The goal of this project was to classify the big five personality traits of film protagonists through their spoken dialogue. For this project, I utilized a massive film corpus from the University of California Santa Cruz. The corpus is broken down by genre and contains 960 film scripts where the dialog in the film has been separated from the scene descriptions. In addition, I leveraged the IBM Watson Personality Insights platform with python to visualize personality traits.
The hypothesis for this project is that we can find archetypal consistencies between fictional and nonfictional dialogue that can be mapped to general personality traits (i.e. Big Five Traits). The hope is to utilize these finds as a baseline for a generative narrative delivery currently being developed.
The overarching idea for this generative delivery can be thought of as essentially the antithesis of the Youtube algorithm. Youtube funnels the user down a path of more and more ‘extreme’ content based on your watch history. A user who watches a video that might have some right-wing tendency (or comments) for example might then be given a video with slightly more right-leaning views and so on. The end result is that the user is entrenched in content from a hyper specific viewpoint.
For this project, we would like to establish a baseline categorical classification for any given user, marry them to the corresponding ‘fictional’ characterization and expose them to divergent patterns of thinking through a customized narrative delivery. Said another way, we’d like to open up the user to previously unexplored patterns of thinking, to broaden their viewpoint and ultimately help them to break out of established patterns of thought through story.
While I was not able to cluster by personalities using only the script dialogue, I was able to uncover some insights using dimensionality reduction (specifically LDA). Namely, I have uncovered the toxicity of action movie protagonists. We see many commanding and misogynist words throughout the corpus.
In addition, I was successfully able to visualize the big five personality traits of any given character utilizing the existing IBM architecture. The findings are in the personality insights notebook and my project repo.
Topic Modeling from all dialogue aggregated from the action/adventure genre. We can see patterns of domineering and toxic speech coming in most of the topics presented. This makes sense considering we are utilizing films such as the bond franchise, Fast and the Furious and Rambo (to name a few).
Using the IBM architecture, we built out a personality profiler that can be visualized for any film protagonist. Above are the big five traits for Gandalf the Gray (Lord of the Rings: The Fellowship of the Ring)
Project Goal:
At my company Screenshot Productions, we craft intimate, immersive experiences often for a single audience member at a time with the goal of placing the individual within universal human experiences and challenging the individual to consider their place and responsibility with a larger whole. While this format provides a conduit for great emotional impact, it is not at all scalable and we often find ourselves only able to accommodate 100-150 audience members per weekend and have to charge high ticket prices in order to merely break-even.
The central question at the heart of this project, then, is: can we utilize data science and specifically generative deep learning to provide a similar impact and personalized experience over a long-form, interactive narrative such that we can impact 100s of thousands of individuals simultaneously?
Enter this project. A project with a long-form goal of leveraging personality insights derived from a user’s twitter profile in conjunction with fictional dialogue to craft robust, interactive and completely personalized narratives.
As text generation is still a relatively young field, I took the opportunity during my time at Metis to craft a short-form goal as a stepping stone to this larger one: To successfully create an MVP generative model that marries a user’s big five personality profile with a fictional character and generates a short paragraph or tweet from that character’s perspective.
Data was pulled from a massive film corpus from the University of California Santa Cruz. The corpus is broken down by genre and contains 960 film scripts where the dialog in the film has been separated from the scene descriptions.
The process was to clean and preprocess the data such that I was left with a pandas dataframe broken down by character and all dialogue instances as rows. Next I filtered the dataframe for characters that had 100 lines or more and each line consisting of at least 3 words.
From here, I created a relational database that included the following information in a nested structure: film genre, film title, film character, and then all dialogue from each character.
Next, I created a script that ran each individual character's dialogue through the IBM Watson Personality Insights program and returned back their detailed big five personality profile.
At this point, I had a 3,000+ database of characters and their associated personality profiles, so I moved on to Twitter where I used Tweepy to pull the last 200 tweets from any given user and wrote a similar script to also run their personality through IBM.
The Twitter profile was then compared to all characters in the relational database and using cosine similarity, I printed out the character in each genre that was most similar to the user.
For the final step, I leveraged OpenAI's GPT-2 architecture to generate a short tweet from each character. The GPT-2 instance was trained using the GPT-2 Simple python library created by Max Woolf, was trained individually on each character's dialogue and was fine-tuned to print out a tweet length of original content.
While this project was fascinating (and successful in it's own right) this is simply the calibration for the much larger, long-form narrative delivery that I hope to roll out in the near-ish future. Long-form text generation is a difficult problem to solve, however, so for now I am satisfied in generating short samples in the speaking style of fictional characters and creating a storytelling calibration device that meets each user specifically at their personality profile.
In this project I analyzed a UCI dataset that includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.
I utilized a number of supervised machine learning models (logistic regression, random forest, SVMs, Naive Bayes) to determine which mushrooms are poisonous and which are edible. To this I used accuracy as my classification metric. To eat a mushroom that is poisonous can spell certain death in many cases, so I wanted to be ABSOLUTELY sure the mushroom I have identified is indeed edible.
Once modeled, I found that (with the exception of Naive Bayes) all models perfectly classified all the mushrooms in their respective categories. I then took a look at feature importance and built an interactive dashboard visualization in Tableau that showed Odor and Gill Size were HUGELY correlated to a mushroom being edible. Specifically, mushrooms that had a ‘earthy’ odor and mushrooms that had broad gills were virtually always edible.
In addition to my supervised model, I built an image classifier to augment the work done so far. The image classifier utilized transfer learning (built off the the Inception V3 model) in one instance and a custom built convolutional neural network for another. The from scratch CNN severely overfit and did poorly in test data. However, the Inception model did great and classified mushrooms to a 89.6% accuracy on test data.
In the future, I’d love to dial in these findings further and build out an image classification application for hands-on foraging. THAT SAID - an inaccurate reading could be disastrous so I want to stress that no model is a substitute for a professional forager and I’d want to extensively test this app out with professionals before thinking about deploying in a consumer facing setting.
AUC for Logistic Regression model.
Feature Importance Dashboard
Results from Inception V3 Image Classifier
Primary Goals:
In this project, I looked at the potential relationship between unit sales of a video game and the corresponding streaming metrics on Twitch. Specifically, I looked at:
Watch Time
Stream Time
Peak Viewers (the highest number of concurrent watchers on a particular title)
Peak Channels (the highest number of concurrent streamers on a particular title)
Streamers
Average Viewers
Average Channels
I utilized stats collected on SullyGnome (A third-party Twitch stats and analysis site) for the streaming metrics and scraped data on unit sales from VGChartz. The data collected on both sources is global. I did not limit this study to a single country. The data available on SullyGnome is only accessible from 2016 to the present so I had to limit my study to 2016 onward.
Data collection: Web scraping using Beautiful Soup for VG sales information
Data collection: Download CSV files for Twitch streaming metrics.
Clean and orient the data.
Look at correlations and pairplots for all years, just a single year, and then individual months.
Build linear regression models
Zoom in on the coefficients
While I didn't find a strong signal between video game sales and streaming metrics, I did find a surprisingly impactful coefficient in the form of peak_streams. Thus, I recommended that Twitch develop a platform sponsorship program to connect video game developers and advertisers with mid-tier streamers to saturate the market with streams within the first 4 weeks of release.
Residual Plot for Monthly Sales
R squared test results
Correlation matrix between streaming metrics and total sales