Saturday, June 22, 2013

Given the speed at which I consume them, it's only justified that the first post on this blog is about movies. (Although, by that logic, it could have equally well been about sandwiches, Nutella, or tissue paper. Note to self: Look for a Nutella consumption dataset)

Anyway, this post is about movie taglines - specifically, the words that constitute them.

The data is pretty much there for the picking – IMDb hosts a number of freely available1datasets, and one of them is about taglines.

The data is in an odd format, but at least it’s all available in one place. After the coding equivalent of jamming the fork into the toaster and jerking it around until something pops, I have the data in a usable structure

Once here, R’s tm package makes quick work of the word frequency analysis, and I have derived a dataset with common words and their frequencies in movie titles. After removing some highly frequent words in English (articles, pronouns, some prepositions, etc.). Here’s a list of the most used words in movie taglines, ordered by frequency:

love, life, story, world, time, film, comedy, death, woman, don’t

Not many surprises there – until we look at the fraction of taglines these terms occur in:

love

7.5%

life

6.0%

story

5.0%

world

3.8%

time

3.1%

film

2.3%

comedy

2.2%

death

2.1%

woman

2.1%

dont

1.9%

These numbers are way higher than I expected. ‘Love’ alone occurs in a whopping 7.5% of all movie taglines!
Here’s a visual representation of the words you'd have seem most often in movie taglines (the size of each word is proportional to the frequency of its occurrence)

The R code to parse the data and make the word cloud is available at github if you're interested

I'm kinda keen to know if this trend has been constant through the years. Let's do the same thing, except looking at the taglines decade by decade. Here’s the list of top 10 words in movie taglines from each decade2 – from the fifties to the teens (Teens? Onesies? I like ‘onesies’)

So that’s that for frequent words. But I'm also after words that are frequent exclusively in high (or low) rated movies. Or to look at it another way, words that, in retrospect, are indicative of the movie’s success.

One way of doing this is to segment the data into different parts by performance, and do the same analysis as above. But the prior frequencies will likely dominate these lists. What I really want is words whose presence (or absence) is highly indicative of the movie’s rating.

NOTE: Some math to follow. If you're uncomfortable with arithmetic and/or statistics, skip a couple of paragraphs.

For a given term, if D1 is the distribution of movie ratings with the term present in the tagline, and D2 is the distribution of movie ratings with the term absent in the tagline, I'm going to define my divergence/separation metric as3,4:

<Obligatory CORRELATION DOES NOT IMPLY CAUSATION warning>

Adding such words will not automatically make your movie successful – this is offered a post-event descriptive analysis, not a predictive one. I'm not implying any causality here.

</warning>

This divergence is just a magnitude - so I had to separate the most related ‘good movie’ keywords list from the ‘bad movie’ keywords list.

So, without further math or ado, the 10 terms that correspond to highest ratings:

If these lists make you have second thoughts about making Warrior Zombies from Outer Space II: Mayhem Unleashed, don't be disheartened - because like I said earlier, there is certainly a correlation, but it’s not necessarily a causal relationship5. And hey, I know a bunch of people who would watch the hell out of that movie.

Footnotes

1 Going through and adhering to the legal clauses for use
for the datasets is left as an exercise for the reader

2 The punctuation has been removed from the data to make
the analysis easier. So if you see “cant”, that’s probably “can't”, and so on.

3 It is possible that a better metric might have been used,
or even a simpler one, but for some reason, I went with this. Other suggestions
are welcome.

4 IMDb ratings are arguably, not the best indicators of
movie success, but that's certainly one way of estimating, and there is probably going to a
future post analyzing how reliable a measure this.