Jeffrey Girard

Carnegie Mellon University

Exploring movie lengths using R

To show off how R can help you explore interesting and even fun questions using data that is freely available online, I thought I’d put together a quick tutorial.

First, I will download the most recent “basic information” datafile from the Internet Movie Database (IMDB) and explore the length (i.e., runtime) of movies. To do so, I will use functions from base R and the tidyverse family of packages.

# Load packages
library(glue)
library(tidyverse)

To download the file, we can use the aptly named download.file() function and to save it into temporary memory, we can use the tempfile() function. Of course, we could also have downloaded the file using a web browser and loaded it into R directly.

Next, we need to read the data from the temporary file, which we know (from its file extension) is a tab-separated values (tsv) file that has been compressed using gzip. So we need to uncompressed it using the gzfile() function and then read the tsv data using the read_tsv() function. We can explicate the file’s formatting by passing additional arguments (e.g., col_names, quote, na, and col_types) to the read_tsv() function. This process will take a little while.

Now that we have imported the data into the imdb_all data frame, we can select a subset of columns and observations. For the purposes of this tutorial, let’s use the filter() function to exclude non-movies (e.g., tv series, shorts, and video games), adult movies, movies that are over 4 hours long (these are rare at only 0.287% of all movies), and movies from before 1918 or after 2018. Let’s also select just the movie’s primary title, release year, runtime, and genre listing. Finally, let’s sort by release year and then by title and output a preview of the resulting data.

Let’s visualize the distribution of runtimes across all included movies. We can do so using several types of visualization. First, let’s use the trusty histogram and plot the count of movies for each possible runtime (grouped in intervals of 5 mins).

Next, we can visualize the same distribution using a density plot, which is like a smoothed histogram. Note that the y-axis is the kernel density estimate and not the proportion of each runtime value; this is an important distinction to make because densities do not have to add up to 1 whereas proportions do.

Another way to visualize this distribution is the boxplot. The boxplot below shows the middle 50\% of the data as a white box (i.e., the box’s left and right sides are the 25th and 75th percentiles, respectively) and the 50th percentile (i.e., median) is shown as a vertical line within the box. The light horizontal lines extending from the edges of the box are called “whiskers” and show data points within 1.5 times the inter-quartile range (IQR) which is the width of the box. Finally, the black dots (which are grouped so closely in this figure that they look like thicker horizontal lines) are data points that are more than 1.5 times the IQR away from the box (i.e., outliers). Note that boxplots can be depicted horizontally, as below, or vertically.

Next, let’s examine the runtimes per year to see if there have been trends over time. We can do this effectively by plotting a vertical boxplot for each year and stacking them next to each other. We can see below that the median runtimes have been remarkably stable since 1950 or so, although the median runtime increased from around 60 min in the early 1920s to around 90 min by 1950 or so. The 75th and especially the 25th percentiles (i.e., the top and bottom of the boxes) have seen a bit more variability over time. It appears that runtimes were relatively more clumped around 90 min between 1949 and 1999, but saw more variability before and after this range; it would be fascinating for film scholars to weigh in on what factors may have contributed these changes.