rvest + imdb -> explore Friends episode titles

11 minute read
Published: 18 Dec, 2017

This post includes R code to download Friends episode data from IMDB using the package rvest. It analyzes and visualizes episode data.

I always wanted to be a scriptwriter. But my approach to doing creative things is “find the secret, program it, retire”. So what’s the secret to a successful Friends episode? [Really, I want to write/experience a gentle introduction to rvest, and later tidytext and language data science.]

First, find the link, and download some html. This takes some back and forth—I’ll do that annoying thing where I present the finished product as if I thought of it straight away. Suppose we have two functions:

Given those functions (we’ll talk about them later), we only need to go through the list of episodes and download everything.

# the base Friends url; it only needs a season number suffix
url <- "http://www.imdb.com/title/tt0108778/episodes?season="
# make a list of all the season urls (there are 10 seasons)
season_urls <- map_chr(1:10, function(n) {paste0(url, n)})
# for each season in the list, download all the data and put it together
titles <- season_urls %>% map(get_season_data) %>% bind_rows()

Inspect the data

The data includes season, episode, title, rating, number of ratings, director and writers (in list form, because there’s usually more than one writer, and sometimes multiple directors).

Well, a bit—it peaks in Season 5 (✅), drops a bit in 9 (✅) and comes back for the final season. That checks out. It started good, hit its stride in the middle, started to lose steam around 8 and 9 (see the list of worst episodes ⬇️).

‘Embryos’ is the ep where the winner of a quiz wins Monica and Rachel’s apartment. So good. ‘Everybody Finds Out’: it’s the one where Phoebe finds out about Monica/Chandler’s relationship; great ep (other than the Phoebe line I always hated “my eyes! my eyes!”). It’s also a great example of common knowledge and rational agents in game theory.

One problem with titles: “The One with the Embryos” gives no information other than a noun that is related to babies. 😟 No character names and no indication of why I like it.

Are the directors any good?

I’m trying to write a good Friends script, who should I get to direct it? Here, we’ll just have to unnest the list-cols (because of one episode that has two directors, for some reason), and then group and summarize. I’ve already picked out a few directors to highlight with geom_text_repel.

There aren’t many that stick out—we’d want Kevin Bright (of Bright/Kaufman/Crane), who has one of the highest average ratings and directed over 50 eps. Gary Halvorson’s done a lot, but he’s only an average Friends director.

Also Peter Bonerz lolol. On the other hand: David Schwimmer directed like 10 episodes! I assume he directed all the ones about Ross:

titles %>%
unnest(director) %>%
filter(director == "David Schwimmer")
## # A tibble: 10 x 6
## season episode title rating n_ratings director
## <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 6. 6. The One on the Last Night 8.60 1893. David Schwi…
## 2 7. 4. The One with Rachel's Ass… 8.20 1843. David Schwi…
## 3 7. 7. The One with Ross's Libra… 8.60 1877. David Schwi…
## 4 7. 9. The One with All the Cand… 8.30 1794. David Schwi…
## 5 7. 16. The One with the Truth Ab… 8.70 1908. David Schwi…
## 6 8. 2. The One with the Red Swea… 9.10 2252. David Schwi…
## 7 8. 8. The One with the Stripper 8.90 2009. David Schwi…
## 8 8. 12. The One Where Joey Dates … 8.60 1860. David Schwi…
## 9 9. 5. The One with Phoebe's Bir… 8.50 1792. David Schwi…
## 10 10. 9. The One with the Birth Mo… 8.60 1814. David Schwi…

Nope. I guess he did ok, mainly focusing on Seasons 7 and 8 and coming out with an average rating of 8.61, slightly higher than the series average of 8.54. Well. Fine. Ross still sucks.

Title character breakdown

I bet if I write an episode titled “The One Where Rachel [does X]” it’ll be an automatic classic. Let’s check. First, use tidytext::unnest_tokens to split the titles into words, and then take out common filler words (‘The’, ‘One’, ‘a’, etc.) with the quick and helpful anti_join(stop_words).

The characters are the most frequent title words, and ‘Rachel’ (surprise, surprise) is No. 1. But episodes with ‘Ross’ in the title are actually rated slightly better!

That gives us some info into explaining ratings: some directors are good, with a lot of noise. Some words are popular, but with a lot of noise. And don’t write a f—–g clip show. Next up, let’s use the titles and statistics to explain ratings!

Appendix: rvest

The toughest part of learning rvest is the lack of transparency in the returned objects. I find it difficult to navigate a page with rvest through trial-and-error because it’s hard to see inside the xml_document or xml_nodeset objects—but maybe there’s some xml stuff I don’t understand yet. Anyway, here’s what I found.

First, explore the IMDB html / source. The Friends url is easy to find, and then we need a way to go from each season’s page (that lists the episodes) to each individual episode page (to get the data). Suppose we have the season url and one episode url, let’s write a function to get that episode’s data.

We need: html_session to jump to different links, then read the html with read_html (good name). selectorgadget is a useful chrome extension to find CSS paths that select the objects you like, but I found it easier to just right-click + inspect element and find attributes that identify the things I want (usually class and id; though itemprop turns out to be super useful on these imdb pages).

Here, I’m looking for episode title (inspect element suggests div h1[itemprop="name"]) would select the title (which I need to verify for this one, and then hope it works for the rest 🙏). Also rating (span[itemprop="ratingValue"]) and rating count (span[itemprop=“ratingCount”]`).

After that, we get lists of directors and writers from the cast table, and filter accordingly. Sometimes the writers are listed with credits “written by”, sometimes “writer”, or “teleplay by” or “story by” or whatever else they like to say.

Finally, we stick them all in a tibble and return it. The writers and directors are returned in list-cols, so every episode gets one row.

Once we have that episode data function, we can write a function to return season data (although you know I wrote this first just to download the season url and look for the episode titles).

First, read_html the url. The episode titles can be found (via a quick inspect element) with the tags strong a[itemprop="name"]; then ask for the href/link attribute via html_attr.

This is my first experiment with purrr::possibly—since html sessions can do weird stuff (with this very detailed bug report: bad stuff happens sometimes), if getting the episode data fails for some reason, just forget about that episode for now.