A Show about Something? Topic Modeling Seinfeld (Part I)

Post navigation

Archives

Meta

A Show about Something? Topic Modeling Seinfeld (Part I)

Historians of the recent past are blessed (or, some would say, cursed) with an enormous body of non-textual primary sources: films, music, and radio just to name a few. Among these, one of the most underutilized bodies of historical “data” are TV shows. The under-representation of television in academic history is, perhaps, not surprising: TV is not indexed, cannot easily be skimmed for relevant information, and requires a large time commitment on the part of the researcher. Further, because of copyright restrictions some TV shows are much harder for scholars to access than textual resources

Fortunately, volunteers at OpenSubtitles, Addic7ed, and a number of other sites have produced a huge multilingual corpus of TV and movie subtitles that are completely free to download. Working with captions from Seinfeld, this guide will show you how to batch download TV subtitle files (.srt files) and prepare them for use with MALLET or any other data mining/machine learning application. It presumes you have FileBot (available here) installed and are running the current version of Ubuntu (13.10), but the instructions should work with some modification on OS X.

Downloading the Subs

The first step is actually getting the subtitle files downloaded. This is is not as easy as it might seem; the major subtitle websites don’t have an clean, easy way to download a large number of files. And although automated downloaders exist, they all presume you have a properly named copy of an episode located on your machine.

Filebot offers a easy, if inelegant, workaround to this problem. With it, we can generate a list of episode titles, turn that into a simple script to create empty video files, and use Filebot to download the subtitle files. (Note: you can mass download from FileBot’s Subtitle menu, but there is no easy way to deal the problem of duplicate episodes. This method will download one subtitle file for every episode.)

Open filebot and click on the “Episodes” menu. Type in the name of the show, in this case “Seinfeld,” and make sure the other options are appropriate. Save the list as “Seinfeld.txt” in an empty directory.

Open a terminal screen to the directory of your episode list. The following set of sed commands will change your episode list into a shell script that will generate video files with valid filenames, delete special features, and ensure proper handling of double episodes:

If this all worked correctly, your folder should be populated with a set of empty .mp4 files. We can now use FileBot’s command line interface to download the subtitle files in the working directory. Don’t forget the period!

filebot -get-missing-subtitles .

Finally, delete the script and the empty video files:

rm *.mp4; rm *.sh

Cleaning the Subs

Now that we’ve downloaded the subtitle files, we should clean up the markup, timestamps, and release information:

This is not strictly necessary, but I like to append the .txt extension to the files so that I know that I have worked them:

find . -type f -exec mv '{}' '{}'.txt \;

Topic Modeling Seinfeld with MALLET

Now that we’ve downloaded and scrubbed the files, the subtitles are ready for MALLET. If you haven’t installed the package yet, first open a terminal window and navigate to the directory in which you would like to build MALLET, then run the following commands:

After you do that, open up “sein_keys.txt” and inspect your results. You’ll notice that there are a few contraction fragments (e.g. “ll,” “ve,” etc.) in your results. To get rid of these, create a file called “customstop” and enter the words you don’t want to appear in your results (separated by a new line). Delete your sein* files and import your data again with the –extra-stopwords argument: