That American Life

Introduction

If you're even trivially familiar with podcasts and have any interest in them, chances are you are familiar with This American Life. Although not originally a podcast, TAL pioneered the format and has been publishing episodes online for about as long as podcasting has existed.

From their own website:

This American Life is a weekly public radio show, heard by 2.2 million people on more than 500 stations. Most weeks, it is the most popular podcast in the country, with around 2.5 million people downloading each episode.

TAL is the 800 pound gorilla of podcasting.

What is this all about?

In my free time I've been working on a crude application that takes transcripts of everything ever said in an episode of TAL and uses them to generate a data set. The transcripts are downloaded directly from the TAL website, parsed for meaningful information, and then shared on Github.

Why are you doing this?

I'm doing this because I love This American Life, and I wanted to come up with a way to build on top of all the great material they've produced.

Originally I wanted to run a blog in which I'd review every single episode and individual segment of the show. This wouldn't be strictly impossible as I actually have listened to every single episode at least once (and for many, multiple times), but trying to review each one without listening to it again seems unfair, and I don't have a spare 600+ hours for that project.

How does this all work?

The data set is built using two components:

A set of Python scripts that acquire new episode transcripts, extract meaningful data, and generate CSV representations of the content of each episode.

A Java application that checks the local file system for new CSVs and pushes them to a Github repository.

The Data Pipeline

The Python data pipeline makes use of a number of different libraries:

lxml

requests

datetime

csv

glob

bs4

Originally I was building the entire system in Python because of its popularity in the ML community, and because I saw it as a good way to grow more familiar with it. This would have been fine except that I couldn't sort out the nuances of publishing updates to Github in the way that I wanted them published, so in a last-ditch act of despair I decided to spend some time seeing if I could figure out how to do it in Java. If you're using Python and trying to write code that interacts with github.com in any useful or meaningful way, my heart bleeds for you. In the end, using Java worked out just fine.

What am I supposed to do with this?

Build something with the data set. My scripts are currently configured to publish one new CSV per day as long as they find new content. There are over 600 episodes of This American Life that have aired (not excluding repeated episodes), so this should continue to post new content for at least the next two years. As the public data set matures, I hope to build my own experiments on top of it.

I'm also interested in feedback on the data set itself. I'm a software engineer by trade but still very green when it comes to data science. Knowing what data to collect, how to represent it, and how to store it is something I could stand to learn more about.

https://github.com/ian4d/TALGithubUpdater: This repo contains the source for the Java that I wrote to interact with my github repo. All this code does is scan a local directory, sort the files by name, and attempt to add the last file in the list to my github repository. It doesn't get much dumber or more trivial than this.

Anything else I should check out?

thisamericanlife.org: Not only is This American Life one of the best podcasts/radio shows of all time, but they have a really excellent website. You can go there to stream old episodes, buy merchandise, etc.

This American Life on Google Play. The This American Life app is also really good. You can do pretty much anything with it that you can do on the website (and more?), plus you get the added satisfaction of giving financial support to something you enjoy.