Converting JSON to TSV using Python streaming

Earlier today, a friend asked my advice on how to convert a JavaScript Object Notation (JSON) file to tab-separated values (TSV) in Python. As with most things in software development, there are many ways to accomplish this, some more Pythonic than others.

I thought about it, and decided to illustrate how to do this map/reduce style (sans reducer) by streaming to STDOUT.

The Data

To start off with, he was dealing with a pretty simple JSON format for closed captioning data. There’s a “cc” root key with an array of items containing: duration, content, and a timestamp. The derived schema looks like this:

Since I couldn’t use his actual data here, I scoured Google trying to find another example of closed captioning JSON data. That proved elusive, so I converted the sample data from w3’s WebVTT Introduction to JSON. This was apparently the beginning of an audio interview between Roger Bingham and Neil deGrasse Tyson.

The Code

Normally, you might think to craft a class which knows how to load and read the specific JSON file, and maybe code or another class to do the writing to a TSV file. However, in this case, since I’m writing this in a map/reduce style, and streaming the data to STDOUT, I only really need the mapper (CcJsonMapper) and not a TsvWriter class or code. The mapper will map the JSON file to STDOUT as TSV. Here’s how it will be used:

python cc_json_mapper.py data.json > data.tsv

So, basically, I’m calling the cc_json_mapper.py Python script, passing in the filename of the JSON file, and redirecting it’s output to a file.