Taxi TechBlog 1: Data Prep and Backend

A week ago, I published NYC Taxis: A Day in the Life and it went a little bit viral. This seems to be the perfect mix of some brand new (for me) techniques that were a huge hit, along with subject matter that seemingly everyone can relate to. This is part 1 of 2 of a techblog about how I built this visualization, and will cover data munging and building the backend. The next post will talk about the frontend, including animations and charts.

On Monday afternoon, someone from Heroku (where the app is hosted) noticed the spike in traffic and the subsequent failed page loads, and contacted me to recommend turning on more dynos (I am not very experienced with heroku and had no idea what that meant, but it literally came down to dragging a slider, and the traffic became much more manageable.) Heroku also generously added me to their beta program to help cover the costs of the additional dynos.

Concerned about almost $1000 in overages on my Mapbox account in the first 12 hours, I took to the twittersphere asking the civic hacker community for some advice on how to deal with the situation. Monetize with ads? Add a paypal donate button? Switch to Open Street Maps’ free tile solutions? The chatter caught the attention of Mapbox, who got in touch with me within an hour and offered to sponsor the app! (I had already switched out the mapbox maps and put up a OSM tileset, but not having a dark, subdued basemap was suboptimal for the visualization. In any case, OSM rules and deserves a shoutout for making hosted, leaflet-compatible tiles available for the world to use)

At the time of this posting, the visualization has been shared over 4700 times on Facebook, 3700 times on Twitter, had 300,000 pageviews from 216,000 visitors.

Querying the Data

I’d FOILed the 2013 taxi trip data a few months ago. The blog post I wrote about it had gotten some attention, and resulted in dozens of requests to share the data. Many people came to the BetaNYC hacknights, hard disk in hand, to transfer the data on the spot. About a month ago, some of the requesters urged me enough to create a couple of torrents. At the same time, Andrés Moroy (someone I don’t know except via twitter) offered to host it so I finally got all 50GB zipped and uploaded.

Over the next couple of weeks, data scientists all over the world started publishing lots of interesting analyses of the data, and after someone loaded the data into Google BigQuery, a rather interesting thread erupted where people were showing off all sorts of ways to slice up the 173 Million Rows quickly. This would prove to remove a huge barrier to entry for me.

I’m not sure when the idea of “a day in the life” of a taxi came to mind, but it came directly from the simple question: “How much does a single taxi/driver earn in a single day/shift?” I’d heard that many drivers rent their shifts and need to earn back enough to break even before they take anything home. I’ve done similar static “Day in the life” map projects using the twitter API as well. Because the time and start/end location of each taxi trip was available, it just made sense to follow along to see when/how/where they earned their money over the course of he day.

I set out trying to figure out how to use BigQuery to get the data I wanted: A full day’s trips for a single medallion. With a hard-coded medallion and date, I was able to slap together a query that worked, but I had no idea how to pull all trips from a random cab on a random date, let alone man random cab/days. Reddit to the rescue! u/fhoffa, who started the Taxi Data BigQuery thread, helped me out in an hour with the exact query I needed:

The combination of having the data in BigQuery and a rather awesome community on Reddit already working on the data probably saved dozens of hours if I had tried to do all of this on my own. I can’t stress enough how important the community is to civic hacking projects like this.

So, data in hand, I see that there are around 30-60 trips a day for most of these vehicles. We have a time succession of geographic points, which is enough to string together into a line. This can be animated, as I’ve done in this animation of twitter user’s movements, simply moving a dot “as the crow flies” between the start and end points, and keeping accurate time with an accelerated clock. However, simply moving a dot between start and end points wasn’t going to cut it for this visualization. I wanted to show something that resembles a car actually driving, so I needed a way to trace out a reasonable driving path for each trip. Enter the Google Directions API.

The route is our overall trip, which is divided into legs. This API call included only an origin and destination, so there’s only 1 leg. If we had added waypoints to our API call, we could have up to 9 legs returned.

Each leg has steps, which are basically chunks of the trip that don’t require some instruction to the driver, such as turning or merging onto a highway. Each step has its own html_instructions that are the same thing Google would show you for turn instructions if you got directions via google maps. Neato.

But wait, how does this help us map the step/leg/route? Here’s where it gets interesting. Each step includes a polyline, which looks like this:

"polyline" : {
"points" : "az~vF|jmbMzJiA"
}

What the heck does az~vF|jmbMzJiA mean? It turns out that this is google’s super-compressed format for encoding polylines. This string can be decoded into a series of latitude/longitude coordinates representing the path to be followed for this step! There’s a handy widget for decoding/encoding these and mapping them on the fly here. Give it a try. Copy az~vF|jmbMzJiA and paste it in, see what you get.

Further down in the JSON response from our API call, there’s an overview_polyline that shows the entire route strung together:

Go ahead, try that one in the decoder widget too. You’ll see that the results don’t seem to make sense, as the polyline starts on the roads and then wanders off in seemingly wild directions. This is because there are escape characters in the encoded string. Look above, every place you see “\\” is really meant to represent a single backslash. Try the encoder tool again, you should see a more sensible path.

Use BigQuery to get a bunch of random trip/days Download them as a CSV

Write a node script to build out API calls for each series of 4 trips (Each API call can handle and origin/destination and up to 8 waypoints, meaning 9 total legs per call. Each taxi trip consists of two legs, the trip itself and the “downtime” between this trip and the next trip. So, we can efficiently handle 4 trips/8 legs per API call)

Append the polylines to the raw data, also append the start time of the next trip as a value for the current trip (so we know how long the “downtime” lasts)

Move everything into a sqlite database

Build a node server with a single endpoint, /trip, that will query the sqlite db for all trips for a random medallion, convert the results to geoJSON, and send the response to the browser.

getDirections.js opens our bigQuery results CSV, builds out API calls, and appends the appropriate polylines to the raw data. You can check out the code in my github repo. I won’t go into too much detail here, but I basically slice the number of trips for each taxi into groups of 4 or less, use the start point of the first trip as my origin, and the start point of the next group of 4 or the last stretch of downtime as my destination. The rest of the start and end points are waypoints. Once everything is stored in the appropriate strings/arrays, I build out my API call like this:

I store all of the fullApiCall strings in an Array that I use later to actually make the calls and process the results.

There’s also a step to combine all of the polylines for the steps into a large polyline for each trip or downtime. This is done using the Mapbox package I talked about before, getting the encoded latitude and longitude coordinates, linking all of the steps together for a single trip, and then re-encoding them.

Next a function called createGeojson() decodes the polyline for each trip and downtime, and creates a properly formatted geoJSON LineString. These are pushed to a featureCollection, and that featureCollection is sent back to the browser as a response.

It weighs in at 131kb for this trip. It would have been more efficient to decode the polylines client-side, but I really wanted to just have geoJSON ready to go for d3. Give it a try, copy the geoJSON result and paste it into GeoJSONLint and you’ll see a taxi’s trips for one day:

So there you have it, I combined raw taxi trip data with google directions API results, put them in a database and built a simple API endpoint to grab one random, and serve it up as proper geoJSON. Done-zo.

In the next post, I’ll detail how I built the frontend, designed the visualization and made the timing, animations, and charts work. Stay tuned, and thanks for reading!