Last week, New York City opened data from millions of taxi trips. 165,114,361 to be more precise. At a time when war rages between taxis and Uber, just a few days after the clash between NYC’s mayor DeBlasio and Uber, it looked like a good a idea to play with those data by uploading them into the OpenDataSoft platform and displaying them on a map and as graphs & charts!

South of Manhattan, NYC, Heatmap of Taxi trip pickups and Subway Entrances location

Some quite usual facts

As we could expect, there are more and more taxi trips during the week, peeking on Saturday.

Number of trips, sum of distance and total fare by day of the week

The month by month evolution seems to show more taxi trips during Spring and Fall:

Number of trips, sum of distance and total fare by month

It would be interesting to wait for 2015 data to see if there is a real pattern though.

Reverse Engineering of Taxi pricing

Here are, given a trip’s fare amount, the average distance and average duration of the trip:

About data quality

For a dataset this huge, there are not a lot of errors or bad data, but absolutely clean and perfect datasets are really unusual. Data visualization and mapping are good ways to find some incorrect data, especially when you have 160M rows to check!

With the same idea, we can see that there are both very long trips – 13 days – that may need more investigations, and trips with negative duration, that don’t need any investigation:

Number of trips by trip’s duration

Smart Cities 101

The harder the funnier, we’ve created a heatmap of every pickup locations in 2014 (~160 million, remember) and add the Subway entrances as a layer. It’s pretty amazing to navigate that easily in those huge datasets!

By implementing some route calculation between the subway stations, we could compare every taxi trip with a public transport trip and understand better how people behave, why, and what the city can do to improve their lives. That would be a nice first step in the development of a Smart City…

You want to open your data?

And easily explore, visualize and map your datasets with hundreds of millions of rows? Let’s talk about it!