Thoughts on data, statistics, computer science, and those other things people call life.

Menu

Hello World – Urban Data Edition

I was recently asked to help teach business analytics for the City of New York Management Academy and in putting together an appropriate practical exercise, I thought about what made for the best introduction to the world of urban data. In programming, there’s almost always what’s called a “Hello, world” exercise that gets programmers to be started with some tangible result to their initial efforts. It builds confidence and gets people on their way to learning more. Most city data is dirty, requiring a good deal of work to clean and organize, not to mention the work that goes into understanding the larger context of city operations around the data. For this exercise, I needed something that was reasonably easy to understand, easy to work with, and easy to extract some meaningful insights.

In the case of the New York City, probably the closest thing to a “Hello, world” exercise is using the 311 Service Request dataset. This is basically crowdsourcing the problems of New York City, with each person calling, texting, or emailing their service needs to a centralized call center. This data holds a wealth of information about the various needs of New Yorkers, including building issues, potholes, malfunctioning street lights, and the ubiquitous noise complaints. These are all things New Yorkers (and almost all city dwellers the world over) deal with on a daily basis, so understanding the context wasn’t difficult and the data is presented in a generally clean and manageable way in the NYC Open Data Portal. Because the complaints are self-reported, there is a bias towards areas of the city where New Yorkers feel empowered enough to complain. This usually means Lower Manhattan is over-represented in the data while the outer boroughs are under-represented. There have been some cool visualizations of the data, including an interesting video from Chris Whong showing a heat-map of the 311 calls in 2012.

The advantage of this dataset is that it doesn’t take much to get something that is both interesting and intuitive from the data. Just mapping the hour when noise complaints were made shows a predictable distribution with the highest number of calls between 11 pm and midnight, with the fewest being logged between 5 am and 6 am in the morning.

311 Noise Complaints by hour for the period 1 January through 20 May 2014

Breaking it down by day of the week also shows a pattern one expects if they’ve spent more than a week in New York City:

311 Noise Complaints by day for the period 1 January through 20 May 2014

With this very basic analysis, a call center manager can start planning the staffing levels at various times of the day and week. Drilling down into locations, it’s easy to see how these complaints vary across the city. As my students discovered, in most boroughs, the most complaints were from nightclubs and bars, except for the Bronx where loud noise (presumably on the street or in parks) was the most prevalent type of complaint. As the (low) quality of the charts show, this analysis was easy to do in Excel with PivotTables and simple charts, which underscores the point that sometimes data analysis doesn’t require fancy tools in order to be potentially useful. That was the point I tried to get across to my students, and maybe a few readers of this blog post.