Our investigation found that marijuana arrests actually rose after Kelly’s order. But finding that story involved diving into the data.

Thanks to a recent lawsuit, the NYPD releases a database each year of every single “stop-and-frisk” that officers make. Unfortunately, the database is so big it can’t easily be opened in Excel and the data also requires some serious “cleaning” to be usable.

To address these issues, we analyzed the data using the open source statistics program R, which can handle data cleaning, interrogation, and visualization in one program. Because R lets you type in commands that apply across multiple files, it removes the need for switching among Excel windows. R also supports the SQL-like queries through the sqldf extension package, which makes more complex database systems so powerful.

Cleaning
Because we were interested in when certain types of stop-and-frisk incidents had taken place, we used R to split the day, month, and year of each incident’s date field into individual columns. This set the data up for the next step in our analysis, which was to count up how many marijuana arrests occurred each month.

Querying
Using SQL queries, we were able to group and count the data by month and crime type. We focused our searches on marijuana possession (which in the NYPD data was spelled “marihuana”).

We ran a number of queries to see month-to-month trends and also compared across years to see how 2011 compared to data to 2008. This gave us valuable context because stops actually dropped in November and December of 2011, but not as much as they did in those same months in prior years. If Kelly’s order impacted officer behavior, we should have seen relatively dramatic decreases during those months, but found only slight declines. This context was vital to our story, and explaining why the 2011 drop was not significant was a high priority for our final visualization.

We also ran queries comparing arrests to stops as well as isolating specific precincts. However, only a few of these queries yielded results that were worthy of inclusion in the final interactive.

Visualizing Part 1:
In order to find the trends mentioned above, though, we first had visualize our query results, which R can do, too. An extension package for R called ggplot2 will generate high-quality, customizable line graphs that could be used directly for print graphics. However, we wanted ours to be interactive, which required some additional work.

Visualizing Part 2:
SVG (Serialized Vector Graphic) is a type of graphic that is drawn dynamically on a computer screen, which means that it can be highlighted, clicked, rolled over, or animated in ways that .jpg, .gif and .png files can’t. The ggplot2 graphics can be converted to SVG, and then published to the web using a javascript library called called Raphaël. Although this requires some copying-and-pasting, the clean, dynamic graphics it produces are worth it.

Putting it all together
To better tell the story, we compiled four sets of charts that we incorporated into a so-called “stepper graphic.” Thanks to the newsapps team at ProPublica, there is a great open-source library (http://www.propublica.org/nerds/item/anatomy-of-a-stepper-graphic) for building these graphics. Turning my four charts into four different “slides” was as easy as creating a function for each of them and then copying in their Raphaël code. The stepper graphic library took care of numbering and transitions. We built the grid and axes with standard HTML and CSS, and made label fades using simple jQuery fadeIn() and fadeOut() methods.

Finally, once we confirmed we were running the story with the Guardian, we adjusted the styles to make sure it would mesh well with their design. So we made the months lowercase, the font Georgia, and the line fuschia – perhaps the most important part.