Using OpenStreetMap to Predict Crime in San Francisco with Dataiku Data Science Studio

When I read the blogpost of Hanna about San Francisco Open Data visualization I was very excited about what I could do with those crime data. Could I be a "Data Science Super-hero" and predict crime? That would be awesome, right ?!

SFPD Open Data

As Hanna said, the SFPD dataset is really cool. We have all the incidents reported in San Francisco from 2003 until now with their precise localization. This is fast and easy to load, clean and enrich them with a visual data preparation script in the studio. But I will not go deeper in this description since she already did the job in her own blog post.

OpenStreetMap Data

Let's talk a little bit about OpenStreetMap. This is an open source project to create a "free editable map of the world" and this is an incredible source of data to anyone who wants to build applications based on geographical data. Indeed, people usually tag very interesting information like the category of a street (residential, highway...), localization of parkings, localization of a public transports...

With DSS you can easily get these data for any city or country in the world. Here let's focus on San Francisco and three tables we're going to use:

ways

nodes

way_nodes

Ways is a list of all ways with their geographical shape ('linestring'), the nodes composing it, and many precious information in the 'tags' field.

Nodes is a list of all nodes with their positions ('geom') and again many precious information in the 'tags' field.

Way_nodes is the table that gives you the position of a node into the sequence composing a way.

Postgis

If we want to efficiently manipulate such geographical data, Postgis is the perfect tool. This an extension of Postgresql whith many powerful functions to perform geographical joins, compute distances... If you're not familiar with it you can have a look at this section.

Creating Segments and Features

First, I decided to do the analysis on street segments (between two intersections) and not on the whole streets, because if you have ever been in the US you know that a street can be very...very long. So I used Postgis within an SQL script to create the list of segments with their geography.

Secondly, I extracted all the POI (points of interests) from the tables ways and nodes (still with SQL script) and I classified them into several categories: shop, entertainment, transport, amenity... What do I call a POI? This a way or a node with an interesting tag which enables us to classify it in one of the previous categories.

Now, I have everything I need to create caracteristic features for our segments. Here is a (non-exhaustive) list of all the indicators I thought about but if some others come to your mind please feel free to comment and we will improve the model together:

Distance to closest public transport

Distance to closest shop

Distance to closes parking

Density of public transport within 100m, 250m ..

Density of shop within 100m, 250m...

Density of parking within 100m, 250m...

I also thought I could see all the segments as a network and get information from the structure of the city. Two segments of streets (nodes of the network) are connected if they cross in the reality. In order to do that I used networkx Pyhton package to generate the graph and compute several network scores like for example the degree centrality of a segment.

Here is the flow corresponding to this feature generation:

Geographical Join

I now have one dataset with segments (geography and features) and one dataset of geolocalized crimes. The 'natural' thing to do is a geographical join between these two datasets. I want to match segments and crimes, again this is a job for Postgres and the function STdistance is our friend. I will describe deeper how to perform such a geographical join in howto post comming soon.

I get my final dataset with 17012 lines (one for each segments), 106 features coming from OSM and one target column which is a Boolean with set to 1 if this is a safe street (never seen a crime) and 0 if not (at least one crime from 2003). It gives 18% of safe streets.

Creating a Predictive Model

This is the most exciting part ! Are the features we computed related to crimes ? Can we actually predict if a street is safe or not based on them ?

I love the new analysis bench of the DSS v2.0 because it gives me the possibility to build a Machine Learning model in only three clicks! It deals with missing values, parameters etc and I have almost nothing to do.

So here are the results of our classification problem (Random Forest and Logistic Regression) without any optimization or grid-search.

And there are plenty of fun new charts to analyze the performances of your models. For example I like the density chart. It plots the density of predicted probabilities for both classes:

You can see how well our model separates the distributions of safe and non-safe streets! The highest the predicted probability is, the more segments of class 1 (safe streets) we have.

Interpreting and Vizualizing the Results of the Model

Now that we have a good score, the interesting thing to do is to look at our model and the feature importance, so that we can gain information on what is underline and what are the characteristics of safe and non-safe streets.

We see that the top ranked features are the ones related to graph characterics. For example, the degree centrality represents how much a segment is connected to other segments. As the name says it's a centrality measure.

The OSM features coming next are density measures like density of shops. It's a way to measure the level of activity around a segment.

With DSS 2.0 it is very simple to build maps in a webapp, so let's vizualize our top features with a heatmap (from blue to red) to see if we can easily detect a correlation with safe streets (green points).

Clearly the centrality of a street segment is anti-correlated with safe streets!

Again, the density of shop is anti-correlated with safe streets!

So what whe have here is that safe-street are not in the city center and have few shops close to them. Or let say it in other words: crimes happen when their are many people and things to robe ! This is quite fun to see that the most touristic regions of San-Francisco are also the most "dangerous".

If you want turn yourself as data-science hero you know that you just have to try DSS 2.0 and start tracking the crime or whatever you think about!