Recent Comments

Archives

Categories

Group Info

Abstract

From 1934 to 1963, San Francisco was infamous for housing some of the world’s most notorious criminals. Despite the fact that nowadays the city by the bay is known more for its tech scene than its criminal past, there is still no scarcity of crime due to the increasing wealth inequality and housing shortages.

Therefore, based on nearly 12 years of crime reports from across all of San Francisco’s neighborhoods, providing time and location, our project is aimed to explore the dataset visually so as to discover the overall trends and distribution using d3 and MATLAB. Furthermore, based on the insights gained from data visualization, we build a predictive model for crime classification by applying machine learning techniques on various combinations of attributes. As for real-world application purpose, our prediction results may be referred by local police department in San Francisco so that they can assign workload of police main power accordingly, quickly and more efficiently given the crime data.

Data

The data we are using in our project comes from Kaggle Data Science Competition Platform(https://www.kaggle.com/c/sf-crime/data). This dataset contains criminal incidents derived from San Francisco Police Department Crime Incident Reporting system, ranging from 1/1/2003 to 5/13/2015 with the following dimensions of attributes:

Dates – timestamp of the crime incident, format YYYY-MM-DD HH:MM:SS

Category – category of the crime incident,totally 39 categories

Descript – detailed description of the crime incident

DayOfWeek – the day of the week, 7 values

PdDistrict – name of the Police Department District,

Resolution – how the crime incident was resolved

Address – the approximate street address of the crime incident

X – Longitude of crime scene

Y – Latitude of crime scene

Since the data is formatted well, we didn’t perform the further data cleaning because it may result in missing useful information. However, due to the fact that the label attribute “Category” is excluded in test data, it’s not applicable since we are evaluating our machine learning model based on accuracy. Thus in our exploration, we focused on information in training data, which contains totally 878050 crime data entries represented as (Dates, Descript, DayofWeek, PdDistrict, Resolution, Address, X, Y). Therefore, preparing for feature extraction and machine learning, we have to split the provided “train.csv” into new training data and test data. Our strategy is such that half for training, and half for testing.

Hypothesis

We assume the total crime counts are decreasing from 2003 to 2015.

We assume police Department District(PdDistrict) is most indicative and correlated to the crime classification.

We assume more crimes would take place in weekdays compared to weekends.

We assume southern is the most dangerous district in San Francisco.

Methodology

Data Visualization:

In order to solve problems such as, what is the crime category distribution and pattern in the city by the bay, what are the most relevant features for predicting crime categories, we start with exploring the data visually to see if any hints and insights can be obtained. Our visualization are mainly delivered by D3 and MATLAB.

Machine Learning:

Besides, we also do some statistics analysis for the purpose of better understanding of the overall dataset. Then, based on the hypothesis listed above, we take the following steps to build and select predictive model:

Split data into train set and test set. Learn a model on training data and test on unseen test data.

Extract attributes to be used as input for machine learning algorithm.

Use one-hot encoding to prepare real training features so as to avoid that any learner based on standard distance metrics (such as k-nearest neighbors) between samples will get confused.

Data Visualization, Trends & Patterns

To begin with, we’d like to know the distribution of the crime category and answer the question: What’s the most common crimes? There are totally 39 crime categories, and they are listed as followed(ordered by the total crime counts from highest to lowest)

Total Crime Counts of 39 Crime Category from 2003-2015

We can see that that total crime count throughout the 12 years of each crime type differs significantly, suggesting categories are unevenly distributed.

Top 1o Crime Category Distribution

Additionally, statistics show that nearly ¼ of the total crimes belong to “Larceny/Theft”. Moreover, top five crime types accounts for 60 percentage, and top 20 classes cover 97% of the entire dataset. In the light of this statement, we realize that training instances for #21-#39 crime categories are too few. Therefore, the predictions of this part of data might affect the overall performance of the model negatively considering the fact that we have to build a 39-class classifier. Besides, we believe theft is the most dominant crime type, which thus deserves further exploration.

After mastering the overall distribution of the target label “Category”, we move on to probe into the hidden timely pattern and geographical patterns regarding the proportion of the data belong to top 10 crime categories, for the simple reason that the data provides information regarding two aspects: time and location.

In order to test the assumption that crimes have timely pattern, we derived information about hour, dayofweek, month and year from two attributes “Dates” and “DayofWeek” and deliver the viz above, which turns out that there are indeed some interesting patterns:

Weekly Crime Count

The distribution of “dayofweek” vs “crime counts” is comparatively even. No extreme value appears in a specific day, with the highest crime counts 133734 in Friday and the lowest counts 116707 on Sunday. And it seems that our third hypothesis is incorrect.

Hourly Crime Count:

5am is the most peaceful time period, with the lowest crime incidents.

Crimes are more likely to take place after 12pm than before 12pm.

There are three peek-hour for committing crimes: mid-night at around 12pm, noon at around 12pm and afternoon period “17:00-18:00”.

Monthly Crime Count:

We can see from the stacked area chart, there are 2 peaks of crimes through the year: May and October.

Yearly Crime Count:

The 13th year, which corresponds to 2015, has far less crime than the years before. It’s not because the police department has reinforced intensive fights against criminal so that the crime rate has reduced significantly. But, the underlying reason is that we only have the crime data before 05/13/2015.

More importantly, San Francisco witnessed increasing crime rate since 2010. Compared to 2010, the crime rate increased 16.5% until 2014. And when we investigate into theft which tops among all the 39 categories, it’s obvious that theft increases sharply since 2009, giving rise to the total high crime rate over the year. And the vast majority of theft belongs to property crimes: automobile break-ins, pickpocket/pursesnatch and shoplifting.

After thorough analysis, we consider two possible causes of great significance:

Electronic devices like smartphones, tablets are increasingly prevalent, and these items seem to be the easiest and most lucrative target for thieves.

Another possible reason may be that, the victims or witnesses of crimes are more willing to report happened crime to police, meaning that the crime wave is actually an increase in crimes being reported instead of crimes happened.

Statistics also show that, southern has the highest property theft crime, reaching 41845 over the 12 years’ period of time. And this picture shows the yearly theft counts in 2003(The full animation can be accessed by this link: https://embed.plnkr.co/GrPNfG7qQuwZwYKLr1D0/)

And also from the pie chart above, we discovered the fact is such that nearly a quarter ’s crime occurs in the densely populated, transit-rich Southern Police Station district, which runs from The Embarcadero to south of Market Street. Besides, we can also find some evidence if we look back to the hourly theft count(refer to the bar chart below). The peek is 18:00 ~ 19:00, turning out to be the rush hours in the daytime when people are busy commuting.

Hourly Larceny/Theft Counts

Machine Learning

Our project endures a touch experience when we attempt to build effective and comparatively “accurate” predivtive model. We mainly go through the four phases below and both negative and postive results we’ve gained from our trials are recorded as below.

Negative Results

”Roberry/Non-Roberry” Classifier

Our first trial is doing binary classification instead of multi-classifications. Our assumption is quite simple, we are using use (DayOfWeek, PDDistrict) pair as features to predict the crime category. And We choose the category ‘ROBBERY’, use the classification labels are 0/1 array indicating whether a crime belongs to this category.

Applying selected two features, we’ve tried three algorithms: Naive Bayes, Logistic Regression and SVM. We use first half of training data as training, and second half of train set as testing for the simple reason that the test.csv does not have a label for category, thus we can’t use it to compute the accuracy. All of these three algorithms outputs the same training accuracy and validation accuracy, with training accuracy 0.974156 and validation accuracy 0.973455

Why would this happen? let’s take a look at confusion matrix: [[427371, 0] [11654, 0]] As we can see, the classifier simply labels everything as non-ROBBERY. The reason behind is that, we only have 2.69% (23000/855049=0.026899) of all crimes that belong to ROBBERY. Thus we can conclude that (DayOfWeek, PDDistrict) pair is not a good predictor for crime categories.

Triple-classification

In the next phase, we decided to adjust the learning goal of our machine learning algorithm. Instead of simple binary classifier’s outputting whether a crime is ROBBERY or NON_ROBBERY, we adopt multi-class classification, which classifies the crime as LARCENY/THEFT, NON-CRIMINAL or ASSAULT. In this attempt, we are still using the same (DayOfWeek, PDDistrict) as the feature set. The resulting accuracy is around 0.22 for all Naive Bayes, Logistic Regression and SVM.

This is clear not an accurate classifier. We doubt the problem may be that we are using too few features giving rise to underfitting. Thus possibly we can try more attributes as learning features or try other categories as the learning goal such as regional data, which can be useful for the police department to focus more on a certain region than the other, as crimes are committed more in some places.

Positive Results

“LARCENY/THEFT / NON-LARCENY/THEFT” Classifier

Based on the category distribution we’ve visualized earlier, we targeted crime category of LARCENY/THEFT, attempting to build a LARCENY/THEFT / NON-LARCENY/THEFT classifier because such category has totally 174900 incidents in training data, thus we probably can get a more accurate classifier due to more information hidden in more data.

Also, we made some adjustment of the features we fed into the model. When performing feature extraction, we found that the geographical coordinates are not useful as all the crimes occurred in San Francisco. In other words, the longitude and the latitude are nearly the same. In the light of the above statement, it’s really hard to tell the difference based on the longitude and the latitude. However, it seems to make more sense if we are using PdDistrict to tell more information of geographical location of crime zones.

In addition, we added a new feature indicating whether it’s daytime or night. From the visualization, we found that the crime is to some extent related to time. Specifically, 05:00:00 – 05:59:59 is the most peaceful time period. So we converted “Date” field into 4 discrete values indicating the time period when the crime occurred. Thus, we had three features, Time Period, Day of Week and PDDistrict.

Moreover, in order to balance the data from both classes which aims to prevent the classifier from overfitting, we only used part of NON LARCENY/THEFT data. By adding the above intuitions into the model-building process, the resulting accuracy of predicting whether a crime belongs to “LARCENY/THEFT” is around 60%. And the performance statistics is as followed,

Binary Classification Performance on LARCENY/THEFT

Precision

Recall

F1-score

NON LARCENY/THEFT

0.62

0.67

0.64

LARCENY/THEFT

0.58

0.52

0.55

From the statistical data above, we can see that it is much better than previous attempts, 22% accuracy. As type I errors and type II errors for both classes are of no big difference, the classifier predicts well for both classes. Considering that there exist a certain amount of data which belong to different classes with same features, the result is quite good. For the next week, we plan to optimize our features to improve the performance of our classifier, otherwise we can’t make progress only with current features.

Ultimate Solution

Since the dataset is relatively large (over 800,000 data entries), before applying any learning algorithms, we preprocess and filter the input file first. The basic idea is to generate a series of subsets of training and testing pairs. Furthermore, as we discussed earlier, since crime distributions are uneven across different categories, especially the unpopular ones such as ‘GAMBLING’ and ‘BRIBERY’, which may introduce outliers and decrease the accuracy, we decide to filter out those unpopular instances. The output of this preprocess is a series of training and testing pairs for top 20, top 10, top 5, top 3 categories, and of course, the original unfiltered training and testing pairs.

After preprocessing, the next step is feature extraction. The raw input such as the date (2003-01-06) is split into 3 separate features (year, month and day). Resolution feature is left out, since there are fair amount of entries with NULL value. The address of individual crime is left out, since there is generally one crime per address. The coordinate (x, y) is also left out for the same reason. However, we do keep PdDistrict as an important geo-related feature, as it is much easier to represent. The resulting feature space we are going to use is [year, month, day, hour, DayOfWeek, PdDistrict].

Another important optimization we utilize is ‘One-Hot-Encoding’. The reason behind is that, for the features, such as DaysOfWeek, are ordinal labels. It makes no sense to represent, for example, Fridays (labeled as 4) as twice as large as Wednesdays (labeled as 2). The better representation would be a 7-bit length vector that represents Fridays as [0000100] (4th bit on), and Wednesdays as [0010000] (2nd bit on), and so on.

Finally the dataset is ready to use. Thanks to sklearn python packages, we are able to implement various machine learning algorithms easily, from the basic ones like Logistic Regression Classifier, to the complicated ones like Random Forest. Here is a list of classifiers we have tried.

LARCENY/THEFT/NON-LARCENY/THEFT

SVM (Linear SVC)

0.7973

Logistic Regression

0.7974

BernouliNB

0.7973

Adaboost

0.7728

Random Forest

0.7972

Bagging

0.7974

Gradient Boosting

0.7972

Last but not least, we revisit our failed binary classification we discussed earlier on, using the models we constructed above. This time we try to classify the most popular categories. Here is the result.

TOP 3 CLASSES

TOP 5 CLASSES

TOP 10 CLASSES

TOP 20 CLASSES

All 39 CLASSES

SVM

0.2907

0.2907

0.1969

0.1415

0.1477

LogReg

0.3329

0.3329

0.2389

0.2043

0.1998

BernouliNB

0.3476

0.3476

0.2487

0.2117

0.2079

Adaboost

0.3767

0.3767

0.2662

0.2138

0.1825

Random Forest

0.2877

0.2877

0.1727

0.1559

0.1309

Bagging

0.4690

0.3412

0.2368

0.2026

0.1931

Gradient Boosting

0.4880

0.3760

0.2701

0.2316

0.2233

Analysis and Future Directions

Although it’s difficult to classify most of examples correctly when doing multi-class classification, our result is better than that of random guessing as we extract several useful features like year, month, day and PdDistrict, and ignore meaningless features like X, Y. We employed several machine learning algorithms discussed in class such as SVM, Naive Bayes, Logistic Regression and so on. The best result that we can reach when classifying top 3 classes is around 0.48 using gradient boosting. On average, we can just reach around 0.35.

We also tried to optimize on feature extractions. But the accuracy seemed to encounter bottlenecks. The reason why the accuracy of multi-class classification can’t be improved further is that, the classifiers have high biases because algorithms just learn simple models from relatively insufficient features.

In addition to that, from the datasets, we also found that there are quite a few examples which have the same feature combinations but belong to different categories, which leads to the bottleneck of the accuracy of multi-class classification. Also, looking into the weight factor of the classifiers, the weight of each feature is not big enough and this indicates that our classifiers may underfit on data. One of the solutions to underfitting problem is to add more features more related to categories such as the criminal environment, weather or temperature.

Moreover, with the number of categories increasing, the accuracy of classifiers decrease a lot. We deem there are two main reasons:

1) features are not enough.

2) numbers of examples of each category are of big difference.

The first part has been been covered and discussed above. As for the second part, we contribute it to the categories distribution. As discussed earlier, training instances for #21~#39 crime types are too few compared to other types, which exerts negative effect on the accuracy due to the noise added to the data. Lack of data is the most severe problem when classifying multi classes. This leads to the result that the classifier will classify an unseen example as the category of more examples rather than that of less examples. In order to classify better, the data should be evenly distributed among all categories.

Machine Learning

Firstly, a quick recap of our last attempt: we are using SVM and extracted three features, namely Time, DayofWeek and PdDistrict. For “Time” field, we discretize the “Date” field by defining 00:00:00~06:59:59 as “Late Night”, define 07:00:00-12:59:59 as “Morning”, define 13:00:00~18:59:59 as “Afternoon”, define 19:00:00~23:59:59 as “Evening”. So the new field “Time” is one of “Late Night”, “Morning”,”Afternoon”,”Evening”. Such feature combination provides the accuracy of 0.66.

This week, we explore further on “LARCENY/THEFT/NON LARCENY/THEFT” classifier. This time, we decided to use more features to see whether performance can be improved. Given the insight by visualizations, we consider “Date” field quite informative. Thus, we transformed the “Date” with the format “YYYY-MM-DD HH:MM:SS” into 4 separate features telling time information: Hour, DayofWeek, Month, Year. Therefore, the final feature space are (Hour, DayofWeek, Month, Year, PdDdistrict).

We plan to use Logistic Regression for the simple reason that compared to SVM, it’s not only output a label, but also it can output the probability. Considering the latter stage multi-classifications, we may also use Logistic Regression to train different classifiers for different crime category, and when testing an unseen data, we can use different classifiers on the test data and assign the one with the highest probability.

Through numbers of experiments, we had some interesting discoveries.

Focusing on the Data belonging to Top 20 Crime Categories

We’ve done some simple statistic analysis regarding to the distribution of the total 39 crime categories in the data.

As you can see that the top 20 categories have 851677 incidents in training data, accounting for 96.99% of the total 878050. So we may consider 21~39 crime types as “outliers” or “exceptional cases”. So get rid of these and focusing on the top 20 categories might reduce some noise in training data.

Splitting Data

Since the test set provided by Kaggle has no category label, the test data means little to us in the current situation. Thus, we have to split the original training set which contains 884263 records in total, into new training data And test data.

We found that the accuracy of the classifier largely depends on the compostition of training data, more specifically, the balance between the positive and negative classes. To begin with, we use all the training records, totally 884263 , and split them by selecting 1/3 of the total as test data, and the rest as training data.

The confusion matrix is as followed:
[[223063 0]]
[57991 0]
The problem here is because, we have 676777 negative cases since any record doesn’t belong to “LARCENY/THEFT” are counted as class 0, while only have 174900 positive cases. It’s easy to see that such training data is not good. The solution here is that we have to adjust the proportion, so we only take part and use 200000 negative cases.

Feature Transformation

We also performed some feature transformation. As I mentioned, we are using (Hour, DayofWeek, Month, Year, PdDdistrict). All of these five are discrete values. However, in order to fit Logistic Regression model, we have to do the following conversion:Hour: int between 0~24DayofWeek: 1 represents Monday, 2 represents Tuesday, so on and so forth. So the domain of these attribute is {1,2,3,4,5,6,7}Month: 1 represents Jan, 2 represents Feb, and so on and so forth. So the domain is {1,2,3,4,5,6,7,8,9,10,11,12}Year: since the data ranges from 1/1/2003 to 5/13/2015. Thus we have 12-year’s data. So 1 represents “2003”, 2 represents “2004” and so on and so forth. Thus the domain of Year attribute is {1,2,3,4,5,6,7,8,9,10,11,12,13}

In this way, the data is much more at the same scale, and it turned out that it indeed works because if we are using the original year data (2003~2015), the accuracy has only 0.54. But after transformation, the accuracy is 0.8.

By adopting the above strategies, our classifiers to predict whether a crime data belongs to “LARCENY/THEFT” or “NON LARCENY/THEFT” reached 0.8. In detail, the performance statistics are listed below:

Failed Trial and Negative Results

We try to further narrow down the classes we are investigating to top 10, whose training instances number ranges from 174900 to 31414 to see whether we can perform multi-classification here.

From the category statistics mentioned above, we need to adjust the proportion of different class instances in training data, preferably evenly distributed. So we build the new training data following several steps:

Split the data into top 10 classes according to the “Category” field value.

For each class “.csv” file, shuffle and select 1/3 as training data, 1/3 as test data

Merge the training data of each class into the whole training data, and also merge the individual test data into final testing data.

We found that if we didn’t balance between the negative and positive class distribution, the accuracy of top 10 different crime binary classification is,

Although the accuracy is not good as the above, but the confusion matrix makes more sense.
Besides that, we used both logistic Regression and random forest multi-classifier, the accuracy can only be around 0.2.
So we thought an important reason why other classes can’t perform very well is that there are’t much information in training data. So if we use the unbalanced dataset, the classifier naively predict no. If we use the balanced dataset, the training data is not informative enough to build a decent and predictive model.

What to do next

We may consider using other attribute like “Description”. Since it’s the text feature and related to the essence of the crime, so maybe we can perform sentimental analysis, derive class-indicative keywords to improve accuracy.

Visualization

After couple unsuccessful attempts with finding an appropriate basemap and mapping the X,Y coordinates of the data points, we have decided to change our design. Our machine learning results suggested that Year is the most relevant feature, and comparing to the X,Y coordinates, PdDistrict type is more relevant when it comes to binary classification of our crime categories. Based on this, we have decided to shift our focus on mapping the total incidents of each crime type on the San Francisco contour map on a yearly basis. If time permits, we will also explore the monthly view option.

Machine Learning

Last week, we’ve tried to build multi-classification which classifies each observation as LARCENY/THEFT, NON-CRIMINAL or ASSAULT. It turned out that the performance with the accuracy of 22% on test set is not promising.

So this week, we targeted crime category of LARCENY/THEFT, attempting to build a LARCENY/THEFT / NON-LARCENY/THEFT classifier because such category has totally 174900 incidents in training data, thus we probably can get a more accurate classifier due to more information hidden in more data.

Also, we made some adjustment of the features we fed into the model. When performing feature extraction, we found that the geographical coordinates are not useful as all the crimes occurred in San Francisco. In other words, the longitude and the latitude are nearly the same. In the light of the above statement, it’s really hard to tell the difference based on the longitude and the latitude. However, it seems to make more sense if we are using PdDistrict to tell more information of geographical location of crime zones.

In addition, we added a new feature indicating whether it’s daytime or night. From the visualization, we found that the crime is to some extent related to time. Specifically, 05:00:00 – 05:59:59 is the most peaceful time period.

So we converted “Date” field into 4 discrete values indicating the time period when the crime occurred. Thus, we had three features, Time Period, Day of Week and PDDistrict. Moreover, in order to balance the data from both classes which aims to prevent the classifier from overfitting, we only used part of NON LARCENY/THEFT data.

By adding the above intuitions into the model-building process, the resulting accuracy of predicting whether a crime belongs to “LARCENY/THEFT” is around 60%. And the performance statistics is as followed,

NON LARCENY/THEFT:

Precision: 0.62Recall: 0.67F1-score: 0.64

LARCENY/THEFT:

Precision: 0.58Recall: 0.52F1-score: 0.55

From the statistical data above, we can see that it is much better than previous attempts, 22% accuracy. As type I errors and type II errors for both classes are of no big difference, the classifier predicts well for both classes. Considering that there exist a certain amount of data which belong to different classes with same features, the result is quite good.

For the next week, we plan to optimize our features to improve the performance of our classifier, otherwise we can’t make progress only with current features.

Visualization

As you can see we’ve made so many visualizations to present the data with different aspects. And undoubtedly some of those indeed helped us a lot when performing feature engineering. But we also realizes the fact that we may need to get rid of some of those that didn’t tell much about the patterns or the interesting things in the data.

Thus, in the next phase of the project, not only will we try to improve our accuracy of the prediction model, but also we may modify visualizations into ones that could tell much more and more interpretive in the final presentation by either adopting various more effective viz method, or adding more animations.

In the last version, we just made visualizations of the pure original data, from which it seems that the pattern is hidden from us, and we didn’t actually make significant discovery. Thus, we improve our vis in this week and tried other predictive models based on the insights from the visualization.

Crime Map

To begin with, we made a better crime map and fixed the bug of limited plotting last time. We have attempted one of the most seemingly promising way of creating geographical mapping for large datasets – using R and its ggmap library. One of our previous problems with D3 is that there is no detailed enough San Francisco topographic basemap available. This means that plotting out the SF contours alone will take a tremendous amount of time, let alone adding more than 10k data points onto the basemap – in fact, the browser crashed when the number of data points exceeded 3k.

The ggmap library is easy to use. the OpenStreetMap package that contains SF basemap can be directly used in R – this means that generating the following 10-year data on Prostitution and Sex Offenses Forcible crimes has taken us less than 20 seconds.

However, we have found it is almost impossible to create interactive geographical mapping using R and serve it up on the client side like what D3 provides – there may be other alternatives, but for now it looks like .jpg files are what R can generate at its best.

We will keep looking for other solutions in the next one week or so. Hopefully we can come across a framework that is efficient for mapping data geographically, and at the same time not compromising any functionalities.

Pattern of Date/Time

In the light of idea that date/time servers as an important independent feature of crime, we further derive the hour, month, year from the “Date” field. And perform normalization over the hour, month crime counts using the formula (x-std(X)/mean(X)).

We picked up the top 10 crimes. Although the total crime counts varies a lot, but the similar patterns emerge after applying normalization.

Founding this pattern may help us producing training features. We plan to try using 5 independent features to build predictive models in the next week: hour, dayofweek, month, year and PdDistrict.

One more idea is about how to deal with “Resolution” field. According to common sense, crimes that went unresolved might be more likely to occur again, since the perpetrator wasn’t caught. So we may group crimes into a binary category, Resolved/Unresolved to see whether it would help to make more accurate prediction.

Machine Learning

One of the useful predictions to make is to predict the crime categories from Day of week and PD District. We explored Naïve Bayes, Logistic Regression and SVM thus far. The result is a high accuracy (>0.97). However we conclude it as a bad attempt after looking into the confusion matrix. We realize that the classifier simply label every entry as negative. The reason behind is that, taking ROBBERY as an example, we only have 2.69% (23000/855049=0.026899) of all crimes that belong to ROBBERY.

During one of our group discussion, we decided to adjust the learning goal of our machine learning algorithm. Instead of simple binary classifier outputting whether a crime is ROBBERY or NON_ROBBERY, we adopt multi-class classification, which classifies the crime as LARCENY/THEFT, NON-CRIMINAL or ASSAULT. In this attempt, we are still using the same (DayOfWeek, PDDistrict) as the feature set. The resulting accuracy is around 0.22 for all Naive Bayes, Logistic Regression and SVM.

This is clear not an accurate classifier. We plan to investigate more on the issue in our final project, possibly by using other columns of the data as learning features, trying other categories as the learning goal such as regional data, which can be useful for the police department to focus more on a certain region than the other, as crimes are committed more in some places. If it turns out that the crime category is indeed hard to predict, we will try to analyze and explain why it is so and possibly draw the conclusion on what is missing in the data that leads to the bad performance. All of the three algorithms (Naïve bayes, logistic regression, SVM) can still be exploited to predict the crime category.

Introduction

This dataset contains incidents derived from San Francisco Police Department Crime Incident Reporting system, ranging from 1/1/2003 to 5/13/2015 and the data has already been divided into training set and test set. The training set contains the 878050 records, each representing an incident and the test data set is 884263. The goal of this project is to training a predictive model, and use it to predict which category each record in the test set belongs to. And here is a subset of training data,

The incidents have the following attributes:

Dates – timestamp of the crime incident, format YYYY-MM-DD HH:MM:SSCategory – category of the crime incident,totally 39 categoriesDescript – detailed description of the crime incidentDayOfWeek – the day of the week, 7 valuesPdDistrict – name of the Police Department District,Resolution – how the crime incident was resolvedAddress – the approximate street address of the crime incidentX – LongitudeY – Latitude

Data Clean

Since the data is formatted well, we didn’t perform the further data cleaning because it may result in missing useful information and influencing the accuracy. So at this stage, we are using all the data provided in the following visualization and machine learning processes.

Data Visualization

In order to dig out correlation between attributes listed above, we’ve created four visualization to show the different dimensions in the train set.

Part 1 : Category

S0 to begin with, we’d like to know the distribution of the crime category and answer the question: What’s the most common crimes? As mentioned, there are total 39 crime classifications here, and we marked each category as followed,

Observation:
We intent to visualize in a way such that the crimes categories are shown from the highest counts to the lowest one. Since there are too many categories, the idea is that we will pick up top 10 crimes categories to see if there is any interesting pattern within a specific category.

Part 2 : DayofWeek

First, we’d also like to see if there is a correlation between the attribute “DayofWeek” and crime counts. So question here is : Are most crimes committed in the weekdays or weekend?

Actually the distribution of “dayofweek” vs “crime counts” is comparatively even. No extreme value appears in a specific day, with the highest crimes counts 133734 in Friday and the lowest counts 116707 on Sunday.

Part 3 : Hour

Since no obvious trend can be seen when only considering day, we’d like to explore more on the time of each occurrence that belongs to the top 10 most common crimes in San Francisco. So we made a dashboard here for better visualization,
The following picture shows the plot of “hour” vs “crime counts” of all the ten categories.

Then we move on to, given a specific category, what’s the distribution of “hours” and “crime counts”

05:00:00 – 05:59:59 is the most peaceful time period, with the lowest crime incidents

Crimes are more likely to take place after 12pm then before 12pm

There are three peek-hour commitment period: midnight around 12am, noon at around 12pm and evening period “17:00-18:00”

Also, we can check given a specific hour period, what is the distribution of “category” vs “crime counts”. For example, “00” represents incidents taking place between “00:00:00” to “00:59:59” . And you can click on the gallery to see the detailed statistic in the legend table in each visualization.

00

01

02

03

04

05

06

07

08

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

Observation:

The overall distribution of tends to be even and similar. The bigger a specific category accounts for the overall distribution, the bigger it accounts for the distribution given a specific hour period. For example, In each hour during the day, LARCENY/THEFT, OTHER OFFENSES,NON-CRIMINAL and ASSAULT accounts for the large majority of the data.

Part 4 : Police Department District

We are using pie chart visualization to find the correlation between the crime category and the PD District.

Observation:
Each PD District has slight different proportion of different crime categories, but the overall distribution is fairly even. The most common crime category is LARCENY/THEFT as shown in the blue region of the pie chart.

Part 5 : Longitude & Latitude

We have explored the relationship between reported crimes on Prostitution and Sex offence forcible (SOF) by geographically mapping related data from 2010 – 2015 on the map of San Francisco. Each red dot represents one reported crime on SOF each while blue dot represents a reported case on Prostitution.

2010 – Prostitution and SOF

2011 – Prostitution and SOF

2012 – Prostitution and SOF

2013 – Prostitution and SOF

2014 – Prostitution and SOF2015 – Prostitution and SOFMap for referencing districts in San Francisco:

It is easy to see from the map that during the past five years, crimes on Prostitution almost exclusively aggregated in the downtown and inner mission area (8 and 9 on the reference map). Although cases on SOF seem to scatter everywhere on the map, they aggregate more densely in area 8 and 9 as well. However, it is very interesting to see that in area 9, crimes on SOF hardly overlap with Prostitution geographically. One possible conclusion from this observation may be that Prostitution may help reduce the rate of SOF.

Part 6: Parellel Coordinate of High-Dimensional Data

We are using parallel coordinates to find the correlation among all the dimensions except ‘Descript’ and ‘Address’ for every category, which means that the following visualization covers six-dimensional data: “Dates”, “DayOfWeek”,”PdDistrict”, “Resolution”, “X” and “Y”. And here we target four crime categories: ‘WARRANTS’, ‘DRUNKENNESS’, ‘LARCENY/THEFT’ and ‘KIDNAPPING’.

WARRANTS

DRUNKENNESS

LARCENY/THEFTKIDNAPPING

Observation
The crime category is largely related with ‘Resolution’ like ‘WARRANTS’ is mainly resolved by ‘Arrest, Booked’, ‘None’ and ‘Arrest Cited’ in descending order, while ‘KIDNAPPING’ is resolved by ‘Arrest, Booked’, ‘None’ and ‘
District Attorney Refuses to Prosecute’. For other fields like ‘PdDistrict’ and ‘DayOfWeek’, the visualization doesn’t indicate a strong relationship between them and the category.

Machine Learning

Assumption

Our first trial, at current stage of the project, is doing binary classification instead of multi-classifications. Our assumption is quite simple, we are using use (DayOfWeek, PDDistrict) pair as features to predict the crime category. And We choose the category ‘ROBBERY’, use the classification labels are 0/1 array indicating whether a crime belongs to this category.

Training Algorithms

We are using supervised classification, and we’ve tried three algorithms: Naive Bayes, Logistic Regression and SVM so far.

We use first half of training data as training, and second half of train set as testing for the simple reason that the test.csv does not have a label for category, thus we can’t use it to compute the accuracy.

All of these three algorithms outputs the same training accuracy and validation accuracy.
Training accuracy: 0.974156
validation accuracy: 0.973455

Analysis

Why this happens? let’s look into the confusion matrix:

[[427371, 0]
[11654, 0]]

As we can see, the classifier simply label everything as non-ROBBERY. The reason behind is that, we only have 2.69% (23000/855049=0.026899) of all crimes that belong to ROBBERY. Thus we can conclude that (DayOfWeek, PDDistrict) pair is not a good predictor for crime categories.

Discussion

What’s hardest part of the project that you’ve encountered so far?

The dataset is huge, with 800 thousands of labeled training set and 800 thousands of test set, which adds more challenges to whatever visualization or training process.

The project initially is to deal with multi-classification rather then binary classifier, which is much more difficult. And we are considering to transfer it to a binary classification, such like whether or not a specific incident is resolved or not(target variable changes to “Resolution” rather than “Category”)

As for cross-validation part, we initially intends to user 5-fold validation which makes more sense, but there are some bugs when we implemented because of the limited time. So we are fixing it in later stage.

What are your initial insights?

Our initial insights are proposed in each visualization above. We’ve tried to explore the relationship between different variables and the target label. It turns out that some of them works, and others just seems useless, showing no obvious trends. But at least, it shed lights to the direction that we should focusing on attributes like “PdDistrict”,”Resolution ” and “hour” derived from the “Date” variable has large significance on determine the category of each incident data.

Are there any concrete results you can show at this point? If not, why not?

It seems that at this stage, we can’t show very instructive results. However, at least we have tried on our first assumption and have the conclusion that (DayOfWeek, PDDistrict) pair is not a good predictor for crime categories. Maybe, in the later stage, adding other dimensions of the data can render a better performance and better accuracy.

Going forward, are the current biggest problems you’re facing?

We are quite on the track and are also well-planned for the rest of the project. So going forward is not the biggest problem we are facing.

Do you think you are on track with your project? If not, what parts do you need to dedicate more time to?

Yes, we are definitely on track with our project. But we also need dedicate more time to the machine learning parts, integrating insights we’ve got into feature selection, extraction so that we can further prove the correctness of our ideas.

Given your initial exploration of the data, is it worth proceeding with your project?

Yes, it’s worth proceeding with the project since we are discovering a number of interesting patterns through visualization. And as long as we are improving our training algorithm, we think we can make more progress and thus refining the raw massive data into little gems of real useful insights.