Blog

Harvey Alférez, Ph.D

Data Scientist, School of Engineering and Technology, Montemorelos University, Mexicowww.harveyalferez.com

Traffic in NYC as captured by my wife Doris’ camera lens

There is tons of open data on the Web. This data can be freely used by Seventh-day Adventists to try to figure out ways to help the inhabitants in the cities. This post describes how the students at my Pattern Recognition course at Montemorelos University and I have used open data and machine learning, which is a key component of data science, to discover interesting mission-oriented patterns for the church at NYC.

In my courses, I mostly focus on analyzing open data from NYC because of two reasons: 1) NYC has pivotal significance in our church’s ongoing Mission to the Cities project; and 2) NYC provides a portal that makes the wealth of public data generated by various NYC agencies and other city organizations available for public use [1].

Although the number of traffic deaths in NYC has fallen [2], city officials and traffic-safety groups agree that more aggressive steps must be taken to reach Mayor Bill de Blasio’s goal of eliminating traffic deaths in the city [3]. With this problem in mind, we analyzed a dataset of motor vehicle collisions in NYC, which is freely provided by the Police Department [4].

The studied dataset was created in 2014 and subsequently updated in 2016. This dataset registers motor vehicle collisions in Bronx, Brooklyn, Manhattan, Queens, and Staten Island from 2014 to 2016. This is a large dataset with 932,904 registered incidents! Moreover, each registered incident has 30 variables.

With traditional queries and spreadsheet analysis it is quite difficult (and sometimes impossible) to obtain timely answers to unseen patterns in large quantities of data, such as in our case study. In this kind of cases, machine learning, which “gives computers the ability to learn without being explicitly programmed” [5], can help us to grasp patterns we did not know that even exist.

From the set of 30 variables, we chose a subset to carry out the experiments. First, we chose the variables Date and Time because we wanted to know the day and time of each traffic incident. The Zip Code, Borough, Longitude, and Latitude variables were chosen because we wanted to know the demographic information of the accidents. Also, we had interest in figuring out the demographic groups that were injured the most. Therefore, we included in the experiments the Injured Persons, Injured Pedestrians, Injured Motorists and Injured Cyclists variables. Last but not least, we wanted to determine what provoked the accident and the type of vehicle that caused the accident. Therefore, we chose the Contributing Vehicle 1 and the Vehicle Type Code 1 variables from the dataset.

In order to analyze the data, we used Weka, which is a powerful tool for machine learning [6]. Although Weka contains a large range of machine learning algorithms, for our exploration we used the K-Means algorithm because the input data is unlabeled.

Our findings are as follows:

On Fridays, around Prospect Park, Brooklyn, there are around 77,000 pedicab accidents registered in the dataset, which goes from 2014 to 2016. This finding can help Seventh-day Adventist congregations in the area (see the map below) to bring awareness to the community in order to prevent this kind of accidents. For example, pathfinders could go to the park on a field day and ride away with banners as flags on their bicycles or images and messages on shirts saying “Beware of your surroundings. Be swift on the brakes”. Also, information boots could be on display with pamphlets offering cycling safety tips to avoid collisions as well as what to do if people get involved in a traffic accident. Also a mini clinic to attend minor injuries in the vicinity would come in handy.

Alcohol has been one of the biggest accident contributors. Moreover, accidents in which bicycles were involved have caused the highest number of deaths and injuries. Church members could look for innovative ways to inform the general population of the dangers of consuming alcohol and driving, and about cycling safety.

On Thursdays, Fridays, and Saturdays, drivers tend to drive aggressively. This situation increases the number of accidents during those days. A solution the church could offer to this problem is to launch a social media campaign on stress management at the end of the week.

Let us use the knowledge that open data offers us to make a difference in the cities. As shown in the results above, big problems could have implementable simple solutions in which church members could make an extraordinary difference in their communities. Although a manual process could have been carried out to analyze the large dataset in our case study, it would have taken weeks or even months. In our case, the process just took a few days and a considerable low human-based analysis (computers did the hard work).

I thank the students at my Pattern Recognition course, Anthony, Claudia, Carlos, Isaías, Jairo, Eduard, Marco, Jaziel and Carlos, for their intense work on the experiments.

Well done! This is a great example of how the Church can respond better to the needs of their surrounding societies.
An infographic showing the results of some of those data analysis carried would give more insight to the reader of the complexity but power of your method.
Keep up the good work!