projects

“Approximate Street Address” Doesn’t Do It Justice

Whew! It’s been a while since the last update- don’t worry, I’ve been hard at work learning TensorFlow (and I’ve even contributed to its documentation a touch), and I’ll have a fairly large post later this week. In the meantime, I thought I’d share something I’ve discovered about one of the Santa Monica Parking API‘s fields that I had previously shrugged off as unhelpful.

I was looking through some of my parking data and decided to print some information for all parking meters, ordered by meter_id, when I noticed something interesting:

Multiple meters were given the same street_address field. On further inspection, I also noticed that address numbers in the list either ended in 0 or 1. I couldn’t think of a better thing to do than plot some of them on a map and see what I came up with.

First, I picked two group of meters that had similar addresses. In this example, “00 Pico Blvd” and “01 Pico Blvd”.

Here’s the “01 Pico Blvd” coordinates mapped:

And then with “00 Pico Blvd” added in:

The street_address is a label for their block! I tested this out with several groups of meters, and found the block-by-block grouping consistent. That makes me comfortable to say this:

street_address Groups Parking Meters Together by Block

It’s going to save a huge amount of effort when people inevitably want to group these meters together block-by-block

Before realizing this, I was thinking of various way of trying to use a combination of meter_id and GPS coordinates to try to group these together without doing it manually, but this provides a very natural way to group them! Hooray for the data being even better than first thought!

If you read my last post, you may remember that there were a couple of issues that need to be overcome before putting the data into any sort of machine learning algorithm. Namely, the data could potentially be noisy (i.e. lots of events from one meter within a few seconds); the data is unbalanced; and the raw data, which is in the form of events, is not the ideal format- what we’d prefer to know is which meters were occupied at any given time. So, how are we going to do this with what we have?

Enter Meter Sessions

The key, as I mentioned in the previous post, is by taking advantage of sessions. Each sensor_event from the Santa Monica API contains both an event_type and session_id field, which can be used to construct sessions: a period of time that a given parking meter was occupied. By constructing all of the sessions in our data, we can go back and query our data set to see whether or not each parking meter was occupied at a given time. To show how this works, I’ve prepared some graphs with dummy data below to illustrate the concept on a smaller scale.

Example Scenario

In this example, assume we have two parking meters in our town that send out data in the form of events (in a similar manner to Santa Monica’s meters). Whenever somebody drives over or leaves Meter 1, we receive that information and it is stored in our database. Same goes for Meter 2. Let’s say that we decided to take a look at a 12-hour snapshot of this event data- that data might look something like this (NOTE: the data in this example is simplified for the sake of illustration):

As you can see, we have an array of JavaScript objects, each of which represents an event. Inside we find a unique event_id, which allows us to find this particular event amongst a sea of others; an event_time, which tells us exactly when the event occurred; the event_type, which identifies if the event represents somebody entering the space ("SS") or leaving the space ("SE"); meter_id, which lets us know which of our two meters sent this event; and session_id, which connects two events together.

Even though this sample data set is small, it’s already hard to get a good handle on what exactly is going on. Let’s start doing some simple visualizations and try to get a better picture, step by step.

Step 0: Organize Events By Meter

The first thing we do is separate the events by parking meter. Here, I’ve graphed the parking meter ID on the y-axis so that each meter has room to organize its own events along the x-axis (time).

Using this, it’s much easier to see that Meter 1 (in blue) has five events, while Meter 2 (in red) has three events in this 12-hour snapshot. This by itself isn’t particularly useful. At best, we get an idea of how busy the meters are in relation to one another, but we don’t have any idea when either meter is occupied or available. Let’s apply the event_type property to our chart and see what things look like.

Step 1: Identify Event Type (Start or End Event)

Now that we’ve marked which events are start events ("SS") and which are end events ("SE"), we can say a little more about the data. For example, a car pulled into Meter 2 at 1:00, and left at 7:00. For Meter 2, a car left at 2:00, and another pulled in at 4:00. It’s starting to become clear how these events are connected together, but let’s actually connect them together by their session_id.

Step 2: Connect Events by Session

Things are starting to fall into place. We’ve connected start events and end events, creating the sessions shown above. A session, as stated before, is a period of time where a meter is occupied. The one thing we are inferring from our event data are the two sessions that go off the edge of the graph: Meter 1, session 8, and Meter 2, session 12. Because we don’t have matching events for those sessions, we have to make some assumptions:

If the first event we see from a parking meter is an end event, then there must have been a start event before our observation period, and thus the session was active from at least the start of our observation period.

If the last event we see from a parking meter is a start event, then the session must either still going on, or the end event was after our observation period. In either case, that session must have been active at least until the end of our observation period.

In the full application of the dataset, we’re going to avoid making any such assumptions by truncating our data on either side – when you have many months of data, cutting off a few hours will not make a huge difference, and it’s best to keep the data as unbiased as we are able to.

This graph gives us a nice visual indication of how long a each car stayed at a meter- any time that is on a line is occupied. All we have to do to convert this session data into balanced data is to pick a time increment and sample both meters at the same time with that increment amount. For this example, let’s make the time increment 30 minutes (way to large for the real application, but easier to see here)- below is a graph with vertical grids marking half hour slices:

Great! Now all we have to do is do our sampling.

Step 3: Convert Session Data into Availability Data

Here is what our final data looks like graphed. Green circles represent a meter being available at a given time, and red x’s represent a meter being occupied. These are the data points that we’ll be able to feed into something like a logistic regression or a neural network!

<side note>

You may notice that a meter is being marked “available” at the same time end events occur. That is a particularity that is being used for this example due to the huge time increments we’re using (in the actual data set, each event_time is given to the second). In order to account for this, I reasoned it was fair to say that once an end event occurred, the parking space was immediately available. Although this is less likely to occur frequently in the real data, I will continue to use a non-inclusive right bound on the sessions. Put in other terms, I will say that a parking space is occupied if a session is active at that time, and an active session will be defined by:

the time
a) greater than or equal to a start event (as a UNIX timestamp)
b) strictly less than the end event (as a UNIX timestamp)
where the two events share the same session_id

</side note>

To get a look at how the raw data compares to what we started with, here’s one possible approach to storing the data:

This is fantastic! Let’s go over how this approach solves the various problems with the data we had at the start:

Unbalanced Data: By definition, this data is now balanced. For each time slice (0:00, 0:30, 1:00, etc.), we have an availability value for both meters. Additionally, the sampling rate, or how we sliced up our session data, is constant.

Not Best Data: We now have data that instantly says whether or not a meter was available at a particular time. Since we will be training our machine learning algorithms on that metric, it’s vital we nail that piece down.

Noisy data: This one was harder to showcase with this dummy data, but this approach alleviates much of the noise. This due to the fact that our final dataset doesn’t care how many events happened in a short period of time. If a session lasts for three seconds, it is unlikely to show up or affect our dataset. To clean it even further, we can try out some measures that take length of session into account and weed out the pesky sessions that last an unusually brief period of time. Luckily, by creating the session data in Step 2, we’ll be able to easily go through and see which sessions are abnormally short (or long, for that matter)

To top it off, we did more than just fix issues with had with the original data, we’ve also made some additional improvements without even intentionally trying to do so:

MORE DATA: By slicing up our sessions, we multiplied the amount of data points we had by SIX. And that was slicing with a time interval of half an hour- think of how much data we can extract by reducing it to five minutes, or even one minute! The nice thing is that this isn’t false or fabricated data, we’re just extracting more from the data that was already there!

COMPACT DATA: Our data is going to be super compact. The above data can easily be stored as a comma separated values (CSV) file, and even when we have gigabytes of data, it will compress down immensely due to the majority of the file being commas, spaces, and the words “true” and “false”.

I love happy accidents.

Alright, that’s the end of this long post. I think it’ll be a week or so before the next Santa Monica Spaces update (I’ll have some other features to keep the content flowing), but next time I’ll be showing off my code progress and maybe finally get some code out on my GitHub! Speaking of which, you can see the iPython Notebook used to produce the graphs in this post here. I’m still learning the matplotlib library, so if you have any suggestions to improve my visuals, leave me a comment below! Peace out, data nerds.

Next up: Enough Talk- Where is Santa Monica Spaces Now?

Taking a Look at Santa Monica’s Parking API

Santa Monica Spaces is a project I started back in July 2015, with the hope of creating useful data and models as well as improving my programming, machine learning, and data analysis skills. This post will focus on the data used in the project, some of the opportunities it creates and the challenges it presents.

Context

Back in June, the City of Santa Monica released a RESTful API that spits out real-time data for both parking lots and meters throughout the city. When it was released, I went to meetings run by the city council to introduce the APIs and go over what the data looked like. Below is a breakdown of the /meters/ route- the main data source of this project. Note that I am not going to talk about each and every sub-route and feature, but rather those that are of interest to this project:

GET /meters/ – Parking Meter Information

The base route, /meters/ sends out mostly static information about each parking meter in Santa Monica, represented as a metered_space object. Inside of each metered_space are the parking meter’s area, latitude, longitude, meter_id, street_address, and active properties. Let’s examine the fields that may need further explanation:

active: Indicates whether or not a meter is in service/functioning. This field is rarely updated, and so checking once or twice daily for any changes is sufficient to stay up-to-date with meter statuses

area: a short description of the parking meter’s location. Can be used as a loose way of grouping meters together

GET /meters/events – Real time Meter Events Data

Going deeper, the /meters/events/ route returns a list of sensor_event objects, which represent parking meter events- a car either entering or leaving a parking space. These events are what I’ll be using to construct Santa Monica Spaces’ predictive models, so let’s go over all of the sensor_event properties in detail:

event_id: The unique numeric identifier for each event

meter_id: The id of the meter that sent this event. The ids found here connect to meter_id properties returned from the base /meters/ route

event_type: Denotes the type of this event- can be one of "SS"or "SE". "SS" stands for “session start” (i.e. a car just entered this space), and "SE" stands for “session end” (i.e. a car just left this space)

session_id: The unique session number that contains this event. A session is defined as one start event ("SS") and one end event("SE"). Therefore, exactly two events should share session_id

event_time: The time this event occurred, with precision to the second. The string format is an ISO 8601 formatted UTC date/time, but with all non-alphanumeric characters removed. For example, “2007-04-05T14:30Z” becomes “20070405T1430Z”. This is non-standard for most date formatting libraries, so one must create a function to parse the string themselves

ordinal: the unique number identifying the order in which the server received this event. An event with a lower ordinal was received earlier than those with a higher ordinal. Additionally, the server emits event data sorted by ordinal. Can be used as an argument in the /meters/events/since/ route to limit the events returned to only those after this event was received by the server

Without any additional arguments, /meters/events/ returns all parking events emitted in Santa Monica from the past 5 minutes. You can use the sub-route /meters/events/since/ in order to modify how many events are sent from the server. By using /meters/events/since/:datetime, you can use pass in a UTC date string (formatted as described above), which will return all of the events that have occurred since that time. Additionally, you can call /meters/events/since/:ordinal to return all events that have occurred after the event with the specified ordinal number (inclusive of that event).

For both of these sub-routes, the API will not serve any events that occurred three hours prior to the request. i.e. you can only get three hours of historic event data without storing it yourself.

Interesting Features

Looking at the data available from the API, a couple of interesting things stand out that will be of use when designing code:

Minimizing data transfer: You can reduce the amount of repeat data you receive by keeping track of the latest ordinal you’ve seen and using /meters/events/since/:ordinal

Implicit data structure – Sessions: By tying together the start event ("SS") and end event ("SE") that share a session_id, you can create a representation of a “session”. A “session” represents a period of time during which a parking meter is occupied

Geo Data: We’ll have to make some sort of heatmap or other visualization on a map of Santa Monica- this data is begging for it.

Additionally, there appear to be several challenges this data presents, and they will need to be overcome in order to make the best use of it:

Noisy data: There is potentially going to be a lot of noise in the data in the form of events. Because an event is emitted any time a car drives over or leaves a parking space, it’s possible for a single car to trigger multiple events while attempting to park, perhaps even at the same parking meter.

Unbalanced Data: Ideally, we would receive the same amount of data from each parking meter at any given time, but that is not the case. The events we receive are sporadic, and some parking meters have a lot more events than others. This imbalance will cause issues if we try to do time series predictions

Not Best Data: The event data we have is useful, but what we really want is information about a parking meter’s availability at any given time. That is, “was this parking meter occupied or open at this time?”

How are we going to fix these problems? It turns out that the solution lies in the meter sessions. In the next post, I’ll walk us through a visualization of unbalanced data, and how we can use sessions to solve it (and how that will alleviate other problems as well).

Up next: Balancing the data and visualizing sessions

Predictive Modeling and Better Historical Data for the Beach City

Santa Monica Spaces is a project I started back in July 2015, with the hope of creating useful data and models, as well as improving my programming, machine learning, and data analysis skills. Over the next series of posts, I’ll introduce you to the project, talk about the goals and challenges involved, and catch you up to where I am now.

What is this thing?

This project is all about parking meter availability. Santa Monica Spaces aims to provide useful analysis and services by transforming data from the City of Santa Monica Parking API. Once complete, those accessing Santa Monica Spaces should be able to create historical visualizations in a real-time interface, export a number of useful datasets in a variety of formats, and make use of a predictive modeling feature to estimate the percentage availability of parking meters in a given region.

Why are you doing this thing?

Currently, the data available from the API is not in a format particularly suited for analysis, visualizations, or modeling. Additionally, the data is imbalanced, which can cause a number of headaches when trying to do time-series predictive modeling. Finally, the availability of historical parking meter data is rather low, and by storing records myself, I hope to have a large enough dataset to perform robust analysis and modeling.

What technologies are you using?

So far, I’ve used the following languages and software:

Java

JavaScript(Node/Express)

Python/iPython Notebook, pandas, scikit-learn

awk

Octave

Is there an open source repository?

Soon! Parts of the project will definitely be put on my GitHub, but I need to double check and separate out the non-safe parts of my code (probably not the best idea to give out secret keys). Portions of my datasets will be available to test out code and get a better sense of everything.