Clustering 101, or: On Fridays, People Bike Differently!

Sep 3, 2017
• Dirk

We have talked about the BABS open data data
set manytimesbefore. It lists bike trips in the
San Francisco Bay area, with start and end point, date, time, and some
extra information about the rider. What we want to look at in this
episode is some basic clustering, and some surprising results from
this well-know data set. The plan is to find classes of typical days
in terms of bike usage. One would e.g. expect different usage patterns
between weekdays and weekends, and we will actually discover some fun
things beyond these basics as we go along. Let’s dive right in.

Some Data Processing

We start off by dissecting our data set into dates, hours, and trip
counts. We can do this with one line of Python code, using
the pandas library. Send me a
Tweet if you want some future post to give an introduction to
pandas. For now, all we need is the groupby function and some
housekeeping.

The first step (#1) is grouping all rows with the same date and hour
together, the second (#2) counts how many rows (i.e. bike trips) we
have on a given day, for a given hour. The last two steps (#3) and
(#4) make a data frame with sensible column names. The result should
look like this.

date

hour

count

0

2014-09-01

0

3

1

2014-09-01

3

1

2

2014-09-01

4

2

3

2014-09-01

5

1

4

2014-09-01

6

1

Now we want to pivot the table we created to get the hourly counts,
creating one row with hourly counts for each day. Pandas data frames
have a built-in pivot function that does just that.

To illustrate the contents of the new data frame, let’s plot the
second row, containing the trips taken on the second day (a Monday)
in our data frame. The result looks like this.

We see a peak of activity in the morning hours, which likely consists
of people commuting to work and a corresponding afternoon peak of
people coming back from work.

Clustering

Let’s think about the clustering task a little. Clustering is a
sub-discipline of unsupervised learning where one has data points
available, but no target variable. We want to find days that look
alike, not e.g. predict the number of trips taken. Famous clustering
algorithms include the k-Means algorithm and the family
of hierarchical clustering algorithms.

So what can we expect from the output a clustering algorithm applied
to our data set? Well, we’ll get a label for each day, such that days
looking alike should get the same label and thus be grouped together,
and from what we’ve seen above, at least two clusters should be
present, weekdays (where we have peak activity caused by commuters)
and weekends.

The Elbow Method

Algorithms like k-Means expect the number of clusters k to be an
input, determined before running the algorithm. What do we do if we
don’t know the number of clusters beforehand? Well, we can use a trick
known as the elbow method. Here, one plots the number of clusters
against the inertia, a measure of how well the points cluster
together (it’s defined as the sum of the distances of the points to
their respective cluster centers). The resulting curve should be
monotonously decreasing, since of course the more clusters we have,
the lower the sum of distances to the closest cluster center
becomes. However, once we reach the ‘correct’ number of clusters, the
gain will be smaller and smaller and thus create an elbow shape in the
plot. Below, you’ll find the elbow plot I made with artificially
created data to illustrate this. The data consists of three sources of
Gaussian 2D noise. Successive runs of the k-Means algorithm with
different values for k gives different clusters and different
inertias, the resulting elbow plot would suggest 2 or 3 clusters
(sometimes it’s not 100% obvious what the right answer is).

Now let’s look at the elbow plot from our cycling data.

From the plot it looks like 4 is a good number of clusters. Let’s have
a look at the cluster centers, i.e. the typical days for each group.

It looks like clusters 1-3 are weekdays of some sort and class 0
contains weekends. Time for some quality control.

Quality Control

Inspecting the cluster centers, we see immediately that classes 1 and
3 look very similar. This is typically a reason for suspicion. But
let’s press ahead for now and re-visit this later. One thing you
absolutely need to do after applying any clustering algorithm is
looking at the cluster sizes. If you see big disparities you could
have issues like outliers that you need to deal with. Our cluster
sizes look like this.

Now, class 2 looks like it has very little counts. A quick look at the
cluster center plot above tells us that this class contains
lower-traffic days, so that’s nothing too strange. Now what about
those similar classes 1 and 3?

Let’s look at the weekdays on which the days labeled 0-3 fall into
respectively.

Looks like our clustering cleanly puts almost all Fridays into class 3
and Monday-Thursday into class 1. This is quite exciting, we can tell
if it’s a Friday or another weekday by just looking at the way people
cycle! Weird, but these slightly wacky results are what I love about
my job. Now what about the small cluster, labeled with 2? Let’s repeat
the plot above, now grouping by the Month.

Group 2 contains a high number of December weekdays. Not as cleanly
cut as our Friday/other weekday dissection, but also quite
neat. Especially considering that we capture 10 out of the 15 or so
proper December weekdays in this group.

That was today’s data adventure, I hope you’ve enjoyed it. Next time,
we’ll have, as promised, a look at the traffic chaos we’ve created
last time.