Counting Crowds and Lines with AI

In Union Square, NYC, there’s the untoppable burger joint named Shake Shack that’s
always crowded. A group of us would obsessively check the Shake Cam
around lunch to figure out if that trip was worth it.

14 person line, not bad

Rather than do this manually (come on, it’s nearly 2018), it would be great if this could be done
for us. Then, to take that idea further, imagine being able to measure foot traffic on a month to month basis
or to measure the impact of a new promotional campaign.

Object detection has received a lot of attention in the deep learning space, but it’s
ill-suited for highly congested scenes like crowds. In this post, I’ll talk about
how I implemented multi-scale convolutional neural network (CNN)
for crowd and line counting.

Why not object detection

Regional-CNN’s (R-CNN) use a sliding window to find an object. High density crowds are ill-suited for
sliding windows due to high occlusion:

Failed attempt with off the shelf (no retraining) TensorFlow R-CNN

Further exploration in this approach led me to TensorBox,
but it too had issues with high congestion and large crowd counts.

Density Maps to the rescue

Rather than a sliding window, density maps (aka heat maps) estimate the likelihood of a head
being at a location:

What’s happening here?

Based on multi-scale convolutional neural network (CNN) for crowd counting,
the ground truth is generated by taking the head annotations and setting that pixel value to one, and
then gaussian blurring the image. The model is then trained to output these blurred images, or density maps.
The sum of all the image pixels then results in the crowd count prediction. Read the paper for more insight.

Let’s look at density maps applied to the shake cam. Don’t worry about the color switch from blue to white for the density maps.

The sum of the pixel values is the size of the crowd

As you can see above, we have:

The annotated image courtesy of AWS Mechanical Turk.

The calculated ground truth by setting head locations to one and then gaussian blurring.

How to annotate the data?

How to count the line?

Lines aren’t merely people in a certain space, they are people standing next to each other
to form a contiguous collection of people. As of now, I simply feed the density map into a
three layer fully connected (FC) network to output a single number, the line count.

Gathering data for that also ended up being a task in AWS Mechanical Turk.

Here are some examples of where lines aren’t immediately obvious:

Making a product out of data science

This is all good fun working on your development box, but how do you host it? This
will be a topic for another blog post, but the short story is: