Segmenting Audience with KMeans and Voronoi Diagram using Spark and MLlib

Analyzing huge data set to extract meaningful properties can be a difficult task. Several methods have been developed for the last 50 years to find hidden information.

Clustering algorithms can be used to group similar news like in Google News, find areas with high crime concentration, find trends, .. and segment the data into groups. This segmentation can be used for instance by publisher to reach a specific target audience.

The k-mean clustering algorithm is an unsupervised algorithm meaning that you don’t need to provide a training example for it to work(unlike neural network, SVM, Naives Bayes classifiers, …). It partitions observations into clusters in which each observation belongs to the cluster with the nearest mean. The algorithm takes as input the observations, the number of clusters(denoted k) that we want to partition the observation into and the number of iterations. It gives as a result the centers of the clusters.

The algorithm works as follow:

Take k random observations out of the dataset. Set the k centers of the clusters to those points

For each observation, find the cluster center which is the closest and assign this observation to this cluster

For each cluster, compute the new center by taking the average of the features of the observations assigned to this dataset

Go back to 2 and repeat this for a given number of iterations

The centers of the clusters will converge and will minimize the cost function which is the sum of the square distance of each observation to their assigned cluster centers.

This minimum might be a local optimum and will depend on the observation that were randomly taken at the beginning of the algorithm.
In this post, we are going to listen to a tweet stream to get tweets with their geolocation and then apply the k-means algorithm on their coordinates to find geographical clusters.

Fetching the tweets

Twitter provides an API to continuously listen to a stream of tweets. In order to use the API, you need a twitter api keys and access tokens.
To get those, log in on https://apps.twitter.com/
Click on “Create New App”. Fill in the Name, Description, Website, Callback URL and click on “Create your twitter application”.
Now go to the “API Keys” tab and click on “Create my access token”
To run the program to listen to a tweet stream and write the tweets to disk, we would need to create the file twitter-credentials.txt on your disk with the following content:

Choosing the colors

To generate a color for each cluster, we can choose colors at random but it will often give some bad results: colors too close too each other, colors too bright or too dark, …

An easy way to generate colors is to use the HSL Color Model(Hue, Saturation, Lightness). With the hue ranging from 0 to 360 going from red to yellow to green to blue and back to red with all the intermediary values in-between.

We choose a fixed value for the saturation and lightness and set the hue value based on the group number:

With this representation, we can see two issues: 1) It can be difficult to see the areas associated to each the cluster, 2) It’s hard to see the density of the points.
We are going to address those two issues in the next sections.

Voronoi Diagram

A Voronoi diagram is a way of dividing the space into regions. Each cluster center is associated to a region such that any points in that region is closer to this cluster center than to any other cluster centers.

Unfortunately those edges may not form a complete polygon.
For instance, let’s consider 4 points on which we run the generateVoronoi, we have those edges:

(the dashed lines represent the borders of the image and are not edges returned by generateVoronoi)

The edges associated to each region don’t form a polygon. To close the polygons, we are going to do the following for each region:

extract the vertices from the edges

sort the vertices by the angle they form with the x-axis by using the atan2 function (clockwise)

add missing corner vertices

To find which corner vertex need to be added, we check the position of each vertex and the next one (clockwise):

if one vertex is on the border and the next vertex is on the next border(clockwise), add a corner vertex

if one vertex is on the border and the next vertex is on the opposite border, add two corner vertices

So in order to complete the region A, we have to add a vertex at the top-corner (ab). For the region B, we have to add two vertex corners: on the bottom-right(da1) and at the top-right corner(da2). For the region C, we have to add a vertex at the bottom-left corner(bc).

The code to close the polygons and draw them on the map is the following:

Percentile circles

To show the concentration of tweets around the center, we are going to draw two circles around each cluster center:

50 percentile circle: this circle contains 50% of the tweets closest to the center

90 percentile circle: this circle contains 90% of the tweets closest to the center

To compute the percentile, we sort the tweets in each region by their distance to the cluster center. The radius of the 50 percentile circle is the distance of the tweet at the middle of the sorted list. For the radius of the 90 percentile circle is the one located at the following position: (0.9 * number_of_tweets).

With those circles, we can see that most of the tweets in the America West Coast Cluster are all grouped very closely to the center. This is not the case for the tweets in the East Asia cluster which are more loosely located.

Conclusion

We have seen in this post how to listen to tweet streams and then use spark to cluster the tweets by location. We described some techniques to visualize the clusters on a map by dividing the map into voronoi cells and show the concentration of tweets using percentile circles around the cluster centers.

The k-means can be used for other applications such as clustering news based on their textual content, it will be the subject of another post.

2 Responses to Segmenting Audience with KMeans and Voronoi Diagram using Spark and MLlib

First of all, great posts and very good code base, I like the simple code you have written.
I am trying out all your examples to learn spark and some machine learning algos.
But for this k-means this sbt is not able to find out dependency for simplevoronoi, could you let me know if you are aware of this? or should I use some other repository.