Ribbon

Sunday, August 31, 2014

Clustering for Business People

In many situations, a business requires to simplify the customer base by excerpt some typical representatives [i.e. segments or clusters] that, put together, describe approximately the main trends within a customer base. More than any other, Marketing departments need it to understand:

Trends (what segments are growing/shrinking)

Customer LifeCycle (Do I recruit the same customers?)

What are the shopping/behavioral/spending/channel habits of my customers?

Who is the most profitable group of customers (is it a function of my Marketing Spend?)

Where are the Marketing dollars spend? Is it a good investment

In brief, at a high-level, understanding the interactions between customers and the BrandThe Analytics to tackle the issue are pretty simple conceptually. Yet, it is easy to get lost between all the techniques available. Let's make a rundown on what's out there.Overview:

Family

Name

Description

Use

Drawback

Upside

Partitional Clustering

K-meansK = number of clusters

For a predetermined number of clusters. find the center for each cluster that minimize distances between each point and its respective newly formed cluster

Used by marketing department in 99% of the time.

No notion of partial membership (for individuals on the border of each group for instance)Sensitive to outliersNot deterministic

SimpleIntuitive techniqueEase to Deploy

K-Medoids [1]ex:Clara

Clara is a sister version of k-means that creates medoids instead of clusters. The main difference is that Clara returns centers that are actual individuals

Used for large datasets that still fits in memory.Less common that K-means

Not deterministic

More scalable that k-means

Fuzzy Clusteringex:Fanny

Similar to K-means but allows each point to be a member of several clusters

Rare.Can be useful to better account for non-obvious cases.

More complex paradigm

Better representation of reality

Hierarchical

Hierarchical Clustering

Has 2 flavors: Top-down and Bottom-up.For Bottom Up: Aggregate each point with the closest one until there remains only 1 group.The output is a tree (named Dendrogram)

Can be used in Natural Sciences as it reproduces well the Species/Sub-species/Sub-Sub … paradigm.

Slow

No apriori in terms of number of clusters (one can choose the prune the tree at any height)Deterministic (run twice, same results!)

Applications:

K-means Clustering:

Let's use the following (bogus) problem statement: We would like to design a county-level experiment to understand the effect of an implementation of a public policy on races (for instance, the very controversial No Child Left Behind policy) with the objective to be fair to everyone.As such, we would like to ensure that we identify counties with the same racial mix. This lead to 2 questions:1- How many different segments of counties are there?We run a small experiment using Census data (see script) and we end up with this curve that show distinctiveness (first row) homogeneity (bottom row) performance :

2- What are their characteristics?

Those 2 curves show that k between 5 and 10 are good picks and achieved most of the optimal performance. I usually use this curve to select a range and then create 2-3 scenarios and see what the client feels the most comfortable with (How much is worth more granularity? Are the size of the Clusters ideal and uniform enough for your campaigns...). Depending on the use, we may choose different k).
For the sake of the argument, let's go with 6.

K-Means is a rigid algorithm: Once the number of clusters has been determined, it is difficult to go back. In addition, we see that they are some cases on the cluster frontier that the algorithm took the decision to put one single cluster (i.e. no sense of partial classes).

To remedy this situation, we can use a Hierarchical Clustering (only for small populations as it is expensive to compute). In our case, we want to understand how close States are in terms of racial mix.

Running a divisive clustering on the 2013 data, we would get the following:

Nice! We see that there are 4 main groups (from left to right):

1-Alabama ... Missouri: Those are the Southern States

2-Arizona ... New Mexico: Mixed bags. We will go back to it later

3- Alaska ... Ohio: Mostly Midwest States

4- Idaho ... West Virginia: Northern States

That being said, we can choose to prune (i.e. cut) at the height we want, giving that extra-flexibility that K-Means did not allow.

We also notice an outlier: Hawaii. Its population mix must be significantly different from the others (may be from natives?)