A collection of sloppy snippets for scientific computing and data visualization in Python.

Tuesday, January 22, 2019

A visual introduction to the Gap Statistics

We have previously seen how to implement KMeans. However, the results of this algorithm strongly rely on the choice of the parameter K. In this post we will see how to use the Gap Statistics to pick K in an optimal way.
The main idea of the methodology is to compare the clusters inertia on the data to cluster and a reference dataset. The optimal choice of K is given by k for which the gap between the two results is maximum. To illustrate this idea, let’s pick as reference dataset a uniformly distributed set of points and see the result of KMeans increasing K: