Using clustering to find points in an image

In this post, I present my new package {img2coord}. This package can be used to retrieve coordinates from a scatter plot (as an image).

devtools::install_github("privefl/img2coord")

Have you ever made a plot, saved it as a png and moved on? When you come back to it, it is sometimes difficult to read the values from this plot, especially if there is no grid inside the plot.
Making this package was also a good way to practice with clustering.

Even when using the true number of clusters, kmeans get trapped in a local minimum (this is clearly not the best solution!), depending on the initialisation of centers. One possible solution would be to use many initialisations; let’s try that.

set.seed(1)
km <-kmeans(ind, centers =22, nstart =100, iter.max =100)

## Warning: did not converge in 100 iterations
## Warning: did not converge in 100 iterations

This works better here because I combined the silhouette statistic with a gini coefficient (measure of dispersion) of the number of pixels in each cluster (assuming that they should have approximately the same number). Let’s have a look at the combined statistic:

Handling large images

## Error: Detected more than 10000 pixels associated with points (21358).
## Make sure you have a white background with no grid (only points).
## You can change 'max_pixels', but it could become time/memory consuming.
## You can also downsize the image using `img_scale()`.

The green points are spanning 21,358 pixels, which could be a lot to process, depending on your computer. To solve this problem, you can do:

Conclusion

We have seen that hclust() was performing better than kmeans() (for this example). For some reason I don’t understand yet, initializing kmeans() with centers from hclust() works even better.

Then, we have seen how to determine the number of clusters. Finally, we have seen that using a particular statistic, specifically designed for this problem, improved the solution.

Of course, this could be improved a lot. For example, this won’t work for plots having a background color or some grid inside. Feel free to bring your ideas. BTW, thanks Robin who brought some nice ideas that improved this package a lot.