Handling missing data in K-Means

One of the challenging things related to building "big data" apps is dealing with messy data sets. At SupplyFrame, we ran into a problem while doing some analysis with K-Means clustering: All interesting features in our data had varying amounts of missing values. It turns out that how the values are missing is significant! Say you knocked out various cells at random: Your analysis won't suffer too much as the contribution to the error is uniform. This is known as Missing Completely at Random (MCAR).

However, let's say you knock out cells more frequently if the user came from a certain country with latency problems. Now, the contribution to error is no longer random. We had to modify the K-Means algorithm to handle this situation. Since we also deal with non-Euclidean distances, we had to adapt K-Means to accept any distance function. Here is a simple Python project that provides a reference implementation:

Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website

About us

SupplyFrame Engineering is a cheerful bunch of developers, hackers and researchers on a mission to revolutionize the electronics industry. We tackle a wide range of problems in Search, Computational Advertising, Knowledge Engineering and Tools Development. Our day-to-day includes everything from platform engineering and data science to Web apps. We love what we do. You should join us.