Kaggle Digit Recognizer: A K-means attempt

Over the past couple of months Jen and I have been playing around with the Kaggle Digit Recognizer problem - a ‘competition’ created to introduce people to Machine Learning.

The goal in this competition is to take an image of a handwritten single digit, and determine what that digit is.

You are given an input file which contains multiple rows each containing 784 pixel values representing a 28x28 pixel image as well as a label indicating which number that image actually represents.

One of the algorithms that we tried out for this problem was a variation on the k-means clustering one whereby we took the values at each pixel location for each of the labels and came up with an average value for each pixel.

When we needed to classify a new image we calculated the distance between each pixel of the new image against the equivalent pixel of each of the 10 pixel averaged labels and then worked out which label our new image was closest to.

We started off with some code to load the training set data into memory so we could play around with it:

One thing we learnt was that it’s helpful to just be able to take a small subset of the data set into memory rather than loading the whole thing in straight away. I ended up crashing my terminal a few times by evaluating a 40,000 line file into the Slime buffer - not a good idea!

distance-between finds the euclidean distance between the pixel values, find-gap then uses this to find the distance from each of the trained labels set of pixels to the test data set and we can then call which-am-i to find out which label a new set of pixels should be classified as:

The which-am-i function first returns its prediction and then also includes the distance from the test data set to each of the trained labels so that we can tell how close it was to being classified as something else.

We got an accuracy of 80.657% when classifying new values with this algorithm which isn’t great but doesn’t seem too bad given how simple it is and that we were able to get it up and running in a couple of hours.