With all sorts of people waving around the term “Machine Learning” lately, I figured it was time to look into what the fuss was about, so I purchased “Machine Learning In Action”. I am mostly enjoying the book so far, with one inconvenience: all the code presented is in Python, which is easy enough to follow, but not directly useful to me. The best way to learn is to get your hands dirty and code, so I am planning on converting the Python examples into F# as a progress through - which should also be a good exercise in learning more F#.

Chapter 2 of the book covers classification using k-Nearest Neighbors. The idea behind the algorithm is fairly straightforward: given a dataset of numeric observations, each observation being classified in a group, the algorithm will classify a new observation based on what group most of its close neighbors belong to.
The book uses a linear algebra library in its implementation. It seemed like overkill for the situation, so I’ll go for raw F# here.

Let’s first create a new F# library project, and start working on a script, creating a fictional dataset like this:

createDataSet returns a Tuple with two elements. First, we create an Array of Arrays, an Array containing 6 observations on 2 fictional variables. The second element is also an Array, containing the Label of the group the observation belongs to. For instance, the first observation was [ 1.0; 0.9 ], and it belonged to group A.

It would be helpful to visualize the dataset, to get a sense for the structure of the data. One option to do this is FSharpChart, a lightweight charting library which works fairly well with the F# interactive window. The easiest way to use it is by adding it to our project via NuGet, which adds a reference to MSDN.FSharpChart. We need to add a reference to FSharpChart to the script, with a reference to the path where NuGet downloaded the libraries (this post by Don Syme provides a great example) - we are now ready to add a scatterplot function to the script:

The scatterplot simply takes a dataset, maps each observation to a tuple of X and Y coordinates, and passes it to FSharpChart.FastPoint, which produces a… scatterplot. Let’s select all that code, send it to F# interactive, and start playing in fsi:

At that point, you should see a chart popping up, looking like this one:

Our 6 observations are there. However, this isn’t very informative - it would be nice to also see what group each point belongs to, maybe with some colors and labels. I had a few problems with that part, here is the code I came up with:

dataSet and labels match what we have done so far; I added two arguments, i and j, which represent what variable to plot. byLabel takes the dataset and labels, and packages each observation in a tuple, where the first element is the label of the observation, and the second the X and Y coordinates we will display.

Then we create a combination chart: we create an individual series for each individual label, by filtering the observations matching each label, and generating a Point series, adding a Label to each individual observation. Note the ugly static upcast of each Point chart to a GenericChart - I struggled quite a bit with that one, because FSharpChart.Combine would complain about the Chart type. The other thing that is needed here is a reference to System.Windows.Forms.DataVisualization.dll (#r @"System.Windows.Forms.DataVisualization.dll"), and open System.Drawing.

This is much more useful - we see that group A is lying in the upper-right quadrant, whereas group B is in the lower-left area:

Now that we know what we are dealing with, let’s go classify. The procedure is pretty simple: to classify a new subject based on the dataset, we compute the distance between the subject and every observation in the set, pick the k closest observations, and take a majority vote in this set of k nearest neighbors.

The price to pay for not using a Linear Algebra library is that I can’t directly compute the difference between two vectors - I have to write my distance function by hand.

The distance function simply takes two arrays of doubles, computes the sum of the square of the differences of each element by folding, and takes the square root of the total, which is the Eucliean distance.

classify maps each row of the dataset (an observation) with its distance to the subject we want to classify. Once that’s done, we zip the distances with the observation label, sort by distance, take the k first elements, group them by labels, and take the group with the largest number of elements.