This is the continuation of my series converting the samples found in Machine Learning in Action from Python to F#. After starting on a nice and steady pace, I hit a speed bump with Chapter 6, dedicated to the Support Vector Machine algorithm. The math is more involved than the previous algorithms, and the original Python implementation is very procedural , which both slowed down the conversion to a more functional style.

Anyways, I am now at a good point to share progress. The current version uses Sequential Minimization Optimization to train the classifier, and supports Kernels. Judging from my experiments, the algorithm works - what is missing at that point is some performance optimization.

I’ll talk first about the code changes from the “naïve SVM” version previously discussed, and then we’ll illustrate the algorithm in action, recognizing hand-written digits.

Main changes from previous version

From a functionality standpoint, the 2 main changes from the previous post are the replacement of the hard-coded vector dot product by arbitrary Kernel functions, and the modification of the algorithm from a naïve loop to the SMO double-loop, pivoting on observations based on their prediction error.

The code I presented last time relied on vector dot-product to partition linearly separable datasets. The obvious issue is that not all datasets are linearly separable. Fortunately, with minimal change, the SVM algorithm can be used to handle more complex situations, using what’s known as the “Kernel Trick
“. In essence, instead of working on the original data, we transform our data in a new space where it is linearly separable:

… we substitute with an inner-product function of the appropriate signature:

// A Kernel transforms 2 data points into a floattypeKernel=floatlist->floatlist->float

… which we can now explicitly inject in the algorithm, like this:

letsmodatasetlabels(kernel:Kernel)parameters=...dostuffhere

This allows us to still support the linear case, by passing in the dot function as a Kernel - but also other more exotic Kernels, like the Gaussian Radial Basis Function, which we will see in action later, in the hand-written digits recognition part:

Note that the radialBias function has 3 arguments, and the Kernel type expects two - when using that Kernel, you simply need to supply the sigma argument by currying:

letrbfKernel=radialBias10.0

… which now has the correct signature. Using a different Kernel would be as simple as creating a new function which has the proper signature, and passing it to the algorithm. First-class functions rule.

The SMO algorithm

The “naïve version” of the algorithm worked essentially this way:

assign coefficients Alpha to each observation,

iterate over the observations, pick another random observation and attempt to “pivot” them, updating the Alpha coefficients for the selected pair while respecting some constraints,

until no Alpha updates are occurring for a while.

By design, the Alpha coefficients stay between 0 and an upper value C, and once the algorithm finishes, observations with Alpha > 0, the “Support Vectors”, are used to generate a separation boundary.

The SMO algorithm is largely similar, with two main differences:

instead of picking a random pivot candidate, it uses a heuristic to select the candidate: from the unbounded candidates (where 0 < Alpha < C), it picks the candidate with the largest absolute error difference,

instead of a single-loop, it switches between full passes over the entire dataset, and passes over the unbounded candidates.

At that point, I think the code in the smo function is decently readable, and conveys the “how” of the algorithm fairly well. I won’t go into the “why”, which would lead us way beyond a single post. I provided a list of resources at the end of the post, in case you want to understand the logic behind the algorithm better, including the original article by Platt.

Three final comments before moving to the fun part, using the algorithm on hand-written recognition.
First, I mentioned last time that I was bothered by the deeply-nested structure of the pivot code. The original Python code was performing multiple checks, and exiting the computation early if some conditions was or wasn’t met. I resolved that issue by using a Maybe builder in the pivotPair function, which I lifted from here - and I really like the result.

Then, I reorganized a bit the code from the original. In Platt’s pseudo code (and in the Python code from Machine Learning in Action), there are 2 key methods: takeStep, and examineExample. The pivot is performed by takeStep, and examineExample selects a suitable pivot - and calls takeStep to execute the pivot. I re-arranged a bit the code, so that these two steps are completely separate: identifyCandidate, given an observation, returns a pair of observations (or None), as well as the corresponding prediction error, and doesn’t call pivotPair, which is called in a separate step and handles the alpha updates. In my opinion, this makes the algorithm workflow much clearer.

Finally, the code in the book uses some error caching, which I have not included in my version - because I actually don’t understand how it is useful. I am under the impression that my current implementation avoids un-necessary error computations, but I may be missing something. If anyone spots an error, I’d love to hear about it! In the same vein, at that point, the algorithm seems to be behaving correctly (the file in Chapter6.fsx illustrates it on a variety of test datasets), but can be very slow on larger datasets or specific parameters settings. I attempted to use memoization to cache some of the Kernel computations, which was a complete failure (I’ll probably post more about that later), and will probably revisit the code for some tuning later (input welcome). In any case, you’ve been warned - this is not a fully optimized version (yet)!

The dataset consists of 1593 scanned digits, written down by 80 people. Each scan is stored as 256 values (1 or 0), representing the 16x16 pixels of the scan, and 10 values (1 or 0), where 1 marks the digit that was written down (i.e. there should be 9 zeroes and 1 one).

We create a web request, and grab the result into data, which at that point contains a gigantic string - our dataset. We need to put it into the shape our algorithm expects, that is, an array of observations (each observation represented as a list of floats), and a matching array of labels. To do that, we need to break the data into lines, and parse each line into 256 numbers (our observations), and the label of the observation - like this:

// a line in the dataset is 16 x 16 = 256 pixels,// followed by 10 digits, 1 denoting the numberletparse(line:string)=letparsed=line.Split('')letobservation=parsed|>Seq.take256|>Seq.map(funs->(float)s)|>Seq.toListletlabel=parsed|>Seq.skip256|>Seq.findIndex(fune->(int)e=1)observation,label// classifier: 7s vs. rest of the worldletdataset,labels=data.Split((char)10)|>Array.filter(funl->l.Length>0)// because of last line|>Array.mapparse|>Array.map(fun(data,l)->data,ifl=7then1.0else-1.0)|>Array.unzip

The parse function splits a line according to whitespace, take the 256 first elements, and transform them to an list of floats (the “observation” part), and identifies what digit was scanned by finding the index/position marked “1” in the last block (a “2” is encoded for instance as 0 0 1 0 0 0 0 0 0 0: all positions are 0, except index 2).

Out of the box, the SVM classifier only separates between 2 classes. In this case, we’ll arbitrarily try to recognize 7s. We process the string data, breaking it into an array split by line breaks (char 10), remove the last chunk (the file ends with an end line), apply our parse function to transform each line into a tuple observation/label, and reduce the labels into 1.0 for 7s, -1.0 otherwise - and finally unzip, exploding the array of tuples into 2 arrays, one containing observations, the other labels. Done!

Because I am a visual guy, I thought it would be nice to see how the original scans looked like, hence the render function:

We set the search parameters and pick a radial basis kernel (more on parameters selection further), split the dataset into 600 observations for training, the rest for validation, using array slices - and start training the classifier.

Note: the choice of 600 is largely random. I just selected a multiple of 200, because of the internal organization of the dataset, which has blocks of 20 times the same digit in a row. Taking multiples of 200 ensures every digit will be represented. Note also that the sample is heavily unbalanced, with only 10% of 7s.

Given the nature of the algorithm, the time to complete the training step will vary. On my humble machine, I have seen it take typically 6 to 9 minutes - so give it a bit of time.

So this is nice but… how good is our classifier?

The proof of the pudding is in the eating, and of the classifier in the classifying, so let’s see if we can gauge quality. One straightforward metric we can use here is the percentage properly recognized for each group. Let’s do that:

quality takes a classifier and a sample. The classifier is a function that takes in a observation, and returns a float: if the result is greater than 0.0, the prediction is the group with label 1.0, otherwise it’s the group labeled -1.0. The sample is an array of tuples - an observation, and its known label, 1.0 or -1.0. quality computes the % properly classified by computing the prediction for each observation. If the prediction and label are of the same sign, their product is positive, and the prediction is correct: we map the correct predictions to 1.0, 0.0 otherwise - and the percentage properly classified is simply the average of these.

evaluate has the same structure, but first splits the data into 2 groups, separating by class. This is important, because computing the raw proportion on the entire sample could be highly misleading. For instance, imagine a classifier that predicts “not a 7” for every observation: that predictor would be correct 90% of the time, because 90% of the sample is in that class. On the other hand, splitting by class would show us that it gets 100% hits in one group, and 0% in the other, a much more informative diagnosis of our state of affairs!

We can now evaluate how the classifier does, on the training set, and on the validation set, the one we really care about. Here is what I got:

Classification is excellent in the training set, which is reassuring, and pretty good on the validation set (92% of 7s properly recognized, and 96% for the rest of the world) - the classifier is obviously doing something right.

How did I come up with the parameter C = 5.0, and sigma = 10.0 for the radial basis kernel? Frankly, by poking around. In the end of the script, you’ll find a block of code which iterates over various values for C and sigma:

Run this only if you have some time on your hands - this tries out various combinations, and prints out some quality metrics for each setting, which helps determining what order of magnitude “works”. I reduced Depth to 10, to limit the search time, but this is still fairly lengthy.

Conclusion

That’s it for now on the Support Vector Machine classification algorithm. Compared to the other classification algorithms I looked into so far, this one has been the hardest to convert, and I frankly can’t wait to move on with my life and look into other algorithms!

I was very impressed by the results on the digits classification - there is something almost magical in seeing 200+ lines of F# and a function as simple as the radial basis function do a pretty good job at classifying fairly messy data.

I suspect the current code is not running as fast as it should, and has room for optimization. I actually tried out some avenues - for instance, I attempted memoizing the Kernel evaluations, but the results were worse than disappointing. If anyone sees some obvious improvements, or has comments on the code, I’d love to hear them! In the meanwhile, I’ll start looking into other algorithms - and come back to this later.