Handwriting Recognition Revisited: Kernel Support Vector Machines

In a previous article, we discussed how to perform the recognition of handwritten digits using Kernel Discriminant Analysis. In this article, we will discuss some techniques to do it using Kernel Support Vector Machines.

Kernel Discriminant Analysis has its own set of problems. It scales poorly to the number of samples as O(n³), despite having no problem dealing with high-dimensional data. Another serious problem is that it requires the entire training set to be available during model evaluation, making it unsuitable in many scenarios (such as in embedded systems).

At the end of the previous article, I had mentioned Support Vector Machines (SVMs) as better candidates for performing handwritten digits recognition. One advantage of SVMs is that their solutions are sparse, so, unlike KDA, they do not require the entire training set to be always available during evaluation. Only a (typically small) subset will be needed. This subset is what is commonly called "support vectors".

Support Vector Machines (SVMs) are a set of related supervised learning methods which can be used for both classification and regression. In simple words, given a set of training examples, each marked as belonging to one of two categories, an SVM classification training algorithm tries to build a decision model capable of predicting whether a new example falls into one category or the other. If the examples are represented as points in space, a linear SVM model can be interpreted as a division of this space so that the examples belonging to separate categories are divided by a clear gap that is as wide as possible. New examples are then predicted to belong to a category based on which side of the gap they fall on.

Support Vector Machine as a maximum margin classifier. The leftmost picture shows a decision problem for two classes, blue and green. The middle picture shows the hyperplane which has the largest distance to the nearest training data points of each class. The rightmost picture shows that only two data points are needed to define this hyperplane. Those will be taken as support vectors, and will be used to guide the decision process.

A linear support vector machine is composed of a set of given support vectors z and a set of weights w. The computation for the output of a given SVM with N support vectors z1, z2, ... , zN and weights w1, w2, ... , wN is then given by:

A decision function is then applied to transform this output in a binary decision. Usually, sign(.) is used, so that outputs greater than zero are taken as a class and outputs lesser than zero are taken as the other.

As detailed above, the original SVM optimal hyperplane algorithm is a linear classifier. However, almost 30 years later, since its introduction in 1963, some researchers (and the original author himself) suggested a way to create non-linear classifiers by applying the kernel trick to those maximum-margin hyperplanes. The result was a "boom" in the research of Kernel Machines, which became one of the most powerful and popular classification methods to date.

And it was not without a reason. The Kernel trick is a very powerful tool able to provide a bridge from linearity to non-linearity to algorithms which solely depend on the dot product between two vectors. It comes from the fact that, if we first map our input data into a higher-dimensional space, a linear algorithm operating in this space will behave non-linearly in the original input space.

The "trick" resides in the fact that this mapping does not ever need to be computed: all we have to do is replace all dot products by a suitable kernel function. The kernel function denotes an inner product in feature space, and is usually denoted as:

Using a Kernel function, the algorithm can then be carried into a higher-dimension space without explicitly mapping the input points into this space. This is highly desirable, as sometimes our higher-dimensional feature space could even be infinite-dimensional and thus infeasible to compute. Since the original formulation of SVMs mainly consists of dot products, it is straightforward to apply the Kernel trick. Even if the resulting classifier is still a hyperplane in the high-dimensional feature space, it may be non-linear in the original input space. The use of the Kernel trick also gives a much stronger theoretical support to the underlying classifiers when in comparison with methods which have different source of inspiration, such as biology.

Schematic diagram demonstrating how the Kernel trick can be applied to the original Support Vector Machine formulation.

Suppose we have to decide between three classes, A, B, and C. Now, suppose all we have are binary classifiers, i.e., methods which can automatically decide only between two classes. A possible approach in order to use our binary classifiers in a multi-class classification problem would be to try to divide our multi-class problem into a set of binary problems. The left matrix below shows all possible combinations of binary decision problems which can be formed by taking our three classes:

However, notice that the left matrix includes some redundant scenarios. For example, it is pointless to compute the decision between A and A. It is also inefficient to compute both B x A and A x B, when computing only one and taking the opposite would suffice. Discarding the redundant options, we are left with the matrix on the right. As it can be seen, a typical decision problem between n classes can always be decomposed in a small subset of n(n-1)/2 binary problems.

Now that we are left with three binary problems, it is then straightforward to see that in order to achieve multiclass classification using SVMs, all we have to do is to create three SVMs to operate, each in one of the sub-problems.

To decide for a class, we can use a voting scheme in which the class which receives the most decisions is declared the winner of the decision process. For example, let's say class A won in the first machine, and that C won in both the second and the third.

Another possible approach is to use a one-against-all strategy where the input pattern is presented to all SVMs and decide for the SVM which produces the highest output. Unfortunately, nothing guarantees that a higher positive output from a machine is better than a lower, but still positive output from another machine, so just choosing the machine which produces the highest output as a winner in the decision process can lead to poor results. That is, unless one is using Relevance Vector Machines, which in this case, this strategy would work since the outputs would be indeed probabilities. However, RVMs have problems of their own which are beyond the scope of this article.

For the reasons listed above, we will be focusing only on one-against-one multi-class classification in the rest of this article.

The (Kernel) Support Vector Machine code presented here is also part of Accord.NET, a framework I've been building over the years. It is built on top of AForge.NET, a popular framework for computer vision and machine learning, aggregating various topics I needed through my past researches. Currently, it has implementations for PCA, KPCA, LDA, KDA, LR, PLS, SVMs, HMMs, LM-ANN, and lots other acronyms. The project is hosted in GitHub, at https://github.com/accord-net/framework/. For the latest version of this code which may contain the latest bug-fixes, corrections, enhancements, and features, I highly recommend the download of the latest version of Accord.NET directly from the project's site.

The training algorithms can perform both classification and regression. They are direct implementations of Platt's Sequential Minimal Optimization (SMO) algorithm. MulticlassSupportVectorLearning provides a delegate function, denoted Configure, which can be used to select and configure any learning algorithm. This approach does not impose limitations on which algorithm to use, and also allows the user to specify his own algorithms to perform training.

Since the MulticlassSupportVectorLearning algorithm works by training a set of independent machines at once, it is easily parallelizable. In fact, the available implementation can take full advantage of extra cores in a single machine.

The code listed above uses AForge.NET Parallel constructions alongside with Accord.NET matrix extensions. I have decided not to use the newly added .NET 4.0 Parallel Extensions so the Framework could still be compatible with .NET 3.5 applications.

If you have already read the previous article about Kernel Discriminant Analysis for Handwritten Digit Recognition, just skip this section. This is an introduction to the Optdigits Dataset from the UCI Machine Learning Repository.

In raw Optdigits data, digits are represented as 32x32 matrices. They are also available in a pre-processed form in which digits have been divided into non-overlapping blocks of 4x4 and the number of on pixels have been counted in each block. This generates 8x8 input matrices where each element is an integer in the range 0..16.

Kernel methods are appealing because they can be applied directly to problems which would require significant data pre-processing (such as dimensionality reduction) and extensive knowledge about the structure of the data being modeled. Even if we know little about the data, a direct application of Kernel methods (with less preprocessing) often finds interesting results. Achieving optimality using Kernel methods can become, however, a very difficult task because we have an infinite choice of Kernel functions to choose from - and for each kernel, an infinite parameter space to tweak.

The following code shows how a Multiclass (Kernel) Support Vector Machine is instantiated. Note how the inputs are given as full vectors of 1024 positions. This would be impractical if we were going to use Neural Networks, for example. Kernel methods, in general, however, have no problems processing large dimensionality problems because they do not suffer from the curse of dimensionality.

To start the learning process, click the button "Start training". Using the default settings, it should not take too long. Since the code uses the Parallel Extensions from AForge.NET, the greater the number of cores, the faster the training.

Training a Multi-class Support Vector Machine using the parallelized learning algorithms of the Accord.NET Framework.

After the training is complete, click "Classify" to start the classification of the testing set. Using the default values, it should achieve up to 95% accuracy, correctly identifying around 475 instances of the 500 available. The recognition ratio of the testing set may vary around a little depending on the learning algorithm's run.

Results from the Kernel Support Vector Machine learning algorithm.

The same set and the same amount of training and testing samples have been used as in the previous Kernel Discriminant Analysis article for comparison purposes. Since SVMs are more efficient both in processing time and memory needs, probably higher accuracy rates could be achieved by using more samples from the dataset.

After training, the created SVMs can be seen in the "Machines" tab. The support vectors and the bias (threshold) for each machine can be seen by selecting one of the entries in the first DataGridView. The darker the vector, the more weight it has in the decision process.

Details about Support Vector Machines, including their support vectors.

Even if the gain on recognition rate was just over 3%, the accuracy of the recognition has greatly improved upon KDA. By clicking on the "Classification" tab, we can manually test the Multi-class Support Vector Machine for user drawn digits.

We can see that the SVM method produces much more robust results, as even badly drawn digits can still be recognized accurately:

In this article, we detailed and explored how (Kernel) Support Vector Machines could be applied in the problem of handwritten digit recognition with satisfying results. The suggested approach does not suffer from the same limitations of Kernel Discriminant Analysis, and also achieves a better recognition rate. Unlike KDA, the SVM solutions are sparse, meaning only a generally small subset of the training set will be needed during model evaluation. This also means the complexity during the evaluation phase will be greatly reduced since it will depend only on the number of vectors retained during training.

One of the disadvantages of Support Vector Machines, however, is the multitude of methods available to perform multi-class classification since they can not be applied directly to such problems. Nevertheless, here we discussed and exemplified how to employ a one-against-one strategy in order to produce accurate results even if your data set is unbalanced.

As in KDA, another problem that raises from Kernel methods is the proper choice of the Kernel function (and the tuning of its parameters). This problem is often tractable with grid search and cross-validation, which are by themselves very expensive operations, both in terms of processing power and training data available. Nonetheless, those methods can be parallelized easily, and a parallel implementation of grid search is also available in the Accord.NET Framework.

If you would like to hire good developers to build your dream application, please check out DaitanGroup, one of the top outsourcing companies in Brazil. This company, located in Brazil's Sillicon Valley but with US-based offices, has huge experience developing telecommunications software for large and small companies worldwide.

Comments and Discussions

Dear Mr. Souza. I want to learn about Support Vector machines. With your article I found the Accord.Net homepage. I downloaded "framework-3.0.0". As I understood it is not necessary to install AForge.Net. I want to do some tests with Regression(SVms). Iwork with Microsoft Visual Studio Ultimate 2013". If I want to compile "Accord.NET.LGPL(35 project)"m the error message says in"Framework-3.0.0\Setup\Scripts\Accord.dll.targets" was not found.

I will need probably more help in future to run your software. Thank you in advance.

Hello:I am a student in computer engineering, I've built a SVM network Tdibaa and I want to recognize the faces of people, I used SMO algorithm to obtain alpha i value, I get features by using of Legendre Polynomial of order 4.When I train the model of SVM the error rate was very high , my question what is the best kernel function that fitting thisWork and what are the best parameters.Best Regards

In December you were kind enough to add a composite kernel for Gaussian and DynamicTimewarping.Could you make the other kernels composite, also?This would give the other kennels the ability to work with the DynamicTimeWarping kernel, too.Given the results I have received from the Gaussian composite kernel I am very excited to see the results from expanding this to the other kernels.Thanks in advance for your reply.

When using the DynamicTimeWarping kernel setting the degree to the default of 1 is used to define the kernel as linear, given that the DynamicTimeWarping kernel is tied to the Polynomial kernel, too. However, if I want to the DynamicTimeWarping kernel to function as a non-linear kernel what should the degree property be set to?

To make it non-linear, it should suffice to set the degree to anything higher than 1. Often setting it to 2 would already be a good choice. It is also possible to apply any other non-linear function to the DTW output and pass this combined function to a Custom[^] kernel, but I suppose just increasing the polynomial degree would already suffice.

I have looked through your documentation trying to find an example of how to apply a non-linear kernel function to the output if a DTW kernel function, using a Custom kernel. I have not been able to find this example. Would you creat an example for this?

Better yet than a Custom kernel, I would suggest creating a new kernel function (a new class) that contains the DTW kernel as a field. Then, in the implementation of the Function method, call it like this:

However, this wouldn't give you the Estimate methods of the Gaussian kernel, which would be quite interesting in this case. I will see if I can write a proper class for this kernel with the Gamma estimation part, but the above should already work.

If you would write a proper class this would be greatly useful.I have used your Accord.NET library with great success!I have created a Machine that is achieving an accuracy of 91% matching stock patterns on approximately 3 years of stock data. I have automatically generated more than 12,000 sequences and 800,000 observations. I am training on 30% of the data and testing on the remaining data.

<a href=http://www.findmynexttrade.com/codeproject/StockAccuracy.png>CLICK to See Accuracy Image:

I am planning on publishing a Code Project article on using the Accord.NET library to classify stock patterns. The key to success has been automatically generating a large enough set of sequences with sufficient observations.

Your proper class that combines both DTW and Gaussian or Annova/(RBF) will provide me with the kind of hybrid kernel needed to improve the accuracy on fewer observations. I noticed that when using either the Gaussian or (RBF) it takes much longer to generate the PCA from that kernel when using a KernelPrincipalComponentAnalysis class. Is there a way to make the proper hybrid kernel less intensive?

However, it may also require some bits from the updated versions of the classes DynamicTimeWarping[^] and the normal Gaussian[^]. If it proves to be too difficult to replace those classes, I can also generate some new temporary binaries with them.

I have run test on the newly created Generic/composite Gaussian class.I have used the DTW as a composite kernel.The results with the new class are far more accurate than using the DTW alone.I am using the overloaded constructor (kernel: DTWKernel, sigma:3)

The new approach to the kernel should be extended to all of the kernel, (in my humble opinion).

When attepting to use multiple classes and the ProbabilisticOutputCalibration class the ProbabilisticOutputCalibration constructor will not support the MulticlassSupportVectorMachine as claimed in the documentation.Is there a way to add this support?

I am working with Hand written recognizition application for regional languages.For that I am developping user interface to create databas.I am working with INkCanvas in wpf application.I want to save the INk canvas data as a as a (x,y) coordinates in text file on button click.can u guide me please.

I am attempting to load a saved machine using the following code:svm = MulticlassSupportVectorMachine.Load(@"C:\DevTools\C# Accord Scientifc Framework\New\TechnicalAnalysis\AWSMachines\AWSMachine.aws");

When I call the following code I get a "null reference" error message.

Thanks for the message. Was the machine saved with a previous version of the framework? The null reference must be due to a link function not being specified. If it could be possible, could I get one copy of the AWSMachine.aws file you have so I can investigate? You can send it through the project's GitHub issue tracker[^] if you want, or personally to my email if you don't want to make it public.

I had sent you an email yesterday (through Gmail) with the updated binaries fixing the locals issue in the Dynamic Time Warping kernels. Unfortunately I mistakenly included the wrong link in that email. I am sending the email through CodeProject now, in case you didn't receive it. The updated binaries can be found here:

Dear césar de Souza,We prefered to use your code for our experiments in neural networks.In this regard, we would like to know which neural network simulator is suitable for the code as we are using windows 64 bit O.SHope to hear from you soon.

However, if you are searching for a true neural network simulator, i.e. something that really models the biological aspects of the neurons, including chemical synapses and frequency activation signals, perhaps you could take a look at Emergent[^]. It is mainly focused at providing biologically-plausible neural network models.

Hi César de Souza, Thank for your post, i have a question about Optdigits-test, can you help me?I don't understand format data test is an integer in the range 0..16 meanwhile format data train is 0..1.Each a raw Optdigits data to test, how to pre-processed it is an interger in the range 0..16? Why you don't set raw data test is 0..1 same data train.Thank you very much !

FantasticThank you for sharing this perfect projectI'm student and wish one day I can code like youJust a questionI couldn't understand the format of test file "optdigits-tes.txt"for example in line one what does numbers like 15 or 13 refers to?could u give a description?

In fact, actually the project uses only the 'optdigits-tra.txt' text file to create both training and testing sets. The 'opt-digits-tes.txt' is not used, but I ended up including it on the sources as well. If I am not very mistaken, each row in this file contains a 'grouped pixel' representation of the digits. Instead of representing each digit as a 32x32 binary matrix containing only zeros and ones, this file instead represents each digit by counting how many pixels were contained in each 4x4 region of the image.

So, for example, if we have 32x32 bitmaps, we separate those into 64 blocks of size 4x4. Then, we count how many pixels are set to 1 inside those blocks. The first element of the row will, then, contain this number of pixels that were active. We do the same for every block, and we end up with a feature vector of 64 positions containing values from 0 to 16. The last digit is the class label for the digit.

Error 1 'int' does not contain a definition for 'Compute' and no extension method 'Compute' accepting a first argument of type 'int' could be found (are you missing a using directive or an assembly reference?) ...1\Samples\MachineLearning\Handwriting\MainForm.csline no-338 project-Handwriting (SVM)

I am working in the field of image annotation and I've used function Accord .Net. I extracted features and classify with SVM But I have 5 class and the output returns at the first class. Here is the link image below. Please help me.http://www.8pic.ir/images/19598223966679296049.png[^]

I'am doing a research using SVMs, but I want to comprehend the Kernels so that I could make my paper better.

I saw the documentation of ACCORD.NET--Kernels--Gaussian said the method of Gaussian--Gaussian.Estimate() is based on the paperby Caputo, Sim, Furesjo and Smola, "Appearance-based object recognition using SVMs: which kernel should I use?", 2002.

But it was so hard to find the paper, so I have to turn back to you. Could you send that paper to me? If you can , I will contact you personally, and then email it to me.

It so graceful of you to answer my questions pointing out my suspicious, I wrote the numbers in the scale that i feel comfortable, maybe that's what cause the inaccuracy. But it is so disappointed that I could not open the "youtube", and the youtube is not available in the whole PRC(people republic of China). And it's not convenient for me to pay your app online in my country, although I want to support you so much.

The other app I linked was not mine, it was from a user who used the framework to create an application to read digits using Windows Phone. It is called Point-and-Call, there was also a video on Youtube[^] about it. Sorry that you can't see the videos, though :(

It's so nice of you and I think I would benefit a lot from your work. Actually I am now doing a paper for my degree, and now I'm finding the method and apps to support my work. As your significant work, I want to have your e-mail please, that will be more convenient for us to contact.

The results of the recognition is not good enough. It could not distinguish the 0 from 9 for most time. Maybe it is the wrong of the samples. Could you please test it again and help me to make it better, as I wish this to help me to do some works a lot.

1) Could not find type 'Handwriting.Canvas'. Please make sure that the assembly that contains this type is referenced. If this type is a part of your development project, make sure that the project has been successfully built using settings for your current platform or Any CPU.

2) The variable 'canvas' is either undeclared or was never assigned.3) Could not find type 'ZedGraph.ZedGraphControl'. Please make sure that the assembly that contains this type is referenced. If this type is a part of your development project, make sure that the project has been successfully built using settings for your current platform or Any CPU.4) The variable 'graphClassification' is either undeclared or was never assigned.

Hi César,my name is Vanessa Manni and I'm a student. I have seen your work with MulticlassSupportVectorMachine. Thanks to your tutorial, I have create a program that can recognize 177 numbers on 300 in bitmap images.I would ask you the meaning of the parameters:Complexity;Epsilon;Tolerance;Degree;Constant;How they could improve my results?I have used a Polynomial Kernel. The Gaussian Kernel can Improve my results? Can you load your file for SVM learning?

Hi,César de Souza .it's great workI have been read your Articles at links: Handwriting Recognition Revisited: Kernel Support Vector Machines and Handwriting Recognition using Kernel Discriminant Analysis.i want to implement svm in my work but i don't understand the principle .so my work based to recognition handwriting numeral via web cam, i have implement freeman chain code and Fourier descriptorfor all numeral the problem is how i applied a svm for example at 48 frequency represent a one numeral '0' . if i have 12 sample of 0 and for all numbers of database of 0 we have 12*48 every '0' represent by 48 frequency.

I needed Windows Phone version of SVM and found your solution.Your solution is fast. Good work!

To get it work in Windows Phone I had to remove AForge.Parallel.For sentences (and replace them with normal for loop sentences). AForge cannot be used in Windows Phone. All code must be managed. Of course it is now little bit slower, but fast enough.

I'm soon going to publish my solution in Windows Phone Marketplace. I'm thinking of licensing. It's little bit confusing that code is licensed both:under CPOL and also in source code it is said that license is GNU LGPL v 2.1 or later.Microsoft has banned LGPLv3 so it's not possible to use it. Am I going to tell in my application that SVM is licensed both CPOL and GNU LGPL v 2.1. Of course you will be credited also. Do you want that I add reference (HTML link) to original SVM solution (this article) also to my app. What do you think?

This is great! I can grant you another free (non-GPL) license if you need (and to the most recent version, which is also faster than the version available here in CP). Please send me an email at cesarsouza at gmail dot com so we can discuss