Step 3: Prepare the data

Every time you want to classify text, you will need to prepare your data. As this is a language agnostic process I created a different page for it : How to prepare your data for text classification ? Check it out before reading the remaining of this svm tutorial !

Step 4: Read the data

The document-term matrix is saved as a CSV file.
It can easily be read in C#.
To do this we will use another Nuget package called CsvReader.

I am passionate about machine learning and Support Vector Machine. I like to explain things simply to share my knowledge with people from around the world. If you wish you can add me to linkedin, I like to connect with my readers.

Post navigation

49 thoughts on “How to classify text using SVM in C#”

Hi Alexandre, I'm quite inexperienced with text classifiers and I'm looking for something super simple so I can pass a set of text documents (all belong to the same subject matter) to train the system and then pass another text document to get a a probability that it belongs to the same subject matter. It doesn't seem that this is what SVM can do very easily. Do you know a simple solution for that? (preferably C#/.NET). thank you.

Hello Marcelo. You can use SVM to predict probabilities. In libsvm.net you just need to give the value 1 to the probabilities parameter of the C_SVC class, then you will be able to call the PredictProbabilities method.

It would be cool to have a simple example project where you divide documents into categories like "spam" and "no spam". Like Marcelo Calbucci I am inexpierienced too with libsvm. Especially the data preparation for libsvm looks complicated to me.

Hello Alexandre KOWALCZYK sir,I am a student and I found your blog quite useful.
I am currently working on a project where we need to find the missing letter from a given string,example-->th*n as then.Can you help me ,how svm can solve this problem.
Thankyou.

If for each word with a missing letter you assign a class number corresponding to the missing letter (1 to 26) then it is a multi-class classification problem. Then you use SVM in the one-vs-all approach to predict which is the missing letter.

Thanks for your comment Alexandre. I just found that I did not share the source code for this article. You can find it on GitHub. Note: Be sure to change the path in the Program.cs file to use your directory structure. I hope this will help you getting started.

No I don't know a tutorial for multi-class classification. There is not much to say about it as it is usually performed using the one-vs-all approach. There is an option in libsvm to indicate that you want to perform multi class classification.

Well I am afraid is question is too broad. I would probably first research papers on the internet to see if someone did that before. Then I would probably try using Convolutional Neural Networks as it is currently the state of the art for image classification. You have to pick the best tool for the task at hand, and I would not recommend SVM for this one. If you are forced to stick with C# I would say take a look at the Accord.Net library.

Hi Alexandre,
My .csv file contains more than one column (you have only column here as 'text' in your csv file), so in my case the 'List x' will have all the columns except the class column. Could you please give any hint how do i achieve this?.

Hello Rajni. It does not really matter how many column you have. At the end keep in mind that the text column is transformed into a vector X. So if you have several columns you just have to find a way to transform them in a vector X which describes the problem at hand.
Regards,
Alexandre

Hello. You would need to extract the data from your xml file to create a x and y vector then call the CreateProblem method as in the article. To do SVM with C# nowadays I recommend Accord.Net, they have a lot of documentation that should help you.

Hi Alexandre,
I want to classify tweets as positive or negative based on emoticons as sentiment labels.
Happy face :), :-), :D, :-)), etc. can be mapped to positive sentiment, while sad face :(, :-(, etc. can be mapped to negative sentiment.
How can I build a classifier model using libsvm?
Thank you!

Hi. You would need to label each tweet as being positive or not, then train your SVM to classify. But what you try to do seems too simple. There is only a limited set of emoticons you could just write a function checking if the "positive" emoticons are on the tweet and you do not need machine learning at all.

Hi Alexandre,
I've finished reading your tutorials on SVM. Great job, clear and punctual (although I had some doubts I raised in specific tutorials).

I would like to ask you if you have any suggestions about my task.
The task is build a model that predict the labels of a deterministic finite automata (accepting or rejecting, 0 or 1, namely if a string end in an accepting or rejecting state)
(a sample is in the form 01010001 for example, or 1100 if the alphabet is binary. But the alphabet can change and be of 4 symbols for example[0 1 2 3] In the latter case a sample string can be 32201100 or 3333333332221110 ......)

With SVM you will be able to classify the data. So if the taks is "Given a string predit if it is in accepting or rejecting state" it might be possible to use it.
1) I recommend libsvm, but other libraries might suit your needs better. You have to read the doc of each library to see if it provide an useful functionality for you.
2) You cannot use sample which are not of the same length. One possibility would be to automatically increment the length of all samples to have the same length as the longest one. You have to figure out what is the best approach depending on your domain. You could for instance start all the string with a special character. The problem is the same as "how to deal with missing data". In some case people takes an average, in other the most frequent value, it really depends on the problem.
3) I do not know enough about the domain to answer this question.
4) When data is not linearly separable Gaussian kernel is often recommended.

first of all, thank you very much - there is not really much of information available about SVM in C#. I got one question - in your code, you're always training your svm at the beginning. Is it also possible to store the training and just load it before doing a prediction? I want to design an email-classificator and I dont want to train the svm before each classification and I don't think, it's necessary.

Thanks for the great tutorial. Just one question on the text preparation: In your page you use a matrix that is based on the words + number of times the word appears. Is there a way to use SVM in scenarios where the word order is important? Could I use a matrix that always uses a count of 1, but has the same word appear multiple times within the matrix?
thx