In my opinion, it is a good idea to get familiar with both the Explorer and the command-line interface if you want to get a feeling of the amazing power of this data mining library. However, where you can take full advantage its power is in your own Java programs. Now it is time to deal with it.

Representing your text database in order to enable learning, and to train a classifier on it.

Using the classifier to predict text labels of new, unseen documents.

The first step is a batch process, in the sense that you can do it periodically (as long as your labelled data set gets improved with time -- bigger sizes, new labels or categories, corrected predictions via user feedback). The second step is actually the moment in which you get advantage of the knowledge distilled by the learning process, and it is online in the sense that it is don by demand (when new documents arrive). This distinction is conceptual, I mean that modern text classifiers retrain on the added documents as soon as they get them, in order to keep or improve accuracy with time.

In consequence, what we need to demonstrate the text classification process is two programs: one to learn from the text dataset, and another to use the learnt model to classify new documents. Let us start showing a very simple text learner in Java, using WEKA. The class is named MyFilteredLearner.java, and its main() method demonstrates its usage, which involves:

Loading the text dataset.

Evaluating the classifier.

Training the classifier.

Storing the classifier.

The most interesting parts of the process are:

We read the dataset by simply using the method getData() of an ArffReader object that wraps a BufferedReader.

We programmatically create the classifier by combining a StringToWordVector filter (in order to represent the texts as feature vectors) and a NaiveBayes classifier (for learning), using the FilteredClassifier class discussed in previous posts.

The process of creating the classifier is demonstrated in the next code snippet:

So we set the class of the dataset as being the first attribute, then we create the filter and set the attribute to be transformed from text into a feature vector (the last one), and then we create the FilteredClassifier object and add the previous filter and a new NaiveBayes classifier to it. Given the attributes above, the dataset has to have the class as the first attribute, and the text as the second (and last) one, like in my typical example of the SMS spam subset example (smsspam.small.arff).

You can execute this class with the following commands to get the following output:

For the case you don want to evaluate the classifier on the training data, you can omit the call to the evaluate() method.

Now let us deal with the classification program, which is far more complex but only for the process of creating an instance. The class is named MyFilteredClassifier.java, and its main() method demonstrates its usage, which involves:

Reading the text to be classified from a file.

Reading the model or classifier from a file.

Creating the instance.

Classifying it.

Creating the instance is performed in the makeInstance() method, and its code is the following one:

The classifier learnt with MyFilteredLearner.java expects that an instance has two attributes: the first one is the class, it is a nominal one with values "spam" or "ham"; the second one is a String, which is the text to be classified. Instead of creating one instance, we create a whole new dataset which first instance is the one that we want to classify. This is required in order to let the classifier know the schema of the dataset, which is stored in the Instances object (and not in each instance).

So first we create the attributes by using the FastVector class provided by WEKA. The case of the nominal attribute ("class") is relatively simple, but the case of the String one is a bit more complex because it requires the second argument of the constructor to be null, but casted to FastVector. Then we create an Instances object by using a FastVector to store the two previous attributes, and set the class index to 0 (which means that the first attribute will be the class). As a note, the FastVector class is deprecated in the WEKA development version.

The latest step is to create an actual instance. I am using the WEKA development version in this code (as of the date of this post), so we have to use a DenseInstance object. However, if you make use of the stable version, then you can use Instance (link to the stable version doc), and must change this code to:

Instance instance = new Instance(2);

As a note, I have commented in the code a different way of setting the value of the second attribute. I must note that we do not set the value of the first attribute, as it is unknown.

And if you feed this classifier with a file (smstest.txt) that stores the text "this is spam or not, who knows?", and the model learnt with MyFilteredLearner.java (that is stored in myClassifier.dat), then you get the following result:

It is interesting to see that the class assigned to the instance before classifying it is "?", which means undefined or unknown.

For those interested on using the classifiers discussed in my previous posts (I mean including AttributeSelection, and using PART and SMO as classifiers), the only part of this code that you have to change is the learn() and evaluate() methods in MyFilteredLearner.java. Just play with it, and have fun.

Thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for futher articles on this topic!

Just FYI (and you can probably just delete this comment, no need to put it up if you don't want to) this one is still throwing 404s:http://www.esp.uem.es/jmgomez/tmweka/MyFilteredClassifier.javabut I found it based on the other URL, it should be:https://github.com/jmgomezh/tmweka/blob/master/FilteredClassifier/MyFilteredClassifier.java

Hi, You've got an great post but I got an error in loading the model file (.model file). Im using naive bayes multinomial with string to word vector filter. I used weka explorer to save the model file.

Yes. This code assumes that you have the raw text (e.g. ["this is my text",label] instances), so it is required to use a FilteredClassifier that first applies the StringToWordVector to the text (to get word-weights vector representation), then applies the classifier to the word-based representation. The Filtered classifier does it in a smooth fashion.

In my previous post: http://jmgomezhidalgo.blogspot.com.es/2013/01/text-mining-in-weka-chaining-filters.html I show how to use a FilteredClassifier at the WEKA Explorer. Once you have it, you can save it as a FilteredClassifier.

You can create your ARFF files with scripts, as the output of other programs, etc. There are many ways, it depends on the source of your data. If thedata is going to be very very large, you may consider using a database and the appropriate connectors in WEKA.

Thank you for your response sir,Actually i am student doing my final year project which is used to identify the disease-treatment relation in short text.In the as a initial task i have to annotate the sentences as informative and non informative.Before that i have do the tagging part.Now my question is either should i give the tagged base words as my input for creating arff file or normal sentences is enough.. which one will provide the improved result.Thanks in advance.

My experience is that if you have the sentences tagged, applying the StringToWordVector filter and then AttributeSelection with Ranker and Information Gain will give you which words are most valuable to predict if a sentence is informative or not.

Then the StringToWordFilter will give you the words, and after that the AttributeSelection filter will rank those words according to being good predictors. Beware, it could be the case that a word is not very "informative" (that is, a good predictor of your positive class) but very "non-informative" (that is, agood predictor of your negative class).

To get the ARFF file, you can have two folders, one called "informative" with a sentence per file, and another one called "non-informative" with a sentence per file as well.

This code completely depends on your training set. If you are using mine (smsspam.small.arff), it should be that way, while it is more likely to get the class ham, as it is the majority class. You can do the test by submitting a sentence that is already spam from the dataset.

I have count the number of spam in smsspam.small.arff. I have found out that there are only 33 spams line in the smsspam.small.arff. but after using the java code, the output only shows 13 incorrectly classified instances. is this that nothing wrong with it?

No, there is nothing wrong. There are 33 spam instances in the dataset, and 167 ham instances as well. The error by evaluating on the 200 training instances is 6.5% (13 instances). That is, you train on the dataset with 200 examples, then you run on the same dataset and get 187 correctly classifieed instances, and 13 mistakes -- some of them will be of the class spam and some of them will belong to ham. That's all.

Obviously it is more like for the test to fail on a spam message because there are few spams, so the classifier tends to classify in the majority class (ham).

First I would like to say that your posts here are amazing, keep up the good work! I am using WEKA too in my project now (i am still a beginner) and I wish to use a topic model such as Latent Dirichelet Allocation. I have looked into the documentation but there is no implementation of LDA. There are some API's such as LingPipe and Mallet that allow LDA transformation. However I do not know how I can get this representation into weka so i can classify them. Do you have any experience with doing this? Help is really appreciated!

Unfortunately, LDA is not implemented in WEKA. You can ask for it in the WEKA list at: http://list.waikato.ac.nz/mailman/listinfo/wekalist.

In a search, I have found this quote by Mark Hall:

--Q: I was looking for a LDA in Weka, but I didn't found it. Is there a LDA in Weka or something similar?A: Weka doesn't have an implementation of LDA, but it does have a number of other methods that are arguably as good or better: multi response linear regression, logistic regression, PCA, partial least squares regression and linear support vector machines.--

Found in: http://list.waikato.ac.nz/pipermail/wekalist/2011-September/053397.html

If you are working with your own file, it is very likely that the error is caused by having a different classification problem (class type, for instance). A quick search will give you may email, please send the file to me (or a subset of it) if you want me to check it, as it works perfectly with my sample files.

It is possible to get the probability for each of the class values or labels in the case of a classification problem (nominal class) using the distributionForInstance() method available in every classifier (see http://weka.sourceforge.net/doc.dev/weka/classifiers/Classifier.html#distributionForInstance(weka.core.Instance) ). Instead of calling classifyInstance() in line #116, you can call the previous method to get an array with the probabilities of each class value. Beware, not all classifiers produce robust class membership probabilities, so this depends on the base classifier that you are using inside the FilteredClassifier.

However, if you want to get information about the internal probability calculations done during training, the only way I see to do this is using a base classifier that makes use of probabilities (e.g. NaiveBayes family) and output the classifier as an String somewhere after training, then post-processing that output.

i used my files and all functions are works but i'm having a problem with the last one classify() it shows for me this "Problem found when classifying the text" can you please tell me what's the problem ?

First, I am using the version 3.7.9 (development version) in those tests.

Second, regarding the exception. You get the message because I catch the exception (lines 120-122 at MyFilteredClassifier.java). Just substitute the line #121 by e.printStackTrace(); to get a more informative error message and post it here if you are not able to solve it.

Most likely, the error is produced because either the model has not been previously learnt, or the training and test datasets are not compatible.

thank you for your reply . how can i know if they are not compatible ? i build them using WEKA tool not your MyFilteredLearner.java , dose this cause the problem ?

Also, i have replaced the line #121 and i got this error

java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:604) at java.util.ArrayList.get(ArrayList.java:382) at weka.core.Instances.attribute(Instances.java:341) at weka.core.AttributeLocator.locate(AttributeLocator.java:153) at weka.core.AttributeLocator.initialize(AttributeLocator.java:119) at weka.core.AttributeLocator.(AttributeLocator.java:102) at weka.core.StringLocator.(StringLocator.java:69) at weka.filters.Filter.flushInput(Filter.java:431) at weka.filters.unsupervised.attribute.StringToWordVector.batchFinished(StringToWordVector.java:768) at weka.classifiers.meta.FilteredClassifier.filterInstance(FilteredClassifier.java:474) at weka.classifiers.meta.FilteredClassifier.distributionForInstance(FilteredClassifier.java:495) at weka.classifiers.AbstractClassifier.classifyInstance(AbstractClassifier.java:70) at myfilteredclassifier.MyFilteredClassifier.classify(MyFilteredClassifier.java:117) at myfilteredclassifier.MyFilteredClassifier.main(MyFilteredClassifier.java:197)

I am afraid that the output is not very informative, so I cannot help you with this unless I have more information. In particular, a short sample of the training and testing files may be enough - however it is required that you describe the process for generating the model with more detail: you just used the Explorer? Which version? Which model (classifier)? Etc.

Hey Jose, thanks for this example.I tried it but i have a problem. You suggested to switch the methods learn() and evaluate(). I did this and the training and evaluation works. But when I want to classify my own text after that I get the following error:

java.lang.NullPointerException: No output instance format defined

I didn't see in your code that you set the output format. Do you know wha I have to do?

Hi, This looks like an excellent demonstration of how to use Weka with java. But I have unfortunately experienced an issue right at the end:

I have copy and pasted your classes and used the example file formats for the training instances and the new instance and I am using the Weka developer version. The classifier is built, learned and evaluated correctly. But when I run the MyFilteredClassifier methods to load instance, load model, make instance and classify it fails to classify the instance? I get the following error: No output instance format defined

This is the single line of my instance file: this is spam or not, who knows?

This is the start of my train ARFF file:@relation sms_test

@attribute spamclass {spam,ham}@attribute text String

@dataham,'Go............................

Could you please let me know why this is happening, because I am using the exact code and file formats you have supplied. Thanks in advance.

Hi..im new to Weka and im implementing a movie classifier system based on genres for my project.I have a small question regarding your code. When you uploade the model it seems that you have uploaded "somthing.dat" file. But im uploading "something.model" file previously created and saved using weka explorer.So can you tell me is this the reason why im continuously getting errors in "classify" function?Thank you in advance.

It is strange, in principle you should be able to use a model file you have previously saved using the Explorer, with my code, if the Classifier is compatible (same kind of FilteredClassifier with same filters, classifier and so). The name of the file does not matter...

I am afraid I cannot provide better guidance if I have not more details...

Hello. I am new to weka. I read and understood about classification but i don't understand one thing about testing:I have 4 news categories and i made a arff file, transform with stringtowordvector and classified it.Now i want to test one new text(one news)How am i gonna transform this basic text to a test set?

This was a really great way for me to understand how to get started with Weka. More than with any other tutorial I have come across. A million thanks for this!One question - Your MyFilteredLearner class has an evaluate and a learn method, both of which perform mostly the same steps of initialising/setting options for many of the same variables. Can't this be handled in the main function itself? Or by declaring the classifiers globally and avoiding having to repeat the code in the learn() method?

@Adina - This post explains exactly that. You can apply the same configuration of thje StringToWordVector filter properly to the test set by using a Filtered Classifier.

@Kikazz You are right, that code can be factorized into the main function or another "initialization" one. My purpose was to allow you easily delete the function you don't need without loosing the one you need, and at the same time, having all the code for evaluation or training together. But it is better the way you propose.