How to Run Your First Classifier in Weka

Weka makes learning applied machine learning easy, efficient, and fun. It is a GUI tool that allows you to load datasets, run algorithms and design and run experiments with results statistically robust enough to publish.

In this post, I want to show you how easy it is to load a dataset, run an advanced classification algorithm and review the results.

If you follow along, you will have machine learning results in under 5 minutes, and the knowledge and confidence to go ahead and try more datasets and more algorithms.

1. Download Weka and Install

Visit the Weka Download page and locate a version of Weka suitable for your computer (Windows, Mac, or Linux).

Weka requires Java. You may already have Java installed and if not, there are versions of Weka listed on the download page (for Windows) that include Java and will install it for you. I’m on a Mac myself, and like everything else on Mac, Weka just works out of the box.

If you are interested in machine learning, then I know you can figure out how to download and install software into your own computer. If you need help installing Weka, see the following post that provides step-by-step instructions:

Need more help with Weka for Machine Learning?

2. Start Weka

Start Weka. This may involve finding it in program launcher or double clicking on the weka.jar file. This will start the Weka GUI Chooser.

The Weka GUI Chooser lets you choose one of the Explorer, Experimenter, KnowledgeExplorer and the Simple CLI (command line interface).

Weka GUI Chooser

Click the “Explorer” button to launch the Weka Explorer.

This GUI lets you load datasets and run classification algorithms. It also provides other features, like data filtering, clustering, association rule extraction, and visualization, but we won’t be using these features right now.

3. Open the data/iris.arff Dataset

Click the “Open file…” button to open a data set and double click on the “data” directory.

Weka provides a number of small common machine learning datasets that you can use to practice on.

Select the “iris.arff” file to load the Iris dataset.

Weka Explorer Interface with the Iris dataset loaded

The Iris Flower dataset is a famous dataset from statistics and is heavily borrowed by researchers in machine learning. It contains 150 instances (rows) and 4 attributes (columns) and a class attribute for the species of iris flower (one of setosa, versicolor, and virginica). You can read more about Iris flower dataset on Wikipedia.

4. Select and Run an Algorithm

Now that you have loaded a dataset, it’s time to choose a machine learning algorithm to model the problem and make predictions.

Click the “Classify” tab. This is the area for running algorithms against a loaded dataset in Weka.

You will note that the “ZeroR” algorithm is selected by default.

Click the “Start” button to run this algorithm.

Weka Results for the ZeroR algorithm on the Iris flower dataset

The ZeroR algorithm selects the majority class in the dataset (all three species of iris are equally present in the data, so it picks the first one: setosa) and uses that to make all predictions. This is the baseline for the dataset and the measure by which all algorithms can be compared. The result is 33%, as expected (3 classes, each equally represented, assigning one of the three to each prediction results in 33% classification accuracy).

You will also note that the test options selects Cross Validation by default with 10 folds. This means that the dataset is split into 10 parts: the first 9 are used to train the algorithm, and the 10th is used to assess the algorithm. This process is repeated, allowing each of the 10 parts of the split dataset a chance to be the held-out test set. You can read more about cross validation here.

The ZeroR algorithm is important, but boring.

Click the “Choose” button in the “Classifier” section and click on “trees” and click on the “J48” algorithm.

This is an implementation of the C4.8 algorithm in Java (“J” for Java, 48 for C4.8, hence the J48 name) and is a minor extension to the famous C4.5 algorithm. You can read more about the C4.5 algorithm here.

Click the “Start” button to run the algorithm.

Weka J48 algorithm results on the Iris flower dataset

5. Review Results

After running the J48 algorithm, you can note the results in the “Classifier output” section.

The algorithm was run with 10-fold cross-validation: this means it was given an opportunity to make a prediction for each instance of the dataset (with different training folds) and the presented result is a summary of those predictions.

Just the results of the J48 algorithm on the Iris flower dataset in Weka

Firstly, note the Classification Accuracy. You can see that the model achieved a result of 144/150 correct or 96%, which seems a lot better than the baseline of 33%.

Secondly, look at the Confusion Matrix. You can see a table of actual classes compared to predicted classes and you can see that there was 1 error where an Iris-setosa was classified as an Iris-versicolor, 2 cases where Iris-virginica was classified as an Iris-versicolor, and 3 cases where an Iris-versicolor was classified as an Iris-setosa (a total of 6 errors). This table can help to explain the accuracy achieved by the algorithm.

Summary

In this post, you loaded your first dataset and ran your first machine learning algorithm (an implementation of the C4.8 algorithm) in Weka. The ZeroR algorithm doesn’t really count: it’s just a useful baseline.

You now know how to load the datasets that are provided with Weka and how to run algorithms: go forth and try different algorithms and see what you come up with.

Leave a note in the comments if you can achieve better than 96% accuracy on the Iris dataset.

Changing the test option to “use training set” changes the nature of the experiment and the results are not really comparable. This change tells you how well the model performed on the data to which was trained (already knows the answers).

This is good if you are making a descriptive model, but not helpful if you want to use that model to make predictions. To get an idea at how good it is at making predictions, we need to test it on data that it has not “seen” before where it must make predictions that we can compare to the actual results. Cross validation does this for us (10 times in fact).

Great work on Multilayer Perceptron! That’s a complicated algorithm that has a lot of parameters you can play with.

Maybe you could try some other datasets from the “data” directory in Weka.

We need to evaluate a suite of algorithms and see what works best. This means that we need a robust test harness (so we cannot be fooled by the results). This involves careful selection of a metric and a resampling method like cross validation.

We can then choose an algorithm that both performs well and has low complexity (easy to understand and explain to others/use in production).

An excellent article indeed! Given the classifier (prediction for numeric class) model as well as an instance from a training or test set, do you have any idea about the steps as to how predicted values are calculated specifically under the “Predictions ontest data” sectionof WEKA’s output?

A trained classifier is evaluated by providing it the input attributes (without the numeric output attribute), and the prediction is made. This is repeated for all instances in the test dataset. Weka then provides a summary of the error in the predictions using a number of measures such as mean absolute error and root mean squared error.

I’m new in weka,I want use simple CLI,i want delete some attribute from arff file and i inter :
[java weka.filters.unsupervised.attribute.Remove –R 5-34 -i data/kdd_01.arff -o data/kdd_02.arff]
But it doesn’t work!!!
I hope You can help me…

I am trying an .arff file of my own and i don’t get over 65% correct answers for the model that I built.Is there a way to make it better or does that mean that the data maybe don’t relate with the outcome class?
Thanks.

You could try some different algorithms. You could try some further preparation of your data such as normalization, standardization and feature engineering. You could also try tuning the parameters of an algorithm that is doing well.

Be scientific and methodical in your approach, devise specific questions and investigate them and record what you discover.

Can Weka,given that a data belongs to a certain class, output a dataset that belongs to that class?Of course, there maybe many datasets that satisfy that class, but there are also certain problems that have unique solutions.
I want to try that with the 8-queens problem.
Thanks a lot for your answers.

Not in the GUI I think Soto. You might need to write a program using the WEKA API to handle this case. I believe I’ve read examples of neural nets being used to solve 8 queens and TSP problems. It all comes down to representation of the problem.
Good luck.

” 3 cases where a Iris-versicolor was classified as a Iris-setosa” in your explanation of the confusion matrix in the post should be ” 3 cases where a Iris-versicolor was classified as a Iris-virginica”. Great post by the way, cheers.

Hi a bit off topic for this lesson but im in need of help! Im using J48-cross validation but i want to change the amount of times the model can run and adjust the weights given to each variable- if that makes sense. also known as the number of epochs/iterations. i have done this before and im sure its a simple fix but i cant remember where or what this is called in weka.

hello Jason, I must say this is exciting, i absolutely have no foundation in computer science or programming and neither was i very good at mathematics but somehow i

am in love with the idea of machine learning, probably because i have a real life scenario i want to experiment with.

I have up to 20 weekends and more of historical data of matches played and i would like to see how weka can predict the outcome of matches played within that 20 week

period.

My data is in tabular form and it is stored in microsoft word.
It is a forecast of football matches played in the past.

Pattern detection is the key, By poring over historical data of matches played in the past, patterns begin to emerge and i use this to forecast what the outcome of

matches will be for the next game.

I use the following attributes for detecting patterns and making predictions which on paper is always 80-100% accurate but when i make a bet, it fails.
(results, team names, codes, week’s color, row number)

Results= Matches that result in DRAWS

Team names = Believe it or not, teams names are used as parameters to make predictions, HOW? They begin with Alphabets.

Codes= These are 3-4 strings either digits or a combo of letters and digits, depending on where they are strategically placed in the table, they offer insight into

detecting patterns.

Weeks Color= In the football forecasting world, there are 4 colours used to represent each week in a month. RED, BLUE, BROWN and PURPLE. These also allows the

forecaster to see emerging patterns.

Row Number= Each week, the data is presented in a table form with two competing teams occupying a row and a number is associated with that row. These numbers are used

to make preditions.

So i would like to TEACH WEKA how i detect these patterns so that my task can be automated and tweaked anyhow i like it.

In plain english, how do i write out my “pattern detecting style” for weka to understand and how do i get this information loaded into weka for processing into my

desired results.
Going by my scenario, What will be my attributes?
What will be my instances?
What will be the claasifiers?
What algorithms do i use to achieve my aim or will i need to write new algorithms?

Can you pls help me. i actually new to this datamining concepts. i want to know how to extract a features and accuracy of a given url name. for eg: if the url name is http://www.some@url_name.com it will extract the feature is _ and @ in it and i also tells the age of the url and also some feature extraction like ip address, long or short url,httos and ssl ,hsf,redirect page ,anchor tag like that it should extract and it will tell the accuracy too.and then implement using c4.5 classifier algorithm to find whether the given url name is malicious or benign url.

Sounds like a great project. As with any project, you need to start by building up a dataset for you to analyse.

I suspect the words in the URL will be useful, SSL cert or not (https) may be useful, and so on. It is hard to know a priori what will be most useful, I’d recommend brain storming and trying a lot of different features in your model. Also consider using an importance measure or correlation with the output variable to see which features look promising and which appear redundant.

Hi!
I’m trying to use libsvm for classification (2 class) in 10-fold cross-validation mode. The output predictions that I get have an instance#, but I dont know which instances of my dataset do these correspond to. For example, my output predictions looks like this:

How does my dataset get divided into 10 parts, which files do these instances correspond to?
I’m interested in knowing which files get incorrectly classified. Is there some other/better way to do this?

Hi Jason. I’ve been playing around for a while with WEKA, and now I get good prediction results. But I still wonder how to apply the model built further?
I mean, I train and tune algorithms and get better results, but then?
When I try to input, say, a set of four attributes corresponding to those of the IRIS set, it doesn’t recognize it as something that it can use in the model.
If I put these four attributes and an empty column, it accepts this, but I don’t know how to predict the class then? How should I set the parameters in WEKA to do that, please?
Thanks by advance.

Hi,
I am trying to use different types of classification rules with numeric data (both independent and dependent variables). It will be great if you please let me know about the applicable classification rules and results interpretation using numerical data in Weka or R.
Thanks in advance.

Once I have trained the model, how do I run it?
For example, I have trained a model to make predictions on bodyfat % according to age, height, weight and abdominal circumference. Now, I want to input parameters about myself too see what the model predicts.
Any help would be appreciated.

Hello,I have data set with numeric prediction.I would like to apply rules learning algorithm in Weka. I noticed that Weka does not support rule based algorithm for numerical prediction as it does for nominal prediction. I do not know why. also, why the accuracy is not shown in the same way as nominal prediction please? also I noticed the forms of the rules are also different? How to calculate the accuracy of the correct classification please? Many thanks

Hi, this is a great work. Thanks for this. One thing. Could you pls elaborate little more about “test options.” Got to know what is the purpose of the ‘use training set option’. But what about the others. Cross validation is already being used. What about other two. ‘supplied test set and percentage split’??
Thank you!

and finally someone explains ML in plain english, i’ve been bashing my head off the desk for weeks working through training material on this. Very well written and explained in a few simple steps. Many thanks for this 🙂

Under Review Results wouldn’t the last comparison be 3 iris-versicolor classified as iris-virginica? I’m having trouble comprehending the pattern in this table, and using the example from Wikipedia it appears to not be the same as you mentioned? Totally new to this so I have no idea what I’m talking about but I’d appreciate some clarification.

Hi, i have just trained a Random forest classifier for a text classification. I have got the results too. When they say- train your classifier and then test it with different data for evaluation, i want to know what should i do next to evaluate this classifier? Can anyone help me with steps to be followed ? I am not sure of how to save this trained classifier model and upload a test data.

I have same problem… Its my first day in Weka.. and i’m trying to load the German credit data, available as credit-g.arff in the Weka 3.8 distribution. I cant load this dataset because when I open the “open file” it only shows me the folders that are in my computer, and not the weka dataset….. Please if someone has an idea how to solve this thing…

Just a quick note that I love the whole site and have never had such an easy time establishing a new direction of endeavor with a high degree of confidence and understanding! This post had me up, running valid data, and evaluating the output from the classifier in under 10 minutes! Simply amazing!

One small issue of note, though. The last paragraph of section 5, just before the Summary, covers the Confusion Matrix. In that paragraph, the third case cited refers to the three instances in which “Iris-versicolor was classified as a Iris-setosa.” My understanding of the table, however, is that there were three instances in which the Iris-versicolor was classified as an Iris-verginica, not Iris-setosa as stated. Naturally Mr. Murphy would select the most ironic location for this confusion of interpretation.

Thanks for your helpful information.
I have a specific question. Using the steps that you have mentioned we can train a machine learning model in WEKA and test its accuracy. I am wondering how we can classify new instances, with no class labels, using a model that we have trained in WEKA. For example, lets say that we have 1000 instances of positive and negative sentences. We train a machine learning model using an algorithm. Afterwards, we want to label 100 new sentences that have not already been classified with either positive or negative labels. How can we do such a work using WEKA?

I am making an application based on handwritten digit recognition, it will be an android app. User will click picture of a digit, it will be send to the server then text will be recognized using machine learning.

I will write back-end of my application in java language so does WEKA provide interfacing with java.

Nice post. I got your book on “Master Machine Learning Algorithm”, It is very good but i want to know how to replicate the example on “Logistic Regression” using weka pakage and java but could not. Please assist. Thank you.

A quick question: I ran the SMO classifier, as I needed a Support Vector Machine, and got a set of results that included a list of the features used, under a line that reads, “Machine Linear: showing attribute weights, not support vectors”. Each feature has a value to the left and the label “(normalised)” next to it.

What does this mean, please? Values for each feature was used in the classification so I assume the numbers refer to some sort of weighting i.e. how heavily each feature impacted on the results. Is this the case?

Any chance someone can please explain this in simple terms as I am a beginner, or at least point me to a website with a detailed explanation of the SMO classifier and ALL its results section contents.

Hi, am quite new to weka. I am faced with a problem of clustering my data. I have a dataset of 13 distinct attributes, though only 12 is significant to the dataset. Am interested in using the kmeans algorithm and the euclidean distance for the distance function.

The problem faced is to cluster the data belonging to 1 class along among the dataset.
How do i go about it.
Please answer is need ASAP.

I am beginner in Weka. i use weka for ECG classification not for a dataset but classify record by record … firstly i convert .mat file to .csv and by weka i convert it to .arff as i can not get relation and header for my file >>>> i have question here in ECG i use .mat file or use annotation files also ????

I am training data set of posts from Facebook on Naive bayes multinomial,the data gets more classified if i use the üse training set test option but if i use cross folds and percentage split the percentage of the correctly classified instances dropss drastically(i get 40% or below).Why is this happening.i have tried to search for this problem and I cannot find solutions.I repeated with 3 diffrent classifiers
(SMO,IBk anns J45) but the same problem persists.what could be the problem and how do i solve it.

Can you please provide a link to example/demo code to classify unlabeled data. I have a set of unlabeled data, training data and test data. Using weka, I am not able to get the label for unlabeled data.

I am dealing with multi class problem weka. Here my doubt is variations in accuracy results with different classifiers with same attributes. EX: With SMO accuracy -84%, Random Forest- 92%. How this much variation comes? and is there any option to enhance smo performance in weka. Let me know as early as possible. thank you.

Can you guide me how to change the parameters in WEKA? I want to change the classification technique with different parameters values (at least 3 parameters with 3 different value for each), AND their classification results.

I am new to Weka and machine language .I am trying to load following dataset. For the
following dataset the J48 classifier is disabled. I need to generate java class files .Could you please help on the issue

Would you please tell me how i can apply the trained models on new data.
I want to build a decision support system that predicts the best radiological procedure for a patient with given indications. I tried Weka but i cant figure out how to apply the models on new patients’ conditions and get the predictions.

Dear Jason Brownlee, Now i have one Question how many records (attributes and instance) required to conduct analysis/Research or Thesis a minimum and Maximums using data mining technology especially Weka software.

I will be taking a course in Predictive Analytics through UC San Diego in April and I thought I’d get a start by looking at WEKA. That will be the tool that we will be using in the course.

I liked the example of the iris file that you gave and how it analyzed the data and found a certain number of records misclassified within the data set. I’m trying to understand the real life application of this tool and I’m thinking about my previous role as a manager over all incidents that were reported by customers who are trying to access our websites. Some of those issues were data related problems, eg missing data, incorrect data in the wrong fields, etc.

So would I be able to run analysis of the data and identify records that had these data issues in my dataset and then drill down to the actual record to identify the real root cause and address the issue upstream in the data ETL process?

I thought I posted a comment last night, but don’t see it 🙁 At any rate, here is my question/thought.

I will be taking Predictive Analytics through UC San Diego next month and will be using WEKA in the course. Your example on the Iris made it easier for me to digest how to use this tool and I thank you for that.

I’m thinking about the practical use of this tool in my every day work. I worked for a large insurance company and was responsible for 7 websites and all of the reported issues/incidents within those portals from customers (internal and external). I’m assuming this would be a good tool to look at data related issues, e.g. we get data feeds from mainframe to a repository. In that data, there can be defects, e.g. missing data, incorrect data in wrong field, etc. I’m assuming this will help with this type of issue.

My questions are – is my assumption above correct and if so, does this tool then allow you to easily identify the data at a line level?

Hello Sir,
how to use Weka KnowledgeFlow Environment . i m trying to use weka knowledge flow
environment bt output not display.
i m working on Chronic Kidney Disease Detection by Analyzing Medical Datasets in Weka .

Am trying to classify tweets into 3 categories: +ve, -ve and neutral.
Currently I have around 800 tweets. Steps followed by me:
1. Converting tweet text column to string using NomialtoString.
2. Applying StringtoWordVector (stemmer: Snowball, tokenizer: NGramTokenizer or CharNGramTokenizer.
3. Applying AttributeSelection (with default settings) under “preprocess” only so as to automatically select the attributes.

Am getting an accuracy to around 65% with Naive Bayes. How can I improve the result??
I need to know exact process or settinngs to follow. Is there anything else I should be doing??

I have next question:
I have for example 3 attributes and a decision: blue, round, small -> ball

Weka does analyze this dataset and build decisions, but how to see this decisions in the way like: if(blue, round, small) then -> ball
Great respect
Thanks
P.s. Our team is going to implement our own classification algorithm to Weka (LEM2), if you have some similar idea I would appreciate your help

I would like to know if Weka can help in evaluating phishing emails detection model, not only based on words found in the content, but also based on checking other features such as sender IP through online blacklisting service, then increase the probability of being phishing if sender IP is blacklisted? if Weka can help, then how?

hello sir
tomorrow will be my external viva .. my topic is lung nodule using CBIR. NN techinque.
i have a table of feature extraction…that i run according the same way that u told in this PDF.. plz tell me some new things regarding this..
thnks

Thanks for this great machine learning tool. I tried using Weka to classify spambase dataset into either spa or non-spam but it is not giving me the right result. Can you explain how I can use Weka for spam email classification using any dataset? What I first did was to convert the text file to .cvs file, then do select normalize on Weka, and then went on select the classification algorithms that I want. The result I got is below:
=== Run information ===

Hy Mr.Jaon….I am working on diagnosis of thyroid ….i created two separate dataset files…train file and test with with ? instead of output …WEKA works well with trained file but when i execute test file it pops up an error message NOT ENOUGH TRAINING INSTANCES WITH CLASS LABELS (REQUIRED 1,PROVIDED 0)….can you please help me with this