Discover what you didn’t even know you were looking for with Weka Data

Posted by KingEclient on 14 December, 2015

Just suppose huge datasets could be turned into useful information that you didn’t even know existed?

This is where machine learning algorithms come in. Tools such as support vector machines (SVMs), artificial neural networks (ANNs), decision trees, and clustering are being used to detect and bring to the surface unknown patterns amongst trillions of instances. Even though these algorithms have been validated by thousands of studies, they are extremely complex to use without making a mess of the dataset users had in mind. This is where specialized data mining software plays its part.

Would you like to know how?

Weka is a piece of software that focuses on finding and confirming underlying patterns of data by using supervised and unsupervised Machine Learning algorithms. Its interface has a series of tools that discovers information which is invisible at first sight: classification, regression, clustering, data pre-processing, association rules, and model visualization. Even though some of its functionalities may sound a bit specialized, the user interface is very well-organized. Users need to have a basic understanding of data mining and know which algorithm they want to use to analyze their datasets; everything else is the job of the software.

As a simple starters’ guide we would like to show you the steps needed to analyze a dataset. These are the steps you need to follow for a supervised machine learning method.

1. Upload a .csv file that contains the following fields: instances of data, classifiers, and classes.

2. Preview your data. When you upload a .csv, the program automatically displays your data. For example, you can select ‘Class’ to be displayed and you will be able to see whether classes are balanced or not (i.e. same number of instances).

Fig.2.: In this case, there are the same number of instances per class (3 classes, 50 instances each).

3. If your data is balanced you can proceed and apply a classifier (algorithm). If your data is unbalanced you can either balance it manually from the .csv file or choose a filter from the pre-processing stage, for instance the SMOTE filter. Once the algorithm is chosen, e.g. SVM (Support vector machines), you just have to click and the model will be built in a few seconds or minutes, depending on the amount of data.

Weka is an open source software; it provides users with a step-by-step interface for analyzing big datasets. If you want to try it, follow the steps above and find out what your data has been hiding from you.

4. Users can check the raw model on the same page, or simply view it in the visualizer. For instance, we have uploaded a sample filled with data taken from a botanical study. In the image below we see that the iris-setosa is the one that is best classified against the other two classes that appear closer in a 2-D space. Moreover, the model also allows users to visualize which class rates higher/lower in the dimensions represented by the x and y axes.