Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. Naive Bayes Classifier is a straightforward and powerful algorithm for the classification task. Even if we are working on a data set with millions of records with some attributes, it is suggested to try the Naive Bayes approach.

In many of my previous articles, I have posted about Naive Bayes Classifier, what it's about, and how it works. Today we will be using KSAI library to build our Naive Bayes model. But before that let's explore What is KSAI?

What Is KSAI?

KSAI is an open-source Machine Learning library that contains various algorithms such as classification, regression, clustering, and many others. It is an attempt to build Machine Learning algorithms with the language Scala. The library Breeze, which is again built on Scala is getting used for doing the mathematical functionalities.

KSAI mainly used Scala's inbuilt case classes, Future and some of the other cool features. It has also used Akka in some places and tried doing things in an asynchronous fashion. In order to start exploring the library, the test cases might be a good start. Right now it might not be that easy to use the library with limited documentation and unclear API, however, the committers will update them in the near future.

How to Use It

You can add KSAI library to your project by adding up below dependency for it.

Using KSAI for Naive Bayes Classifier

KSAI Naive Bayes Classifier can be used under three models, which are:

General/Gaussian: It is used in classification and it assumes that features follow a normal distribution.

Multinomial: It is used for discrete counts.

Bernoulli: The binomial model is useful if your feature vectors are binary (i.e. zeros and ones).

For better understanding, let's take an example and try to build something using KSAI's Naive Bayes Classifier algorithm.

In this example, I'll be using the data file movie.txt for demonstrating the application. I will also be including this file in the GitHub repository that will be provided below so that you guys can also play around with it.

As the name suggests, movie.txt contains data related to movies, which is labeled as neg (for negative movie review) and pos (for positive movie review). In our example, we will first convert our resource data into numeric values, by providing 1 for each positive record and 0 for every negative record.

Our data set is ready. Now we will just slice up our data and use some part of source data to first train our algorithm and some part of source data to predict and check what is the accuracy of our algorithm.