Using Intel® Data Analytics Acceleration Library to Improve the Performance of Naïve Bayes Algorithm in Python*

This is a computer translation of the original content. It is provided for general information only and should not be relied upon as complete or accurate.

Sorry, we can't translate this content right now, please try again later.

Introduction

How is Netflix able to recommend videos for customers to view? How does Amazon know which products to suggest to potential buyers? How does Microsoft Outlook* categorize an email as spam?

How do they do it? Netflix is able to recommend videos to customers from past viewing experience. Amazon uses data from shopper viewing and buying histories to suggests products that shoppers might likely to buy. Microsoft analyzes a huge amount of emails to determine whether or not an email is junk or spam.

It seems like Netflix, Amazon, and Microsoft have done some kind of analysis on the historical databases of some kind to come up with decisions to assist their customers with their needs. With the availability of a huge amount of data (video, audio, and text) on social media and on the Internet, there needs to be an efficient way to handle these big sets of data with minimum human intervention in order to gain more insight into how people think, behave, and interact with society and our environment.

This article discusses machine learning and describes a machine learning method/algorithm called Naïve Bayes (NB)​2. It also describes how to use Intel® Data Analytics Acceleration Library (Intel® DAAL)​3 to improve the performance of an NB algorithm.

What is Machine Learning?

Machine learning (ML) is a data analysis method that is used to create an analytical model base on a set of data. That analytic model has the ability to learn without being explicitly programmed as new data is being fed to it. ML has been around for quite some time, but only recently has it proved to be useful due to the following:

The growing volumes and varieties of available data from social media and on the Internet are increasingly available.

Computer systems are getting more powerful.

Data storages are getting larger and cheaper.

The two most common types of ML are supervised learning​4 and unsupervised learning​5.

With supervised learning, algorithms are trained using a set of input data and a set of labels (known results). Each time algorithms learn by comparing the results of the input data with the label and adjust the machine learning model. Classification is considered supervised learning.

Unlike supervised learning, with unsupervised learning algorithms have no labels to learn from. Instead, they have to look through input data and detect patterns on their own. For example, to categorize which region in the world a person belongs to, the algorithms have to look at the population data, identify races, religions, languages, and so on.

Figure 1: Simplified machine learning diagram.

Figure 1 shows a simplified diagram of how ML works. First, an ML algorithm gets trained using a training data set and creates an ML model. That model processes the testing data set and, finally, the result will be predicted.

The next section discusses a type of supervised learning algorithm, the Naïve Bayes algorithm.

Naïve Bayes Algorithm

The Naive Bayes (NB) algorithm is a classification method based on Bayes’ theorem​6 with the assumption that all features are independent of each other.

The NB algorithm is commonly used in detecting email sorting, document categorization, email spam detection, and so on.

The diagram in Figure 2 shows how the NB algorithm works:

Figure 2: Machine learning diagram with Naïve Bayes algorithm.

Figure 2 shows that the training data set consists of two sets: the training labels and the training data. The training labels are the correct outputs of the training data. These two sets are needed to build the classifier. After that, the testing data set will be fed to the classifier to be evaluated.

Figure 3: Email spam detection using Naïve Bayes algorithm.

Figure 3 shows the flow diagram of the NB algorithm detecting email spam. The known spam email goes through the NB algorithm to build the email classifier. After that, unknown emails go through the email classifier to check whether or not an email is spam.

Applications of NB

The following are some NB applications:

Taking advantage of its speed to make predictions in real time

Multiclass/multinomial classification, such as classifying many types of flowers

Detecting spam emails

Classifying text

Advantages and Disadvantages of NB

The following are some of the advantages and disadvantages of NB.

Advantages

Training the model can be done quickly.

NB does well with multi-class prediction.

Disadvantages

If the data has no label in the training data set, NB cannot make a prediction.

NB does not work well with large data sets.

Features/events are not always completely independent.

It takes a considerable amount of time to train the model with a large data set. It can take weeks and even months to train a particular model. To help optimize this training step, Intel developed Intel DAAL.

Intel® Data Analytics Acceleration Library

Intel DAAL​7 is a library that consists of many basic building blocks that are optimized for data analytics and machine learning. These basic building blocks are highly optimized for the latest features of the latest Intel® processors. Multinomial Naïve Bayes classifier is one of the classification algorithms that DAAL provides. In this article, we use PyDAAL​10, the Python API of DAAL, to build a basic Naïve Bayes classifier.

Using the NB Method in the Intel Data Analytics Acceleration Library

This section shows how to invoke the multinomial NB algorithm​8 in Python​9 using Intel DAAL.

2) Create a function to split the input data set into the training data, label, and test data.

Basically, split the input data set array into two arrays. For example, for a dataset with 100 rows, split it into 80/20: 80 percent of the data for training and 20 percent for testing. The training data will contain the first 80 lines of the array input dataset, and the testing data will contain the remaining 20 lines of the input dataset.

3) Restructure the training and testing dataset so it can be read by Intel DAAL.

Use the commands to reformat the data as followed (We treat ‘trainLabels’ and ‘testLabels’ as n-by-1 tables, where n is the number of lines in the corresponding datasets):

prediction: Prediction result that contains predicted labels for test data.

Conclusion

Using Intel DAAL algorithms can help simplify and improve the performance of your ML-related applications since those algorithms are highly optimized using the latest features of the latest Intel processors. Using Intel DAAL also mean that you don’t have to modify your code to take advantage of new features of new Intel processors. You only need to link to the latest version of Intel DAAL.

The next article will explore another Intel DAAL algorithm like support vector machine (SVM).