Pages

Saturday, September 19, 2009

Hi all,

In this article we will continue our studies about Data Mining algorithms. Now, i will present a supervised learning algorithm called Bayesian Classification. As same as the previous articles presented in this blog, a simple example of the algorithm will be presented which can be executed with Python Interpreter.

The Algorithm

The Bayesian classification algorithm is called with this name because is based on the Bayes' probability theorem. It's known also by Naïve Bayes classification rule or only by Bayesian Classifier.

The algorithm aims to predict the class membership probabilities, such as the probability that a given tuple or pattern belongs to a particular class,that is, predict the most probable class that the pattern belongs to. This type of prediction is called statistical classification, which is totally based on probabilities.

This classification also is called a simple or Naïve, because it assumes that the effect of an atribute value on a given class is independent of the values of the other attributes. This assumption is called class conditional independence and it's made to simplify the computations involved.

Furthermore about attributes, it's important to notice that the Bayesian classifier gets better results when the attribute values are categorical instead of continuous-valued . Maybe this will be more clear at the example that will be shown soon.

Other characteristic of the algorithm is that it requires data set already classified, that is, a set of patterns and their associated class labels. Based on this data set, called also 'training data set ' , the algorithm receives as input a new pattern (unknown data), that is, data patterns for which the class label is not known, and returns as output the class which the probability calculated for this data is maximum in accordance to probabilistic calculations. Different from the K-means algorithm seen in the previous article, the Bayesian classifier doesn't need a metric to compare the 'distance' between the instances and neither classifies the unknown pattern automatically, since it's necessary a data set already classified (training set). Because of this requirement, the Bayesian Classification algorithm is considered a supervised data mining algorithm.

To see how the algorithm works, let's resume it with the following four steps:

Step 01: Probabilities classes evaluation (Class Prior Probability).

In this step, each class of the training set has its probability calculated. In most of times, we only work with two classes, for instance, one class shows if a certain consumer buys or not a product based on his demographic characteristics. The calculation is done by dividing the number of patterns of a specific class by total number of patterns of the training set.

Step 02: Probabilities evaluation of the training set

Now, each value of each attribute of the data of the training set has your probability calculated for each possible class. This step is where occurs the most consuming time computational processing of the algorithm, since depending on the number of attributes, classes and patterns of the training set, it's possible that many calculations must be done before get some results (probabilities).

It's important to notice that this calculation depends totally on the attribute values of the unknown sample data, that is, the sample that you desire to predict the class label. Supposing that there are k classes in the test set and m attributes in the test set, it must be necessary to calculate k x m probabilities.

Step 03: Evaluate the probabilities of the unknown data.

In this step, the probabilities calculated for the patterns of the unknown data of the same class are multiplied. Thus, the result obtained is multiplied by the probability class calculated at the Step 01.

With the probabilities of each class calculated, then check which class has the maximum value for the probability of the unknown data. The algorithm ends returning the class with the probability that has the maximum value (the predicted class) for the unknown data.

Further information about Bayesian classification can be found at the links below:

Now, let's see a practical example of the use of Simple Bayesian Classification algorithm with the probabilities evaluation.

Example of the Algorithm

In this example, let's consider that a bank loans officer wants to predict if the client will be a bank defaulter or not. For this, the bank must consider his historical client profiles and some attributes. To make easy the comprehension of the scenario and the data model, let's use a training set with only 15 rows and 4 columns (attributes). The Figure 01 shows the training set that will be used in this example.

Figure 01. The historic client profiles (Training Set).

The attributes shown at the Figure 01 are described as below:

CLIENT_ID: This column has an unique integer sequential identifier. For the algorithm this attribute is optional, but it may help to organize the rows of the data set.

GENDER : This attribute identifies the gender of the client. The values allowed are only MALE or FEMALE.

MARITAL_STATUS: This attribute brings information about the marital status of the client. It can be only the values MARRIED or SINGLE.

EDUCATION: This attribute brings the information about the education level of the client. It can assume only four different values: HIGHSCHOOL_INCOMPLETE , HIGHSCHOOL_COMPLETE, GRADUATION_INCOMPLETE and GRADUATION_COMPLETE.

INCOMES: This attributes refers to the earnings of the client. It can only has the values: ONE_MIMINUM_SALARY, TWO_MINIMUM_SALARIES and UPPER_THREE_MINIMUM_SALARIES.

DEFAULTER: This column represents the classification label attribute of the patterns. In this example the classification shows if the client is bank defaulter, that is DEFAULTER=YES, or the client is not bank defaulter, that is, DEFAULTER=NO. To clarify the visualisation, the clients of the training set that are defaulters were marked in red and clients that aren't defaulters are marked in blue.

Let's execute the Bayesian Classification to a given unknown pattern. Based on the data shown at the Figure 01, the target is to predict the class label (DEFAULTER) of this new client shown at the Figure 02 by using the Bayesian Classifier.

Figure 02. The new Client to be classified as DEFAULTER or NOT DEFAULTER

Step 01: The Evaluation of the classes probabilities.

There are only two classes, one that shows the client is bank defaulter (DEFAULTER= YES) and another that points the client is not bank defaulter (DEFAULTER= NO). Calculating the probabilities of the classes, we have:

Probability DEFAULTER= YES : 4/15 = 0,2667

Probability DEFAULTER= NO: 11/15 = 0,7334

Step 02: Calculate the probabilities of the training set.

For the first attribute of the unknown data GENDER=MALE, let's calculate the probability of DEFAULTER=YES:

Probability of GENDER=MALE and INADIPLENT=YES : 2/4 = 0,5

And for the case where the client is male and is not defaulter, we have:

Multiplying the probabilities of the unknown data for the case of DEFAULTER=YES by the priori probability of DEFAULTER calculated at the Step 01, we have:

0,5 x 0,25 x 0,25 x 0,25 x 0,2667 = 0,0021

Multiplying the probabilities of the unknown data for the case of DEFAULTER= NO by the probability of NOT DEFAULTER calculated at the Step 01, we have:

0,3636 x 0,5455 x 0,3636 x 0,3636 x 0,7334 = 0,0192

As 0,0192 > 0,0021, the algorithm classifies the unkown pattern as INADIPLENT=NO, that is, this new client has higher probability of not becoming a bank defaulter than becoming one, based on the previous data (training set) and the Bayesian classification.

To help classifying those clients, let's use a implementation of the Bayesian Classification algorithm that will work with only many attributes that has nominal (categorical) values. This implementation was written with Python Script 'bayesian_classify.py' .

The first parameter that must be passed as argument of the script is the data set file path. The second parameter must indicate the list of columns used at the classification, all then splitted by comma and at one string. The third parameter shows the column that has the classifications. The fourth parameter must receive the unknown data pattern values list split by comma and in the same order of the attributes passed in the second parameter. The Figure 03 shows the call of the script at the console based on the example shown above.

The Script has one more parameter. If this parameter is passed as 0, the script returns all probabilities of each class. If the parameter is passed as 1, the script returns only the classification of the unknown data. The Figure 04 shows the result of the call of the script presented at the Figure 03.

>>> DEFAULTER=NO

Figure 03. Execution of the bayesian_classifier.py returning the classification.

Therefore, it must have to be considered some observations before using the Bayesian Classifier. It's necessaty that the training set must be correct and consistent, since one line that presents some wrong value can compromise the final result. Other drawback of the algorithm is when there is missing value in the attribute, so the probability is assigned to 0, which makes difficult to give the correct classification of certain samples. Anyway, some techniques have been presented to go around these problems, but it's not the scope of this article now.

To download the script of the Bayesian Classification algorithm and the data set used at this article, click here.

Search in this blog

Join the Brazilian Python Conference PythonBrasil 2013

Marcel Caraciolo

I am a brazilian data scientist, entrepreneur, python hacker and technology consultant. Nowadays I work with data-centric applications, specially in machine learning, recommender systems and bioinformatics. I am also interested in distributed computing, high performance and data visualization, educational and bioinformatics ventures.

Until 2013 I was the co-founder of two companies Atepassar.com, a social network for students in Brazil and co-founder of PyCursos, a on-line startup for python training and on-line courses. In 2014, I assumed a new position at Genomika Diagnósticos, a brazilian genetics tests laboratory, as CTO.