Custom R Components – Classification with the Naive Bayes Algorithm

The Naive Bayes algorithm is one (of many) methods of Classification. For instance you may want to derive from a past Marketing campaign what prospects you should focus on in your next Marketing activity. The algorithm can identify patterns of what type of contacts have already purchased a certain product (ie what was their age, gender, income, etc.). Now you can use this information for your next campaign and focus on the people that are most likely to be interested. So you spend your Marketing budget where it is most effective.

SAP Predictive Analysis can use the Naive Bayes algorithm thanks to the ability to create Custom R Components. Within such a component an expert user can encapsulate R-Script in an end-user-friendly format. With thousands of different methods available in R, that concept is extremely powerful. This article explains how to implement and use Naive Bayes.

Usage

Let’s try the Naive Bayes algorithm on some data from the real world. The UC Irvine Machine Learning Repository kindly hosts a dataset with information taken from the 1994 US Census. The file called Adult contains anonymous information from over 32.000 people listing their age, education, martical status and much more, including the information whether the person was earning over 50.000 US Dollar in the year 1994. We will use this information to create a model that we can apply on future data to determine if the person is likely to earn more or less than these 50.000 USD.

You can follow the steps below if you download the above dataset. Before getting started, you may just have to add a first row with column names.

Just load your data into SAP Predictive Analysis. You see some of the available columns. The ‘Income’ field on the right-hand side tells us whether the person was in that year over or below the 50k threshold. This colum is called ‘TargetVariable’ in the screenshots below.

Now add the Naive Bayes Classifier component to my model. Further below you find the details to add this logic to your own SAP Predictive Analysis installation.

Configure the component. You need to tell the component

– the Classifier Column: Income

– and the Predictor Column: Here you can pick Age, Occupation and HoursPerWeek to start.

Run the model. Then go to the charts area. The table shows how many records were correctly and incorrectly classified. 24.263 people were correctly classified as earning less than 50.000 USD. 556 people were correctly classified as high-earners.

You can also save the trained model to further test it on data that is already classified. Or you can apply the model on new data for which the classification is actually unknown.

R Libraries

Please make sure you have the R-libraries e1071 and gplots installed. The following document explains how to make new libraries available in SAP Predictive Analysis:

The component can be downloaded as .spar file from GitHub. Then deploy it as described here. You just need to import it through the option “Import/Model Component”, which you will find by clicking on the plus-sign at the bottom of the list of the available algorithms.

Disclaimer

Please note that this component is not an official release by SAP and that it is provided as-is without any guarantee or support. Please test the component to ensure it works for your purposes.

Assigned tags

Related Blog Posts

Related Questions

In you description you take 70% to train you model and 30% to test it.

Have you seen any rule of thumb around for this specific number?

Furthermore in order to reduce bias in data how would you ensure that your data is picked random in both samples and not reused? If this was salestransactions they could be presorted and with a “timestamp”-bias.

Looking at PA the options are: First N, Last N, Every N, Simple Random or Systematic Random? However how do we make sure that data are not reused in training, testing or validation and making sure that any presorting in the data ?

By the way it would be nice with a function to automatic control process-flow for training, testing and validation – right?

Often two thirds of the data are used for training a model and the remaining third is used for testing. To keep it simple I did a straight 70% / 30% split and found that worked quite nicely on this data.

Just as you say, you want to avoid having the same record in both the training and testing dataset. Here I used “First 70%” and “Last 30%” to achieve that. However, this requires the data to be randomly sorted. If this is not the case, like in your example, then a little custom R Script could do the trick to randomly separate/flag the records.GreetingsAndreas

One small question, does this algorithm works only when the target is numerical? Because I have one column named Priority having 3 things- LOW, MEDIUM, HIGH. So when I applied this algorithm on them, some error appeared.

I tested with Iris it worked fine. But iris dataset has measures on which algorithm runs smoothly. I have data with dimensions only, I mean I have data with PRIORITY along with CREATION DATE over a year and some additional field like who raised the incident etc.

If I apply HANA based Naive Bayes algorithm it works on this data, but if I apply this extension error comes.

Custom R extensions curently have a limitation with dates. If you remove the date, it might work.

Here is the comment from the release notes

“You cannot use date columns as strings in the data that is passed to the custom R component. Therefore, we recommended to filter the date column from the dataset or use the as.date function in R script.”

I dont know your use case but a date as input for a classification seems unusual. Often dates are used to describe activity in relation to a time stamp, ie “number of days since last contact to customer”. Is that something that would make sense to your case?

Maybe you are aware of the Data Manager in Automated Mode that helps create such variables based on dates, amongst other things?

This custom r component was released before the HANA PAL included the naive bayes algorithm. Now that it is also available in HANA, this custom extension is probably most relevant for customers using data sources other than HANA.

I assume the core functionality is very similar, if not identical, but I havent tested it. The differences might be rather in the parameterisation. Out of the box the algorithm from the HANA PAL offers more options than the component here. But there are additional parameters from R that could be exposed as well.

If possible I’d clearly recommend to use the PAL. If that is not possible or sufficient, this component might be an option.