Oleksiy Dyagilev on computer science and related ..

This blog post will provide an introduction into using machine learning algorithms with InsightEdge. We will go through an exercise to predict mobile advertisement click-through rate with Avazu’s dataset.

Overview

There are several compensation models in online advertising industry, probably the most notable is CPC (Cost Per Click), in which an advertiser pays a publisher when the ad is clicked.
Search engine advertising is one of the most popular forms of CPC. It allows advertisers to bid for ad placement in a search engine’s sponsored links when someone searches on a keyword that is related to their business offering.

For the search engines like Google, advertising is one of the main sources of their revenue. The challenge for the advertising system is to determine what ad should be displayed for each query that the search engine receives.

The revenue search engine can get is essentially:

revenue = bid * probability_of_click

The goal is to maximize the revenue for every search engine query. Whereis the bid is a known value, the probability_of_click is not. Thus predicting the probability of click becomes the key task.

Working on a machine learning problem involves a lot of experiments with feature selection, feature transformation, training different models and tuning parameters.While there are a few excellent machine learning libraries for Python and R, like scikit-learn, their capabilities are typically limited to relatively small datasets that you fit on a single machine.

With the large datasets and/or CPU intensive workloads you may want to scale out beyond a single machine. This is one of the key benefits of InsightEdge, since it’s able to scale the computation and data storage layers across many machines under one unified cluster.

Let’s now calculate the CTR (click-through rate) of the dataset. The click-through rate is the number of times a click is made on the advertisement divided by the total impressions (the number of times an advertisement was served):

The CTR is 0.169 (or 16.9%) which is quite a high number, the common value in the industry is about 0.2-0.3%. So a high value is probably because non-clicks and clicks are subsampled according to different strategies, as stated by Avazu.

Now, the question is which features should we use to create a predictive model? This is a difficult question that requires a deep knowledge of the problem domain. Let’s try to learn it from the dataset we have.

For example, let’s explore the device_conn_type feature. Our assumption might be that this is a categorical variable like Wi-Fi, 2G, 3G or LTE. This might be a relevant feature since clicking on an ad with a slow connection is not something common.

We see that there are some features with a lot of unique values, for example, device_ip has 6M+ different values.
Machine learning algorithms are typically defined in terms of numerical vectors rather than categorical values. Converting such categorical features will result in a high dimensional vector which might be very expensive.
We will need to deal with this later.

Processing and transforming the data

Looking further at the dataset, we can see that the hour feature is in YYMMDDHH format.
To allow the predictive model to effectively learn from this feature it makes sense to transform it into three features: year, month and hour.
Let’s develop the function to transform the dataframe:

Saving preprocessed data to the data grid

The entire training dataset contains 40M+ rows, it takes quite a long time to experiment with different algorithms and approaches even in a clustered environment.
We want to sample the dataset and checkpoint it to the in-memory data grid that is running collocated with Spark.
This way we can:
* quickly iterate through different approaches
* restart the Zeppelin session or launch other Spark applications and pick up the dataset more quickly from memory

Since the training dataset contains the data for the 10 days, we can pick any day and sample it:

123

prepared.filter("time_day = 21").count()res51:Long=4122995

There are 4M+ rows for this day, which is about 10% of the entire dataset.

Now let’s save it to the data grid. This can be done with two lines of code:

The complete listing of notebook can be found on github. You can import it to Zeppelin and play with it on your own.

A simple algorithm

Now that we have training and test datasets sampled, initially preprocessed and available in the data grid, we can close Web Notebook and start experimenting with
different techniques and algorithms by submitting Spark applications.

For our first baseline approach let’s take a single feature device_conn_type and logistic regression algorithm:

At first, we load the training dataset from the data grid, which we prepared and saved earlier with Web Notebook.

Then we use StringIndexer and OneHotEncoder to map a column of categories to a column of binary vectors. For example, with 4 categories of device_conn_type, an input value
of the second category would map to an output vector of [0.0, 1.0, 0.0, 0.0, 0.0].

Then we convert a dataframe to an RDD[LabeledPoint] since the LogisticRegressionWithLBFGS expects RDD as a training parameter.
We train the logistic regression and use it to predict the click for the test dataset. Finally we compute the metrics of our classifier comparing the predicted labels with actual ones.

You can notice how the AUROC is being improved as we add more and more features. This comes with the cost of the training time:

We didn’t include high-cardinality features such as device_ip and device_id as they will blow up the feature vector size. One may consider applying techniques such as feature hashing
to reduce the dimension. We will leave it out of this blog post’s scope.

Tuning algorithm parameters

Tuning algorithm parameters is a search problem. We will use Spark Pipeline API with a Grid Search technique.
Grid search evaluates a model for each combination of algorithm parameters specified in a grid (do not confuse with data grid).

Pipeline API supports model selection using cross-validation technique. For each set of parameters it trains the given Estimator and evaluates it using the given Evaluator.
We will use BinaryClassificationEvaluator that has AUROC as a metric by default.

real-time applications can execute any OLTP query (read, insert, update, delete) on training data that is immediately available for Spark analytical queries or machine learning routines. There is no need to build a complex ETL pipeline that extracts training data from OLTP database with Kafka/Flume/HDFS. Besides the complexity, an ETL pipeline introduces unwanted latency that can be a stopper for reactive machine learning apps.
With InsightEdge, Spark applications can view the live data;

the training data lives in the memory of data grid, which acts as an extension of Spark memory. This way we can load the data quicker;

An in-memory data grid is a general-purpose highly available and fault tolerant storage. With support of ACID transactions and SQL queries it becomes the primary storage for the application;

InsightEdge stack is scalable in both computation (Spark) and storage (data grid) tiers. This makes it attractive for large-scale machine learning.

Summary

In this blog post we demonstrated how to use machine learning algorithms with InsightEdge. We went through typical stages:

interactive data exploration with Zeppelin

feature selection and transformation

training predictive models

calculating model metrics

tuning parameters

We didn’t have a goal to build a perfect predictive model, so there is great room for improvement.

In the architecture section we discussed how the typical design may look like, what are the benefits of using InsightEdge for machine learning.

The Zeppelin notebook can be found here and submittable spark apps here