Optimize Linear Regression Model with Intel DAAL

How does a company forecast how much it can spend on future advertising in order to increase sales? Similarly, how much should a company spend in future training programs in order to improve productivities?

advertisements

In both cases, those companies rely on the historical relationship between variables like, in the former case, the relationship between the sale value and advertising spending to predict the future outcome.

This is a typical regression problem that can be best solved using machine learning1.

This article describes a common type of regression analysis called linear regression2 and how the Intel® Data Analytics Acceleration Library (Intel® DAAL)3 helps optimize this algorithm when running it on systems equipped with Intel® Xeon® processors.

Linear regression (LR) is the most basic type of regression that is used for predictive analysis. LR shows a linear relationship between variables and how one variable can be affected by one or more variables. The variable which is impacted by the others is called a dependent, response, or outcome variable and the others are called independent, explanatory, or predictor variables.

In order to use LR analysis, we need to examine whether LR is suitable for this set of data. To do that we need to observe how the data is distributed. Let's look at the following two graphs:

advertisements

Figure 1:A straight line can be fitted through the data points.

Figure 2:A straight line cannot be fitted through these data points.

advertisements

In figure 1, we can fit a straight line through the data points; however, there is no way to do that with the data points in figure 2. Only the curved (non-linear) line can be fitted through the data points in figure 2. Therefore, linear regression analysis can be done on the dataset in figure 1, but not on that in figure 2.

Depending on the number of independent variables, LR is divided into two types: simple linear regression (SLR) and multiple linear regression (MLR).

LR is called SLR when there exists only one independent variable, whereas MLR has more than one independent variable.
The simplest form of the equation with one dependent and one independent variable is defined as:

The question is how to find the best-fit line so that the difference between the observed and predicted values of the dependent variable (y) is minimum. In other words, find A and B in equation (1) such that |yobserved – ypredicted| is minimum.

The task to find the best-fit line can be done using the least squares method4. It calculates the best-fit line by minimizing the sum of the squares of the vertical differences (difference between the observed and predicted values of y) from each data point to the line. The vertical difference can also be called residual.

Figure 3:Linear regression graph.

From figure 3 the green dots represent the actual data points. The black dots show the vertical (not perpendicular) projection of the data points onto the regression line (red line). The black dots are also called the predicted data. The vertical difference between the actual data and the predicted data is called residual.

Applications of Linear Regression

Some of the applications that can make good use of linear regression:

Forecasting future sales.

Analyzing the marketing effectiveness and pricing on sales of a product.

Assessing risk in financial services or insurance domains.

Studying engine performance from test data in automobiles.

Advantages and Disadvantages of Linear Regression

Some of the pros and cons of LR:

Advantages

The result is optimum when the relationship between the independent and the dependent variables are almost linear.

Disadvantages

LR is very sensitive to outliers.

It is inappropriately used to model non-linear relationships.

Linear regression is limited to predicting numeric output.

Intel® Data Analytics Acceleration Library

Intel DAAL is a library consisting of many basic building blocks that are optimized for data analytics and machine learning. These basic building blocks are highly optimized for the latest features of the latest Intel® processors. LR is one of the predictive algorithms that Intel DAAL provides. In this article, we use the Python* API of Intel DAAL to build a basic LR predictor. To install DAAL, follow the instructions in How to install the Python Version of Intel DAAL in Linux*5.

Linear regression is a very common predictive algorithm. Intel DAAL optimized the linear regression algorithm. By using Intel DAAL, developers can take advantage of new features in future generations of Intel Xeon processors without having to modify their applications. They only need to link their applications to the latest version of Intel DAAL.