The Simplest Machine Learning Algorithm

It's easy to take a black box view of machine learning algorithms and apply them without knowing how they work. However, a deep understanding of algorithms can help with selecting an approach, guiding data preprocessing, interpretation of learnt models and improving accuracy and efficiency. With this aim in mind, we introduce one of the simplest machine learning algorithms, ridge regression [1]. We outline its strengths/weaknesses, show how to implement it and then give an evaluation of its empirical performance.

A Theoretical Interlude

To keep articles accessible we generally try to avoid mathematics in this blog, but sometimes it's necessary to get a better understanding of an algorithm or concept. Let's start with a definition of the data used for learning. We have \(n\) observations or examples \(\textbf{x}_1, \textbf{x}_2, \ldots, \textbf{x}_n\) and a corresponding set of labels \(y_1, y_2, \ldots, y_n\). Each example is a vector/array of numbers and the labels are scalars. For instance each example is a vector of properties of person (age, eduction, gender, ethnicity, occupation) and the label could be their annual income. Let's say that there is some function linking \(\textbf{x}\) and \(y\) so that \(f^*(\textbf{x}) \approx y\). Due to the nature of the problem, in general we can never recover \(f^*\) (due to a finite number of examples and noisy data, amongst other things) but we can recover a function \(f\) which is a reasonable approximation of \(f^*\). The important aspect of \(f\) is that it makes good predictions for \(y\) not just on the training data, but on new unseen examples.

So here is one way of finding the function \(f\): try to minimise the error between the \(i\)th predicted and actual label using the square of the difference, i.e. \((y_i - \textbf{x}_i^T \textbf{w})^2\). Here we are choosing functions which are the dot products between the example \(\textbf{x}_i\) and a weight vector \(\textbf{w}\). This form is easy to study analytically as we shall later see. Note that if \(\textbf{x}_i^T \textbf{w} = y_i\) then the error becomes zero for that example. Putting the pieces together results in the following problem

where \(\|\textbf{w}\|^2\) is the square of the norm of \(\textbf{w}\) and equal to \(\textbf{w}^T\textbf{w}\) (the magnitude of this vector), and \(\lambda\) is a user-defined regularisation parameter. This second term may not be completely obvious but is required to improve the generalisation ability of this learner.

We now put all of the examples and labels into a matrix \(\textbf{X}\) and vector \(\textbf{y}\). In this case the \(i\)th row of \(\textbf{X}\) is the \(i\)th example and similarly the \(i\)th element of \(\textbf{y}\) is the corresponding label. The solution to this optimisation is as follows:

where \(\textbf{I}\) is the identity matrix i.e. one with 1s in the major diagonal and 0s elsewhere. This gives us the complete ridge regression algorithm: simply compute \(\textbf{w}\) as above and then make predictions using \(f(x) = \textbf{x}^T\textbf{w}\).

Implementation and Example

So let's now implement this algorithm in Python conforming to the scikit-learn standards. We create a RidgeRegression class as follows:

The class stores the regularisation parameter as lmbda (not lambda as this is a python keyword) which is initialised in the constructor. The fit method performs the training by computing the weight vector as given in the section above. The other two methods get and set the parameters, and are required by the scikit-learn GridSearchCV class for model selection. The computational complexity of the training is cubic in the number of features of X due to the matrix inverse. Prediction is fairly rapid however, and just scales linearly in the number of features.

We will test out this algorithm on the Winequality red dataset. This dataset has a quality rating (0-10) for 4898 red wines based on 11 properties such as volatile acidity, chlorides, pH and alcohol. The data is loaded and processed as follows:

The scale function transforms the features (columns) of X so they are zero mean and have unit variance, and we require y to have zero mean for ridge regression. We then split the data into training and test sets with a test set size of 20% and a training set composed of 80% of the examples.

In line 2 we perform model selection over lmbda for ridge regression using the GridSearchCV class and select the parameters with the minimum mean absolute error. We then retrain on the training data in line 4 and make some predictions for the test set, outputting the error as ridge_error. For comparison, we repeat this process with Support Vector Regression (SVR), a state-of-the-art regression algorithm.

The resulting errors are 0.56 and 0.50 for ridge regression and the SVR respectively. We would certainly expect the SVR to be more accurate than ridge regression in general as the generalisation of the SVR is much better grounded in theory. However ridge regression performs surprisingly well in this case, relative to a considerably more complicated algorithm.

How can we improve upon ridge regression? One way is to solve for the weight vectors in different ways. We already stated that computing the matrix inverse has a cubic complexity. A large speedup is possible using an optimisation approach called Stochastic Gradient Descent (SGD) which is implemented in the scikit learn Ridge class. This also hints at the possibility of parallelising the whole algorithm for use on large datasets. Another way of improving ridge regression is by using the kernel trick which allows one to efficiently model non-linear functions.

Summary

We gave a complete description of ridge regression, perhaps one of the simplest machine learning algorithms. Beginning with its formulation we gave its implementation in Python using just a few lines of code. Using the wine quality dataset we showed that it is only slightly worse than Support Vector Regression, a state-of-the-art approach.