Improving Linear Models Using Explicit Kernel Methods

In this tutorial, we demonstrate how combining (explicit) kernel methods with linear models can drastically increase the latters' quality of predictions without significantly increasing training and inference times. Unlike dual kernel methods, explicit (primal) kernel methods scale well with the size of the training dataset both in terms of training/inference times and in terms of memory requirements.

Intended audience: Even though we provide a high-level overview of concepts related to explicit kernel methods, this tutorial primarily targets readers who already have at least basic knowledge of kernel methods and Support Vector Machines (SVMs). If you are new to kernel methods, refer to either of the following sources for an introduction:

Currently, TensorFlow supports explicit kernel mappings for dense features only; TensorFlow will provide support for sparse features at a later release.

This tutorial uses tf.contrib.learn (TensorFlow's high-level Machine Learning API) Estimators for our ML models. If you are not familiar with this API, tf.estimator Quickstart is a good place to start. We will use the MNIST dataset. The tutorial consists of the following steps:

Load and prepare MNIST data for classification.

Construct a simple linear model, train it, and evaluate it on the eval data.

Replace the linear model with a kernelized linear model, re-train, and re-evaluate.

Load and prepare MNIST data for classification

Run the following utility command to load the MNIST dataset:

data = tf.contrib.learn.datasets.mnist.load_mnist()

The preceding method loads the entire MNIST dataset (containing 70K samples) and splits it into train, validation, and test data with 55K, 5K, and 10K samples respectively. Each split contains one numpy array for images (with shape [sample_size, 784]) and one for labels (with shape [sample_size, 1]). In this tutorial, we only use the train and validation splits to train and evaluate our models respectively.

In order to feed data to a tf.contrib.learn Estimator, it is helpful to convert it to Tensors. For this, we will use an input function which adds Ops to the TensorFlow graph that, when executed, create mini-batches of Tensors to be used downstream. For more background on input functions, check Building Input Functions with tf.contrib.learn. In this example, we will use the tf.train.shuffle_batch Op which, besides converting numpy arrays to Tensors, allows us to specify the batch_size and whether to randomize the input every time the input_fn Ops are executed (randomization typically expedites convergence during training). The full code for loading and preparing the data is shown in the snippet below. In this example, we use mini-batches of size 256 for training and the entire sample (5K entries) for evaluation. Feel free to experiment with different batch sizes.

Training a simple linear model

We can now train a linear model over the MNIST dataset. We will use the tf.contrib.learn.LinearClassifier estimator with 10 classes representing the 10 digits. The input features form a 784-dimensional dense vector which can be specified as follows:

In addition to experimenting with the (training) batch size and the number of training steps, there are a couple other parameters that can be tuned as well. For instance, you can change the optimization method used to minimize the loss by explicitly selecting another optimizer from the collection of available optimizers. As an example, the following code constructs a LinearClassifier estimator that uses the Follow-The-Regularized-Leader (FTRL) optimization strategy with a specific learning rate and L2-regularization.

Regardless of the values of the parameters, the maximum accuracy a linear model can achieve on this dataset caps at around 93%.

Using explicit kernel mappings with the linear model.

The relatively high error (~7%) of the linear model over MNIST indicates that the input data is not linearly separable. We will use explicit kernel mappings to reduce the classification error.

Intuition: The high-level idea is to use a non-linear map to transform the input space to another feature space (of possibly higher dimension) where the (transformed) features are (almost) linearly separable and then apply a linear model on the mapped features. This is shown in the following figure:

Technical details

In this example we will use Random Fourier Features, introduced in the "Random Features for Large-Scale Kernel Machines" paper by Rahimi and Recht, to map the input data. Random Fourier Features map a vector \(\mathbf{x} \in \mathbb{R}^d\) to \(\mathbf{x'} \in \mathbb{R}^D\) via the following mapping:

The right-hand-side quantity of the expression above is known as the RBF (or Gaussian) kernel function. This function is one of the most-widely used kernel functions in Machine Learning and implicitly measures similarity in a different, much higher dimensional space than the original one. See Radial basis function kernel for more details.

Kernel classifier

tf.contrib.kernel_methods.KernelLinearClassifier is a pre-packaged tf.contrib.learn estimator that combines the power of explicit kernel mappings with linear models. Its constructor is almost identical to that of the LinearClassifier estimator with the additional option to specify a list of explicit kernel mappings to be applied to each feature the classifier uses. The following code snippet demonstrates how to replace LinearClassifier with KernelLinearClassifier.

The only additional parameter passed to KernelLinearClassifier is a dictionary from feature_columns to a list of kernel mappings to be applied to the corresponding feature column. The following lines instruct the classifier to first map the initial 784-dimensional images to 2000-dimensional vectors using random Fourier features and then learn a linear model on the transformed vectors:

Notice the stddev parameter. This is the standard deviation (\(\sigma\)) of the approximated RBF kernel and controls the similarity measure used in classification. stddev is typically determined via hyperparameter tuning.

The results of running the preceding code are summarized in the following table. We can further increase the accuracy by increasing the output dimension of the mapping and tuning the standard deviation.

metric

value

loss

0.10

accuracy

97%

training time

~35 seconds on my machine

stddev

The classification quality is very sensitive to the value of stddev. The following table shows the accuracy of the classifier on the eval data for different values of stddev. The optimal value is stddev=5.0. Notice how too small or too high stddev values can dramatically decrease the accuracy of the classification.

stddev

eval accuracy

1.0

0.1362

2.0

0.4764

4.0

0.9654

5.0

0.9766

8.0

0.9714

16.0

0.8878

Output dimension

Intuitively, the larger the output dimension of the mapping, the closer the inner product of two mapped vectors approximates the kernel, which typically translates to better classification accuracy. Another way to think about this is that the output dimension equals the number of weights of the linear model; the larger this dimension, the larger the "degrees of freedom" of the model. However, after a certain threshold, higher output dimensions increase the accuracy by very little, while making training take more time. This is shown in the following two Figures which depict the eval accuracy as a function of the output dimension and the training time, respectively.

Summary

Explicit kernel mappings combine the predictive power of nonlinear models with the scalability of linear models. Unlike traditional dual kernel methods, explicit kernel methods can scale to millions or hundreds of millions of samples. When using explicit kernel mappings, consider the following tips:

Random Fourier Features can be particularly effective for datasets with dense features.

The parameters of the kernel mapping are often data-dependent. Model quality can be very sensitive to these parameters. Use hyperparameter tuning to find the optimal values.

If you have multiple numerical features, concatenate them into a single multi-dimensional feature and apply the kernel mapping to the concatenated vector.