April 7, 2018
machine-learningNearest Neighborregression

The basic Nearest Neighbor (NN) algorithm is simple and can be used for classification or regression. NN is a non-parametric approach and the intuition behind it is that similar examples should have similar outputs .

Given a training set, all we need to do to predict the output for a new example is to find the “most similar” example in the training set.

A slight variation of NN is k-NN where given an example we want to predict we find the k nearest samples in the training set.
The basic Nearest Neighbor algorithm does not handle outliers well, because it has high variance, meaning that its predictions can vary a lot depending on which examples happen to appear in the training set. The k Nearest Neighbor algorithm addresses these problems.

To do classification, after finding the nearest sample, take the most frequent label of their labels.
For regression, we can take the mean or median of the k neighbors, or we can solve a linear regression problem on the neighbors

Nonparametric methods are still subject to underfitting and overfitting, just like parametric methods.
In this case, 1-nearest neighbors is overfitting since it reacts too much to the outliers. High , on the other hand, would underfit.
As usual, cross-validation can be used to select the best value of .

Distance

The very word “nearest” implies a distance metric. How do we measure the distance from a query point to an example point ?

Typically, distances are measured with a Minkowski distance or norm, defined as:

With this is Euclidean distance and with it is Manhattan distance.
With Boolean attribute values, the number of attributes on which the two points differ is called the Hamming distance.

For our purposes we will adopt Euclidean distance and since our dataset is made of two attributes we can use the following function where .

Weighted distance

Instead of computing an average of the neighbors, we can compute a weighted average of the neighbors. A common way to do this is to weight each of the neighbors by a factor of , where is its distance from the test example.
The weighted average of neighbors is then , where is the distance of the th neighbor.

For our implementation, we chose to use weighted distance according to a paper1 which proposes another improvement to the basic k-NN where the weights to nearest neighbors are given based on Gaussian distribution.

If we are happy with an implementation that takes execution time, then that is the end of the story. If not, there are possible optimization using indexes based on additional data structures, i.e. k-d trees or hash tables, which I might write about in the future.

Strength and Weakness

k Nearest Neighbor estimation was proposed sixty years ago, but because of the need for large memory and computation, the approach was not popular for a long time. With advances in parallel processing and with memory and computation getting cheaper, such methods have recently become more widely used.
Unfortunately, it can still be quite computationally expensive when it comes to large training dataset as we need to compute the distance for each sample.
Some indexing (e.g. k-d tree) may reduce this cost.

Also, when we consider low-dimensional spaces and we have enough data, NN works very well in terms of accuracy, as we have enough nearby data points to get a good answer. As the number of dimensions rises the algorithm performs worst, this is due to the fact that the distance measure becomes meaningless when the dimension of the data increases significantly.

On the other hand, k-NN is quite robust to noisy training data, especially when a weighted distance is used.

Dataset

To test our k-NN implementation we will perform experiments using a version of the automobile dataset from the UC Irvine Repository. The problem will be to predict the miles per gallon (mpg) of a car, given its displacement and horsepower. Each example in the dataset corresponds to a single car.

Number of Instances: 291 in the training set, 100 in the test set
Number of Attributes: 2 continous input attributes, one continuous output

Predict

Using the data in the training set, we predicted the output for each example in the test, for , , and . Reported the squared error on the test set.
As we can see the test error goes down while increasing .