K Nearest Neighbor – A data driven Machine Learning Algorithm

I'm Piyush Malhotra, a Delhilite who loves to dig Deep in the woods of Artificial Intelligence. I like to find new ways to solve not so new but interesting problems. Fitting new models to data and articulating new ways to manipulate and personify things is what I think my field is all about. When not working or playing with data, you'll find me in the gym or writing new blog posts.

April 7, 2019

Most of the Supervised Machine Learning algorithms are model based, i.e., given a set of training points we find a function that maps input points to output labels. This means that after utilization of the training points to find the model function, we can throw away data points keeping only the function parameters!

Today we will look past this model-driven approach and work on a data-driven machine learning algorithm – K Nearest Neighbor (KNN). What’s more is we will be going full Super Developer Mode and build it from scratch! I too love scikit-learn, but sometimes it’s fun to code. 😉

KNN Step-by-Step

KNN works on the principle of majority wins and similarity matters. What do I mean by that? Let’s understand how KNN works!

Instantiate a KNN object with a natural number.

Provide the KNN object with the training data. – .fit(X, y).

The .fit() method does only one thing! It creates a copy of all of these training points and stores it in the KNN object.

Prediction:

A given test point is matched to every single point in the training set copy using a similarity metric.

Similarity metric is a mathematical function that takes two points – a test set point and a training set point and gives a real number- which is bigger if the two points are more similar and small if the two points are less similar.

Collect K highest similarity values and labels of associated training points out of all these estimations.

Find the label with majority out of these K points.

This label is our prediction result.

Repeat step 4 for all test points.

Remember: Here, similarity metric, when referred in mathematical terms, is inverse of distance measure between two points.

importnumpyasnpclassKNN:def__init__(self,k):self.k=kdeffit(self,X,y):self.X=np.copy(X)self.y=np.copy(y)defdistance(self,x_test):passdefpredict(self,x_test):preds=[]forindexinrange(x_test.shape[0]):# get similaritiesdistances=self.distance(x_test[index])# Sort labels in ascending order of distancessorted_labels=self.y[np.argsort(distances)]# Take first k labelsk_sorted_labels=sorted_labels[:self.k]# get unique labels and their countsunique_labels,counts=np.unique(k_sorted_labels,return_counts=True)# predicition is the label with maximum number of countspred=unique_labels[np.argmax(counts)]preds.append(pred)returnnp.array(preds)

Now that we’ve a basic structure of our class ready, let’s learn about the distance measures that play major role for similarity metrics.

Distance Measures for KNN

Let’s look at 3 well-known distance measures:

Manhattan Distance

Euclidean Distance

Minkowski Distance

Manhattan Distance

Use this when there are features with vastly different range. It is the sum of absolute difference between the features of the two points.

Euclidean Distance

This is the most frequently used distance function. Make sure to standardize the data so that each feature has almost identical range. It is the root mean square of the difference between the features of the two points.

Minkowski Distance

A distance function which generalizes the basic formula of Manhattan and Euclidean Distance. It is the pth root of the sum of pth power of absolute difference between the features of the two points.

For,

p = 1, it is Manhattan distance

p = 2, it is Euclidean distance

We know that euclidean distance gets distorted by highly varying range of features while Manhattan distance tries to resolve it. Using this information, we can tweak p to give us something in the middle or can use a search algorithm to find a suitable p value that works for a given dataset.

Let’s add this new knowledge to our already defined class.

importnumpyasnpclassKNN:def__init__(self,k,d_metric='eucd',p=None):ifd_metric=='mind'andp==None:self.p=1else:self.p=pself.k=kself.d_metric=d_metricself.d_metric_to_fn={'eucd':self._euclidean_dist,'mand':self._manhattan_dist,'mind':self._minkowski_dist}deffit(self,X,y):self.X=np.copy(X)self.y=np.copy(y)def_manhattan_dist(self,x_test):returnnp.sum(np.abs(self.X-x_test),axis=-1)def_euclidean_dist(self,x_test):sq_diff=(self.X-x_test)**2returnnp.sqrt(np.sum(sq_diff,axis=-1))def_minkowski_dist(self,x_test):abs_diff=np.abs(self.X-x_test)sum_p_diff=np.sum(abs_diff**self.p,axis=-1)pth_root=sum_p_diff**(1/p)returnpth_rootdefdistance(self,x_test):returnself.d_metric_to_fn[self.d_metric](x_test)defpredict(self,x_test):preds=[]forindexinrange(x_test.shape[0]):# get similaritiesdistances=self.distance(x_test[index])# Sort labels in ascending order of distancessorted_labels=self.y[np.argsort(distances)]# Take first k labelsk_sorted_labels=sorted_labels[:self.k]# get unique labels and their countsunique_labels,counts=np.unique(k_sorted_labels,return_counts=True)# predicition is the label with maximum number of countspred=unique_labels[np.argmax(counts)]preds.append(pred)returnnp.array(preds)

Now that we’ve defined the KNN class, let’s visually compare how these distance functions work. But what will work as a perfect dataset? How about a dataset that represents a circular decision boundary? That’s fun. Here, let’s define it:

We can certainly see a block of negative region around point (-0.5, 0.48) in the first image. Although it should have been a region of positive labels, the model was fooled due to the presence of nearby outliers. This is resolved in the second image via the use of Manhattan Distance.

Now, let’s see how minowski distance performs for different values of p.

defget_clf(p):# A wrapper function to get KNN with different# p valuesreturnKNN(k=5,d_metric='mind',p=p)defscatter_plot(X,y):# scatter plot of original dataplt.scatter(X[y==1,0],X[y==1,1],marker='s',label='Inner Circle')plt.scatter(X[y==0,0],X[y==0,1],marker='^',label='Outer Circle')# We will be testing p values from 0.25 till 2# increasing each time with a difference of 0.25ps=np.arange(0.25,2.25,0.25)plt.figure(dpi=300,figsize=(10,14))fori,pinenumerate(ps):# get a KNN clfclf=get_clf(p)# Fit dataclf.fit(X,y)# plot itplt.subplot(4,2,i+1)plot_predictions(clf,[-1.75,1.75,-1.75,1.75])scatter_plot(X,y)plt.title('p={}'.format(p))plt.legend()plt.show()

After looking at these distance functions, you might have a question. If not, then let me ask you. What if none of the pre-defined distance functions is suitable enough for our problem statement? Naturally, we might be looking into creating our own distance measures. But to do so, we must take into account the properties that a distance measure should follow!

Properties of Distance Measure for KNN

There are many well-known distance measures, but you can certainly define your own. While defining a distance measure, remember these necessary properties that it should follow (Deza & Deza, 2009):

Note: For usage of distance measures, the properties matter but not their name. So just relax and focus on what the property means.

Non-Negativity: The distance between two points is always greater than and equal to zero.

d(x, y) ≥ 0

Identity of indiscernible: The distance between x and y is equal to zero if and only if x is equal to y.

d(x, y) = 0 iff x = y

Symmetry: The distance between x and y is equal to the distance between y and x.

d(x, y) = d(y, x)

Triangle inequality: Considering the presence of a third point z, the distance between x and y is always less than or equal to the sum of the distance between x and z and the distance between y and z.

d(x, y) ≤ d(x, z) + d(z, y)

That concludes our discussion over distance measures. But we aren’t there yet!

A KNN classifier has two major players in classification process – the distance measure and the hyperparameter k. While we have gone over the former a bit, we still need to work our way through the latter one.

Significance of k in KNN

The k nearest data points to the point under observation has a huge role to play. Let’s try to answer this by considering 2 scenarios:

Scenario 1: Majority Wins

Out of k closest data points, the majority of points of one class declares the label for the point under observation.

if k = 1, (aka Nearest Neighbor) classification might be wrong if the closest point is an outlier.

if k = any multiple of n, where n = number of classes, no majority situation can occur. Example: For 2 class classification, what if we have k = 8 out of which 4 points belong to each class.

Advice:

Try to keep k = any multiple of n + 1, where n = number of classes and the multiple isn’t 0. This gives us a tie-breaker.

Keep a validation set and a range of values of k and select the value of k where your performance metric starts to saturate on the validation set.

Scenario 2: Distance Weighted Classification

What if you have k = 9 for a 2 class classification, out of which 4 says true and 5 says false. According to scenario, 1 prediction must be false! But what if the 4 closest points are all true? Should our prediction still be False? That’s where distance weighted KNN comes into play!

The idea of distance weighted classification is a 3 step process:

Convert distances of the closest k values to similarities. The closer the point, higher the similarity.

similarity(x, y) = 1 / d(x, y)

Find total similarity for each label in the closest k values.

The label with the highest total similarity will be our prediction.

Let’s try this with an example. We will be working with euclidean distance.

Point under observation:

x1

x2

1

3

Top k (=5) points

S No.

x1

x2

Distance

Similarity (= 1/distance)

Label

1

1.3

2.8

0.3605

2.7735

False

2

1

2.5

0.5

2

True

3

1.1

3.5

0.509

1.9611

False

4

1

3.75

0.75

1.3333

True

5

1.9

2.9

0.728

1.3736

True

Total similarity for True label = 4.7069

Total similarity for False label = 4.7346

Since total similarity of False label is higher, even though we’ve a majority of True labels, our prediction will be False.

This might have given you an essence of how important hyperparameter k is in the KNN algorithm.

With all said and done, you might still be very vary of using KNN in production. Let’s learn why!

Run Time Analysis of KNN

For the model based approaches, we usually have an order of O(nxm) time for training (.fit() – n examples/batches and m iterations) and an order O(1) time for a prediction. This high order training time is due to the usage of optimization algorithms like Gradient Descent while training while the constant time for prediction is due to simple weighted sum of the input features.

In contrast to model based approach, data driven KNN has an order of O(1) time for training and an order of O(n) time for a prediction. For the former, it is due to the simple assignment operation in the fit(…) method. And for the latter, it is due to the computation of distance between every training point and the test point.

At time of deployment, our major focus is on prediction time. Thus, KNN raises a major issue of increased prediction time.

Although we can overcome this using hacks like parallelization, but these hacks require usage of more computation equipments making this an expensive approach.

This brings us to the end of the blog post. KNN is a powerful technique and can do marvels but requires more computation at time of prediction. Hope to see you next time. 🙂