The training of a KNN model basically does nothing but memorizing all the training points and the associated labels, which is very cheap in computation but costly in storage. The prediction is implemented by finding the K nearest neighbors of the query point, and voting. Here K is a hyper-parameter for the algorithm. Smaller values for K give the model low bias but high variance; while larger values for K give low variance but high bias.

In SHOGUN, you can use CKNN to perform KNN learning. To construct a KNN machine, you must choose the hyper-parameter K and a distance function. Usually, we simply use the standard CEuclideanDistance, but in general, any subclass of CDistance could be used. For demonstration, in this tutorial we select a random subset of 1000 samples from the USPS digit recognition dataset, and run 2-fold cross validation of KNN with varying K.

Now the question is - is 97.30% accuracy the best we can do? While one would usually re-train KNN with different values for k here and likely perform Cross-validation, we just use a small trick here that saves us lots of computation time: When we have to determine the $K\geq k$ nearest neighbors we will know the nearest neigbors for all $k=1...K$ and can thus get the predictions for multiple k's in one step:

We have the prediction for each of the 13 k's now and can quickly compute the accuracies:

In [6]:

forkinxrange(13):print"Accuracy for k=%d is %2.2f%%"%(k+1,100*np.mean(multiple_k[:,k]==Ytest))

Accuracy for k=1 is 97.20%
Accuracy for k=2 is 97.20%
Accuracy for k=3 is 97.30%
Accuracy for k=4 is 96.70%
Accuracy for k=5 is 96.70%
Accuracy for k=6 is 96.60%
Accuracy for k=7 is 96.40%
Accuracy for k=8 is 96.50%
Accuracy for k=9 is 96.20%
Accuracy for k=10 is 96.20%
Accuracy for k=11 is 96.00%
Accuracy for k=12 is 96.30%
Accuracy for k=13 is 96.20%

Obviously applying KNN is very costly: for each prediction you have to compare the object against all training objects. While the implementation in SHOGUN will use all available CPU cores to parallelize this computation it might still be slow when you have big data sets. In SHOGUN, you can use Cover Trees to speed up the nearest neighbor searching process in KNN. Just call set_use_covertree on the KNN machine to enable or disable this feature. We also show the prediction time comparison with and without Cover Tree in this tutorial. So let's just have a comparison utilizing the data above:

Although simple and elegant, KNN is generally very resource costly. Because all the training samples are to be memorized literally, the memory cost of KNN learning becomes prohibitive when the dataset is huge. Even when the memory is big enough to hold all the data, the prediction will be slow, since the distances between the query point and all the training points need to be computed and ranked. The situation becomes worse if in addition the data samples are all very high-dimensional. Leaving aside computation time issues, k-NN is a very versatile and competitive algorithm. It can be applied to any kind of objects (not just numerical data) - as long as one can design a suitable distance function. In pratice k-NN used with bagging can create improved and more robust results.

In contrast to KNN - multiclass Support Vector Machines (SVMs) attempt to model the decision function separating each class from one another. They compare examples utilizing similarity measures (so called Kernels) instead of distances like KNN does. When applied, they are in Big-O notation computationally as expensive as KNN but involve another (costly) training step. They do not scale very well to cases with a huge number of classes but usually lead to favorable results when applied to small number of classes cases. So for reference let us compare how a standard multiclass SVM performs wrt. KNN on the mnist data set from above.

Let us first train a multiclass svm using a Gaussian kernel (kind of the SVM equivalent to the euclidean distance).