K-nearest-neighbor(kNN) classification is one of the most fundamental and simple classification methods and should be one of the first choices for a classification study when there is little or no prior knowledge about the distribution of the data. K-nearest-neighbor classification was developed from the need to perform discriminant analysis when reliable parametric estimates of probability densities are unknown or difficult to determine. In an unpublished US Air Force School of Aviation Medicine report in 1951, Fix and Hodges introduced a non-parametric method for pattern classification that has since become known the k-nearest neighbor rule (Fix & Hodges, 1951). Later in 1967, some of the formal properties of the k-nearest-neighbor rule were worked out; for instance it was shown that for \(k = 1\) and \(n\rightarrow \infty\) the k-nearest-neighbor classification error is bounded above by twice the Bayes error rate (Cover & Hart, 1967). Once such formal properties of k-nearest-neighbor classification were established, a long line of investigation ensued including new rejection approaches (Hellman, 1970), refinements with respect to Bayes error rate (Fukunaga & Hostetler, 1975), distance weighted approaches (Dudani, 1976; Bailey & Jain, 1978), soft computing (Bermejo & Cabestany, 2000) methods and fuzzy methods (Jozwik, 1983; Keller et al, 1985).

Characteristics of kNN

Between-sample geometric distance

Figure 1: Voronoi tessellation showing Voronoi cells of 19 samples marked with a "+". The Voronoi tessellation reflects two characteristics of the example 2-dimensional coordinate system: i) all possible points within a sample's Voronoi cell are the nearest neighboring points for that sample, and ii) for any sample, the nearest sample is determined by the closest Voronoi cell edge.

The k-nearest-neighbor classifier is commonly based on the Euclidean distance between a test sample and the specified training samples. Let \({\textbf x}_i\) be an input sample with \(p\) features \((x_{i1}, x_{i2}, \ldots, x_{ip})\ ,\) \(n\) be the total number of input samples (\(i=1,2,\ldots,n\)) and \(p\) the total number of features \((j=1,2,\ldots,p)\ .\) The Euclidean distance between sample \({\textbf x}_i\) and \({\textbf x}_l\) (\(l=1,2,\ldots,n\)) is defined as

A graphic depiction of the nearest neighbor concept is illustrated in the Voronoi tessellation (Voronoi, 1907) shown in Figure 1. The tessellation shows 19 samples marked with a "+", and the Voronoi cell, \(R\ ,\) surrounding each sample. A Voronoi cell encapsulates all neighboring points that are nearest to each sample and is defined as

where \(R_i \) is the Voronoi cell for sample \(\textbf{x}_i\ ,\) and \(\textbf{x}\) represents all possible points within Voronoi cell \(R_i \ .\) Voronoi tessellations primarily reflect two characteristics of a coordinate system: i) all possible points within a sample's Voronoi cell are the nearest neighboring points for that sample, and ii) for any sample, the nearest sample is determined by the closest Voronoi cell edge. Using the latter characteristic, the k-nearest-neighbor classification rule is to assign to a test sample the majority category label of its k nearest training samples. In practice, k is usually chosen to be odd, so as to avoid ties. The k = 1 rule is generally called the nearest-neighbor classification rule.

Classification decision rule and confusion matrix

Classification typically involves partitioning samples into training and testing categories.
Let \(\textbf{x}_i\) be a training sample and \(\textbf{x}\) be a test sample, and
let \(\omega\) be the true class of a training sample and \(\hat{\omega}\) be the predicted class for a test sample
\((\omega,\hat{\omega}=1,2,\ldots,\Omega)\ .\) Here, \(\Omega\) is the total number of classes.

During the training process, we use only the true class \(\omega\) of each training sample to train the classifier, while during testing we predict the class \(\hat{\omega}\) of each test sample. It warrants noting that kNN is a "supervised" classification method in that it uses the class labels of the training data. Unsupervised classification methods, or "clustering" methods, on the other hand, do not employ the class labels of the training data.

With 1-nearest neighbor rule, the predicted class of test sample \(\textbf{x}\) is set equal to the true class \(\omega\) of its nearest neighbor, where \(\textbf{m}_i\) is a nearest neighbor to \(\textbf{x}\) if the distance

For k-nearest neighbors, the predicted class of test sample \(\textbf{x}\) is set equal to the most frequent true class among \(k\) nearest training samples. This forms the decision rule \(D:\textbf{x}\rightarrow \hat{\omega}\ .\)

The confusion matrix used for tabulating test sample class predictions during testing is denoted as \(\textbf C\) and has dimensions \(\Omega \times \Omega\ .\) During testing, if the predicted class of test sample \(\textbf{x}\) is correct (i.e., \(\hat{\omega}=\omega\)), then the diagonal element \(c_{\omega \omega}\) of the confusion matrix is incremented by 1. However, if the predicted class is incorrect (i.e., \(\hat{\omega} \neq \omega\)), then the off-diagonal element \(c_{\omega \hat{\omega}}\) is incremented by 1. Once all the test samples have been classified, the classification accuracy is based on the ratio of the number of correctly classified samples to the total number of samples classified, given in the form

Figure 2 shows an X-Y scatterplot of the 19 samples plotted as a function of their \(X\) and \(Y\) values. One can notice that among the four samples closest to test sample \({\textbf x}_{11}\) (labeled green), 3/4 of the class labels are for Class A (red color), and therefore, the test sample is assigned to Class A.

Figure 2: X-Y Scatterplot of the 19 samples for which pairwise Euclidean distances are listed in Table 1. Among the 4 nearest neighbors of the test sample, the most frequent class label color is red, and thus the test sample is assigned to the red class.

Feature transformation

Increased performance of a classifier can sometimes be achieved when the feature values are transformed prior to classification analysis. Two commonly used feature transformations are standardization and fuzzification.

Standardization removes scale effects caused by use of features with different measurement scales. For example, if one feature is based on patient weight in units of kg and another feature is based on blood protein values in units of ng/dL in the range [-3,3], then patient weight will have a much greater influence on the distance between samples and may bias the performance of the classifier. Standardization transforms raw feature values into z-scores using the mean and standard deviation of a feature values over all input samples, given by the relationship

\(
z_{ij}=\frac{x_{ij} - \mu_j}{\sigma_j},
\)

where \(x_{ij}\) is the value for the ith sample and jth feature, \(\mu_j\) is the average of all \(x_{ij}\) for feature j, \(\sigma_j\) is the standard deviation of all \(x_{ij}\) over all input samples. If the feature values take on a Gaussian distribution, then the histogram of z-scores will represent a standard normal distribution having a mean of zero and variance of unity. Once standardization is performed on a set of features, the range and scale of the z-scores should be similar, providing the distributions of raw feature values are alike.

Fuzzification is a transformation which exploits uncertainty in feature values in order to increase classification performance. Fuzzification replaces the original features by mapping original values of an input feature into 3 fuzzy sets representing linguistic membership functions in order to facilitate the semantic interpretation of each fuzzy set (Klir and Juan, 1995; Dubois and Prade, 2000; Pal and Mitra, 2004). First, determine \(x_{min}\) and \(x_{max}\) as the minimum and maximum values of \(x_{ij}\) for feature j over all input samples and \(q_1\) and \(q_2\) as the quantile values of \(x_{ij}\) at the 33rd and 66th percentile. Next, calculate the averages \(Avg_1=(x_{min}+q_1)/2\ ,\) \(Avg_2=(q_1+q_2)/2\ ,\) and \(Avg_3=(q_2+x_{max})/2\ .\) Next, translate each value of \(x_{ij}\) for feature j into 3 fuzzy membership values in the range [0,1] as \(\mu_{low,i,j}\ ,\) \(\mu_{mid,i,j}\ ,\) and \(\mu_{high,i,j}\) using the relationships

The above computations result in 3 fuzzy sets (vectors) \(\boldsymbol{\mu}_{low,j}\ ,\) \(\boldsymbol{\mu}_{med,j}\) and \(\boldsymbol{\mu}_{high,j}\) of length n which replace the original input feature.

The statistical significance of class discrimination for each jth feature can be assessed by using the F-ratio test, given as

where \(n_\omega\) is the number of training samples in class \(\omega\) \((\omega=1,2,\ldots,\Omega)\ ,\) \(\bar{y}_\omega\) is the mean feature value among training samples in class \(\omega\ ,\) \(\bar{y}\) is the mean feature value for all training samples, and \(y_{\omega i}\) is the feature value among training samples in class \(\omega\ ,\) \((\Omega-1)\) is the numerator degrees of freedom and \((n-\Omega)\) is the denominator degrees of freedom for the F-ratio test. Tail probabilities, i.e., \(Prob_j\ ,\) are derived for values of the F-ratio statistic based on the numerator and denominator degrees of freedom. A simple way to quantify simultaneously the total statistical significance of class discrimination for p independent features is to sum the minus natural logarithm of feature-specific p-values using the form

\(
\textrm{sum[-log(p-value)]}=\frac{\sum_j^p Prob_j}{p}.
\)

High values of sum[-log(p-value)] for a set of features (>1000) suggest that the feature values are heterogeneous across the classes considered and can discriminate classes well, whereas low values of sum[-log(p-value)] (<100) suggest poor discrimination ability of a feature.

Performance assessment with cross-validation

A basic rule in classification analysis is that class predictions are not made for data samples that are used for training or learning. If class predictions are made for samples used in training or learning, the accuracy will be artificially biased upward. Instead, class predictions are made for samples that are kept out of training process.

The performance of most classifiers is typically evaluated through cross-validation, which involves the determination of classification accuracy for multiple partitions of the input samples used in training. For example, during 5-fold \((\kappa=5)\) cross-validation training, a set of input samples is split up into 5 partitions \(\mathcal{D}_1, \mathcal{D}_2, \ldots, \mathcal{D}_5\) having equal sample sizes to the extent possible. The notion of ensuring uniform class representation among the partitions is called stratified cross-validation, which is preferred. To begin, for 5-fold cross-validation, samples in partitions \(\mathcal{D}_2, \mathcal{D}_3, \ldots, \mathcal{D}_5\) are first used for training while samples in partition \(\mathcal{D}_1\) are used for testing. Next, samples in groups \(\mathcal{D}_1, \mathcal{D}_3, \ldots, \mathcal{D}_5\) are used for training and samples in partition \(\mathcal{D}_2\) used for testing. This is repeated until each partitions have been used singly for testing. It is also customary to re-partition all of the input samples e.g. 10 times in order to get a better estimate of accuracy.

Pseudocode is defined as a listing of sequential steps for solving a computational problem. Pseudocode is used by computer programmers to mentally translate each computational step into a set of programming instructions involving various mathematical operations (addition, subtraction, multiplication, division, power and transcendental functions, differentiation/integration, etc.) and resources (vectors, arrays, graphics, input/output, etc.) in order to solve an analytic problem. Following is a listing of pseudocode for the k-nearest-neighbor classification method using cross-validation.

calculate distances between all the input samples and store in \(n \times n\) matrix \(\textbf{D}\ .\) (For a large number of samples, use only the lower or upper triangular of \(\textbf{D}\) for storage since it is a square symmetric matrix.)

for \(t \leftarrow \) 1 to \(NumIterations\) do

set \(\textbf{C} \leftarrow \)0, and \(n_{total} \leftarrow 0\ .\)

partition the input samples into \(\kappa\) equally-sized groups.

for \(fold \leftarrow\) 1 to \(\kappa\) do

assign samples in the \(fold\)th partition to testing, and use the remaining samples for training. Set the number of samples used for testing as \(n_{test}\ .\)

set \(n_{total} \leftarrow n_{total}+n_{test}\ .\)

fori \( \leftarrow \) 1 to \(n_{test}\) do

for test sample \(\textbf{x}_i\)determine the \(k\) closest training samples based on the calculated distances.

increment confusion matrix \(\textbf{C}\) by 1 in element \(c_{\omega,\hat{\omega}}\ ,\) where \(\omega\) is the true and \(\hat{\omega}\) the predicted class label for test sample \(\textbf{x}_i\ .\) If \(\omega =\hat{\omega}\) then the increment of \(+1\) will occur on the diagonal of the confusion matrix, otherwise, the increment will occur in an off-diagonal.

determine the classification accuracy using \(Acc = \frac{\sum_j^{\Omega}c_{jj}}{n_{total}}\) where \(c_{jj}\) is a diagonal element of the confusion matrix \(\textbf{C}\ .\)

calculate \(TotAcc = TotAcc + Acc\ .\)

calculate \(AvgAcc = TotAcc/NumIterations\)

end

The above pseudocode was applied to several commonly used data sets (see next section) where the fold value varied in order to asses s performance (accuracy) as a function of the size of the cross validation partitions.

Commonly Employed Data Sets

Nine data sets from the Machine Learning Repository of the University of California - Irvine (UCI) were used for several k-nearest neighbor runs (Newman et al, 1998). Table 2 lists the data sets, number of classes, number of samples, and number of features (attributes) in each data set.

Table 2. Data sets used.

Data set

#Samples

#Classes

#Features

Reference

Cancer (Wisconsin)

699

2

9

Wolberg & Mangasarin, 1990

Dermatology

366

6

34

Guvenir et al, 1998

Glass

214

6

9

Evett & Spiehler, 1987

Ionosphere

351

2

32

Sigillito et al, 1989

Fisher Iris

150

3

4

Fisher, 1936

Liver

345

2

8

Forsyth, 1990

Pima Diabetes

768

2

8

Smith et al, 1988

Soybean

266

15

38

Michalski & Chilausky, 1980

Wine

178

3

13

Aeberhard et al, 1992

Performance Evaluation

Figure 3 shows the strong linear relationship between 10-fold cross-validation accuracy for the 9 data sets as a function of the ratio of the feature sum[-log(p)] to number of features. The liver data set resulted in the lowest accuracy, while the Fisher Iris data resulted in the greatest accuracy. The low value of sum[-log(p-value)] for features in the liver data set will on average result in lower classification accuracy, wheres the greater level of sum[-log(p-value)] for the Fisher Iris data and cancer data set will yield much greater levels of accuracy.

Figure 3: Linear relationship between classification accuracy and the ratio sum[-log(p)]/#features. 5NN used with feature standardization.

Figure 5 shows that when averaging performance over all data sets (k=5), that both feature standardization and feature fuzzification resulted in greater accuracy levels when compared with no feature transformation.

Figures 6, 7, and 8 illustrates the CV10 accuracy for each data set as a function of k without no transformation, standardization, and fuzzification, respectively. It was apparent that feature standardization (Figure 7) and fuzzification (Figure 8) greatly improved the accuracy of the dermatology and wine data sets. Fuzzification (Figure 8) slightly reduced the performance of the Fisher Iris data set. Interestingly, performance for the soybean data set did not improve with increasing values of k, suggesting overlearning or overfitting.

Figure 4: Bias as a function of various cross-validation methods for the data sets used. Feature values standardized and k=5.

Average accuracy as function of k is shown for feature standardization and fuzzification for all data sets combined is shown in Figure 9. Again, feature standardization and fuzzification resulted in improved accuracy values over the range of k.
Finally, in Figures 10, 11, and 12 are shown the bootstrap accuracy as a function of training sample size when (k=5), i.e. 5NN, with and without feature standardization and fuzzification. The use of feature standardization and fuzzification resulted in substantial performance gains for the dermatology and wine data sets. Feature fuzzification markedly improved performance for the dermatology data set, especially at lower sample size. Standardization also improved the dermatology date set performance at smaller sample sizes. Performance for the liver, glass, and soybean data sets was not improved by feature standardization or fuzzification.

Figure 5: Bias as a function of cross validation method averaged over all training sets as a function of feature transformation. K=5 used.

Figure 9: Bias as a function of k averaged over all training sets as a function of feature transformation.

Figure 10: Bootstrap bias as a function of the number of training instances sampled randomly with replacement. No feature transformation used.

Figure 11: Bootstrap bias as a function of the number of training instances sampled randomly with replacement. Feature standardization used.

Figure 12: Bootstrap bias as a function of the number of training instances sampled randomly with replacement. Feature fuzzification used.

Performance of the unsupervised k-nearest neighbor classification method was assessed using several data sets, cross validation, and bootstrapping. All methods involved initial use of a distance matrix and construction of a confusion matrix during sample testing, from which classification accuracy was determined. With regard to accuracy calculation, for cross-validation it is recommended that the confusion matrix be filled incrementally with results for all input samples partitioned into the various groups, and then calculating accuracy -- rather than calculating accuracy and averaging after each partition of training samples is used for testing. In other words, for e.g. 5-fold cross-validation, it is not recommended to calculate accuracy after the first 4/5ths of samples are used for training and the first 1/5th of samples are used for testing. Instead, it is better to determine accuracy after all 5 partitions have been used for testing to fill in the confusion matrix for each input sample considered along the way. Then, re-partition the samples into 5 groups again and repeat training and testing on each of the partitions. Another example would be to consider an analysis for which there are 100 input samples and 10-fold cross-validation is to be used. The suggestion is not to calculate average accuracy every time 10 of the samples are used for testing, but rather to go through the 10 partitions in order to fill in the confusion matrix for the entire set of 100 samples, and then calculate accuracy. This should be repeated e.g. 10 times during which re-partitioning is done.

The hold-out method of accuracy determination is another approach to assess the performance of k-nearest neighbor. Here, input samples are randomly split into 2 groups with 2/3 (~66%) of the input samples assigned to the training set and 1/3 (~33%) of the samples (remaining) assigned to testing. Training results are used to classify the test samples. A major criticism of the hold-out method when compared with cross-validation is that it makes inefficient use of the entire data set, since date are split one time and used once in this configuration to assess classification accuracy. It is important to recognize that the hold-out method is not the same as predicting class membership for an independent set of supplemental experimental validation samples. Validation sets are used when the goal is to confirm the predictive capabilities of a classification scheme based on the results from an independent set of supplemental samples not used previously for training and testing. Laboratory investigations involving molecular biology and genomics commonly use validation sets raised independently from the original training/testing samples. By using an independent set of validation samples, the ability of a set of pre-selected features (e.g. mRNA or microRNA transcripts, or proteins) to correctly classify new samples can be better evaluated. The attempt to validate a set of features using a new set of samples should be done carefully, since processing new samples at a later date using different lab protocols, buffers, and technicians can introduce significant systematic error into the investigation. As a precautionary method, a laboratory should plan on processing the independent validation set of samples in the same laboratory, using the same protocol and buffer solutions, the same technician(s), and preferably at the same time the original samples are processed. Waiting until a later phase in a study to generate the independent validation set of samples may seriously degrade the predictive ability of the features identified from the original samples, ultimately jeopardizing the classification study.

The data sets used varied over the number of classes, features, and statistical significance for class discrimination based on the feature-specific F-ratio tests. An important finding during the performance evaluation of k-nearest neighbor was that feature standardization improved accuracy for some data sets and did not reduce accuracy. On the other hand, while feature fuzzification improved performance for several data sets, it nevertheless resulted in decreased performance for one data set (Fisher Iris). The effect of feature standardization and fuzzification varies depending on the data set and the classifier being used. In an independent analysis of 14 classifiers applied to 9 large DNA microarray data sets, it was found that feature standardization or fuzzification improved performance for all classifiers except naive Bayes classifier, quadratic discriminant analysis, and artificial neural networks (Peterson and Coleman, 2007). While standardization reduced performance of only quadratic discriminant analysis, fuzzification reduced the performance of the naive Bayes, quadratic discriminant analysis, and artificial neural networks classifiers.

In light of the transformations explored in this study of k-nearest neighbor classification, it is recommended that at least the effects of feature standardization be performed and comparatively assessed when using k-nearest neighbor classification. In addition, the effects of values of k should also be determined in order to identify regions where overlearning or overfitting may occur. Lastly, there may be unique characteristics of the sample and feature space being studied, which may cause other classifiers to result in better (worse) performance when compared with k-nearest neighbor classification. Hence, a full evaluation of K-nearest neighbor performance as a function of feature transformation and k is suggested.

Acknowledgments

We are grateful to the current and past librarians of the University of California-Irvine (UCI) Machine Learning Repository, namely, Patrick M. Murphy, David Aha, and Christopher J. Merz.

References

Aeberhard, S., Coomans, D., de Vel, O. Comparison of Classifiers in High Dimensional Settings. Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland, 1992.

Michalski, R.S., Chilausky R.L. Learning by Being Told and Learning from Examples: An Experimental Comparison of the Two Methods of Knowledge Acquisition in the Context of Developing an Expert System for Soybean Disease Diagnosis. International Journal of Policy Analysis and Information Systems. 4(2), 1980.