This is the third post of our series on classification from scratch, following the previous post that introduced smoothing techniques, with (b)-splines. Consider here kernel-based techniques. Note that, here, we do not use the "logistic" model... it is purely non-parametric.

Kernel-Based Estimated, From Scratch

I like kernels because they are somehow very intuitive. With GLMs, the goal is to estimate

Heuritically, we want to compute the (conditional) expected value on the neighborhood of X. If we consider some spatial model, where X is the location, we want the expected value of some variable Y, "in the neighborhood" of X. A natural approach is to use some administrative region (county, department, region, etc). This means that we have a partition of X (the space with the variable(s) lies). This will yield the regressogram, introduced in Tukey (1961). For convenience, assume some interval/rectangle/box type of partition. In the univariate case, consider

or the moving regressogram

In that case, the neighborhood is defined as the interval (x ± h). That's nice, but clearly very simplistic. If Xi= X and Xj=X-h+E (E>0), both observations are used to compute the conditional expected value. But if Xj=X-h-E, only Xi is considered. Even if the distance between Xj and Xj' is extremely, extremely small. Thus, a natural idea is to use weights that are a function of the distance between Xi's and X. Use

where (classically)

for some kernel k (a non-negative function that integrates to one) and some bandwidth h. Usually, kernels are denoted with a capital letter K, but I prefer to use k, because it can be interpreted as the density of some random noise we add to all observations (independently).

Actually, one can derive that estimate by using kernel-based estimators of densities. Recall that:

Now, use the fact that the expected value can be defined as:

Consider now a bivariate (product) kernel to estimate the joint density. The numerator is estimated by:

While the denominator is estimated by:

In a general setting, we still use product kernels between Y and X and write

for some symmetric positive definite bandwidth matrix

Now that we know what kernel estimates are, let us use them. For instance, assume that k is the density of the N(0,1) distribution. At point x, with a bandwidth h we get the following code:

We observe what we can read in any textbook: with a smaller bandwidth, we get more variance, less bias. "More variance," here, means more variability (since the neighborhood is smaller, there are fewer points to compute the average, and the estimate is more volatile), and "less bias" in the sense that the expected value is supposed to be computed at point x, so the smaller the neighborhood, the better.

We can replicate our previous estimate. Nevertheless, the output is not a function, but two series of vectors. That's nice to get a graph, but that's all we get. Furthermore, as we can see, the bandwidth is not exactly the same as the one we used before. I did not find any information online, so I tried to replicate the function we wrote before:

k-nearest Neighbors

An alternative is to consider a neighborhood not defined using a distance to point X but the k-neighbors, with the n observations we got.

where

with

The difficult part here is that we need a valid distance. If units are very different on each component, using the Euclidean distance will be meaningless. So, quite naturally, let us consider here the Mahalanobis distance.

Here we have a function to find the k closest neighbor for some observation. Then two things can be done to get a prediction. The goal is to predict a class, so we can think of using a majority rule: the prediction for Yi is the same as the one the majority of the neighbors.

But we can also compute the proportion of black points among the closest neighbors. It can actually be interpreted as the probability to be black (that's actually what was said at the beginning of this post, with kernels).