Nearest neighbor search with kd-trees

Nearest neighbor search is an important task which arises in different areas -
pattern recognition, recommendation systems, DNA sequencing and even game development.

Usually, this task is formulated as follows.
We have N points in some space (S dataset).
We have to work with queries, which have dataset S and some point X as their parameters
(X does not have to belong to S).
Typical queries are "find k nearest neighbors of X" or "find all points in S at given distance R from X or closer".
Depending on problem, we may have:
a) different number of dimensions - from one to thousands,
b) different metric type (Euclidean, 1-norm, ...),
c) different dataset size.
Hence, for different problems different algorithms are feasible.

The key point of the problem formulation is that dataset S is considered fixed.
X may vary from request to request, but S remains unchanged.
It allows to preprocess dataset and build data structure which accelerates processing.
All strategies which promise better than O(N) processing time rely on some kind of preprocessing.
Different preprocessing strategies have different features.

Contents

Nearest neighbor search with kd-trees

Basic description

Kd-trees are data structures which are used to store points in k-dimensional space.
As it follows from its name, kd-tree is a tree.
Tree leafs store points of the dataset (one or several points in each leaf).
Each point is stored in one and only one leaf, each leaf stores at least one point.
Tree nodes correspond to splits of the space (axis-oriented splits are used in most implementations).
Each split divides space and dataset into two distinct parts.
Subsequent splits from the root node to one of the leafs remove parts of the dataset (and space)
until only small part of the dataset (and space) is left.

Chart at the left shows an example of kd-tree in the 2-dimensional space.
Red squares are dataset points,
black lines are splits.
The thinner the line is, the deeper is the node which corresponds to the split.

kd-trees allow to efficiently perform searches like "all points at distance lower than R from X" or
"k nearest neighbors of X".
When processing such query, we find a leaf which corresponds to X.
Then we process points which are stored in that leaf, and then we start to scan nearby leafs.
At some point we may notice that distance from X to the leaf is higher than the worst point found so far.
It is time to stop search, because next leafs won't improve search results.
Such algorithm is good for searches in low-dimensional spaces.
However, its efficiency decreases as dimensionality grows,
and in high-dimensional spaces kd-trees give no performance over naive O(N) linear search
(although continue to give correct results).

Considering number of dimensions K fixed, and dataset size N variable,
we can estimate complexity of the most important operations with kd-tree:

building a kd-tree has O(N·logN) time complexity and O(K·N) space complexity

nearest neighbor search - close to O(logN)

M nearest neighbors - close to O(M·logN)

Kd-tree construction

ALGLIB implementation of kd-trees uses following dataset model:

search is performed in NX-dimensional space

dataset S includes N points in NX-dimensional space.
Coordinates in NX-dimensional space of points are called X-values.

each point may have additional attributes- Y-values and/or integer tags.
These attributes may be returned with X-values as the query result.

Y-value is a NY real numbers associated with point.
Y-value is not used for distance computation when performing search.
Usually it is value of some NY-dimensional vector function at the point.

Tag is an integer number associated with point.
Usually, tags are used for identification of points (for example, to store its index in the dataset).

X-values, Y-values and tags are passed to kd-tree when it is built from dataset.

Y-values and tags can be omitted.
If your problem does not require additional information,
kd-tree will store only X-values.
However, if you decide to use this feature, then each point must have same set of attributes.
For example, it is impossible to assign 3 Y-values to one point, and 5 Y-values to another one.

kd-trees are built with one of the constructor functions: kdtreebuild or kdtreebuildtagged.
First one builds kd-tree without tags (with optional Y-values),
second one builds kd-tree with tags (and optional Y-values).
As result, these functions return kdtree structure.

Querying a kd-tree

ALGLIB implementation of kd-trees supports following kinds of queries:

KNN-queries - find K nearest neighbors of X.

AKNN-queries - find K ε-approximate nearest neighbors with given degree of approximation.
This query returns K nearby points which do not have to be nearest neighbors,
but are located closer than R·(1+ε) from X,
where R is a maximum distance from X to its true nearest neighbors.
Such queries are several times faster than exact KNN queries, especially in high dimensional spaces.

RNN-queries - find all points at distance R or closer.

NN searches are performed in two stages.
At the first stage we send a query using one of the querying functions:
kdtreequeryknn, kdtreequeryaknn or kdtreequeryrnn.
These functions perform search in a kd-tree, save result in the kdtree structure,
and return result size (number of points satisfying search criteria).

At the second stage user may extract result by calling one of the functions:
kdtreequeryresultsx to get X-values,
kdtreequeryresultsxy to get X and Y-values,
kdtreequeryresultstags to get tags,
kdtreequeryresultsdistances to get distances from dataset points to X.

ALGLIB Reference Manual contains nneighbor_d_1 example which shows how to work with kd-trees.

Other functions

kdtreeserialize and kdtreeunserialize functions
can be used to serialize kd-tree (convert it to string) and unserialize it
(restore structure from string representation).
Serialization allows to save tree to file, move between different computers and
even versions of ALGLIB in different programming languages.
ALGLIB Reference Manual includes an example on serialization of kd-trees - nneighbor_d_2.

Performance of kd-trees

In order to estimate performance of ALGLIB implementation of kd-trees we've conducted a series of numerical experiments.
Experiments were performed on AMD Phenom II X6 3.2GHz CPU with Microsoft C++ compiler and maximum optimization settings.
During experiments we've generated N=50.000 points,
uniformly and randomly distributed across D-dimensional unit hypercube.
Then we performed 50.000 queries for K nearest neighbors.

Depending on number of dimensions K, kd-tree construction took from 30 to 60 ms.
Approximate time complexity is O(D·N·logN),
i.e. construction time is linear with respect to number of dimensions and linearithmic with respect to dataset size.

Charts above show how query processing time depends on number of neighbors K and number of dimensions D.
From the first chart we can conclude that KNN query time grows approximately linearly with respect to K
(the best result possible).

Second chart allows to study performance of kd-tree in connection with number of dimensions D.
You may see that for any K (even for K=1) query time exponentially grows as we increase number of dimensions.
Each additional dimension results in approximately 1.7-fold increase of processing time.
In high dimensional spaces we have to scan more leafs of the tree,
and at some moment kd-tree will become just complicated form of linear search (all nodes are examined).
BTW, at this point query time will stop to grow exponentially with growth of D.

One of the ways to fight "curse of dimensionality" is to use approximate (AKNN) queries.
When we process ε-approximate KNN query we can return inexact nearest neighbor,
located at distance r·(1+ε) from X} (where r is a distance to true nearest neighbor).
The larger approximation factor ε we choose, the more leafs can be skipped when we scan the tree.

Charts above show time complexity of AKNN-queries with different values of the approximation factor ε
and different number of dimensions D.
You may see that the higher D we choose, the more advantageous is to use AKNN queries.
With D=5 and ε=1 AKNN queries give us only 3-fold increase of performance,
but D=10 and ε=1 give us 10-fold increase.

Obviously, approximate KNN queries are not universal solution.
In many cases we have to know exact nearest neighbors.
However, in some areas (like large scale data processing) it is sufficient to work with inexact results.