tl;dr How can something like Mathematica's Nearest be implemented efficiently?

Mathematica has a function called Nearest which will take a list of "things" (they can be numbers, coordinates in n-dimensional space, strings, etc.), and will return a NearestFunction object. This object is a function that, when applied to x, will return the list element which is closest to x by some distance metric. The distance metric can be passed as a parameter to Nearest: by default it uses Euclidean distance for numerical data and some kind of edit distance for strings.

Example (this will hopefully make the question more clear):

nf = Nearest[{92, 64, 26, 89, 39, 19, 66, 58, 65, 39}];

nf[50] will return 58, the element closest to 50. nf[50, 2] will return {58, 39}, the two closest elements.

Question: What is an efficient way to implement this functionality? What sort of data structure is NearestFunction likely to use internally? What is the best possible complexity of computing a nearest element for different types of data?

For a plain list of numbers sorting them and doing a binary search would work, but Nearest works with multidimensional data as well as with an arbitrary distance function, so I suppose it uses something more general. But I wouldn't be surprised if it turned out to be specialized for certain kinds of data / distance functions.

2 Answers
2

For distance functions that are well-behaved, there are many data structures optimized specifically for this. For multidimensional data, the k-d tree (and other binary space partitioning trees) can give excellent nearest-neighbor searches, usually in sublinear time. You may also want to look into metric trees, which are tree structures optimized to store points in some metric space in a way that supports nearest-neighbor searches. Depending on the particular metric space (Euclidean distance, edit distance, etc.), different data structures might be more or less appropriate.

For arbitrary distance functions in which there are no restrictions on the behavior (not even things like the triangle inequality, for example), then the best you can do is a linear search, since the distance function might be infinite for all points except for one specific point in the set.