Fast Feature Vector Distance Sorting Using Postgres

Jan 6, 2015 • David Golightly

Sorting by feature vector distance

Sometimes, describing the real world requires us to use a lot of numbers
measuring a variety of features. We call a set of these numerical qualities a
feature vector. For example,
let’s consider the specs of a variety of apples. Suppose we’ve rated the apples
for a variety of flavor characteristics from 0-100:

Name

Crunchy

Tart

Sweet

Nutty

Bitter

Fuji

50

90

10

30

40

Gala

90

60

70

80

10

Cameo

40

30

90

50

80

Since we have 5 different numerical features, we can describe the apple
characteristics in 5-dimensional vector space.

Calculating distance

Suppose, given an apple from this list with 5-dimensional flavor
characteristics, we want to find the top 3 other apples in our system that are
most similar to it in flavor. In other words, we want to measure the “flavor
distance” between apples, and find the 3 apples with the shortest distance to
our original apple.

Fortunately, the simplest way to calculate distance in any number of dimensions
is one we all learned in middle school: a variation on the familiar Pythagorean
Theorem known as Euclidean distance,
which simply extends that theorem from 2 to an arbitrary number of dimensions.
As in the Pythagorean theorem, we take the difference between points in each
dimension, square the differences, then add up the squares. Finally, we take
the square root of that sum. So to find the distance between vectors q and
p in n dimensions:

If we’re modeling these apples in Ruby and ActiveRecord, we might have a model
that looks like:

Scaling

Now that we have a basic approach for finding the distance between 2 flavor
vectors, what if we want to dramatically increase the number of apples in our
database? While finding the distance between a single apple and every other
apple runs in O(n) time, comparing each apple to every other apple runs in
O(n2) time, which becomes prohibitive rather quickly. In addition,
a naive approach using this function would be extra slow in Rails:

This is incredibly slow, because not only are the comparisons O(n2), we’re
also instantiating O(n2) ActiveRecord objects for a one-off comparison!

To speed this up, we can calculate and store these comparisons in a
separate table, so we can query based on distance and avoid the overhead of
instantiating that many AR objects. However, even with this approach we’ll need
to take the hit of recalculating the distances every time a change is made to
the flavor vectors of one of the apples. Fortunately, Postgres gives us tools
to avoid ActiveRecord altogether.

Leveraging Postgres array and cube modules

Postgres 9 introduces the array
column type, which can be used to store our feature vectors. The Rails Postgres
adapter allows us to use this column type in our AR model.

We’ll need to maintain this flavor_vector column whenever a field changes, or
else we can just delete the individual columns for the flavor attributes
altogether.

Taking this a step further, Postgres’ cube
module provides cube_distance, which calculates the distance between two
vectors in a query. This allows us to sort by distance and return only the
top 3 distances! These distances can be cached at the database level, making
subsequent queries even faster.

This allows us to rewrite our top_apple_similarities function as follows:

We’re still in O(n2) territory, but our AR
instantiations are now limited to O(n),
pushing the bulk of the work onto the database where it belongs. This
optimization led to a 5-6x speedup of production code, from >30 sec. to 5-8 sec,
with a table of 500 rows and 15 feature dimensions. Selecting the result of the
cube_distance function as a column allows us to inspect the computed result on
the returned model.

By leveraging the database to do the hard work, we can achieve significant
gains in performance even without dramatically changing the algorithm.