Similarity Search

By Pedro Oliveira, 07 Aug 2018 · 4 minute read

Learn how to find similar items in the Knowledge Graph with machine learning.

While showcasing Stardog’s machine learning capabilities, customers keep asking for the
ability to find similar items in the Knowledge Graph. This is a useful feature
for many tasks, such as generating recommendations or finding near duplicates.

So in Stardog 5.3.3 we introduced a new type of machine learning model that
supports search and retrieval of similar items in an efficient and scalable way.
This is a general problem: you’re searching in a graph of nodes that represent
real-world objects and the main thing you want to consider is similarity between
pairs of objects. The motivating reasons you’d be doing this are varied; maybe
you’re building a recommendation system or looking at data lineage or debugging
problems in some business process where a problem in one object may also occur
in similar objects.

Similarity Search

To get into the details without getting bogged down, let’s explore a specific
example, using the movie
dataset.

Similarity search follows the same
syntax and pipeline as our
other machine learning models. First, you need to create a model, which holds
the set of items available for search. The spa:arguments property receives the
features used for similarity calculation, while spa:predict contains the
identifier of the item.

Here, we are creating a SimilarityModel named :simModel which takes as input
the genres, directors, writers, producers and MetaCritic score for all movies in
the dataset.

Using this model it’s pretty easy to find similar movies. We select a movie and
its properties and pass it as input to the model. The number of similar items to
return is controlled by the spa:limit property given in spa:parameters.

This query finds five movies that are similar to The Big Lebowski and their
similarity score, based on the features given through spa:arguments.

similarMovieLabel

confidence

The Big Lebowski

0.9999999999999998

Fargo

0.9996443676337468

Blood Simple

0.9996332068990889

The Man Who Wasn’t There

0.9996019945613324

Barton Fink

0.9995802728226650

As expected, the most similar item is the movie itself, followed by other movies
from the inimitable Coen Brothers.

Just like other models, similarity search features can have any
datatype: numbers, strings, sets,
etc. The best representation for those features is automatically taken into
account by Stardog when it calculates a similarity score.

Under the Hood

Items and their features are vectorized using feature
hashing, the same technique used
by our classification and regression models. This vectors are saved in a search
index created using cluster
pruning,
an approximate search algorithm which groups items based on their similarity in
order to speed up query performance.

The index is used to find the vectors with largest cosine similarity, which is
the score given by spa:confidence.

The Stardog docs describe
advanced parameters which can be used to increase query performance and recall.

Future Work

We are exploring other ways of representing items as vectors, such as knowledge
graph embeddings and predication-based semantic indexing, while improving the
techniques underlying the search index itself. Stay tuned for updates.