Implementing SPARQL arbitrary-length property paths (redux)

I recently became aware that the implementation of arbitrary-length property paths I created for Sesame’s SPARQL engine suffered a serious flaw: it didn’t scale. At all. The chosen strategy (see my earlier blog post on this topic) of evaluating paths of ever-increasing length using multi-joins slowed down to a crawl (through tar) on any serious dataset at about length 7 or 8. Queries involving such paths took 15 minutes or more to complete. Clearly, this would need a little improvement.

Looking at the implementation, it turned out there were two main bottlenecks in the performance:

complete re-evaluation of all paths of length (n – 1) whenever we expand the path length

a computationally heavy cycle-detection check

After fiddling about with some ineffectual half-measures for one or both of these problems I decided to take a step back, and completely rethink the algorithm. It occurred to me that if we did a proper breadth-first search implementation we could eliminate the computationally expensive multi-join altogether (and even more importantly, the hugely inefficient repeat performances on known paths for every increase of the length). We could simply use a FIFO queue to pop intermediate values on and keep doing simple lookups of patterns of length 1.

Let me clarify with an example. Suppose you have the following path-expression:

ex:alice foaf:knows+ ?x

In the naive join-based algorithm, this would be evaluated by first looking up all paths of length 1:

BGP(ex:alice, foaf:knows, ?x)

If the result of this is not empty, we continue by looking up paths of length 2, adding a filter-condition to guard against cycles:

I won’t bore you with the details for paths of length 3 and further, but I trust you appreciate that this can quickly become quite big and complex. Also, notice the fact that in the expression for length 2, we have BGP(ex:alice, foaf:knows, ?a1), which is essentially the same expression we already evaluated for paths of length 1. We already know the answers to this bit, yet we still create a new join which will re-evaluate the damn thing, for every increase of the length, again and again.

Enter the new algorithm. I got rid of joins completely. Instead, I introduced a FIFO-queue of values. We iterate over results for length 1, reporting all resulting solutions, and putting all bindings found for ?x onto the queue. Then, when all results for length 1 are exhausted, we examine the queue. If it is not empty, we pop the first value, and create a new BGP where we replace the start with the popped value. We iterate over solutions to this BGP, and again, we report all solutions and add all bindings of ?x thus found to the stack. We continue in this fashion until finally the last iteration is exhausted and the queue is empty.

This is really classic breadth-first search, so no new wheels have been invented here, but it is amazing (although in retrospect really quite understandable) how much more performant this graph traversal mechanism is than the earlier join-based approach. One experiment I did (using the DBPedia category thesaurus) was the following query:

This query (which effectively retrieves all subcategories of Category:Business) took over 15 minutes to complete in the original approach. After switching to the new BFS-based approach, it completes in roughly 120ms.

One thing I haven’t touched on yet in this new implementation is how to detect cycles. In the original approach, we retrieved solutions that encompassed the complete path (joins with intermediate variables) and we used filters to check that all of the intermediate vars are pairwise distinct. Clearly, we can’t do that in this approach, because we no longer have all these intermediate vars available at once.

The solution is actually rather simple, if you stop and think of what a cycle actually is: a cycle is a path where at the end you have gotten back where you started (in other news: water is quite wet). Therefore, you simply have to mark where you’ve been before (that is, which solutions you’ve previously reported), and you’re home free.

A list internally records all reported combinations of the start value (ex:alice) and the end value (each binding of ?x) of our path. Whenever we find a new solution during iteration, we check if the reported combination already exists in that list, and if so, we discard it.

Why not just record the end values? Well… Think about it. Do all path expressions start with a fixed start point?

This approach to cycle detection has two consequences:

cycle detection has become a cheap lookup in a list.
This is quick, but puts a burden on memory space. In practice, however, the result of property-paths stays small enough to be able to keep such a list without problems.

We now automatically filter out duplicate results

The second consequence is really useful in practice: the old approach reported back all possible paths of all possible lengths, so if for example Bob is a friend of Alice but also a friend of a friend of Alice, we would get back Bob twice – which is clearly redundant information. In the new algorithm, Bob is only reported once, saving bandwidth, computation time, and the user having to jump through a lot of hoops to get unique results.

The new and improved algorithm will be available in the next Sesame release (2.6.1).