Tuning SPARQL Queries

Here follows an excerpt from our upcoming Dydra Developer Guide, from a
section that provides some simple tips on how to tune your queries for
better application performance.

SPARQL
is a powerful query language, and as such it is easy to write complex
queries that require a great deal of computing power to execute. As both
query execution time and billing directly depend on how much processing a
query requires, it is useful to understand some of Dydra’s key performance
characteristics. With larger datasets, simple changes to a query can result
in a significant performance improvement.

This post describes several factors that strongly influence the execution
time and cost of queries, and explains a number of tips and tricks that will
help you tune your queries for optimal application performance and a reduced
monthly bill.

Note that the following may contain too much detail if you are casually
using Dydra for typical and straightforward use cases. You probably won’t
need these tips until you are dealing with large datasets or complex
queries. Nonetheless, you may still find it interesting to at least glance
over this material.

SELECT Queries

A general tip for SELECT queries is to avoid unnecessarily
projecting
variables you won’t actually use. That is, if your query’s WHERE
clause binds the variables ?a, ?b, and ?c, but you
actually only ever use ?b when iterating over the solution sequence
in your application, then you might want to avoid specifying the query in
either of the following two forms:

SELECT * WHERE { ... }
SELECT ?a ?b ?c WHERE { ... }

Rather, it is better to be explicit and project just the variables you
actually intend to use:

SELECT ?b WHERE { ... }

The above has two benefits. Firstly, Dydra’s query processing will apply
more aggressive optimizations knowing that the values of the variables
?a and ?c will not actually be returned in the solution
sequence. Secondly, the size of the solution sequence itself, and hence the
network use necessary for your application to retrieve it, is reduced by not
including superfluous values. The combination of these two factors can make
a big performance difference for complex queries returning large solution
sequences.

If you remember just one thing from this subsection, remember this:
SELECT * is a useful shorthand when manually executing queries, but
not something that you should much want to use in a production application
dealing with complex queries on non-trivial amounts of data.

Remember, also, that SPARQL provides an ASK query form. If all you
need to know is whether a query matches something or not, use an ASK
query instead of a SELECT query. This enables the query to be
optimized more aggressively, and instead of a solution sequence you will get
back a simple boolean value indicating whether the query matched or not,
minimizing the data transferred in response to your query.

The ORDER BY Clause

The ORDER BY
clause can be very useful when you want your solution
sequence to be sorted. It is important to realize, though, that ORDER
BY is a relatively heavy operation, as it requires the query processing to
materialize and sort a full intermediate solution sequence, which prevents
Dydra from returning initial results to you until all results are available.

This does not mean that you should avoid using ORDER BY when it
serves a purpose. If you need your query results sorted by particular
criteria, it is best to let Dydra do that for you rather than manually
sorting the data in your application. After all, that is why ORDER BY
is there. However, if the solution sequence is large, and if the latency to
obtain the initial solutions is important (sometimes known as the
“time-to-first-solution” factor), you may wish to consider whether you in
fact need an ORDER BY clause or not.

The OFFSET Clause

Dydra’s query processing guarantees that a query solution sequence has a
consistent and deterministic ordering even in the absence of an ORDER
BY clause. This has an important and useful consequence: the results of an
OFFSET
clause are always repeatable, whether or not the query has an
ORDER BY clause.

Concretely, this means that if you have a query containing an OFFSET
clause, and you execute that query multiple times in succession, you will
get the same solution sequence in the same order each time. This is not a
universal property of SPARQL implementations, but you can rely on it with
Dydra.

This feature facilitates, for example, paging through a large solution
sequence using an OFFSET and LIMIT clause combination, without
needing ORDER BY. So, again, don’t use an ORDER BY clause
unnecessarily if you merely want to page through the solution sequence (say)
a hundred solutions at a time.

The LIMIT Clause

Always ensure that your queries include a
LIMIT
clause whenever
possible. If your application only needs the first 100 query solutions,
specify a LIMIT 100. This puts an explicit upper bound on the amount
of work to be performed in answering your query.

Note, however, that if your query contains both ORDER BY and
LIMIT clauses, query processing must always construct and examine the
full solution sequence in order to sort it. Therefore the amount of
processing needed is not actually reduced by a LIMIT clause in this
case. Still, limiting the size of the ordered solution sequence with an
explicit LIMIT improves performance by reducing network use.