An Introduction to SPARQL

This tutorial, the first of a three-part series, introduces SPARQL -- a query language and data access protocol for the Semantic Web. SPARQL is defined in terms of the W3C's RDF data model and will work for any data source that can be mapped into RDF. The specification is under development by the RDF Data Access Working Group (DAWG) and has recently reached Last Call Working Draft.

At this point in its life cycle the specification is stable enough that developers can begin seriously exploring its capabilities. And the availability of several SPARQL query engines means that this exploration can be practical rather than theoretical.

But what if you're a lot more interested in Web 2.0, which is practical and real, than in the Semantic Web, about which opinions vary widely? Why might you want to go to the trouble of learning SPARQL? For dyed-in-the-wool Semantic Web fans, this question may well be a no-brainer: RDF has needed a standard query language for some time and having one will make many development tasks much easier.

However SPARQL has a much wider potential audience. A key aspect of the Web 2.0 idea is the ability to extract and query information held across many different ad hoc, third-party apps, services, or repositories. That ability to move in and among various data sources is key to the Web 2.0 idea of the mashup -- take a little Google Maps, salt with some eBay, and sprinkle with a heaping hunk of Flickr, right?

SPARQL, which is both a query language and a data access protocol, has the ability to become a key component in Web 2.0 applications: as a standard backed by a flexible data model, it can provide a common query mechanism for all Web 2.0 applications. XML.com managing editor Kendall Clark has published an excellent essay (Web 2.0 Meet The Semantic Web) that expands more fully on this idea. SPARQL should be of interest to developers exploring the available options for publishing open data on the Web.

The goal of these tutorials is to enable developers to quickly become productive with SPARQL. All of the key language features will be introduced with abundant examples. No previous experience with RDF query languages is required, but a basic familiarity with RDF and RDF/XML is essential. There are many good primers on RDF available for readers interested in a quick refresher course or a bottoms-up introduction.

This first tutorial introduces the key concepts in SPARQL and its relationships to the other specifications under development by the DAWG. By the end of the tutorial you'll be able to write some simple SPARQL queries to extract data from RDF.

In the second tutorial we'll cover some of the more advanced query options, including working with multiple data sources. That tutorial will also demonstrate the ease with which data can be merged and queried using SPARQL.

The third and final tutorial will introduce the other SPARQL query forms: CONSTRUCT, DESCRIBE, and ASK. Far from being limited to querying data, SPARQL also offers the ability to extract information from a data repository according to rules of the client's devising. Powerful stuff.

Before jumping into the syntax, let's put SPARQL into some context, and take a brief look at the data we'll be using throughout the series.

SPARQL in Context

Work on RDF query languages has been progressing for a number of years. Several different approaches have been tried, ranging from familiar looking SQL-style syntaxes, such as RDQL and Squish, through to path-based languages like Versa.

Of these approaches, those that emulate SQL syntactically have probably been the most popular and widely implemented. This is perhaps surprising given the very different models that lurk behind relational databases and RDF -- familiarity with syntax has no doubt contributed to this success. SPARQL follows this well-trodden path, offering a simple, reasonably familiar (to SQL users) SELECT query form which will be the main focus of this first article.

SPARQL actually consists of three separate specifications. The query language specification makes up the core. But alongside it sits the query results XML format which, as you might guess, describes an XML format for serializing the results of a SPARQL SELECT (and ASK) query. This simple format is easily processable with common XML tools such as XSLT; we'll look at an example of that later.

The third specification is the data access protocol which uses WSDL 2.0 to define simple HTTP and SOAP protocols for remotely querying RDF databases. (Or, cunningly, for querying any data repository that can be mapped to the RDF model). The XML results format is used to generate responses from services that implement this protocol.

In total, then, SPARQL consists of a query language, a means of conveying a query to a query processor service, and the XML format in which query results will be returned.

There are a number of issues that SPARQL does not address yet; most notably, SPARQL is read-only and cannot modify an RDF dataset. Work on this area is currently out of scope for the DAWG, as noted in Section 2 of their charter. It seems likely that this will become a later task for the Working Group once the initial specifications have reached Recommendation status. A similar strategy of "query first, update later" was also adopted by the XQuery Working Group.

SPARQL Query Tools

Happily the SPARQL specifications don't exist in isolation. There are several tools and APIs that already provide SPARQL functionality, and most of them are up to date with the latest specifications. A brief list includes:

My SPARQL query tool Twinkle offers a simple GUI interface to the ARQ library, and supports multiple output formats and simple facilities for loading, editing, and saving queries. Handy if you want to play with SPARQL on the desktop. But for a minimum of installation fuss you can't beat an online SPARQL query tool, which we'll use throughout the rest of the tutorials. As it happens, the service is also a self-contained example of the SPARQL protocol in action.

The Periodic Table in RDF

Tutorial writers can burn a lot of time crafting a good set of examples. A balance needs to be struck between making the data clear versus making it too trivial. What you really want is for the examples to reflect the power of the technology being introduced. For this series, I'm going to dispense with the art of data design and instead pick up some data already published wild on the Web. That is, we're doing real RDF processing of real-world data. Not only will this help illustrate SPARQL's utility, we may even learn a few interesting facts along the way.

Bob DuCharme has done an excellent job of curating public collections of RDF on his site rdfdata.org. I've picked out this RDF representation of the periodic table for our purposes. It's data that most people will have at least a passing familiarity with, so won't take a great deal of review in order for you to get started. Here's a handy periodic table to use as a reference if your chemistry is a little rusty.

The RDF data provides some essential facts about each element including its name, symbol, atomic weight and number, plus a good deal more. We'll focus on these simple properties for now. A slightly edited extract of the data, showing a description of sodium, is included here:

Note that the namespace for this data is http://www.daml.org/2003/01/periodictable/PeriodicTable# -- that'll be important when we start formulating our SPARQL queries. The RDF includes a mixture of properties; some are simple literals such as name and atomicWeight, while others such as group and standardState have resources as values.

Introducing the Triple Pattern

RDF is built on the triple, a 3-tuple consisting of subject, predicate, and object. Likewise SPARQL is built on the triple pattern, which also consists of a subject, predicate and object. In fact an RDF triple is also a SPARQL triple pattern. A triple from our data expressed using the SPARQL triple pattern syntax looks like this:

A triple pattern is written as subject, predicate, and object and is terminated with a full stop. URIs, e.g. for identifying resources, are written inside angle brackets. Literal strings are denoted with either double or single quotes. While properties, like name, can be identified by their URI, it's more usual to use a qname-style syntax to improve readability. Later in the tutorial I'll show you how to associate a prefix with a URI using a mechanism very similar to XML namespaces.

SPARQL specifies a number of handy abbreviations for writing complex triple patterns. Both the basic syntax and abbreviations borrow heavily from Turtle, a very terse RDF serialization alternative to RDF/XML. As a text rather than XML format, Turtle can be used to express RDF very succinctly. Rather than exhaustively list all of the SPARQL syntax shortcuts here, we'll introduce them throughout the examples contained in this and later tutorials.

The triple pattern above is fine for demonstrating syntax but isn't very useful as a query. If we know all the data, there's no need to run a query. However, unlike a triple, a triple pattern can include variables. Any or all of the subject, predicate, and object values in a triple pattern may be replaced by a variable. Variables are used to indicate data items of interest that will be returned by a query. The next example shows a pattern that uses variables in place of both the subject and the object:

?element table:name ?name.

Since a variable (which has in SPARQL an alternative spelling using the $ character, like $element) matches any value, this pattern will match any RDF resource that has a name property. Each triple that matches the pattern will bind an actual value from the RDF dataset to each of the variables. For example, there is a binding of this pattern to our dataset where the element variable is bound to <http://www.daml.org/2003/01/periodictable/PeriodicTable#Cl and the name variable is "chlorine."

In SPARQL all possible bindings are considered, so if a resource has multiple instances of a given property, then multiple bindings will be found. Which is a good thing to remember if you end up with more data than expected in your query results.

At this point you may be wondering if it's legal for a triple pattern to include only variables. Well, it is:

?subject ?predicate ?object.

This pattern matches all triples in an RDF graph.

Triple patterns can also be combined to describe more complex patterns, known as graph patterns. These will be clearer when seen within the context of some sample queries. So let's look at the basic structure of our first SPARQL query.

Structure of a Query

This SPARQL query selects the names of all the elements in the periodic table:

Let's break down the query into its parts to better understand the syntax.

Starting from the top we encounter the PREFIX keyword. PREFIX is essentially the SPARQL equivalent of declaring an XML namespace: it associates a short label with a specific URI. And, just like a namespace declaration, the label applied carries no particular meaning. It's just a label. A query can include any number of PREFIX statements. The label assigned to a URI can be used anywhere in a query in place of the URI itself; for example, within a triple pattern. In the single triple pattern included in this query we can see the table prefix in use as a shorthand for http://www.daml.org/2003/01/periodictable/PeriodicTable#name, the full URI of the name property.

The start of the query proper is the SELECT keyword. Like its twin in a SQL query, the SELECT clause is used to define the data items that will be returned by a query. In Example 6 we're returning a single item, the name of the element.

As you might expect, the FROM keyword identifies the data against which the query will be run. In this instance, the query references the URI of the periodic table in RDF. A query may actually include multiple FROM keywords, as a means to assemble larger RDF graphs for querying. We'll have more to say about that (and SPARQL datasets in general) in the next tutorial. For now, think of all the lovely mashups . . .

Finally, we have the WHERE clause. A graph pattern is a collection of triple patterns that identify the shape of the graph that we want to match against. In this instance you'll recognize the pattern for this query as the triple pattern we used earlier.

The WHERE keyword is actually optional and can legally be omitted to make queries slightly terser:

URIs are often long and unwieldly, and you can never have too much syntactic sugar to help avoid typing them out repeatedly. BASE is another form of URI abbreviation, defining the base URI against which all relative URIs in the query will be resolved, including those defined with PREFIX. As you can see, the common prefixes of the two URIs in the previous example have been factored out into a BASE URI declaration.

Now that we've written a complete query, let's run it and get some results.

Our First Results

The result of a SPARQL SELECT query is a sequence of results that, conceptually, form a table or result set. Each row in the table corresponds to one query solution. And each column corresponds to a variable declared in the SELECT clause. If you've done any kind of database development, this kind of table-oriented result set should be immediately familiar.

In later sections we'll look at how that sequence can be modified, e.g. to apply a sort order, limit the number of returned results, etc. We'll also take a quick look at the XML results format. But for now, let's make the query to do something more interesting.

Graph Patterns

Taking what we've learned about the simplest kind of triple patterns and the structure of a SPARQL query, we can now explore how to do more complex and useful queries.

The next example shows a query that selects the name, symbol, and atomic number of all elements in the periodic table:

What's new here is that the query pattern consists of multiple triple patterns. A collection of triple patterns is a graph pattern. In this instance the graph pattern consists of three triple patterns, one to match each of the desired properties: name, symbol, and atomicNumber. Understanding how this query operates involves a bit more background on the pattern matching process.

The most important point is that within a graph pattern a variable must have the same value no matter where it is used. So in the previous example the variable element will always be bound to the same resource. In other words, this query will match any resource that has all three of the desired properties. A resource that does not contain all of these properties will not be included in the results because it won't satisfy all of the triple patterns. We'll cover optional matching in a later section.

The other notable item here is that there is one triple pattern for each of the variables required to be present in the result set. In SPARQL one cannot SELECT a variable if it is not listed in the graph pattern. This may seem slightly odd if you're only used to SQL; in that language it is quite common to return variables that are not listed in a WHERE clause. But remember a SPARQL query processor has no data dictionary that lists all columns (i.e. properties) of a resource. Variables must be bound to an RDF term via a triple pattern in order for the processor to be able to extract that term from the graph.

Graph Pattern Shortcuts

SPARQL includes a number of syntax shortcuts that simplify the writing of patterns. Let's rewrite our query more succinctly:

We've used two shortcuts here. The first should be familar to SQL users: *. This shortcut means "return all variables listed in the graph pattern." It saves having to itemize every variable at the cost of relying on the processor to order the columns in the result set.

The second shortcut is, formally, the use of a predicate-object list. This shortcut allows a query author to list the subject of a series of triple patterns only once. When we're using this form, each triple pattern is terminated with a semicolon rather than a full stop. This shortcut can be used when several patterns share the same subject.

SPARQL offers a similar shortcut, an object list, which simplifies patterns that differ only in their subject.

OPTIONAL Patterns

RDF graphs are often semi-structured; some data may be unavailable or unknown. How do we allow for this when querying for data? Let's work through an example to illustrate the problem. Imagine that we wanted to adapt the previous query to also return the color of the element. Our first attempt may look like this:

We've extended our SELECT statement to include the new variable, color, and have also added a match for the relevant property (table:color) to the graph pattern. So far, so good.

If you run this query though, you'll notice that some elements are missing. Ununtrium, for example. (No, I'd never heard of it either). If we look closely at the RDF data, we find that this ununtrium, and several other of the heavier elements, do not have the relevant table:color property. So these elements are not returned in the results.

We need to alter the query to allow for the fact that we have some missing or incomplete data. We achieve this by indicating that the relevant triple pattern is optional:

If you run this version of the query you'll find that all of the elements are now correctly included. The OPTIONAL keyword must be followed by a sub-pattern containing the optional aspects of the query. Within the result set, if an element doesn't have a color property, then the color variable is said to be unbound for that particular solution (row).

Matching Alternatives with UNION

Now that we've seen how to explore optional data, let's see how we can select from alternatives. If we were interested in the chemistry of the halogens and the noble gases, we might simply construct and run separate queries in order to find out their atomic weights and CAS registry numbers.

But using the SPARQL UNION keyword we can write a single query that matches all of the elements. That query looks like this:

There are a few things to notice. First, the query pattern consists of two nested patterns joined by the UNION keyword. If an element resource matches either of these patterns, then it will be included in the query solution. For clarity the patterns use the predicate-object list shortcut.

The query also includes another demonstration of URI shortening, this time within the object of a triple pattern. The value (range) of the table:group property is a resource. Each of the groups in the table is modeled as a resource with its own URI. The full URI for group 17 is http://www.daml.org/2003/01/periodictable/PeriodicTable#group_17. As we've already declared a URI PREFIX for http://www.daml.org/2003/01/periodictable/PeriodicTable# we can truncate this to table:group_17.

Any number of UNIONs can be included in a query, providing a great deal of flexibility in assembling data from alternatives.

Sorting

With all of the examples we've seen so far, we've been content to let the results be returned in whatever order the query engine chooses. This is rarely desirable in practice, as we commonly need to impose some sensible and relevant ordering to the data.

SPARQL offers the ORDER BY clause to let us do precisely that. The next example demonstrates the new syntax:

This example selects the name and atomicNumber of all of the elements in group 18 of the periodic table, the noble gases. The ORDER BY clause indicates that the elements should be ordered by their atomic number property, in ascending order.

Formally, ORDER BY is a solution sequence modifier -- it manipulates the result set prior to it being returned by the query processor. As such, it is not part of the graph pattern and so is listed after the WHERE clause in the query syntax.

An ORDER BY clause can list one or more variable names, indicating the variables that should be used to order the result set. The query processor will sort by each variable in turn, in order of their declaration. By default all sorting is done in ascending order, but this can be explicitly changed using the DESC (descending) and ASC (ascending) functions. The next example sorts all of the elements in the periodic table in descending order of atomic weight:

SPARQL also allows us to limit the total number of results in a result set using the LIMIT keyword, which indicates the maximum number of rows that should be returned. A value of zero will return no results; if the value is greater than the size of the result set, then all rows will be returned. Used in combination with ORDER BY we can modify our query to create a new query that returns the ten heaviest elements in the periodic table:

When building user interfaces to navigate through a database or set of results, it's common to break the results into pages, e.g. displaying 10 search results at a time. SPARQL supports such paging by allowing a query to specify an OFFSET into the result set. This indicates that the processor should skip a fixed number of rows before constructing the result set. This usage is naturally combined with ORDER BY in order to ensure a consistent and meaningful order. By way of example, let's assume that we've already listed the ten heaviest elements in the periodic table and now want to fetch the next ten heaviest. In this query we use OFFSET to skip the data we've already seen:

SPARQL Query Results XML Format

For readability the examples we've viewed so far have been rendered as HTML tables. Most SPARQL processors will include a custom API to allow the direct manipulation of a result set, allowing a programmer to manipulate results in whatever way suits an application. But if we want to serialize a SPARQL result set in a standard way, perhaps to return data via a web service, we can use the SPARQL Query Results XML Format.

By way of an example, here's an extract of the results from the first example above. To view the complete set of results, refer to the online service:

All of the key elements belong to a single namespace, http://www.w3.org/2005/sparql-results#

The root element is sparql, which contains a head and a results element that together describe the result set

The head section declares all variables that will be returned in the result set. It's equivalent to the column headings in an HTML table

The results section lists each query result, i.e. one result element for each row in the result set

A result element contains one binding for each variable. A binding is one of literal or uri. These elements contain the actual values returned. If a variable is not bound in a query (see the above section on OPTIONAL Patterns), then it is marked as unbound.

Given its obvious simplicity and regular structure, manipulating this format with XSLT or XQuery is fairly trivial. The SPARQL Query Results XML Format specification includes several relevant examples.

Summary

This brings us to the end of our first look at SPARQL.

We've seen how SPARQL allows us to match patterns in an RDF graph using triple patterns, which are like triples except they may contain variables in place of concrete values. The variables are used as "wildcards" to match RDF terms in the dataset.

We introduced the SELECT query which can be used to extract data from an RDF graph, returning it as a tabular result set. We built up more complex graph patterns from simple triple patterns and illustrated how to deal with both required and OPTIONAL data. UNION queries were also introduced as a way of dealing with selecting alternatives from our dataset. Finally, we demonstrated how to apply ordering to our results, LIMIT the amount of data returned, and jump forward through results using OFFSET.

Along the way we took a brief look at the SPARQL XML Query Results Format, and a number of the syntax shortcuts that make writing queries much simpler. These are especially useful with repetitive graph patterns and long URIs.

Armed with this information, and the growing range of SPARQL implementations, you can start to investigate the language yourself and put it to good use. As you begin working with the language you'll no doubt find Dave Beckett's query language reference a handy resource.

In our next tutorial in this series we'll look more closely at how SPARQL deals with data typing, applying constraints to our data, and the facilities for querying data from multiple sources.

Finally, I'd like to thank Katie Portwin and Priya Parvatikar for early feedback on this article.