Translating SPARQL queries into SQL using R2RML

RDF is the data model of the semantic web as it has been conceived by the W3C. This data model is extremely general. Because RDF has been modelled to be after the web and using web technologies, like URIs, it is also very easy to link different RDF graphs using the same mechanism that makes possible to link web pages. RDF is also finally becoming a mainstream technology. Right now, there is already a huge amount of information available as inter-linked RDF graphs, for example as part of the Linked Data initiative.

Another advantage of having information available in the web as RDF graphs is that this information, once it has been retrieved, can be easily queried using the SPARQL W3C recommendation.

SPARQL can be used to query data stored using the RDF data model in the same way SQL can be used to query data stored using the relational data model. In fact, SPARQL could be used as a API to retrieved data from a RESTful web service where each web service URI is attached to a RDF graph or subgraph.

The question from an infrastructure point of view is what software should be used to store RDF graphs and to provide a SPARQL interface to these data.

Special purpose data stores for RDF have been available for a long time: triple stores like Jena, Mulgara or AllegroGraph or Virtuoso are possible solutions. It is also possible to use graph databases supporting the RDF model like Neo4j.

Unfortunately, deploying new infrastructure requirements supposes many times a real barrier for the adoption of semantic technologies in the development of web systems. This is specially true when most of the present web applications are already built using the relational model and migrating these data to a triple store solution, or adopting a hybrid solution are not valid options.

In these situations, recent W3C proposals like R2RML can provide an easy way to introduce semantic technologies in web applications using existing relational solutions and data.

R2ML provides a standard way to lift relational data to the RDF data model. This way, web application developers can expose their information as linked data with important gains for web clients, in example, the abilitiy to link the application data to other data providers using URIs and the possibility to use standard vocabularies in the description of the data, without any modification in the underlying mechanism to store these data.

However, if a web client consuming the generated RDF wants to use SPARQL to query the data, it must first retrieve all the data from the R2RML mapped web service and save it locally in some kind of SPARQL enabled store. This problem would disappear if the web service could offer a SPARQL endpoint to the clients so they could retrieve only the fragment of the whole RDF graph they were interested in.

To accomplish this task it would be necessary to translate the SPARQL query in a semantic preserving SQL query that could be executed against the relational data mapped by the R2RML description.

The efficient translation of SPARQL into SQL is an active field of research in the academy and in the industry. In fact, a number of triple stores are built as a layer on top of a relational solution. Support for SPARQL in these RDF stores supposes the translation of the SPARQL query to a SQL query that can be executed in a certain relational schema.

Some foundational papers in the field include “A Relational Algebra for SPARQL” by Richard Cyganiak that translates the semantics of SPARQL as they were finally defined by the W3C to the Relational Algebra semantics or “Semantics preserving SPARQL-to-SQL translation” by Chebotko, Lu and Fotohui, that introduces an algorithm to translate SPARQL queries to SQL queries.

This latter paper is specially interesting because the translation mechanism is parametric on the underlying relational schema. This makes possible to adapt their translation mechanism to any relational database using a couple of mapping functions, alpha and beta, that map a triple pattern of the SPARQL query and a triple pattern and a position in the triple to a table and a column in the database.

Provided that R2RML offers a generic mechanism for the description of relational databases, in order to support SPARQL queries in any R2RML RDF graph, we just need to find an algorithm that receives as an input the R2RML mapping and builds the mapping functions required by Chebotko et alter algorithm.

The straightest way to accomplished that is using the R2RML mapping to generate a virtual table with a single relation with only subject, predicate and object. The mapping for this table is trivial. A possible implementation of this
algorithm can be found in the following Clojure code.

The code basically generates a SQL query where for each triple-map and their included term-maps in the R2RML mapping generates a SQL SELECT subquery with the right selection and projections. The final table is the UNION of all the generated subqueries.

The generated SQL code is far from being optimal in this sample implementation, but it is another glimpse of the possibilities R2RML introduces to boost the adoption of semantic technologies in common everyday web development.

This implementation does not take into account the use of datatypes and language support in RDF literals or the possibility of select from different graphs in the query. It neither generates any additional triple due any entailment regime.

It should be possible to improve the translation mechanism using the existing research in SPARQL to SQL translation, for example, “A Complete Translation from SPARQL into Efficient SQL” by Elliott, Cheng et alt. or to find better translation alternatives.

Another option could be extending the R2RML vocabulary with optional terms that could be used to give hints to a translation mechanism so it can provide more efficient translations for more specific database schemas.