30 November 2006

I wrote about the SPARQL algebra last time, not with the idea of a regular update about DAWG work but noting a significant issue being considered by the working.
I hope that the wider community will comment and contribute; much of the ground work for the algebra is drawn from outside the working group anyway.

This week DAWG made a significant decision in a different, but nearby area, that of basic graph pattern matching
(BGP).

Basic Graph Patterns

Basic graph patterns (BGPs) are the building block in SPARQL. They are a
sequence of adjacent triple patterns in the query string. A BGP is matched
against whatever is being queried and the results of matching feed into the
algebra.

The decision is that no change, or rather the "implementation hint" for BGP matching, made more formal, becomes the basis
of SPARQL. No queries will be changed by the decision.

The Issue

Having reveal the punch line, what's the joke?
The issue is whether blank nodes in BGP matching behave like hidden, named variables or do they behave as a different kind of variable all together. This matters for counting - it does not matter for the logical meaning of a solution.

Example data:

:a :p 1 .
:a :p 2 .

Example query pattern:

{ ?x :p [] }

How many answers? Blank nodes are existential variables in RDF; named variables (the regular ?x ones) are universal variables. Queries don't return the binding of an existential; queries can return the binding of a named variable and the bound value of a named variables can be passed to other parts of the query (FILTERs, OPTIONALs
etc) via the algebra.

In the absence of a DISTINCT in the query, are the solutions to this pattern:

1 : ?x = :a

2 : ?x = :a , ?x = :a

Either 1 or 2

In OWL-DL, existential variables give more expressivity: if there is a disjunction in the ontology, the reasoner can know that something exists, but not need to find an actual value - and it may not be able to
find the value anyway - but the reasoner does know there is "something there".

Users and Implementers

For application writers, the main use of blank nodes in queries that I've seen is
the convenience of using []

{
?x :something [ :p ?v ; :q ?w ]
}

[] saves having to have a named variable and splitting the
expression. It would be one more thing to learn if that causes different answers
(duplicates) to the pattern using query variables:

{
?x :something ?z .
?z :p ?v ;
:q ?w .
}

especially if the first pattern had been written:

{
?x :something _:bnode .
_:bnode :p ?v ;
:q ?w .
}

which is the [] example written out with a labelled blank node.
This is also why it's BGPs that are the building block, not triple patterns -
blank nodes are scoped to BGPs.

For implementers, having two kinds of variable that behave different in matching can make life a harder, especially if they using some existing technology, like an SQL database
which has one preferred type (universal-like). Work on SPARQL semantics suggests that there can be one (complicated) SQL statement for one SPARQL query. Nice if it's true because all the work that has gone into SQL database optimizers is
reusable. Having two types of variables would need extra SQL to be generated like
nested SELECTs, or some post-processing done.

More work for implementers is not good; many open source RDF toolkits are built by people in their own time or by people with other things to do at work. So, it
might be a
noticeable dampener on toolkit development and hence deployment. Not a good idea to do this without a reasonable need.

DAWG Resolution

Logically, duplicates don't make any difference. DAWG decided there are two
solutions for simple entailment matching example above. It's what
implementations currently do.

The
DAWG process is often test case driven. We use test cases as both a concrete way to define what results we expect and also to make concrete decisions. So once consensus was emerging to have duplicates, the
working group decided to approve a test case that captured that. In this case, the test is
rdfsemantics-bnode-type-var (the example above is the same idea, just
shorter).

Defining BGP Matching for SPARQL

In Last Call 2,
the first CR
publication and the
intermediate publication
versions of SPARQL, BGP matching is a
complicated definition that allows for different entailment regimes for extensibility
within a single entailment framework. SPARQL is defined for simple entailment, and
so for anything where there a virtual graph, but it allows for extension to more complicated entailment regimes, like OWL-DL.

The trouble is the matching definition is that it has a bug in it; it allows too many solutions when the graph is redundant. The bug has to be fixed.

SPARQL is defined for simple
entailment. With this decision on how blank nodes behave, the working group can now rework
the BGP matching. The options are outlined in the message
2006OctDec/0140
and either of the variable mapping approaches can be made to work. It's more
akin to a formalised version of the
last call 1
style and the implementation hint of the complex version.

... and Extensibility

Extensibility comes by replacing the whole of BGP matching, with just a few
natural conditions such as all named variables must be bound in a BGP match and there are no accidental
clashes of blank nodes.

(Vested interest declaration)

OK - ARQ isn't built on SQL; it's a self contained query engine with various extension points - by default, it queries any
Jena graph and exploits Jena's direct connection to the storage layer but only for basic graph patterns, not more complex patterns.
It could sort out the two types of variables quite easily.

But we also have a Jena SPARQL database engine (called SDB) that is built on SQL and is specifically designed for SPARQL, for large datasets
and for named graphs.

SDB fits in as an ARQ query engine and the more it can turn into a single SQL
statement the better. All of the query being the ideal case (and hopefully the
normal one). So making the translation to a relational quad store simpler
is helpful for SDB.

23 November 2006

Basic graph patterns (BGPs) are a block of adjacent triple patterns
that match, or don't match, all together. In BGP matching every named
variable must have some value for the BGP to have been matched. Filters
add restrictions on the values variables can take.

They are also an extension point in SPARQL because you can plug in
a different BGP matcher (e.g. a DL reasoner or a wrapper for legacy
SQL data), then reuse the other two layers to form complex queries out
of the exact matches from BGP matching.

Graph expressions (OPTIONAL, UNION,
GRAPH, Groups (things between {}) combine BGPs in various
ways to give more complicated patterns. Graph expressions are recursive
so you can have patterns within patterns. A SPARQL query has one
graph expression in the WHERE clause.

Solution modifiers (Project, DISTINCT,
ORDER BY, LIMIT/OFFSET) take the output of
matching the query graph pattern and process it in various ways to
yield the result set.

Current Situation

The semantics for SPARQL have, so far, been declarative and top-down.
A solution to a query passes if it meets all the conditions following
from the definitions of each of the SPARQL graph pattern forms and filters.
That means, in theory, the whole of the solution is available.

In fact, ARQ builds solutions up by adding the
minimum necessary at each subpattern to make the solution match.
This is what gives ARQ
streaming evaluation which keeps the total memory use down. But the
effect is the same, whole solutions are available everywhere in the
pattern.

Another way is to have a bottom-up algebra. The
relational
algebra works this way. Subexpressions are evaluated
to give tables (relations) and then these tables combined together
to form new tables.

It turns out that these are nearly the same as shown by
Richard Cyganiak and
Jorge Perez et al
(see references below).
The class of graph patterns that differ are a certain kind of
nested
OPTIONAL where a variable is used on the outside,
fixed part, and in the inner OPTIONAL, but not in
between.

Now that is a rather unusual query.
The tricky bit is the use of ?v at the outermost and
innermost levels but not in between.The query can be rewritten so
as not to nest (note the repeat of :x3 :q ?w).

In DAWG, we're considering
an algebra for SPARQL, based on the paper
Semantics
of SPARQL. This will mean a change of results in the case
above. I've never seen such a query in the wild, only in artificial test cases.

The change from declarative semantics to a constructional algebra is
motivated primarily by the implementers in the working group wanting to
apply all the good stuff on relational algebra optimization to SPARQL
engines. The complexity of the exact mapping to SQL in the
semantics preserving SPARQL-to-SQL is daunting.

Not that simple!

But other things differ as well if a purely relational style is
applied (and that's not being proposed currently) because
"Semantics of SPARQL"
does not consider the difference in treatment of filters (it's focus is on the
combination of graph patterns).

Two cases arise: one where filters are in OPTIONALs, and one
where filters are nested inside groups.

SELECT *
{ ?x :p ?v .
OPTIONAL
{
?y :q ?w .
FILTER(?v = 2)
}
}

The FILTER uses a variable from the fixed part of
the OPTIONAL. Under Jorge's semantics the variable
isn't in-scope, so the filter evaluation is an error (unbound variable),
otherwise known as false; the optional part never matches.
This form of query is realistic and does occur for real
so the current proposal for DAWG makes this work. Like SQL
and LEFT OUTER JOIN with the ON clause,
the LeftJoin operation in the SPARQL algebra can take a condition
which applies over the whole LeftJoin.

The fixed
pattern could be repeated inside the optional part but such repetition
for something application might well use is bad, very bad.

Application writers will have to take care about scope and
groups: putting FILTERs inside {} changes
their scope.

Currently,

{ :x :p ?v . FILTER(?v < 5) }

and

{ :x :p ?v . { FILTER(?v < 5) } }

have the same effect. But they aren't in the algebraic form
because { FILTER(?v < 5) } evaluates on the empty
table that does not have a ?v in it; evaluation is an
error and hence false.

ARQ

There is an implementation of the algebra in
ARQ in CVS
(not in the current release ARQ 1.4). It's enabled in the command
line tool
arq.sparql
with --engine=ref from the command line and

QueryEngine2.register() ;

in code.

The work-in-progress
unfinished
editor's working text is available. Check it is actually up to
date because that's just work space and isn't a DAWG document. It does not
reflect working group decisions - it more a proposal for consideration.

Disclaimer: Any views expressed here are mime and not
the working group (which at
the time of writing is thinking about it but hasn't decided
anything yet).

12 November 2006

SPARQL is normally
thought of as only querying fixed RDF data. At the core of SPARQL are
the building blocks of basic graph patterns, and on top of these there is
an algebra to create more complex patterns (OPTIONAL UNION, FILTER, GRAPH).

The key question a basic graph pattern asks is "does this pattern match
the graph". The named variables record how the pattern matches.

Not all information needs to be in the raw data. ARQ
property functions are a way to let the application add some
relationships to be computed at query execution time.

LARQ
adds free text search. The real work is done by
Lucene.
LARQ adds ways to create a Lucene index from RDF data
and a property function to perform free text matching
in a SPARQL query.

This will be in ARQ 1.5 but is available from
ARQ CVS now. Hopefully, this will be useful to users and
application writers.
Comments and feedback on the design are welcome, especially
before the next ARQ release.

13 October 2006

SPARQL
is a graph pattern matching language,
The matching can be semantic
(basic graph patterns are entailed by the data in order to match)
or through the
algebra that works over the top of basic graph pattern matching.

Complex queries make use of UNION and OPTIONAL,
including the idiom with
OPTIONAL and !BOUND, in all sorts of creative ways.
And sometimes they want the answers to indicate which way the query matched
to get a particular solution.

Which matched? the prefLabel or altLabel?
The application can assign the different to branches to different
variables although that might be inconvenient in a larger query
if ?label is used elsewhere in the query.

ARQ's property functions provide a way of adding different ways to match a
triple pattern. Uses include calling out to custom code
to perform some calculation, or calling out to use a external
index, such as Lucene.

Property functions can assign to variables in the solution - unlike
FILTER functions which must not have side effects. Property
functions can't change values but they can bind an unbound variable or check
an existing binding is compatible.

It works both ways round: { "Hello World" apf:assign ?x }
gives the same results.

If subject and object are constants or already bound,
apf:assign checks to see if they are the same. If they are,
the tripe pattern
matches with no change to the solution, otherwise it does not match and that solution
is rejected. The "sameness" is Jena'ssameValueAs,
just like the rest of graph matching.
If both variables are unbound, apf:assign complains
and the query is broken.

07 May 2006

Sometimes, an application will be making a SPARQL query,
using the results from a previous query or using some RDF
term found through the other Jena APIs.

SQL has prepared statements - they allow an SQL statement to
take a number of parameters. The application fills in the
parameters and executes the statement.

One way is to resort to doing this in SPARQL by building
a complete, new query string, parsing it and executing it.
But it takes a little care to handle all cases like
quoting special characters; you can at least use some of the
many utilities in ARQ for producing strings such as
FmtUtils.stringForResource (it's
not in the application API but in the util
package currently).

Queries in ARQ can be
built
programmatically but it is tedious, especially when the
documentation hasn't been written yet.

Another way is to use query variables and bind them to
initial values that apply to all query solutions. Consider
the query:

It doesn't matter if the node is a literal, a resource with
URI or a blank node. It becomes a fixed value in the query, even
a blank node, because it's not part of the SPARQL syntax, it's a
fixed part of every solution.

This gives named parameters to queries enabling something
like SQL prepared statements except with named parameters not
positional ones.

This can make a complex application easier to structure and
clearer to read. It's better than bashing strings together,
which is error prone, inflexible, and does not
lead to clear code.

13 April 2006

SPARQL is now (April 2006) a W3C Candidate Recommendation
which means it is stable enough for wide spread
implementation. Actually, there are quite a few implementations already
(SPARQL
implementations page on ESW wiki).

RDQL predates
SPARQL - in fact, RDQL design predates the
current RDF specifications and some of the design decisions
in RDQL are a reflection of that. The biggest of these is
that RDF didn't have any datatyping so RDQL handles tests
on, say, integers without checking the datatype (if it looks like
an integer, it can be tested as integer).

SPARQL has all the features of RDQL and more:

ability to add optional information to query results

disjunction of graph patterns

more expression testing (date-time support, for example)

named graphs

sorting

but, above all, it is more tightly specified so queries in one implementation
should behave the same in all other implementations.

ARQ - A Query Engine for Jena

In parallel with the developing the SPARQL specification,
I have been developing a new query subsystem for
Jena called
ARQ.
ARQ is now part of the Jena download.

ARQ builds on top of the existing Jena query support
for matching of basic graph patterns
(BGPs are the building block in SPARQL).
ARQ can execute SPARQL and RDQL as well as an extended form of
SPARQL. It has several extension points, such as
Property functions. The ARQ query engine works with
any Jena Graph or Model.

Converting RDQL code to SPARQL code

The functionality of RDQL is a subset of SPARQL so it's
not hard to convert RDQL queries to SPARQL. What needs
to be done is convert the triple syntax and convert
any constraints.

Syntax

SPARQL syntax uses a Turtle-like syntax which is familiar to anyone knowing N3.
Namespaces go at the start of the query, not after like
USING. There are no () around triple
patterns; instaead there is a "." (a single dot)
between triple patterns. An RDQL only ever has one graph pattern,
in SPARQL, blocks of triple patterns are
delimited by {}

You can even use the command line
arq.qparse to read in an
RDQL query and write out the SPARQL query but it's a rough approximation
you'll need to check and it may not be completely legal SPARQL.

Constraints

The constraints need the most care because SPARQL uses RDF datatyping
and RDQL doesn't. Some common areas are:

regular expressions

string equality and numeric equality

Regular expressions

A SPARQL regular expression looks like:

regex(expr, "pattern")
regex(expr, "pattern", "i")

The catch is that the expr must be a literal; it can't be a URI.
(Well - it can, but it will never match!). If you want to perform a regular
expression match on a URI, use the
str() built-in to get the string form of the URI.

regex(str(?uri), "^http://example/ns#")

Equality

RDQL has = for numeric equality and eq for
string equality. A number in RDQL was anything that can be parsed an a number, whether
it had a datatype or not (or even the wrong datatype). Likewise, anything could be
treated as a string (like URIs in regular expressions).

SPARQL has = which is taken from
XQuery/XPath Functions and Operators. It decides whether that is numeric equals,
string equals or URI-equals based on the kind of arguments it is given.

API Changes

The ARQ API is in the package com.hp.hpl.jena.query. The RDQL API
is deprecated, starting with Jena 2.4. The new API
is similar in style to the old one for SELECT, with iteration over
the rows of the results (javadoc).
Differences include the widespread use of factories, naming consistent with the
SPARQL specifications, and different exec operations for the
different kinds of SPARQL query. QueryExecution objects should be
properly closed.

One change is that to get the triples that matched a query, instead of asking the binding for the triples
that were used in the matching, the application should now make a
CONSTRUCT query.

Normally, the unit of matching in ARQ is the basic graph pattern (a sequence of triple patterns). These sets of triple patterns are dispatched to Jena for matching by Jena's graph-level query handler.
Each kind of storage provides the appropriate query handler. For example, the database fastpath
is a translation of a set of triple patterns into a single SQL query involving joins.

There is also a default implementation that works by using plain graph find (a triple with possible wildcards) so a new storage system does not need to provide it's own query handler until it wants to exploit some feature of the storage.

If a function property is encountered, then it is internally treated as a call to be an extension. There is a registry of
function properties to implementing code.

where ext:list is a function that bind its arguments
(unlike a FILTER function). The property function form is legal SPARQL.

So, this mechanism shows that collection access can be done in SPARQL
without resorting to handling told blank nodes.

cwm (which is a forward chaining rules engine) and Euler
(which is a backward-chaining rules engine) already provide this style of access. Their property is - the subject and object meanings are the other way.

11 February 2006

This approach means that the same source code is used for both the Java world
and the .Net world, making future improvements visible to both from a single source tree.

I tried doing it for .Net on Windows with
C# Express
and IKVM-0.24.0.1.

Summary

SPARQL queries work.

Using Jena from C# works for small scale cases - lots of checking to do but
it should be a matter of verifying everything from the dependent libraries works properly.

Some things aren't working but there are a few hotspots of trouble that, when
fixed, mean that the majority (may be all) of the Jena test suite will run. As
it is at the moment, quite a lot can be done including using the ARQ command line programs.

The Conversion

The IKVM bytecode conversion route is my preferred choice because it means one source
codebase, not two. When I tried this before, I got an early version of ARQ up and running.
But it wasn't complete; the first big block was the lack of java.nio.charset
support in GNU Classpath. Jena and ARQ have lots of tests of internationalization and
charsets. That alone was enough to make it not worthwhile exploring further at the time.

The process is simple: run ikvmc on all the jars to get a library. Ignore all
the warnings about missing stuff. It's surprising what various libraries actually
reference - Log4j has references to a lot of log record transports. At the simplest:

ikvmc *.jar -out:XXX.dll -target:library

I've now broken this in two DLLs: jena-libs.dll (all the jars except the jena ones)
and jena.dll (jena.jar, jenatest.jar, arq.,jar, iri.jar)
but that is just because I keep building the DDLs while testing.

It takes a minute or so (less time than building jena.jar itself).
The result is two DLLs of about totaling 16M - the whole assembly is about
23M including the three IKVM DLLs. Not small - but it works and it is simple to do.

file:///c:/absolute was incorrectly turned into a windows filename. Worked OK with
Sun's Java but not IKVM. Fixed.

GNU Classpath bugs:

InputStreamReader(InputStream, Charset) is broken although the other two
constructors that allow the charset conversion to be explicitly
controlled do seem to work. This can be worked around in Jena.
Bugzilla Entry.

Zero-width lookbehind regexs aren't implemented. They are used by JJC's new IRI code.
Bugzilla Entry.