30 November 2006

I wrote about the SPARQL algebra last time, not with the idea of a regular update about DAWG work but noting a significant issue being considered by the working.
I hope that the wider community will comment and contribute; much of the ground work for the algebra is drawn from outside the working group anyway.

This week DAWG made a significant decision in a different, but nearby area, that of basic graph pattern matching
(BGP).

Basic Graph Patterns

Basic graph patterns (BGPs) are the building block in SPARQL. They are a
sequence of adjacent triple patterns in the query string. A BGP is matched
against whatever is being queried and the results of matching feed into the
algebra.

The decision is that no change, or rather the "implementation hint" for BGP matching, made more formal, becomes the basis
of SPARQL. No queries will be changed by the decision.

The Issue

Having reveal the punch line, what's the joke?
The issue is whether blank nodes in BGP matching behave like hidden, named variables or do they behave as a different kind of variable all together. This matters for counting - it does not matter for the logical meaning of a solution.

Example data:

:a :p 1 .
:a :p 2 .

Example query pattern:

{ ?x :p [] }

How many answers? Blank nodes are existential variables in RDF; named variables (the regular ?x ones) are universal variables. Queries don't return the binding of an existential; queries can return the binding of a named variable and the bound value of a named variables can be passed to other parts of the query (FILTERs, OPTIONALs
etc) via the algebra.

In the absence of a DISTINCT in the query, are the solutions to this pattern:

1 : ?x = :a

2 : ?x = :a , ?x = :a

Either 1 or 2

In OWL-DL, existential variables give more expressivity: if there is a disjunction in the ontology, the reasoner can know that something exists, but not need to find an actual value - and it may not be able to
find the value anyway - but the reasoner does know there is "something there".

Users and Implementers

For application writers, the main use of blank nodes in queries that I've seen is
the convenience of using []

{
?x :something [ :p ?v ; :q ?w ]
}

[] saves having to have a named variable and splitting the
expression. It would be one more thing to learn if that causes different answers
(duplicates) to the pattern using query variables:

{
?x :something ?z .
?z :p ?v ;
:q ?w .
}

especially if the first pattern had been written:

{
?x :something _:bnode .
_:bnode :p ?v ;
:q ?w .
}

which is the [] example written out with a labelled blank node.
This is also why it's BGPs that are the building block, not triple patterns -
blank nodes are scoped to BGPs.

For implementers, having two kinds of variable that behave different in matching can make life a harder, especially if they using some existing technology, like an SQL database
which has one preferred type (universal-like). Work on SPARQL semantics suggests that there can be one (complicated) SQL statement for one SPARQL query. Nice if it's true because all the work that has gone into SQL database optimizers is
reusable. Having two types of variables would need extra SQL to be generated like
nested SELECTs, or some post-processing done.

More work for implementers is not good; many open source RDF toolkits are built by people in their own time or by people with other things to do at work. So, it
might be a
noticeable dampener on toolkit development and hence deployment. Not a good idea to do this without a reasonable need.

DAWG Resolution

Logically, duplicates don't make any difference. DAWG decided there are two
solutions for simple entailment matching example above. It's what
implementations currently do.

The
DAWG process is often test case driven. We use test cases as both a concrete way to define what results we expect and also to make concrete decisions. So once consensus was emerging to have duplicates, the
working group decided to approve a test case that captured that. In this case, the test is
rdfsemantics-bnode-type-var (the example above is the same idea, just
shorter).

Defining BGP Matching for SPARQL

In Last Call 2,
the first CR
publication and the
intermediate publication
versions of SPARQL, BGP matching is a
complicated definition that allows for different entailment regimes for extensibility
within a single entailment framework. SPARQL is defined for simple entailment, and
so for anything where there a virtual graph, but it allows for extension to more complicated entailment regimes, like OWL-DL.

The trouble is the matching definition is that it has a bug in it; it allows too many solutions when the graph is redundant. The bug has to be fixed.

SPARQL is defined for simple
entailment. With this decision on how blank nodes behave, the working group can now rework
the BGP matching. The options are outlined in the message
2006OctDec/0140
and either of the variable mapping approaches can be made to work. It's more
akin to a formalised version of the
last call 1
style and the implementation hint of the complex version.

... and Extensibility

Extensibility comes by replacing the whole of BGP matching, with just a few
natural conditions such as all named variables must be bound in a BGP match and there are no accidental
clashes of blank nodes.

(Vested interest declaration)

OK - ARQ isn't built on SQL; it's a self contained query engine with various extension points - by default, it queries any
Jena graph and exploits Jena's direct connection to the storage layer but only for basic graph patterns, not more complex patterns.
It could sort out the two types of variables quite easily.

But we also have a Jena SPARQL database engine (called SDB) that is built on SQL and is specifically designed for SPARQL, for large datasets
and for named graphs.

SDB fits in as an ARQ query engine and the more it can turn into a single SQL
statement the better. All of the query being the ideal case (and hopefully the
normal one). So making the translation to a relational quad store simpler
is helpful for SDB.

23 November 2006

Basic graph patterns (BGPs) are a block of adjacent triple patterns
that match, or don't match, all together. In BGP matching every named
variable must have some value for the BGP to have been matched. Filters
add restrictions on the values variables can take.

They are also an extension point in SPARQL because you can plug in
a different BGP matcher (e.g. a DL reasoner or a wrapper for legacy
SQL data), then reuse the other two layers to form complex queries out
of the exact matches from BGP matching.

Graph expressions (OPTIONAL, UNION,
GRAPH, Groups (things between {}) combine BGPs in various
ways to give more complicated patterns. Graph expressions are recursive
so you can have patterns within patterns. A SPARQL query has one
graph expression in the WHERE clause.

Solution modifiers (Project, DISTINCT,
ORDER BY, LIMIT/OFFSET) take the output of
matching the query graph pattern and process it in various ways to
yield the result set.

Current Situation

The semantics for SPARQL have, so far, been declarative and top-down.
A solution to a query passes if it meets all the conditions following
from the definitions of each of the SPARQL graph pattern forms and filters.
That means, in theory, the whole of the solution is available.

In fact, ARQ builds solutions up by adding the
minimum necessary at each subpattern to make the solution match.
This is what gives ARQ
streaming evaluation which keeps the total memory use down. But the
effect is the same, whole solutions are available everywhere in the
pattern.

Another way is to have a bottom-up algebra. The
relational
algebra works this way. Subexpressions are evaluated
to give tables (relations) and then these tables combined together
to form new tables.

It turns out that these are nearly the same as shown by
Richard Cyganiak and
Jorge Perez et al
(see references below).
The class of graph patterns that differ are a certain kind of
nested
OPTIONAL where a variable is used on the outside,
fixed part, and in the inner OPTIONAL, but not in
between.

Now that is a rather unusual query.
The tricky bit is the use of ?v at the outermost and
innermost levels but not in between.The query can be rewritten so
as not to nest (note the repeat of :x3 :q ?w).

In DAWG, we're considering
an algebra for SPARQL, based on the paper
Semantics
of SPARQL. This will mean a change of results in the case
above. I've never seen such a query in the wild, only in artificial test cases.

The change from declarative semantics to a constructional algebra is
motivated primarily by the implementers in the working group wanting to
apply all the good stuff on relational algebra optimization to SPARQL
engines. The complexity of the exact mapping to SQL in the
semantics preserving SPARQL-to-SQL is daunting.

Not that simple!

But other things differ as well if a purely relational style is
applied (and that's not being proposed currently) because
"Semantics of SPARQL"
does not consider the difference in treatment of filters (it's focus is on the
combination of graph patterns).

Two cases arise: one where filters are in OPTIONALs, and one
where filters are nested inside groups.

SELECT *
{ ?x :p ?v .
OPTIONAL
{
?y :q ?w .
FILTER(?v = 2)
}
}

The FILTER uses a variable from the fixed part of
the OPTIONAL. Under Jorge's semantics the variable
isn't in-scope, so the filter evaluation is an error (unbound variable),
otherwise known as false; the optional part never matches.
This form of query is realistic and does occur for real
so the current proposal for DAWG makes this work. Like SQL
and LEFT OUTER JOIN with the ON clause,
the LeftJoin operation in the SPARQL algebra can take a condition
which applies over the whole LeftJoin.

The fixed
pattern could be repeated inside the optional part but such repetition
for something application might well use is bad, very bad.

Application writers will have to take care about scope and
groups: putting FILTERs inside {} changes
their scope.

Currently,

{ :x :p ?v . FILTER(?v < 5) }

and

{ :x :p ?v . { FILTER(?v < 5) } }

have the same effect. But they aren't in the algebraic form
because { FILTER(?v < 5) } evaluates on the empty
table that does not have a ?v in it; evaluation is an
error and hence false.

ARQ

There is an implementation of the algebra in
ARQ in CVS
(not in the current release ARQ 1.4). It's enabled in the command
line tool
arq.sparql
with --engine=ref from the command line and

QueryEngine2.register() ;

in code.

The work-in-progress
unfinished
editor's working text is available. Check it is actually up to
date because that's just work space and isn't a DAWG document. It does not
reflect working group decisions - it more a proposal for consideration.

Disclaimer: Any views expressed here are mime and not
the working group (which at
the time of writing is thinking about it but hasn't decided
anything yet).

12 November 2006

SPARQL is normally
thought of as only querying fixed RDF data. At the core of SPARQL are
the building blocks of basic graph patterns, and on top of these there is
an algebra to create more complex patterns (OPTIONAL UNION, FILTER, GRAPH).

The key question a basic graph pattern asks is "does this pattern match
the graph". The named variables record how the pattern matches.

Not all information needs to be in the raw data. ARQ
property functions are a way to let the application add some
relationships to be computed at query execution time.

LARQ
adds free text search. The real work is done by
Lucene.
LARQ adds ways to create a Lucene index from RDF data
and a property function to perform free text matching
in a SPARQL query.

This will be in ARQ 1.5 but is available from
ARQ CVS now. Hopefully, this will be useful to users and
application writers.
Comments and feedback on the design are welcome, especially
before the next ARQ release.