There have been rather strong claims made in recent emails about how
bnodes in query patterns should (must?) be handled, in some cases
accompanied by inappropriate condescending ad-hominem comments. I
will say no more about the latter but focus on the former, as many of
the technical claims being made do not seem to me to be justified on
either semantic or pragmatic grounds; and because this issue has
significant consequences for the design of SPARQL.
First, a brief review of the issue as I understand it.
(1) There is an obvious similarity, noted by several commentators,
between a query pattern containing a bnode and a similar pattern with
a variable in place of the bnode. Let me refer to these as the 'bnode
pattern' and the 'variable pattern'. For simple (RDF/S) entailment
there is very little to choose between them, since an instance of the
variable pattern is found in the reference graph just when the bnode
pattern is simply entailed by the (RDF/S closure of) the reference
graph: so one can, in these cases, think of a bnode in a pattern as
simply being a 'blank variable' to which a binding is not required to
be delivered in the answer (but is of course required to exist.) And
from there it is a short step to the idea of simply banning bnodes
from query patterns altogether, with a concomitant simplification to
the spec., particularly to the semantics. (<Aside> there are
pragmatic arguments in favor of allowing them, but these have been
disputed: but in any case, if bnodes and variables are essentially
equivalent, this debate can be conducted entirely on pragmatic
grounds, without getting muddled with semantic issues.</Aside>)
(2) There is an objection to this comfortable picture, which is that
for more expressive languages (OWL-DL), it is possible for a KB to
entail an existential statement without there being any term in the
KB for a variable to bind to; and in such a case, the natural way to
account for this situation would be for the variable-pattern query to
fail, but the corresponding bnode-pattern query to succeed. This
would therefore require that the two kinds of pattern be given
fundamentally different semantics: intuitively, one might say that a
bnode means 'some thing exists', but a query variable means 'some
named thing exists', where the presence of the name is what makes the
answer binding possible; and a bnodeID does not count as a 'name' in
the required sense.
(3) This idea is given terminological flesh by the distinction,
introduced in
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0192
between 'distinguished' and 'undistinguished' variables, and the
observation that SPARQL currently uses 'semi-distinguished' variables
(which can bind to both "names" and bnodeIDs). In
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0018
this distinction is justified on the grounds that bnodes are not
"names", so that it would not be appropriate to supply a bnodeID as
an answer binding to a query variable; or at any rate, as is
well-known to the cognoscenti, this is simply not the way that things
are done:
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0031.
(4) None of the reasons cited in the previous paragraph seem to me to
be convincing objections to the obvious reply to the objection in
(2), which is that bnodeIDs *can* be given as bindings to query
variables; and since this is so, any engine which can determine the
entailment of a pattern with a bnode, could deliver such a binding as
an answer to the similar query pattern with a variable in place of
the bnode. That is, if ASK PATTERN[_:x] (with _:x scoped to the
pattern) is entailed, then SELECT ?x PATTERN[?x] should succeed,
possibly with ?x bound to some suitable bnodeID generated by the
querying engine. Adopting this as a design principle for querying
languages in the RDF family would, I suggest, bring some overall
regularity and simplicity to the spectrum of SPARQL versions (with
different values for the entailment and scoping set parameters.) The
fact that bnodes exist in the language means that purely existential
answers can indeed be given by the same variable-binding mechanism as
is used to deliver name bindings. In other words, there should be no
difference between 'a thing exists such that...' and 'a named thing
exists such that...', since the "name" can itself be a bnodeID, if
necessary one generated for the sole purpose of being a binding for
the query variable. Let me call this the 'existential name
principle', as it amounts to the claim that the RDF language family
makes no distinction between existence and named existence; since if
something exists, the presence of bnodes means that we can, in
effect, validly generate a name for it. In the rest of this message I
will defend this idea against the objections raised in the above
messages, and try to distinguish this issue from others with which it
might be confused.
(5) One objection to this idea, cited earlier, is that it embodies a
logical confusion: bnodes are existentially bound *variables*, and
are therefore not logical names. Now, while it is of course true that
bnodes and URIrefs/literals (which I will collectively call 'names'
here) have subtly different truth conditions, they are not so
different as this contrast suggests. First, calling one a 'name' and
the other a 'variable' is merely a matter of surface terminology.
Many logical syntaxes make a lexical distinction but this is not
necessary, and many others, such as ISO Common Logic, do not: a
'variable' then is simply a name that is bound by a quantifier.
Second, more to the point, they have essentially the same semantics:
they both in effect assert that something exists, and use a lexical
token to refer to it. (It is worth noting that there are logical
syntactic conventions which do not use lexical tokens *of any kind*
to show existence, such as Peirce's existential graphs.) The only
difference is that the scope of a name token might extend beyond the
local syntactic context, allowing other assertions to be made about
the thing denoted by the name; whereas an existential 'variable' is a
lexical token only within the context defined by the scope of the
quantifier. In natural deduction logical systems, logical names and
existentials are almost interconvertible, using the rules EI and EG:
PHI[name]
----------------EG
(exists (x)PHI[x])
(exists (x)PHI[x])
--------------------EI
PHI[name*]
when name* does not occur free in the derivation, ie is a 'new' name.
The need to invoke a 'new' name in EI illustrates the point about
scoping: the 'newness' is a guarantee that no larger scope can
'capture' this name. But once this matter is dealt with, there really
is no logical difference at all between a name and an existentially
bound 'variable'. (Another way to illustrate this is to consider
alternative textbook variations of logic which dispense with names
altogether, and only have quantified variables. Logical 'names' then
are simply existential variables which are bound at the top lexical
level; a typical database is considered to be a huge existential
sentence.)
Bnodes are in fact even more similar to RDF names, i.e. to
URIreferences, than is usual in a logical notation; since bnodes,
like URIrefs, have global scope. This might seem an odd claim, but it
is both true and important. The basic RDF graph syntax model, on
which the RDF semantics is defined, has *no notion at all* of
syntactic scope. The existential interpretation of bnodes is given
relative to an RDF graph, but an RDF graph is simply any set of RDF
triples. The set of all RDF triples on the Web is an RDF graph, as is
the set of all syntactically possible RDF triples, as is any
singleton set containing a single RDF triple, or anything in between.
There in nothing in the abstract graph syntax which allows one to
indicate the edge or extent of an RDF graph: no bnode binders or
quantifiers or scoping syntax devices. All scoping of bnode IDs - not
bnodes themselves, but their identifiers in a concrete interchange
syntax such as SPARQL or RDF/XML - is presumed to be done by external
conventions arising from the lexical/syntactic/scoping rules of
whatever surface syntax form is being used to convey the abstract
syntax, and it is these external conventions which determine which
sets of triples are being considered as 'graphs' in any particular
RDF application. This was a deliberate design decision, but perhaps
one that was not adequately explained. What it amounts to is a
decision to trade issues of lexical syntax (name scope, name binders,
the free/bound distinction, etc.) for a purely mathematical 'abstract
syntax' which treats all items uniformly. Thus, the only meaningful
question about bnodes in RDF graphs is whether they are the same or
not. If two graphs contain the same bnodes, then the implied 'scope'
of that bnode is the union of the graphs. Thus, questions of lexical
or syntactic scope which usually arise when talking about existential
'variables' in a conventional syntax are replaced, in the RDF graph
syntax, by simple questions of identity between bnodes, allowing the
entire graph syntax to be described using elementary set theory. The
resulting picture of the RDF graph syntax has both URIrefs and bnodes
being treated almost identically: both have global scope, both are
referring terms, and their role in the semantics is almost identical;
but bnodesIDs are usually treated as having document scope, while
URIrefs are lexical items with global scope in all Web applications.
(6) Another objection seems to be that to allow variables to bind to
bnodes somehow introduces great additional complexity to the
query-answering process. It seems to me that this must be wrong,
because one can define the variable-binding option in terms of the
other. Suppose that a query engine is capable of generating answers
to ASK queries containing bnodes, then it can deliver appropriate
answers to SELECT queries by the following pseudocode, where
(gensymBnode) generates a new bnodeID and P[a/b] means the pattern P
with a replaced by b:
if ( ASK P[(gensymBnode)/?x] ) then return (bind ?x (gensymBnode))
Of course, this needs to be written more carefully to account for
multiple variables, but you get the idea. And as written this does
require doing an ASK for each pattern of variables in the query; but
this is at worst a small multiple linear extra cost, and need only be
done in cases where a real SELECT query on the variable has failed;
and I am sure a more efficient method can be defined in practice,
since the derivation of the inferences supporting the existential
query will in any case involve generating suitable 'names'. (I can
kind of see that in the presence of elaborate equality reasoning,
problems might arise in determining whether or not multiple bindings
are really the same. But even here, the possibility of providing a
mere existential name - a bnode - does seem to provide an opportunity
for computational savings with a possible loss of completeness but
without sacrificing correctness.)
(7) Finally, there seems to be a misapprehension that because any
graph entails infinitely many other bnode-rich graphs which have it
as an instance, that to allow bnodes as bindings will somehow open an
infinite floodgate of redundant answers. And this would be true, if
the spec were written in a very naive way so that entailment alone
were a sufficient criterion for correctness of an answer binding. But
this, of course, is why the SPARQL specs are written in the way that
they are, to restrict answer bindings to a particular scoping set of
possible bindings - in the case of normative SPARQL, the URIs and
Bnodes which actually occur in the scoping graph. The separation
between the scoping graph and the scoping set was written into the
spec to allow for alternative strategies for handling OWL. Bijan
notes that the spec as written therefore allows 'OWL-DL-SPARQL' -
which, to emphasize, is not itself defined by the spec - to restrict
itself to bindiings to an arbitrarily small set of legal answer
bindings, and therefore to restrict itself so as to disallow bindings
to bnodes. This is technically correct, but wholly against the design
intention of those definitions. It was never the intent of that
construction to restrict the set of legal answer bindings: it was,
rather, to allow the scoping set to contain *extra* bnodeIDs,
precisely in order to overcome the objection raised in the second
paragraph above, i.e. to allow the query engine to invent new bnodes
when they are required in order to provide an answer binding to a
query which has a purely existential answer, for which no bnode which
actually occurs in the scoping graph is an appropriate answer binding.
(8) Although the restricted interpretation (where a scoping set for
OWL-DL queries must contain no bnodes) does not violate any actual
technical requirement of the current SPARQL draft, I would argue
strongly against adopting it as a standard, to the point that I would
argue for pre-emptively rendering it illegal by suitable wording to
the current spec. With the restricted interpretation, for example, we
would have the absurd situation that answer bindings could be derived
from a graph using RDF entailment which are not legal using OWL-DL
entailment, even when the graph itself is legal OWL-DL/RDF.
(9) A final quick word on the claim of prior art implicit in the
wording of some of the earlier messages. The entire subject of RDF
querying is too new for there to be a substantial body of accepted
techniques, accepted ideas, etc.., ; so any appeal, explicit or
implicit, to the 'way things are done', or to a body of expert
knowledge not vouchsafed to us all but weighty in its implications,
should be treated as a symptom of blowhardiness; or at the very least
in order to be taken seriously needs to be backed up by actual
citations and/or arguments.
Comments, ripostes, objections, etc. welcome.
Pat Hayes
--
---------------------------------------------------------------------
IHMC (850)434 8903 or (650)494 3973 home
40 South Alcaniz St. (850)202 4416 office
Pensacola (850)202 4440 fax
FL 32502 (850)291 0667 cell
phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes