02 June 2010

Standardising RDF Syntaxes

One area of interest at the RDF Next Steps Workshop is
other RDF-related syntaxes, ones that are not RDF/XML. RDF/XML is the standard
syntax; N-Triples is defined as part of the RDF test suite but not formally as a syntax on the same level as RDF/XML; there is RDFa for embedding in XHTML.

RDF/XML is not easy to read as RDF. Turtle appeals because it more clearly shows the triple structure of the data. N-Quads is a proposal to extend RDF file format to named graphs and TriG is a Turtle-inspired named graph syntax. There is TriX but I've never come across that in the wild.

Using XML had several advantages, such as comprehensive character set support, neutrality of format
and reuse of parsers. However, it's complicated in it's entirety, even after using an XML parser and it is quite expensive to parse, making parsing large (and not some large) files a significant cost. Because it can't, practically, be processed by XSLT there are
nowadays few advantages.

All the non-XML formats, which are much easier to read and process, would be good to standardise but they are not without the need for sorting out some details.
Details matter when you're dealing with anything over a trivial amount of data
and when's it's millions of triples, it's just a friction point to get the data
cleaned up if there is disagreement between information publisher and
information consumer.

Turtle

Turtle takes the approach of using UTF-8 as the character set, rather than relying on character set control like XML. Given that nowadays UTF-8 support is well understood and widely available, the internationalization issues of different scripts are best dealt with that way. Parsers are both simple to write and fast.
(The tricks needed to get Java to parser fast would be a subject for a separate discussion.)

As Turtle is the more mature of the possible syntaxes, it is also the best
worked out. One issue I see is the migration from a one-standard-syntax world to a two-standard-syntax world
and it's not without its practical problems. What if system A speaks RDF/XML only, and system B speaks only Turtle? How long will it take for uses of content negotiation take to catch up? Going from
V-nothing to V1 of a system (which is where we are now) is usually quicker than
going from V1 to V2 as the need to upgrade is much less. If it ain't broke why change?

Turtle can write graphs that RDF/XML can't encode. If the property can't be split into namespace and local name, then RDF/XML can't represent it. An XML qname must have a local part of at least one alphabetic character. This isn't common but these details arise and cause problems (that is, costs) when exchanging data at scale.

What would be useful would be a set of language tokens to build all sorts of languages, like rule languages but at the moment there some unnecessary restrictions in Turtle on prefixed name (Turtle calls them qnames but they are not exactly XML qnames).

Turtle disallows:

employee:1234

because the local part starts with a digit. In data converted from existing (non-RDF) data this is a nuisance, and one that caused SPARQL to allow it, based on community feedback.

But there are other forms that can be useful that are not allowed (and aren't in SPARQL):

ex:xyz#abc

ex:xyz/abc

ex:xyz?parm=value

The last one might be a bit extreme but the first two or just using the prefix
to tidy up long IRIs. Partial alignment with XML qnames makes no sense in Turtle. Extending the range of characters to include /, # and maybe a few others, makes prefixed names more useful. Issues just like this lead to the CURIE syntax.

While these URIs can be written in Turtle, it needs the long form, with <...>, and the only way to abbreviate is via the base
IRI, but you can only have one base URI. It's a workaround really that gets ugly
when the advantage of Turtle is that it is readable. Extending the range of
characters in the local part does not invalidate old data; it does create
friction in interoperability so we have one last chance to sort this out if
Turtle is to be standardised.

N-Quads

<s> <p> <o> .

<s> <p> <o> <g> .

What could be simpler? N-Quads is N-Triples with an optional 4th field to give the graph name (or context - it wasn't designed specifically for named graphs, but let's just consider
IRIs in the 4th field, not blank nodes or literals which the syntax allows).

But TriG puts the graph name before the triples, while N-Quads puts it after. Maybe N-Quads should be like TriG so that TriG can make N-Quads a subset. Parsing this modified N-Quads only takes buffing of the tokens on the line and counting to 3 or 4 to determine if it's a triple or a quad. Making TriG more flexible, at the cost of the slightly less intuitive graph name first, in what is basically a dump format, seems to me to be a good trade-off.

Blank nodes labels need to be clarified - is the scope the graph or the document? Both are workable. I'd choose scope-to-the-document, if only to avoid the confusion of two identical labels referring to to different bnodes, and it's occasionally useful to say that a bnode
in one graph really is the same as another when using it as a transfer syntax
(for example, when one graph is a subgraph of another). TriG has the same issue but the use of nested forms for graphs makes scoped-graph more reasonable (except that graphs can be split over different {} blocks). Doing the same in N-Quads and TriG is important, and my preference is document-scoped labels.

TriG

TriG is a Turtle-like syntax for named graphs. It is useful for writing down RDF datasets.

It has some quirks though. Turtle is not a subset of TriG because the default graph needs to be wrapped in {} but the prefixes need to
be outside the {}. The default graph needs to be given in a single block, but named graphs can be fragmented (that was just an oversight in the spec). It would be helpful to allow the unnamed graph be specificed as Turtle and similarly if an N-Quads file were legal TriG.

TriG allows the N3-ish form:

<g> = { ... } .

I've seen some confusion about this form in the data.gov.uk data. The addition "=" and ".", which are optional, cause confusion and at least one parser does not accept them as it wasn't expected.

In N3, = is a synonym for owl:sameAs but the relationship isn't likely to be owl:sameAs, read as N3, it's more likely to be log:semantics. Now I like the uniformity of the N3 data model, with graph literals (formulae) because of the simplicity and completeness it introduces but it's not RDF, it's an extension and it breaks all RDF-only systems.

If <g> is the IRI of a graph document, it would be more like the N3:

<g> log:semantics { ... } .

or

<g> log:semantics ?v .
?v owlSameAs { ... } .

Avoiding the variability of syntax, which brings no benefit, is better. Drop the
optional adornment.

Summary

None of these issues are roadblocks; they are just details that need sorting out to move from the current
de facto formats to specifications. When exchanging data between systems
that are not built together, details matter.