12 February 2007

Jena SDB

SDB was released
for the first time last week. While this is the first alpha release, it's
actually at least the second internal query architecture and second loader
architecture (Damian did the work to get the loader working fast).

SPARQL now has a
formal
algebra. ARQ is used to turn the SPARQL
query syntax into a algebra expression; SDB takes over and compiles it first to
a relational algebra(ish) structure then generates SQL syntax. Now there is a
SPARQL algebra, this is all quite a bit simpler for SDB which is why this is the second
design for query generation; much of the work now comes naturally from ARQ.

Patterns

At the moment, SDB only translates basic graph patterns, join and leftjoin
expressions to SQL and leaves anything else to ARQ to stitch back together. For
the current patterns, there are two tricky cases to consider:

For the first, sometimes the join needs to involve more than equality
relationships like "if col1 = null or col1 = col2", which is a bit
of scope tracking, and for the second, if a variable can be bound in two or more
OPTIONALs , you have to take the first binding. The scope tracking is needed
anyway.

Over time, more and more of the SPARQL algebra expression will be translated
to SQL.

Layouts

SDB supports a number of databases layouts in two main
classes:

Type 1 layouts have a single triple table, with all information encoded
into the columns of the table.

Type 2 layouts use a triple table with a separate RDF terms table

Type 1 layouts are good at supporting fine-grained API calls where the need
to join to get the actual RDF terms is removed because they are encoded into the
triple tables columns. Jena's existing database layer, RDB, is an example of
this. When the move was made from Jena1 to Jena2, the
DB layout
changed to a type 1 layout and it went faster. The two type 1 layouts
supported are Jena's existing RDB layout and a debug version which encodes RDF
terms in SPARQL-syntax directly into the triples table so you can simple read
the table with an SQL query browser.

Type 2 layouts, where the triples table has pointers to a nodes table, are
better as SPARQL queries gets larger. Databases prefer small, fixed width columns to
variable string comparisons. SDB has two variations, one using 4 bytes integer
indexes and one using 8 byte hashes. The hash form means that hash of query
constants can be calculated and don't have to be looked up in the SQL.

It seemed that the hash form would be better all round. But it isn't
- loading was 2x slower (sometimes worse) despite the fact that RDF terms don't
have to be inserted into the nodes table first to get their auto-allocated
sequence id. Databases we have tried are significant slower indexing 8 byte
quantities than 4 byte quantities and this dominates the load performance.

Next

There are three directions to go:

Inference support

Application-specific layout control

Filters support

(1) and (2) are linked by the fact it is looking at a query and deciding, for
certain known predicates and part-patterns, that different database tables
should be used instead. See
Kevin's work
on property tables which uses the approach to put some higher level
understanding of the data back into the RDF store. (3) is "just" a matter of
doing it.