09 December 2010

There are "Lies, damned lies, and statistics" but worse are probably performance measurements done by someone else.
The real test is what does it mean for any given application and is performance "fit for purpose". Database-related performance measurements are particular murky. The shape of the data matters, the usage made of the data matters, all in ways that can wildly affect whether a system is for for purpose.

Treat these figures with care - they are given to compare the TDB bulker (to version 0.8.7) loader and the new one (version 0.8.8 and later). Even then, the new bulk loader is new, so it is subject to tweaking and tuning but hopefully just to improve performance, not worsen it.

The new bulk loader is faster by x2 or more depending on the characteristics of the data.
As loads can take hours, this saving is very useful.
It produces smaller databases and the databases are as good as or better in terms of performance
than the ones produced by the current bulk loader.

Numbers are "query mix per hour"; larger numbers are better. The BSBM performance engine was run with 100 warmups and 100 timing runs over local databases.

Loader used

50k

250k

1m

5m

25m

100m

200m

350m

Loader 1

102389.1

87527.4

58441.6

5854.7

1798.4

673.0

410.7

250.0

Loader 2

106920.1

86726.1

62240.7

11384.5

3477.9

797.1

425.8

259.2

What this does show is that for a narrow range of database sizes around 5m to 25m,
the databases produced by loader2 are faster.
This happens because the majority ogf the working set of databases due to loader1 didn't fit mostly in-memory but those produced by loader2 do.

LUBM isn't a very representative benchmark for RDF and linked data applications - it is design more
for testing inference. But there is some details of various systems published using this benchmark.
To check the new loader on this data, I ran loads for a couple of the larger generated. These are
the 1000 and 5000 datasets, with inference applied during data creation. The 5000 dataset, just under
a billion triples, was only run through the new loader.

This article could be subtitled called "Good I/O and Bad I/O".
By arranging to use good I/O, the new TDB loader achieves faster
loading rates despite writing more data to disk. "Good I/O"
is file operations that occurs in a buffered and streaming fashion.
"Bad I/O" is file operations that cause the disk to jump the heads
about randomly or work in small units of disk transfer.

The new TDB
loader "loader2" is a standalone program that bulk loads
data into a new TDB database. It does not support incremental loading,
and may destroy existing data. It has only been tested on Linux; it
should run on Windows with Cygwin but what the performance will
be is hard to tell.

Figures demonstrating the loader in action for various large datasets
are in a separate blog entry. It is faster than the current loader
for datasets over about 1 million triples and comes into it's own
above 100 million triples.

Like the current bulk loader ("loader1"), loader2 can load triple and quad RDF formats,
and from gzipped files. It runs fastest from N-triples or N-Quads
because the parser is fastest, and low overhead, for these formats.

The loader is a shell script that coordinates the various phases.
It's available in the TDB development code repository in
bin/tdbloader2 and the current 0.8.8 snapshot build.

Loader2 is based on the observation that the speed of loader1 can
drop sharply as the memory mapped files fill up RAM (the "can" is
because it does not always happen; slightly weird). This fall off
is more than one would expect simply by having to use some disk and
sometimes the rate of loader1 becomes erratic. This could be
due to the OS and the management of memory mapped files
but the effect is that the secondary index creation can
become rather slow. loader1 tends to do "bad I/O" - as the
caches fill up, blocks are written back in what to the disk looks
like a random order causing the disk heads to jump round.

Copying from the primary index to a secondary index involves a sort
because TDB uses
B+trees
for it's triple and quad indexing. A B+Tree keeps its
records in sorted order and each index is different orders.

Loader1 is much faster than simply loading all indexes at once because
in that case there is some much RAM being used for caching of parts of all the
indexes. Better is to do one index at a time, using the RAM for caching one
index at a time.

Loader2 similarly has an data loading phase and an index creation phase.

The first phase is to build the node table and write out the data for index building.
Loader2 takes the stream of triples and quads
from the parser and writes out the RDF terms (IRI, Literal, blank node)
into the internal node table.
It also writes out text files of tuples of NodeId (the internal 64 bit
number used to identify each RDF term. This is "good I/O" - the writes
of the tuples files are buffered up and the files are written append-only.
This phase is a Java program, which exits after the node table and working files have been written.

The next phase is to produce the indexes, including the primary index. Unlike
loader1, loader2 does not write the primary index during node loading.
Experimentation showed it was quicker to do it separately despite needing more I/O.
This is slightly strange.

To build indexes, loader2 uses the
B+Tree rebuidler
and that requires the data in index-sorted order. Index rebuilding is a sort followed
by B+tree building. The sort is done by Unix sort.
Unix sort is very easy to use and it smoothly scales from a few lines to gigabytes of data.
Having written the tuple data out as text files in the first phase (and fixed width hex numbers at that - quite wasteful)
Unix sort can do a text sort on the files. Despite that meaning lots of I/O, it's good I/O
and the sort program really knows how to best manage temporary files.

For each index, a Unix sort is done to get a temporary file of tuple data in the right sort order.
The B+Tree rebuilder is called with this file as the stream of sorted data it needs to
create an index.

There are still opportunities to tune the new loader and to see if the output of the sorts
being piped directly into the rebuilder is better or worse than the two step approach
using temporary file used at the moment. Using different disks for different temporary files
should also help.

The index building phase is parallelisable. Because I/O and memory usage are the bottlenecks, not CPU cycles,
the crossover point for this to become effective might be quite high.

To find out whether loader2 is better than loader1, I've run a number of tests.
Load and query tests with the Berlin SPARQL Benchmark (2009 version), a load test on the RDF version of
COINS (UK Treasury Combined Online Information System - about 420 million quads and it's real data) and a load test using the Lehigh University Benchmark with some inferencing. Details, figures and tables in the next article.

03 December 2010

The indexes hold 3 or 4 NodeIds, where a NodeId is a fixed length 64 bit unique
number for each RDF term in the database. Numbers, dates and times are encoded directly
into the 64 bits where possible, otherwise the NodeId refers to the location in a separate NodeId to RDF term table like all other types,including IRIs.

The B+Trees have a number of leaf blocks, each of which holds only records (key, value pairs, except there's no "value" part in a triple index - just the key of S,P and O in various orders).
TDB threads these blocks together so that a scan does not need to touch the
rest of the tree - scans happen when you look up, say S?? for known subject and unknown property and object.
The scan returns all the triples with a particular S. Counting all the triples only touches the leaves of the B+Tree, not the branches.

B+Trees provide performant indexing over a wide range of memory situations,
ranging from very little caching of disk structures in memory, through to
being able to cache substantial portions of the tree.

The TDB B+Trees have a number of
block storage layers;
an in-JVM block caching for use on 32 bit JVMs, memory mapped files,
for 64 bit JVMs, and an in-memory RAM-disk for testing. The in-memory RAM disk is
not efficient but it is a very good simulation of a disk - it
really does copy the blocks used by a client when written to another area
so there is no possibility of updating blocks through references held by the
client after the block has been written to "disk".

However, one disadvantage can be that the index isn't very well packed. The B+Trees
guarantee that each block is at least 50% full. In practice, the blocks are 60-70% full for indexes POS and OSP.
But a worse case can arise happens when inserting into the SPO index because data typically arrives with all the triples for one subject, then all the triples for another subject, meaning the data is nearly sorted. While this makes the processing faster, it makes the resulting B+Tree about 50%-60% packed.

Packing density matters because it influences how much of the tree is cached in a fixed amount of computer memory. If it's 50% packed, then it's only 50% efficient in the cache.

There are various ways to improve on this (compress blocks,
B#Trees,
and many more besides - B-tree variations are very extensively studied data-structures).

I have been working on a B+Tree repacking programme that takes an existing B+Tree and produces a maximumally packed B+Trees. The database is then smaller on disk and the in-memory caches are more efficiently used. The trees produces are legal B+Trees, and have a packing density of close to 100%. Rebuilding indexes is fast and scales linearly.

The Algorithm

Normally, B+Trees grow at the root. A B+tree is the same depth everywhere in the tree and the tree
only gets deeper if the root node of the tree is split and a new root is created pointing to down the two blocks formed by splitting the old root. This algorithm, building a
tree from a stream of records, grows the tree from the leaves towards the root.
While the algorithm is running there isn't a legal tree - it's only when the algorithm
finishes, does a legal B+Tree emerge.

All the data of a B+tree resides in the leaves - the branches above tell you
which leaf block to look in (this is the key difference between B-Trees and B+Trees).
The first stage of repacking takes a stream of records (key and value) from the initial tree.
This stream will be in sorted order because it's being read out of a B+Tree and
for a TDB B+tree, it's a scan tracing the threading of the leaf blocks together.
In other words, it's not memory intensive.

In the first stage, new leaf blocks are produces, one at a time. A block is filled
completely, a new block allocated, the threading pointers completed and the full
block written out. In addition, the block number and highest key in the block are emitted.
The leaf block is not touched again.

The exception is the last two blocks of the leaf layer. A B+Tree must have blocks
at least 50% full to be a legal tree. Although the TDB B+Tree code can cope with blocks
that are smaller than the B+tree guarantee, it's neater to rebalance the last two blocks in the case the last block is below the minimum size. Because the second-to-last block is
completely full, it's always possible to rebalance in just two blocks.

Phase two takes as input the stream of block number and highest key from the level below
and builds branch nodes for the B+Tree pointing, by block number, to the blocks produced
in the phase before. When a block is finished, the block can be written out
and a block number and split key emitted. This split key isn't the highest key
in the block - it's the highest key of the entire sub-tree at that point
but this the key passed. A B+tree branch node has N block pointers and N-1
keys and the split key is the last key from making the full block, and is
the Nth key from below.

Once again, the last two blocks are rebalanced to maintain then B+Tree invariant of all blocks
being at least half full. For large trees, there are quite a few blocks, so the rebalance of
just two of them is insignificant. For small trees, it not really worth repacking the tree -
block caching at runtime hides any advantages there might be.

The second phase is repeated applied to the block number and split key stream from the layer below
until a layer in the tree is only one block (it can't be zero blocks). This single block
is the new root block. The third phase is to write out the B+Tree details to disk
and put the root block somewhere where it can be found when the B+Tree is reopened.

Consequences

The repacking algorithm produces B+Trees that are the approaching half the size of the
original trees. For a large dataset, that's several gigabytes.

The repacked trees perform a bit faster than trees formed by normal use except
in one case where they are faster. If the tree is small, the majority fits in the RAM caches,
then repacking means less RAM is used but the speed is much the same (in fact as few percent slower,
hard to measure but less than 5%, presumably because there is a difference ratio of tree decent and in-block binary search being done by the CPU.
This may be no more than a RAM cache hierarchy effect).

However, if the tree was large, and repacked now fits mostly in memory, the repacked trees are faster.
As the indexes for an RDF dataset grows much large than the cacheable space, then this effect
slowly declines. Some figures to show this are in preparation.

The biggest benefit however, is not directly the speed of access or the reduced disk space.
It's the fact here is a fast and linear growth way to build a B+Tree from
a stream of sorted records. It's much faster than simply using the
regular insertion into the B+Tree.

This is part of the new bulk loader for TDB. It uses external sorting to
produce the input to index creation using this B+Tree repacking algorithm.
The new bulk loader can save hours on large data loads.

28 August 2010

SPARQL 1.1 Update is work-in-progress by
the SPARQL Working Group but the general
design and language is reasonably stable. There is also the W3C submission
SPARQL Update from July 2008.
The language are similar in style but the details of the grammars differ.
So how to migrate from the syntax used in the submission to the upcoming SPARQL recommendation
for a SPARQL Update language?

One way is to provide both languages behind a common API, with the application indicating which language
to use. This maximises compatibility because if the submission is the chosen language, the parser for
the submission language will be used. But the application has to be changed to move between the languages and conversion of update scripts has to be done for each script, so probably it's a "big bang" change over. The two languages are very close - is it possible to have a single language that covers both languages?
Then the application can mix usages and when an update request is printed it can be printed in the soon-to-be standard language, helping people see how the language has changed.

It turns out that most, but no all, the submission language can be incorporated into
the grammar for the emerging standard. The cases not covered don't seem to be ones likely
to be widely used although it would be good to know if they are.

CREATE, CLEAR, LOAD, DROP are covered.

INSERT DATA, DELETE DATA on the default graph covered or working on one a named graph
is covered but not on more than one graph at once.

An extra grammar rule for MODIFY is supported, again working on the default graph or one named graph. but with only a single, optional GRAPH <uri>.

The old style INSERT { :s :p :o }, DELETE { :s :p :o }, that is, insert or delete some data using just the INSERT or DELETE keyword, without DATA, leads to ambiguity in the combined grammar. These forms are not supported in the combined language. In fact, these forms pre-date the DATA forms in the submission language.

The ability to work on only one named graph needs a little explanation. In the
combined grammar, the INTO or FROM
is used to set the WITH part of an update.
There can be at most one WITH. In the submission,

INSERT INTO <g1> <g2> <g3> { ... } WHERE { ... }

is legal. In terms of language, this could be incorporated into the extended language but it introduces a capability not present in the upcoming working group language and it can't be written out again without repeating the operation, once for each named graph. Operating on a single named graph, or the default graph, is covered by the standard.

For old style INSERT or DELETE of data, conversion can be done by adding in the word DATA to the operation or adding WHERE {} to the update operation. Both these conversions yield something that is legal and the same under the submission language so the conversation can be done and retain the use of old software.

By using an extended grammar, the application can even mix syntax of the submission on SPARQL Update and SPARQL 1.1 Update in a single request or, indeed, single operation. When printed the output can be in the equivalent SPARQL 1.1 Syntax.

ARQ (currently, the development snapshot) includes a command line SPARQL 1.1 Update extended parser, "arq.uparse". arq.uparse reads the extended syntax and prints the equivalent strict SPARQL 1.1 Update form. It can be used to translate from the submission language to W3C standards language. More on practical details: jena-dev/message/45040.

Key points from the extended Grammar: The working group is not planning on including this published SPARQL 1.1 Update grammar.

30 July 2010

As before, I will still be able to work on Jena, ARQ and TDB and I also get to continue participating in the W3C SPARQL working group, now as an Invited Expert. The working group is making good progress on it's chosen list of features, and now it's just a "small" matter of doing the core work and getting out the Last Call documents to the community.

17 July 2010

I have Ubuntu 10.04 working on a Samsung N210, running Thunderbird, Firefox as well as all my Java development systems. It may not be a fast machine but it's very convenient. The process is now easy, easier than some older material (for 9.10 and very early 10.04) on the web might suggest.

When first turned on, the machine installed Windows 7 starter. I let this finish even though I didn't want it so I could install Ubuntu 10.04 along side Windows in case it didn't work. Once I was happy it would work, I repartitioned the disk (with gparted) to create a single partition, deleting Windows and the restore partition, then reinstalled.

First, build a USB drive with the install on. To get the machine to boot fro this I had to:

As the machine boots, keep F2 pressed to go into the BIOS.

Make sure the machine will boot from a USB pendrive.

Reboot with USB and install Ubuntu Netbook Remix

You have to press F2 very early to get into the BIOS configuration screens. The boot through the BIOS is very fast so don't wait for machine to put the Samsung flash screen up.

You can reset the BIOS to not boot from USB if you want to at this stage, or later.

At this point the wireless does not work. Don't panic; plug in an Ethernet cable and update the system.

sudo apt-get update
sudo apt-get upgrade
sudo reboot

and now the wireless works. There's quite a lot of advice on the web about this but it now seems that there is no need for any custom software - looks like the main Ubuntu repositories have a working version of the system.

The missing function keys are due to the fact that Samsung N150/N210/N220 are missing from the udev rules:
/lib/udev/rules.d/95-keymap.rules
/lib/udev/rules.d/95-keyboard-force-release.rules
adding "|*N150/N210/N220*" to the product part of the rules for Samsung in BOTH files, will enable the Fn-up and Fn-down keys. The new product section will look like:
ENV{DMI_VENDOR}=="[sS][aA][mM][sS][uU][nN][gG]*", ATTR{[dmi/id]product_name}=="*NC10*|*NC20*|*N130*|*SP55S*|*SQ45S70S*|*SX60P*|*SX22S*|*SX30S*|*R59P/R60P/R61P*|*SR70S/SR71S*|*Q210*|*Q310*|*X05*|*P560*|*R560*|*N150/N210/N220*"
Now, you can map these keys to any program setting the backlight

and then install some Samsung tools - you need to add the repository to the package manager which you can do graphically or as:

02 June 2010

One area of interest at the RDF Next Steps Workshop is
other RDF-related syntaxes, ones that are not RDF/XML. RDF/XML is the standard
syntax; N-Triples is defined as part of the RDF test suite but not formally as a syntax on the same level as RDF/XML; there is RDFa for embedding in XHTML.

RDF/XML is not easy to read as RDF. Turtle appeals because it more clearly shows the triple structure of the data. N-Quads is a proposal to extend RDF file format to named graphs and TriG is a Turtle-inspired named graph syntax. There is TriX but I've never come across that in the wild.

Using XML had several advantages, such as comprehensive character set support, neutrality of format
and reuse of parsers. However, it's complicated in it's entirety, even after using an XML parser and it is quite expensive to parse, making parsing large (and not some large) files a significant cost. Because it can't, practically, be processed by XSLT there are
nowadays few advantages.

All the non-XML formats, which are much easier to read and process, would be good to standardise but they are not without the need for sorting out some details.
Details matter when you're dealing with anything over a trivial amount of data
and when's it's millions of triples, it's just a friction point to get the data
cleaned up if there is disagreement between information publisher and
information consumer.

Turtle

Turtle takes the approach of using UTF-8 as the character set, rather than relying on character set control like XML. Given that nowadays UTF-8 support is well understood and widely available, the internationalization issues of different scripts are best dealt with that way. Parsers are both simple to write and fast.
(The tricks needed to get Java to parser fast would be a subject for a separate discussion.)

As Turtle is the more mature of the possible syntaxes, it is also the best
worked out. One issue I see is the migration from a one-standard-syntax world to a two-standard-syntax world
and it's not without its practical problems. What if system A speaks RDF/XML only, and system B speaks only Turtle? How long will it take for uses of content negotiation take to catch up? Going from
V-nothing to V1 of a system (which is where we are now) is usually quicker than
going from V1 to V2 as the need to upgrade is much less. If it ain't broke why change?

Turtle can write graphs that RDF/XML can't encode. If the property can't be split into namespace and local name, then RDF/XML can't represent it. An XML qname must have a local part of at least one alphabetic character. This isn't common but these details arise and cause problems (that is, costs) when exchanging data at scale.

What would be useful would be a set of language tokens to build all sorts of languages, like rule languages but at the moment there some unnecessary restrictions in Turtle on prefixed name (Turtle calls them qnames but they are not exactly XML qnames).

Turtle disallows:

employee:1234

because the local part starts with a digit. In data converted from existing (non-RDF) data this is a nuisance, and one that caused SPARQL to allow it, based on community feedback.

But there are other forms that can be useful that are not allowed (and aren't in SPARQL):

ex:xyz#abc

ex:xyz/abc

ex:xyz?parm=value

The last one might be a bit extreme but the first two or just using the prefix
to tidy up long IRIs. Partial alignment with XML qnames makes no sense in Turtle. Extending the range of characters to include /, # and maybe a few others, makes prefixed names more useful. Issues just like this lead to the CURIE syntax.

While these URIs can be written in Turtle, it needs the long form, with <...>, and the only way to abbreviate is via the base
IRI, but you can only have one base URI. It's a workaround really that gets ugly
when the advantage of Turtle is that it is readable. Extending the range of
characters in the local part does not invalidate old data; it does create
friction in interoperability so we have one last chance to sort this out if
Turtle is to be standardised.

N-Quads

<s> <p> <o> .

<s> <p> <o> <g> .

What could be simpler? N-Quads is N-Triples with an optional 4th field to give the graph name (or context - it wasn't designed specifically for named graphs, but let's just consider
IRIs in the 4th field, not blank nodes or literals which the syntax allows).

But TriG puts the graph name before the triples, while N-Quads puts it after. Maybe N-Quads should be like TriG so that TriG can make N-Quads a subset. Parsing this modified N-Quads only takes buffing of the tokens on the line and counting to 3 or 4 to determine if it's a triple or a quad. Making TriG more flexible, at the cost of the slightly less intuitive graph name first, in what is basically a dump format, seems to me to be a good trade-off.

Blank nodes labels need to be clarified - is the scope the graph or the document? Both are workable. I'd choose scope-to-the-document, if only to avoid the confusion of two identical labels referring to to different bnodes, and it's occasionally useful to say that a bnode
in one graph really is the same as another when using it as a transfer syntax
(for example, when one graph is a subgraph of another). TriG has the same issue but the use of nested forms for graphs makes scoped-graph more reasonable (except that graphs can be split over different {} blocks). Doing the same in N-Quads and TriG is important, and my preference is document-scoped labels.

TriG

TriG is a Turtle-like syntax for named graphs. It is useful for writing down RDF datasets.

It has some quirks though. Turtle is not a subset of TriG because the default graph needs to be wrapped in {} but the prefixes need to
be outside the {}. The default graph needs to be given in a single block, but named graphs can be fragmented (that was just an oversight in the spec). It would be helpful to allow the unnamed graph be specificed as Turtle and similarly if an N-Quads file were legal TriG.

TriG allows the N3-ish form:

<g> = { ... } .

I've seen some confusion about this form in the data.gov.uk data. The addition "=" and ".", which are optional, cause confusion and at least one parser does not accept them as it wasn't expected.

In N3, = is a synonym for owl:sameAs but the relationship isn't likely to be owl:sameAs, read as N3, it's more likely to be log:semantics. Now I like the uniformity of the N3 data model, with graph literals (formulae) because of the simplicity and completeness it introduces but it's not RDF, it's an extension and it breaks all RDF-only systems.

If <g> is the IRI of a graph document, it would be more like the N3:

<g> log:semantics { ... } .

or

<g> log:semantics ?v .
?v owlSameAs { ... } .

Avoiding the variability of syntax, which brings no benefit, is better. Drop the
optional adornment.

Summary

None of these issues are roadblocks; they are just details that need sorting out to move from the current
de facto formats to specifications. When exchanging data between systems
that are not built together, details matter.