09 December 2010

TDB bulk loader - version 2

This article could be subtitled called "Good I/O and Bad I/O".
By arranging to use good I/O, the new TDB loader achieves faster
loading rates despite writing more data to disk. "Good I/O"
is file operations that occurs in a buffered and streaming fashion.
"Bad I/O" is file operations that cause the disk to jump the heads
about randomly or work in small units of disk transfer.

The new TDB
loader "loader2" is a standalone program that bulk loads
data into a new TDB database. It does not support incremental loading,
and may destroy existing data. It has only been tested on Linux; it
should run on Windows with Cygwin but what the performance will
be is hard to tell.

Figures demonstrating the loader in action for various large datasets
are in a separate blog entry. It is faster than the current loader
for datasets over about 1 million triples and comes into it's own
above 100 million triples.

Like the current bulk loader ("loader1"), loader2 can load triple and quad RDF formats,
and from gzipped files. It runs fastest from N-triples or N-Quads
because the parser is fastest, and low overhead, for these formats.

The loader is a shell script that coordinates the various phases.
It's available in the TDB development code repository in
bin/tdbloader2 and the current 0.8.8 snapshot build.

Loader2 is based on the observation that the speed of loader1 can
drop sharply as the memory mapped files fill up RAM (the "can" is
because it does not always happen; slightly weird). This fall off
is more than one would expect simply by having to use some disk and
sometimes the rate of loader1 becomes erratic. This could be
due to the OS and the management of memory mapped files
but the effect is that the secondary index creation can
become rather slow. loader1 tends to do "bad I/O" - as the
caches fill up, blocks are written back in what to the disk looks
like a random order causing the disk heads to jump round.

Copying from the primary index to a secondary index involves a sort
because TDB uses
B+trees
for it's triple and quad indexing. A B+Tree keeps its
records in sorted order and each index is different orders.

Loader1 is much faster than simply loading all indexes at once because
in that case there is some much RAM being used for caching of parts of all the
indexes. Better is to do one index at a time, using the RAM for caching one
index at a time.

Loader2 similarly has an data loading phase and an index creation phase.

The first phase is to build the node table and write out the data for index building.
Loader2 takes the stream of triples and quads
from the parser and writes out the RDF terms (IRI, Literal, blank node)
into the internal node table.
It also writes out text files of tuples of NodeId (the internal 64 bit
number used to identify each RDF term. This is "good I/O" - the writes
of the tuples files are buffered up and the files are written append-only.
This phase is a Java program, which exits after the node table and working files have been written.

The next phase is to produce the indexes, including the primary index. Unlike
loader1, loader2 does not write the primary index during node loading.
Experimentation showed it was quicker to do it separately despite needing more I/O.
This is slightly strange.

To build indexes, loader2 uses the
B+Tree rebuidler
and that requires the data in index-sorted order. Index rebuilding is a sort followed
by B+tree building. The sort is done by Unix sort.
Unix sort is very easy to use and it smoothly scales from a few lines to gigabytes of data.
Having written the tuple data out as text files in the first phase (and fixed width hex numbers at that - quite wasteful)
Unix sort can do a text sort on the files. Despite that meaning lots of I/O, it's good I/O
and the sort program really knows how to best manage temporary files.

For each index, a Unix sort is done to get a temporary file of tuple data in the right sort order.
The B+Tree rebuilder is called with this file as the stream of sorted data it needs to
create an index.

There are still opportunities to tune the new loader and to see if the output of the sorts
being piped directly into the rebuilder is better or worse than the two step approach
using temporary file used at the moment. Using different disks for different temporary files
should also help.

The index building phase is parallelisable. Because I/O and memory usage are the bottlenecks, not CPU cycles,
the crossover point for this to become effective might be quite high.

To find out whether loader2 is better than loader1, I've run a number of tests.
Load and query tests with the Berlin SPARQL Benchmark (2009 version), a load test on the RDF version of
COINS (UK Treasury Combined Online Information System - about 420 million quads and it's real data) and a load test using the Lehigh University Benchmark with some inferencing. Details, figures and tables in the next article.

2 comments:

This is somewhat similar to what the 4s-import loader does, though it's not completly tuned for bulk loading, so it works in 5MT stripes. We don't have a dedicated bulk loader anymore, it wasn't enough better than the 4s-import one to justify maintaining it (maybe 5% faster).

We load 5MT-worth of resource data (ID -> resource maps), then 5MT of quads gets sent to the indexes. Using the same trick though of stashing the quad data, then sorting it. We use binary data, mmap() and qsort() IIRC, but the idea is the same.

The various indexes then do their own internal sort to place the data in logical disk order, to get more write locality before writing. I doubt that matters so much on 2010 hardware though.

It was done that way because when we wrote that loader, the normal situation would be that we would get at most 10% of the indexes mapped into RAM during normal operation.

5store does something very different, but it's optimised for vastly more powerful hardware.

At the time 4s-import was tuned (2006ish) our data/user ratio was pretty high, so the key economic factor was triples/machine, but now were much more focussed on users/second, so we make sure 100% of the index is in RAM during normal operation.

This looks very interesting. I have been trying to load a very large dataset (~3B triples) and I have been observing the sharp fall off when the memmapped files start to fill up. This usually happens around the 350,000,000 statement mark and the load performance drops quickly from an average of about 70k TPS to about 3-6k TPS.