13 February 2008

In other work, I've needed with a storage and indexing components. To
test them out, I've built a persistent
Jena graph that behaves like an "RDF VM" whereby an application
can handle more triples than memory alone, and it flexes to use and release its
cache space based on the other applications running on the machine.
Working name: TDB. Early days, having only just finished writing the
core code, but the core is now working to the point where it can load and query
reliably.

The RDF VM uses indexing code (currently, classical
B-Trees) but in a way
that matches the model of implementation of RDF. There is no translation
between the indexing and the disk idea of data. To check that made sense, I
also tried with the B-Trees replaced by
Berkeley DB
Java Edition. The BDB version behaves similarly with a constant slowdown. Of
course, BDB-JE is more sophisticated with variable sized data items and
duplicates (and transactions but I wasn't using them) so some overhead isn't
surprising.

I have also tried some other indexing structures but B-Trees have proved to
scale better, from situations where there isn't much free memory to 64-bit
machines where there is.

Node Loads

The main area of difference between the custom and BDB-backed implementations is in loading
speed. They handle RDF node representations differently.
Storing them in a BDB database, or JDBM htable,
was adequate, giving a load rate of around 12K triples/s but it does generate too many disk
writes to disk in an asynchronous pattern. Changing to streaming writes in TDB
fixed that. Because all the implementations fit the same framework, this technique
can be rolled back into the BDB-implemented code. And BDB supports
transactions. The node technique may also help with a SQL database backed
system like SDB as well.

I did try Lucene - not a good idea.
Loading is too slow, but
then that's not what Lucene is designed for.

Testing

TDB works and gives the right results for the queries.
(It would be
good to have the results published as well as described in the
DAWG test suite so
testing can be done.)

Query 2 benefits hugely from caching. If run
completely cold, after a reboot, it can take up to 30s. Running cold is also a
lot more variable on machine sdb1 because other projects use the disk array.

Still room for improvement
though. The new index code doesn't quite pack the leaf nodes optimally yet and some more profiling may
show up hotspots but for a first pass just getting the benchmark to run is fine.
Rewriting queries, as an optimizer should, lowers the execution time for queries
3 and 5 to 0.48s and 1.46s respectively.

The results for query 4 show one possible hotspot. This query churns
nodes executing the filters but the node retrieval code does not benefit from
co-locality of disk access. Fortunately alternative code for the node
table does make co-locality possible and still run almost as fast. Time to get
out the profiler.

To illustrate the "RDF VM" effect; when run with Eclipse, Firefox etc all
consuming memory, then my home PC is 5-10% slower than when run without them
hogging bytes even on a dataset as small as 16 million triples.

First Results
for TDB

Machine

sdb1

Home PC

Date

11/02/2008

11/02/2008

Load (seconds)

686.582

726.1

Load (triples/s)

23,478

22,961

Query 1 (seconds)

0.05

0.03

Query 2 (seconds)

1.30

0.73

Query 3 (seconds)

9.87

9.50

Query 4 (seconds)

30.99

35.32

Query 5 (seconds)

29.87

34.24

Breakdown of the sdb1 load:

Loading

Triples

Load time seconds

Load rate Triples/s

Overall

16,120,177

686.582s

23,478

infoboxes

15,472,624

651.543

24.084

geocordinates

447,517

24.084

18,581

homepages

200,036

10.955

18,259

Setup

My home PC is a media centre - quad core, 3Gbyte RAM, consumer grade disks,
running Vista and Norton Internet Security anti-virus. I
guess it's quicker on the short queries because there is less latency to getting
to the disk - even if the disks are slower - but falls behind when the query
requires some crunching or a lot of data drawn from the disk.

sdb1 is a machine in a blade rack in the data centre - details below.

(My work's desktop machine, running WindowsXP has various Symantec antivirus,
anti-intrusion software components and is slower for database work generally.)