RDF Store Benchmarks with DBpedia

In the course of my diploma thesis, I evaluated the performance of several RDF stores when small pieces of information are requested from a large dataset (DBpedia infoboxes plus two very small sets). The benchmark queries employ varying levels of joins and constraints.

As of now, only the configuration for OpenLink Virtuoso has been optimized - this must be taken into consideration when comparing performance.

1. Motivation

The use case is a mobile client-server application that allows for the exploration of Linked Data based on geographical coordinates. As the application will be user-facing, short response times are of high importance. In this context, queries are expected to yield small result sets, but involve large datasets (such as DBpedia) and possibly several levels of joins.

2. Tested RDF Stores

RDF stores were required to support large datasets such as DBpedia, SPARQL, Named Graphs as well as means to implement owl:sameAs inference (i.e. built-in ability or an apt programming interface). The following stores were selected:

In an initial release of this benchmark, Virtuoso's performance was far from ideal, which OpenLink traced back to inappropriate indexes for this usage scenario, which does not make use of graph indications. Following suggestions by OpenLink, the configuration was adjusted to include POGS, PSOG and SOPG indexes next to the default OGPS index, resulting in 3-45 times shorter query times.

2.2 SDB Beta 1

The index layout was tested on PostgreSQL 8.2.5 and MySQL 5.0.45 (x64 versions, default configurations). The hash layout was tested only on PostgreSQL due to performance issues ("Hash loading is very bad on MySQL." - SDB Wiki).

The results obtained for SDB currently can not be compared to those of Virtuoso, as the databases lack optimizations. Andy Seaborne suggests the use of PostgreSQL's ANALYZE command.

2.3 Sesame 2.0 beta 6

Sesame's good preliminary results and moderate loading times prompted me to explore the effects of supplementary indexes in addition to the default spoc and posc indexes. The following table shows the build times on the full dataset (see the section Queries for query times):

4. Benchmark Configuration

The low amount of RAM (1GB vs. a 4 GB dataset) likely impacts the results. Accordingly, the results have significance only for comparable configurations.

5. Loading

The RDF stores feature different indexing behaviors: Sesame automatically indexes after each import, while SDB and Virtuoso allow for selective index activation. In order to make load times comparable, the data import was performed as follows:

infoboxes-fixed.nt was imported with indexes initially disabled in SDB and Virtuoso. Indexes were then activated and the time required for index creation time was factored into the import time.

geocoordinates-fixed.nt was imported with indexes enabled.

homepages-fixed.nt was imported with indexes enabled.

5.1 Loading of infoboxes-fixed.nt

5.2 Loading of geocoordinates-fixed.nt

5.3 Loading of homepages-fixed.nt

6. Queries

As few data has been prepared for actual use in the application, the queries are mostly of generic nature. They run against the DBpedia infoboxes set and assess performance with varying levels of joins and constraints.

In order to minimize query caching effects, queries were always executed in order after server startup. An exception was Virtuoso, where a noticeable warm-up delay occurred with the initial query. Accordingly, results for query 1 were obtained by restarting the server and warming it up using query 5.