Data Hub Scaling Challenges

preface: This blog post is generally intended for a more technical audience, i.e., software developers, administrators or knowledge engineers. Albeit, all other interested people are also invited to read it 😉

d:swarm is successfully deployed and in production mode at SLUB Dresden right now. At the current state we make use of the streaming variant. This variant has especially been chosen to overcome the shortcomings that have been experienced via applying larger amounts of data through the Data Hub of d:swarm. This blog post describes the current state and experiences that has been made with the Data Hub in d:swarm and, finally, the challenges that needs to be tackled when scaling it to approach the vision of a central knowledge graph.

Current State of the Data Hub

Currently, all content data (bibliographic records – a collection of statements) in d:swarm is stored in a graph database, which is Neo4j right now. We’ve implemented a graph data model (GDM) on top of the Property Graph data model that leverages features of the RDF data model (see also here for a comparison of our GDM vs RDF). We’ve chosen this database type first, since we could fit very naturally the graph data model that fits all our requirements into the the Property Graph data model. Amongst other following demands needs to be fulfilled:

storing data from arbitrary data sources in a generic format without information loss,

qualified attributes (e.g. order or index position) or

versioning at a statement-level base

All content data will be processed via an unmanaged extension that runs at the Neo4j database server to provide custom HTTP APIs to process content data more efficiently between the d:swarm backend and the database that was chosen as Data Hub.

Testing, Testing, Testing

Since we started intensively working at scaling the write performance at our Data Hub implementation in April 2015, i.e., first testing the Data Hub with some larger data sets via different write opportunities, we tackled some challenges that turned out through our testing. Neo4j support two different write modi: batch insert, where the database needs to be offline and transactional, i.e. at a running database instance. We could identify two main demands from our testing observations:

reduce the storage amount of the database itself (initially we observed storage footprint with a ratio from 1:72 (original source data vs. processed GDM data in the Data Hub))

increase the speed (throughput) at write operations

… and Iterate

So we tried to reduce the amount of information that really needs to be stored in the data hub as much as possible. One the one hand, we eliminated all unnecessary or duplicated information, e.g., write the type (/class) information of resources or bnodes only as node label and not as separate relationship to a type node. Those often occurring type relationships can “end” in a super nodes (/dense nodes), i.e. nodes with many ingoing (or outgoing) relationships. On the other hand, we tried to make necessary information as compact as possible, e.g., via hashing. We are applying robust and fast hashing algorithms. Currently, we make use of SipHash. We even hash UUIDs Another approach that we’ve implemented to tackle information reduction is compressing. For example, we make use of prefixed URIs (instead of full-fledged ones).

Finally, we made use of another index engine (MapDB) for exact-match indices, which are the majority of our indices. For example, an index that guarantees that each statement does only exists once per data model and version range. So this index acts like a constraint or entry barrier.

… to Improve and Be Prepared for Scaling

With all those improvements we were able to reduce the storage footprint of the Data Hub by a factor from 3 to 7, i.e., depending on the structure of the original data source (e.g. flat or more hierarchical) we now have ratios from 1:10 to ~ 1:23. The resulting data storage footprint ratio are applicable for us right now, since we need to keep in mind that we usually add additional information to the data in its original format, e.g., versioning or provenance information. The major storage reduction was achieved by the introduction of prefixed URIs. This slackened the processing a bit and made it a bit more complex, i.e., a mechanism and index to do the namespace/prefix resolving was necessary.

… but Are We There Yet?

All the improvements that we’ve implemented so far mainly dealt with reducing the amount of the database itself. However, we weren’t really able to get on the second challenge – increasing the speed at write operations. Since, we prefer to do everything at a running database instance, we can only rely on the transactional processing. So we cannot take advantages of the batch mode at this time. A processing (import + transform + export) of rather flat XML source data – 1.2 million records – still took ~ 13.7 hours. To illustrate this (one of many) dataset a bit more, these are ~ 121 million statements. Whereby, a records consists of ~ 20 – 250 statements.

This duration was not acceptable for SLUB having the amount of information in mind that should finally be processed:

~ 100 million records (~ 2 billion statements)

That’s why, we decided to bring forward implementing the streaming variant that bypasses the Data Hub completely and, which was requested by other interested institutions (e.g. Dortmund University Library). Plain data processing without storing anything in between took only 429 seconds for the mentioned 1.2 million records data set. Nevertheless, there’s still a challenge left

d:swarm – a Challenge to Modern Databases?

So what’s it what we are looking for (ideally)?

arbitrary data can be stored in a generic data format (without information loss)

records (a collection of statements) are our main unit of processing (that can be flat or hierarchical)

qualified attributes can be added to statements, e.g., order or provenance

versioning can be done at a statement-level base

the data hub can be partitioned semantically, e.g., all data that belong to a certain data model can be stored and retrieved rather easily

an intuitive query language to work with the data

an indexing engine to speed up read operations

constraints can be defined to guarantee data integrity

data is stored in an RDF-compatible format or can be converted to RDF rather easily

~ 100 million records (~ 2 billion statements) can be written within a day

A system consisting of one or more (, different) storage solutions (e.g. a Data Hub powered by polyglot persistence) that fulfil all these requirements and demands might be an opportunity or a re-evaluation of the use cases that should be possible with such a system may reduce one or more features.

d:swarm Data Hub in the Future

SLUB Dresden together with the d:swarm team will re-evaluate the requirements and demands that should be satisfied by a central data hub. Afterwards, improvements or a re-implementation of the Data Hub component can be planned. This can probably be done together with ScaDS Dresden/Leipzig (Competence Center for Scalable Data Services and Solutions). The people at ScaDS are interested in use cases for storing and processing large knowledge graphs, which fits perfectly to our feature list of or (ideal) Data Hub. We discussed our use case and the scability challenge at ScaDS project meeting in June 2015.

Besides an evaluation (and maybe cooperation) with other, existing projects (that arised and evolved from the beginning of the d:swarm project and later on) that needs to tackle similar challanges might be a good option. Amongst others, Wikidata is a very interesting project, since it is founded on a very similar (graph) data model (see this comparison). Most recently, they published a (graph) query service that is powered by a triple store (Blazegraph). Wikibase the data hub behind Wikidata has a notion of record and also supports qualified attributes at statements and statement-level based versioning. A database with rather promising features (and that’s why also a look worth ) is AsterixDB. It has an open and flexible data model (a superset of JSON) that fits good to our demand of a generic data format, and this system was developed and implemented with scale in mind.

Let’s see what the future will bring us to achieve the goal to have a central data hub for bibliographic records and related information.