iansrobinson.com

Graph Databases

(Want to talk about this or any other data-modelling topic in person? Sign up for our GOTO Amsterdam tutorial, June 18.)

Many graph database applications need to version a graph so as to see what it looked like at a particular point in time. Neo4j doesn’t provide intrinsic support either at the level of its labelled property graph model or in its Cypher query language for versioning. Therefore, to version a graph we need to make our application graph data model and queries version aware.

I’ll say this up front: versioning necessarily creates a lot more data – both more nodes and more relationships. In addition, queries will tend to be more complex, and slower, because every MATCH must take account of one or more versioned elements. Given these overheads, apply versioning with care. Perhaps not all of your graph needs to be versioned. If that’s the case, version only those portions of the graph that require it.

In this post I’m going to discuss what I consider to be the easiest form of versioning – time-based versioning. Time-based versioning gives us a universal versioning scheme: pick an instant, convert it to a long value, and the resulting version number applies to all parts of the graph. Version-number-based versioning, where each entity has its own independently varying version number (and scheme even), is more complex to implement. I’ll be dealing with that in a future post.

All graph versioning strategies with Neo4j must make tradeoffs between the cost of writes, the amount of on-disk storage require, the degree of fragmentation incurred, and the performance of reads for both current and historical queries. The design described here strikes a balance between these competing considerations. At the end of the post I describe some variations and optimisations.

Separate Structure From State

The key to versioning a graph is separating structure from state. This allows us to version the graph’s structure independently of its state.

To help describe how to design a version-aware graph model, I’m going to introduce some new terminology: identity nodes, state nodes, structural relationships and state relationships.

Identity Nodes

Identity nodes are used to represent the positions of entities in a domain-meaningful graph structure. Each identity node contains one or more immutable properties, which together constitute an entity’s identity. In a version-free graph (the kind of graph we build normally) nodes tend to identify an entity, locate it in the network structure, and represent its state. Identity nodes in a version-aware graph, in contrast, serve only to identify an entity and locate it in a network structure.

Structural Relationships

Identity nodes are connected to one another using timestamped structural relationships. These structural relationships are similar to the domain relationships we’d include in a version-free graph, except they have two additional properties, from and to, both of which are timestamps.

State Nodes and Relationships

Connected to each identity node are one or more state nodes. Each state node represents an immutable snapshot of an entity’s state. State nodes are connected to identity nodes using timestamped state relationships.

Example

The following example describes a simple domain made up of shops, products and suppliers. A ShopSELLS one or more Products, which are SUPPLIED_BYSuppliers. Over time prices change, shops sell different products, and source them from different suppliers. With a versioned graph we’ll be able to see which products a shop was selling, what price they were being sold at, and who was supplying them, at any point in the graph’s history.

The diagram below shows a version-aware graph representation of this domain. To help you see how structural and state-based elements are combined in the same graph, I’ve drawn identity nodes and structural relationships in blue, state nodes and state relationships in red. In the data itself, I’ve labelled the identity nodes Shop, Product, and Supplier, and the state nodes ShopState, ProductState and SupplierState. Structural relationships are named SELLS and SUPPLIED_BY. All state relationships are named STATE.

This graph represents the domain in its initial state on 1 January 2014. All the relationships, both structural and state, are current. Each relationship has a from property with a long millisecond value representing 1 January 2014, and a to property with a very large long value (End-Of-Time, or EOT, which is a magic number – in this case Long.MAX_VALUE) that effectively means there is no current upper bound to the period associated with the relationship.

Current Queries

To query the current state of a version-aware graph, we need to match current structural and state relationships, which we can do simply by matching the to property on relationships against our EOT value. Here’s a query for finding all products that currently retail for one pound or less, together with their suppliers, and the shops where they can be purchased:

As you can see, because of the need to make all parts of the query version-aware, even simple queries can become quite complex.

Modifying the Graph

The kinds of changes we may need to track in a version-aware graph include changes to state and changes to structure. Changes to state involve adding, removing and modifying node properties. Structural changes involve adding and deleting relationships, as well as modifying the strength, weight or quality of relationships by changing one or more of their properties.

Modifying Structure

In the following query, we move product 1 from shop 1 to shop 2. This change will be effective from 1 February 2014 (1391212800000 milliseconds since 1970-01-01T00:00:00Z).

As part of this query, we archive the old SELLS relationship, setting its to property to the millisecond value representing 1 February 2014, and introduce a new SELLS relationship that connects product 1’s identity node to shop 2’s identity node. We set the from property of this new relationship to 1 February 2014, and its to property to EOT.

This is what the graph looks like after we’ve made the change. I’ve used a dashed blue arrow to show the archived SELLS relationship, and a thick blue arrow to show the new SELLS relationship.

Modifying State

In the following query, we update product 1’s price. As with our structural change, this change will be effective from 1 February 2014.

This time, we archive the product’s current STATE relationship, and then create a new state node and a new STATE relationship. We copy the product’s properties from the old state node to the new state node (SET newState = currentState), and then update the price.

Historical Queries

We’ve seen that to query the current state of the graph, we need to do an exact match on the to property of both structural and state relationships, matching against our EOT value. Historical queries are slightly different, in that we’re looking for relationships whose from-to period covers the millisecond instant we’re interested in; that is, relationships where the from value is less than or equal to the instant we’re interested in, and the to value is greater than that instant. These kinds of comparisons must be done in the WHERE clause.

The following query finds all the products sold by shop 1 on 5 January 2014 (1388880000000 milliseconds from the Java epoch):

Observations

Some further comments on this approach:

Additive

This design is purely additive in nature. To delete a structural relationship we simply archive it; that is, timestamp its to property. To delete a node we must archive both its current state relationship and any current structural relationships. Current relationships (both structural and state) are those with an EOT value.

This additive model strikes a reasonable balance between the competing considerations mentioned at the beginning of this post: the cost of writes, the amount of on-disk storage require, the degree of fragmentation incurred, and the performance of reads for both current and historical queries. Changes in state require the addition of nothing more than a single new state relationship and one new state node, which keeps the cost of writes down. Nothing is rewritten; therefore, nothing is deleted, which in turn keeps store growth to a minimum. On the other hand, both current and historical queries require scanning timestamped relationships, and every state lookup requires traversing at least one state relationship.

Indexes and Constraints

Indexes and constraints are not version aware. When using indexes (or label scans), you will have to test the results of each lookup. For a node to be included in the results for a particular point in time, it must be reachable by a single state relationships whose period includes that point. For entities that change frequently, or are (logically) deleted frequently, this check can impact the performance of index lookups.

Having multiple states for a particular entity poses problems for unique constraints. If a uniquely constrained property value doesn’t change between versions, the constraint will fail because there will now be duplicate property values. To get round this problem, we have to introduce a version property that increments with each state change, and a synthetic property that combines the current version number with the value of the property we want to constrain. We then apply the unique constraint to this synthetic property instead of the underlying domain property. I’ll write this up in more detail in the near future.

Variations and Optimisations

The design strategy described here can be modified in several ways, depending on your needs:

Linked Lists

I’ve attached state nodes to identity nodes in a radial fashion. This works well when there are no more than several tens of changes to an entity’s state over its lifetime. If an entity changes far more often than that, you might consider using a linked list.

Linked lists are also useful when you need to compare successive states: the ordering imposed by the linked list helps you identify next and previous states. In fact, you can combine radial relationships with linked lists:

CURRENT_STATE Relationships

For graphs with entities that change frequently, you might consider adding a CURRENT_STATE relationship that connects an entity’s identity node directly to its latest state node. This allows you to navigate directly to that entity’s current state, without having to iterate many radial state relationships, or traverse a linked list. The use of a CURRENT_STATE relationship, however, requires more housekeeping when updating state: you’ll have to delete the old CURRENT_STATE relationship, and then create a new one.

Overloading Identity Node with Initial State

For graphs where most entities change infrequently, or not at all, you might consider overloading each entity’s identity node with that entity’s initial state. You then add state nodes and relationships only with the first change to an entity’s state. If you adopt this strategy, you should continue to adjust structure at the level of the identity nodes, however, not the latest state nodes. Further, while it helps reduce the number of additional nodes and relationships in the graph, it does impose some additional cognitive overhead when visualising the resulting structure, because you now have to make sense of a structure in which current structural relationships are connected to historical initial state nodes.

Overloading Identity Node with Current State

Following on from the idea of overloading the identity node with initial state, and as an alternative to using a CURRENT_STATE relationship, you might consider overloading the identity node with the current state of the entity, and building a linked list of previous states. The following diagram shows what our shop-product-supplier model would look like after if we were to adopt this approach:

At first glance, this looks simpler. There are fewer nodes, for starters. But it comes with its own compromises. Updating a node requires us not only to create a new node at the head of the linked list, but also to physically delete all the structural relationships attached to the old head of the list, before then recreating them on behalf of the new head of the list. The model is optimised for querying the current state of the graph, but at the expense of more costly writes and increased space on disk (because each time we recreate a batch of relationships, we add them to the end of the stores on disk). I’ll be looking at this model in more detail in a future post.

The current version of Neo4j has two different types of index: named indexes and automatic indexes. The Cypher query examples throughout the Graph Databases book use named indexes. In this query, for example, venue, city and author are named node indexes:

There’s a problem, however: data created using a Cypher CREATE statement won’t be indexed in a named index. This has led to some confusion for anyone wanting to code along with the examples: if you use the CREATE statements as published, the query examples won’t work.

If you want to code along with the examples, the easiest thing to do is create the data using Cypher and automatic indexes. You’ll need to modify the queries to use the automatic indexes, but the changes aren’t complicated.

Configure Automatic Indexes

The examples below show how to run the Shakespeare examples (pp.47-50) using automatic indexes.

Before you create any data, you’ll need to configure the automatic indexes. The approach you take will depend on whether you are running Neo4j in embedded or server mode.