iansrobinson.com

The Power of the Daleks (and Neo4j)

Jim Webber and I have been busy these last few months writing a tutorial for Neo4j. As we did with REST in Practice, we’ve chosen an example domain that falls a little outside the enterprise norm. With REST in Practice we chose the everyday world of the coffee shop; this time round, we’ve gone for the grand old universe of Doctor Who.

As we’ve worked on the tutorial, we’ve built up a Doctor Who data model that shows how Neo4j can be used to address several different data and domain concerns. For example, part of the dataset includes timeline data, comprising seasons, stories and episodes; elsewhere we’ve social network-like data, with characters connected to one another through being companions, allies or enemies of the Doctor. It’s a messy and densely-connected dataset – much like the data you might find in a real-world enterprise. Some of it is of high quality, some of it is lacking in detail. And for every seeming invariant in the domain, there’s an awkward exception – again, much like the real world. For example, each incarnation of the Doctor has been played by one actor, except the first, who was originally played by William Hartnell, and then later by Richard Hurdnall, who reprised the tetchy first incarnation some years after William Hartnell’s death for the twenty-fifth anniversary story, The Five Doctors. Each episode has a title; at least, they did when the series first began, in 1963, and they do today. But towards the end of the third season, the original series stopped assigning individual episode titles (in the original series, stories typically took place over several episodes); a story’s episodes were simply labelled Part 1, Part 2, etc. And so on.

The Dalek Props

The earliest alien monsters to appear in Doctor Who, the Daleks have long held a fascination for the British viewing public, and for the enquiring minds of Doctor Who fandom. In 2002, Jon Green began researching the history of the Dalek props created for the original series, which ran from 1963 to 1988. Jon was later joined by a second researcher, Gav, and the result of their hard work is the wonderful Dalek 6388.

Recently, I imported some of this Dalek research into our Doctor Who database. From the wealth of detail available on Dalek 6388 I chose to focus on the ways the different prop parts were reused in different stories.

Each episode or story featuring the Daleks made use of one or more Dalek props. Each prop came in two parts: the top half, called the shoulders, and the bottom half, called the skirt. These parts were handmade, and as a result there were many differences between them: the hemisphere spacing, collars, slats, eyestalks, and gun boxes in particular varied in subtle but noticeable ways. Reconstructing the history of the props has been largely a matter of identifying specific parts in screenshots and photos based on these visual clues.

Over time, the different shoulders and skirts were mixed and matched. For example, the shoulders originally belonging to Dalek 1, one of the props created for the very first Dalek story, The Daleks, were later paired with the skirt belonging to Dalek 5 for the story Day of the Daleks. This composite prop is known as Dalek One-5. A couple of seasons later, the shoulders were paired with the skirt belonging to Dalek 7 for the Daleks’ last outing with Jon Pertwee’s Doctor, Death to the Daleks. This composite prop is known as Dalek One-7.

Structuring the Data

With the details from Dalek 6388 added to our dataset we now have something that resembles a supply chain traceability scheme – another excellent use for Neo4j. Here’s how some of that data is structured (click the image for a closeup view):

The screenshot above shows some of the Dalek prop data as viewed in Neoclipse. Neoclipse allows you to view, create and edit nodes, as well as search any indexes you’ve created for your store. Neo4j’s browser-based Webadmin tool also includes a visualisation tab for viewing a database running in server mode.

The screenshot above show three of the stories in which the Daleks appeared: The Daleks, The Dalek Invasion of Earth and The Power of the Daleks. (They’ve appeared in many more stories – the view here shows only a fraction of the data we have. Even here I’ve only shown full details of the props for one of the stories, The Power of the Daleks. Portions of the data associated with the other two stories are then included to show how the data is interconnected.)

Below each story node is a node representing the group of Dalek props used in that story. I’ve chosen to create these group nodes because there is potentially other data to be included at the group level. For example, each story was assigned a budget for building or maintaining the props used in that story. Moreover, in the research, the voice artists and operators responsible for bringing the Daleks to life are not associated with individual props but with the group as a whole (the details of the voice artists and operators are not, however, currently in our dataset). Hence the Dalek props group nodes USED_IN each story.

Focusing now on the props used in The Power of the Daleks, we can see that four props appeared in the story: Dalek 1, Dalek 2, Dalek 7, and Dalek Six-5. Daleks 1 and 2 also appeared in the two earlier stories, The Daleks and The Dalek Invasion of Earth (together with some other props not shown here), as well as in several other stories not shown in the screenshot. Daleks 7 and Six-5 are here not associated with any other stories, but in the full dataset, Dalek 7 is shown to have appeared in two other stores, The Evil of the Daleks and The Chase, while Dalek Six-5 appeared in four other stories following its debut in the The Power of the Daleks.

Dalek Six-5 is one of those composite props I mentioned previously. It is COMPOSED_OF shoulders whose ORIGINAL_PROP was Dalek 6, and a skirt originally made for Dalek 5. The shoulders, as the graph in the screenshot shows, appeared as part of at least one other composite prop, Dalek Six-7, where they were paired with the skirt belonging to Dalek 7, which also appears in The Power of the Daleks. Dalek Six-5‘s skirt, whose ORIGINAL_PROP was Dalek 5, was at some other time paired with the shoulders from Dalek 1 to create the composite prop Dalek One-5. (To find out which story Dalek One-5 appeared in, all we’d have to do is expand the graph beyond the nodes shown here.)

Querying the Data

The Dalek prop data captures many of the significant domain concepts and relationships described in the Dalek 6388 research. Once we’d built the dataset, I set about asking it some tricky questions. At the top of my list of pressing questions was this one: what’s the hardest working Dalek prop part in showbiz? That is, which of those shoulders and skirts has appeared in more stories than any other part?

Neo4j offers several different ways to query a graph. There’s the Core API, which allows you to deal imperatively with nodes and relationships, and the original Traverser API, which offers a slightly more declarative route into a graph. For more sophisticated traversals there’s the Traversal Framework, which, being more powerful than the Traversal API, has a slightly steeper learning curve. For querying the server there’s the REST API. Next, there’s Gremlin, a powerful graph traversal language with a plugin for Neo4j, and the Pattern Matching library, which is like a regex language for graphs. And finally, introduced in the latest 1.4 milestones, there’s Cypher, a new declarative graph pattern matching language – a SQL for graphs.

In the rest of this post we’ll look at three different ways of figuring out which was the hardest working Dalek prop part. In the first example we’ll look at the Traversal Framework. In the next we’ll use the Pattern Matching library. Finally, we’ll use Cypher.

There’s quite a bit of code here. If you haven’t the patience to wade through it all, do just one thing: go look at the Cypher query. It’s a thing of Dalek-like precision.

Traversal Framework

Here’s what the query to find the hardest working Dalek prop part looks like using the Traversal Framework:

This example shows an idiomatic usage of the Neo4j Traversal Framework. First, we lookup a starting node for the traversal in an index. In this instance, we look up the Dalek species node. Then we build a traversal description and execute the traversal from this starting node. The traversal crawls the subgraph below the starting node. With each node that it visits, it calls the evaluate() method on the custom evaluator supplied with the description. This method determines whether the traverser is positioned over a node of interest (in this case, an original prop node). If it is, the traverser returns the path to this node to the calling code, and then begins a new travesal further up the subtree. If it isn’t a node of interest, the traverser continues deeper into the subgraph.

The description specifies the types of relationship we’re prepared to follow, and their direction. Every relationship in Neo4j has a type label and a direction. The direction helps provide semantic clarity when dealing with asymmetric relationships between nodes. While a relationship between two nodes is always directed – it has a start node, a direction, and an end node – a traverser can be programmed to follow either an INCOMING or an OUTGOING relationship, and even to ignore the direction entirely (using Direction.BOTH).

Note that we’ve specified that the traversal proceed depth first. This means that from the Dalek species node it will strike out on a depth first hunt for an original prop, following the first APPEARED_IN relationship to an episode and then searching the nodes below that episode. Once it has found a first matching original prop, the traverser returns the path to the prop to the calling code (INCLUDE_AND_PRUNE), and then moves successively back up that path, trying alternate branches, until it works its way all the way back to the species node, from where it tries the next APPEARED_IN relationship.

Here’s the route the traverser takes from the starting Dalek species node through all the nodes below The Power of the Daleks:

(Dalek)-[:APPEARED_IN]->(The Power of the Daleks)(skirt)-[:ORIGINAL_PROP]->(Dalek 7) INCLUDE_AND_PRUNE

When it’s run, our traverser will return all the paths from the Dalek species node to the original props nodes associated with each prop part. Each path contains all the details of the nodes and relationships along that path, starting with the Dalek species node and ending with the original prop node, with the relevant episode and prop part nodes somewhere in between.

By itself, the traverser result doesn’t tell us which prop part has been used the most; for that, we need to do some accumulation. To generate the counted and sorted results, our code calls getSortedResults(), which is simply a helper method that delegates to a ResultsAssembler instance. This ResultsAssembler is responsible for accumulating the results for each prop part.

The ResultsAssembler builds a map of prop part keys and episode count values. The assembler uses a QueryResultAccessor to retrieve the original prop name and prop part from a query result. Here the query result is a path, but when we get round to the next example, which uses the Pattern Matching library, it’s a pattern match object. Hence the generic parameters.

Here’s the PathAccessor we use to access the original prop name and prop part from a path. The original prop details are held in the last node in the path, while the prop part details are held in the penultimate node. The getOriginalProp() method can retrieve the original prop details quite easily from the path’s endNode() property. To get the part name (shoulders or skirt) the getPath() method uses the path’s nodes() iterator and an iterator helper function to access the last but one node.

Having accumulated and sorted the results, our unit test can then assert that the results contain the top three most used prop parts: Dalek 1‘s shoulders, Dalek 5‘s skirt, and Dalek 6‘s shoulders. Each of these parts appeared in 12 stories.

What do Dalek 1‘s shoulders look like today, nearly 50 years after they first appeared on TV? Dalek 6388, of course, has the answer.

Pattern Matching

The Traversal Framework helped us find the hardest working Dalek prop parts in showbiz, but we still had to do quite a bit of work to accumulate the results. While the traversal description itself was relatively trivial, having to iterate nodes in each of the returned paths proved quite tedious. With the Pattern Matching library we can avoid some of the messy iterative code that cropped up in the PathAccessor.

Using the Pattern Matching library we create an abstract subgraph that describes the graph patterns we want to match in our real graph. When matching, we anchor this abstact graph to a starting node in the real graph, much as we started the traversal in the previous example from a node we’d looked up in an index. The matcher then matches subgraph instances corresponding to the nodes, relationships and constraints defined in our abstract subgraph.

The pattern node objects that we define as part of our abstract subgraph can then be used as keys into the match results. Below is the getSortedResults() method and MatchAccessor class for the pattern matching example. You’ll notice that we pass the originalProp and part pattern node objects into the MatchAccessor. The accessor then calls the match’s getNodeFor() method with one or other of these pattern nodes in order to retrieve the real, matched node.

Cypher

The Pattern Matching example still required us to accumulate the results in order to work out which was the hardest working prop part in the original series. The abstract pattern nodes provided convenient keys into matched nodes, but we still had then to pass the matched nodes’ contents to the ResultsAssembler in order to build a map of prop parts and usage counts. With Cypher, all that manual accumulation disappears. Better still, the entire query is expressed using a very simple, declarative syntax. Unencumbered by a “fluent” interface, a Cypher query concisely describes our desired route into the graph:

Spend a couple of moments looking at it and you’ll soon appreciate its simplicity and declarative expressiveness. If only the Daleks had such power…

Cypher is still in its infancy, and its syntax is still in flux. In the 1.4 release timeframe, which culminates this week with the release of Neo4j 1.4 GA, Cypher performs only non-mutating operations. Over the course of the 1.5 release, we expect to add mutating operations.

The Tutorial

Our Neo4j tutorial is freely available from GitHub. Download it now and try some of the exercises. And keep checking back for updates: as Neo4j grows, so will the tutorial.

Jim and I are also running one- and two-day versions of the tutorial at a number of conferences and other events:

Hi there! My name is Lindsay, and I’m the Community Manager at DZone, a website for article and link sharing and discussion for the developer community. We were hoping, with your permission, to republish some of your articles pertaining to NoSQL (including this great one, loving the Dalek connection, I’m working my way through the new Dr. Whos myself now, I recently met Martha Jones!). The articles would go onto the DZone site and would gain a lot of new readership. I’d also like to talk to you about our free program for bloggers such as yourself, it is called the Most Valuable Blogger (MVB) program. If you are interested in letting us republish your work or in learning more about the MVB program please contact me at lgordon (at) dzone (dot) com. I promise that this is not spam and I hope to hear from you!

A single instance of Neo4j supports approx 32 billion nodes, 32 billion relationships, and 64 billion properties. The Enterprise edition provides High Availability capabilities in the form of a master-slave replicated cluster – more details in the manual: http://docs.neo4j.org/chunked/stable/ha.html

Querying is fastest when the portion of the graph to be traversed is in main memory: on my MacBook Air, I can traverse approx 1.5 million nodes per second using the Core API. The traversal frameworks and Cypher are around a third as fast, though Cypher being a declarative query language, we foresee there being plenty of opportunities to create optimised Cypher query plans, and so speed up performance.

Neo4j excels at graph local queries – that is, contextualized queries where you identify one or more start nodes (often using an index to lookup these start nodes) and ‘crawl’ or pattern match the surrounding portions of the graph. (See http://markorodriguez.com/2011/03/30/global-vs-local-graph-ranking/ for more details on graph local queries).