Over the past few weeks I’ve been modelling ThoughtWorks project data in neo4j and I realised that the way that I’ve been doing this is by considering what question I want to answer and then building a graph to answer it.

When I first started doing this the main question I wanted to answer was ‘how connected are people to each other’ which led to me modelling the data like this:

The ‘colleagues with’ relationship stored information about the project the two people had worked on together and how long they’d worked together.

This design was fine while that was the only question I wanted to answer but after I showed it to a few people it became clear that there were other questions we could ask which would be difficult to answer with it designed this way.

e.g.

Which people on project X have I never worked with?

Which person has worked for client X for the longest?

Which people worked together on the same client if not the same project?

I therefore need to make ‘client’ and ‘project’ first class entities in the graph rather than just being there implicitly which favours a design more along these lines:

It makes it a little more difficult to answer the initial question about connections between people but opens up the answers to other questions such as the ones detailed above.

I’m still getting used to this way of modelling data but it feels like you’re driven towards designing your data in a way that’s useful to you as opposed to the relational approach where you tend to model relations and then work out what you want to do with the data.

I like that it is a “different way of thinking.” No doubt you could also model “Client” as a table and “Project” and “Person” as other tables, with relationships between them, but it sounds like maybe the graph approach helped you discover these relationships?

Is it possible to “traceback” through the edges to project and client, to infer that original edge directly between people?

In your final model, are there multiple nodes for each person, one for each project they were on, or is the same node hit by multiple edges coming from the different projects?

The neat things about graphs is that multiple subgraphs can live in the same data-space. There’s no reason not to keep the original Person-colleague->Person relationships in addition to the Client–>Project–>Person relationships. You can query only the relationships you want and ignore the ones that are irrelevant for any given question. Nodes can have as many relationships of as many different types as you need.

If you haven’t already, check out Cypher for some of the neat queries you can do with multiple relationships. For instance, assuming you kept the colleagues relationships, the answer “Which people on project X have I never worked with?” can be found with:

@jadell:disqus didn’t think about the subgraph idea but that sounds pretty neat. I guess I initially ruled that out because it seemed like it would be creating some duplication in the graph – the normal form lectures from university DB lectures coming back to haunt me!

Anything that’s new is straight forward enough. You might have to compose a nifty query to find the people you want to attach to an office but growing the graph would be unlikely to cause any significant issues.
And changing the structure of the database should usually be possible by traversal and transformation.

Graphs are the future 😉

Balaji

i kinda think that we would have ended up with a similar relationship graph if we normalized our tables in a relational db…