Pages

Friday, 6 December 2013

Untying the Graph Database Import Knot

Working for Neo Technology has many, many upsides. I love my job, love my colleagues, love our product, love our market - I think you can pretty much say that I am a happy camper. But. There's always a but. At least a couple times a week I am confronted with things that make me go "Oh no, not that again!" And "that" is usually about one particular topic: Importing data into Neo4j. Many, smart people are having trouble with it - and there are many reasons for this. So let's start zooming into this Gordian Knot - and see if we can untie it - without having to cut it ;-) ...

The Graph Database Import Knot

The first thing that everyone should understand that, in a connected world, importing data is, per definition more difficult to do. It is a true "knot" that is terribly difficult to untie, for many different reasons.

Just logically, the problem of importing "connected" data is technically more difficult than with "unconnected" data structures. Importing unconnected data (eg. the nodes of your graph model) is always easy/easier. Just dump it all in there. But then you come to importing the connections, the relationships, and you find that there's no such thing as an "external entity" (aka "the database schema") that is going to be ensuring the consistency and connectedness of the import. You have to do that yourself, and explicitly, by importing the relationships between a) a start node that you have to find, and b) an end node that you have to lookup. It's just ... more complicated. Especially at scale, of course.

So how to untie this knot? I can really see two steps that everyone needs to take, in order to do so:

Understand the import problem. Every import is different, just like every graph is different. There is little or no uniformity there, and in spite of the fact that many people would love to just have a silver bullet solution to this problem, the fact of the matter is that there is none - at least not today. So therefore we will have to "create" a more or less complex import solution for every use case - using one of the tools at hand. But like with any problem, understanding the import problem is often the key to choosing the right solution - so that's what I will focus on here as well.

Pick the right tool. There are many tools out there, and we should not be defeated by the law of the instrument - and use the right tool for the job. Maybe, this article can help in bringing these different tools together, bring some structure to them, and then - even though I have not used all tools, but I have used a few - I can also tell you about my experiences. That should allow us to make some kind of a mapping between the different types of Import problems, and the different tools at hand.

So let's give it a shot.

YOUR import scenario

Like I said before: one import problem is different from the next one. Some people want to store the facebook social graph in neo4j, other people just want to import a couple of thousand proteins and their interactions. It's really, very different. So what are the questions that you should ask yourself? Let me try and map that out for you:

This little mindmap should give you an insight into the types of questions you should ask yourself. Some of these are project related, others are size/scale related, others are format related, and then the final set of questions are related to the type of import that you are trying to do.

The Tools Inventory

If you have ever visited the neo4j website, you have probably come across the import page. There's a wealth of information there around the different types of tools available, but I would like to try and help by providing a bit of structure to these tools:

So these tools range from using a spreadsheet - which most of use should be able to wield as a tool - to writing a custom piece of software to achieve the solution to the import problem at hand. The order in which I present these is probably very close to "from easy to difficult", and "from not so powerful to very powerful".

So let's start doing a little assessment on these tools. Note that this is by no means scientific - this is just "Rik's view of the world".

Very easy: all you need to do is write some formulas that concatenate strings with cell content - and compose cypher statements this way. These cypher statements can then just be copied into the neo4j-shell.

Only works at limited scale (< 5000 nodes/relationships at a time). Performance is not good: overhead of unparametrized cypher transactions. Quirks in copying/pasting the statements above a certain scale. Piping the statements in can work on OSX/Linux - but not on Windows.

1 comment:

really good piece of information, I had come to know about your site from my friend shubodh, kolkatta,i have read atleast nine posts of yours by now, and let me tell you, your site gives the best and the most interesting information. This is just the kind of information that i had been looking for, i'm already your rss reader now and i would regularly watch out for the new posts, once again hats off to you! Thanks a lot once again, Regards, Regards mulesoft training in hyderabad