Pages

Tuesday, 15 October 2013

Importing my Last.fm dataset - the neo4j way

Some time ago, I blogged about how you
could create an interesting graph dataset in neo4j using the data
from Last.fm. At the time, I used Talend
as an ETL tool, to do the import into neo4j – as the dataset was
quite large and the spreadsheet
method would probably not cut it anymore. It worked great – the
only downside (for this particular use case) was that ... I had to
learn Talend. And not that that is terribly difficult – especially
not if you are an experienced ETL professional, which I am clearly
NOT – but there was definitely a learning curve involved. So: there
continued to be a latent desire to do this import into neo4j natively
– without separate tooling. And now, I think we have that, thanks
to the ever-amazing Michael Hunger.

Michael created a collection of utilities that basically plug into the neo4j-shell, and extend its
functionalities with things like... data import functionalities.
There are different options, and you should definitely read up on the
different capabilities, but for my specific Last.fm use case, what
was important was that it can easily import the csv files that I had
created at the time for the import using talend.

You can read up on the details of the
shell-tools in the readme (in contains very simple installation
instructions that you would need to go through beforehand –
essentially installing the .jar file in neo4j's lib directory). Once
you have done that and you shutdown/restart the neo4j server, you are
good to go.

Creating the database from scratch.

As you will see below, the steps are
quite simple:

Step 1: start with an empty neo4j database

What's important here is that the
neo4j-shell-tools work on a **running** neo4j database. You do not
need to introduce downtime, and you do not use the so-called “batchimporter” method – instead you are doing a full blow,
transactional, live update on the graph, using this toolset.

Step 2: prepare the .csv files

I had already prepared these files for
the previous blogpost – so that was easy. The only difference that
I had to make was that I

had to make sure that the
delimiter that I was using was right. The neo4j-shell-tool allows
you to specify the type of delimiter, and getting that wrong will
obviously lead to faulty imports

had to add a “header” row at
the top of the text files. The neo4j-shell-tool will assume that the
first line of the .csv files defines the structure of the rest of
the file. Which also then means, that I needed multiple files as
both the nodes and relationships that I wanted to add have a
different structure/type.

So I ended up with 2 .csv files to add
nodes to the graph, and 7 .csv files to add the relationships between
the nodes. You can download everything here.

-d defines
the delimiter of the file that we are importing. In these case a
“;”.

-i
defines the input file. On OSX, not adding a path will just look for
the file in the root of your neo4j installation directory. In many
cases you will want to have an absolute, or relative path from
there.

-o
defines an option output file where the result of the import
commands will be written. This is intended for logging purposes.

And then finally, with the
highlighted “create...”
section, we basically define the Cypher query that will do
the import transaction – using the parameters from the csv file
(between { }) as input.

Note that the neo4j-shell-tools provide
some separate functionalities for dealing with large input files and
for tuning the transaction throttling (how many updates in one
transaction), but that for this purpose we really did not need to do
that.

Then for the relationship import
commands, we have a very similar structure:

Note
that, because of the domain model that we have from the last.fm
dataset, some relationships have to be unique and others don't –
hence the difference in the Cypher queries.

Step 4: executing the commands

Then
all we need to do is to put the files on the right locations, make
sure that autoindexing is correctly defined, and then copy/paste the
commands into the neo4j-shell.

On
my MacBook Pro, the entire import took about 35 seconds, and I ended
up with the database that I had previously created with the Talend
toolset:

And
then the same graphic/query exploration can begin. You can take the graphical tools for a spin, or alternatively create your own cypher queries and get going.

Conclusion

Overall,
I found this new process to be extremely intuitive and straightforward –
even simpler then what I had experienced using the Talend toolset. I
have put the zip-file
and the corresponding input statements over here – so feel to
download and experiment yourself. Just make sure that you put the
.csv files in the neo4j “home directory”, or adjust the paths as
you want (both relative and absolute paths seemed to work on my
machine).