A blog on technology, data, NYC, and more

Main menu

Category Archives: NoSQL

I have been interested recently in NoSQL databases as a way to handle large collections (I recommend NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence as an introduction). I have touched keystores and BigTable in previous internships, but never a graph database. Therefore, here is my attempt at using one on my laptop to create a dummy submission for the Million Song Dataset Challenge (on music recommendation). This post will be about installing the database, loading the training data from the challenge, and perform basic queries. The second post will cover walking through the graph and making recommendations based on “people would listened to this also listened to that”. This is not intended to be a fully documented example! My intent is to gather pointers to the relevant information so I (and hopefully, you) can get started faster the next time.

Neo4j

Why Neo4j? I could have chosen something else, but it’s open-source, works on linux/max/windows, there is a good amount of documentation online, and for non-intensive use you can interact with it from most languages. Also, now that I understand it better, it’s trivial to install!

Installing Neo4j

It’s so easy! I use Ubuntu 12.04, it took me 30 seconds to be up and running. Download a tar package, untar, launch with

bin/neo4j start

and that’s it! You have a database contained in a folder “data”, a server running to interact through a REST API, and a dashboard that looks like this (counters would be 1 – 0 – 0 – 0):

It really takes a matter of seconds to complete the installation on one machine. Of course, the challenge starts now.

Interacting: Python + REST API

Not knowing where to start, I chose my language of choice, Python. I used this Neo4j client which can be installed using easy_install. Basic commands are as follow:

The code above connects to the database (the webserver must be running) and creates two nodes, each with two properties: ‘name’ and ‘id’. In our experiments, nodes will be either users or songs. In IPython, you can easily inspect the properties of a node:

In [6]: n1.properties
Out[6]: {u'id': u'123', u'name': u'song1'}

A graph database is made of nodes, relationships (directed edges between nodes), and properties (metadata/info on nodes or relationships). We saw how to make nodes, relationships are as easy. Below we create, than delete, a relationship of the type ‘listened_to’ between two nodes n1 and n2:

rel = n2.relationships.create("listened_to", n1)
rel.delete()

Once a node has a relationship ‘rel’, you can find the other end of that relationship. In our case, n2 is a user that “has listened to” n1, a song. Remember that relationships are directed. In IPython, basic relationship traversing is as follow (note ‘outgoing’ versus ‘incoming’):

Indexing

If you have a node object, you can walk along its relationships to other nodes and find paths with different constraints, great things we will discuss in the second post on this subject. But a more pressing matter is to find a node that you inserted in the database. We assume that you have some form of ID that you put in the properties of that node (e.g., ‘user_id’ for users). We need indexes!

Creating an index is easy, and then you must insert each node you create into it. Note that we haven’t explored Neo4j auto-indexing yet. Making and using an index looks like:

Cypher

Cypher is a graph query language for Neo4j. It is inspired by SQL and borrows from SPARQL. It sounds like the proper way to use Neo4j if you get serious about it, but we haven’t explored it much yet. However, one command that’s useful when experimenting is how to delete everything in the database (except the root node):

Batch Inserting the MSD Challenge data

We know the basic commands, the goal is now to upload the training file of the Million Song Dataset Challenge. The direct link is (careful it’s 500MB) here. Unzipped, it’s more than 2GB. It contains info about >1M users that have listened to >300K songs. Each of the 50M lines is a triplet (user, song, playcount):

we have one type of relationship, ‘listened_to’, that goes from a user to a song

‘listened_to’ has one property, playcount

we have two indexes: song_index and user_index (they record the song/user ID)

We could insert all that data (1.5M nodes, 50M relationships) using the Python commands we just saw, but after doing that for an hour, it’s clearly a dead end in term of speed.
Therefore, we have to use batch inserting! It is intended to jump start a database with a lot of data, exactly our case. Long story short, Neo4j is developed in Java, and through Java you can interact directly with the database files, there is no more efficient way.

Following the example Neo4j provides, we created a Java script to go through the data file, and (in order):

insert all 1M users (with indexing)

insert all 300K songs (with indexing)

enter all relationships (by querying the indexes to find the proper nodes)

It is not the most efficient, we should have entered only the songs, than enter all relationships for one user at a time so we don’t have to search for that user in the index. But hey, we tried, it worked! The full code is available here: MSDCBatchInsert.java. You will need to stop the Neo4j server before launching the insert code, it needs to have an exclusive lock on the database. FYI, the section that inserts songs looks like this:

On my Ubuntu laptop with 4GB of RAM, core i3 CPU, inserting all the nodes took 3 minutes, inserting all the relationships took 90 minutes. Not bad! Now my dashboard looks like:

The MSD Challenge training data is loaded in the graph database, now we need to use it. We’ll do so in our second post on the subject, where we’ll try to make simple recommendations by walking along the edges. Stay tuned!
T.