Some recommendations in Neo4j

Not so long ago, I attended Graph Café, a nice type of meetup where there is no formal agenda, but there is just a series of lightning talks with beer inspired breaks in between. Graph Café's are organized every so often by the Graph Database - Amsterdam meetup group. Naturally, my friend Rik van Bruggen, kindly and without any pressure had asked me to also do a lightning talk, so I had to come up with something. Neo4j 2.0 was released recently and the Graph Café was actually for celebrating this occasion. One of the new features in 2.0 is a major overhaul of the Cypher query language. I wanted to find out how much you could do with the query language in terms of creating functionality. My challenge: implement different forms of recommendations purely in Cypher. And I'm not talking about the basic co-occurrence counting type of recommendations, but something less trivial than that. My lightning talk ended up being about creating a naïve Bayes classifier in Cypher, which kind of worked. Obviously, I couldn't just leave it at that, so now I implemented the classifier and collaborative filtering (based on item-item cosine similarity) purely in Cypher. This post shows how.

Your Meetup.com neighborhood in a graph

In order to do recommendations, we need a dataset with things in it that can be recommended. For this, I chose to grab data for a number of Meetup.com groups through their excellent API. This allows us to retrieve groups, their members, the members' RSVP's, the members' interests and much more. Our graph contains several meetup groups with all members, topics that are of interest to members and to groups and all RSVP's by members for each group. These are the possible relations and node labels (if this look awkward to you, go read about Cypher first):

The graph is built using a simple Python script that makes the required API calls and populates the database using py2neo. Before we can do useful matching against the database, we need to create some indexes, so we can find things by name. Luckily, in the new Cypher, this is super easy!

Now, let's find me (Friso van Vollenhoven) and show all my group memberships, the topics I like, the topics my groups discuss and the events that I RSVP'ed for:

Are you a graph database person?

In our example, we are going to look for people that we can recommend the Graph Database - Amsterdam meetup group to. Basically, for each person in the database, we can ask: are you a graph database person? We will try to predict who would answer yes to that question. Those people will be the ones we can recommend the graph database group to.

So what makes a person a graph database person? One way to approach this, is to look at the topics that typical graph database people like versus the topics that other people like. We use the topics as a binary feature of a person and we will create a naïve Bayes classifier based on these features. For training we will assume that people already in the graph database group are graph database people and people who are in some other, non related group are non-graph database people.

In order to use the liking of topics as binary features for our classification, we need to determine the probabilities that a person likes a certain topic, both in general and for both classes that we'd like to classify (graph database person and non-graph database person). We can do this easily by just counting how many people like each topic. We do this for the entire training population and for the both classes. Also, we will separate the training data into two parts, so we can use one part as training data and one as test data to verify the accuracy of our classifier.

The result shows us how many people went into the training data set for each group. We need to remember those number for later reference, as they will be the denominators in some of the calculations we'll do.

Learning by counting

Let's look at how liking different topics adds to the chance of someone being a graph database person. For this, we also need to know the total number of people in the training data. Of course we can add up the numbers above, but where's the fun in that. Let's query for it.

The nice thing about naïve Bayes is that for binary features, it's mostly just counting and multiplying. The problem is that we need a way to remember these counts. Cypher is stateless and declarative, so we have no way to kind of keep things around in memory in between queries (AFAIK). To work around this, we are just going to store the count in the graph itself. First, we set the likes from all members on each topic. Notice how we add 1 to the actual like count. We will later see why this is.

Cool. Now we have all ingredients to see if we can classify a member as a graph database person. Let's give it a try with my friend Rik. You can see in the query that we use coalesce to account for topics that we have not seen in our training data. We give these topics a default value of 1. However, this would not be fair to the topics that are actually present once. As a solution we could add 1 to those topics, but then it wouldn't again be fair to the topics that were actually present twice, which is why we add 1 to all the topic counts. Which is the +1 that we saw in the earlier queries where we set the counts on the topics. This is a form of smoothing the data based on the assumption that really rare properties occur less than the ones in the training data.

Now, let's look at how the different topics that Rik likes add to the fact the he may or may not be a graph database person. We take a look at all topics that Rik likes and return the probabilities of a graph database person liking these versus the probability of a non-graph database person liking these. We can use these to determine the conditional probabilities of someone being a graph database person based on the presence of a topic using Bayes' theorem.

Liking Neo4j and Graph Databases really increases the likelihood of being a graph database person. What a surprise! Topics exclusively present in the graph databse training group, will have a 1.0 probability, which sometimes results in > 1.0 because of rounding errors. We also return a boolean flag that tells us whether the topic was present in the training data. You can see that topics not present in any of the classes in the training data, return a 1.0 probability on both sides, which is counterintuitive (and wrong), but for the classification it doesn't matter.

Once more, what a surprise. Rik is likely to be a graph database person! Note that we are not using the denominator as above. We can do this, because it's the same for both classes and we only care about which of the two results is larger.

The next obvious question now is: how accurate are those results? Because we kept half of the data in our labeled data set apart as a test set, we can now use that to figure out how accurate our classifier is by creating a confusion matrix (although it doesn't look like a matrix in our output, but you get the point). Let's have a look.

As it turns out, we have 10 false positives, so we wrongly classify about 7% of non-graph database people as graph database people. If we wish to improve on this, there are several options. One is to investigate the details of the false positives by doing manual exploratory analysis and as a result of that come up with potentially better features. The other, obvious one is: MORE DATA! Go for the latter if you can; it's cheaper than spending numerous hours improving your model.

Production ready?

The above solution works. However, there is one thing: we need to store the like counts for topics in the graph itself for things to work. The good thing is that this actually creates a denormalized (does that exist in a graph database?), pre-aggregated view of some required data for the classification. This makes the classification process faster. On the downside, setting and updating the counts is a graph global operation which also writes back to the graph.

If we were to do this classification just once and then forget about it, it would be nicer to keep the counts in memory and not in the graph. We could open a feature request for a Cypher based scripting language that allows to set variables during script execution which are reusable throughout the script.

On the other hand, if you need to run the classifier all the time, it would be nicer to have the counts stored in the graph, but keep them updated when things change. Another feature request: triggers.

Targeting entire groups

Classifying each person individually is a lot of work. It can be problematic scaling such an approach. Can't we just target entire meetup groups that somehow resemble our graph database meetup group? Of course we can. We can use collaborative filtering to figure out which groups are most similar to ours.

The absolute simplest (stupidest) thing you can do is just assume that groups that have the most members in common with our group are most similar and hence will also like graph databases (as a group). The issue with this is that it tends to favor larger groups over smaller ones (because they will have more members in common). Because of that, we will normalize the count of members in common to the target groups size.

This result implies that we should target the member of the Netherlands Cassandra Users meetup group when we are looking for new members for the graph database group. This seems nice, but there are still issues. First of all, we see that the Open Web Meetup scores relatively high (ranked 3rd), but it is a very small group, so it needs only a few members in common to reach a large fraction in common, so the result may not be very significant. We could try to take the number of members in the target group into account to create a confidence interval of which we could use the lower bound as the actual score (here's a nice explanation of the concept), but we'll save that maybe for later.

Another issue with this result is that is only looks at group membership as input. It is debatable whether group membership is a really strong indication of interest of a member. Anyone can be a member of many meetup groups as joining is easy and free. Perhaps a lot of people join a group just to check it out once and then never interact with the group anymore, but are still listed as a member. In order to work around this, we are going to use the RSVP's of people to see how often they actually, physically interact with the group. More RSVP's means more interest in the group. Once more, we need to of course normalize for the number of meetups a group actually organizes. We are going to use the number of meetups attended as a fraction of the meetups organised by a group as a score for the interest that each member has in a group. We can use these score to calculate a similarity of two groups based on the similarity in the way members interact with the group. One way of doing this is collaborative filtering using cosine similarity as a similarity measure (as opposed to pure co-occurrence in the example above). So, here it is:

And in the results you can see that: a) when you don't just look at membership, but at meetup attendance, there are only three other groups that have actual co-occurrence with the graph database group and b) The Amsterdam Applied Machine Learning Meetup Group now scores better than the Cassandra group.