Warming up the "Cypher Muscles"

This query gives us the "degree" (the number of "TWEETS") relationships of a "Handle" node) of the nodes in our dataset, and a first indication of what to look for further on:

Turns out that this is quite interesting. Very few of the "top gun" cyclists seem to be the top Tweeters. The only one that really stands out I think is Luca Paolini - the others are basically excellent riders, but not the "top guns". At least not in my opinion/experience of the sport.

So lets take a look at the #hashtags. Which ones are most mentioned in Tweets?

"Impossible is Nothing" with the power of WITH

Because here's the thing. When I was starting to do this little project, I did not know what I was going to find. I really didn't. Of course I knew a bit about Neo4j, I am a fan of cycling, but still... it was all kind of an experiment, a jump into the unknown. So this is where I fell in love - again - with Neo4j, Cypher, and the way it allows you to interactively and iteratively explore your data - hunting out the hidden insights that you may not have thought of beforehand.

The WITH clause allows query parts to be chained together, piping the results from one to be used as starting points or criteria in the next.

So that means that I can basically iteratively query my dataset, and use the result of one iteration as input for the next iteration. Which is very powerful in my opinion. So here's what I did with the "top ranked" nodes:

First I explored which other nodes are connected to these top-ranked ones, using WITH:

//what is connected to the top NodeRanked handlesmatch (h:Handle)where h.nodeRank is not nullwith horder by h.nodeRank DESClimit 1match (h)-[r*..2]-()return h,rlimit 50

This gave me a nice little overview:

However, because I had to "LIMIT" the result, it felt as if I was artificially skewing the view. So lets take a second pass at this.

Second, I looked at the labels of the nodes that are directly connected to the single one top ranked node:

//what is connected to the top NodeRanked handles at depth 1match (h:Handle)where h.nodeRank is not nullwith horder by h.nodeRank DESClimit 1match (h)--(connected)return labels(connected), count(connected)limit 25

And I can do something very similar by just tweaking the query to find out what is connected at dept 2 or 3...

//what is connected to the top NodeRanked handles at depth 3match (h:Handle)where h.nodeRank is not nullwith horder by h.nodeRank DESClimit 1match (h)-[*..2]-(connected)return labels(connected), count(connected)order by count(connected) DESC

The order of the result is a bit different then:

So that gave me some good feel for the dataset. Again:I think it's mostly this interactive query capability that makes it so interesting.

Betweenness on a subgraph

So then I thought back to some work that I did last year to try and implement Betweenness Centrality in Cypher. The result of that was clearly that it was pretty easy to do, but... that it would be very expensive to do so on a large dataset. I think this would be a prime candidate for another Graphaware component :) ... but let's see if we can use WITH to

first find a subgraph of interesting suspect nodes

then calculate the betweenness on these suspect nodes

Turns out that this was pretty straightforward. Here's the query:

//betweenness centrality for the top ranked nodes - query using UNWIND//first we create the subgraph that we want to analysematch (h:Handle)where h.nodeRank is not nullwith horder by h.nodeRank DESClimit 50//we store all the nodes of the subgraph in a collection, and pass it to the next queryWITH COLLECT(h) AS handles//then we unwind this collection TWICE so that we get a product of rows (2500 in total)UNWIND handles as sourceUNWIND handles as target//and then finally we calculate the betweenness on these rowsMATCH p=allShortestPaths((source)-[:TWEETS|MENTIONED_IN*]-(target))WHERE id(source) < id(target) and length(p) > 1UNWIND nodes(p)[1..-1] as nWITH n.realname as Name, count(*) as betweennessWHERE Name is not nullRETURN Name, betweennessORDER BY betweenness DESC;

Here's the result:

As you can see - this is quite interesting. It's clear that there are a number of lesser known riders that are very "between" the top guns (in terms of PageRank).

Wrapping it up with some pathfinding

So last but not least, we need to do some pathfinding on this dataset. In my experience, that always gives away some interesting insights.

So let's experiment with two very well known riders, Tom Boonen (former world champ and winner of the Tour of Flanders and Paris Roubaix multiple times) and Alexander Kristoff (this year's winner of the Tour of Flanders). Here's the simple query:

As you can see we are using the same principle as above: WITH ties it all together.

That's about it, folks. There are so many other things that I would love to do with this dataset (Community detection is high on my wishlist) - but I think 5 parts to a blogpost series is probably enough :) ...

I guess you could have seen from this series of blogposts, that I am a bit into Cycling, and that I enjoy working this stuff with Neo4j. It's been a lot of fun - and a bit of effort - to get all of this done, but overall... I am pretty happy with the result.

Please let me know what you thought of it too - would love to get feedback.