Pages

Saturday, March 21, 2015

Graph Pattern Mining (gSpan) - Introduction

For those of you out there that follow this blog fairly regulary, you may have noticed that there has been a lull in the amount of posts recently. This is not because I have not been working on my next post, it is because it has taken me quite some time to understand the topic described in this post. That being said, I'm assuming (yes I know what happens when people assume) that for many of you out there, this topic would drain several precious hours of your life to understand as well. So, I'm going to share what I've learned about the gSpan algorithm. In this post I will lay the foundation by describing how data scientists create some sort of order out of all of the different possible graph configurations that can exist. In my gSpan algorithm post, I'll describe how the information presented in this post is used to find frequent graph patterns. The information in this post is based on "gSpan: Graph-Based Substructure Pattern Mining" by Xifeng Yan and Jiawei Han, September 3, 2002.

Graphs and why they're tricky to pattern mine
First of all, let's start SUPER simple. What do people mean when they talk about finding patterns in graphs? Let's just say that they aren't talking about something like this:

They are actually talking about some sort of visual representation of something, like this:

I have no idea what the graph above represents...I totally made it up. But, it could look like a chemical composition diagram to some of you out there, or maybe a social network diagram, a family history tree...whatever. Essentially, graphs are systems that have nodes (the A, B and C in the picture above) and connections between them (solid, dashed and double lines above). In graph mining terminology they call a node a vertex and a connection an edge. If you ask me why, I have no idea, but that random knowledge might help you if you ever decide to pick up a technical paper on the subject.

Now, if you have followed some of my previous posts on pattern mining (Apriori Basket Analysis, or Generalized Sequential Pattern Mining) you might look at that graph above and ask yourself, where do I start with that thing? This is a major question that have vexed many mathematicians and data scientists. Let's just start by stating the fact that, in general, when we're dealing with graphs, we don't really care about the orientation of the graph, or the how it "looks" per se; we really only care about what vertices are connected to each other and how. As an example here's that first graph again, with some other equivalent graphs.

You'll notice the term "Isomorphism" in that picture. That's just a fancy way of saying those graphs are essentially the same as graph 1 if you only consider the vertices and edges, but rearranged to look different. Since what we really want, is to be able to look for sub-graphs (or small subsets of the graphs in our database) and see if they are frequent, we would love to be able to create some sort of order for the graphs so we could kind of treat them like a sequential pattern. In sequential pattern mining we represent patterns something like this <A,(B,C)D,E,(E,F)>. There's a chronological order there: first A, then B and C, then D, then E, then E and F. But with a graph, where do you start??? For example, I can number the vertices of graph 1 several different ways

First of all, make a note that the numbering starts with the number 0; this is referred to as the root. Two of the variations above start at the C at the top (but is that really the "top"?) One starts at the A in the lower left corner, and the last one starts at the B. The order that the nodes are numbered can create a LOT of numbered graphs, but they're all different versions of the same thing. What a data miner really needs is a way to create some sort of sequential order so that we always number graph nodes/vertices the same way and therefore don't waste a whole bunch of time dealing with graphs that might be numbered differently, but represent the same thing we already have. Since, I'm focusing on the gSpan algorithm, I'm going to describe how DFS codes solve this problem.

DFS in DFS code stands for depth first search. When we get into the gSpan algorithm, the reason for this will become more apparent, but for now just go with it. A DFS code is just a way of documenting the vertices and edges of a graph in tabular form, but with special rules. We'll represent each edge by defining 5 things. Do do this, first we need to create codes for the features in our graphs.

With this coding system, we can turn the green edge in the 1st graph above into (0,1,C,c,B). The 0 is for the starting vertex; the 1 is for the ending vertex; 'C' is for the starting vertex label/type; 'c' is for the edge label/type; and 'B' is for the ending vertex label/type. The reason for having a code for both the vertex number and the vertex label is that if you're analysing a graph for frequent patterns, the vertex number doesn't give you much information about patterns, but we need it to keep track of where we are in the graph. If you finish coding all of the edges in that first graph you might come up with something that looks like this.

Since you don't have any rules at this point to guide you, you could have just as easily come up with DFS codes that look like any of the DFS codes below...or more

All of these examples start from 0 in the lower left corner, but are very different. It becomes obvious that we need to come up with a standardized way to order our graphs for several reasons; (1) so that we don't drive ourselves crazy comparing codes like the ones above only to realize they're the same graph, (2) to make is easier to find a subgraph we might think is frequent in many different graphs that will likely have very different configurations. One of the things required to solve these problems is a set of rules called "Neighborhood Restriction"that limit the ways that we can create these lists of connections in our graphs. These rules govern the order that you add edges to the list when you're creating codes for a graph. The simplified English version of the rules listed in the technical paper go something like this.

If the first vertex of the current edge is less than the 2nd vertex of the current edge (forward edge)

If the first vertex of the next edge is less than the 2nd vertex of the next edge (forward edge)

If the first vertex of the next edge is less than or equal to the 2nd vertex of the current edge

AND If the 2nd vertex of the next edge is equal to the 2nd vertex of the current edge plus one this is an acceptable next edge

Otherwise the next edge being considered isn't valid

Otherwise (next edge is a backward edge)

If the first vertex of the next edge is equal to the 2nd vertex of the current edge

AND If the 2nd vertex of the next edge is less than the 1st vertex of the current edge this is an acceptable next edge

Otherwise the next edge being considered isn't valid

Otherwise (the current edge is a backward edge

If the first vertex of the next edge is less than the 2nd vertex of the next edge (forward edge)

If the first vertex of the next edge is less than or equal to the 1st vertex of the current edge

AND If the 2nd vertex of the next edge is equal to the 1st vertex of the current edge plus one this is an acceptable next edge

Otherwise the next edge being considered isn't valid

Otherwise (next edge is a backward edge)

If the first vertex of the next edge is equal to the 1st vertex of the current edge

AND If the 2nd vertex of the current edge is less than the 2nd vertex of the next edge this is an acceptable next edge

Otherwise the next edge being considered isn't valid

I know that's hard to follow, I'll walk you through one and give you the answers to a couple more of these. It just so happens that Example 1 above meets all of these criteria. So, for (0,1,C,c,B) to (1,2,B,a,A) we see that 0,1 is a forward edge, and 1,2 is a forward edge. so we need to make sure that the 1 in (1,2,B,a,A) is less than or equal to the 1 in (0,1,C,c,B)...check; AND, we need to make sure that the 2 in (1,2,B,a,A) is equal to the 1 in (0,1,C,c,B) +1; 2 = 1+1...check. so the step from edge 0 to edge 1 is valid.

Let's do another one. In example 1 still, (1,2,B,a,A) is a forward edge because 1 is less than 2 and (2,0,A,b,C) is a backward edge because 2 is greater than 0. So we check if the 2 in (2,0,A,b,C) is equal to the 2 in (1,2,B,a,A)...check; AND we check if the 0 in (2,0,A,b,C) is less than the 1 in (1,2,B,a,A)...check.

If you keep going through Example 1, and the rest of the examples above, you'll see how hard it might be to create this type of order at random

I created example 1 to fit the rules on purpose. Examples 2, 3 and 4 were created at random. I was surprised to see how many edge steps I accidentally got right for those 3.

If you take some time and are careful, you can create several different DFS codes like example 1 that meet the neighborhood restriction rules, but are different from example 1. Below are 3 examples (1, 5, and 6) that meet the rules. To get these 2 other examples, I started at different locations in the graph. The easiest way to get these patterns (and the method that is actually used in gSpan) is to do the following: First you take a step forward, then try to make any backward connections possible connecting from smallest to largest (e.g. 3,0 comes before 3,1 if applicable). Then, you take another step forward (e.g. 3,4) and repeat until you're done with the graph.

So, it appears we've gotten closer to figuring out the 1 way we should represent a graph, but we're not quite there. To fix this problem the creators of the gSpan algorithm have an ordering system that will allow us to figure out if example 1 is "less than" example 5, or if example 5 is "less than" example 6. This ordering system is great because once we have this, we just define THE representation of the graph (the one we'll be using in the gSpan algorithm) as the minimum DFS code possible. This order is called DFS Lexicographic Order. You may remember from other posts that lexicographic is just a fancy way of saying the order is going from smallest to largest (e.g. 1,2,3 or A, B, C...). To do this we also make use of the fact that we have given codes to our vertex types and edge types. Remember how the blue circles in our examples are 'A', and the dotted green lines are 'c'? 'A' is a vertex label and 'c' is an edge label. Here's the rules that govern this order that will help us get our "minimum" DFS code; I'm going to write it in a very simplified pseudo-code format.

Let X be the one version of the graph's DFS code Y be another version for the same graph

Assume that X > Y to startStart by comparing the first edge (edge 0) of both DFS codesStart a loop for comparisons going through all the edges

Check each of the following rules; if one applies, set X < Y and exit the loop

Is X a backward edge and Y a forward edge?

Is X a backward edge, Y a backward edge, and the 2nd vertex of X < 2nd vertex of Y?

Is X a backward edge, Y a backward edge, 2nd vertex of X = 2nd vertex of Y, and the edge label of X is less than the edge label of Y?

Is X a forward edge, Y a forward edge, the 1st vertex of Y < the 1st vertex of X

Is X a forward edge, Y a forward edge, the 1st vertex of X = the 1st vertex of Y, and the label for the first vertex of X is less than the label for the first vertex of Y?

Is X a forward edge, Y a forward edge, the 1st vertex of X = the 1st vertex of Y, the first vertex label of X = the first vertex label of Y, and the edge label for X is less than the edge label for Y?

Is X a forward edge, Y a forward edge, the 1st vertex of X = the 1st vertex of Y, the first vertex label of X = the first vertex label of Y, the edge label for X = the edge label for Y, and the 2nd vertex label of X < the 2nd vertex label of Y?

If you've made it to this section of code, you haven't proven that A is less than B yet If there's another edge left in the graph to check, increment up one edge (e.g. go from edge 0 to 1)End Loop
Again, like a lot of things in data mining, this might look complex, but when you see it in a couple of examples it is pretty easy to follow. If we apply this to Example 1, 5, and 6 above you'll see what I mean.

Let's let example 1 be DFS code X and example 5 be DFS code Y in the code above. For the first edge, they are both forward edges with 0,1. So, rules 1 through 4 in the code don't apply. When we look at rule 5, we see that the 'C' in (0,1,C,c,B) is "greater than" the 'A' in (0,1,A,a,B) so rule 5 doesn't apply either. Because the first vertex label 'C' and 'A' aren't equal, rules 6 & 7 don't apply. So, comparing the first edge (edge 0) didn't give us the answer. We move on to the next edge (edge 1). For X we have (1,2,B,a,A) and for Y we have (1,2,B,c,C). The first two items match and they're forward edges so we jump all the way down to rule 5. We see that the first vertex labels match too so we go to rule 6. When we look at rule 6 we see that the edge 'a' is "less than" the edge 'c' so we get to say that X<Y, or Example 1 is "less than" Example 5. We don't have to check the rest of the edges because we break out of the loop once we find just one example of one of the 7 rules being met.

Let's compare example 5 (X) and example 6 (Y). Starting at edge 0, we see that they're both forward edges with (0,1,...), so rules 1-4 don't apply. When we look at rule 5 we see that the 'A' from (0,1,A,a,B) is "less than" the 'B' from (0,1,B,c,C) which satisfies rule 5. This makes example 5 "less than" example 6; the loop breaks and we have our answer.

So, for our 3 examples, we know that example 1 < example 5 and example 5 < example 6. Just like other places where inequalities are used, this information tells us that example 1 < example 6, or to be complete example 1 < example 5 < example 6.

With the information and understanding we've covered in this post we are armed to tackle the gSpan algorithm itself. When I complete that blog post, I'll add a link to it right HERE. Hope this helps somebody out there!