clusters, moving their nodes to random clusters.5.If no improvement is seen forXsteps, start over from Step 2,but use a more sensitive cost function:# neighbors of u that are not in the same cluster +# of nodes co-clustered with u that are not its neighborsApproximately: Naive cost function scaledby the size of cluster CMCODEBader and Hogue (2003) use a heuristic to ﬁnd dense regionsof the graph.Key Idea.A

k-core of G is an induced subgraph of G such thatevery vertex has degree ≥k.2-coreNot part ofa 2-coreuAlocal k-core(u, G)is ak-core in the subgraph of G induced by {u}∪N(u).Ahighest k-coreis ak-core such that there is no (k+1)-core.MCODE, continued1.Thecore clustering coefﬁcient CCC(u)is computed for each vertexu:2.Vertices are weighted bykhighest(u) × CCC(u), wherekhighest(u) is thelargestkfor which there is a localk-core aroundu.3.Do a BFS starting from the vertexvwith the highest weightwv,including vertices with weight ≥TWP×wv.4.Repeat step 3, starting with the next highest weighted seed,and so on.CCC(u) = the density of the highest, localk-core ofu.In other words, it’s the density of the highestk-core inthe graph induced by {u}∪N(u).“Density” is the ratio of existing edges to possible edges.MCODE, ﬁnal stepPost-process clusters according to some options:Filter.Discard clusters if the do not contain a 2-core.Fluff.For everyuin a cluster C, if the density of{u}∪N(u)exceeds athreshold, add the nodes in N(u) to C if they are not part C1,C2, ..., Cq. (This may cause clusters to overlap.)uvCiCjHaircut.2-core the ﬁnal clusters (removes tree-like regions).Comparison – 40% edges removed; varied % added% of added edgesGeometric Accuracy = GeoMean(PPV, Sn)MCLRNSCSPCMCODERepresentative test; MCL generally outperformed others.“Sensitivity” := %complex covered by itsbest matching cluster.PPV is % cluster covered by itsbest matching complex.MCLMotivation(1) Number of u-v paths of lengthkis larger if u,v arein the same dense cluster, and smaller if they belong todifferent clusters.(2) A random walk on the graph won’t leave a densecluster until many of its vertices have been visited.(3) Edges between clusters are likely to be on manyshortest paths.van Dongen (2000) proposes the following intuition for thegraph clustering paradigm:

Think driving in a city:(1) if you’re going from u to v, there are lots ofways to go; (2) random turns will keep you in the same neighborhood;(3) bridges will be heavily used.Girvan-NewmanStructural Units of a Graphk-bond. A maximal subgraph S with all nodes having degree ≥k

inS.k-component. A maximal subgraphSsuch that every pairu, v∈Sisconnected byk

edge-disjointpathsinS.k-block. A maximal subgraphSsuch that every pairu, v∈Sis connectedby kvertex-disjoint

pathsinS.k-blocks of a graph(van Dongen, 2000):(k+1)-blocks nestinside k-blocks.Structural Units of a Graphk-bond. Amaximalsubgraph S with all nodes having degree ≥k

inS.k-component. A maximal subgraphSsuch that every pairu, v∈Sisconnected byk

edge-disjointpathsinS.k-block. A maximal subgraphSsuch that every pairu, v∈Sis connectedby kvertex-disjoint

pathsinS.Every k-block⊆some k-componentEvery k-component⊆some k-bondAll vertices of a k-componentmust have degree ≥ k in S.(If degree(u) < k, u couldn’t havek edge disjoint paths to v in S.k vertex-disjoint paths are alledge-disjoint.Hence if u,v are connected by kvertex-disjoint paths in S, theyare connected by k edge-disjointpaths in S.Thm. (Matula) Thek-components form equivalence classes (theydon’t overlap).Problem withk-blocks as clustersThe clustering is very sensitive to node degree and to particularconﬁgurations of edge-disjoint paths.Example 1.Red shaded region is nearly a complete graph (missing onlyone edge), yet each of its nodes is in its own 3-block.(van Dongen, 2000):Example 2.Blueshaded region can’tbe in a 3-block withany other vertex (b/cit has degree 2), butreally it should bewith the K4subgraphit is next to.Number of Length-kPathsLetAbe a the adjacency matrix of an unweighted simplegraph G.AkisA⋅A⋅...⋅A(ktimes)Thm.The(i,j)entry ofAk, denoted(Ak)ij

r=1(Ak−1i rArj)irjAk−1irArjNote: the paths do nothave to be simple.k-Path ClusteringIdea. Use Zk(u,v) := (Ak)uvas a similarity matrix.kis an input parameter.Given Zk(u,v), for somek, use it as a similarity matrix andperformsingle-link clustering.Single-link clusteringof matrix M:Throw out all entries of M that are < thresholdtReturn connected components of remaining edges.Called single-link clustering because a single “good” edgecan merge two clusters.Problem withk-Path ClusteringConsider Z2:Z2(a,b) = 1, andZ2(a,c) = 1But intuitively, a,b are moreclosely coupled than a,cConsider Z3:Z3(a,b) = 2, andZ3(a,d) = 2[Why?]But intuitively, a,b are moreclosely coupled than a,dWhile therearemore short pathsbetween a & b than between otherpairs, half of the short paths are of oddlength and half are of even length.(van Dongen, 2000)Problem withk-Path ClusteringConsider Z2:Z2(a,b) = 1, andZ2(a,c) = 1But intuitively, a,b are moreclosely coupled than a,cConsider Z3:Z3(a,b) = 2, andZ3(a,d) = 2[Why?]But intuitively, a,b are moreclosely coupled than a,dWhile therearemore short pathsbetween a & b than between otherpairs, half of the short paths are of oddlength and half are of even length.Solution.Add self-loops toevery node.(van Dongen, 2000)Examplek-path clustering. (van Dongen, 2000)Using k=2, Z2(u,v) := (A2)uvas the similarity matrix.Random WalksOn an unweighted graph:Start at a vertex,choose an outgoing edge uniformly atrandom, walk along that edge, and repeat.On a weighted graph:Start at a vertexu,choose an incident edgeewith weightwewithprobabilitywe/ ∑d

wdwheredranges over the edges incident tou,walk along that edge, and repeat.Transition matrix.If AGis the adjacencymatrix of graph G, we form TGby normalizingeach row to sum to 1:∑ = 1TG=a11a12a13a14a21a22a23a24a31a32a33a34a41a42a43a44Random Walks, 2uwSuppose you start atu. What’s theprobability you are atwafter 3 steps?Letvube the vector that is 0everywhere except indexu.At step 0,vu[w]gives theprobability you are at nodew.After 1 step,(TGvu)[w]gives theprobability that you are atw.Afterksteps, the probability thatyou are at w is:(TGkvu)[w]In other words,TGkvu

is a vector giving ourprobability of being at any node after takingksteps.Random Walks for Finding ClustersTGkvu

is a vector giving our probability of being at any nodeafter takingksteps and starting fromu.We don’t want to choose a starting point. Instead of vuwe coulduse the vectorvuniformwith every entry = 1/n.But then for clustering purposes,vuniformjust gives a scalingfactor, so we can ignore it and focus on TGk=: TkTk[i,j] gives the probability that we will cross fromitojon stepk.Ifi, jare in the same dense region, you expect Tk[i,j] to be higher.Example (van Dongen, 2000)The probability tendsto spread out quickly.Second Key IdeaAccording to some schedule, apply an “inﬂation” operator to the matrix.p11p12p13p14p21p22p23p24p31p32p33p34p41p42p43p44→pr11pr12pr13pr14pr21pr22pr23pr24pr31pr32pr33pr34pr41pr42pr43pr44→Inﬂation(M, r) :=RescalecolumnsThe affect will be to heighten the contrast between the existing smalldifferences. (As in inﬂation in cosmology.)0.250.250.250.25→0.250.250.250.250.250.250.250.250→0.250.250.250.2500.30.30.20.2→0.3460.3460.1540.154Examples.(r=2)Example.253167109111284Attractors: nodeswith positivereturn probability.The algorithmMCL(G, {ei}, {ri}):# Input:# Graph G,# sequence of powers ei, and# sequence of inflation parameters rkAdd weighted loops to G and compute TG,1=: T1fork = 1,...,∞:T :=Inflate(rk,Power(ek, T))ifT ≈ T2