Two provably consistent divide and conquer clustering algorithms for large networks

Two provably consistent divide and conquer clustering algorithms for large networks

Soumendu Sundar Mukherjee

Purnamrita Sarkar

Peter J. Bickel

Abstract

In this article, we advance divide-and-conquer strategies for solving the community detection problem in networks. We propose two algorithms which perform clustering on a number of small subgraphs and finally patches the results into a single clustering. The main advantage of these algorithms is that they bring down significantly the computational cost of traditional algorithms, including spectral clustering, semi-definite programs, modularity based methods, likelihood based methods etc., without losing on accuracy and even improving accuracy at times. These algorithms are also, by nature, parallelizable. Thus, exploiting the facts that most traditional algorithms are accurate and the corresponding optimization problems are much simpler in small problems, our divide-and-conquer methods provide an omnibus recipe for scaling traditional algorithms up to large networks. We prove consistency of these algorithms under various subgraph selection procedures and perform extensive simulations and real-data analysis to understand the advantages of the divide-and-conquer approach in various settings.

For the convenience of the reader, we collect here some of the more frequently used notations used in the paper, and provide a summarizing phrase for each, as well as the page number at which the notation first appears.

Community detection, also known as community extraction or network clustering, is a central problem in network inference. A wide variety of real world problems, ranging from finding protein protein complexes in gene networks [13] to studying the consequence of social cliques on adolescent behavioral patterns [15] depend on detecting and analyzing communities in networks. In most of these problems, one observes co-occurrences or interactions between pairs of entities, i.e. pairwise edges and possibly additional node attributes. The goal is to infer the hidden community memberships.

Many of these real world networks are massive, and hence it is crucial to develop and analyze scalable algorithms for community detection. We will first talk about methodology that uses only the network connections for inference. These can be divided into roughly two types. The first type consists of methods which are derived independently of any model assumption. These typically involve the formulation of global optimization problem; examples include normalized cuts [28], Spectral Clustering [24], etc.

On the other end, Statisticians often devise techniques under model assumptions. The simplest statistical model for networks with communities is the Stochastic Blockmodel (SBM) [14]. The key idea in a SBM is to enforce stochastic equivalence, i.e. two nodes in the same latent community have identical probabilities of connection to all nodes in the network. There are many extensions of SBM. The Degree Corrected Stochastic Blockmodel [16] allow one to model varied degrees in the same community, whereas a standard SBM does not. Mixed membership blockmodels [2] allow a node to belong to multiple communities, whereas in a SBM, a node can belong to exactly one cluster.

For an SBM generating a network with n nodes and K communities, one has a hidden community/cluster membership matrix Z∈{0,1}n×K, where Zik=1 if node i is in community k∈[K]. Given these memberships, the link formation probabilities are given as

P(Aij=1∣Zik=1,Zjk′=1)=Bkk′,

where B is a K×K symmetric parameter matrix of probabilities. The elements of B may decay to zero as n grows to infinity, to model sparse networks.

Typically the goal is to estimate the latent memberships consistently. A method outputting an estimate ^Z is called strongly consistent if P(^Z=ZΠ)→1 for some K×K permutation matrix Π, as n→∞. A weaker notion of consistency is when the fraction of misclustered nodes goes to zero as n goes to infinity. Typically most of the consistency results are derived where average degree of the network grows faster than the logarithm of n. This is is often called the semi-dense regime. When average degree is bounded, we are in the sparse regime.

There are a plethora of algorithms for community detection. These include likelihood-based methods [3], modularity based methods [29], spectral methods [26], semi-definite programming (SDP) based approaches [8] etc. Among these, spectral methods are scalable since the main bottleneck is computing top K eigenvectors of a large and often sparse matrix. While the theoretical guarantees of Spectral Clustering are typically proven in the semi-dense regime [21], a regularized version of it has been shown to perform better than a random predictor for sparse networks [17]. Profile likelihood methods [6] involve greedy search over all possible membership matrices, which makes them computationally expensive. Semidefinite programs are robust to outliers [8] and are shown to be strongly consistent in the dense regime [4] and yield a small but non-vanishing error in the sparse regime [12]. However, semi definite programs are slow and typically only scale to thousands of nodes, not millions of nodes.

Methods like spectral clustering on geodesic distances [5] are provably consistent in the semi-dense case, and can give a small error in sparse cases. However, it requires computing all pairs of shortest paths between all nodes, which can pose serious problems for both computation and storage for very large graphs.

Monte Carlo methods [29], which are popular tools in Bayesian frameworks, are typically not scalable. More scalable alternatives such as variational methods [10] do not have provable guarantees for consistency, and often suffer from bad local minima.

So far we have discussed community detection methods which only look at the network connections, and not node attributes, which are often also available and may possess useful information on the community structure (see, e.g., [22]). There are extensions of the methods mentioned earlier which accommodate node attributes, e.g., modularity based [34], spectral [7], SDP based [32], etc. These methods come with theoretical guarantees and have good performance in moderately sized networks. While existing Bayesian methods [20] are more amenable to incorporating covariates in the inference procedure, they often are computationally expensive and lack rigorous theoretical guarantees.

While the above mentioned array of algorithms are diverse and each has unique aspects, in order to scale them to very large datasets, one has to apply different computational tools tailored to different algorithmic settings. While stochastic variational updates may be suitable to scale Bayesian methods, pseudo likelihood methods are better optimized using row sums of edges inside different node blocks.

In this article, we propose a divide and conquer approach to community detection. The idea is to apply a community detection method on small subgraphs of a large graph, and somehow stitch the results together. If we could achieve this, we would be able to scale up any community detection method (which may involve covariates as well as the network structure) that is computationally feasible on small graphs, but is difficult to execute on very large networks. This would be especially useful for computationally expensive community detection methods (such as SDPs, modularity based methods, Bayesian methods etc.). Another possible advantage concerns variational likelihood methods (such as mean-field) with a large number (depending on n) of local parameters, which typically have an optimization landscape riddled with local minima. For smaller graphs there are less parameters to fit, and the optimization problem often becomes easier.

Clearly, the principal difficulty in doing this, is matching the possibly conflicting label assignments from different subgraphs (see Figure ?(a) for an example). This immediately rules out a simple-minded averaging of estimates ^ZS of cluster membership matrices ZS, for various subgraphs S, as a viable stitching method.

In this regard, we propose two different stitching algorithms. The first is called Piecewise Averaged Community Estimation (PACE); in which we focus on estimating the clustering matrix C=ZZ⊤, which is labeling invariant, since the (i,j)-th element of this matrix being one simply means that nodes i and j belong to the same cluster, whereas the value zero means i and j belongs to two different clusters. Thus we first compute estimates of ZSZ⊤S for various subgraphs S and then average over these matrices to obtain an estimate ^C of C. Finally we apply some computationally cheap clustering algorithm like approximate K-means, DGCluster1, spectral clustering etc. on ^C to recover an estimate of Z.

We also propose another algorithm called Global Alignment of Local Estimates (GALE), where we first take a sequence of subgraphs, such that any two consecutive subgraphs on this sequence have a large intersection, and then traverse through this sequence, aligning the clustering based on a subgraph with an averaged clustering of the union of all its predecessor subgraphs in the sequence, which have already been aligned. The alignment is done via an algorithm called Match which identifies the right permutation to align two clusterings on two subgraphs by computing the confusion matrix of these two clusterings restricted to the intersection of the two subgraphs. Whereas a naive approach would entail searching through all K permutations, Match finds the right permutation in KlogK time. Once this alignment step is complete, we get an averaged clustering of the union of all subgraphs (which covers all the vertices). By design GALE works with estimates of cluster membership matrices ZS directly to output an estimate of Z, and thus, unlike PACE, avoids the extra overhead of recovering such an estimate from ^C.

The rest of the paper is organized as follows. In Section 3 we describe our algorithms. In Section 4 we state our main results and some applications. Section 5 contains simulations and real data experiments. In Section 7 we provide proofs of our main results, while relegating some of the details to the Appendix, Section Appendix B. Finally, in Section 6 we conclude with some discussions and directions for future work.

As we discussed in the introduction, the main issue with divide and conquer algorithms for clustering is that one has to somehow match up various potentially conflicting label assignments. In this vein we propose two algorithms to accomplish this task. Both algorithms first compute the clustering on small patches of a network; these patches can be induced subgraph of a random subsample of nodes, or neighborhoods. However, the stitching procedures are different.

Suppose A is the adjacency matrix of a network with true cluster membership of its nodes being given by the n×K matrix Z where there are K clusters. Set C=ZZ⊤ to be the clustering matrix whose (i,j)-th entry is the indicator of whether nodes i,j belong to the same cluster. Given A we will perform a local clustering algorithm to obtain an estimate of C, from which an estimate ^Z of the cluster memberships may be reconstructed.

The τ parameter in PACE seems to reduce variance in estimation quality as it discards information from less credible sources — if a pair of nodes has appeared in only a few subgraphs, we do not trust what the patching has to say about them. Setting τ=θE(Nij) seems to work well in practice (this choice is also justified by our theory).

A slight variant of Algorithm ? is where we allow subgraph and/or node-pair specific weights wℓ,i,j in the computation of the final estimate, i.e.

where Nij now equals ∑Tℓ=1wℓ,i,jy(ℓ)ij. We may call this estimator w-PACE standing for weighted-PACE. If the weights are all equal, w-PACE becomes equivalent to ordinary PACE. There are natural nontrivial choices, including

wℓ,i,j=|Sℓ|, which will place more weight to estimates based on large subgraphs,

wℓ,i,j=degSℓ(i)+degSℓ(j), where degS(u) denotes the degree of node u in subgraph S (this will put more weight on pairs which have high degree in Sℓ).

The first prescription above is intimately related to the following sampling scheme for ordinary PACE: pick subgraphs with probability proportional to their sizes. For instance, in Section 5.2 we analyze the political blog data of [1] where neighborhood subgraphs are chosen by selecting their roots with probability proportional to degree.

In real world applications, it might make more sense to choose these weights based on domain knowledge (for instance, it may be that certain subnetworks are known to be important). Another (minor) advantage of having weights is that when T=1 and |S1|=n, we have Nij=w1,i,j and so if w1,i,j≥τ, then

^Cij=^C(1)ij,

i.e. w-PACE becomes the estimator based on the full graph. This is for example true with wℓ,i,j=|Sℓ|, because τ is typically much smaller than n. However, ordinary PACE lacks this property unless τ=1, in fact, with τ>1, the estimate returned by PACE is identically 0. Anyway, in what follows, we will stick with ordinary PACE because of its simplicity.

Before we discuss how to reconstruct an estimate ^Z of Z from ^C, let us note that we may obtain a binary matrix ^Cη by thresholding ^C at some level η (for example, η=1/2):

^Cη:=[^C>η].

This thresholding does not change consistency properties (see Lemma ?). Looking at a plot of this matrix gives a good visual representation of the community structure. In what follows, we work with unthresholded ^C.

Reconstruction of ^Z: How do we actually reconstruct ^Z from ^C? The key is to note that members of the same community have identical rows in C and that, thanks to PACE, we have gotten hold of a consistent estimate of C. Thus we may use any clustering algorithm on the rows of ^C to recover the community memberships. Another option would be to run spectral clustering on the matrix ^C itself. However, as the rows of ^C are n-vectors, most clustering algorithms will typically have a running time of O(n3)2 at best. Indeed, the main computational bottleneck of any distance based clustering algorithm in a high dimensional situation like the present one, is computing dij=∥^Ci⋆−^Cj⋆∥ which takes O(n) bit operations. However, since we have gotten a good estimate of C, we can project the rows of ^C onto lower dimensions, without distorting the distances too much. The famous Johnson-Lindenstrauss Lemma for random projections says that by projecting onto Ω(logn/ϵ2) dimensions, one can keep, with probability at least 1−O(1/n), the distances between projected vectors within a factor of (1±ϵ) of the true distances. Choosing ϵ as inverse polylog(n) we need to project onto polylog(n) dimensions and this would then readily bring the computational cost of any distance based algorithm down from O(n3) to O(n2polylog(n)).

Following the discussion of the above paragraph, we first do a random projection of the rows of ^C onto s(=polylog(n)) dimensions and then apply a (distance-based) clustering algorithm.

As for Cluster(⋅,K), we may use approximate K-means or any other distance based clustering algorithm, e.g., DGCluster(^C,^Cproj,K), a greedy algorithm presented in Appendix A as Algorithm ?.

First we introduce a simple algorithm for computing the best permutation to align labels of one clustering (Z1) to another (Z2) of the same set of nodes (with fixed ordering) in a set S. The idea is to first compute the confusion matrix between two clusterings. Note that if the two labelings each have low error with respect to some unknown true labeling, then the confusion matrix will be close to a diagonal matrix up to permutations. The following algorithm below essentially finds a best permutation to align one clustering to another.

Now we present our sequential algorithm which aligns labelings across different subgraphs. The idea is to first fix the indexing of the nodes; cluster the subgraphs (possibly with a parallel implementation) using some algorithm, and then align the clusterings along a sequence of subgraphs. To make things precise, we make the following definition.

After we construct a traversal, we travel through this traversal such that at any step, we align the current subgraph’s labels using the Match algorithm (Algorithm ?) on its intersection with the union of the previously aligned subgraphs. At the end, all subgraph labellings are aligned to the labeling of the starting subgraph. Now we can simply take an average or a majority vote between these.

Implementation details: Constructing a traversal of the subgraphs can be done using a depth first search of the super-graph Sm,T of subgraphs. For our implementation, we start with a large enough subgraph (the parent), pick another subgraph that has a large overlap with it (the child), align it and note that this subgraph has been visited. Now recursively find another unvisited child of the current subgraph, and so on. It is possible that a particular path did not cover all vertices, and hence it is ideal to estimate clusterings with multiple traversals with different starting subgraphs and then align all these clusterings, and take an average. This is what we do for real networks. We also note that at any step, if we find a poorly clustered subgraph, then this can give a bad permutation which may deteriorate the quality of aligning the subsequent subgraphs on the path. In order to avoid this we use a self validation routine. Let S be intersection of current subgraph with union of the previously visited subgraphs. After aligning the current subgraph’s clustering, we compute the classification accuracy of the current labeling of S with the previous labeling of S. If this accuracy is large enough, we use this subgraph, and if not we move to the next subgraph on the path. For implementation, we use a threshold of 0.55.

Computational time and storage: The main computational bottleneck in GALE is in building a traversal through the random subgraphs. Let ηm,T be the time for computing the clusterings for T subgraphs in parallel. A naive implementation would require computing intersections between all (T2) pairs of m-subsets. As we will show in our theoretical analysis, we take m=ω(√n/πmin), where πmin:=minkπk (here πk:=nk/n, where nk is the size of the k-th cluster) and T=~Ω(n/m). Taking πmin=Θ(1), the computation of intersections takes O(T2m)=~O(n3/2) time. Further, a naive comparison for computing subsets similar or close to a given one would require TlogT time for each subset leading to T2logT=~O(n) computation. However, for building a traversal one only needs access to subsets with large overlap with a given subset, which is a classic example of nearest neighbor search in Computer Science.

One such method is the widely used and theoretically analyzed technique of Locality Sensitive Hashing (LSH). A hash function maps any data object of an arbitrary size to another object of a fixed size. In our case, we map the characteristic vector of a subset to a number. The idea of LSH is to compute hash functions for two subsets A and B such that the two functions are the same with high probability if A and B are “close”. In fact, the amount of overlap normalized by m is simply the cosine similarity between the characteristic vectors χ(A) and χ(B) of the two subsets, for which efficient hashing schemes h:{0,1}n→{0,1} using random projections exist [9], with

P(h(A)=h(B))=1−arccos(χ(A)Tχ(B)/m)/π.

For LSH schemes, one needs to build L:=Tρ hash tables, for some ρ<1, that governs the approximation quality. In each hash table, a “bucket” corresponding to an index stores all subsets which have been hashed to this index. For any query point, one evaluates O(L) hash functions and examines O(L) subsets hashed to those buckets in the respective hash tables. Now for these subsets, the distance is computed exactly. The preprocessing time is O(T1+ρ)=~O(n1+ρ2), with storage being O(T1+ρ+Tm)=~O(n), and total query time being O(T1+ρm)=~O(n1+ρ/2). This brings down the running time added to the algorithm specific ηm,T from sub-quadratic time ~O(n3/2) to nearly linear time, i.e. ~O(n1+ρ/2).

Thus, for other nearly linear time clustering algorithms GALE may not lead to computational savings. However, for algorithms like Profile Likelihood or SDP which are well known to be computationally intensive, GALE can lead to a significant computational saving without requiring a lot of storage.

With PACE we have mainly used random m-subgraphs, h-hop neighborhoods, and onion neighborhoods, but many other subgraph sampling schemes are possible. For instance, choosing roots of hop neighborhoods with probability proportional to degree, or sampling roots from high degree nodes (we have done this in our analysis of the political blog data, in Section 5.2). As discussed earlier, this weighted sampling scheme is related to w-PACE. A natural question regarding h-hop neighborhoods is how many hops to use. While we do not have a theory for this yet, because of “small world phenomenon” we expect not to need many hops; typically in moderately sparse networks, 2-3 hops should be enough. Although, an adaptive procedure (e.g., cross-validation type) for choosing h would be welcome. Also, since neighborhood size increases exponentially with hop size, an alternative to choosing full hop-neighborhoods is to choose a smaller hop-neighborhood and then add some (but not all) randomly chosen neighbors of the already chosen vertices. Other possibilities include sampling a certain proportion of edges at random, and the consider the subgraph induced by the participating nodes. We leave all these possibilities for future work.

We have analyzed GALE under the random sampling scheme. For any other scheme, one will have to understand the behavior of the intersection of two samples or neighborhoods. For example, if one takes h-hop neighborhoods, for sparse graphs, each neighborhood predominantly has nodes from mainly one cluster. Hence GALE often suffers with this scheme. We show this empirically in Section 5, where GALE’s accuracy is much improved under a random m-subgraph sampling scheme.

The ideas behind PACE and GALE are not restricted to community detection and can be modified for application in other interesting problems, including general clustering problems, co-clustering problems ([27]), mixed membership models, among others (these will be discussed in an upcoming article). In fact, [19] took a similar divide and conquer approach for matrix completion.

Let σ and σ′ be two clusterings (of n objects into K clusters), usually their discrepancy is measured by

δc(σ,σ′)=infξ∈SK1nn∑i=11{σ(i)≠ξ(σ′(i))}=1ninfξ∈SK∥ξ(σ)−σ′∥0,

where SK is the permutation group on [K]. If Z,Z′ are the corresponding n×K binary matrices, then a related measure of discrepancy between the two clusterings is δ(Z,Z′)=infQ perm.1n∥ZQ−Z′∥0. It is easy to see that δ(Z,Z′)=2δc(σ,σ′). (To elaborate, let Qξ be the permutation matrix corresponding to the permutation ξ, i.e. Qij=1{ξ(i)=j)}. Then ξ(σ(i)≠σ′(i), if and only if (ZQξ)i⋆≠Z′i⋆, i.e. ∥(ZQξ)i⋆−Z′i⋆∥0=2.) For our purposes, however, a more useful measure of discrepancy would be the normalized Frobenius squared distance between the corresponding clustering matrices C=ZZ⊤ and C′=Z′Z′⊤, i.e.

~δ(C,C′)=1n2∥C−C′∥2F.

Now we compare these two notions of discrepancies.

Incidentally, if the cluster sizes are equal, i.e. n/K, then one can show that

~δ(C,C′)≤4Kδ(Z,Z′)=8Kδc(σ,σ′).

Although we do not have a lower bound on ~δ(C,C′) in terms of δ(Z,Z′), Lemma A.1 of [30] gives us (with X=Z, Y=Z′) that there exists an orthogonal matrix O such that

∥ZO−Z′∥F≤∥C−C′∥(√K∥C∥+√K∥C′∥)λmin(C)≤2√Kn∥C−C′∥nmin(Z),

where we used the fact that ∥C∥=λmax(C)=nmax(Z)≤n. The caveat here is that the matrix O need not be a permutation matrix.

To prove consistency of PACE we have to assume that the clustering algorithm A we use has some consistency properties. For example, it will suffice to assume that for a randomly chosen subgraph S (under our subgraph selection procedure), Eδ(^ZS,ZS)3 is small. The following is our main result on the expected misclustering rate of PACE.

The first term in ( ?) essentially measures the performance of the clustering algorithm we use on a randomly chosen subgraph. The second term measures how well we have covered the full graph by the chosen subgraphs, and only depend on the subgraph selection procedure. The effect of the algorithm we use is felt through the first term only.

We can now specialize Theorem ? to various subgraph selection schemes. First, we consider randomly chosen m-subgraphs, which is an easy corollary.

Notice that the constant mθ(m−1) in can be made as close to 1 as one desires, which means that the above bound is essentially optimal.

Full h-hop neighborhood subgraphs are much harder to analyze and will not be pursued here. However, ego networks, which are 1-hop neighborhoods minus the root node (see Figure ?(b)), are easy to deal with. One can also extend our analysis to h-hop onion neighborhoods which are recursively defined as follows: O1(v)=S1(v) is just the ego network of vertex v; in general, the h-th shell Sh(v):=⊎u∈Sh−1(v)[O1(u)∖Oh−1(v)∪{v}], and Oh(v)=Oh−1(v)⊎Sh(v), where the operation ⊎ denotes superposition of networks. Here, for ease of exposition, we choose to work with ego networks (1-hop onion neighborhoods).

We will now use existing consistency results on several clustering algorithms A, in conjunction with the above bounds to see what conditions (i.e. conditions on the model parameters, and m,T etc.) are required for PACE to be consistent. We first consider (1+ϵ)-approximate adjacency spectral clustering (ASP) of [18] as A. We will use stochastic block model as the generative model and for simplicity will assume that the link probability matrix has the following simple form

B=αn[λI+(1−λ)11⊤] with 0<λ<1.(2)

We now quote a slightly modified version of Corollary 3.2 of [18] for this model.

The proof of Corollary ? follows from Corollary ? and an estimate for Eδ(^ZS,ZS) given in (Equation 16), which is obtained using Lemma ?. In order for the first term of to go to zero we need Kπmaxλ2mαnπ2min=o(1), i.e. m≫Kπmaxλ2αnπ2min. Thus for balanced block sizes (i.e. πmax,πmin≍1K) we need to have m≫K2λ2αn. So, qualitatively, for large K or small αn or a small separation between the blocks, m has to be large, which is natural to expect. In particular, for fixed K and λ, this shows that we need subgraphs of size m≫nd−1n, and T≫n2m2 many of them to achieve consistency (here the average degree dn≍nαn). Let T=n2m2rn and m=ndnsn, where both rn,sn→∞. Let us see what computational gain we get from this. Spectral clustering on the full graph has complexity O(n3), while the complexity of PACE with spectral clustering is

O(Tm3)=O(n2m2rnm3)=O(n2mrn)=O(n3dnrnsn).

So if dn=Θ(nα), then the complexity would be O(n3−αrnsn), which is essentially O(n3−α). When dn=Θ(logn) the gain is small.

Note however that for a parallel implementation, with each source processing M out of the T subgraphs, we may get a significant boost in running time, at least in terms of constants; the running time would be O(n3Mdnrnsn).

The proof of Corollary ? follows from Corollary ? and an estimate for Eδ(^ZS,ZS)1(|S|≥m⋆) given in (Equation 17), which is obtained using Lemma ?. For the right hand side in to go to zero (assuming K fixed, balanced block sizes), we need min{nα2n,Tα2n}→∞. In terms of average degree this means that we need dn≫√n, and T≫n2d2n. That with ego neighborhoods we can not go down to dn=Θ(logn) is not surprising, since these ego networks are rather sparse in this case. One needs to use larger neighborhoods. Anyway, writing dn=√nrn, T=n2d2nsn, where both rn,sn→∞, the complexity of adjacency spectral clustering, in this case becomes O(Td3n)=O(n2dnrnsn) and with M processing units gets further down to O(n2dnMrnsn).

Although from our analysis, it is not clear why PACE with spectral clustering should work well for sparse settings, in numerical simulations, we have found that in various regimes PACE with (regularized) spectral clustering vastly outperforms ordinary (regularized) spectral clustering (see Table 4).

It seems that the reason why PACE works well in sparse settings lies in the weights Nij. With h-hop neighborhoods as the chosen subgraphs, if Puv=1{ρg(u,v)≤h}, where ρg is the geodesic distance on G, then Nij=(P2)ij. It is known that spectral clustering on the matrix of geodesic distances works well in sparse settings ([5]). PACE seems to inherit that property through N, although we do not presently have a rigorous proof of this.

We conclude this section with an illustration of PACE with random m-subgraphs using SDP as the algorithm A. We shall use the setting of Theorem 1.3 of [12] for the illustration, stated here with slightly different notation. Let SDP-GV denote the following SDP [12]

maximize ⟨A,X⟩subject to X⪰0,X≥0,diag(X)=In,1⊤X1=1⊤C1.

For the simple two parameter blockmodel B=1n((a−b)I+b11⊤) with equal community sizes, we have g≍a+(K−1)bK≍dn, the average degree of the nodes (note that dn=a+(K−1)bK−an). The assumptions of Corollary ? are satisfied when

m=Ω(max{ndn,ndnϵ2(a−b)2}).

This is exactly similar to what we saw for spectral clustering (take a=nαn, and b=nαn(1−λ)). In particular, when the average degree dn=Θ(nα), and a−b=Θ(nα), we need m=Ω(n1−α/ϵ2) and T≫n2α/ϵ4 for PACE to succeed. However, in the bounded degree regime, the advantage is negligible, only from a potentially smaller constant, because we need m=Ω(n). Again, from our numerical results, we expect that with h-hop subgraphs, PACE will perform much better.

We denote the unnormalized miscustering error between estimated labels ^Z and the true labels Z, (^Z,Z∈{0,1}n×K) of the same set of nodes as M(Z,^Z):=nδ(Z,Z′)=minQ perm.∥^Z−ZQ∥0. Note that since ^Z,Z are binary, the ∥^Z−ZQ∥0=∥^Z−ZQ∥1=∥^Z−ZQ∥2F. As discussed earlier, the number of misclustered nodes will be half of this number.

The main idea of our algorithm is simple. Every approximately accurate clustering of a set of nodes is only accurate up to a permutation, which can never be recovered without the true labels. However we can align a labeling to a permuted version of the truth, where the permutation is estimated from another labeling of the same set of vertices. This is done by calculating the confusion matrix between two clusterings. We call two clusterings aligned if cluster i from one clustering has a large overlap with cluster i from the other clustering. If the labels are “aligned” and the clusterings agree, this confusion matrix will be a matrix with large diagonal entries. This idea is used in the Match algorithm, where we estimate the permutation matrix to align one clustering to another.

Now we present our main result. We prove consistency of a slightly modified and weaker version of Algorithm ?. In Algorithm ?, at every step of a traversal, we apply the Match algorithm on the intersection of the current subgraph and the union of all subgraphs previously aligned to estimate the permutation of the yet unaligned current subgraph. However, in the theorem presented below we use the intersection between the unaligned current subgraph with the last aligned subgraph. Empirically it is better to use the scheme presented in Algorithm ? since it increases the size of the intersection which requires weaker conditions on the clustering accuracy of any individual subgraph. We leave this for future work.

The entries of ^ZGALE will be fractions, but as we show in Lemma ?, rounding it to a binary matrix will not change consistency properties.

Note that GALE depends on the spanning tree we use and particular the traversal of that spanning tree. Let SpanningTreesG be the set of all spanning trees of a graph G. For T∈SpanningTreesG, let TraversalsT be the set of all traversals of T. Let ^ZGALET,(x1,…,xJ) be the outcome of GALE on the traversal (x1,…,xJ) of T∈SpanningTreesSm,T.

Again, the constant θ can be taken as close to 1 as one desires. Thus the above bound is also essentially optimal.

We will now illustrate Theorem ? with several algorithms A. We begin with a result on (1+ϵ)-approximate adjacency spectral clustering.

We see that the first term is exactly same as the first term in Corollary ?. This, for balanced graphs, again imposes the condition m≫K2λ2αn. In particular, if K=Θ(1) and we are in a dense well separated regime, with λ=Θ(1), αn=Ω(1/√n), then we need m=Ω(√nlogn). If K=Θ(1), λ=Θ(1) and αn=Θ(logn/n), then we need m≫n/logn. In both cases, we need T=Ω(nlogn/m). Thus in the regime where average degree is like logn there is still some computational advantage for very large networks (also factoring in parallelizability); however, for moderately sized networks, GALE may not lead to much computational advantage.

Now we present an exact recovery result with SDP as the base algorithm A. We shall use a result4 from [33] on an SDP which they call SDP-λ. Let κ:=πmax/πmin. Also let π denote the vector of the cluster proportions (π1,…,πK).

Assuming that any subsequent clustering of the exactly recovered scaled clustering matrix Zdiag(nπ)−1Z⊤ gives the exact clustering Z back (for example, our distance based naive algorithm NaiveCluster5 will do this), we have the following corollary.

Note that, in the above bound r can taken to be greater than 1. This means that, with high probability, the proportion of misclustered nodes is less than 1/n and hence zero, leading to exact recovery. As for computational complexity, note that the separation condition , with n replaced by m, restricts how small m can be. Consider the simple SBM with balanced block sizes for concreteness. In this case, the separation condition essentially dictates, as in the case of spectral clustering, that m≫K2λ2αn. Thus the remarks made earlier on how large m or T should be chosen apply here as well.

As discussed earlier in Section 3, even a naive implementation of GALE will only result in an O(n3/2) running time in addition to the time (ηm,T) required to cluster the T random m-subgraphs, whereas a more careful implementation will only add a time to ηm,T that is nearly linear in T. Since SDPs are notoriously time intensive to solve, this gives us a big saving.

where I is the K dimensional identity matrix and J is the K×K matrix of all ones. Here ρn will be the degree density, and r will measure the relative separation between the within block and between block connection probabilities, i.e. p and q. If the blocks have prior probabilities πi,i=1,…,K, then the average degree d, under this model is given by

dn=(n−1)ρna((1−r)∑iπ2i+r).

In particular, if the model is balanced, i.e. πi=1/K for all i, then

dn=(n−1)ρna(1+(K−1)r)K.

In order to understand and emphasize the role of PACE and GALE in reducing computational time while maintaining good clustering accuracy, we use different settings of sparsity for different methods. For recovering ^Z from ^C in PACE, we have used random projection plus K-means (abbreviated as RPKmeans below), and spectral clustering (SC). We also want to point out that, for sparse unbalanced networks GALE may return more than K clusters, typically when a very small fraction of nodes has not been visited. However, it is possible that the unvisited nodes have important information about the network structure. For example, all subgraphs may be chosen from the larger clusters, thereby leaving the smallest cluster unvisited. We take care of this by computing the smallest error between the (K+1)! permutations of GALE’s clustering to the ground truth. This essentially treats the smallest cluster returned by GALE as misclustered. In real and simulated networks we have almost never seen GALE return a large number of unvisited nodes.

SDP with ADMM: Interior point methods for SDPs are not very fast in practice. We have solved SDPs using an ADMM based implementation of [32]. From Table ? we see that PACE and GALE significantly reduces the running time of SDP without losing accuracy too much. In fact, if we use spectral clustering to estimate ^Z from ^C in the last step of PACE, we get zero misclustering error (ME).

Mean Field Likelihood: From Table 3 we see that our implementation of mean field on the full graph did not converge to an acceptable solution even after five and half hours, while both PACE and GALE return much better solutions in about two minutes. In fact, with spectral clustering in the last step of PACE, the misclustering error is only 0.14, which is quite good. This begs the question if this improvement is due to spectral clustering only. We show in the next simulation that in certain settings, even when spectral clustering is used as the base algorithm, PACE and GALE lead to significant improvements in terms of accuracy and running time.

Regularized Spectral Clustering: In sparse unbalanced settings, regularized spectral clustering with PACE and GALE performs significantly better than regularized spectral clustering on full graph. In fact, with spectral clustering used in the last step of PACE, we can hit about 5% error or below, which is quite remarkable. See Table 4. In Section 5.2 we will see that PACE and GALE also add stability to spectral clustering (in terms of clustering degree 1 vertices).

Profile Likelihood with tabu search: Optimizing profile likelihood (PL) or likelihood modularity ([6]) for community detection is a combinatorial problem, and as such hard to scale, even if we ignore the problem of local minima. In Table 5 we compare running time of profile likelihood (optimized using tabu search) and its divide and conquer versions. We see that the local methods significantly cut down the running time of PL without losing accuracy too much.

We also applied profile likelihood on 5000 node graphs with 20 workers. Although PACE and GALE finished in about 22 minutes, the global method did not finish in 3 days. So, here we present results on 1000 node networks instead.

Table 5: PACE and GALE with Profile Likelihood (PL) as the base method. Simulation settings: n=1000, average degree =18.47, 2 unequal sized clusters with relative sizes π=(0.4,0.6), T=50, parallel implementation in Matlab with 12 workers. We sampled 2-hop neighborhoods by selecting their roots uniformly at random from nodes having degree greater than the 0.1th lower quantile (=12) of the degree distribution (average neighborhood size was 310). Ordinary PACE with such a scheme may be thought of as w-PACE, as discussed in Section .

This is a directed network (see Figure 1) of hyperlinks between 1490 blogs (2004) that are either liberal or conservative ([1]); we have ground truth labels available for comparison, 758 are liberal, 732 are conservative. We convert it into an undirected network by putting an edge between blogs i and j if there is at least one directed edge between them.

The resulting network has lots of isolated nodes and isolated edges. The degree distribution is also quite heterogeneous (so a degree-corrected model would be more appropriate). We focus on the largest connected component. We use Laplacian spectral clustering (row normalized, to correct for degree heterogeneity), with PACE.

Table 6: Misclustering rate in the political blog data. PACE was used with T=10, and h-hop neighborhoods with h=2, with roots chosen at random from high degree nodes.

Largest Conn. Comp.

RSC

RSC + PACE

SC

SC + PACE

With leaves (1222 nodes)

18.74%

6.79%

48.12%

6.55%

Without leaves (1087 nodes)

11.87%

4.23%

3.13%

3.86%

Table 7: Misclustering rate in the political blog data. GALE and PACE were used with T=50, and m=300 random subgraphs.

Largest Conn. Comp.

RSC

RSC + PACE

RSC + GALE

SC

SC + PACE

SC + GALE

With leaves (1222 nodes)

18.74%

13.34%

11.62%

48.12%

7.86 %

5.81%

Without leaves (1087 nodes)

11.87%

12.8%

10.0%

4.23%

7.28%

6.7%

Tables Table 6-Table 7 show that PACE and GALE add stability (possibly in eigenvector computation) to spectral clustering. Indeed, with PACE and GALE we are able to cluster “leaf” vertices (i.e. vertices of degree 1), with significantly more accuracy.

To summarize, we have proposed two divide-and-conquer type algorithms for community detection, PACE and GALE, which can lead to significant computational advantages without sacrificing accuracy. The main idea behind these methods is to compute the clustering for each individual subgraph and then “stitch” them together to produce a global clustering of the entire network. The main challenge of such a stitching procedure comes from the fundamental problem of unidentifiability of label assignments. That is, if two subgraphs overlap, the clustering assignment of a pair of nodes in the overlap may be inconsistent between the two subgraphs.

PACE addresses this problem by estimating the clustering matrix for each subgraph and then estimating the global clustering matrix by averaging over the subgraphs. GALE takes a different approach by using overlaps between two subgraphs to calculate the best alignment between the cluster memberships of nodes in the subgraphs. We prove that, in addition to being computationally much more efficient than base methods which typcally run in Ω(n2) time, these methods have accuracy at least as good as the base algorithm’s typical accuracy on the type of subgraphs used, with high probability. Experimentally, we show something more interesting — we identify parameter regimes where a local implementation of a base algorithm based on PACE or GALE in fact outperforms the corresponding global algorithm. One example of this is the Meanfield algorithm, which typically suffers from bad local optima for large networks. Empirically, we have seen that on a smaller subgraph, with a reasonable number of restarts, it finds a local optima that is often highly correlated with the ground truth. PACE and GALE take advantage of this phenomenon to improve on accuracy/running time significantly. Another example is Regularized Spectral Clustering on sparse unbalanced networks. We intend to theoretically investigate this further in future work.

Finally, working with many subgraphs naturally leads to the question of self consistency of the underlying algorithm. This is often crucial in real world clustering problems with no available ground truth labels. We intend to explore this direction further for estimating model parameters like the number of clusters, algorithmic parameters like the size and number of subgraphs, number of hops to be used for the neighborhood subgraphs, etc. Currently, these are all picked a priori based on the degree considerations. It may also be possible to choose between different models (e.g., standard blockmodels, degree corrected models, dot product models etc.) by examining which model leads to the most self consistent results. We leave this for future work.

In conclusion, not only are our algorithms, to the best of our knowledge, the first ever divide-and-conquer type algorithms used for community detection, we believe that the basic principles of our methods will have a broad impact on a range of clustering and estimation algorithms that are computationally intensive.

But ∥Z′∥22 is the maximum eigenvalue of Z′⊤Z′ which is diagonal with its maximum diagonal entry being the size of the largest cluster under Z′. Thus ∥Z′∥22 equals the size of the largest cluster under Z′ and so is trivially upper bounded by n. Same goes for ∥ZQ⊤∥22. Therefore we get

∥C−C′∥F≤2√n∥ZQ−Z′∥F.

Squaring this, and taking infimum over all permutation matrices Q in the right hand side, we obtain the claimed inequality.

Now we will prove Theorem ?. The proof will be broken down into two propositions. First we decompose

Note that Wij=1{Nij≥τ}Nij≤1τ. On the other hand, since the subgraphs were chosen independently using the same sampling scheme, the ∥^C(ℓ)−C(ℓ)∥F are identically distributed. Therefore, taking expectations we get

E∥E1∥2F≤1τ×T∑ℓ=1E∥^C(ℓ)−C(ℓ)∥2F=TτE∥^C(S)−C(S)∥2F,

where S is a randomly chosen subgraph under our subgraph selection scheme.

Since −(E2)ij=Cij1{Nij<τ}, we have ∥E2∥2F=∑i,jC2ij1{Nij<τ}, and by taking expectations we get

For this sampling scheme |S|=m≥m⋆ and with p=m(m−1)n(n−1), Nij∼ Binomial(T,p) so that we have, using the Chernoff bound6 for binomial lower tail, that

P(Nij<θTp)≤e−(1−θ)2Tp/2.

Finally, we get ( ?) by plugging in these parameter values and estimates in ( ?).

The most crucial thing to observe here is that if one removes the root node and its adjacent edges from a 1-hop neighborhood, then the remaining “ego network” has again a blockmodel structure. Indeed, let S be a random ego neighborhood of size ≥s with root R, i.e. V(S)={j:ARj=1}. Then conditional on V(S) being R’s neighbors, and the latent cluster memberships, edges in E(S) are independently generated, i.e. for j,k,ℓ,m∈V(S), and s,t∈{0,1}, we have

P(Ajk=s,Aℓm=t|S,Z)=P(Ajk=s|Z)P(Aℓm=t|Z).

This is because the “spoke” edges ARj are independent of Aj,k,j,k∈V(S). Therefore, conditional on S, this ego-subgraph is one instantiation of a block model with the same parameters on |S| vertices.

Now for ego networks, y(ℓ)ij∼Bernoulli(nij/n), where nij is the total number of ego-subgraphs containing both i and j. Notice that

nij=∑ℓ≠i,j1{Aiℓ=1,Ajℓ=1},

that is, nij is the sum of (n−2) independent Bernoulli random variables

1{Aiℓ=1,Ajℓ=1}∼Bernoulli(Bσ(i)σ(ℓ)Bσ(j)σ(ℓ)).

So

(n−2)B2#≥Enij=∑ℓ≠i,jBσ(i)σ(ℓ)Bσ(j)σ(ℓ)≥(n−2)B2⋆,

and we have, by the Chernoff bound, that

P(nij≤(n−2)B2⋆−Δ)≤exp(−(Enij−(n−2)B2⋆+Δ)22Enij)≤exp(−Δ22(n−2)B2#).(6)

In order to apply Theorem ? we need the following two ingredients, which we will now work out.

estimate of |S|, and

estimate of P(Nij<τ).

(i) Estimate of |S|. Note that |S|=∑kAkR. So

P(|S|<m⋆)=ERP(|S|<m⋆|R).

Since (n−1)B⋆≤E(|S||R)=∑j≠RBσ(k)σ(R)≤(n−1)B#, and AkR,1≤k≤n are independent, we have, by Chernoff’s inequality, that