Abstract

We conduct a comparative analysis on various estimates of the number of clusters in community detection.
An exhaustive comparison requires testing of all possible combinations of frameworks, algorithms, and assessment criteria.
In this paper we focus on the framework based on a stochastic block model, and investigate the performance of greedy algorithms, statistical inference, and spectral methods.
For the assessment criteria, we consider modularity, map equation, Bethe free energy, prediction errors, and isolated eigenvalues.
From the analysis, the tendency of overfit and underfit that the assessment criteria and algorithms have, becomes apparent.
In addition, we propose that the alluvial diagram is a suitable tool to visualize statistical inference results and can be useful to determine the number of clusters.

Community detection is a coarse graining process for networks.
Whereas the original dataset, given as a network, possesses information that is quite rich, it is often too microscopic to have its important structures interpreted.
For better interpretability, a community detection algorithm summarizes (i.e., clustering or partitioning) the dataset by aggregating the vertices and edges of densely connected components.
That is, the detailed relational information of similar vertices is discarded, while preserving an important macroscopic structure.
A set of aggregated vertices is regarded as a cluster, or a community.

A straightforward approach, or a framework, for community detection is to optimize an objective function that evaluates the quality of clustering.
Another popular approach is based on statistical inference and considers a generative model of a network. It is often formulated using the so-called stochastic block model Holland et al. (1983); Wang and Wong (1987); Karrer and Newman (2011) as a generative model, which is a random graph with a modular structure. Therefore, the community structure can be inferred by fitting the network to the model.
While these two approaches may seem very different, the former can sometimes be formulated as a limiting case (zero-temperature limit in physics terminology) of the latter, and in this paper, it is mainly explained in terms of the latter approach.

When community detection is performed, the number of clusters q∗ needs to be determined. In other words, the complexity of the model, i.e., the model selection needs to be specified.
This process consists of determining the partition that describes the modular structure most efficiently, or to evaluate whether the obtained partition is statistically significant.
Just as the quality of the q-way partition varies depending on the very definition of a cluster, the appropriateness of the number of clusters also varies depending on the principle followed.
Meanwhile, it is difficult to decide on which principle to apply, given a dataset.
Therefore, it is important to investigate the typical behavior and biases of each criterion.
For example, some criteria may behave very differently from others in some cases, or some criteria may be more sensitive to the accuracy of a particular algorithm.
Moreover, the dangers of underfit and overfit are often not symmetric.
In the case of community detection, it is often safer to underfit than to overfit, because the former only implies a different level of coarse graining, while the latter implies the detection of fictitious small clusters.

In this paper, a comparative analysis of various criteria that estimate q∗ is conducted.
This analysis is distinct from other comparative analyses in the following sense.
Whereas the performance of the community detection completely depends on the framework (or objective function), algorithm, and assessment criterion used, it is often the case that a specific combination of them is employed in a benchmark test.
For example, Infomap, a greedy algorithm for the map equation Rosvall and Bergstrom (2008, 2011), is a popular algorithm and frequently appears in benchmark tests. However, when the performance of a certain objective function or an assessment criterion is compared with the map equation, it is fair to use a common algorithm.
For this reason, the performance of various assessment criteria using the same statistical inference algorithm is compared.
In addition, the performance of the same assessment criterion on the same dataset using different algorithms is investigated.
Therefore, when a criterion is ill-behaved, it can be argued whether it is due to the criterion itself, or the algorithm used.

As can be observed below, sometimes, the validation curves of assessment criteria change very gradually.
In such a case, it is not easy to determine the plausible value of q∗ from the assessment criteria, and a finer inspection of the partition actually obtained is required.
For this purpose, a visualization technique, called the alluvial diagram Rosvall and Bergstrom (2010) is proposed as a suitable tool; not only because of the way the network is partitioned, but also because it allows the significance level to be evaluated from an inference algorithm.

The remainder of the paper is organized as follows.
First, stochastic block models are defined in Sec. II to set the basic framework.
Second, the algorithms used to determine the cluster assignments and for the estimate of q∗ in Sec. III are introduced.
Third, the assessment criteria of the number of clusters q∗ is explained in Sec. IV.
Then, in Sec. V, the results of the comparative analyses are shown.
In Sec. VI, how the alluvial diagram helps determine the number of clusters is explained.
Finally, Sec. VII is devoted to the summary and discussion.

Community detection based on a stochastic block model is considered to be the basic framework.
The sets of vertices and edges are denoted as V and E, respectively.
Their respective cardinalities are referred to as N and L.

ii.1 Standard stochastic block model

The most standard version of stochastic block models is constructed as follows.
We first consider a set of vertices V without edges.
For each vertex, we randomly specify the cluster assignment σi∈{1,…,q}, where i is the index of a vertex and the number of clusters q is given as an input.
The probability of the cluster size can also be specified.
It is a prior distribution of the cluster assignments and is expressed by a multinomial distribution ∏iγσi.
Then, the edges are generated randomly according to the vertex pair’s cluster assignment, where the connection probability is specified by an element of the q×q affinity matrix ω; the edge probability distribution of a vertex pair is given by the Bernoulli distribution.
Thus, the likelihood of the stochastic block model is given as

p(A,\boldmathσ|\boldmathω,%\boldmath$γ$,q)

=p(A|\boldmathσ,\boldmathω,q)p(\boldmathσ|\boldmathγ,q)

=∏iγσi∏i<jωAijσiσj(1−ωσiσj)1−Aij.

(1)

When a higher connection probability is provided for pairs of vertices with the same cluster assignment (as compared to pairs of vertices with different cluster assignments), i.e., ωσσ>ωσσ′ (σ≠σ′), a set of random graphs with a community structure can be generated.
Generating the stochastic block model is a forward problem, and community detection is its inverse problem, i.e., the inference of γ, ω, and σ.

Whereas the standard stochastic block model is very flexible, it is often not suitable to fit real-world networks, mainly because it can only have a binomial degree distribution, which is not true in many datasets.
To resolve this problem, the so-called degree-corrected stochastic block model Karrer and Newman (2011) was proposed.
Following Ref. Karrer and Newman (2011), the Bernoulli distribution for the edge probability of each vertex pair was approximated with the Poisson distribution, which is justified when the network is sparse.
In this model, it is assumed that the mean of the Poisson distribution depends on the degrees of the vertex pair as well as on the affinity matrix.
Hence, for a given affinity matrix ω and the number of clusters q, the likelihood of the degree-corrected stochastic block model is given as

p(A,\boldmathσ|\boldmathω,q)

=p(A|\boldmathσ,\boldmathω,q)p(\boldmathσ|q)

=∏i<j(diωσiσjdj)Aije−diωσiσjdj,

(2)

where di is the degree of vertex i.
Here and hereafter, the uniform prior distribution for p(\boldmathσ) was considered for simplicity. Moreover, we neglected the existence of self-loops, which is also justified when the network is sparse.
The log likelihood reads

logp(A,\boldmathσ|\boldmathω)

=∑i<j[Aijlog(diωσiσjdj)−diωσiσjdj],

(3)

where we neglected a constant term.

While the stochastic block model of Eq. (3) is able to express various modular structures, hereafter, we restrict our interest to the community structure.
The affinity matrix ω is then restricted to the form

ωσσ′={ωin(σ=σ′),ωout(σ≠σ′).

(4)

In other words, only whether a vertex pair is assigned to the same cluster or not is distinguished.
This restriction to the inference algorithm using BP was proposed in Ref. Zhang and Moore (2014).
One of the reasons why this restriction is employed is because it can considerably reduce the computational cost, so that performance on large networks with many clusters can be evaluated.
Another reason is because some assessment criteria compared in this paper are specialized to the community structure, and their generalizations to general modular structures are not known.

iii.1 Statistical inference

The goal of community detection is to determine the set of cluster assignments σ; it is a hidden variable, and when the model parameter ω is learned for a given number of clusters q, fitting the network for the stochastic block model is carried out by maximizing the marginal log likelihood or equivalently, minimizing the free energy,

f(\boldmathω,q)

=−log∑\boldmathσp(A,\boldmathσ|\boldmathω,q).

(6)

Unfortunately, this optimization problem is computationally difficult in general, and a number of approximate methods have been proposed in the literature.
The EM algorithm is employed in this paper, which is a popular method for fitting the stochastic block model.
To obtain the minimum of the free energy, the EM algorithm iteratively optimizes the distribution of the hidden variable σ with a fixed model parameter ω (E step) and the optimization of ω with a fixed distribution of σ (M step).
For the E step, we use the BP algorithm which will be explained later in this section.
Thus, we obtain the probability distribution of the cluster assignment for each vertex, such that Eq. (6) is expected to be minimized.
Hereafter, we often omit the number of clusters q in the argument, which is always given as an input; we try various values of q for model assessment.

When the affinity matrix ω is fixed as a constant in the E step, the free energy reads

f(\boldmathω,q)

=const.−log∑\boldmathσe2LβQ(\boldmathσ),

(7)

where

Q(\boldmathσ)=12L∑i<jδσiσj[Aij−αdidj2L],

(8)

α=2Lβ(ωin−ωout),β=logωinωout,

(9)

are the modularity function Q(\boldmathσ), resolution parameter αReichardt and Bornholdt (2006), and inverse temperature β, respectively.
This indicates that modularity maximization can be regarded as a special case of the inference using the stochastic block model; the partition with the maximum modularity coincides with the result of the statistical inference when the entropic effect is ignored, or β→∞. This is known as the maximum a posteriori (MAP) estimate Nishimori (2001).
The connection between likelihood maximization and modularity maximization was first discussed in Ref. Newman (2013) for q=2 in the context of spectral graph partitioning; the above relation was pointed out in Ref. Zhang and Moore (2014), which discusses a finite temperature formulation of the modularity maximization.
It is known that β also plays the role of a resolution parameter Schülke and Ricci-Tersenghi (2015) that controls the typical scale of clusters.

In Refs. Zhang and Moore (2014); Schülke and Ricci-Tersenghi (2015), α is set to unity and β is treated as an input parameter, which corresponds to fitting a network with a fixed affinity matrix.
However, it is more natural to learn them instead.
The learning of ωin and ωout can be carried out in a straightforward manner.
They are obtained as the values that minimize the free energy, Eq. (7).
The derivatives with respect to the model parameters Foo (a) yield

^ωin

=∑(i,j)∈E⟨δσiσj⟩∑i<jdidj⟨δσiσj⟩,

(10)

^ωout

=∑(i,j)∈E(1−⟨δσiσj⟩)∑i<jdidj(1−⟨δσiσj⟩),

(11)

where δσσ′ is the Kronecker delta, ⟨⋯⟩ is the average with respect to the current estimate of the posterior distribution p(\boldmathσ|A,\boldmathω) and the hat notation indicates the estimated quantity.
Let nσ=∑i⟨δσσi⟩ denote the number of nodes within cluster σ.
As mentioned in Ref. Zhang et al. (2015), if we assume that p(\boldmathσ|A,\boldmathω) is the distribution that prevents nσ from fluctuating significantly, i.e., ,

∑i<jdidj⟨δσiσj⟩

≈12∑σ⟨∑idiδσσi⟩2.

(12)

We also assumed that the overcounting for i=j in the sum is negligible.
Then, Eqs. (10) and (11) can be approximated as

^ωin

=2∑σ∑(i,j)∈E⟨δσσiδσσj⟩∑σ(∑idi⟨δσσj⟩)2,

(13)

^ωout

=2L−∑σ∑(i,j)∈E⟨δσσiδσσj⟩(2L)2−∑σ(∑idi⟨δσσj⟩)2.

(14)

Note that the update of model parameters only costs linear time; therefore, it is not a bottleneck in the algorithm.

To evaluate the probability of cluster assignment for each vertex, BP is used (see Refs. Mézard and Montanari (2009); Decelle
et al. (2011b, a); Zhang and Moore (2014) for details).
The marginal probability ψiσi of vertex i’s cluster assignment σi can be obtained by marginalizing the likelihood Eq. (II.2). Using the tree approximation, which is valid for sparse networks, it can be expressed as

ψiσi=1Zi∏k∈∂i[∑σkψk→iσkeβδσiσk]∏ℓ∉i∪∂i[∑σℓψℓ→iσℓe−αβdidℓ2Lδσiσℓ]

=1Zi∏k∈∂i[1+ψk→iσi(eβ−1)]∏ℓ∉i∪∂i[1+ψℓ→iσi(e−αβdidℓ2L−1)],

(15)

where ∂i denotes the neighboring vertices of i.
In Eq. (15), ψk→iσi indicates the cavity bias from vertex k to vertex i, that is, the marginal probability of k without the marginalization from i, and Zi is the normalization factor.
Assuming that αβdidℓ/2L=O(N−1), we can further approximate ψiσi and ψi→jσi as

ψiσi

≈1Zie−αβ2Ldiθσi∏k∈∂i[1+ψk→iσi(eβ−1)],

(16)

ψi→jσi

≈1Zi→je−αβ2Ldiθσi∏k∈∂i∖j[1+ψk→iσi(eβ−1)],

(17)

respectively, where θσ≡∑ℓdℓψℓσ.
The cavity bias ψi→jσi is normalized by Zi→j (see Ref. Zhang and Moore (2014) for details).
Using these quantities, we can estimate the elements in Eqs. (13) and (14) as

∑idi⟨δσσi⟩=∑idiψiσ=θσ,

(18)

∑σ⟨δσσiδσσj⟩=∑σ1Zijψi→jσdiωindjψj→iσ

(19)

It should be noted that the iteration of Eqs. (13), (14), (16), and (17) does not minimize the free energy Eq. (6) itself, but minimizes its approximated quantity called the Bethe free energy. We show the specific form of the Bethe free energy in Sec. IV.1.

The critical values of β for the stochastic block model have been discussed in Refs. Zhang and Moore (2014); Schülke and Ricci-Tersenghi (2015).
There are three phases of state, depending on the value of β and the strength of the community structure: the retrieval phase, paramagnetic phase, and spin-glass phase.
In the retrieval phase, the fixed point of BP with the minimum Bethe free energy correctly indicates the community structure.
In the paramagnetic phase, BP converges to the so-called factorized state as the minimum of the Bethe free energy.
In the factorized state, for any vertex i, the marginal probability distribution of the cluster assignment ψiσ has a uniform distribution, ψiσ=1/q. In other words, any vertex has an equal probability of joining any cluster. Therefore, the resulting partition does not exhibit any community structure.
Finally, the spin-glass phase is the phase in which BP typically does not converge. This is also the case in which the statistically significant community structure cannot be retrieved.
In the case of the standard stochastic block model with equal size clusters, for a given number of clusters q∗, the critical value of β between the paramagnetic phase and the spin-glass phase obtained by the stability of the factorized state against a random perturbation is

β∗

=log(q∗√c−1+1),

(20)

where c is the average degree.
The lower bound estimate of β that prevents BP from going into the paramagnetic phase is given by:

β0=log(q∗c−1+1).

(21)

In practice, it cannot be uniquely determined whether BP belongs to the retrieval phase, paramagnetic phase, or spin-glass phase, because real-world networks do not precisely emulate the stochastic block model.
However, they work as the reference values of β to obtain an intuition regarding which phase BP belongs to.
In Ref. Zhang and Moore (2014), it is suggested that β=β∗ should be used as an input, because BP is expected to belong to the retrieval phase with this value.

The effect due to the absence of model-parameter learning can be interpreted as follows.
Given that the model only distinguishes whether a pair of vertices is in the same cluster or not, the specific values of ωin and ωout may not be so crucial for the resulting cluster assignment.
Conversely, when other statistical quantities such as likelihood or cross-validation errors are considered, erroneous model-parameter estimates may cause a large bias.
As we observe in Sec. V, the results of the criteria that depend only on cluster assignments (e.g., modularity and minimum description length of the map equation) are not very sensitive to model-parameter learning, while the criteria that utilize the model parameters (e.g., the Bethe free energy and cross-validation errors) are ill-behaved without learning.

iii.2 Greedy algorithms

In the previous section, the free energy minimization based on the stochastic block model has been considered.
In the limit of β→∞, it reduces to the maximization of the modularity function Q(\boldmathσ) in Eq. (8), or the energy minimization.
In this case, the probability distribution with respect to σ is no longer considered, and our goal here is to find the best cluster assignment for each vertex.

While a number of algorithms have been proposed in the literature, perhaps, greedy algorithms, such as the Louvain method Blondel et al. (2008), are the most widely used in practice.
Another greedy algorithm for community detection that we analyze is the Infomap Rosvall and Bergstrom (2008), which optimizes the map equation Rosvall and Bergstrom (2008) (see Sec. IV.3 for the details of the map equation).
In such algorithms, we assign a unique cluster label for each vertex at the beginning, i.e., q=N, and merge and update their assignments as referring to the neighboring vertices to achieve a higher or lower value of the objective function, e.g., Q(\boldmathσ).
Note that the number of clusters is also determined automatically during the optimization process.
Although these greedy algorithms are extremely fast, as we will observe in Sec. V.1, they tend to largely overfit when the algorithm is trapped in a local extremum of the energy landscape.
The situation is very severe particularly when the landscape is glassy Good et al. (2010).

iii.3 Spectral methods

Figure 1: Relationships with the modularity function, normalized cut, and the corresponding spectral methods in the case of bisection.

Other commonly used algorithms are spectral methods.
In this section, the focus is on the case of q=2 (bisection).
Let us consider the case of modularity maximization.
While maximizing Q(\boldmathσ) is originally a discrete optimization problem, the assignments σ are relaxed to a real vector \boldmathx∈RN with a spherical normalization constraint, i.e.,

max\boldmathxQ(\boldmathx)∑ix2i=N.

(22)

Here, Q(\boldmathx)=\boldmathx⊤Q\boldmathx and Q is a matrix the element of which is given as

Qij=Aij−αdidj2L.

(23)

This matrix is known as the modularity matrix Newman (2006b, a).
If Eq. (22) is rewritten using the Lagrange multiplier, an eigenvalue problem is obtained with respect to Q and the leading eigenvector is expected to be correlated to the optimum assignments.
Although Q is a dense matrix, because it is due to the rank 1 matrix of the second term in Eq. (23), its leading eigenvalues and eigenvectors can be computed efficiently using the power iteration Git ().

A more classical version of a spectral method is the one based on the minimization of an objective function called the normalized cut Shi and Malik (2000).
The normalized cut fNcut is defined as, for σi∈{1,2},

fNcut(\boldmathσ)

∝∑i,jAijδσi,1δσj,2(∑i(σi=1)di)(∑j(σj=2)dj).

(24)

Analogous to the case of modularity, the continuous relaxation with the spherical normalization constraint yields the eigenvalue problem with respect to the normalized Laplacian L, which is defined as:

L

=I−D−1/2AD−1/2,

(25)

where I is the identity matrix and Dk≡diag{dk1,…,dkN} (see Ref. Luxburg (2007) for details).

The optimizations of the normalized cut and modularity look different. However, it is known that minimizing the normalized cut fNcut(\boldmathσ) is equivalent to maximizing modularity Q(\boldmathσ) with a special choice of the resolution parameter αKawamoto and
Kabashima (2015a) at the level of discrete optimization. Moreover, when the problem is relaxed to a continuous optimization, it is also possible to formulate the modularity maximization as the eigenvalue problem of the normalized Laplacian Newman (2013); it can be done by imposing a degree-dependent normalization constraint instead of the spherical normalization constraint and setting the resolution parameter α to unity. These relationships are summarized in Fig. 1.

In general, the leading q eigenvectors are expected to be correlated to the optimum q-way partition. Thus, to determine the assignment of each vertex, those eigenvectors have to be rounded, e.g., using the K-means method. However, the focus of this paper is only on the eigenvalues, because they are sufficient to estimate the number of clusters.

While the above two spectral methods are based on energy minimization, a spectral method related to the Bethe free energy minimization was proposed in Ref. Krzakala et al. (2013).
The matrix that appears in this method is called the non-backtracking matrix B, and is derived from the linear stability analysis of the BP algorithm that minimizes the Bethe free energy.
The non-backtracking matrix B is not a symmetric matrix, and its specific form is given as:

B=(0D−I−IA).

(26)

A symmetric variant of the non-backtracking matrix is also proposed in Ref. Saade et al. (2014a).

In Sec. IV.4, we will discuss the properties of the spectra of the above matrices and their use as the assessment criteria of the number of clusters.

In this section, we explain the assessment criteria of the number of clusters q∗.
To determine it using an algorithm in which q is given as an input, we assess the quality of the clustering based on a criterion, as we sweep the value of q.
It should be noted that, although the input value of the number of clusters, namely, the maximum number of clusters that the vertices can be assigned to, is q, the resulting partition might have less than q clusters.

iv.1 Bethe free energy

In the present framework of statistical inference, the most natural assessment is to measure the free energy and observe its saturation as an increment of the number of clusters q.
When the network is generated by a block model with q∗ clusters, the marginal likelihood will not be increased for q>q∗.
Therefore, we expect that a parsimonious number of clusters can be selected from the saturation of the free energy.

As mentioned above, the algorithm using BP does not minimize the free energy itself.
Instead, it minimizes the Bethe free energy as an approximated quantity, and it can be written in terms of the cavity bias ψi→jσ and affinity matrix ω as follows.

fBethe

=−1βN⎛⎝∑ilogZi−∑(i,j)∈ElogZij−∑(i,j)∉Elog~Zij⎞⎠,

(27)

where

Zi

=∑σe−dihσi∏k∈∂i[∑σkψk→iσkdkωσkσidi],

(28)

Zij

=∑σσ′ψi→jσdiωσσ′djψj→iσ′

for (i,j)∈E,

(29)

~Zij

=∑σσ′ψi→jσ(1−diωσσ′dj)ψj→iσ′

for (i,j)∉E,

(30)

hσ

=∑kdk∑σkψkσkωσkσ.

(31)

Some simple algebra shows that the Bethe free energy here fBethe is related to the Bethe free energy in Ref. Zhang and Moore (2014) (which we refer to as fmodBethe) as

fBethe

=fmodBethe+C(ωout),

(32)

C(ωout)

=−c2β(2Lωout+logωout+1L∑idilogdi).

(33)

iv.2 Modularity

While modularity appeared as an objective function with a fixed q in Sec. III.1, it was originally defined as an assessment criterion of the number of clusters Newman and Girvan (2004).
In modularity, the strength of a community structure is measured by comparing the actual network and a randomized network in each cluster.
Although the performance of modularity is not considered state of the art, it has been extensively studied and used as a baseline in many benchmark tests.

Precisely speaking, while the sum is taken over every vertex pair (i,j) (i<j) in Eq. (8), the sum is taken over all possible combinations of vertices (including the case i=j) in the original definition, although this does not cause a qualitative difference unless the self-loops are significant.
The modularity function of Eq. (8) with the partition obtained by free energy minimization is sometimes distinguished as the retrieval modularity. However, it is referred to as modularity in this paper for simplicity.
The partition is selected with a maximum modularity, or the parsimonious one among the partitions with a high modularity.

From the view point of statistical inference, the modularity maximization corresponds to a maximum likelihood estimate; i.e., there is no penalty term.
In principle, it can still assess the number of clusters because the degrees of freedom of the affinity matrix ω are restricted as in Eq. (4), and thus, the model with a larger q does not contain the model with a smaller q as a subset.
In addition, if we tune the resolution parameter α, the likelihood varies, and the optimum value q∗ changes.

iv.3 The map equation

Another popular criterion is the map equation Rosvall and Bergstrom (2008, 2011), in which the strength of the community structure is measured in terms of the minimum description length of a random walker.
The map equation encodes the moves of a random walker on a given network using multiple codebooks.
Specifically, it considers a codebook that encodes moves between clusters, as well as codebooks that encode moves within each cluster.
Given that the codewords of different codebooks can be overlapped, a proper assignment of clusters will compress the description length of a random walker. Moreover, by using the codebooks of superclusters, i.e., the clusters of clusters, its hierarchical extension can be performed naturally.
The map equation also has an interesting feature in that it allows for the consideration of flow information, e.g., the directedness of edges, although we do not address this point in this paper (see Refs. Rosvall and Bergstrom (2008, 2011) for more details).

The excellent performance of the map equation and its greedy implementation (Infomap) has been shown in numerous articles.
As with modularity, one can use the minimum description length of the map equation for model assessment only and perform community detection based on another objective function.
It should be noted that the characterization of a cluster in the map equation is not equivalent to that of the stochastic block model.
However, when densely connected components exist in a network, the minimum description length of a random walker is further compressed by clustering them; thus, it is expected that an optimal partition in the sense of modularity is also a good partition in the sense of the map equation.

It is also debatable whether we should consider the hierarchical nature of the map equation Rosvall and Bergstrom (2011).
The map equation is naturally formulated as a hierarchical clustering, and the fundamental two-level method can be regarded as a truncation of the general multilevel method.
Nevertheless, we measure the minimum description length of the two-level method and compare it with other model assessment criteria, because it is not always possible to measure the minimum description length in the sense of the multilevel map equation.
Whereas the multilevel map equation assumes a hierarchical structure, for example, in the case of partitions using the inference algorithm considered in this paper, each partition with different values of q is independent and is not constrained to constitute a hierarchical structure.

iv.4 Spectral band

The spectra of matrices Q, L, and B, in Sec. III.3 can be used to estimate the number of clusters.
In the case of a uniform random graph, in the infinite graph size limit, the spectrum of a corresponding matrix exhibits a non-zero spectral density within a finite range.
In other words, the spectral band can be observed, as exemplified in Fig. 2; it is often referred to as the semicircle lawMehta (2004) in the case of a symmetric matrix.
The spectral band stems purely from the random nature of a network, and if a characteristic structure in a network exists, the eigenvalues outside of the spectral band, i.e., isolated eigenvalues, will be observed.
As we mentioned in Sec. III.3, because the leading eigenvectors are expected to be correlated to the optimum partition, the number of statistically significant clusters can be estimated by counting the isolated eigenvalues.

Figure 2: (Color online) A part of the histogram of the eigenvalues of the modularity matrix for the standard stochastic block model.
The network has N=2000 with two equally sized planted clusters and an average degree equal to 6 . The dashed line indicates the estimate of the boundary of the spectral band calculated by the result in Ref. Kawamoto and
Kabashima (2015a).

While it is sometimes possible to evaluate the boundary of the spectral band by visual inspection, it is not trivial and it is preferable to have its estimate.
The estimate of the boundary of the spectral band of the modularity matrix Q was first derived in Ref. Nadakuditi and Newman (2012), the estimate was then generalized for random networks with arbitrary expected degrees Nadakuditi and Newman (2013), and arbitrary degree sequences Zhang et al. (2014).
These results, however, assume that the average degree is sufficiently large.
A result that is applicable to sparse networks is derived in Ref. Kawamoto and
Kabashima (2015a); although it is still a mean-field result, it yields the estimate for a random network with an arbitrary degree sequence and it is exact when the network is regular.

Although the boundary estimate of the spectral band of the normalized Laplacian L is also possible using the mean-field approximation Kawamoto and
Kabashima (2015b), it is known that L seriously suffers from the emergence of localized eigenvectors; those localized eigenvectors consist of a few elements with very large values and most of the elements are close to zero.
These localized eigenvectors are not correlated to the optimum partition and can deteriorate the estimate of the number of clusters.
For this reason, we do not pursue the assessment using the spectrum of L in this paper.

On the other hand, the non-backtracking matrix B tends to avoid the emergence of the localized eigenvectors, and its spectral band has a clear boundary at √ρ(B)Saade et al. (2014b), where ρ(B) is the spectral radius of B.
As seen in Sec. V.2, the assessment using the non-backtracking matrix performs well in many cases.
Note that, however, the non-backtracking matrix is not completely free from the localized eigenvectors Kawamoto (2016); Pastor-Satorras and
Castellano (2016).

iv.5 Prediction errors

Finally, we explain the cross-validation estimates of prediction errors, which are also useful to estimate the number of clusters (see Ref. Kawamoto and
Kabashima (2017) for the detailed derivations).
Although evaluating cross-validation errors is computationally demanding in general, the leave-one-out cross-validation (LOOCV) is an exceptional case and the corresponding errors can be obtained efficiently using the result of BP Kawamoto and
Kabashima (2017).

We consider four types of cross-validation errors, the Bayes prediction error, Gibbs prediction error, MAP estimate of the Gibbs prediction error, and Gibbs training error.
We refer to A∖(i,j) as the adjacency matrix without the knowledge of Aij.
Given A∖(i,j), the cluster assignment probability of i and j is

p(σi,σj|A∖(i,j))=ψi→jσψj→iσ′.

(34)

Then, the prediction probability ^p(Aij=1|A∖(i,j)) that i and j are connected is:

^p(Aij=1|A∖(i,j))

=∑σi,σj^p(Aij=1|σi,σj)p(σi,σj|A∖(i,j))

=∑σσ′ψi→jσdiωσσ′djψj→iσ′=Zij.

(35)

Note that the two-point partition function Zij is the normalization factor in Eq. (19) and is not equivalent to the two-point partition function defined in Ref. Zhang and Moore (2014), which does not have a probabilistic interpretation.
The cross-entropy error function with respect to ^p(Aij|A∖(i,j)), which is referred to as the Bayes prediction error of LOOCV, EBayes, is:

EBayes(q)

=−1L∑i<j[Aijlog^p(Aij=1|A∖(i,j))

+(1−Aij)log(1−^p(Aij=1|A∖(i,j)))].

(36)

Using the fact that ωσσ′=O(N−1), it can be approximated as:

EBayes(q)

≃1−1L∑(i,j)∈ElogZij,

(37)

where we neglected the O(N−1) term.
The Bayes prediction error EBayes should be the appropriate choice for assessing models in terms of the predictability of edges when the network is generated by the stochastic block model.
However, this is often not the case.
Hence, the Gibbs prediction error EGibbs is considered, which is a rough estimate of the prediction error compared to EBayes.
While the probability with respect to σi and σj is marginalized when the cross-entropy error function is measured in EBayes, a specific choice is made regarding σi and σj first, and the average is taken later in EGibbs.
Thus, we have:

EGibbs(q)

≃1−1L∑(i,j)∈E∑σi,σjp(σi,σj|A∖(i,j))log(^p(Aij=1|σi,σj))

=1−1L∑(i,j)∈E∑σi,σjψi→jσiψj→iσjlog(diωσiσjdj)

=−βL∑(i,j)∈E∑σψi→jσψj→iσ−logωout+const.,

(38)

where we again neglected the O(N−1) term.
By replacing ψi→jσ with the delta function that has a peak at argmaxσψi→jσ, the MAP estimate of the Gibbs prediction error is obtained, which is referred to as EMAP.

The Gibbs training error Etraining can be derived in the same manner.
In Etraining, we include the information of Aij for the probability with respect to σi and σj.
Thus, we have,

Note that the complexity of computing the Bethe free energy and the cross-validation errors is considerably reduced by restricting the parameter space of the stochastic block model.
While the stochastic block model required a computation of O(q2L) in the general case, it is O(L) with the restriction: Eq. (4).

In this section, a comparative analysis of the assessment of the number of clusters was conducted using synthetic and real-world networks.
For the synthetic networks, the planted number of clusters is denoted as: qplanted.

v.1 Assessment using the greedy algorithms

Figure 3: (Color online) Numbers of clusters q∗ detected by the (a) Louvain method and (b) Infomap for instances of the standard stochastic block model. Each network has two equally sized clusters as a planted structure, and the networks are generated for various values of ϵ=ωout/ωin, i.e., the strength of community structure.
The algorithms were executed 30 times for each network; the resulting number of clusters fluctuates depending on the initial cluster assignments, and the shaded regions show the standard deviations from the mean value. In each case, the stochastic block models of N=2,000 with the average degrees c=6, 12, and 24 are evaluated.
The bottom figures show the N dependence on the estimated number of clusters q∗ for the (c) Louvain method and (d) Infomap; the planted structure is of two equally sized clusters with c=6 and ϵ=0.5, and the experimental procedure is the same as in Figs. 3a and 3b. The error bars indicate the standard deviations.Figure 4: (Color online) Histograms of the cluster size distributions detected by the Infomap for the same stochastic block models as in Fig. 3 with (a) c=12 and ϵ=0.1 and (b) c=12 and ϵ=0.5.
The algorithm is run 100 times on the same network and the results show their cumulative frequencies.
When the community structure is strong (ϵ=0.1), one of the planted clusters seems to be detected.
Conversely, only small clusters that are broadly distributed are detected when ϵ=0.5.
Note that the specific form of the cluster size distribution possibly depends on the details of the algorithm.

The performance of the greedy algorithms was first examined on the basis of the standard stochastic block model.
Figures 3a and 3b show the number of clusters detected using the Louvain method and the two-level Infomap, respectively.
The horizontal axes represent the strength of the community structure ωout/ωin≡ϵ.
The Louvain method is a hierarchical clustering algorithm that aims to optimize modularity, while the (two-level) Infomap is a non-hierarchical clustering algorithm that aims to optimize the map equation.
For the implementation, we used Ref. lou () for the Louvain method and Ref. inf () for the Infomap.

All instances considered in this section have qplanted=2.
Given that the stochastic block model is exactly the model assumed in the inference algorithm, the assessment by the Bethe free energy and some of the prediction errors are known to be very accurate Mossel et al. (2014); Massoulié (2014); Kawamoto and
Kabashima (2017), even when the planted modular structure is very weak.

When the average degree is sufficiently high and the community structure is strong (i.e., ϵ∼0), both algorithms correctly detect two clusters.
However, when the networks are sparse and the community structure is weak (i.e., ϵ≫0), those algorithms tend to largely overfit.
Moreover, as shown in Figs. 3c and 3d, the detected number of clusters increases as the network becomes larger.
A non-hierarchical clustering algorithm for modularity Clauset et al. (2004); fas () and the multilevel Infomap Rosvall and Bergstrom (2011) were also tested.
Although the tendency that the hierarchical clusterings slightly prevent overfitting was confirmed, significant differences were not observed.

It should be noted that, detecting too many clusters does not readily imply the failure of the algorithm.
For example, when the result consists of a few large clusters and many very small clusters, significant clusters can be extracted via a visual inspection.
This is actually the case for instances with strong community structures.
Otherwise, the sizes of clusters can be broadly distributed, and such a visual inspection may fail.
Such situations are exemplified in Fig. 4 for the Infomap.

As a reference to the comparative analysis of the latter sections, we list the results of the greedy algorithms on synthetic and real-world networks in Fig. 5.
The descriptions of the networks can be found in Sec. V.2 and the references therein.

Figure 5: (Color online) Box plots of the estimates of the number of clusters q∗ using the Louvain method (top) and the two-level Infomap (bottom) on the synthetic and real-world networks that are analyzed in Secs. V.2 and V.3.
The algorithms were executed 30 times for each network.
Each box represents the range from the upper quantile to the lower quantile, the line in the box represents the median, whiskers from the box represent the upper and lower extremes, and circles represent the outliers, which are significantly far from the upper and lower quantiles.

Table 1: Parameters of the LFR networks. All networks have N≃104. The average degree c, maximum degree dmax, size of the smallest cluster Nσmin, and size of the largest cluster Nσmin are the realized values, while the mixing parameter μ, minus the exponent of the degree distribution τ1, and minus the exponent of the planted cluster size distribution τ2 are the input values.
The planted cluster size distribution is controlled by: Nσmin and Nσmax, while τ2 is fixed.
A comparative analysis for different values of τ2 would be difficult because a required graph size N becomes extremely large for a slight change of τ2.
The planted number of clusters qplanted and the estimates using the modularity matrix q∗mod and the non-backtracking matrix q∗NBT are indicated in the last three columns.Figure 6: (Color online) Assessment of various criteria with respect to given inputs of q for a LFR network.
The top panel indicates the model parameters α (blue cross) and β (red plus), where the shaded region indicates the region of β between Eqs. (20) and (21).
The middle panel indicates the Bayes prediction errors EBayes (red circles),
Gibbs prediction errors EGibbs (green triangles),
Gibbs training errors Etraining (blue diamonds), and
MAP estimates EMAP of EGibbs (yellow squares).
The bottom panel indicates the modularity (yellow pentagon), map equation (blue hexagon), and Bethe free energy (gray octagon).
In each panel, the number of clusters selected by the spectral method of the non-backtracking matrix is indicated by a vertical dashed line.
The planted number of clusters qplanted is indicated with a filled triangle at the top of the figure.
Figure 7: (Color online) Result of the same LFR network as Fig. 6. Here, statistical inference is performed without the model-parameter learning; the model parameters are fixed to: α=1 and β=β∗.

In this section, the performance of various assessment criteria based on the statistical inference algorithm that were explained in Sec. III.1 is analyzed.
First, the performance on synthetic networks, called the LFR network Lancichinetti and Fortunato (2009), is analyzed.
The LFR network is a random graph model that has a power-law degree distribution and, as a planted structure, a power-law cluster size distribution.
The parameters of the LFR networks considered are listed in Table 1.
Although the LFR network is often analyzed as a random graph that emulates typical real-world networks, in this paper it is not argued whether the parameter set investigated is “realistic” or not.
In fact, it is not obvious whether the LFR network really emulates typical real-world networks, because, as can be seen from Fig. 4, a broad cluster size distribution can be obtained fictitiously.

Figure 6 shows the result for an instance with a strong community structure (small mixing parameter μ, in terms of the LFR network).
Although the network has vertices with very large degrees, the cluster sizes are set to be almost equal.
For this network, all the criteria support values close to qplanted, although the criterion based on the non-backtracking matrix slightly overfits.

While the values of the model parameters are learned in Fig. 6, we show the result for the same network without the model-parameter learning in Fig. 7.
As we can see from Figs. 6 and 7, modularity and the map equation behave similarly in both cases.
As we mentioned at the end of Sec. III.1, this may be due to the fact that the cluster assignment is not very sensitive to specific values of model parameters, at least when the network has a strong community structure.
Conversely, the performance of the Bethe free energy and cross-validation errors change qualitatively, indicating that the learning step cannot be skipped.
Note that skipping the learning step does not necessarily mean that it is computationally more efficient. With an incorrect choice of the affinity matrix ω, it will be more difficult to fit the network.
It turns out that BP requires more sweeps until convergence.
Therefore, it is more beneficial to optimize the model parameters.
The rest of the results in this paper are generated according to the algorithm with model-parameter learning.

Figure 8: (Color online) Assessment of various criteria with respect to given inputs of q for several LFR networks.
They are plotted in the same manner as in Figs. 6 and 7.

The LFR networks with weak community structures are shown in Fig. 8.
Figures 8a–c represent the results for networks with narrowly peaked planted cluster size distributions.
Conversely, Fig. 8d–f represent the result for networks with broad planted cluster size distributions.
Although it is difficult to thoroughly examine the effect of cluster size distribution, we can at least confirm that the performance of the present algorithm and assessment criteria are not very sensitive to the cluster size distribution.

In the case of sparse networks such that the average degree is of O(1), if the planted community structure is too weak, it becomes fundamentally impossible to retrieve the planted structure. In other words, the network becomes statistically impossible to distinguish from a uniform random graph.
The critical strength of the community structure is called the detectability threshold, or the detectability limit Decelle
et al. (2011a); Moore (2017); Foo (b).
In terms of the spectral method, it is the case that the leading eigenvalues are buried in the spectral band.
In terms of other assessment criteria, the slope of a validation curve becomes flat, or the value at q=1 becomes the minimum.
In the case of the stochastic block model with equally sized clusters, this threshold is given by the value of ϵ that corresponds to β∗ in Eq. (20), and the paramagnetic phase will be observed beyond the detectability threshold.

For the network in Fig. 8a, all the criteria we consider behave reasonably, supporting the values close to qplanted.
For the network in Fig. 8b, other than the Gibbs prediction error and its MAP estimate, the assessment criteria still support the values near qplanted.
Indeed, in the case of the stochastic block model, it is known that the Gibbs prediction error tends to underfit near the detectability threshold Kawamoto and
Kabashima (2017).
Although the value of the (information-theoretic) detectability threshold for the LFR network is not known, the network in Fig. 8c may be beyond the detectability threshold. The Gibbs prediction error and Bayes prediction error are minimized or saturated already at q=1 (not shown in the figure). The Bethe free energy exhibits a monotonic behavior, while the values of other criteria behave violently; this implies that the landscapes of the objective functions are glassy.

More importantly, while we observed in Figs. 3 and 5 that the estimates by modularity and the map equation largely overfit when the greedy algorithm is used, the results in Fig. 8 indicate that those criteria behave reasonably when statistical inference is used.
Therefore, the shortcomings that we observed in Fig. 3 were not the flaw of the criteria themselves, but of the greedy algorithms and the MAP estimate (i.e., β→∞) framework.
In fact, when the assumed model is correct, it is known that the MAP estimate overfits compared to the estimate with the optimum inverse temperature βZhang and Moore (2014); Moore (2017); Peixoto (2017).
The contribution of this paper is that it confirms that the overfitting occurs with the greedy algorithms near the detectability threshold.

While the results in Figs. 8d and 8f are similar to those in Figs. 8a–c, the result in Fig. 8e is qualitatively different.
As the input values of q are increased, at some point, BP converged to the factorized state Foo (c); as a result, the estimated value of β jumps, and some prediction errors become constant.
Convergence to the factorized state is a desirable feature of BP; it implies that BP has reached the detectability threshold and that there is no significant structure.
Note, however, that it is often difficult to determine whether BP is actually in the paramagnetic phase or the retrieval phase.
Given that the factorized state always exists as a fixed point of BP in the retrieval phase, it is possible that BP is trapped in a local minimum of the Bethe free energy, while the correct initial state would converge to the global minimum.

As the final example using the LFR network, consider the case in which the assessment seems to fail because of the EM algorithm.
Consider a network that has a broad degree distribution as in Figs. 6 and 7, in addition to a weak community structure and a broad planted cluster size distribution.
As shown in Fig. 9, while the spectrum of the non-backtracking matrix exhibits the estimate of q∗ near qplanted, such estimates cannot be obtained via the other criteria, because BP converges to the factorized state at q=12, although the values of the criteria significantly vary when they reach this value.
In this case, we can hardly conclude that there are no statistically significant structures beyond q=12, and it is more natural to conclude that the BP converged to a local minimum of the Bethe free energy.
Note that, even if we accept the estimate of q∗=27, we cannot obtain a result with 27 clusters; recall that the input value of q is the maximum number of clusters allowed, and the actual number of clusters that can be obtained is much less than 27.
Readers might wonder what factor dominates the performance of the EM algorithm in the LFR network.
Although the degree distribution seems to be an important factor, because there are many model parameters in the LFR network, it is difficult to precisely determine parameter dependencies experimentally.
Note that a thorough investigation of the phase space of a particular model is not the goal of this paper.
Instead, we investigate generic behaviors in community detection.

Figure 9: (Color online) Assessment of various criteria with respect to given inputs of q for a LFR network.
They are plotted in the same manner as in Figs. 6–8.

Table 2: Estimated number of clusters of real-world networks using the spectra of the modularity matrix q∗mod and non-backtracking matrix q∗NBT. Multi-edges, self-loops, and the direction and weights of edges are neglected in all networks.Figure 10: Assessment of various criteria with respect to given inputs of q for several real-world networks.
They are plotted in the same manner as in Figs. 6–9.
The shaded parts in the cross-validation error plot indicate the standard errors.
Figure 11: Assessment of various criteria with respect to given inputs of q for several real-world networks.
They are plotted in the same manner as in Figs. 6–9.
The shaded parts in the cross-validation error plot indicate the standard errors.
Figure 12: Assessment of various criteria with respect to given inputs of q for several real-world networks.
They are plotted in the same manner as in Figs. 6–9.
The shaded parts in the cross-validation error plot indicate the standard errors.
Figure 13: Assessment of various criteria with respect to given inputs of q for several real-world networks.
They are plotted in the same manner as in Figs. 6–9.
The shaded parts in the cross-validation error plot indicate the standard errors.

Let us next examine the performance of the assessment criteria on real-world networks.
The basic information about each dataset is listed in Table 2 and the results of the assessment by each criterion are shown in Figs. 10(a)–(d), 11(e)–(h), 12(i)–(l), and 13(m)–(p).
For some networks, the inference algorithm does converge to the factorized state at some point as the input value of q is increased; as far as we investigated, in many cases, the convergence to this trivial BP fixed point either supports a plausible value of q∗ or does not affect the assessment.

It is known that the Bethe free energy tends to largely overfit for real-world networks Decelle
et al. (2011a); Kawamoto and
Kabashima (2017) when an affinity matrix of full degrees of freedom is used.
However, with a restricted affinity matrix, the assessment by the Bethe free energy does not seem to be very wrong.

Unlike the cases of synthetic networks, the behaviors of the assessment criteria are often very different from each other.
For example, modularity tends to support a relatively small value for q∗, while the map equation tends to support a relatively large value.
Assessment by the Bethe free energy and prediction errors can be close to either of them, and we cannot determine a similarity tendency.
Note again that the inference algorithm does not optimize the minimum description length of the map equation; the partition is obtained such that the marginal likelihood is maximized and the minimum description length is measured only as a criterion for the assessment of the number of clusters.
Another way to utilize the assessment by the minimum description length is to check whether the resolution limit Fortunato and Barthélemy (2007); Kawamoto and Rosvall (2015) affects the result.
The estimates for the number of clusters q∗ by modularity and by the map equation can differ considerably when many small clusters exist, because their resolution limits are very different Kawamoto and Rosvall (2015).
When both modularity and the map equation support the same number of clusters q∗, it implies that the network is resolution limit-free Foo (d).

v.3 Assessment using the spectral methods

Finally, we examine the performance of the assessment of the number of clusters using the spectral methods that we explained in: Secs. III.3 and IV.4.
The results of the estimates using the modularity matrix q∗mod and non-backtracking matrix q∗NBT are listed in Tables. 1 and 2.
The estimates using the non-backtracking matrix are also shown in Figs. 6–13 as dashed lines.

Note that, because the leading eigenvalues can be quickly computed for sparse networks, the assessment using the spectral method can be conducted most easily.
Overall, the assessment using the modularity matrix tends to underfit, while the assessment using the non-backtracking matrix tends to overfit, compared to the other criteria.
Furthermore, for real-world networks, it is often the case that the spectral band of the modularity matrix cannot be clearly observed.
Therefore, in many cases, it is also difficult to visually assess the number of clusters from the spectral density.

The assessment using the non-backtracking matrix is often very accurate in the sense that it coincides with the planted value qplanted of an LFR network.
The analysis with various values of μ was also analyzed in Ref. Darst et al. (2014).
It is difficult to analyze what exactly causes overfitting in the cases of the real-world networks; one of the possibilities is that the community structure may not be the only structure that contributes to the eigenvalues outside of the spectral band, and those unknown structures cause overfitting when we focus on community detection.

As we have observed, the validation curves of the criteria are often gradually saturated, particularly when the community structure is weak.
In such a case, a criterion supports a range of values for q∗, instead of a single plausible value.
Therefore, a finer inspection is required as a final step, if one wishes to determine the most plausible value of q∗.

Visualizing how a network is partitioned for each input value of q can be helpful for the assessment of the number of clusters.
The alluvial diagram Rosvall and Bergstrom (2010) is a suitable tool for this purpose.
It was originally introduced as a diagram to indicate the time evolution of a community structure.
Here, we visualize the change in the partition for different values of q, rather than the change in the partition over time.
In the alluvial diagram, the results of community detection are aligned horizontally. For each partition, the set of vertices in the same cluster is expressed as a vertical bundle. Then, the same vertices in different partitions are smoothly connected.
The alluvial diagram can be generated at Ref. map () using .smap files.

The alluvial diagram also uses color tone to express the significance of the cluster assignment; the vertices with insignificant assignments are expressed by faint colors.
We assess that the cluster assignment of vertex i is not significant if maxψiσ is less than a threshold.
Here, we regard maxψiσ>0.9 as a significant assignment.

Figure 14: Alluvial diagrams of the political books, political blogs, protein, and US airports networks. Some clusters are highlighted.

The alluvial diagrams of four real-world networks are shown in Fig. 14.
As we can see from the political books and political blogs networks, the actual partition may be kept the same as we increase the input value of q.
We can confirm that the partition with q=3 stably exists in the political books network and the partition with q=4 is also consistent with the partitions with fewer clusters; i.e., it is a refinement of the partition with q=3, and the highlighted clusters in the middle belong to different clusters in the partition with q=2.
For the political blogs network, although modularity and the map equation support q=3 or 4, in the end, we can confirm that the dominant structure does not change from the partition with q=2.

In the case of the protein network, for any choice of q, only a fraction of vertices belong to a specific cluster significantly. In other words, the network does not have a global community structure.
Whereas the vertices with insignificant assignments exhibit a random-like behavior, the vertices with significant assignments roughly exhibit a hierarchical structure.
According to the Gibbs prediction error EGibbs and its MAP estimate EMAP in Fig. 12(i), the partitions with q=3 or 4 are supported.
Although we cannot clearly see a qualitative difference in the alluvial diagram, if we look carefully, from the partition with q≥5, we can observe that a small fraction of vertices with significant assignments start to exhibit a non-hierarchical structure.

The way the US airports network is partitioned is also different from other networks.
When we focus on the vertices with significant assignments, the resulting partitions do not constitute a hierarchical structure for small values of q, while they roughly do for large values of q.
As various assessment criteria support the range of 5≤q∗≤8, it seems to be plausible to select q∗ in this range.

We conducted a comparative analysis on the assessment of the number of clusters in community detection.
Although we examined the performance of various algorithms and assessment criteria, an exhaustive analysis requires all possible combinations of the frameworks, algorithms, and assessment criteria.
For example, an important missing part is the Monte Carlo method Nowicki and Snijders (2001); Peixoto (2014); Newman and Reinert (2016).
The Monte Carlo method usually incorporates the prior probability distribution with respect to the affinity matrix ω; it plays the role of a penalty in the assessment of the number of clusters.
Therefore, a comparative analysis including the Monte Carlo method would not only be a comparison of different algorithms, but also a comparison of different frameworks.
In a broader sense, we should note that community detection also possesses some other fundamental issues as discussed in Ref. Peel et al. (2017).

We confirmed that the assessment using the EM algorithm with BP and the corresponding prediction errors also provide plausible estimates in various synthetic and real-world networks, while the greedy algorithms tend to largely overfit.
Note that it is not trivial that the BP algorithm performs reasonably for real-world networks, because the emergence of the so-called hard phase Decelle
et al. (2011a) may deteriorate the performance when the planted number of clusters is large.
Furthermore, the EM algorithm may be trapped in a local minimum depending on the choice of the initial condition of the model parameters.

We also observed that the estimate of q∗ using the modularity matrix tends to underfit, while the estimate using the non-backtracking matrix tends to overfit.
To the best of our knowledge, the assessment using the boundary of the spectral band of the modularity matrix has not been investigated in the literature.

Finally, we proposed the utilization of the alluvial diagram for the assessment of q∗.
Although there is the obvious issue that it is not applicable to the networks with a very large q∗, when it is not the case, the alluvial diagram is very useful, particularly when the network does not clearly have a global community structure.

For the LFR networks and real-world networks, we do not know the number of clusters that are statistically significant.
It may not coincide with the planted number of clusters or the number of clusters in the metadata.
Therefore, we rely on the consistency among various criteria and algorithms for the plausibility of assessment.
The tendency of overfit and underfit that we investigated in this paper represents the relative performance among the criteria and algorithms.
Although there is no ground truth in a real-world network, estimating the number of clusters is a practical problem.
At the end of the day, a practitioner selects a certain value (or a set of values) as q∗.

The code for the assessment using the modularity matrix is available at Ref. Git ().
The code for the assessment using the other criteria in this paper, which can also generate .smap files, is available at Ref. gra ().

acknowledgments

T. K. thanks Jean-Gabriel Young for useful comments.
This work was supported by Japan Society for the Promotion of Science KAKENHI Grants No. 26011023 (TK) and No. 25120013 (YK).

Precisely speaking, the algorithmic detectability
threshold that a certain algorithm fails to retrieve the planted structure is
different from the information-theoretic detectability threshold that it is
fundamentally impossible to retrieve the planted structure.

T. P. Peixoto,
arXiv preprint arXiv:1705.10225 (2017).

Of course, this can also be confirmed directly by
observing the marginal distributions themselves.

Precisely speaking, the algorithmic detectability
threshold that a certain algorithm fails to retrieve the planted structure is
different from the information-theoretic detectability threshold that it is
fundamentally impossible to retrieve the planted structure.