Statistical significance of communities in networks

Nodes in real-world networks are usually organized in local modules. These groups, called communities, are intuitively defined as sub-graphs with a larger density of internal connections than of external links. In this work, we introduce a new measure aimed at quantifying the statistical significance of single communities. Extreme and Order Statistics are used to predict the statistics associated with individual clusters in random graphs. These distributions allows us to define one community significance as the probability that a generic clustering algorithm finds such a group in a random graph. The method is successfully applied in the case of real-world networks for the evaluation of the significance of their communities.

Nodes in real-world networks are usually organized in local modules. These groups, called communities, are intuitively defined as sub-graphs with a larger density of internal connections than of external links. In this work, we introduce a new measure aimed at quantifying the statistical significance of single communities. Extreme and Order Statistics are used to predict the statistics associated with individual clusters in random graphs. These distributions allows us to define one community significance as the probability that a generic clustering algorithm finds such a group in a random graph. The method is successfully applied in the case of real-world networks for the evaluation of the significance of their communities.

I. INTRODUCTION
Complex networks play a crucial role in understanding physical, biological, social and technological systems [1][2][3]. Interactions between proteins in cells of living organisms, relations between human actors in socioeconomic contexts and connections between Web pages in the World Wide Web can naturally be described as graphs. Real-world networks typically have complex topological properties, but in spite of their evident diversity, structural analysis has revealed that they share a conspicuous set of common features: scale-freeness (i.e., the number of connections per node following a wide or power-law distribution) [1] and small-worldness (i.e., the average number of hops between two nodes in the network scales logarithmically with its size) [4] are two celebrated examples of such properties. Recent studies have focused on deeper structural features of networks. Real-world networks are typically organized in local clusters of nodes which are usually denominated communities. Communities are groups of nodes with a higher level of interconnection among themselves than with the rest of the graph. In this sense, communities are groups relatively isolated from the other nodes of the network and are expected to represent elements sharing common features and/or playing similar roles within the system (see Ref. [5] for an exhaustive review). For instance, if one considers the World Wide Web, communities are composed by groups of Web pages dealing with similar topics; in social networks, communities stand for sets of actors sharing common interests, ideas and friendship relationships; in protein interaction networks, communities represent groups of proteins with similar functionalities.
This imbalance of in-and out-connections corresponds to an intuitive concept. There are some formalizations of the definition of community. The LS set [6] or strong community [7,8] stands for a group where every node belonging to the group has more internal connections than external ones. A less restrictive definition refers to a weak community [8] as a set of nodes where the number of intracommunity connections (summed over all nodes within the group) is larger than the number of links going out of the community. Along these lines, the well known modularity is a quality function able to quantify the statistical importance of a partition comparing the number of internal connections observed in the communities with its expected number in a suitable null model [9]. Besides the formulation of a definition, big efforts have been made for the detection of communities in networks. Since the total number of possible divisions of a network in subgraphs is a non-polynomial function of the size of the network itself, finding and detecting communities is not a trivial issue. Many algorithms have been proposed during recent years, every of them with the same spirit of finding the best groups which maximize the internal density of links [5,[9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24]. Different principles may be used, but nevertheless in all cases some property related to the community structure is locally or globally optimized. The consequence is that even in uncorrelated networks these algorithms find clusters that are supposed to be good according to the modularity function or to other quality measures.
If algorithms are able to identify communities even in random graphs, which value can we give to communities found in real networks? Or better, how to statistically determine the significance of a community? This problem has been the subject of some studies in the literature [20,[25][26][27][28][29][30]. In [20,27] for example, the partition of a network maximizing the modularity is compared with the maximum modularity partition of a randomized version of the given network (i.e., all edges are randomly rewired). In [29], differently, the importance of a community partition is proportional to its robustness against random perturbations (i.e., random reshuffling of edges). Such heuristic approaches rely on the modularity function to evaluate the quality of a partition, which means that are subjected to the modularity resolution limits [21,31]. Furthermore, all the proposed methods are designed to deal with full partitions, not with single communities. Even though in a network one might find some meaningful communities alongside with randomly connected node clusters. In this paper, we develop a statistical method aimed at discriminating between a single bona fide community and structures arising as topological fluctuations. Instead of a direct comparison with an average outcome, the community is confronted with the best expected result for a null-model. The reason for stressing this "best outcome" is that community detection algorithms will in general produce the best possible clusters given a graph, even if it is random. The threshold of significance can be approximated by using Extreme and Order Statistics [32,33] applied to null-model community fitness. A community significance can be then obtained as the extreme probability of finding a group equal or better than the one given in a set of equivalent random graphs.

II. NULL-MODELS AND DEFINITION OF C -SCORE
Consider a scenario as the one depicted in Fig. 1, with a given community C in a graph. k i denotes the number of connections (degree) of the node i. Given C, k i can be divided in two terms: k int i , the number of links connecting i to nodes in C, and k ext i , the number of connections outside. Similarly, we define the internal degree of C, m int C = j∈C k int j , as well as m ext C = j∈C k ext j and its total degree m C = m int C + m ext C . We consider a very simple stochastic null model: all connections inside the group are locked (the community is given so cannot be altered), while the other links are randomly reshuffled among all nodes preserving their degrees. For simplicity, we allow the rewiring operation to form multiple links (two nodes can be connected by more than one edge) or self-loops. In some weighted graphs the weights of the links are equivalent to multiple connections and so the present null-model would be appropriate. Some examples are social networks (the Zachary club [34], see last section) or the C. Elegans metabolic network [16] that will be analyzed later. For unweighted graphs, we have checked that our results do not noticeably change by including or not multiple links as long as the graph is not condensed (a node gets a finite fraction of the links). When node degrees are much smaller than the network size, the probability of generating self-loops and multiple links by random reshuffling becomes negligible. Note also that our null model is similar to the one used for the definition of the modularity [9] and close in spirit to the configurational model [35]. It generates graphs that have no special internal structure except that given by random fluctuations, keep the degree sequence of the original network and can show degree-degree correlations only if the degree sequence and the network size determine their presence [36]. This is the most general null model, appropriate when no knowledge about the system is available and simple enough to be treated from an analytical point of view. If further information regarding the constraints present in the process that generated the given network is available, other, simpler or more elaborated, null models can be employed. Our method to evaluate group significance is general enough to admit the use of different null models by altering consequently the distributions that will be described next.
Once the null model has been selected, suppose that C is a group composed of randomly chosen nodes and consider a generic node i not belonging to C. The distribution of k int i is given by the hypergeometric distribution where y x = y! (y−x)!x! is a binomial coefficient, and m * are the free ends in the network: m * = m − m C (m are the total ends in the graph, twice the number of links). Eq. (1) states that the probability of node i to have k int i internal connections to C is given by the ratio of two terms: the total number of ways in which k int i links can be placed at the end of m ext C free ends multiplied by the number of ways to locate the remaining k i − k int i edges out of m * − m ext C free ends, divided by the total number of ways to place all k i connections in the network (i.e., out of m * free ends). If the node i belongs to C, Eq. (1) has to be corrected to exclude i from the group. When the group C is composed of n C randomly chosen nodes, Eq. (1) recovers the results obtained via numerical simulations (see inset Fig. 2).
The next, more interesting, case is when C is not composed of randomly chosen nodes, but it has been detected by a clustering algorithm. As can be seen in the main plot of Figure 2, the shape of f (k int ) dramatically changes due to the algorithm node selection. Most of the nodes populating the tail of the distribution are incorporated into the group. Correlations are also present since nodes in the community are expected to be connected among themselves. Still, it is possible to obtain an approximate expression for the probability f (k int ). We first consider the case of homogeneous graphs where all nodes have the same degree (i.e., k i = k , ∀ i) and extend later our analysis to networks with arbitrary degree sequences. We will assume that C has been selected to maximize k int i for each node inside C as well as the overall m int C . This also implies that since the nodes are all equivalent, have the same degree k, they can be ranked according to their k int . We indicate with w the node (or nodes) with the lowest k int within the community (see Fig. 1). k int w , the internal-degree of the worst node, establishes then an upper cut-off to the possible values of k int of the out-group nodes. An expression similar to Eq. (1) can then be derived for the external nodes by taking into account this new cut-off where m = (N − n C ) k int w . The term m accounts for the fact that no node can connect to more than k int w internal vertices and therefore some of the free ends m * become occupied. Eqs. (1) and (2) specify the null-model. Our method does not depend on the particular functional shapes of f (k int i ) and g(k int i ). For instance, a more restricted null-model without multiple links can be approximated by using Wallenius hypergeometric distribution, although this considerably complicates the numeri- cal evaluation of the functions. Another null-model, less realistic but very easy to implement, is the Erdös-Rényilike networks for which f (k int i ) and g(k int i ) are binomial distributions.
The worst node within the community, w, will play a central role in our method to evaluate group significance. We assume that in a random graph there is not a drastic variation between k int w and the internal degree k int of the best nodes outside the group. Postulating a smooth variation of k int between inside and outside of the community allows us to find an expression for the probability distribution of k int w based on Eq.
(2) which only applies to external nodes. The degree of the worst node, k int w , is a given quantity in g(k int | C, k int w ). In order to find a formula for P (k int w ), we need thus to alter our point of reference and consider the second worst node within the community w . If the statistics of k int w is comparable to that of the best external nodes, k int w should follow the distribution of the extreme of g(k int . This means that the probability for k int w to be lower or equal to a certain number reads where G(·) is the cumulative of the function g(·). The distribution P (k int w ) is given by the derivative of the cumulative of Eq. (3), P (k int w ) = ∂ Pr(≤ k int w ). It must be remarked that Eq. (3) is valid for independent random variables, in our null model the independence is justified for external nodes and is an approximation when refers to w. Figure 3 shows a comparison between the distribution P (k int w ) obtained with this procedure and its counterpart from numerical simulations. Despite the approximations performed to reach an analytical form for P (k int w ), the agreement is remarkable. The use of Extreme Statistics contributes in part to such agreement, since under very general conditions the limit extreme-value distribution is stable and has no memory of the parental distribution.
Once a functional form for Pr(≤ k int w ) was obtained, we can define a measure of the significance of a group, the C-score, as which corresponds to the probability that k int w for an optimized community in an equivalent random graph ensemble is higher than or equal to the value seen in C. A point to stress here is that c contains not only information about the worst node, k int w , but also about the community external links and about the degrees of the external nodes.
In order to extend our results to heterogeneous graphs, we need to rank the nodes according to the role they play with respect to the given community C. For regular networks, since all the nodes are equivalent, the ranking can be simply established by considering the values of the internal degrees k int . However, another criterion is required to deal with heterogeneous networks. We use the probability distribution provided by Eq. (1) as the basis for such procedure. The rank for a node i can be established by the probability of finding a node with an internal degree k int i or higher in the null model given its degree k i and C. That is, for each node i we calculate the score r i = ki q=k int i f (q) and then perform comparisons on the basis of r. The values of r fall in the interval [0, 1] regardless of the node degree, which facilitate the comparison. w and w correspond thus to the nodes with the highest and second highest values of r within the community, respectively. Under the hypothesis of a randomly connected network, the scores r of the vertex w, r w , and that of the external nodes can be seen as random variables uniformly distributed in the interval [r w , 1]. The C-score can be then calculated as the probability of observing r w as the minimal value of a set of (N − n C + 1) random extractions from a uniform distribution defined in the interval [r w , 1]. An alternative to this last step is to map the internal degree of w into k int w (the internal degree that it would have if its degree was equal to k w and its score r w ) by inverting the distribution of Eq. (1). Once the transformation has been performed, we can proceed in the same way as for homogeneous networks with Eqs. (3) and (4).

III. BEYOND THE C-SCORE
A low value of the C-score (i.e., c ≤ 5%) is enough to consider a group as significant. However, when the C-score is higher, one could argue that the reason is that relaying only on the worst node of the community for the full group evaluation is a too severe criterion. Algorithms may fail to place a single node and this would translate into a non significant community according to the C-score approach. The performance of the method can be improved by a further refinement. Instead of considering only the last node, one can include a longer list of nodes and use this information for the computation of the statistical significance of the community. A way to do so is to write an algorithmic procedure. Three classes of nodes can be considered: The community C, the "border" B and the rest of the network. Initially, the group B 0 is empty and C 0 = C. Then at each algorithm step, the following actions are taken where the function f (·) is given by Eq. (1). r i is calculated for each node i ∈ C with respect to the group C t ; • Determine the worst node in C t , w t+1 , as the vertex with highest r wt+1 . Set B t+1 = B t ∪ {w t+1 } and C t+1 = C t \ {w t+1 }; • Compute Pr(< S t+1 |C t+1 , B t+1 , r wt+2 ), where S t+1 = i∈Bt+1 r wi and w t+2 is the worst node still in C t+1 ; • Increase t → t + 1.
This algorithm explores the interior of the community trying to maintain the worst nodes always in B, it ends when t = n C − 1. Pr(< S t+1 |C t+1 , B t+1 , r wt+2 ) stands for the probability that the sum of the scores of the worst t nodes of an optimized community in an ensemble of equivalent random graphs is smaller than the given for C. Its value for a set of independent random variables can be estimated by using Order Statistics (see Appendix A for more details). We define then the B-score as which corresponds to the lowest value of the probability Pr(< S t |C t , B t , r wt+1 ) observed during the iterative procedure. We take the minimum as the best approximation for the significance of the group C, since it is evaluated in the most favorable discrimination of C nodes in border and core. This probability is equivalent to the C-score for t = 1, while becomes a more synergic quantity as t increases. The inclusion of a longer list of worst nodes in the calculation helps to correct conservative estimates due to under-sampling. When communities are significant with respect to the C-score they are significant also according to the B-score. Vice versa, low values of the B-score do not necessarily correspond to small C-scores. Many concomitant bad nodes with features slightly different from the random expectations may multiply their effect and lead, if there is a real signal, to the prediction of a significant community.

IV. COMPUTATIONAL BENCHMARKS
As a first test, we applied the C-and the B-scores to groups found in random graphs using clustering techniques. The C-score and the B-score are able to identify these groups as not significant (see Figure 4). The results confirmed that the scores are good estimators for the statistics of such groups further contributing to our confidence in the method. We consider next the performance of the scores on artificial networks with planted community structure. In order to do so, we build networks in the spirit of Girvan and Newman's benchmark [7]. Since our aim is to evaluate a single cluster, the benchmark will be composed of a group C with 32 nodes and of other 96 nodes in the rest of the network. Every node in C is connected on average with k int nodes of its own group and k ext outside. The external nodes are connected at random. The average total degree for all the nodes is fixed at k = 16. k ext acts thus as a control parameter for the strength of the community structure. The higher it is, the more prominent the disorder of the connections becomes. The scores are shown in Fig. 5a as a function of k ext . Both are able to detect the increasing disorder. Although, as expected, the C-score is more conservative than the B-score raising for earlier values of k ext and so claiming that the group could be found in random graphs before. The (green) continuous curve in the figure represents a numerical estimation of the ideal function that we want to approximate with the scores. Before explaining how it is obtained, we need to describe the second panel of the figure. The distribution for the internal number of connections of C is displayed for the benchmarks at different k ext as well as for equivalent randomized graphs in Fig. 5b. The randomized graphs are obtained by reshuffling the connections of the benchmark networks and the groups of 32 nodes in them are found by modularity maximization. The curves for the benchmarks start far away in the area of high m int C when k ext is low. As k ext increases, they move towards the left and at a certain point, close to k ext ≈ 8, cross under the distribution for the randomized graphs. This point marks the end for the significance of the community. Similar (or better) groups could be found in a random graph by a clustering algorithm. The continuous curve in Fig. 5a is obtained by simulating this process. For each value of k ext , a set of instances of the benchmark are generated. m int C is measured for each of them, and the green curve is calculated averaging the probability of the value m int C or a higher one (cumulative distribution) in the random graph curve of Fig. 5b. The good agreement of this curve with the B-score proves that, despite all the approximations, the B-score is a good measure of cluster significance.
As a final test on benchmarks, we have evaluated the scores performance on the benchmark proposed by Lancichinetti et al (LFR) in Ref. [37]. This technique to generate graphs with planted community structure is a generalization of Girvan and Newman's method to networks with heterogeneous group size and degree distribution. As before, the nodes have k int connections within its own group and k ext = k−k int edges linking elsewhere. The mixing parameter k ext /k indicates the "strength" of the communities. The scores shows a great ability in characterizing the modular structure of the benchmark as we increase the mixing parameter as can be seen in Figure 6. Due to the absence of fluctuations all the communities are well defined until each node shares almost half of its connections with nodes of its group, while the groups become less defined for larger values of the mixing parameter. When about the 60% of the links connect with nodes outside the a priori established groups, the communities become equivalent to those found in random graphs.

V. EXPLORING THE INTERIOR OF A COMMUNITY
An interesting application of the scores is the exploration of the internal structure of groups. One could decide to remove the worst node from the community as we did to measure the B-score and recompute the scores for the remaining group. The operation can be repeated iteratively as long as there are nodes remaining in the group. Interestingly, this process is able to identify the presence of internal structure in groups of vertices if the original community displays internal modularity. Figure 7 shows two examples of the described operation. The B-score is plotted as a function of the number of removed nodes. We consider two different examples: a well defined cluster (generated with the LFR benchmark) plus some randomly added nodes (Figure 7a); and a group composed of two clusters connected via few random links (Figure 7b). The iterative procedure is able to detect and set out the randomly added nodes (Figure 7a), and also to find the deeper internal structure inside the two-elements cluster (Figure 7b).
This procedure also allows us to define more detailed measures for the quality of a community. We can search for deeper and deeper cores in the community that we will call C-q or B-q core. Fixed a level of significance q, the C-q (or B-q) core corresponds to the largest sub-group of a community with C-score (B-score) lower than q. In practical applications, a reasonable value of q is 5%. As we will see next, this concept turns out to be a useful tool to characterize communities in real networks. In the case of the benchmarks, the average sizes of the C-qcores obtained for the GN-like networks at q = 5% are close to 32 up to k ext = 8. At this level of disorder, some nodes stop being significant for the planted communities and therefore come excluded from the q-core. For higher disorder levels, the cores further reduce until eventually vanish.  Table I: Analysis of real networks with known community structure. For each network the table reports, from left to right, the name of the network, the size of its communities nC, the C-score , the B-score, the size of the C-5%core and the size of the B-5%core.

VI. EMPIRICAL NETWORKS
We show now the utility and versatility of our method for the statistical evaluation of communities in real networks. An exhaustive study of the networks with modular structure in the literature has been performed, the following are only a few examples. We report results on social networks such as the Zachary karate club [34] or the one extracted for the characters of the novel Les Miserables [38] or for biological networks such as the C. El-egans metabolic network [16]. In two cases, the Zachary club and the college football networks, the structure of the groups is a priori known. In the Zachary club because the network split in two separate groups due to internal dissensions, and for the college football because the conference in which the teams play is a given data. It is also important to note that some of these networks as, for instance, the Zachary club or the C. Elegans metabolic network are weighted graphs for which the weights of the links are equivalent to multiple connections. We have analyzed both the weighted and unweighted versions and report both results in the case of the Zachary club. The evaluation of the groups for the a priori known communities is summarized in Table I. While the results for the communities obtained maximizing the modularity with a simulated annealing technique are displayed in Table II. There are some general observations valid for all networks. The C-score is often able to discriminate good communities, although sometimes a more sophisticated approach as the B-score is needed. There are also a few cases in which the B-score reverts the judge based on the C-score, meaning that a deeper analysis of the communities was required. An example of this type is for instance the Zachary club 2-partition. However, when the original graph with the weight information is considered its communities become more significant. This seems to apply also to the other weighted graphs, showing that there is a connection between clustering structure and weight location in these networks. We also show the sizes of the 5%-cores of each community in the Tables as well as detailed analysis of one of the communities of the C. Elegans metabolic network in Fig. 8.

VII. CONCLUSION AND DISCUSSION
Finding structure in graphs has direct implications for the study of several empirical disciplines as well as for a general understanding of the phenomena behind the evolution of the systems in which such structures raise. Communities are the most direct and easy-to-envisage example of network structures. This concept is a direct heir of the intuitive idea of closer groups when considering social networks. As such, it has had a long history with a good number of algorithms proposed to detect communities in graphs. There are however two important issues missing in the literature. A firm mathematical definition of what a community means and a clear way to determine which of the outputs of the community detection algorithms are really significant.
In this work, we have focused on the second question with the hope of giving even if partially a hint of where the answer to the first one can lay. A new measure able to statistically quantify the meaning of a single community in networks has been introduced. This measure, called Cscore, represents the probability of occurrence of a group with the same properties (i.e., same number of nodes, nodes with the same degree sequence and same internal  Table II: Analysis of the community structure of several real networks via modularity maximization. For each network the table reports, from left to right, the name of the network, the size of its communities nC, the C-score , the B-score, the size of the C-5%core and the size of the B-5%core. The community highlighted in Figure 8 is marked as (Green) in the table text.
connections) under the following hypothesis: (i) nodes in the network are randomly connected; (ii) the group is chosen, among all possible groups with the same properties, because is the one which maximizes the density of internal connections. The first hypothesis is a natural assumption and a null model where links are randomly placed is very often used as term of comparison for the determination of correlations or other topological properties in networks. The latter one comes out from the common knowledge which prescribes communities as groups with high intra-connectivity. Thanks to the theory of Extreme Statistics, we approximate the values of the C-score in the case in which our hypothesis hold. We have tested the performances of the C-score on several networks, ranging from random graphs to arti- In (a), an overview of the graph partition is shown. In (b), we display a zoom of a single community depicting in red the nodes that are not significant group members. And (c), the C-qcore analysis of the community.
ficial networks with controlled community structure, or to real networks with unknown internal organization. In all cases, we have been able to find good results. The method ability of evaluating one community at a time allows to detect situations in which only some of the communities of the graph are meaningful while the rest of the groups are equivalent to random fluctuations. This approach is also flexible enough to deal with overlapping groups that share nodes between them, providing a separate evaluation for each cluster. Two further refinements of the C-score have been also introduced. One with the aim of exploring the internal structure of the communities, the q-core, and another, the B-score, with the intention of evaluating a community significance based on a group of nodes instead of on the worst node of the cluster. The computational complexity of the evaluation of the B-and C-scores grows quadratically and linearly with the community size, respectively. These tools constitute a set of statistical measures for a thorough evaluation of single communities, avoiding thus the blind acceptance of the output of clustering algorithms.
The software to calculate the C-score and B-score of communities is available at http://filrad.homelinux.org/cscore. sum S t = t i=1 r wi . Finding the distribution of S t can be formulated as calculating the probability that, given a sequence of N − n Ct i.i.d. random variables [we indicate by n F the size of a set F], the sum of the t largest variables is less than S t . The solution for this problem can be found in [32,39]. The cumulative probability distribution is given by the expression where θ t = Integer-Value n Bt + 1 − ξ t and ξ t = (S t − n Bt w t+1 )/(1 − w t+1 ). Note that Eq. (A1) is valid under the assumption of independent variables, which is justifiable to some extent in the case of random networks.