Cluster and propensity based approximation of a network.

Abstract

BACKGROUND:

The models in this article generalize current models for both correlation networks and multigraph networks. Correlation networks are widely applied in genomics research. In contrast to general networks, it is straightforward to test the statistical significance of an edge in a correlation network. It is also easy to decompose the underlying correlation matrix and generate informative network statistics such as the module eigenvector. However, correlation networks only capture the connections between numeric variables. An open question is whether one can find suitable decompositions of the similarity measures employed in constructing general networks. Multigraph networks are attractive because they support likelihood based inference. Unfortunately, it is unclear how to adjust current statistical methods to detect the clusters inherent in many data sets.

RESULTS:

Here we present an intuitive and parsimonious parametrization of a general similarity measure such as a network adjacency matrix. The cluster and propensity based approximation (CPBA) of a network not only generalizes correlation network methods but also multigraph methods. In particular, it gives rise to a novel and more realistic multigraph model that accounts for clustering and provides likelihood based tests for assessing the significance of an edge after controlling for clustering. We present a novel Majorization-Minimization (MM) algorithm for estimating the parameters of the CPBA. To illustrate the practical utility of the CPBA of a network, we apply it to gene expression data and to a bi-partite network model for diseases and disease genes from the Online Mendelian Inheritance in Man (OMIM).

CONCLUSIONS:

The CPBA of a network is theoretically appealing since a) it generalizes correlation and multigraph network methods, b) it improves likelihood based significance tests for edge counts, c) it directly models higher-order relationships between clusters, and d) it suggests novel clustering algorithms. The CPBA of a network is implemented in Fortran 95 and bundled in the freely available R package PropClust.

Four clusters were simulated in the Euclidean plane by sampling from the rotationally symmetric normal distribution with means corresponding to the different cluster centers and variance matrix I. The numbers of points in the clusters were 50, 100, 150, and 200 for the black, red, green, and blue clusters, respectively. A) A plot of the points is shown colored by cluster. B) Heatmap that color-codes the ordered adjacency matrix, calculated using the formula A(i,j) = 1 − [Euclidean.Distance(i,j)/ max(Euclidean.Distance(i,j))]2. In this plot red indicates a high adjacency, and green indicates a low adjacency. As expected, the adjacency within clusters is very high, and the adjacency between the blue and black clusters is the lowest since they are the furthest apart. C) The scatter plot between propensity (y-axis) and whole network connectivity (row sum of the adjacency matrix, Eq. 7) shows that the propensity is related to the distance between a point and its cluster’s center (given Eq. 10) in this example. D) Scatter plot between cluster similarity (y-axis) calculated using CPBA and the Euclidean distance between cluster centers (x-axis) shows a perfect negative correlation (-1).

Gene expression simulation results. Gene expression data were simulated using the simulateDatExpr5Modules function under the WGCNA package in R. An adjacency matrix was then calculated from the Pearson correlation coefficients for the expression levels of each pair of genes. These plots reveal the relationship between the intramodular propensity and the true module membership, kME in (Eq. 3), first in all the clusters combined (top left) and then in each of the five clusters individually. Note the strong correlation and significant p-value in all cases.

Human brain expression data illustrate how CPBA can be interpreted as a generalization of WGCNA. A) Hierarchical cluster tree based on WGCNA. Color bands show the WGCNA modules (first band), CPBA modules identified by propensity clustering (second band), and the modules identified by Oldham et al[]. CPBA yields modules very similar to those identified by WGCNA. The overlap with the well annotated modules of Oldham et al[] confirms that these clustering procedures yield meaningful modules. B) The intermodular adjacency calculated using CPBA (y-axis) is stronly correlated (r = 0.93) with its WGCNA counterpart, the correlation between eigengenes raised to the soft thresholding power. C) For nodes restricted to module 1 (turquoise in the color bands in panel A), CPBA propensity is highly correlated with its WGCNA counterpart, the module membership, kME (Eq. 3) raised to the soft thresholding power. D) and E) show analogous scatter plots for modules 2 (blue) and 3 (brown), respectively. F) The co-expression network exhibits approximate scale free topology (SFT). Specifically, the x-axis corresponds to equal width bins of the logarithm (base 10) of the connectivity (Eq. 1), and the y-axis reports the corresponding logarithm of the frequency. The approximate straight line relationship (linear model fitting index R2=0.91) indicates that SFT fits very well. G) evaluates SFT for CPBA connectivity defined by the right-hand side of Eq. 7. H) evaluates SFT for the propensity pi only. I) The CPBA connectivity (y-axis) is highly correlated (r = 0.96) with connectivity ki in the correlation network (x-axis). Genes are colored according to module assignment (PropClust color band in panel A. J) There is a high correlation (r = 0.88) between ki (x-axis) and propensity (y-axis). K) There is a high correlation (r = 0.93) between CPBA based connectivity (x-axis) and propensity (y-axis).

OMIM disease network. The intramodular connections between the nodes of the eye disease cluster are shown. Diseases are colored based on their MeSH categories, with diseases categorized as eye diseases (colored green), diseases linked to multiple categories (colored grey), and diseases that were not found (colored white). Note that more nodes should have been classified into the eye cluster by MeSH based on the name alone. Primary examples of this include retinitis pigmentosa, cone-rod dystrophy, retinal dystrophy, and microcornia. In spite of the failure of green labeling, these nodes were correctly classified by CPBA. Node and font sizes are proportional to a disease’s propensity.

OMIM Gene Network. Genes are colored based on their cluster membership, and node size is proportional to a gene’s propensity. This view was achieved with a spring-embedded layout in Cytoscape using the number of edges between two genes as weights. Note that CPBA based clustering identifies modules of highly interconnected nodes.

OMIM CPBA versus PPP Analysis. Scatterplot of the Log10(P) values obtained from analysis of OMIM using 14 and 10 clusters versus a single cluster for the Disease network and Gene network respectively. Note that the points are colored based on whether they come from a pair within a cluster(red) or between two clusters(black). This is very telling as it shows that by conditioning on the clustering, CPBA is able to increase its sensitivity in finding intercluster pairs while at the same time toning down that same trait in intracluster pairs.

Simulated CPBA versus PPP Analysis. Scatterplot of the −Log10(P) values versus the true adjacency values obtained from 0/1 block diagonal matrix by re-setting a few other entries from 0 to 1. These changed values are shown along with the resulting −Log10(P) values obtained using CPBA and PPP.