Weighted gene co-expression network analysis with TCGA RNAseq data

The following tutorial describes the generation of a weighted co-expression network from TCGA (The Cancer Genome Atlas) RNAseq data using the WGCNAR package by Langfelder and Horvarth[^1]. In addition, individual genes and modules will be related to sample traits. Exemplarly, a co-expression network for skin cutaneous melanomas (SKCM) will be generated. However, the following weighted gene co-expression analysis (WGCNA) framework is applicable to any TCGA tumour entity.

The code of this vignette is a proof of principial example that can't be run as listed without assembling the RNAseq data as described in the following beforehand.

Assembly and preprocessing of TCGA RNAseq data

Melanoma RNAseq data for the CVE extension were downloaded as expression estimates per gene (RNAseq2 level 3 data) from the TCGA data portal. Please note that the TCGA Data portal is no longer operational and all TCGA data now resides at the Genomic Data Commons.
For WGCNA, the individual TCGA RNAseq2 level 3 files were concatenated to a matrix RNAseq with gene symbols as row and TCGA patient barcodes as column names.

Further preprocessing included the removal of control samples (for more information see the TCGA Wiki) and expression estimates with counts in less than 20% of cases.

To relate co-expression modules to disease phenotypes, clinical metadata is needed. As for the melanoma TCGA data, the clinical data was published as a curated spreadsheet in the supplements of the latest publication (suppl_table_S1D.txt)[^2].

As read counts follow a negative binomial distribution, which has a mathematical theory less tractable than that of the normal distribution, RNAseq data was
normalised with the voom methodology[^3]. The voom method estimates the mean-variance of the log-counts and generates a precision weight for each observation. This way, a comparative analysis can be performed with all bioinformatic workflows originally developed for microarray analyses.

library(limma)
RNAseq_voom = voom(RNAseq)$E

A large fraction of genes are not differentially expressed between samples. These have to be excluded from WGCNA, as two genes without notable variance in expression between patients will be highly correlated. As a heuristic cutoff, the top 5000 most variant genes have been used in most WGCNA studies. In detail the median absolute devision (MAD) was used as a robust measure of variability.

#transpose matrix to correlate genes in the following
WGCNA_matrix =t(RNAseq_voom[order(apply(RNAseq_voom,1,mad), decreasing =T)[1:5000],])

Construction of co-expression network

The connections within a network can be fully described by its adjacency matrix $a_{ij}$, a $N~x~N$ matrix whose component $a_{ij}$ denotes the connection strength between node $i$ and $j$. The connection strength is defined by the co-expression similarity $s_{ij}$. The most widely used method defines $s_{ij}$ as the absolute value of the correlation coefficient between the profiles of node $i$ and $j$: $s_{ij} = |cor(x_i,x_j)|$. However, we employed the biweight midcorrelation to define $s_{ij}$, as it is more robust to outliers[^4]. This feature is pivotal, as we do not expect genes to be co-expressed in all patients.

Originally, the co-expression similarity matrix was transformed into the adjacency matrix using a 'hard' threshold. In these unweighted co-expression networks, two genes were identified to be linked ($a_{ij} = 1$), if the absolute correlation between their expression profiles were higher than a 'hard' threshold $\tau$. However, this hard threshold does not reflect the underlying continuous co-expression measure and leads to a significant loss of information. As a consequence, Horvath and colleagues introduced a new framework for weighted gene co-expression analysis (WGCNA)[^5]. At its core, a weighted adjacency is defined by raising the co-expression similarity to a power ('soft' threshold):

$$a_{ij} = s_{ij}^\beta$$

with $\beta \geq 1$. To choose an appropriate $\beta$-value, the authors present a methodology that assesses the scale free topology of the network. For detailed rational of this approach, please see Zhang and Horvath[^5].

As for the melanoma network, a beta value of 3 was the lowest power for which the scale-free topology fit index curve flattens out upon reaching a high value ($R^2$ \textgreater 0.9 as suggested by Langfelder and Horvarth).

#calculation of adjacency matrix
beta =3
a = s^beta

Lastly, the dissimilarity measure is defined by

$$w_{ij} = 1 - a_{ij}$$

#dissimilarity measure
w =1-a

Please note that TOM-based (topological overlap matrix) dissimilarity proposed by Horvarth and colleagues did not result in distinct gene modules for the analysed melanoma network.

Identification of co-expression modules

To identify co-expression modules, genes are next clustered based on the dissimilarity measure, where branches of the dendrogram correspond to modules. The gene dendrogram obtained by average linkage hierarchical clustering is depicted in figure 2. Ultimately, gene co-expression modules are detected by applying a branch cutting method. We employed the dynamic branch cut method developed by Langfelder and colleagues [^6], as constant height cutoffs exhibit suboptimal performance on complicated dendrograms. WGCNA of the 472 TCGA melanoma samples revealed 41 co-expression modules. All genes that are not significantly co-expressed within a module are summarized in an additional module 0 for further analysis.

The relation between the identified co-expression modules can be visualized by a dendrogram of their eigengenes (fig. 3). The module eigengene is defined as the first principal component of its expression matrix. It could be shown that the module= eigengene is highly correlated with the gene that has the highest intramodular connectivity[^6].

Relation of co-expression modules to sample traits

An advantage of co-expression network analysis is the possibility to integrate external information.
At the lowest hierarchical level, gene significance (GS) measures can be defined as the statistical significance (i.e. p-value, $p_i$) between the $i$-th node profile (gene) $x_i$ and the sample trait $T$

$$GS_i = -log~p_i$$

Module significance in turn can be determined as the average absolute gene significance measure. This conceptual framework can be adapted to any research question. The clinical metadata used in the following was obtained from the recent TCGA melanoma publication[^2] (Supplemental Table S1D: Patient Centric Table).

#load clinical metadata. Make sure that patient barcodes are in the same format #create second expression matrix for which the detailed clinical data is available
WGCNA_matrix2 = WGCNA_matrix[match(clinical$Name, rownames(WGCNA_matrix)),]
#CAVE: 1 sample of detailed clinical metadata is not in downloaded data (TCGA-GN-A269-01')
not.available =which(is.na(rownames(WGCNA_matrix2))==TRUE)
WGCNA_matrix2 = WGCNA_matrix2[-not.available,]
str(WGCNA_matrix2)
#hence it needs to be removed from clinical table for further analysis
clinical = clinical[-not.available,]

Representatively, co-expression modules will be related to the so called lymphocyte score, which summarises the lymphocyte distribution and density in the pathological review.

To enable a high-level interpretation of the dendrogram of module eigengenes, gene ontology (GO) enrichment analysis was performed for the module genes using the GOstatsR package [^8]. Modules were named according to the most significant GO einrichment given a cutoff for the ontology size. The smaller the ontology size, the more specific the term. In this analysis a cutoff of 100 terms per ontology was chosen.

Exploration of individual genes within co-expression module

Assessing the module significance for different sample traits facilitates an understanding of individual co-expression modules for melanoma biology. As for the prioritisation of variants we are next interested in the role of the variant gene within a co-expression module. To this end, Langfelder and Horvath suggest a 'fuzzy' measure of module membership defined as

$$K^q = |cor(x_i,E^q)|$$

where $x_i$ is the profile of gene $i$ and $E^q$ is the eigengene of module $q$. Based on this definition, $K$ describes how closely related gene $i$ is to module $q$. A meaningful visualization is consequently plotting the module membership over the p-value of the respective GS measure. As a third dimension, the dot-size is weighted according to the effect size.