Quality Surrogate Variable Analysis

Studying genetic differential expression using postmortem human brain tissue requires an understanding of the effect brain tissue degradation has on genetic expression. Particularly when brain tissue degradation confounds1 the differences in gene expression levels between subject groups. This problem of confounding necessitates measures from a control dataset of postmortem tissue from individuals who do not have the outcome of interest. Doing so provides a comparative measure of the impact of tissue degradation on expression that can then be used in a case-control study to examine the impact of the outcome of interest on genetic expression. Incorporating the determinations of tissue degradation in control brains in an algorithm to assess the results of genetic differential expression in brains that have the outcome of interest leads to more accurate results and reduces the number of false positive genes that are incorrectly identified as differentially expressed between cases and controls.

SVA background

RNA-sequencing (RNA-seq) is a high-throughput method for quantifying gene expression levels that requires using high-quality RNA. The effect of RNA quality on detecting genetic differential expression accurately was previously addressed with surrogate variable analysis (SVA), which includes batch effects to address the issue of heterogeneity in expression studies (Leek and Storey, 2007). The problem of confounding requires a more robust approach to identifying genes that are differentially expressed.

qSVA

The quality surrogate variable analysis (qSVA) algorithmic framework, an extended version of SVA, was developed by Andrew Jaffe and colleagues (Jaffe, Tao, Norris, Kealhofer, et al., 2017) to provide a method for solving the issue of confounding by brain degradation. The qSVA framework reduces the number of false positive genes, since genes may be identified because RNA quality confounding is not controlled for adequately. This conservative approach uses stricter criteria and involves processing methods that are well established, applying expression cutoffs, avoiding potential batch effects, and adjusting for RNA quality degradation confounding using qSVA.

Datasets

The qSVA algorithm requires the use of two datasets. Here, the dataset of interest is part of BrainSeq, A Human Brain Genomics Consortium, which was initiated with the goal of generating a public database of gene expression in postmortem brain tissue to enhance the understanding of psychiatric disorders through neurogenomic data (Schubert, O’Donnell, Quan, Wendland, et al., 2015). The other dataset is a control dataset, which can also be referred to as the degradation dataset, since it is the measure of the impact of degradation on gene expression in postmortem tissue for individuals who do not have the outcome of interest. The degradation dataset is a much smaller dataset and helps determine the genomic regions most associated with brain degradation. This addresses the concern of an association between the outcome of interest and genetic expression, and helps better understand metrics that demonstrate RNA quality through experimental approaches. Using these two datasets, and by extending qSVA to more than one brain region, we are able to examine the issue of RNA quality confounding using RNA-seq data from multiple brain regions in a case-control study comparing degradation of tissue in patients with schizophrenia to non-psychiatric controls using BrainSeq consortium data (Collado-Torres, Burke, Peterson, Shin, et al., 2018). We focused on the hippocampus (HIPPO) and dorsolateral prefrontal cortex (DLPFC), two brain regions that have been identified as functionally-altered in schizophrenia (Rasetti, Mattay, White, Sambataro, et al., 2014).

Results

After using qSVA to adjust for the confounding effect of RNA, differential expression quality (DEqual) plots are used to assess the effectiveness of the statistical correction. These plots compare the differential expression statistics for the degradation data experiments on the y axis to statistics for the outcome from the dataset of interest on the x axis. The plots are shown for the HIPPO samples, looking at the log-fold change in expression per minute, with each point representing a gene. The goal is to assess the correlation between these two datasets, and how the correlation changes after including the quality surrogate variables (qSVs) in the model. There should not be correlation between the degradation dataset and the schizophrenia disorder case-control BrainSeq dataset, labeled as Dx on the axis for diagnosis, since they are independent datasets and the degradation dataset is serving as a control. Model 1 is a naïve model that includes diagnosis only. Model 2 includes diagnosis and measures for RNA-quality and demographic covariates. Model 3 includes all of the terms from the previous models, with the added qSVs. The number of genes identified as differentially expressed are shown in parentheses next to each model, and the number of genes identified as differentially expressed reduces drastically from over 6,000 in model 1, to 63 in model 2, to 48 in model 3.

Conclusions

Once we are confident that confounding has been removed from the samples of interest, we are able to assess differential expression between cases and controls. Using the 48 genes identified from model 3, we can then perform gene biological process ontology enrichment analysis to determine which genes show enrichment, and to gain clearer insights into which genes are most affected in brain tissue of individuals with schizophrenia. For more information please check the freely available pre-print describing the BrainSeq Phase II project (Collado-Torres, Burke, Peterson, Shin, et al., 2018).

As defined in Wikipedia, confounding is: “In statistics, a confounder (also confounding variable, confounding factor or lurking variable) is a variable that influences both the dependent variable and independent variable causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations.”↩

These are PCs computed on the genotype information from the individuals in this study. We use them to adjust for ethnicity in a more rigorous form than a categorical race variable would be able to do.↩