eQTL tutorial, Bioc 2014

VJ Carey stvjc at channing dot harvard dot edu

Basic data from RECOUNT

We will use transcript profiles obtained through RNA-seq applied
to HapMap cell lines (Montgomery, Sammeth, Gutierrez-Arcelus, Lach, Ingle, Nisbett, Guigo, and Dermitzakis (2010)). The reported
gene-level count data have been rendered as an ExpressionSet (montpick_eset.RData) in the RECOUNT project
(Frazee, Langmead, and Leek (2011)) using the ENSEMBL nomenclature
for genes. These counts were filtered to
exclude genes with uniform zero value over all samples, subjected to
rlog transformation (DESeq2), and then filtered to include only
those with MAD at the 40th percentile of the overall distribution of genewise
MADs.

Characteristic code for the transformation to SummarizedExperiment is given here.
You do not need to execute this as the transformed data are available for loading
as noted below. (exs2se() is a bit of custom code available in the source
of this vignette.)

A naive analysis

The GGtools package includes a function, cisAssoc(), that
can test additive genetic associations in cis between SNP genotypes
housed in tabix-indexed VCF and expression data housed in a
SummarizedExperiment. The test is the score test for a generalized
linear model fit to independent
samples with a continuous response, with response variance approximately
independent of response mean. These assumptions are generally not met
for count data from RNA-seq experiments, but they may hold reasonably
approximately after
the rlog transformation. We will check on the severity
of the violations
after performing some tests.

Here is code that computes, for 200 genes on
chr17, association test statistics
and versions thereof under permutation of expression against
genotype. Do not perform these calculations, they will take too
long for the tutorial. The answer is provided in the
tutorial data, see below.

If instead of counting SNP-gene pairs, we wish to
count genes that show evidence of association with
some genotype, the plug-in FDR procedure can be
used with maximal association scores per gene, for
observed and permuted tests.

s2 = eqsens_dt( dt1, by="probes" )

plotsens(s2, ylab="Count of genes with evidence of eQTL at given FDR")

There is an indication here that we get a larger yield if
we filter more sharply on MAF and distance than we did in the
initial run.

Exercise 1.

How can we compute the FDR for the gene-wise hypotheses
H0g: Mean expression of gene g is independent of SNP
content for all SNP cis to g, under the restrictions
cis radius less than 50kb and MAF greater than 5 percent?

Visualization

The following code creates a visualization of genotype-expression
association for a selected
SNP-gene pair.

Exercise 5: Master regulators.

Find SNP whose FDR for association with multiple distinct genes is
found to be less than one percent. Visualize the relationships.

Connection with “GWAS hits”

We can acquire an image of the current NHGRI GWAS catalog
as follows.

library(gwascat)
curgw = makeCurrentGwascat()
curgw

HOWEVER – as of 28 July 2014, the coordinates used for the
current catalog are GRCh38. We used rtracklayer's liftOver
with the appropriate chain file to obtain an hg19 addressing scheme,
losing a small number of nonmappable loci.

Exercise 6

How frequent are exact coincidences between SNP exhibiting
eQTL association at FDR five percent or
less, and GWAS hits? Compare this frequency as computed
for SNP exhibiting association at FDR 95 percent or greater.

Further exercises

Exercise 6: Population of origin.

We have disregarded the population of origin in
this analysis. Use visualization to assess whether the
association between gene OSBPL7 and SNP rs17774008
has similar forms in subjects from CEU and YRI populations.

You can use s3url to get access to genotype data for chr22.
It takes 5 minutes or so to run cisAssoc on chr22 genes and SNP.
Obtain analyses for CEU and YRI groups separately. What
is the concordance on significance of SNP-gene pairs at FDR five percent or so?

Exercise 7: Efficient analysis with counted expression values.

/data/eQTL2014/montpick_eset.RData has the raw counts.
(Note, probe identifiers are in ENSEMBL vocabulary.) Use DESeq2 or edgeR to
obtain an FDR for a selected SNP-gene pair. Write a function
that does this for any given selection. In principle the mean-variance
relationship needs to be modeled separately for each gene-SNP pair,
so obtaining a statistically optimal solution for a
general search may be laborious. It would be nice to understand
the costs of using rlog with blind and fast set to TRUE
before doing a Gaussian-like analysis of eQTL, as opposed to
pair-specific optimally modeled negative binomial testing for eQTL.