Workflow example

Christian Arnold, Pooja Bhat, Judith Zaugg

30 October 2017

Abstract

This workflow vignette shows how to use the SNPhood package in a real-world example. For this purpose, you will use the SNPhoodData package for a more complex analysis to illustrate most of the features from SNPhood. Importantly, you will also learn in detail how to work with a SNPhood object and what its main functions and properties are. The vignette will be continuously updated whenever new functionality becomes available or when we receive user feedback.

1 Example Workflow

In the following example, you will use data from the SNPhoodData package to address the question how many of the previously identified H3K27ac quantitative trait loci (QTLs) for individuals from the Yoruban (YRI) population [1] show allele-specific signal in individuals of European origin (CEU).

corresponding genotypes for the hQTLs in a gzipped VCF file (genotypes.vcf.gz)

The first two are required to run a SNPhood analysis, genotype files are optional.

1.2 Setting up SNPhood

To set up a SNPhood analysis, you first need to create a named list that contains all parameters needed in a SNPhood analysis. This can be done by calling the function getDefaultParameterList, which generates a default list of parameters. It takes up to two (optional) arguments, which is 1) the path to the user-defined regions file and 2) whether or not the data are paired-end:

In many cases the default returned parameter values are a reasonable choice. However, always check the validity and usefulness of the parameters before starting an analysis to avoid unreasonable results.

In this example the default value for most parameters is a reasonable choice: the size of the regions (regionSize = 500) resulting in an analysis window of 2 * 500 (5’ and 3’ of the hQTL) + 1 (for the hQTL) = 1001 bp. Often it is useful to do a pilot analysis with only a few regions to explore the data fast. This can be done by setting the parameter linesToParse to a value > -1 to indicate the number of lines that should be parsed. Since here we work with only 178 hQTLs on chr21 we keep all of them for the analysis.

There are a few parameters that we have to adjust: First, we want to pool datasets because there are two replicates for each individual, and combining the datasets will give us more power, for example to detect allelic biases (parameter poolDatasets). You also have to carefully check if the start positions in the user regions file are 0-based or 1-based because a shift of one base pair will result in non-sensical results for the genotype distribution of the hQTLs that are determined automatically during the analysis. In this case, start coordinates are 1-based, which is also the default for the parameter zeroBasedCoordinates. Our the data are mapped allele-specifically, so perform allele-specific analysis you have to ensure to set the parameter readGroupSpecific to TRUE.

SNPhood offers a powerful and intuitive way of controlling which reads are considered valid when importing BAM files by means of the various flags that exist (see https://samtools.github.io/hts-specs/SAMv1.pdf), in complete analogy to the Rsamtools package. In essence, for each flag, a corresponding parameter (readFlag_) exists that specifies if the flag has to be set (TRUE), not set (FALSE) or if is irrelevant (NA).

We thus have to check the default values for the various readFlag_ parameters. The default values are for paired-end data, and all of our data here are also paired-end, so we can leave them at their default values. Note that the default values disregard any reads that could be problematic. For example, reads that are marked as duplicates (readFlag_isDuplicate), unmapped (either the read itself or its mate - readFlag_isUnmappedQuery and readFlag_hasUnmappedMate), not a primary read (readFlag_isNotPrimaryRead), not (properly) paired (readFlag_isPaired and readFlag_isProperPair) or fail quality controls (readFlag_isNotPassingQualityControls) are discarded. Note that all flags (as specified by the readFlag parameters) that are set to NA will be ignored when importing the reads.

Lastly, we adjust the size of each bin within the region and select a smaller value than the default one (binSize = 25 instead of 50).

# Verify that you do not have zero-based coordinates
par.l$zeroBasedCoordinates

## [1] FALSE

par.l$readGroupSpecific

## [1] TRUE

par.l$poolDatasets = TRUE
par.l$binSize = 25

You are almost done with the preparation, all that is left is to create a data frame to tell SNPhood which data to use for the analysis and some additional meta information. For this, you can use another helper function: collectFiles. The argument patternFiles specifies the folder and file name of input files; wildcards are allowed.

Finally, you assign the names of the individuals in the column “inidivual” to make pooling of the datasets possible. The column input can be ignored because there is no negative control in this analysis due to the fact that the analysis is done allele-specifically and currently, input normalization does not work in this setting (see the main vignette for details). The column genotype can also be ignored for now because (will be integrated later):

1.3 Quality control

** As stated explicitly in the main vignette, SNPhood is not a designated and sufficient tool for ChIP-Seq QC and has never been designed as such. It is important to assess potential biases such as GC, mapping, contamination or other biases beforehand using dedicated tools both within and outside the Bioconductor framework (see the main vignette for designated QC tools).**

However, SNPhood does offer some rudimentary QC controls. Before executing the full pipeline, you will therefore first do a quick QC step to make sure that the datasets do not have any artefacts. For this, you will investigate the correlation of the raw read counts for our regions among the different datasets. The correlation coefficients should be very high among replicate samples and relatively high among different samples.

For this, you run the main function analyzeSNPhood with a special argument and afterwards employ the function plotCorrelationDatasets. Note that you temporarily reset the value of the parameter poolDatasets to also check the correlation among the replicates samples.

## Warning in analyzeSNPhood(par.l, files.df, onlyPrepareForDatasetCorrelation
## = TRUE, : Forcing parameter normAmongEachOther to FALSE because either input
## normalization is turned on, only one files is going to be processed, or
## because allele-specific reads are requested

## Warning in analyzeSNPhood(par.l, files.df, onlyPrepareForDatasetCorrelation
## = TRUE, : Note that you set the parameter "onlyPrepareForDatasetCorrelation"
## to TRUE. You will not be able to use any functionality except the sample
## correlation plots. For a full analysis, run the function again and set the
## parameter to FALSE.

You can now run the correlation analysis on the SNPhood object and plot it directly (we could also plot it to a PDF file):

The correlation values are indeed very high among the replicate samples and also quite high among the individuals. The mean correlation coefficient among the datasets is 0.91, which is very high (note that only the lower triangle matrix of the correlation matrix is used for this, to not bias the analysis). There does not seem to be a problem with the datasets. However, as mentioned repeatedly, this is not sufficient for QC and should only be another verification that data quality is high and that biases have been controlled for.

1.4 Executing the main function

Now you can execute the full pipeline by setting the parameter onlyPrepareForDatasetCorrelation to FALSE again (the default). The execution of the function may take a few minutes and may issue some warnings e.g. indicating missing data and “correcting” inconsistent parameter settings:

## Warning in analyzeSNPhood(par.l, files.df, onlyPrepareForDatasetCorrelation
## = FALSE, : Forcing parameter normAmongEachOther to FALSE because either
## input normalization is turned on, only one files is going to be processed, or
## because allele-specific reads are requested

The warnings indicate that 4 duplicate regions have been removed as well as that the parameter normAmongEachOther has been set to FALSE because, in this case, reads are read group(allele)-specific (parameter readGroupSpecific is set to TRUE).

1.5 Working with a SNPhood object: Extracting and manipulating information and metadata

Now you can start analyzing the results already, as the main pipeline finished. To familiarize ourselves with the SNPhood object let’s first take a look at some of its properties and helper functions. Generally, there are four “entities” stored in a SNPhood object: regions, datasets, bins, and read groups. For all of them, methods to extract and alter the stored information exist (see below).

## Warning in .getCounts(SNPhood.o, type = "enrichmentBinned", readGroup,
## dataset): Returning an empty list, could not find the requested data in the
## object. Did you ask for the correct type of data? See also the help pages.

1.6 Visualizing counts and enrichment

Let’s first visualize the number of overlapping reads for the regions before (plotRegionCounts) and after binning (plotBinCounts) datasets and read groups. The two functions have a number of arguments to make the visualization as flexible as possible. See the help pages for details, we here only touch upon a few parameters: