Overview

This document covers the use of various tools and metrics associated with depth of coverage.

Note that at the moment this document is incomplete; in future we will add more details pertaining to the DepthOfCoverage and DiagnoseTargets analysis tools. For now, this just gives you some keys to understanding the coverage annotations given in the AD and DP fields of the VCF files output by the variant calling tools.

Coverage annotations: DP and AD

The variant callers generate two main coverage annotation metrics: the allele depth per sample (AD) and overall depth of coverage (DP, available both per sample and across all samples, with important differences), controlled by the following annotator modules:

Coverage (DP): outputs the filtered depth of coverage for each sample and the unfiltered depth of coverage across all samples.

At the sample level, these annotations are highly complementary metrics that provide two important ways of thinking about the depth of the data available for a given sample at a given site. The key difference is that the AD metric is based on unfiltered read counts while the sample-level DP is based on filtered read counts (see tool documentation for a list of read filters that are applied by default for each tool). As a result, they should be interpreted differently.

The sample-level DP is in some sense reflective of the power I have to determine the genotype of the sample at this site, while the AD tells me how many times I saw each of the REF and ALT alleles in the reads, free of any bias potentially introduced by filtering the reads. If, for example, I believe there really is a an A/T polymorphism at a site, then I would like to know the counts of A and T bases in this sample, even for reads with poor mapping quality that would normally be excluded from the statistical calculations going into GQ and QUAL.

Note that because the AD includes reads and bases that were filtered by the caller (and in case of indels, is based on a statistical computation), it should not be used to make assumptions about the genotype that it is associated with. Ultimately, the phred-scaled genotype likelihoods (PLs) are what determines the genotype calls.

TO BE CONTINUED... meanwhile, check out the DepthOfCoverage and DiagnoseTargets tool docs.

See this announcement regarding our plans for support of DepthOfCoverage and DiagnoseTargets. If you find that there are functionalities missing in either tool, leave us a comment and we will consider adding them.

For a complete, detailed argument reference, refer to the GATK document page here.

Introduction

DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:

Total, mean, median, and quartiles for each partition type: aggregate

Total, mean, median, and quartiles for each partition type: for each interval

A series of histograms of the number of bases covered to Y depth for each partition type (granular; e.g. Y can be a range, like 16 to 22)

A matrix of counts of the number of intervals for which at least Y samples and/or read groups had a median coverage of at least X

A matrix of counts of the number of bases that were covered to at least X depth, in at least Y groups (e.g. # of loci with ≥15x coverage for ≥12 samples)

A matrix of proportions of the number of bases that were covered to at least X depth, in at least Y groups (e.g. proportion of loci with ≥18x coverage for ≥15 libraries)

Because the common question "What proportion of my targeted bases are well-powered to discover SNPs?" is answered by the last matrix on the above list, it is strongly recommended that this walker be run on all samples simultaneously.

For humans, DepthOfCoverage can also be configured to output these statistics aggregated over genes, by providing it with a RefSeq ROD.

DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.

Coverage by Gene

To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument

Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes, -geneList is ignored if -omitIntervals is thrown.

We have decided to continue providing and supporting DepthOfCoverage and DiagnoseTargets for the foreseeable future. Going forward, we'll try to integrate them and develop their features to address the main needs of the community. To this end we welcome your continuing feedback, so please feel free to contribute comments and ideas in this thread.

To all who took the time to tell us what you find useful about DoC and DT (and what you wish it could do), a big thank you! This is always very useful to us because it helps us identify which features are most valuable to our users.

HaplotypeCaller has a --min_base_quality_score flag to specify the minimum "base quality required to consider a base for calling". It also has the -stand_call_conf and -stand_emit_conf that both of which use a "minimum phred-scaled confidence threshold".

What is the difference between the base quality flag and the phred-scaled confidence thresholds flags in deciding whether or not to call a variant?

Also, why are there separate flags for calling variants and emitting variants? To my way of thinking, calling and emitting variants are one-and-the-same.

Is there any way to specify a minimum minor-allele-frequency for calling variants in Haplotyper? Under the default settings, the program expects 1:1 ratio of each allele in a single diploid organism. However, stochastic variance in which RNA fragments are sequenced and retained could lead to departures from this ratio.

Finally, is there any way to specify the minimum level of coverage for a given variant for it to be considered for calling/genotyping?

When I check the coordinates of intervals given by DepthOfCoverage "*.interval_summary" file, I found that it tried to add one on the start position of each intervals. For example, the original coordinates in my .bed interval file is 3669022-3669474, then the DOC will give 3669023-3669474, only the start position, not the end one, actually.

I am not sure whether this is caused by "start from 0" or "start from 1" problem, because I have no idea about the coordinates in my original bed files start from 1 or start from 0.

I'm assuming this is not normal as everything was working fine in 3.1-1. Ironically I recently switched to 3.2-2 because the RealignerTargetCreator was giving me an Unsupported major.minor version 51.0 error. I'm running Java 1.7.0-b147.

[EDIT}

I've since noticed that the BAM file was created with GATK version 2.4-9-g532efad. So the 'current' and 'latest' software filing system in place here appears to have failed. I can confirm that GATK v3.2-2 DepthOfCoverage tool generates the above program record group id error. I'll chase the administrators down to ensure the latest GATK version is installed and start from scratch.

ERROR MESSAGE: Argument with name 'includeRefNSites' isn't defined.

I've been learning how to use the DepthOfCoverage to calculate coverage across genes. I noticed that some genes were covered by the bam, in the interval_list file, and in the geneList file, but were not reported in the gene_summary table. Most of these were non-coding RNAs, and in reviewing the geneList file, I noticed that for the non-coding RNAs, the Coding region start position is 1 base higher than the Coding region end position (which also put the Coding region start position higher than the Transcription end position). I adjusted the file and made the Coding region identical to the Transcription regions for the non-coding RNAs, and this resolved the issue for most of the genes. It appears that remaining genes that are still not reported in the gene_summary table, all overlap with exons or UTRs from other genes. My questions are:

Is it the case that if two exons overlap, only one will be reported on, or is something else going on?

What regions of the gene is the tool reporting on, the whole transcribed region, or just the coding region?

Am I safe in changing the coding regions of non-coding RNAs to equal the transcribed region for the coverage analysis?

I ran DepthOfCoverage on several samples for ~5000 genes. It turns out that a handful (~30) of these genes are overlapping (but on different strands). In these cases, it seems that DoC analyzes the first overlapping gene but not the second overlapping gene. For example, in this case (below), it will produce depth stats for the first interval, but not the second interval. You can see the end of the first interval 495620 is actually bigger than the beginning of the second interval 494949.

Hi we've been looking at results of a recent run of GATK-HC (3.1-1) using the new N1 pipeline and we've been seeing something odd in the distribution of the Depth of Coverage (Using DP from the Genotype Fields) we're seeing for the raw unfiltered variants.

All our samples are sequenced using PCR-Free libraries and have two lanes of sequence (~24x mapped depth) and looking at depth of coverage from bedtools we see a nice clean distribution (red in graph) but when we look at the data from the HaplotypeCaller sites (Blue in graph) we see a bimodal distribution with an excess of variants called at lower coverage (~16x) vs the most common coverage of around 24x. We've seen this in all the samples we've looked at so far, so it's not just a one off.

I've had a quick look at read depth from another variant caller (Platypus) and there we see no evidence of this bimodal distribution in the variants it has called.

Is this expected behaviour?
If so why does it occur?
If not any idea what is going on here, is it a bug in the variant caller or the depth statistics?
Do you see the same thing in other datasets?

I have a large number of gVCF files, either single samples or combined in approximately 100 samples per combined gVCF file.
I would like to compute something like the average depth for a set of regions of interest from the combined or single gVCF files.
I can see that I could try to get a VCF output at every position, and use that to infer the percentage with a given read depth, but that seems mightily cumbersome.

So my question: is there an equivalent to the DepthOfCoverage GATK module that takes as input gVCF files? ideally combined gVCF and extract per sample average/median/minimum depth but otherwise I can work with single sample gVCF data.

I'm a little surprised to find that the DepthOfCoverage tool still does not allow parallelization as of v3.1-1. Would it be possible to parallelize this tool and others that consider samples/BAMs data independent. Pretty please... :-)

I am interested in creating genome-wide gVCFs from various sources, including exome samples. The rationale is that it is not completely clear what the target region actually is all the time, and it would be good to keep some flexibility about what I capture.

Nevertheless, these single sample gVCF files can become quite large, but most of the lines in the file are used to store very low depth data ( DP <= 2) which is probably not critical for downstream analysis. I would be interested to have, in the HaplotypeCaller gVCF data generation process, an additional parameter that sets as missing regions with, say, 2 reads or less. That would considerably reduce the storage at little cost in terms of discarded information.

Does such a parameter exist already? Would it make sense at all to add this? I am not sure but right now it seems to me it might be useful.

I used the following command to get the coverage and I am getting only one output file, which gives coverage per samples. I am giving gene list in REFSEQ format along with a sorted interval BED file. However I am not getting any file with per gene coverage.

I tried to use DepthOfCoverage to figure out the coverage of 50 whole exome sequencing data. It works fine for single sample, however, the program always complained about memory issue, even I provide 100 GB as "-Xmx100g". Any suggestions for the problem? There is a option "--read_buffer_size", which seems to be helpful, how should set the value ?

http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html
states (bold is my emphasis):

Parallelism options

This tool can be run in multi-threaded mode using this option.

TreeReducible (-nt)

Yet when I run the GATK with -nt, I get an error that -nt is not supported (truncated to save space):

ERROR A USER ERROR has occurred (version 2.8-1-g932cd3a):
ERROR MESSAGE: Invalid command line: Argument nt has a bad value: The analysis DepthOfCoverage aggregates results by interval. Due to a current limitation of the GATK, analyses of this type do not currently support parallel execution. Please run your analysis without the -nt option.

Error :
ERROR MESSAGE: Badly formed genome loc: Contig '1 69114 69332' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?

Coverage was not reported for 36025109 bases. Many skipped bases fall one after another in the genome to form big continuous regions. I do not understand, why these regions were skipped. Usually if there was no mapping, the coverage value should be reported as 0.

According to the documentation:
http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html
"This tool can be run in multi-threaded mode using this option. TreeReducible (-nt)"

When I try with GATK2.5 I get this error message:
"Invalid command line: Argument nt has a bad value: The analysis DepthOfCoverage aggregates results by interval. Due to a current limitation of the GATK, analyses of this type do not currently support parallel execution. Please run your analysis without the -nt option."

Would it work, if I upgraded to 2.7? You upgrade GATK more frequently than I change my under wear, so it is difficult to keep up :) Thanks!

This information is used to verify that a certain variant was covered good, not only on the position of the variant but also over the rest of the gene it is in. This is a bit hard to see in the interval_summery file because I can't see the exon or genenames in there and I can't see this in the gene_summery file because i mis individual exon information. This means crossrefferencing the bedfile and interval_summery with ugly scripting or creating a spoofed refseq file per exon. Is there an option to format the output where the coverage of each exon is shown with the exon and/or gene name (from the info in the bed or refseq file)?

Trying to downsample in an orderly fashion in the name of experimentation, and in doing so would like to specify just one chromosome for the experiment - so I picked chromosome 17 with -L and a coverage of 30x with -dcov 30. This came up:

ERROR MESSAGE: Locus-based traversals (ie., Locus and ActiveRegion walkers) require a minimum -dcov value of 200 when downsampling to coverage. Values less than this can produce problematic downsampling artifacts while providing only insignificant improvements in memory usage in most cases.

I was hoping to poke through results from using the HaplotypeCaller with many different simulated depths of coverage for several samples. I read that one can use -dfrac instead, and that it might even be more appropriate, though I was hoping to find out what level of coverage led to what level of results and using -dfrac feels much less specific as it appears to toss a fraction of however many reads where at a given position, rather then tossing reads over a certain coverage. Thus with -dfrac, I could say that my sample had an average of 30x for this chromosome and I tossed half so theoretically I've simulated 15x depth of coverage...

Which approach would be more representative of reality? Using -dfrac to simulate a certain depth of coverage, or -dcov assuming I didn't have the 200 restriction?

I'm a bit confused about some of the output of GATK's DepthOfCoverage script.
In the *.GATK.covdepth.sample_summary, what does the "total" column signify?
Is it the total number of base pairs in the data? Because my raw read data has ~37.2 billion bp. However, GATK reports over 92 billion bp.
Does the 'total' mean something else, or will I have to rerun the DepthofCoverage script?

Starting with a vcf file, plotting the depth (DP) distribution gives a nice, slightly asymmetrical bell-shaped curve. Given that SNPs with very high and very low coverages should be excluded, how does one decide what is very high and low. e.g. 5% either side ?

I'm only interested in chromosomes 2L, 2R, 3L, 3R and X of my Drosophila sequences. Filtering for these is easy with a Perl script but I'm trying to do this earlier on in the GATK processes. I've tried ...-L 2L -L 2R -L 3L ...etc, -L 2L 2R 3L ....etc and, -L 2L, 2R, 3R...etc but the result is either input error message or chromosome 2L only.

Many thanks and apologies if I've missed anything in the instructions.

Hi I am trying to run the DepthofCoverage walker from GATK for the first time and I am running into stumbling blocks. I am interested in getting coverage info for genes and I downloaded the Refseq genes list and am filtering and sorting according to specifications. I am using the SortByRef.pl script provided in the website. However, I am getting an error saying:

Can not open temporary file 1045: Too many open files at SortByRef.pl line 68, <$INPUT> line 1021.

I also checked the soft/hard limits for filesystem and it shows that my current system's limit is the following:

I am running GATK DepthOfCoverage one a bunch of samples (sequenced using Illumina MiSeq). For one of the samples, I am getting the following error:

ERROR MESSAGE: SAM/BAM file SAMFileReader{<file.bam>} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 61; please see the GATK --help documentation for options related to this error

I found a suggestion that I should be using fixMisencodedQuals flag in this case. However, if I do that, "the engine will simply subtract 31 from every quality score as it is read in". That will fix this error, but then all my quality scores will be incorrect.

Furthermore, the BAM file I am using was last recalibrated with GATK. Why is GATK producing files with invalid quality scores?

Hello, I´m trying to use deptofcoverage to check the coverage of my reads. I already have the indexed bam files (created with sam to bam) and also reordered (reorder SAM/BAM) but they are still not recognized by depthofcoverage and I got this error message:

"Sequences are not currently available for the specified build"

I used "human (homo sapiens) hg19 full" for mapping but I can´t select it, it only allows b37 version.

I would like to include a RefSeq ROD file in order to get the coverage per gene using the GATK coverage tool (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html). However, it is not really clear to me how I can easily generate such a file, since I can not find the right documentation. All the links that should point to this information seem to be (incorrectly?) redirected to the main GATK homepage). For example this http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_refseq_RefSeqCodec.html has a link that points to http://www.broadinstitute.org/gsa/wiki/index.php/RefSeq which is redirected to http://www.broadinstitute.org/gatk/.
Can someone point me to the correct docs? Other resources I found also point to the same wiki, which I can't find at the moment..

Invalid command line: Argument annotation has a bad value: Class DepthOfCoverage is not found; please check that you have specified the class name correctly

I've looked at the release notes but it's not giving me a clue as to what has changed. Has the DepthOfCoverage annotation now been dropped? I've checked and I can reproduce this on the latest nightly (nightly-2013-03-11-g184e5ac)

Now that DepthOfCoverage is being retired in GATK 2.4, I decided to investigate DiagnoseTargets. I have some questions.

1) Are there plans to support output formats other than VCF? What was great about DOC is I could easily send the output to anyone and it could be easily read. With DT, that requires additional processing.

2) DOC provided summary for intervals as well as samples. DT only does intervals. Is there a way to get per-sample info?

3) DOC output full intervals. DT only outputs the start positions. To get the end positions, an additional step is required. Can that be adjusted?

4) DOC provided coverage info for all intervals. DT only shows covered intervals, so if an interval is not covered, it will not be listed in the output. Is there a way to output all intervals?

Sorry if I sound too critical. I am a big fan of DepthOfCoverage and am disappointed to see it go.

For matched tumor and normal pairs, we easily get insertion and deletion counts from the output of Somatic Indel Detector in GATK. However, when we run multiple samples from the same patient, sometimes calls are made in one sample but not another, so we might not have the numbers for all samples for all indel events. We can get the deletion counts from Depth of Coverage in GATK, but retrieving insertions is trickier.

Does you have a suggestion for how to solve this problem in an automated (ie non-IGV fashion)?

Additionally, as DepthofCoverage is being retired, what do you recommend that we use for getting SNP and deletion counts?

I am learning to use the DepthofCoverage function to obtain the gene coverage information for a collection of bacterial contigs that were mapped with metagenomic reads. The original post introducing this function is here: http://gatkforums.broadinstitute.org/discussion/40/depthofcoverage-v3-0-how-much-data-do-i-have#latest

I am working with bacterial genomic contigs, can you please specify what basic information is needed for a gene list (e.g., name of contig, name of gene, location of gene in the contig, from... to ..., etc.)?

ERROR MESSAGE: Reference index 85 not found in sequence dictionary.

I have already reheadered my bam file to fix a contig mismatch error, and the fasta dict file was generated automatically by gatk. Moreover, about 160 lines of output are generated, but I do not see any irregularities at the position where the code crashed. Please let me know what I can try. Thank you.