I'm looking for official GATK documentation (or a recent manuscript) that defines a general recommendation/requirement for sequencing depth to reliably call a heterozygous point mutation in a diploid organism (WGS). In this case, I'm working with humans. Specifically, suppose we're looking at position X in the genome and we want to classify (True/False) whether there was enough coverage to detect a heterozygous point mutation.

What coverage would that be? I'm curious about both: (1) UnifiedGenotyper and (2) HaplotypeCaller

I know general practice is to have 30x coverage genomewide, but I don't see anything in GATK's documentation (i.e., Best Practices) about required coverage.

1 Answer
1

UPDATE: As an update, Sarah Walker (co-author on the poster) responded to my question on the GATK forum. She clarified with the following statement:

We believe the sites around 30X (and above 150X) are being filtered
due to low mapping quality (since it is whole genome so there are many
areas that are hard to map), which explains the low sensitivities in
these areas. As the error bars show, there are very few variants in
these low mapping quality areas.

She then posted the following plot that falls much more inline with what we would all expect. This is a plot showing sensitivity for NA12878 compared to NIST truth.

shlee, from the GATK forums, pointed me to a poster from AGBT 2018 where the authors compared the HiSeqX and NovaSeq instruments. Figure 3 shows sensitivity for SNPs and short INDELs across multiple allelic fractions (AF) from spike-ins. I was surprised to find that SNPs where AF = 0.4 only had ~50% sensitivity at ~25x coverage, and only ~75% sensitivity at ~50x coverage.

The poster is targeted at somatic variation, but data from figure 3 should behave like a standard diploid germline variant because you often see a wide range of allelic fractions from germline variants. They used spike-ins for figure 3. It's not clear to me whether they used HaplotypeCaller in figure 3, but I don't see how they could have used Mutect2, like they did in figures 1 & 2.

Based on this information, it appears the standard 30x genome-wide coverage is generally insufficient to confidently call many heterozygous mutations, and that ~90-100x is ideal.

Am I missing something? Are we really missing that many heterozygous mutations in standard WGS/WES?

I'll leave this question unanswered for now to allow others to chime in on interpretation, etc. I'd appreciate additional thoughts.