Jump to another community

Why do joint calling rather than single-sample calling? [RETIRED]

Since the release of GATK 3.0, multisample calling has been replaced by the reference model (gVCF-based) workflow for joint analysis in our Best Practices recommendations. This new workflow provides all the benefits of the joint calling strategy detailed below, without any of the drawbacks. See this FAQ article and this method document for more details.

Overview

There are three potential strategies for calling genetic variants on multiple samples:

single sample calling: samples are analyzed individually

batch calling: samples are analyzed in separate batches, with call sets being merged in a downstream processing step

joint calling: variants are analyzed simultaneously across all samples

Our recommendation: joint calling

We recommend joint calling because it can dramatically improve consistency across batches and cause fewer artefacts due to three key advantages:

Batch-calling does not output a genotype call at sites where no member in the batch has evidence for a variant; it is thus impossible to distinguish such sites from locations missing data. In contrast, joint calling emits genotype calls at every site where any individual in the call set has evidence for variation.

2. Greater sensitivity for low-frequency variants

By sharing information across all samples, joint calling makes it possible to “rescue” genotype calls at sites where a carrier has low coverage but other samples within the call set have a confident variant at that location.

3. Greater ability to filter out false positives

The current approaches to variant filtering (such as VQSR) use statistical models that work better with large amounts of data. Of the three calling strategies, only joint calling provides enough data for accurate error modeling and ensures that filtering is applied uniformly across all samples.

Figure 1: Power of joint calling in finding mutations at low coverage sites. The variant allele is present in only two of the N samples, in both cases with such low coverage that the variant is not callable when processed separately. Joint calling allows evidence to be accumulated over all samples and renders the variant callable. (right) Importance of joint calling to square off the genotype matrix, using an example of two disease-relevant variants. Neither sample will have records in a variants-only output file, for different reasons: the first sample is homozygous reference while the second sample has no data. However, merging the results from single sample calling will incorrectly treat both of these samples identically as being non-informative.

Some numbers and lessons from a large-scale joint calling project

We recently participated in a large-scale project in which we applied joint calling approaches to raw sequencing data from approximately 57,000 human exomes representing a wide range of human population diversity. This was done in collaboration with other groups studying the genetic basis of complex and Mendelian diseases.

We performed two pilot studies as part of this project, one focused on sample QC and one focused specifically on evaluating the joint calling approach. In that pilot, we performed complete joint calling across chromosomes 11, 20, 21 and 22, which represents approximately 11.4% of the human exome. In a nutshell, we found that large-scale joint calling results in greater sensitivity to low-frequency variants, an increased ability to remove systematic false positives such as mapping errors, and greater consistency of variant calls across projects (Figure 2).

So, should you call your samples jointly? Yes! But there are a few issues you should be aware of before you start.

Outstanding issues with joint calling

- Scaling & infrastructure

Most of the problems we experienced in our joint calling experiments were scaling problems -- we managed to do joint analysis on 50K+ exomes, but that was already pushing the bounds of what our fairly heavyweight infrastructure can support. Anyone with less hardware is going to struggle to reach those numbers. Not to mention the logistical headache of managing access to the data, if it originates from multiple separate projects. But this only really applies to people who are dealing with seriously large projects, involving tens of thousands of samples.

- The “N+1” problem

This one is probably more widely applicable. When you’re getting a large-ish number of samples sequenced (especially clinical samples), you typically get them in small batches over an extended period of time, and you analyze each batch as it comes in (whether it’s because the analysis is time-sensitive or your PI is breathing down your back). But that’s not joint calling, that’s batch calling, and it doesn’t give you the same significant gains that joint calling can give you. Unfortunately the current joint calling approaches don’t allow for incremental analysis -- every time you get even one new sample sequence, you have to re-call all samples from scratch.

The good news is that we’re working on a new joint calling procedure that will address these issues (Figure 3) , so stay tuned for updates!

Here the " low-frequency variants" refers to sites of low coverage, and does NOT mean "rare variants", right?

Yes and no; this bit does refer primarily to the ability of joint calling to overcome the issues associated with low coverage of sites in general, but it is also key to enabling discovery of rare variants. If you're looking for a rare variant, you'll find it well enough in samples where the site is well covered, regardless of calling mode (single or joint), but in samples where the site has low coverage, you have very little chance of fishing it out with any useable confidence in single calling mode. In contrast, with joint calling, you have a better chance of calling the variant reasonably confidently if there is another sample in the cohort that has it. So it's true that you may still miss poorly covered singletons (because this is not a miracle-making method), but you would struggle to find those in single sample calling mode anyway. The solution is to call large cohorts at a time, to minimize the chance that rare variants will end up as singletons, as those are a pain to call in any case.

(Edit for second question, which I initially forgot to answer)

And re: Greater ability to filter out false positives (FP)

Does it mean fewer FP? And if so, does it mean lower Non-Reference Discrepancy (NRD)?

Yes it leads to fewer false positives. The effect on NRD depends on how appropriate your truth set is; e.g. if your truth set has more FPs than you think, having fewer FPs in your eval callset will actually lead to a larger discrepancy.

No, in this context singleton means a variant that is found in only one sample of a cohort. Basically you want to have a cohort with many individuals to increase the chance that two or more individuals will carry the variant of interest. Because if it's only found in one person and coverage at that site is low, it's going to be needle: meet haystack whichever way you cut it.

Are the stats in the INFO field of VCF file comparable among individual sample callings or batch callings? My thinking is that they are not comparable due to the different VQSR in the different callings.
Thanks (still in vacation )

It depends which stats you mean, but generally speaking we don't recommend comparing annotation values directly between callsets that were generated separately, because a lot of them are relative, not absolute.