Jump to another community

Any ploidy goes!

Until now, HaplotypeCaller was only capable of calling variants in diploid organisms due to some assumptions made in the underlying algorithms. I'm happy to announce that we now have a generalized version that is capable of handling any ploidy you specify at the command line!

This new feature, which we're calling "omniploidy", is technically still under development, but we think it's mature enough for the more adventurous to try out as a beta test ahead of the next official release. We'd especially love to get some feedback from people who work with non-diploids on a regular basis, so we're hoping that some of you microbiologists and assorted plant scientists will take it out for a spin and let us know how it behaves in your hands.

It's available in the latest nightly builds; just use the -ploidy argument to give it a whirl. If you have any questions or feedback, please post a comment on this article in the forum.

Caveat: the downstream tools involved in the new GVCF-based workflow (GenotypeGVCFs and CombineGVCFs) are not yet capable of handling non-diploid calls correctly -- but we're working on it.

UPDATE:

We have added omniploidy support to the GVCF handling tools, with the following limitations:

When running, you need to indicate the sample ploidy that was used to generate the GVCFs with -ploidy. As usual 2 is the default ploidy.

The system does not support mixed ploidy across samples nor positions. An error message will be thrown if you attempt to genotype GVCFs that have a mixture, or that have some genotype whose ploidy does not match the -ploidy argument.

LATEST UPDATE:

As of GATK version 3.3-0, the GVCF tools are capable of ad-hoc ploidy detection, and can handle mixed ploidies. See the release highlights for details.

The terminal printout for UG shows the number of processed active regions , while the HC does not. I see this phenomenon when I have a single BAM input and ploidy specified as 1 and with two BAM inputs and ploidy specified as 2. Disclaimer: I have never used HaplotypeCaller before as I solely work on a haploid organism nor have I attempted to HC on diploid data. Apologies if this is the intended behavior, but I thought I should inquire about it.

The difference in the ProgressMeter behavior that you're seeing is expected. It's due to fundamental differences in how UG and HC traverse the data. UG traverses the genome one individual position at a time, so it's easier to track its progress, and it goes through each unit of data much faster than HC. In contrast, HC traverses the data by ActiveRegions, which are stretches of genome that can go up to several hundred bases (or more if you set the max higher). Each unit takes much longer to process since it contains more data, and so you may not see the count go up for a while (although it should move eventually, unless something is broken that I'm not aware of).

That said I realize I wasn't clear about the usage of the ploidy argument. If you have 7 samples that are clearly separated with distinct read groups (as opposed to pooled samples which are not identified) you need to set the ploidy value to be the actual ploidy of a single sample, as opposed to the aggregate ploidy over all samples. So here you would set the ploidy to 1 for your haploid samples, otherwise you're asking HC to consider each sample as a heptaploid. This should speed up the process as HC will have considerably less work to do.

I might also recommend doing a single-sample to single-sample comparison first before diving into multi-sample comparisons, but that's up to you.

Thank you for your quick response, @Geraldine_VdAuwera. To clarify, if I have merged those same 7 BAM files I should still set ploidy to 1 if they have distinct read groups? And this recommendation applies to both HC and UG? Thank you.

As long as the samples have distinct read groups (specifically, distinct SM tags) it makes no difference whether they are in the same file or in separate files. The GATK engine merges all the data when it reads in the file contents, and distinguishes samples by SM tags. This does apply to both UG and HC, yes.

I have a question about specifying ploidy for population sequence data. I am conducting a population sequencing study on a vector-borne bacteria, so each genomic library I have is the bacterial population infecting a single vector, but constitutes a pool of an unknown number (likely 1-7) different bacterial clones. I am planning to use Haplotype Caller to call variants but am wondering how to deal with ploidy when I have an unknown ploidy (or number of coinfecting bacterial strains) present within each genomic library. Should I set the ploidy to 7 (the maximum we likely will see in nature) for all samples and call them all together?

@Katie There is no absolute answer to your question; it's going to be a tradeoff between sensitivity (aim to detect every variant, even if present only as a minority fraction) and specificity (aim to reduce false positives resulting from sequencing noise/artifacts). If you want to maximize sensitivity, you'll need to set the ploidy to the highest likely number of clones in your population.

One more question about setting ploidy for a haploid sample with an unknown number of clones present at unknown proportions (i.e. unlikely they are present at equal proportions). If I set ploidy at 10 because that's the highest likely number of clones in my population, does the variant detecting algorithm assume that each clone should be present at 10%? In other words, am I then limiting my sensitivity to any variant found in > 10% of the total population? Or instead, is does the ploidy argument limit the number of unique alleles at a locus? HaplotypeCaller is very slow with high ploidy, but I would like to maintain high sensitivity to minority variants.
Thanks for your help!

Thank you for your help on this. It seems like variant calling may be a two step process:
(1) to identify variants informative of differentiation between pooled samples, set PLOIDY=1 (call consensus allele at each locus) and run HaplotypeCaller on all BAM files which may or may not contain mixed samples.
(2) to identify within-host polymorphisms, I will then set PLOIDY=100 and run HaplotypeCaller on each BAM file individually. (Calling variants on all samples at once with high PLOIDY is computationally infeasible.)

In this case, can I leave the HAPLOTYPE argument at its default value, because this will not have as strong an effect on sensitivity to minority variants?

Using ploidy 1 seems a bit radical for the first step -- this will favor sites where your population is very homogeneous. Sites with lots of heterogeneity may still be called but with low confidence, so you'd have to make sure to lower the emit thresholds. Or use the GVCF mode of HC since it sounds like what you'll be looking for are sites with low probability of being hom-ref, and potentially multiple alternate alleles. You can then select these out to serve as list of interesting sites.

Assuming you plan to run step 2 on just the sites called in step 1, note that you can do this by passing in your vcf using -L, and add padding (optional but recommended) using -ip 50.?

But ploidy 100 may prove challenging in terms of performance. The number of calculations that need to be done to evaluate all possible combinations is astronomical. In future this may be more feasible if we can triage the combinations that need to be calculated, but with the current model that is not done. I'm afraid your job might run forever... You may have to compromise and use a lower ploidy even for that second step.

Thank you. Using the GVCF mode makes a lot more sense so that I can separate single sample variant calling and then conduct multi-sample variant discovery to identify different subsets of variants which may inform (1) between sample variation and (2) to identify within-host variants.

However, I ran into trouble using the GVCF mode of HC using the following input:

One more ploidy question: the GenotypeGVCFs tool returns no output when I test it on a list of gVCFs called with a ploidy of 30. I have tested several iterations and get no error messages, but the output file only includes the VCF header. The gVCF files look reasonable.

Thank you. Yes, the issue is that if I run HaplotypeCaller with PLOIDY=30, all GQ scores are extremely low, due to low confidence in calling 30 alleles. When I set a lower ploidy, GQ scores increase (slightly) and it is possible to use the GenotypeGVCF tool. Even though GQ scores are low, this will help me identify sites that show evidence of within-sample polymorphism.
Thanks for the help!

Hi I have a question related to ploidy in UnifiedGenotyper. I am using gatk version 3.8.1 and working with P. falciparum.
I have a file with previously to sequencing pooled samples (they are not identified). The number of samples is known but the number of parasites per sample is unknown. Also, the parasite has stages where it is haploid and others where it is diploid. Which ploidy should I set when calling variants?
Thanks!

@Rebs Based on what you describe there is no "correct" setting for the ploidy. One possible approach is to choose what is the minimum allele fraction you aim to be able to detect, and set ploidy accordingly.

Alternatively you could try using the Mutect2 somatic caller (in tumor-only mode) which does not make ploidy assumptions.

What do you mean with allele fraction?
Also, I performed variant calling trying different ploidys and when setting it to 1, it detects less SNPs...

Regarding Mutect2, I would prefer to stick to UnifiedGenotyper. It was recommended by some colleagues of mine which also work with the same organism (FYI they used ploidy 2 but their conditions were different from mine).

Imagine you have a pool of 10 diploid individuals and one of them has a heterozygous SNP. Assuming perfect sequence quality, what you expect to see in the data is that the variant allele will be represented by one out of every 20 reads -- an allele fraction of 0.05. Considering the regular occurrence of technical artifacts, we're not going to want to trust an allele that's represented by very few reads. So let's say we're arbitrarily setting that limit at 5 (there are better ways to do this; this is a strawman) -- that means you'll need to have ~100x coverage at this site to make a confident call.

Now in your case you already have the data so you have to work backward from however much coverage you do have. Let's say it's ~50x coverage -- that means you can only really expect to make confident calls for variants with an observed allele fraction of 0.1, ie one hom-alt in 10 diploid samples or one heterozygote in 5 diploid samples.

Does that make sense?

Re: what you see with ploidy =1, that's expected. You're throwing out anything that doesn't look like a hom-alt call.