GATK Showcase in Terra

Check out these fully configured workspaces to test drive the Best Practices pipelines and workshop tutorials with zero installation required!

Get email notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements on the blog, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. When reporting a problem, include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.

I found that GenotypeGVCFs in GVCF mode can lead to an unexpected homozygous or heterozygous genotypes when one SNP is called within an indel.

In the following example I have the individual1 showing only an indel at position 183095185. The second individual shows also an indel at position 1830095185 but also one SNP at the position 183095187 (even if this individual seems to be homozygous because only 2 reads support the laternate allele).
Executing GenotypeGVCFs on those 2 individuals lead to an erroneous calling with weird allele depths for the individual 1 for the SNP at the position 183095187 : 1/1 with an AD of 1,0 (the second individual remains homozygous for the reference allele at the SNP).

I have found that this error occurs many time and, each time, it seems that a SNP and an indel overlap in 2 different individuals. Sometimes the resulting error calling is homozygous for the alternate allele (1/1), sometimes heterozygous but always with almost no allele depths for alternate allele. When I look at bam resulting file from HaplotypeCaller, there has been some reassembly but there's no read supporting an alternate allele for the first individual at the SNP position (in my example, at position 183095185).

I see that there's an issue still opened on github for this problem (last updated on March, the 11th) but since the gihutb link lead to a 404 error page, I don't have any clue on the current state of this bug correction.

Hi @bgrenier, sorry for the lack of transparency. The 404 error is because the code repo is private to the dev team. I plan to set up a public view of the bugs tracked from the forum to resolve this, just haven't got around to it yet.

As for your bug, I'm afraid it's been lingering in the backlog due to lack of developer time. Its current status is that it needs to be retested following some developments that may have resolved or clarified the problem. I'll ask @Sheila if she might be able to do the retest in the next few days so we can give you a better update.

Was there any resolution to this bug? I'm seeing a similar result in a similar situation, but it seems to be dependant on whether I merge GVCFs before genotyping or just genotyping all the samples from seperate GVCFs. What I'm most interested is the what happens to Sample2 at position 25268503.

genotype for Sample2 at position 25268503. Where is GenotypeGVCF getting a depth of 12 from? Everything it had up to that point for this samples says there's a deletion at that position and even the merge command gave it ./. for a GT. When I run GenotypeGVCF on all 3 GVCFs without any merging, I get a VCF like this:

With a ./. GT and a depth of 0 for Sample2...which is the behavior I would expect. This is from a test case where I want to be able to dynamically decide if there are too many samples to just genotype and instead insert a merge step before genotyping. What I was expecting was to get the same answer merging then genotyping as when I only genoytped. I do get some other differences in my test case, all slightly different MQ values, which don't affect the depths or genotypes, and I can live with that. But I often filter variants differently on ./. vs 0/0 GT's. Any further assistance on this would be very helpful. This was all using 3.3-0-g37228af. Thanks.

Unfortunately, this has not been prioritized. We do have a bug fix for indel representation coming up in the next release. I'm not sure if this fix will fix your issue, however. I will talk with the team at our meeting today and see if we can get this moved up in priority.

Has this bug been addressed yet? I think I am having a similar issue. GenotypeGVCFs is calling some sites as heterozygous when I can't see any evidence in the bam file. This is on a targeted sequencing project, I followed the best practices workflow with the exception of duplicate removal. I did not remove duplicates because many reads are expected to be duplicates on the platform I am analyzing. I used GATK 3.4.

Hi,
I think I'm experiencing a similar problem as the ones discussed in this thread. I have performed targeted sequencing on a large cohort of patients/controls using Illumina. I followed the best practices workflow using GATK 3.2.2. After running haplotypeCaller (without --dontUseSoftClippedBases) I combined the g.vcf's into "groups" (approximately 120 gvcfs/group) using CombineGVCFs; the groups were subsequently genotyped using GenotypeGVCFs. By chance, I discovered that new variants/alternative alleles have appeared for some samples in the genotyped vcf, not present in the bam files or gvcfs. Moreover, several of these new variants were only present in samples within a specific group. As suggested above, I re-ran genotypeGVCFs using GATK 3.4-46; this removed a couple of the false positive variants (indels) but not the majority. I could see that several of the false positive SNPs were located in repetitive regions, but that doesn't explain why they appear in certain groups. If I run samples from one group as separate gvcf's when genotyping, the variant disappears from that group.
I was wondering how new heterozygous and homozygous variants can appear after genotyping, when they are not present before? why/how can they be specific to the combined groups, I didn't think the sub-grouping procedure would affect the results? Is the only way to deal with this to run haplotypeCaller with --dontUseSoftClippedBases or should I change the grouping somehow?
Below is part of the code and some output:

The GVCF contains the PLs for the reference blocks. If there is some evidence for a variant at a site, but not enough evidence to call a variant, that is reflected in the PL of the block. Can you please post a VCF record for one of the sites you are referring to above? Also, please post the corresponding GVCF records for the samples that are called variant in the VCF but were not called variant in the GVCF.

Thank you for you quick response!
I'm not sure I understand correctly, but one example sample has this gvcf data for position chr4:129961069:
chr4 129961060 . C . . END=129961178 GT:DP:GQ:MIN_DP:PL 0/0:64:99:57:0,118,1800

Whereas the vcf data for the same position and individual is:
G/G [DP=48;PL=131,20,0;GQ=20;AD=0,0]

Although the gvcf data is for a block of reference calls, the PL shows very low evidence for an alternative call (see attached pic from IGV). In the vcf, on the other hand, there is low evidence for a reference call.
The alternative allele for this, and many other positions, only appear in samples that were grouped together using CombineGVCFS. If I run the samples included in this particular group as separate gvcf's when I run GenotypeGVCFs the alternative alleles disappear.
I found another discussion concerning the same problem at the GATK forum, but in that case the problem was solved using GATK 3.2.2, but I am already using that version.