Jump to another community

I expect to see a variant at a specific site, but it's not getting called

This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.

There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.

In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.

1. Generate the bamout and compare it to the input bam

If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.

What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

2. Check the base qualities of the non-reference bases

The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.

3. Check the mapping qualities of the reads that support the non-reference allele(s)

The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

4. Check how many alternate alleles are present

By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the --max_alternate_alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.

5. When using UnifiedGenotyper, check for overlapping deletions

The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.

Some sequencing technologies introduce particular sources of bias. For example,
in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.

7. Try fiddling with graph arguments (ADVANCED)

This is highly experimental, but if all else fails, worth a shot (with HaplotypeCaller and MuTect2).

Fiddle with kmers

In some difficult sequence contexts (e.g. repeat regions), when some default-sized kmers are non-unique, cycles get generated in the graph. By default the program increases the kmer size automatically to try again, but after several attempts it will eventually quit trying and fail to call the expected variant (typically because the variant gets pruned out of the read-threading assembly graph, and is therefore never assembled into a candidate haplotype). We've seen cases where it's still possible to force a resolution using -allowNonUniqueKmersInRef and/or increasing the --kmerSize (or range of permitted sizes: 10, 25, 35 for example).

Note: While --allowNonUniqueKmersInRef allows missed calls to be made in repeat regions, it should not be used in all regions as it may increase false positives. We have plans to improve variant calling in repeat regions, but for now please try this flag if you notice calls being missed in repeat regions.

Fiddle with pruning

Decreasing the value of -minPruning and/or -minDanglingBranchLength (i.e. increasing the amount of evidence necessary to keep a path in the graph) can recover variants, at the risk of taking on more false positives.

Comments

I have a question however: if I view a BAM file with recalibrated base qualities in IGV - does IGV then report the recalibrated score or the score before recalibration? That is, the "Base phred quality" score that shows in the info box that pops up when you hover over a locus/read.

IGV will report the recalibrated qualities, because they are written in place of the originals. It is possible to still see the original qualities in the read information; if you set the appropriate flag to keep the original quals when you did the second step of recalibration, they will be written to the OQ tag.

If someone is doing a mutation rate analysis, say in yeast (like exposing the yeast to radiations) to measure the mutation rate. Then different mutations in different locations in each yeast might happen. Let's say, I want to call all of those variants even if they are seen in only one read (but with good mapping quality and good phred score that one can see in IGV by mousing over).

As far as I know, UG and HC are very sensitive to not miss anything and that they don't just look at the base Q score, MPQ and DP to decide how true a variant is, but they also consider things like HaplotypeScore, MLEAC, MLEAF, MQ, MQRankSum, QD, ReadPosRankSum etc.

I know that we shouldn't trust everything we see in IGV, however my question is, is it possible that UG miss calling some of those mutation because they might be seen in only one read but because the coverage for those regions are very high, UG might ignore calling some of them because they are only present in one read? (or some other scenario like that)

Would you please clarify more on the sensitivity of UG, because no matter how many times you tell people don't trust everything you see in IGV they again tend to say "I can see it right there"!

I'm not sure that "sensitivity" is the right word for what you're looking for. I suppose it is in the strictest sense, but in practical terms I think it's something else. You don't ever mention what kind of read depths you're using, but with any significant depth the probability of seeing an "alternate allele" in one read at any given locus approaches 1. The Illumina sequencers have been getting better and better, but they've still got a pretty significant error rate.

UG and HC (and most other variant callers) help to mitigate that error rate by using the expected ploidy of the library in question. So if we're using human data (which is what the GATK was initially developed for), we would expect to see the reads supporting putative variations to be near 50 or 100% of the reads at that site. If the read support varies significantly from those levels, it's less likely to be a real variation. It sounds like in your experiment, you have a whole bunch of yeast cells and you're looking for evidence of variation in any of them - this is a completely different analysis.

If you haven't already, you may want to look at the ploidy parameter of UG. It won't resolve your genotypes to the cell level, but I would think you could, e.g., try to find variants with 5% penetrance by setting the ploidy to 20.

Caveat emptor: I've done exactly one UG run with a non-default ploidy - I used 8. It's entirely possible that 20 is way too high and that you'll be swallowed by our dying Sun before the analysis finishes. Or maybe it will finish in two hours. I have no idea. But I think it's your best hope for doing this type of analysis in GATK

Let me ask it this way, l have some mitochondrial seq with 10,000 x coverage/10,000 mean depth at any given position and of course the -ploidy is 1.

In my analysis any high quality variant should be called even if it is seen in one read among thousands but with high MPQ and BQS! Is it possible that UG would miss calling it because it is only in one read? Dose UG ignore it, because all the other reads disagree with that one read even though everything in that read is of high quality?

From what you said, I understand that UG would call such variants only if I increase the ploidy number. Right?

As is always the case in bioinformatics, there's uncertainty here. Mitochondria are subject to some degree of heteroplasmy, for instance. But the error rate of the sequencer itself is much, much higher than one out of ten thousand - to say nothing of the errors in library prep (PCR polymerases aren't perfect) and alignment. To me, there's absolutely no way you should take a single read out of many (think dozens, not thousands) as significant evidence of variation

So the answer is that UG would not call it do to the pile of evidence (thousands of reads)! Even though that one read is of vary high quality, right? Due to seq machines error rate the algorithm would not trust that variant, right?

Right. One high-quality read is not enough to outweigh the evidence of many mixed-quality reads. The definition of "many" is left as an exercise for the reader, but it ultimately depends on ploidy and error rates

In fact, I would never trust such a variant even if it is of high quality. But I just want to show this post to the people who ask it gain and again saying "I can see it right there" thanks a lot, I think you closed this thread quit well.

We are comparing variants called using GATK HaplotypeCaller and MiSeq reporter. It's amplicon-based Illumina Truseq targeted gene panel. Strangely, all variants's PL is 0 (vcf from one sample, no VQSR done) for those matched MiSeq variants:

Also the total reads should be over 300k, but the DP here is so small. I did downsampling to 3000, marked the duplicates but didn't remove them. Any clue why all these happened? How to calculate the variant frequency here? Thanks!

PL is by definition relative to the called genotype, so will always be 0 for the called genotype.

What does the log output of HaplotypeCaller say at the end, in the read summary? I suspect you'll see an absurdly high number of reads discarded because they're duplicates. Marked reads are skipped by HC. Regarding your depth, the duplicate filters play some role in that, but it seems like HaplotypeCaller 3.2 down sampled a bit too aggressively. I'm not sure what version you used, or if that's still the case in 3.3.

I'm not sure what to tell you about variant frequency, because it depends on which definition you're using. The vcf already contains an allele frequency for each variant, but you only have 1 sample so that's not real interesting. As I understand, you only ran HC over the variants reported by another tool - if you're just looking for how many of them were actually variant here you probably want to use a tool like VariantEval (or maybe GenotypeConcordance). Of course, if you just ran HaplotypeCaller in the default mode, it only output calls at variant sites. VariantEval will also help you calculate things like variants per kilobase, if that's what you're after

It is true that HaplotypeCaller does some fairly aggressive downsampling, and unfortunately it seems we have a bug that occurs when you try to override HaplotypeCaller's downsampling defaults. It's a hard one to fix because the downsampling system is rather complex, so in the meantime our recommendation is to not touch the downsampling settings for HaplotypeCaller. Rest assured that the tool downsamples data to a level that gives it all the data it needs to make an informed call.

@bwubb Is it a borderline call? Any excess coverage? This could be due to downsampling differences triggered by the different interval lists (which are used to generate the random seed for downsampling).

Hmm, given the coverage for those two samples, there's no downsampling going on, so that doesn't explain it. Aside from a filesystem hiccup / memory issue that caused a skipped position, I can't think of why this would happen. And it's hard to test without running the whole thing again. If you can run again on the full cohort, I would suggest trying in allSites mode, and see if something gets output at that position. Or maybe first try on just those two samples and see what happens.

Unfortunately there's not much else I can do aside from encouraging you to upgrade and use HaplotypeCaller instead of UG.

@wchen, what you're showing is not downsampling, it's read filtering, which is a separate process. Your problem is that almost all your read are getting filtered out because they are marked as duplicates, as shown in this line:

@Geraldine_VdAuwera
Illumina TruSeq checmistry with PE. It's amplicon-based. How can I run GATK without marking duplicates?
By the way, what does downsampling do exactly? I set it to dcov 3000 since we have very high coverage for tumor sample without normal match. Thanks!

Ah, that explains it -- for amplicon sequencing you should skip the MarkDuplicates step.

Downsampling to coverage reduces the number of reads that the program will use for an analysis, to the amount that is sufficient for the analysis to be empowered, without being slowed down by excess/redundant data. This is done because very high coverage (above several hundred x) is typically unnecessary and counterproductive. The read depth is reduced by randomly throwing out reads until the desired quantity is achieved, so the remaining set is representative of the overall data. Each tool's default downsampling settings are tuned for its particular requirements, so you generally don't need to modify them unless you are trying to do something specific.

Nope, GATK does not require duplicates to be marked at the programmatic level. We just recommend doing it per Best Practices because for most experimental designs it's the right thing to do, but amplicon targets is one notable exception.

I tried not marking the duplicates, just added the read group, then ran the indel realigner in GATK3.3, but got this error?
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/gatk/engine/CommandLineGATK : Unsupported major.minor version 51.0

@wchen, This error occurs when you run with the wrong Java version. GATK 3.3 uses Java 1.7. Perhaps you are running on a different machine that is not set up the same way, or you just upgraded your GATK version?

Try using our Best Practices calling workflow which involves the HaplotypeCaller instead of UnifiedGenotyper. If that doesn't work, then we can try to help you. But we can't help you use an older, inferior tool.

@raphael123‌ Don't worry about it, but remember we can only help you if we have all the information we need. It is better to be overly descriptive than not enough. In a case like this, you can tell us everything you tried, step by step, and show the results (posting the vcf record is usually very helpful).

Hi. I have just tried the HaplotypeCaller on a new dataset, and it is not calling any variants in a mapping that should have plenty of them (and visual inspection seems to support that). Base qualities at the SNP locations all seem healthy. The main suspect seems to be MAPQ. The mapping has been produced with BWA, and for some reason it has decided to give ALL my reads MAPQ scores of 6 or less. I am mapping barley exome reads to the barley genome, which is a) huge, b) fragmented and c) very incomplete, so that might be causing BWA to freak out. I have set the -mmq parameter to 0 for testing, and the HaplotypeCaller log output tells me that it is no longer filtering out any reads now. Could you please clarify for me your point 3 above -- does the local reassembly of reads still take into account the reads' MAPQ scores even if -mmq is set to zero?

P.S. I forgot to mention that this is a single sample which is almost completely homozygous but has plenty of differences with the reference sample, i.e. the SNPs I can see visually are all homozygous for the alternate allele. Could this be part of the problem? Does the HaplotypeCaller not call SNPs like these by default? I am not using any optional command line parameters apart from -mmq.

I couldn't figure out how to move my post, nor could I figure out how to delete my post in this thread. I presume having a duplicate post is not good. If you can direct me how to delete my post from this thread, or if you'd rather just do that yourself, I'd appreciate it.

Hi, I have a similar issue as posted above. I tried to follow every post to figure out the reason of HaplotypeCaller missing an important variant on Chromosome X, position 31219271. This is a known SNP (C -> T) for HapMap individual NA12878. I follow the Best Practices calling workflow, which uses Picard to mark duplicate reads, and GATK v3.6 for indel realignment, and base recalibration. The processed bam file is then used for HaplotypeCaller with the following commend:

SNP chrX:31219271 is missing in result.vcf. I load result_HC_debug.bam to IGV and see that there are actually 55% C and 45% T. Most of the reads are with Mapping Quality 60 to 70, and base phred quality from 20 to 33. The IGV view is attached below. There is no other variation called in the neighborhood.

I am wondering if anyone have other suggestions to help me looking into the reasons of HaplotypeCaller failing to pick up this SNP.

Hello,
I am validating a custom amplicon sequencing platform. I have targeted genotyping results for ~200 samples and corresponding sequence data on 60 amplicons each with 1-2 known variant sites per amplicon. Certain loci are not calling the alternate allele in known locations and samples with variant. I have tried variant discovery (i.e. without a bed file) and targeted. Position 23 is the culprit, it should be an A. In the IGV viewing (attached) it lists the A SNP in all the reads.

Hi
I performed a variant call using gatk 3.8 -T HaplotyCaller for samples sequenced with a TrueSeq Amplicon based sequencing on a list of targeted genes
a variant 3:38,783,873 is called HOM_REF in the vcf file whereas in bam file is G 49% and T 47% (

attached also the screenshot for bamout for which the reads in this regions are not present
why

could you try to clarify the reason for which I'm obtaining an homozygous call instead of a heterozygous?

Hi
in the previous post, I described an issue regarding a variant that I'm expecting to see (bam) but I have not found in the vcf .
Now, I'm asking regarding some variants that are not present in the bam file (0/0) but they are called in the vcf as 0/1
example 3:38770101 as reported in the figure

Hi there,
I am doing a VC analysis using GATK3.3 HaplotypeCaller on human WES (Illumina Hiseq). The same as people reported before, a variant was not called in the vcf file and in the IGV I see the variant in homozygous state (homozygous alternative) in all my samples (~30). I see the same using the bamout file. The alternative allele is C and reference T.
This variant C is actually really frequent in the population, but the refence human genome (Hs37) that I used still says T. So my question is, is there some step in the GATK that I'm missing that it corrects for population frequency or something like that?
Coverage is high, mapping quality is high, quality of the bases at that position is good as well.
Any ideas?
Thanks a lot!