GATK Showcase in Terra

Check out these fully configured workspaces to test drive the Best Practices pipelines and workshop tutorials with zero installation required!

Get email notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements on the blog, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. When reporting a problem, include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.

Less unfiltered reads than filtered ones

we are working with GATK 3.2.2 and focus on an optimization of the SNP and indel calling in the context of our data. Thereby, we also determine the frequency of a certain variant. However, we came across some strange results in this case.
In the vcf files there is of course the DP in the info field, representing the unfiltered depth, and the DP in the format field, representing the filtered depth. Yet, we also calculated the depth by ourselves in the bam file and looked it up in the IGV.
One especially strange case is the following:
For a sample GATK calls a deletion. The unfiltered DP (info field) is DP=644. The filtered DP (format field) is DP=505. It consists of 438 reads with the original genotype and 67 reads containing the variant. However, in our raw bam file and in the IGV things are different. There we see 499 reads in total. 425 containing a C (reference), one with an N and 73 with a deletion. We checked and the results are reproduceable.
Furthermore, we see similar differences as well. There are many cases in which the number of reads containing the variant or the reference allele as we count it and as it is observable in the IGV is smaller than the reported number of filtered reads.
How can this happen? I'm really looking forward to your answers.

Best Answer

Ah, this makes a lot more sense. And the final piece of the puzzle is, if I recall correctly, that the reads written to the bamout are those that pass all filters, while the calculations reported in the final vcf as AD may include filtered reads. So I really don't think there's anything to worry about here.

You can tell which reads are sample reads and artificial haplotypes by right clicking on the reads in IGV -> selecting "Color Alignments By" -> "Sample" .
When you hover over the different colored reads, you can see in the label which are sample reads and which are artificial haplotypes.

thanks for your quick answer. Yet, I still do not understand properly what is going on with my reads. I colored them in the IGV according to "Sample". However, they are all of the same color (light red).
Just to understand it correctly, but the Haplotype Caller creats artificial reads? How am I supposed to deal with them when I want to analyze the frequency by which a variant appears? I cannot really trust the frequency I calculate from the vcf file if it contains additional artificial reads, can I?
In one case, I saw 3 reads with an insert and 1 with a deletion, but according to the vcf file there were 72 reads containing the variant. And in the IGV I just saw my four reads containing an indel.
For us it is really important to have reliable information on the frequency, but sometimes the differences between the frequency based on the vcf file and the frequency I calculated from the raw bam file is really big. What would you suggest we should do?

thank you very much for your answer. I started looking at the bamout files and restarted our calculations. From the first results it seems that in most cases the raw number of reads is now bigger than the one reported in the vcf file. In most cases where this cannot be observed, different ways of counting (especially concering inserted bases) might explain the differences. However, there remain a few cases where I just can find no explanation for the results.
For example: Once, a SNP is called. I count 124 reads containing the reference allele and 132 reads containing the alternate allele. These results are consistent with the observations in the IGV. Additionally, there are 9 reads containing a deletion. However, according to the format field in the vcf file there should be 178 reads with the original genotype and 45 with the variant. Do you have any idea how this could happen?

I chose a slightly different SNP as an example, because I think this is even more problematic:
Below you find a screenshot from the IGV.
For the SNP I counted 295 reads containing the reference and 277 reads containing the alternate allele (as you can also see in the screenshot). In addition there are 57 reads with a deletion at the position of the variant, which I ignored as they neither contained the reference nor the alternate allele.
The corresponding line in the vcf is the following:

So on the one hand, I cannot understand how 319 reads could have been counted containing the genotype 0. On the otherhand, it seems odd that there are a total of 572 reads in the IGV, but 575 according to the info field of the vcf.

I tried to disable all filtering options, so that all reads should be displayed in the IGV. Yet, there seems to be one read missing, which contais a deletion (I counted the reads manually to be sure). So, there seem to be 58 instead fo 57 reads containing a deletion at the side of the SNP. The other figures however do not change.

What I did for the SNP is the following:
I extract information on the SNP from the vcf file (using R and the package Rsamtools). I use this information to get all reads (from the bam file) that cover the mutated base. In total there are 630 reads. I checked the number of reads directly by samtools and it is the same.
In the next step, I looked at the cigar string of every read. As I know the starting position of every read, I can simply calculate, whether a read covers the mutated base or whether there is a deletion at the site.
In 58 out of the 630 cases I find a deletion, which is why I neglect these reads. However, for the remaining 572 reads I check whether the base at the positiion of the mutation is the same as the reference base, the same as the alternate base or a different base. I just count the number of reads containing the reference base (295) and those containing the alternate base (277) and compare the results with those reported in the vcf file.

Ok, that makes sense. I assume these calculations were made on the original bam, rather than HC's output bam? If so I think what's happening is that HC realigns some of those reads with the artifactual indel so they now support the reference.

Hello Geraldine,
at first I made my calculations with the original bam, but as I read about the output bam (with the realigns), I changed my script so that I always use the output bams for the calculations. (Otherwise the differences in the numbers are even greater).
Sarah

So, concerning the total number of reads the results do now appear to be fine. Yet, there remains the great difference between reads without the SNP counted in the bamout-file and those counted in the vcf-file which I do not understand.

The command lines concerning the variant calling I used are the following:

Oh, I just realized that it looks like the ADs are basically flipped between the alleles. Does that match what you're seeing? There was an ordering bug in 3.2 that occasionally switched the AD values between alleles. Is it possible that that explains what you're seeing? And do you still see that happen when you re-call variants from the original bam file with 3.3?

I found a little mistake in my former calculations (I used the bam file resulting from GATK 3.2.2 as an input for the pipeline using GATK 3.3.0 and not the original file resulting from the alignment with TMAP). Yet, I corrected it mistake and the results stay somehow strange:

Ah, this makes a lot more sense. And the final piece of the puzzle is, if I recall correctly, that the reads written to the bamout are those that pass all filters, while the calculations reported in the final vcf as AD may include filtered reads. So I really don't think there's anything to worry about here.