Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Did we give you a GitHub issue to track?

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community

Attention:
The frontline support team will be slow on the forum because we are occupied with the GATK Workshop on March 21st and 22nd 2019. We will be back and more available to answer questions on the forum on March 25th 2019.

Comments

I have read a few different forum threads about BQSR. I saw a few times it was mentioned that you can safely run BQSR on the chromosome level if there is a lot of data. I am assuming a 30X WGS sample produced on a HiSeq X machine would produce enough data to do this safely? Would you recommend to run BQSR on all chromosome together? I just want to run as many commands in parallel as possible.

Hi,
this is my first try with GATK. After aligning with bwa on ucsc_hg19 reference and de-dup with Picard, I have re-aligned a bam file and started the base re-calibration process before going to SNP calling (as indicated in best practices). I've run everything with default options, and generated the bqsr report file with RStudio and a downloaded BQSR.R script (to circumvent a Rscript PATH issue).
I provided the BQSR.R script with the recal.table from the first BaseRecalibrator passage and the csv file from AnalyzeCovariates run after a second passage, as indicated in the "how to recalibrate base quality" tutorial but my feeling is that quality values are worse after recalibration. First I thought it was due to downsampling, but after running the whole process on the entire bam file, I got similar results (see attached file). I'm using the (I think) latest GATX version, v3.3-0-g37228af.
Any help on how to interpret my recalibration plots?
Thanks a lot,
Fabrice.

Hi, If I am working with a non-model organism without a database of verified SNPs, is BaseRecalibration impossible? Should I still perform indel realignment with RealignerTargetCreateor and IndelRealigner?

I am currently running whole-exome data from 350 trios (unaffected parents + affected child) through the best practices pipeline (GATK V3.2-2), starting from fastq files, with the eventual goal of identifying de novo mutations.

I have healthy control data from another sequencing project (~650 trios), that I would like to use as a control here for comparing burden of de novo mutations. However, for these samples I only have bam files that have undergone BQSR (with V2.X GATK), and the original quality scores were not saved. I am reverting these back to fastq files in order to align them to the same reference as the other cohort, etc.

Obviously I would like the control data to be as well-matched as possible, so in an ideal world, I would have raw quality scores for both cohorts, and after alignment would conduct BQSR analogously on each before proceeding to variant calling. Given that this is not possible, I am wondering if it is better to use the existing recalibrated base quality scores for the control data as is, or whether I should re-recalibrate these? Is it a bad idea to recalibrate already recalibrated base quality scores and/or is there any advantage to doing this?

I would recommend running your controls through BQSR again. Successive runs should not have any negative effect, and since there have been a couple of substantial improvements in the BQSR tools compared to early 2.x versions, your data may benefit.

Question -- for the BaseRecalibrator walker, the documentation says having a list of known variant sites is optional, but with the walker, it's apparently mandatory.

I was wondering if there was a recommended practice to get this information when it's unavailable? For instance, when I process raw reads, I usually do:
mapping with bowtie
duplicate removal
indel realignment

Would it make sense for me to get a preliminary list of high quality-score variants (e.g., QUAL > 100) and use these as "known" variant sites? Or do I need ALL known variant sites first?

Hi @agopal, yes, you're on the right track. Have a look at the presentation slides from the recent workshop posted on the GATK blog. In the "Non-human" presentation you will find recommendations for bootstrapping a set of known variants.

Working with a species of plant that does not have a set of known SNPs and read here in the forum about bootstrap variants calls. I wonder what that is. You get the bam file recalibrated and using variables identified to recalibrate it again? Or is using the new variants identified to recalibrate the original bam?

Thank you very much for spending time to read my question.
Sincerely,
Rhewter.

@Sheila@Geraldine_VdAuwera I have heard a few people at different occasions saying it's not necessary to run BQSR for Illumina HiSeq X Ten high coverage data. I thought I would hear it from the horse's mouth myself. Is it best practice to run BQSR in all cases? I'm sorry if this is a stupid question. I don't have much experience with BQSR and bam pre-processing in general. I am perfectly happy with you telling me, that I should evaluate it myself. Thank you for your time.

@tommycarstensen Not a stupid question at all. I'm going to give you one of my infamous "yes and no" answers. Yes, it is true that with higher-quality, higher-coverage data (as one might expect from the beauty that is the HiSeq X Ten), BQSR produces diminishing returns. No, I wouldn't recommend skipping BQSR despite the previous statement.

My favorite analogy is that BQSR is like fire insurance. Most of the time you don't need it and it feels like a waste of money. But when your house burns down (which you don't know is going to happen in advance -- unless you're committing insurance fraud, and that's another matter entirely) it can really save your bacon. Because it mutates into a fire department and rebuilding crew all at once. And that's where my analogy truly breaks down, but you get the point

We're looking into ways to make it faster and economize on storage space, but for now BQSR is still definitely best-practice.

I've recently been given some low coverage (6-8X) WGS BAMs to analyse. They have been processed with a GATK based workflow, but after alignment each BAM was split by chromosomes and each chromosome was process separately. This means some of the read groups going through BQSR have fewer than 100M bases. My understanding from the above is that this is not what you would recommend, but possible to do. My question is how much impact on quality is this likely to have (that is, having fewer than 100M reads) and how significant is the improvement with 1 B bases? If I decide to re-run BQSR can I simply run it on the BAMs I have?

I've also noticed that they have been run through IndelRealigner before BQSR. I've always performed these steps the other way around with the rational that bases will potentially move during realignment, therefore, you can't tell if a bases is truly different from reference until it's in its final alignment. Can anyone comment on the correct order to run these modules, or does it not matter?

BQSR will be able to build a better model with more bases, so we do not recommend running on less than 100M bases. Can you tell me a little more about how the reads were sequenced? If you have whole genomes, I suspect you have a lot of reads per lane. BQSR is supposed to be run per-lane. So, you may be able to combine some of your data to get better results, even if they are not from the same sample. Again, it just matters that they are from the same lane.

@Sheila, thanks for your response. Can you clarify how BQSR determines which data come from the same lane? I know this is a question that has been asked before, but the answers don't seem to be that clear. I've read previous posts by you stating that the read group ID is used to determine lane (sorry can't find the link for that one), but here http://gatkforums.broadinstitute.org/discussion/4586/read-group-pu-field-used-by-baserecalibrator you state that the PU field will take precedence. There are lots of references all over the forum that the GATK doesn't require the PU field. Is it used in preference to the ID tag or not? If it's not, they I don't fully understand how to set the ID tags up for the data I have. The samples have been multiplexed (at least four samples per lane). Each sample will need it's own read group (with distinguishing SM tags), and each of these will need to have a unique ID tag. How will the GATK then know they are from the same lane?

As those documents state, PU is used over RGID if it is present, but we generally assume that PU is not present, and in that case RGID is used. I'm not sure how I can state that more clearly.

Note that when we say that BQSR analyzes the data per lane, we mean per lane per sample. So in your case, you'd typically first demultiplex the data into separate per-sample files. Within that, you may have data from different lanes or libraries (distinguished by RGID and LB tags, and having the same SM tag). BQSR will distinguish them accordingly. Does that make more sense?

Can you explain how BQSR will handle binned base qualities, for example from HiSeq4000 data? Will it be possible to merge a HiSeq2xxx BAM with a HiSeq4000 BAM? My Illumina FAS suggested performing base recalibration prior to merging in this case, but he doesn't have direct experience with this. What workflow would you recommend?

I am new to GATK. Can someone please explain me what base quality score is? what does base quality score of 20 mean? I also need to know about the filter. What does PASS and VQSR mean in Filter column? Thank you

Hi,
Previously, I used v3.4 to do BQSR with default settings, after print reads, the BAM size became much larger, I know this case is normal as discussed in some post here.
However, when I updated to v3.5, and in the printread task included one more option --disable_indel_quals, I found the new BAM size does not increase a lot, in some case it even decrease, for example, before BQSR the BAM is 8.1G, after that it became 7.9G.
I would like to know is there problem?
Thanks

I also had the same case but thought it was related with the amount of information that is in the BAM files. When you disable the indel qualities you reduce the amount of information in the BAM file which would result in reduced size of the BAM file. But, it would not affect anyother information or values in the BAM file. Is that the case?

It's true that disabling indel quals doesn't affect anything else, but there are a few other reasons why file size fluctuates when going through BQSR. When you don't have indel quals inflating the size of the file, the rest is more noticeable in comparison.

Hi @Geraldine_VdAuwera. I am new at working with ion torrent sequencing data and will very much appreciate any pointers.I was wondering if BQSR is valid for bam files generated on ion torrent platform. In general, are there any steps in best practices guideline that are not valid / recommended for ion torrent data. Many thanks.

We know that indel realignment won't work properly on Ion Torrent data; if I recall correctly the tool actually has a hard-coded check that will make it quit with an error if it encounters Ion Torrent data. I'm not sure about BQSR, I think that should be fine.

More generally though we don't support running on Ion Torrent data because we know it has some error modes that GATK is not able to model correctly. I can't comment on Ion PGM as we have never had any such data to experiment with, to my knowledge at least.

Ion has their own caller which does purport to model the idiosyncrasies of that data type correctly. I can't comment on its accuracy as we've never done any comparisons. I can only recommend trying both (or other callers as well) and seeing what performs best in your hands. Others in the community may have more useful recommendations, besides.

Unfortunately the BQSR tables are not expected to be consistent from one run to the next. The problems it addresses are not just machine-specific, they can also be lane-specific or library-specific. So it's not possible to be lazy. Believe me, we would be all over that if it was possible!

Please do go and school your friend, for they are sorely mistaken! It would be quite useless to run a method that amputates our ability to discover new things. Then we wouldn't be doing variant discovery, we'd just be re-genotyping!

Your intuition is correct -- the effect of novel variants on the recal tables is tiny, largely because their number (relative to the number of bases in a genome or exome) is tiny. And qualities will only be adjusted based on trends across all reads, so any individual site may not see any adjustments (regardless of whether there is a variant there) unless it is affected by some bias -- in which case the adjustment is a good thing. As you say sometimes the adjustment is an increase in qualities. It's not all about reducing scores. So finally, you're correct that real novel variants have no less chance of getting called before or after BQSR. The major effect of BQSR is to decrease false positives and increase reliability of calls overall.

Thank you for being an advocate of good methodology and understanding!

Hi Geraldine, currently practicing with GATK sample data frag_1.fastq and frag_1.fastq with reference genome Hgenome14RK.fasta. presently running the analysis but having problem with BaseRecalibrator. /Elvis_tutorial$ java -jar /home/francis/GenomeAnalysisTK.jar -T BaseRecalibrator -R /home/francis/Elvis_tutorial/Hgenome14RK.fasta -I /home/francis/Elvis_tutorial/H14_realigned_reads.bam -knownSites /home/francis/Elvis_tutorial/known.vcf -o /home/francis/Elvis_tutorial/H14_recal_data.grp. how do i get known.vcf file because that is the errors that am getting from "Couldn't read file /home/francis/Elvis_tutorial/known.vcf because file '/home/francis/Elvis_tutorial/known.vcf' does not exist"?

Hi,
I am sending you a file with an example of my graphics after BQSR.
Can you please tell me if it is normal that after re-calibration my quality scores become lower? Is there any reason that can explain this?

In general, it is possible for the quality scores to be lower after BQSR if the program found that the machine systematically overestimated the quality of the calls.

But in some cases, if you see a big drop in all quality scores, it can also mean that the set of known sites is not adequate. We mostly see that in studies of non-human/non-model organisms. What kind of data are you working with?

@Sheila I didn´t know that issue.
We sequenced only a small target panel (3,509bp) in 96 samples and we used a MiSeq flowcell that produced a total of 250M bases, which gives a mean of 2M bases per readgroup (sample). Each sample has a mean of 300 reads per site (+- 75000bp).

I performed BaseRecalibrator tool independently in each sample/readgroup (only 2M bases). Maybe I should create only one recal.table with all my samples (250M bases), however the problem is that this tool will analyze each sample/readgroup independently, so the result will be the same, right? Should I give the same @RG to all my samples since they are from the same library and sequenced in the same single-lane flowcell?

@Sheila
All my samples were processed in the same single-lane flowcell. But they represent different DNA samples.
To distinguish these samples we used barcodes (but this information is not present in the FASTQ files).
My read name is like this: @Instrument:RunID:FlowCellID:Lane:Tile:X:Y ReadNum:FilterFlag:0:SampleNumber

I am not sure about what @RG ID should I give to my samples. I found the read groups tutorial very confuse.

If I look to this example, I conclude that I should give my flowcell name to the @RG ID, and in this case all my BAM files from all samples will have the same readgroup ID:@RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200@RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200@RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400@RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400@RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200@RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200

I read in the comments, that the @RG ID is only important to BQSR. Is that right? Will I be able to distinguish my samples from @RG SM after this step?

@Sheila
Thank for your help. I wonder if there is other GATK tool that can detect bases resulting from sequencing errors or do you think it is not necessary to recalibrate bases in my case?
It is ok to run the other tools in my small database?
Thanks

Unfortunately there's not other way of recalibrating base qualities, but this is not a blocking problem. It just means that your data will be a bit more vulnerable to errors on that front, yet it is still usable. Since you have such a small target region, you will be able to evaluate variant calls individually to some extent, which compensates for the inability to correct for some types of error. So now you can proceed to calling variants on your data. Note that you also won't be able to apply variant recalibration (VQSR filtering) but that's also ok since you should have a small enough number of variants to be able to examine them each more closely.

Hi. I am calling variants on fox RNAseq reads which have been aligned to dog. I am hoping to perform BQSR on this data set using previously called fox variants which were generated using fox genomic reads aligned to dog. However, my PI is concerned about the fact that this recalibration data will contain both dog vs fox and fox vs fox variants, and that the dog vs fox variants will be much
more numerous (around 80-90% of the variants in our experience). Should we be concerned about using this data set for recalibration?

No, this makes perfect sense. What you're trying to do here is mask out as many as possible of the sites where your fox reads will legitimately mismatch the dog genome (meaning where you expect to potentially see real variation in the fox relative to the dog) to avoid counting those base calls as errors. Since your known variants were generated using the same fox vs dog setup, this seems absolutely appropriate to me.

Is it advisable to specify an interval file with -L when running BQSR on WES data? We are only calling SNPs within the capture region for our exome data, but there are certainly reads outside that region, due to a variety of factors (as I'm sure you're aware).

However, the reads outside the exome region should be subject to the same sequencing machine errors as those inside. So would we end up with a better model by including these regions, simply because we're feed more reads into the BQSR model?

Yes @buddej, that is what we recommend. There should be a minority of off target sequence, so you won't lose anything substantial in terms of amount of data for building the error model, and in fact you'll avoid incorporating reads that may have more mismatches than on target reads.

When you apply the recalibration though you don't restrict to intervals.

I've got some experience with other pipelines but am new to GATK. My files are prepped, but I have 1152 individuals (of a coral, Acropora cervicornis) from 6 flowcells/lanes and am unsure how many of them to start with. I'd like to run HaplotypeCaller on some (? all?) of these samples and then use that data for the bqsr and proceed from there. Can anyone share any advice on this? I only really need help with the organization component, and I'm not sure if I should be running so many samples, if they need to be distributed across read groups, or if I need to run them all. From what I can tell it seems best to maintain my .bam files per sample, but I've also seen that I need to merge based on read group at some point.

What do you mean by your "files are prepped"? Can you tell us about your end goal?

We mostly work with humans at the Broad, so we don't have much experience with non-human data. That said, you are correct you probably don't need to run HaplotypeCaller on all 1152 of your samples then do the bootstrapping process. You may try running HaplotypeCaller on a subset of your samples (perhaps try 20-30) then run BQSR. In the next round, you can try with a different subset of 20-30 individuals, and so on, until you see convergence.

When you say "merge based on read group at some point", I am assuming you mean aggregating bams sequenced on the same flowcell that belong to the same sample. We do that at the Mark Duplicates step.

Hi Sheila, Thanks for the info. So by 'files are prepped' I just meant I've got the trimmed bam files with marked duplicates ready to go.

As for merge based on read groups I was trying to figure out if I should be running haplotype caller on bams containing all the data from a read group, but it seems running each sample individually and then combining GVCFs should work.

On another note, if I use a subset of samples for BQSR, I assume I should distribute them across read groups, correct? From what I understand detecting the systematic differences between flow cells should require that information, so I think that seems like the way to go.

I have around 100M bases per sample but in some cases, half of that. I think the best way to run BSQR is to merge my bam files by read group and run BSQR on entire read groups at a time, generating the recalibration report from there on files that are well over 1B bases.

My question is, do I need to include data from multiple read groups in these merged .bam files so that the systematic differences between read groups (due to machine) can be calculated? I'm basically trying to figure out how to organize the files for BSQR. I've run it on some individual files and it works fine (as far as I can see, it functions at least).

Hi everyone!
I will try to do bootstrapping as I am working on newly sequenced organisms.
The explanation here seems already clear I guess, that I should reach kind of the last stage of the analysis by skipping BQSR first, and then I should use my filtered variants as the vcf input for BQSR and re-do the rest of the analysis(?) The question here is, how crucial is this step, I mean, if I had a known set of variants, I'd definitely do it, but bootstrapping will probably not be as efficient. And how can I know when I reach convergence?

In case that may help me, Geraldine has a post from April 2015 that goes as:
"In the "Non-human" presentation you will find recommendations for bootstrapping a set of known variants"
and she shares a dropbox link, which gives 404. Is it possible to share that file again maybe?
I also checked out BQSR presentations from some workshops, but couldn't find any details.

@Dragonite
I had the same issue when following GATK best practices. I did my BQSR by calling the variants on my samples first and then went back to do BQSR - then rest of the best practices.
But, for individuals working with emerging models, I think the deal breaker is your sample size and the quality of the your "per base sequence quality" in your fastq file, see this http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

The standard minimum threshold for sequence quality at each base is 20, but with my sequence I had all the bases above quality 20 (actually 30) at all the positions. So, I realized that BQSR isn't making much difference since the whole purpose of BQSR is to recalibrate this sequence quality scores. And, I was gaining as much valuable variants without doing this extra step. So, check your fastqc first, if it looks good I would rather emphasize on VQSR using gvcf approach on cohorts.

@Sheila@Geraldine_VdAuwera
I think this particular question comes a lot and I understand the dilemma of using vs. not using this method (BQSR) since is is very tempting (honest opinion) but resource consuming. But, I think looking at the "per base sequence quality" prior making the judgement is a good idea. Am I missing something else that is more important on BQSR?

@everestial007 FastQC is a great tool but it only tells you what the scores are, not whether they are accurate. So you can't know just from that whether BQSR will help your data or not.

We have not yet found a good way to predict upfront whether a given dataset will benefit from recalibration, so we run it systematically on all our data. We consider that it's like fire insurance: most of the time it's not needed but when it is you're really happy you got it.

The tool has been rewritten in the upcoming GATK4 version to be quite a bit faster, so in future that point of pain will be eased to some extent.

Thanks much @Geraldine_VdAuwera .
I will think if I should reconsider doing BQSR on my pipeline. I have been doing my analyses for about more than a year but keep coming back and adding parameters in the analyses. Lol

Now I have another question:
After BaseRecalibrator we can get a report table which contains many EmpiricalQualities, I used to think that we just need to use these EmpQS in PrintReads, but I find that it recalculate EmpQS in PrintReads process, and the EmpQS is a little bit different with those in report table. I want to know why it has to recalculate the EmpQS rather than just use EmpQS in report table?

What version of GATK are you using? Can you share with us how the tool is telling you that it is using BAQ? If you are using an older version of GATK, I highly recommend you update to the latest version (v3.7). Unfortunately, we cannot help you with questions related to features that are deprecated.

@oskarv Sorry your question got lost amid the other conversation on this thread. Yes, we expect to get marginal differences in output when parallelizing execution of BaseRecalibrator. It shouldn't cause major differences, however. Are you seeing very different results?