When you are running AnalyzeCovariates to analyze your BQSR outputs, you may run into an error starting with this:

org.broadinstitute.sting.utils.R.RScriptExecutorException: RScript exited with 1. Run with -l DEBUG for more info.

The main reason why this error often occurs is simple, and so is the solution. The script depends on some external R libraries, so if you don’t have them installed, the script fails. To find out what libraries are necessary and how to install them, you can refer to this FAQ article.

One other common issue is that the version of ggplot2 you have installed is very recent and is not compatible with the BQSR script. If so, download this file and use it to generate the plots manually according to the instructions below.

If you have already checked that you have all the necessary libraries installed, you’ll need to run the script manually in order to find out what is wrong. To new users, this can seem complicated, but it only takes these 3 simple steps to do it!

1. Re-run AnalyzeCovariates with these additional parameters:

-l DEBUG (that's a lowercase L, not an uppercase i, to be clear) and

-csv my-report.csv (where you can call the .csv file anything; this is so the intermediate csv file will be saved).

2. Identify the lines in the log output that says what parameters the RScript is given.

The snippet below shows you the components of the R script command line that AnalyzeCovariates uses.

You may have noticed that a lot of the scores that are output by the GATK are in Phred scale. The Phred scale was originally used to represent base quality scores emitted by the Phred program in the early days of the Human Genome Project (see this Wikipedia article for more historical background). Now they are widely used to represent probabilities and confidence scores in other contexts of genome science.

Phred scale in context

In the context of sequencing, Phred-scaled quality scores are used to represent how confident we are in the assignment of each base call by the sequencer.

In the context of variant calling, Phred-scaled quality scores can be used to represent many types of probabilities. The most commonly used in GATK is the QUAL score, or variant quality score. It is used in much the same way as the base quality score: the variant quality score is a Phred-scaled estimate of how confident we are that the variant caller correctly identified that a given genome position displays variation in at least one sample.

Phred scale in practice

In today’s sequencing output, by convention, Phred-scaled base quality scores range from 2 to 63. However, Phred-scaled quality scores in general can range anywhere from 0 to infinity. A higher score indicates a higher probability that a particular decision is correct, while conversely, a lower score indicates a higher probability that the decision is incorrect.

The Phred quality score (Q) is logarithmically related to the error probability (E).

$$ Q = -10 \log E $$

So we can interpret this score as an estimate of error, where the error is e.g. the probability that the base is called incorrectly by the sequencer, but we can also interpret it as an estimate of accuracy, where the accuracy is e.g. the probability that the base was identified correctly by the sequencer. Depending on how we decide to express it, we can make the following calculations:

If we want the probability of error (E), we take:

$$ E = 10 ^{-\left(\frac{Q}{10}\right)} $$

And conversely, if we want to express this as the estimate of accuracy (A), we simply take

Here is a table of how to interpret a range of Phred Quality Scores. It is largely adapted from the Wikipedia page for Phred Quality Score.

For many purposes, a Phred Score of 20 or above is acceptable, because this means that whatever it qualifies is 99% accurate, with a 1% chance of error.

Phred Quality Score

Error

Accuracy (1 - Error)

10

1/10 = 10%

90%

20

1/100 = 1%

99%

30

1/1000 = 0.1%

99.9%

40

1/10000 = 0.01%

99.99%

50

1/100000 = 0.001%

99.999%

60

1/1000000 = 0.0001%

99.9999%

And finally, here is a graphical representation of the Phred scores showing their relationship to accuracy and error probabilities.

The red line shows the error, and the blue line shows the accuracy. Of course, as error decreases, accuracy increases symmetrically.

Note: You can see that below Q20 (which is how we usually refer to a Phred score of 20), the curve is really steep, meaning that as the Phred score decreases, you lose confidence very rapidly. In contrast, above Q20, both of the graphs level out. This is why Q20 is a good cutoff score for many basic purposes.

Detailed information about command line options for BaseRecalibrator can be found here.

Introduction

The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.

New with the release of the full version of GATK 2.0 is the ability to recalibrate not only the well-known base quality scores but also base insertion and base deletion quality scores. These are per-base quantities which estimate the probability that the next base in the read was mis-incorporated or mis-deleted (due to slippage, for example). We've found that these new quality scores are very valuable in indel calling algorithms. In particular these new probabilities fit very naturally as the gap penalties in an HMM-based indel calling algorithms. We suspect there are many other fantastic uses for these data.

This process is accomplished by analyzing the covariation among several features of a base. For example:

Reported quality score

The position within the read

The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine

These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.

The system was designed for users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.

Running the tools

BaseRecalibrator

Detailed information about command line options for BaseRecalibrator can be found here.

This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:

read group the read belongs to

assigned quality score

machine cycle producing this base

current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.

Creating a recalibrated BAM

To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:

After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.

This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.

Effectively the new quality score is:

the sum of the global difference between reported quality scores and the empirical quality

plus the quality bin specific shift

plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

Miscellaneous information

The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.

A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.

Unless your database of variation is so poor and/or variation so common in your organism that most of your mismatches are real snps, you should always perform recalibration on your bam file. For humans, with dbSNP and now 1000 Genomes available, almost all of the mismatches - even in cancer - will be errors, and an accurate error model (essential for downstream analysis) can be ascertained.

The recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

Example pre and post recalibration results

Recalibration of a lane sequenced at the Broad by an Illumina GA-II in February 2010

There is a significant improvement in the accuracy of the base quality scores after applying the GATK recalibration procedure

The output of the BaseRecalibrator

A Recalibration report containing all the recalibration information for the data

Note that the BasRecalibrator no longer produces plots; this is now done by the AnalyzeCovariates tool.

The Recalibration Report

The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:

Arguments Table -- a table with all the arguments and its values

Quantization Table

ReadGroup Table

Quality Score Table

Covariates Table

Arguments Table

This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).

Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.

Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

Troubleshooting

The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.

If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.

I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.

All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.

I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?

Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.

I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:

First do an initial round of SNP calling on your original, unrecalibrated data.

Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.

Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

Downsampling to reduce run time

For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.

We have discovered a bug that seriously impacts the results of BQSR/ BaseRecalibrator when it is run with the scatter-gather functionality of Queue. To be clear, the bug does NOT affect BaseRecalibrator runs performed "normally", i.e. WITHOUT Queue's scatter-gather.

Consequences/ Solution:

Please be aware that if you have been using BaseRecalibrator scatter-gathered with Queue (GATK versions 2.0 and 2.1), your results may be wrong. You will need to redo the base recalibration of your data WITHOUT scatter-gathering.

This issue will be fixed in the next release (version 2.2). We apologize for any inconvenience this may cause you!

ERROR stack trace

java.lang.RuntimeException: java.io.IOException: Input/output error
at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:87)
at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:74)
at htsjdk.samtools.util.AbstractIterator.next(AbstractIterator.java:57)
at htsjdk.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:47)
at htsjdk.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:25)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:79)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:41)
at htsjdk.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:40)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:478)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.next(IndexFactory.java:440)
at htsjdk.tribble.index.IndexFactory.createIndex(IndexFactory.java:338)
at htsjdk.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:312)
at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:402)
at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:288)
at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:225)
at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:148)
at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208)
at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:997)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:779)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:290)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)
Caused by: java.io.IOException: Input/output error
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:224)
at htsjdk.tribble.readers.PositionalBufferedStream.fill(PositionalBufferedStream.java:127)
at htsjdk.tribble.readers.PositionalBufferedStream.peek(PositionalBufferedStream.java:118)
at htsjdk.tribble.readers.PositionalBufferedStream.read(PositionalBufferedStream.java:57)
at htsjdk.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:80)
at htsjdk.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:122)
at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:85)
... 24 more

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: java.io.IOException: Input/output error

I am working on non-human data and want to use GATK's Haplotype caller to call variants. I am using the protocol that is listed here.

So far I have reached the indel realignment stage and want to do base recalibration next. Since I do not have known sites, this page in the GATK forum advises you to do an initial round of SNP calling on original un-recalibrated data.

Has anybody experience doing this? Is this initial calling done with UnifiedGenotyper on the indel realigned BAMs, or since I want to use Haploytype Caller downstream, should I use this module instead?

I am performing BQSR on RNA-seq data for the purpose of SNP calling. I was wondering about some issues:

My organism does not have a known set of SNPs. I asked this question before in the forum and accordingly, I am using a set of SNPs filtered as the input of knownsites for BQSR. In filtering, some SNPs have a tag 'SNPcluster', shall I use this file or shall I somehow filter the file so only the PASS SNPs are retained?

I have performed SNP calling only on a subset of my bam file because I was interested in certain chromosomes. Would that be fine if I only use this subset of SNPs and try to recalibrate only the subset of bam file not the entire bam? I am asking this question in case this can somehow create a bias for final results.

Hi all,
I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).

Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this:

05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1".

Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).

I noticed that the pipeline for BQSR is slightly different between the "Workshop walkthrough (Brussels 2014)" document and the "(howto) Recalibrate base quality scores = run BQSR" document. I used the former document as my guide since it is more recent, but I am curious as to why these two are different. The first step is the same for the 2 docs, but then for the second step, the output is a recal.bam file in the brussels document instead of a post_recal.table file in the (howto) document. Then, I am confused about the (howto) document because it seems that the 4th step is exactly the same as the second step, except we generate a recal.bam file. Is the brussels document presenting steps that are more efficient? (since there is only 1 use of -BQSR to generate the recalibrated bam file we need, as opposed to 2 uses of the option)

I encounter the "RScript exited with 1. Run with -l DEBUG for more info" error when trying to make recalibration plots using Rscript. However, I do have all the required packages installed on my computer, including gsalib, which seems to be the problem. When I run with -l DEBUG, I get the following information:

Error in library(gsalib) : there is no package called ‘gsalib’

The thing is that I am using my computer but working in a cluster, so not directly on my computer. Should I have the packages installed in the cluster or is it fine that I've installed the packages on my computer? I'm a bit new to this, so I apologize if this question is a bit basic.

Hi,
I am using GATK version 3.2-2-gec30cee.
I have a pb with BQSR. Whwn I run the first step (analyze covariates), everything seems to be fine according to the INFO given by the walker:
Command used:
java -jar -Xmx16g $GATK -T BaseRecalibrator -R ./assembly/spades_reapr.genome.abv500.fna -I BAMsort-RG-dedup-realign_Miseq300.bam -knownSites CEW1_Miseq300_HC0.vcf -o CEW1_Miseq300_BQSR1.table
INFOs read on shell:

Hi there,
I have been using GATK to identify variants recently. I saw that BQSR is highly recommended. But I don’t know whether it is still needed for de novo mutation calling. For example, I want to identify de novo mutations generated in the progenies by single seed descent METHODS in plants. For example, in the paper of “The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana”, these spontaneous arising mutations may not included in the known sites of variants. Based on documentation posted in GATK websites, they assume that all reference mismatches we see are errors and indicative of poor base quality. Under this assumption, these de novo mutations may be missed in the step of variant calling. So in this situation, what should I do? Or should I skip the BQSR step?
Also what should I do when I reach to step- VQSR?
Hope some GATK developers can help me on this.
Thanks.

I am recalibrating the base qualities, using all the standard covariates, including machine cycle. The reads come from a 454 machine, and they have an average length of around 300 bases. Curiously, the recalibrator only issues an error if I set the --maximum_cycle_value to less than 27. And even if I set the --maximum_cycle_value to the true maximum length (1601), the RecalTable2 does not record cycles longer than 27. In DEBUG level, I get a few messages that seem to be related, but that I don't know how to interpret:

The fact is that after recalibration the low-qualitity tails get higher qualities, and I suspect that the recalibration does not work well for late cycles. I'm using GATK v3.2-2-gec30cee. This is what my command looks like:

One particularity of the bam file is that it's not huge. There are only about 500000 reads. Recalibration may not be accurate, but still, I wonder if it is using all the information available. There are hundreds of thousands of reads longer than 100 bases. Can anybody explain what's happening? Thanks.

Hi! I’m trying to plan my GATK pipeline and have a few questions. We currently have an assembled genome for a non-model avian species and would like to align the reads from the single individual to the genome and subsequently mine SNPs. Our end game is to design a Fluidigm SNP chip. Since we are only working with the data from a single individual, I am trying to figure out how to best minimize false positive SNP calls.

1) Prior to beginning mapping with BWA, is any pre-processing of the reads recommended? The tutorials say to use raw reads… do we not even need to remove adaptors?

2) Does GATK have any specific recommendations for non-model BQSR (i.e., when no reference set of high-quality SNPs is available)? I’ve found a couple references to this process (http://gatkforums.broadinstitute.org/discussion/3286/quality-score-recalibration-for-non-model-organisms, https://gist.github.com/brantfaircloth/4315737). One user indicated they were using SNPs with quality scores greater than 30 as their reference SNPs for BQSR, another 90. Does GATK have any other thoughts, or is this largely a judgment call?

I have attended the great Brussels workshop, and I am posting here the BQSR question I had.

I have a human exome experiment with 24-multiplexed samples per lane (Nextera libraries) where we first only did one lane of sequencing (~15x) and then added a second lane (summing up to ~30x). From what I understood reading the Best Practices, I probably don't have enough data to run a BQSR on each sample. How should I then do the BQSR step? Should I skip it altogether? Could I estimate the model parameters on one whole lane (all the samples) and then apply it separately to each sample?

And as a separate question: If I could turn back the clock, would it have been better to do 12-multiplexed samples per lane and run these two lanes of sequencing (24 samples in total) for the same amount of reads but giving me more data to do a BQSR step per sample?

A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.

If I have 150Mbases of data over a set of small target intervals, does that count as a small dataset for which I should avoid using BQSR? What about 1B bases, again over a small set of intervals? What are the best practices in this case?

ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8):

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR MESSAGE: SAM/BAM file SAMFileReader{xxxx.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score (72) with BAQ correction factor of 8; please see the GATK --help documentation for options related to this error

I have a question regarding the PrintReads walker. I am running it with the --BQSR engine using the command below. While my job has yet to finish, it is clear that the output bam file will be significantly larger than the input (~2x). I have not set the –emit_original_quals flag, so I expect that the original scores should be discarded. Should I still be expecting this size increase?

When I was trying to use these known sites at the VariantRecalibration step, I got a lot of walker errors saying that (I paraphrase) "it's dangerous to use this known site data on your VCF because the contigs of your references do not match".

However, if you look at the GRCm38_68.fai it DOES include the smaller scaffolds which are present in my data.

So, my question is: how should I filter my bam files for the IndelRealigner and downstream steps? I feel like the best option is to filter on the contigs present in the known site vcfs, but obviously that would throw out a proportion of my data.

I was wondering about the format of the known site vcfs used by the RealignerTargetCreator and BaseRecalibrator walkers.

I'm working with mouse whole genome sequence data, so I've been using the Sanger Mouse Genome project known sites from the Keane et al. 2011 Nature paper. From the output, it seems that the RealignerTargetCreator walker is able to recognise and use the gzipped vcf fine:

ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation.

Which means that the SNP vcf (which has the same format as the indel vcf) is not used by BQSR.

My question is: given that the BQSR step failed, should I be worried that there are no errors from the Indel Realignment step? As the known SNP/indel vcfs are in the same format, I don't know whether I can trust the realigned .bams.

ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: -37
at org.broadinstitute.sting.utils.BaseUtils.convertIUPACtoN(BaseUtils.java:172)
at org.broadinstitute.sting.utils.fasta.CachingIndexedFastaSequenceFile.getSubsequenceAt(CachingIndexedFastaSequenceFile.java:288)
at org.broadinstitute.sting.gatk.datasources.providers.ReferenceView.getReferenceBases(ReferenceView.java:116)
at org.broadinstitute.sting.gatk.datasources.providers.ReadReferenceView$Provider.getBases(ReadReferenceView.java:87)
at org.broadinstitute.sting.gatk.contexts.ReferenceContext.fetchBasesFromProvider(ReferenceContext.java:145)
at org.broadinstitute.sting.gatk.contexts.ReferenceContext.getBases(ReferenceContext.java:189)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateIsSNP(BaseRecalibrator.java:335)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:253)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:132)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:228)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:216)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

As a note, I validated the BAM with picard/validateSamFile, with no errors found

I have a study including 100 samples. While most samples were sequenced in one lane, a dozen were in two lanes. I wonder if the difference in # of lanes may lead to difference in overall base quality scores after BQSR?

I have been working primarily with non-model organisms (and mostly inbred-mapping populations, but that's a topic for a different discussion). To recalibrate base qualities, I have taken the approach of running through the Indel Realignment, SNP, and INDEL calling. Then, filtering around INDELs. I use multi-sample VCFs and have taken the following approach to recalibrate base quality: I grab the top 90th percentile SNPs from all SNPs in my filtered SNP VCF file (based on ALTQ), then I pull out these top SNPs for each SAMPLE in the VCF file (in my case I usually have between 100-300 samples) and write to SEPARATE VCF files for each SAMPLE if the GQ > 90 and it's a SNP for that sample. I then use these SAMPLE HQ VCF files for the BQSR tools.

When using queue for BQSR scatter/gather parellelism I've been seeing the following:

java.lang.IllegalArgumentException: Table1 188,3 not equal to 189,3
at org.broadinstitute.sting.utils.recalibration.RecalUtils.combineTables(RecalUtils.java:808)
at org.broadinstitute.sting.utils.recalibration.RecalibrationReport.combine(RecalibrationReport.java:147)
at org.broadinstitute.sting.gatk.walkers.bqsr.BQSRGatherer.gather(BQSRGatherer.java:86)
at org.broadinstitute.sting.queue.function.scattergather.GathererFunction.run(GathererFunction.scala:42)
at org.broadinstitute.sting.queue.engine.InProcessRunner.start(InProcessRunner.scala:53)
at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:84)
at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:434)
at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:156)
at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:171)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)

I'm using gatk version: v2.4-7-g5e89f01 (I can't keep up the pace with you guys). I'm wondering if this is a know issue, and if so, if there is a workaround or a fix in later GATK versions.

I am trying to test -nct parameter in printreads (bqsr). My machine has 32 cores (aws cc2.8xlarge) and input is a single BAM. So I tried "-nct 32" first, but it was almost as slow as "-nct 1". Actually any -nct >8 made it slow, so I am wondering if it's my fault or known/expected result. Thanks!

The header of the Mills_and_1000G_gold_standard.indels.b37.vcf seems to the indicate that the correct 16569 bp MT version is used for the VCF file

##contig=<ID=MT,length=16569,assembly=b37>

Why does the BQSR tool think that a different version of MT is used for the Mills_and_1000G_gold_standard.indels.b37.vcf ?

Edit:

I have the same problem with the 1000G_phase1.indels.b37.vcf from the GATK_b37_bundle. Get the same error and the MT contig seems the be the correct one from the vcf header. Only the dbsnp_137.b37.vcf is accepted by the BQSR tool without complaining about a different MT contig.

Hi,
I am working on a data set in which (1) multiple individuals were sequenced on the same lane, and (2) the same individuals were sequenced in multiple runs. If I get this right, we would thus want to consider both lane and run as covariates in BQSR. I have two questions related to this:
1) Which elements of the @RG header are considered by BQSR? All given there (e.g. ID, SM, LB, PI, PL)?
2) I am not sure where (in which @RG elemnt) to best incorporate run and lane info?

##### ERROR stack trace
java.lang.NullPointerException
at org.broadinstitute.sting.utils.Utils.join(Utils.java:286)
at org.broadinstitute.sting.utils.recalibration.RecalUtils.writeCSV(RecalUtils.java:450)
at org.broadinstitute.sting.utils.recalibration.RecalUtils.generateRecalibrationPlot(RecalUtils.java:394)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.generatePlots(BaseRecalibrator.java:474)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.onTraversalDone(BaseRecalibrator.java:464)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.onTraversalDone(BaseRecalibrator.java:112)
at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:97)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

Hi,
I would like to perform base quality score recalibration on only the reads that have the "properly aligned" bit (0x2) set in the FLAG column of the SAM format. Ideally, I would like to use the --read_filter argument. Below is some code that does this to my satisfaction with the PrintReads walker of GATK 2 lite. However, GATK 2 lite does not support base quality score recalibration table creation. Is there any way someone could add the code to the GATK 2 full version?

I am not sure why, but the code seems to only work with the System.out.println() line.

I asked this question in a comment under BestPractices but never got a response. Hopefully I will here. Here goes:

I have been running GATK v1.6.2 on several samples. It seems the way I had initially had run GATK for indel-realignment and quality re-calibration steps are reversed. For example, in order of processing, I ran:

MarkDuplicates

Count Covariates

Table Recalibration

Realigner Target Creator

Indel Realigner

What are the consequences and effect to SNP and INDEL calling if the GATK steps are ran as above?. I'm aware that this is not according to the best-practices document (thank you for the thorough GATK documentation), but I wanted to know if it is essential for me to re-analyze the samples. The variant caller I'm using looks at BaseQuality scores when making calls.

My current workflow for analysing mouse exome-sequencing (based on v4 of Best Practices) can require me to use slightly different VCFs as --knownSites or --known parameters in BQSR, indel realignment etc. Basically, I have a "master" VCF that I subset using SelectVariants. The choice of subset largely depends on the strain of the mice being sequenced but also on other things such as AF'. It'd be great to be able to do this on-the-fly in conjunction with--known' in tools that required knownSites rather than having to create project-specific (or even tool-specific) VCFs.

Is there a way to do this that I've overlooked? Is this a feature that might be added to GATK?

This method is described to be the "First pass of the base quality score recalibration". What is the second pass? It is not mentioned anywhere, or am I looking in the wrong place? In v1.2 there were two steps, is there only one step now for bqsr?
Confused,
Juan

I am using GATK v2 (GenomeAnalysisTK-2.0-0-g4c0ffd4) and was trying out the new BaseRecalibrator walker. According to this post the BaseRecalibrator should output "A PDF file containing quality control plots showing the patterns of recalibration of the data", however I do not have any such file. Both the BaseRecalibrator and PrintReads steps of the BQSR pipeline appear to have worked as I have a recalibrated BAM file and the accompanying GATKReport but I would like to be able to view plots of the recalibration process (and preferably have these generated automatically by the recalibration pipeline).