Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Jump to another community

Reads traversed during IndelRealigner

I am running GATK version 2.3. I've run a few jobs (with different reference sequences and Illumina read datasets) and noticed the same puzzling phenomenon.

When I run RealignerTargetCreator, I get the following output in the log...

96553 reads were filtered out during traversal out of 31487680 total (0.31%)
96137 reads (0.31% of total) failing MappingQualityZeroFilter
416 reads (0.00% of total) failing UnmappedReadFilter

Then, when I run IndelRealigner, I get the following puzzling output in the log...

0 reads were filtered out during traversal out of 20024 total (0.00%)

Previously, when I was using an earlier version of GATK (pre-2), the number of reads traversed during RealignerTargetCreator and IndelRealigner were about the same. Now, the IndelRealigner log always shows about 20000 reads traversed.

Should I be concerned about this?

Also...a quick question...will IndelRealigner still output a warning if the maxreadsforalignment threshold is exceeded for an interval (and therefore the region is returned as-is instead of realigned)?

Well, IndelRealigner will start up even if the intervals file is empty, it'll just see it has nothing to do and consider that it has finished without error. I'm not sure why it gives a non-zero number of reads traversed though, that's weird.

As for the fact that RTC is outputting an empty target list, I'm a little surprised. Unless you have a very small dataset (or one that has already been realigned) it is unlikely that there would be nothing at all that needs realignment. Have you validated your input files?

My BAM file was generated using BWA. I had to used Picard's AddOrReplaceReadGroups to add read group information to make the file compatible with GATK. My read dataset represents a single genotype, and it was generated on a single lane.

Would these errors cause problems with GATK?

A final note...coverage for this alignment is incredibly deep (~2500x). Overkill, I know... Would this impact the output from GATK?

Are you getting a lot of these bad mate errors? RTC does apply some very stringent filters (see the tech doc for a complete list) -- but if the bulk of your data was getting filtered out you'd see it in the output log. Yet what you posted above doesn't suggest that's the case.

That is a lot of depth you've got there (have you marked duplicates?), but it's still well below the default numbers that the IndelRealigner will just give up on (because past a certain depth, realignment is too hard). And RTC doesn't have a maximum depth setting. But you could try setting -dcov to downsample, and see if that helps.

Hi, this is just a general comment when running ValidateSam.jar from Picard on a BAM file that is generated with BWA. The error of "MAPQ should be 0 for unmapped read" you get is pretty common and might mask what your real error is (I think by default, ValidateSamFile.jar outputs the first 100 records and then stops). In any case, it might be a good idea to pass the following when running it IGNORE=INVALID_MAPPING_QUALITY IGNORE=INVALID_CIGAR ; from the picard sourceforge faq page.

Q: Why am I getting errors from Picard like "MAPQ should be 0 for unmapped read" or "CIGAR should have zero elements for unmapped read?"
A: BWA can produce SAM records that are marked as unmapped but have non-zero MAPQ and/or non-"*" CIGAR. Typically this is because BWA found an alignment for the read that hangs off the end of the reference sequence. Picard considers such input to be invalid. In general, this error can be suppressed in Picard programs by passing VALIDATION_STRINGENCY=LENIENT or VALIDATION_STRINGENCY=SILENT. For ValidateSamFile, you can pass the arguments IGNORE=INVALID_MAPPING_QUALITY IGNORE=INVALID_CIGAR.

I did run ValidateSam with the three detected error types masked/ignored (and the default max output changed to 1000 records) just to make sure I wasn't missing any other errors. That run didn't pick up any additional errors, so it looks like those three are the only errors afflicting my BAM file.

Good advice for people who might read this thread and use the ValidateSam tool.

Note that Kurt's advice is good if you are looking to uncover an underlying error, but users shouldn't automatically make Picard ignore those errors when validating files the first times. Issues like bad CIGAR strings can cause the GATK to crap out on some analyses.

Following up with your recommendations, I removed duplicates from my BAM file using Picard MarkDuplicates. Running RTC on this file (with no other changes from my previous command line) produced an intervals.intervals file with 5 regions flagged.

I then tried downsampling using the -dfrac flag.

-dfrac 0.75 produced an intervals.intervals with 6 regions flagged.

-dfrac 0.60 produced an intervals.intervals with 10 regions flagged.

I did some searching to find recommendations on appropriate settings for downsampling, but didn't find much.

Do you have any recommendations on what cutoff to use for downsampling with RTC?

The thing is that we normally don't suggest using downsampling at this step because it's not usually necessary, but it sounds like it might be helping produce at least some target intervals.

In principle you want to produce the most intervals that the RTC can identify as being problematic, to avoid missing out any areas that will give messy results if they are not realigned.

At this point I would recommend visually inspecting the regions you have there (using IGV for example) and see if they look legitimate. If so, you can run them through the realigner and inspect again after realignment to check that the results make sense. To be completely thorough you can then try producing different lists of intervals with different settings of dfrac, realign into separate files (using -L to narrow down the test files and not reprint the whole data set each time) and compare the results for sanity.

I wish I could give you a more straightforward answer but since this situation is a little odd I think it's best to proceed cautiously. Let me know if you need me to clarify anything.