GATK Showcase in Terra

Check out these fully configured workspaces to test drive the Best Practices pipelines and workshop tutorials with zero installation required!

Get email notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements on the blog, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. When reporting a problem, include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.

When should I use -L to pass in a list of intervals?

The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.

In a nutshell, if you’re doing:

- Whole genome analysis: intervals are not required but they can help speed up analysis- Whole exome analysis: you must provide the list of capture targets (typically genes/exons)- Small targeted experiment: you must provide the targeted interval(s)- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet

Important notes:

Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.

Example Use of -L:

-L 20 for chromosome 20 in b37/b39 build

-L chr20:1-100 for chromosome 20 positions 1-100 in hg18/hg19 build

-L intervals.list (or intervals.interval_list, or intervals.bed) where the value passed to the argument is a text file containing intervals

-L some_variant_calls.vcf where the value passed to the argument is a VCF file containing variant records; their genomic coordinates will be used as intervals.

Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.

- For example, HLA-A*01:01:01:01 is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the -L option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.- When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use -L chr1:1-100. This also works for our HLA contig, e.g. -L HLA-A*01:01:01:01:1-100.- However, when passing in an entire contig, for contigs with colons in the name, you must add :1+ to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.

-L HLA-A*01:01:01:01:1+

So here’s a little more detail for each experimental design type.

Whole genome analysis

It is not necessary to use an intervals list in whole genome analysis -- presumably you're interested in the whole genome!

However, from a technical perspective, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere) where you know the data is not reliable or is very messy, causing excessive slowdowns. You can do this by providing a list of "good" intervals with -L, or you could also provide a list of "bad" intervals with -XL, which does the exact opposite of -L: it excludes the provided intervals. We share the whole-genome interval lists (of good intervals) that we use in our production pipelines, in our resource bundle (see Download page).

Whole exome analysis

By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.

Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.

Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.

Tool

-L?

Why / why not

BaseRecalibrator

YES

This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration.

PrintReads

NO

Output is a bam file; using -L would lead to lost data.

UnifiedGenotyper/Haplotype Caller

YES

We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives.

Next steps

NO

No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step.

Small targeted experiments

The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

Debugging / troubleshooting

You can use -L a lot while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).

Comments

This is very helpful! Thank you very much. As I am doing GATK for the whole exome sequencing data with the bam files downloaded from TCGA. I am not sure how to make the intervals file for the option -L. I saw some people use the UCSC Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables?command=start) to generate a BED file containing all the exons plus 10bp at each end. it is based on hg19 assembly. Would that be a good way to go? I do not see any exome interval file in the GATK ftp bundle, although it has been mentioned in the forum. I also saw that GATK accepts BED style interval lists. But the warning message is that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1. I do not know how does this fit into the story if I use UCSC table browser to generate a bed file. As I set 10bp at each end, there will be no problem? The output should still be 1 based because the annotation is based on the reference fasta file, human_g1k_v37.fasta, instead of the intervals, right?

**start ** - The zero-based starting position of the feature in the chromosome.
The first base in a chromosome is numbered 0.
The start position in each BED feature is therefore interpreted to be 1 greater than the start position listed in the feature. For example, start=9, end=20 is interpreted to span bases 10 through 20, inclusive.

If you use a program or website that generates a bed file natively that's fine, the intervals will be issued and interpreted correctly by GATK. The warnings are meant for users who generate interval files with their own scripts, to remind them to define intervals appropriately.

But you should pay attention to what reference you are choosing. The UCSC file will be based on the hg19 build, but the human_g1k_v37.fasta file you are referring to is the b37 build. Have a look at our documentation on reference files if this is confusing to you. You can find both the hg19 and the b37 build files in our resource bundle.

Thank you very much for your explanation! You are right about the reference issue. The problem that I ran into is that GATK does not accept the UCSC version of intervals (hg19 build) when I use the human_g1k_v37.fasta reference (b37 build). But I have to use the human_g1k_v37.fasta reference, because the bam files from TCGA originally used that version. I am wondering if there is any way to solve this problem. It is simple to change chr1 to 1 for the chromosomes in the interval file, but it would be problematic for others like chr6_apd_hap1, chr6_cox_hap2, chrUn_gl000211, chr1_gl000191_random, etc. Or is there any other website for generating exome.intervals based on b37 build?

ERROR MESSAGE: File associated with name exome_intervals.bed is malformed: Problem reading the interval file caused by Badly formed genome loc: Contig chr1 given as location, but this contig isn't present in the Fasta sequence dictionary

I don't know the non-chromosomal contigs of hg19 well enough to really answer that. I think they're not represented by equivalents in b37 but I could be wrong. Do you actually need those targets in your analysis? What intervals did the TCGA study use?

Thanks a lot for your information. I am also considering that maybe the non-chromosomal contigs are not as important, and I plan to remove them from the interval file to give it a try. I do not know what intervals the TCGA study use. Thank you!

Hi there! I've a similar problem to the one rxy712 had, but I'm afraid I can't do without the -L argument. I'm performing an analysis on exome sequencing data so I need to restrict it to target regions. My problem is that I had to use the b37 build for the alignment, but the .bed file of target region (provided by Agilent company) is in the UCSC format so the program gives me an error of no overlapping contigs. Is there a way to rename my .bed file contigs in order to match the b37 reference? In case there isn't, if I just decide not tu use the -L argument, there will be consequences on my analysis? Thank you!

Hello, Sheila! It was great help thank you. Actually before your reply, I tried the sed command with very few hopes, but it seems it worked well. Now everything is running smoothly. I don't have in my data mitochondrial contigs, but I'm not sure what you mean for unmapped contigs. Is it a separate file that should come out from BWA? But anyway if I get it right, in the case I had unmapped contigs it would not be sufficient to use the sed command and I would still be stuck, am I right? Thanks again!

Unmapped contigs are parts of the genome sequence for which people have not yet figured out where they belong on the chromosomes. They are are put in the reference build as "decoy" contigs that collect reads that do not map to anywhere else on the chromosomes, to prevent them from getting mapped with low confidence to the canonical chromosomes. There are versions of the reference build that include them and some that don't.

It sounds like you do not have unmapped contigs. There is some evidence that having them helps a little, but not having them is not a big problem either.

Hello!
Thank you very much for your explanation. Actually I'm using the hs37d5 reference genome, which should contain also the decoy contigs. How can I find out if I have unmapped contigs in my dataset? when I downloaded my intervals list for the SureSelect All Exons Kit v5 they didn't give any info about unmapped contigs, so I guess they shouldn't be present in my sequences, hence whether I have the decoy or not in the reference genome should not make a difference for me, isn't it?

You are going to have non-specific hybridizations (i.e. not all of your sequences are going to map to within your targets/baited intervals, the percentage has to do with how well your capture efficiency is). the decoy mainly affects non-coding regions, but there are a few exceptions. CDC27 is one. I believe a few of the MUC genes are as well (some of the reads will end up mapping to the decoy sequence as opposed to be forced onto CDC27, MUC, etc). So even though they are not targeted by your capture product it's not a bad idea when aligning your sequencing data to a reference genome to use the hs37d5 reference which has the unlocalized contigs, decoy sequence as they act as a sink for non-specific reads and removes false positive variants in those genes.

Hi all, before I read this topic, I did use -L in my PrintReads step which made the rate of mapped reads in my recalibrated bam very low, I think this is so-called "lost data", after I remove -L, the result seems to become normal. Many thanks to all of you. But I am still not very clear why this could happen? Why the data will lose when I use -L in PrintReads.

I have a few questions about the usage of -L. Currently, I use version3.2 on exomes.

In order to speed up, I run HaplotypeCaller with the -GVCF option and GenotypeGVCFs on split chromosomes, and then merge them into sample.g.vcf and sample.vcf, respectively. I use -L chr (for example, chr1) when running HaplotypeCaller to get more variants, and then -L chr1.bed when running GenotypeGVCFs. The reason why I use -L chr1 is that I would like to get all variants though I know that sites called outside the target region might be with high FP. The results from this step may be used together with other data captured by different chip. And -L chr1.bed (which is exactly the target region) is to get more accurate genotype.

My question is that would this combination of commands produce the same result with that produced by with -L chr1.bed for both HaplotypeCaller and GenotypeGVCFs? And it be the same with that produced by running on per-sample bam (rather than per-chromosome)?

The problem with using -L on GVCFs that were generated without an intervals file is that if you have an interval that starts in the middle of a reference block, the entire block will be skipped, which is going to cause missing records. It would be better to run GenotypeGVCFs on the entire GVCFs, then subset the sites of interest using SelectVariants. Including regions outside the capture targets will not affect the calls within the targets.

I have just discussed the usage of -L with my colleague, but we dissent with each other on some point, and we would like to make it more clear. Would you please give us more detail for several questions?

What does "the reference block" refer to?

Say we have three exome dataset. For exome1 and exome2, there is no variant from chr1:1-5000, and exome3 got a SNP (A->G, for example) on chr1:3000. If the capture chip we used did not cover chr1:1-5000, would this record (chr1 3000 . A G) missed from the final output of GenotypeGVCFs? Still using -L chr1 for HaplotypeCaller with -GVCF option and -L chr1.bed for GenotypeGVCFs.

If the record (chr1 3000 . A G) is emitted in the final VCF generated by GenotypeGVCFs, what would the record be like for exome1 and exome2?

In the GVCFs, you will get records representing every single site that is included in your intervals file. If there is no usable data at a site, the record will be a no-call. In the final joint-genotyped VCF, you will only get records at variant sites unless you enable the option to emit non variants.

Hello, I have a question about the-L flag. Since you mentioned that the coordinates are zero based, for the first ten thousand bases I should be using something like -L chr1:0-10000, right? When I tried this, Haplotype caller gave me an error MESSAGE: Badly formed genome loc: Parameters to GenomeLocParser are incorrect:The start position 0 is less than 1. My understanding is that if I use -L chr1:1-10000, it will miss the first base. How should I specify in this case? Thanks for any help!

The reason that I ask is because I have created a consensus intervals file, which contains the intervals in common between 3 exome capture arrays (should be equivalent to using -L for each intervals file and using -isr INTERSECTION). I have done this so that I can accurately compare rates of variation across samples prepared with different arrays. My concern is that if I use interval padding, I will bias some capture arrays over others. I.e. a 'larger' capture array may have more reads spanning the intervals in the consensus file... At the same time, I don't want to lose reads unnecessarily as this may affect variant calling at exon boundaries. Any thoughts?

Hi,
I have a specific question about which bed files should I use for -L option. We are doing whole exome sequencing in Illumina platform and the exome capture is performed using Agilent SureSelectXT V5 capture kit by the service provider. From Agilent website I have downloaded S04380219_SureSelectV5+5UTR folder which contains 4 different bedfiles and two text files namely,

S04380219_AllTracks.bed

S04380219_Covered.bed

S04380219_Padded.bed

S04380219_Regions.bed

S04380219_Targets.txt

S04380219_Probes.txt

Now I am confused here to use the appropriate bedfile as list intervals with -L option to pass. Can you suggest me which file should I use in my analysis?

I don't think that should be a concern. When you use the new joint calling pipeline, if any sites in a sample don't have coverage but some other samples have coverage and a variant call, the sample with no coverage will show a no-call so you know there was no information there. https://www.broadinstitute.org/gatk/guide/article?id=3893

I am looking to find out which reads exactly are filtered out when i use -L. Is filtering based on whether the starting position of a read is outside the intervals in the interval_list? To use AJWillsey's example above: if an interval spans chr1:150-300, the read chr1:100-250 would be filtered out. Read chr1:299-449 would not be filtered out? what about read chr1:300-450?

Is there any documentation (or answered forum question) on this topic that you can point me to?

@Ivo The rule is different for reads and variants. When working with reads, any reads that extend into the intervals specified with -L will be included, even if they start before the interval. When working with variants, only variants that have their start position within the interval will be included.

I have to apologize. I mislead you earlier. It turns out as long as any part of a read is contained in the interval you specify, it will be used in the tools. I told you earlier that it would not be. Sorry.

I see the importance of padding for RealignmentTargetCreator and BaseRecalibrator, which could be useful when indels are at the bordering region of an exome-target-capture .

You also mentioned that "if you do use it, you should do it consistently at all steps where you use -L ". If I use padding for UnifiedGenotyper or HaplotypeCaller, wouldn't I endup with insignificant extra mutation calls in my vcf which are not actually covered by my exam target-capture-region ?

Hi,
The data I have is whole genome sequencing, but now I'm only interested in around 100 genes.
I am applying GATK to an organism that don't have snp database like dbSNP for human. So I'm using the recommended pipeline: 1st round variant calling --> use hardfilter to manually get high confidence variant as gold standard --> base recalibration --> 2nd round variant calling. My question is:
In the 1st round variant calling step, can I set -L? If -L is set, I will only get variants at those targeted genes' regions, in this case, will the recalibration of those reads covering the target regions be affected? Thanks

If you have WGS data, it is best to not use -L. BaseRecalibrator depends on many reads, and it will not perform well if you don't have enough data. You can do the pipeline without using -L; once you have finished the pipeline, you can use SelectVariants to select the intervals you are interested in.

@Sheila
Thanks for your reply. The problem is that the organism we are using have over 50,000 scaffolds, the Haplotype Caller is very slow. Based on your answer, if I extract the interested scaffolds as reference, I still would have the same problem, because only a small portion of reads would map, correct? I guess the way to improve the speed is to parallelize the scaffold analysis and then merge the results.

For the pre-processing steps, you just can't get around base recalibration's need for big data, so you need to give it all the data at a time. Once you have finished pre-processing everything, then sure, you can run HaplotypeCaller with -L.

If even the pre-processing steps are slow because of the number of scaffolds in your reference, then you may want to consider assembling some artificial scaffolds to reduce the number of separate contigs you are working with.

@Geraldine_VdAuwera
Thanks for the suggestion. I wonder if I can just merge all the scaffolds into one long scaffold separated by N? I can make the length of N longer than the allowed gap length in bwa so it will not considered as inserts, and can also change the gff file. Is there any downside for doing this?

Yes that's essentially what I meant, although you may want to consider making a few separate contigs as opposed to a single one. IIRC the BAM index format imposes a limit of 512MB for a single contig sequence.

Hi,
I'd like to know if you have any experience or recommendation on using -L together with variant calling on human RNA-Seq data. I know that there are probably still some unannotated genes or isoforms in the genome which I would discard with such a list, but I actually don't care about those atm. Besides that it seems reasonable to me to provide a "gene-interval-list".
Any thoughts?
Cheers

Thanks for the explanation. I'm just wondering about doing so in VQSR, because in the best practice it is stated that "...it requires a large callset (minimum 30 exomes, more than 1 whole genome if possible)." I also notice in some previous discussion, you did seem to be opposed to the idea of using a single chromosome. So, to my understanding, it sounds like if I use the -L option, but input the WGS dataset, it's okay; but if I only input oneChromosome.bam to VQSR, it is biased. Is that correct please? Thank you very much.

@mscjulia, it all depends what's in your interval list. It could be just one gene out of the entire genome. Or it could be everything in the genome except the really troublesome regions. Depending on that the conclusion is completely different. Basically you need to maximize the number of variants that will be available for building the VQSR model.

@Sheila@Geraldine_VdAuwera I have a question in relation to the -L option. Recently I was working with ~1500 exomes and I followed the best practices guidelines, used -L option to specify the target intervals of the exome capture kit and generated the g.VCF files. Then I split the g.VCF files chromosome wise using -L option and combined then finally in to batches of ~50 multisample g.VCF files. So I used chromosome numbers to split them and then finally used the capture intervals during the genotype GVCF step. My question is it ok to simply use chromosome numbers to split the g.VCF files or should I split the capture interval file in to chromosomes and use them to split the g.VCF files using SelectVariants walker ?

Have a look at this article. If you called variants on only the capture regions, only those regions will be in your GVCFs. If you want to save time, you can run per chromosome, but you don't need to specify the capture regions anymore.

Because you generate g.VCFs using the capture intervals, only those genomic intervals should be present in your variants and reference blocks. It is safe for you to simply use -L chromosome numbers at the GenotypeGVCF step to generate your per chromosome multisample VCFs. This should take the same time to process as using your more defined interval list.

If you are using GRCh38, and depending on the version of GATK3 you are using, I may have an additional comment. So let me know.

@Sheila Thanks for the message. Just to confirm, so you meant that it's not necessary to specify the regions (-L) in GenotypeGVCF step right ? @shlee Thanks for the message. You are right I am using GATK version 3.6 and for the reference, I am using GRCh37 for now. But I am interested to know the additional comment you have, if the reference genome is GRCh38.

@Sheila Thank you. Just for the record , here's what I did. After obtaining the individual g.VCF files for each of the 1500 exomes, I split the gvcfs into individual chromosomes (1.22,x,y,mt) and then combined them chromosome wise in batches of 50, then again combined all chromosomes into multisample gvcf files. Finally, genotyped them chromosome wise and finally combined the vcfs in to single file. this approach was really fast. Thanks again.

The same arguments were used to obtain the two vcf files (except of course for the -L argument ).

I don't understand why I get two different GT by changing the interval. What's more, when I try the same method but with ERC GVCF and GenotypeGVCFs, the position isn't detected anymore in the vcf file of the region of +/- 150 bp around chr1:32372389 (it still is present in the vcf file produced by looking specifically at position chr1:32372389). I suppose GenotypeGVCFs delete position chr1:32372389 since its genotype is 0/0 when detected in an interval but I am not sure this explanation is correct and I would like your opinion.

The intervals affect the reads that are used in the calling. So, inputting different intervals can cause some different calls, however, those calls are usually edge cases.

In your case however, HaplotypeCaller needs some padding around the site to be able to call variants properly. We usually set -ip 100. Remember, HaplotypeCaller does a reassembly step, so having more reads in the active region allows us to get a better reassembly.

First, thanks for your answer and sorry for the delay! I launched Haplotype Caller with the -ip argument set to 200 but it didn't result in a genotype 1/1 for position chr1:32372389. I still have :

chr1 32372389 . C . . . GT:AD:DP:GQ:PL 0/0:38,287:325:0:0,0,0

I had a look at the baminput file and the position is in the center of an exon (chr1:32372389 corresponds to the green covered position).

Furthermore, when I look for specific positions I want to get (3595 positions in total) either by specifying the position or an interval around it, I get the following results : the wider the interval the less variations I get.

You can find for instance the venn diagram between the positions I am looking for, the positions detected locally (the blue/grey circle at the center), the ones detected with an interval (here +/- 150 bp, corresponds to the "global" circle )

When I look at the same diagram for an interval of +/-10 bp I get

So for the +/- 10 bp interval I get 522 of the variants I am looking for, for the +/- 150 bp I get only 441 of them. Note that I deleted the indels before producing the venn diagrams.

I didn't put all the diagrams but I did the same for +/-50 bp and +/- 1000 bp and always observe that the larger the interval, the less variants I detect. There is however no proportionality between the size of the intervals and the number of mutations lost I get (when I go from interval +/-50 to +/-1000 I lose 30 variants and when changing from interval +/-50 to +/-10 I get 56 more variants ).

I suspect your coverage is falling off towards the end of your padded region. Your site of interest, chr1:32372389, is in the 3' UTR of an RNA transcript. Not only that, it is immediately adjacent to a ~50 bp low-complexity sequence region that repeats the dinucleotide sequence AC as shown via IGV below:

The transcript is on the minus strand. So 5' to 3', your site of interest comes first, then the ~50 bp low-compexity region, and finally the annotated end of the transcript ~300 bps down at chr1:32372020.

Even if your library originated from DNA, this region poses a challenge for calling variants because of the low complexity region. But your library is RNA and so any issues are exacerbated. For example,

The location of polyA addition isn't exact within cells, different tissues (e.g. cell line vs. primary tissue) and different cellular states (e.g. quiescent vs. dividing), and so some proportion of transcript ends could be well upstream (or even downstream) of the annotated end.

Degradation of transcripts from the 3' end is common. I'm not just talking about RNA degradation from sample preparation. I'm saying transcript turnover is actively going on in cells. What we capture via RNA-seq is some equilibrium snapshot. I'm certain somewhere out there the literature records the relative turnover rate for PTP4A2 for some context.

Compounding these issues are

library preparation methods whose coverage falls off in terms of capturing the very ends of transcripts,

higher rates of PCR error for such regions no matter the complexity but especially for low-complexity regions and their surrounds,

and the ambiguity of mapping reads containing such low-complexity sequence. This is what the equivalent region of your site of interest looks like in GRCh38:
Notice that the dinucleotide repeat region is in lowercase. I believe this means that you will find many regions in the genome with this sequence pattern, which is not surprising.

For these and other reasons, variant calling on RNA-Seq data comes with this message. My tentative suggestions are (i) that you limit variant calling to regions that exclude the transcript edges and only include those regions for which your sample exhibits even coverage per transcript; and (ii) when it comes to low-complexity regions, be wary that any variants found in such regions will always be suspect, even with DNA-seq. Finally, consider leveraging sequence data originating from a PCR-free DNA library if calling variants for such challenging regions is important to your research.

I have whole exome sequencing targeted panel sequencing samples that may require -L argument. If I run the alignment and HaplotypeCaller process without the target interval, would the omission of interval restriction lead to any meaningful impact on the risk of false positive/negatives? I would only consider exonic variants within the known sequenced interval in results.

One step I anticipate being affected is the BSQR step, since the off-target sequencing can skew the QUAL scores and influence the recalibration. As for VSQR, I would be using hard filtering instead.

@gracec You're correct that in terms of results, BQSR is the only step that would really be affected. But you'll find that restricting the analysis to specific intervals will also speed up execution a little.

We always run on intervals because we use them as a basis for parallelization, but that's a pipeline implementation detail.

I have a question. How to treat each interval as separate? For example, I want to calculate mapping depth of chr1:1-1000, chr1:500-1500. When I use -L with other default parameters, the two intervals are merged together as chr:1-1500 when using DepthOfCoverage. Thanks.

Hello @Sheila and @Geraldine_VdAuwera , I did not understand clearly this sentence: "Small targeted experiments:
The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets." I have data from Target sequencing (~500kb) from150 individuals, are you suggesting to don´t use the flag -L on the BaseRecalibrator -BSQR ? or not run a BaseRecalibrator Step for that type of data? I really appreciate your support. Thank you.

To clarify, does the table presented in this post apply to both Whole Genome Analysis and Whole Exome Analysis? Difficult to determine since it is introduced under the Whole Exome Analysis header. Thanks!

Using the -L option to parallelize various GATK commands that output .bam files (such as PrintReads, RealignerTargetCreator and IndelRealigner) across all chromosomes can speed up runtime in a Pre-Processing pipeline. Do you recommend not using the -L option for such commands even when executing the commands for all chromosomes and merging the result? Thanks!

Hi all,
I am following the "best practice" suggested by broad institute to call variants from whole exome sequencing data. Currently, I am using Mutect2 to call variants from tumor sample and_ normal_ sample based on latest reference genome GRCh38.
But, I don't have interval list to use -L option.
Is there any way to generate interval list from exome sample which I have? or Is there any default interval list for exome data?
Thank You
Snijesh

The table is for all types of analysis. The different ways/reasons to use -L are described under "So here’s a little more detail for each experimental design type."

Do you recommend not using the -L option for such commands even when executing the commands for all chromosomes and merging the result?
Are you asking if you can use -L per-chromosome then merge the chromosome files together? If so, that is fine. We were recommending not using -L on the entire BAM file and only outputting intervals of interest because you will lose the reads that are not included in your intervals of interest.

I noticed that you have updated this website with newer information. In the whole exome analysis section, for the table of tool/-L/why/why not, you removed RealignerTargetCreator and IndelRealigner. So do you suggest not to use -L in the indel realignment/RealignerTargetCreator step or any other concerns related ?? I am working on 70-gene targeted sequencing, I want to know if I can skip both indel realignment and BQSR steps. Thank you very much and happy new year!

Hi, I'm wanting to run a test of my exome sequences (~70 samples) using the pipeline I have set up. I have the -L command with my bed file to run the analysis, However is it possible to just run a smaller subset of the data, eg 2 million bases or a single chromosome? Would I change the -L from the bed file to a number or chromosome, or do I still need that bed file in there? Thank you

I am assuming you are asking if you can use a different bed file/different intervals than the exome intervals you have been given by the sequencing company. In that case, yes, you can input a different interval file using -L. If you want, you can run with -L chr1 -L chr2... Have a look at this thread as well.

@shlee said:
For these and other reasons, variant calling on RNA-Seq data comes with this message. My tentative suggestions are (i) that you limit variant calling to regions that exclude the transcript edges and only include those regions for which your sample exhibits even coverage per transcript; and (ii) when it comes to low-complexity regions, be wary that any variants found in such regions will always be suspect, even with DNA-seq. Finally, consider leveraging sequence data originating from a PCR-free DNA library if calling variants for such challenging regions is important to your research.

I am curious on your thoughts about using -L in Mutect2/HaplotypeCaller (and BQSR?) for RNA-seq variant calling. This post and several others seem to me to suggest supplying an interval list specific to regions with sufficient coverage (dependant on the specific BAM file, ie. computationally determined using DepthOfCoverage or DiagnoseTargets).

However, the (current version of the) RNA-seq best-practice pipeline (WDL on Github) feed a generic interval list (Homo_sapiens_assembly19_1000genomes_decoy.whole_genome.interval_list) to HaplotypeCaller (presumably for optimized scatter-gather) and no interval list to BaseRecalibrator/ApplyBQSR.

Do RNA-seq variant calls improve with a (stringent) interval list? Or is there no benefit? Is there any difference in results running Mutect2/HaplotypeCaller with a stringent interval list versus calling without an interval list and imposing these stringent interval list on the variants after calling. Is there a reason it is not currently a part of the RNA-seq best practice WDL file on Github?

I am curious on your thoughts about using -L in Mutect2/HaplotypeCaller (and BQSR?) for RNA-seq variant calling. This post and several others seem to me to suggest supplying an interval list specific to regions with sufficient coverage (dependant on the specific BAM file, ie. computationally determined using DepthOfCoverage or DiagnoseTargets).

It's been my recent experience that for our assembly callers, calling on the entirety of the data without restriction, then applying an intervals list post-calling to the VCF allows for consistent calling from exome data and furthermore allows for capturing of off-target alignments as can occur for targeted exomes, especially in the case of somatic samples where fusion events are not uncommon. Otherwise, you must be sure that the intervals list you apply allow for sufficient coverage around the target sites for reassembly, which your exome kit provider should have provided you, and which you may additionally pad with bases prior to application or on the fly with the -ip parameter. Seems a better idea to capture calls from all of the alignments from the get-go and then to filter for target regions of interest with the intervals list afterward, at least for a somatic case. It is up to researchers to decide what is best for their research aims.

However, the (current version of the) RNA-seq best-practice pipeline (WDL on Github) feed a generic interval list (Homo_sapiens_assembly19_1000genomes_decoy.whole_genome.interval_list) to HaplotypeCaller (presumably for optimized scatter-gather) and no interval list to BaseRecalibrator/ApplyBQSR.

I am unfamiliar with this particular pipeline. Are you certain this Homo_sapiens_assembly19_1000genomes_decoy.whole_genome.interval_list intervals list isn't just a breakdown of non-N regions of the genome for the purposes of parallelizing compute? Take a look at this pipeline description. As you see, the pipeline parallelizes BQSR over genomic intervals, e.g. chr1 and chr2, for faster compute.

I wrote this article 4 years ago when I first started I agree we need to update it. I think Soo Hee was referring to her analysis with somatic data, but I will sync up with her and see how to best update this. Also, have a look at this thread.

@aneek I noticed you had asked a question about which bed file you need to use to detect variants.
did you ask Agilent company? Could you tell me which bed file did they suggest? is it Padded.bed file? Many thanks.

Hello, we are using data from wild populations of a non model organism. Currently our 'reference' genome is millions of contigs which we've reduced into a 'pseudo reference' (contigs stitched together by poly nucleotides to form a few thousand 'scaffolds'). My first question is if I need to run HaplotypeCaller on all of the contigs simultaneously (and also specifying the intervals of the contigs) or if I can parallelize HaplotypeCaller as separate calls to produce as many gvcf files as I have contigs? Is there more information gained by specifying multiple contigs at once to HaplotypeCaller? It is seems clear that I can parallelize the genotyping phase by interval, but unclear as to the HaplotypeCaller phase. Thanks!

@Geraldine_VdAuwera , you said> "it all depends what's in your interval list. It could be just one gene out of the entire genome. Or it could be everything in the genome except the really troublesome regions. Depending on that the conclusion is completely different. Basically you need to maximize the number of variants that will be available for building the VQSR model."

So I wanted to know if there's any reported threshold of reads or bases or no. of variants for performing VQSR ? I have about 70-100 Low coverage (5-10X) 1000genomes WGS samples and I've used HaplotypeCaller (-ERC GVCF) using -L to extract some 100+ genes of interest primarily to reduce runtime of HC. I think I have about 13 MBp I want to know if VQSR will (a) even work (b) Build a good enough model for inference.

Logically, I don't see why it shouldn't be able to extrapolate to regions given corresponding sites in truth/training sets even if I restrict my analysis to intervals.

I'm using GATK 4.0.12, Java 1.8, Illumina paired end sequencing.

PS Sorry if I'm spamming all over the place with similar issues but I'm in a bit of a rush.

Hi there,
It is not clear to me the use of -L argument (working on WES) in GenomicsDBImport.
I'm running Haplotype Caller with -L option using the target bed.

My question is related to GenomicsDBImport: if -L is used with the same target bed as HaplotypeCaller, it takes a lot to run, since creates a folder for EACH region in genomicsDB workspace. It is faster if -L is a list of all chromosomes (1-22,X,Y), but not sure if the latter is wrong, since does not correspond to the original intervals used in HaplotypeCaller. On the other hand, the gVCF files being loaded into the database have been generated using that target bed, so, it seems that is not wrong to use 1-22,X,Y.