GATK Showcase in Terra

Check out these fully configured workspaces to test drive the Best Practices pipelines and workshop tutorials with zero installation required!

Get email notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements on the blog, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. When reporting a problem, include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.

Does MergeVCFs have a limit to how many gvcf files it can merge?

I am running the gatk best practices pipeline on FireCloud on 12 pooled WGS samples (50 individuals per barcode, one barcode per pool, but I'm using a ploidy of 20) using gatk v4 and wdl adapted from the examples on github: github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/haplotypecaller-gvcf-gatk4.wdl

I have tried running the haplotypecaller-gvcf-gatk4.wdl on one pool (I added arguments to the HaplotypeCaller command to adjust for our data, and -L with a master list that has all the scaffolds in our reference genome), and the HaplotypeCaller command works and finishes, but the MergeVCFs command doesn't after 5 days. I have 9,000 scaffolds that each have a gVCF file after HaplotypeCaller, is there some limitation to the number of gVCF files that MergeVCF can combine?

I would like to have one gVCF for each pool after the HaplotypeCaller to be able to use JointGenotyping to get a final VCF with all 12 pools.

Is there a better tool or a different way to combine the gVCF after HaplotypeCaller to get a single gVCF per pool to use in JointGenotyping? I have seen forum posts from 2015 that say CatVariants might be a good solution, but I don't know if that is appropriate for this pipeline.

If that is the case, then I have re-started this WDL workflow on another sample. I'm going to let it run for a week and see if it finishes running.

Good call, this job is very large and will require a considerable amount of time.

Is there a way to improve the run time without racking up huge costs?

One way I can think of that isn't already implemented in the haplotypecaller-gvcf-gatk4.wdl is the use of NIO ("protocol called NIO that allows us to only retrieve the piece of the file we care about for a given task, which neatly blows away the localization problem and delivers substantial savings" link ). You can alter aspects of the workflow to improve time while sacrificing cost like decreasing preemption or the number of scattering, but evening this increase in cost by using NIO. This is more of a theory than an answer. There is already nio version of haplotypecaller available in the git repo you mentioned called haplotypecaller-gvcf-gatk4-nio.wdl

One thing i noticed is the large amount of scattering taking place during an individual run. The workflow bases the number of times the haplotypecaller task is scattered by the number of intervals in the provided interval list, I can't view your interval list but I assume it has over 3000 intervals. This is one way of scattering the haplotypecaller task but in this case it can be overwhelming if the thousands of intervals are used to spin up thousands of jobs. Another way of scattering the task is to evenly divide intervals into a set number, this is done in the five-dollar-genome-analysis-pipeline using Utils.ScatterIntervalList task. You could incorporate this script into to your method, this way you aren't running 3000+ scatter jobs.

I'm thinking of running the HC command without the preemptible allowance, because I think I am just wasting time allowing the job to be preempted.

Be aware it will increase the cost of your run.

How do I know whether to increase the memory or the CPUs of the job thoughtfully- that is without asking for more than I need and wasting money?

Good question, You could add a few commands in the task command block that writes the system resources being used into a log file. Then use this file to determine whether or not the task is being limited by cpu, memory, etc. (FYI: you can adjust the cpu number)

I would run HaplotypeCaller on the 24 largest scaffolds I have using the original scatter method (one HC task for each line in the interval file -24 scaffolds in the interval file) (I would probably run this preemptively to save money) and then merge the resulting 24 gvcf files to produce one gvcf. I would run the remaining 8286 very very small scaffolds using the non-scattered protocol (just a wdl of HC with -L for the 8286 scaffolds within the body of the command) to produce one gvcf file for all of those small scaffolds. Then I would MergeVCF the two gvcf files to get one gvcf that contained all 8310 scaffolds.

Hi @AdelaideR
Thank you for helping me out! I did use -ploidy 20 in my HaplotypeCaller command, I have a true ploidy of between 92-96 per pool, but it what I have learned from other posts on the forum is that anything more than 21 is an overwhelming computation load. I might switch to ploidy 10 if that would help get genotypes.

The wdl that I used is the last and largest chunk of code at the bottom of this page, it is basically the exact one posted on the github haplotypecaller-gvcf-gatk4-nio.wdl but I added more arguments to HaplotypeCaller, and I also changed the name of the input and output files for the MergeVCF command. (maybe my messing with the names of the inputs and outputs is what went wrong? )

There was no explicit error message for the MergeVCF not working, but when I try to look at the error logs, I get this message:

It never finished the MergeVCF, after 4 days I aborted it and there were no files in the google bucket except for the script. These are the last couple of lines in the workflow.log before I aborted the mergeVCF:

# We need disk to localize the sharded input and output due to the scatter for HaplotypeCaller.
# If we take the number we are scattering by and reduce by 20 we will have enough disk space
# to account for the fact that the data is quite uneven across the shards.
Int potential_hc_divisor = length(scattered_calling_intervals) - 20
Int hc_divisor = if potential_hc_divisor > 1 then potential_hc_divisor else 1

I moved on from trying to get MergeVCFs to work, and have been troubleshooting using CatVariants to combine all the gvcf files that were generated from HaplotypeCaller. Though MergeVCF didn't work, the HC command did finish for the sample, and produced 8310 gvcf files, one for each of my linkage groups or scaffolds.

Because they syntax is so strange for CatVariants //software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_CatVariants.php I am having a lot of issues to get it to work.

There are a lot of errors each time I try to run it, and I adjust something in the code each time. The main issue is that I don't understand how to run the command on FireCloud if it is bypassing the gatk engine ( this part of the command seems to throw errors: java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants )

The second issue is that there seems to be a limit to the number of files you combine, though the documentation says there isnt. I always get this error, though I am trying to combine subsets of the gvcf files, not all 8310:
/cromwell_root/script: line 24: /gatk/gatk: Argument list too long

How can I combine my 8310 gvcf files that I already have generated, and going forward, how do I get the MergeVCF command to work in the example GATK pipeline posted on github: github haplotypecaller-gvcf-gatk4-nio.wdl

I have taken another look at the original issue that I was having with MergeVCFs, and I think that I probably did not give it enough time to run considering the number of gvcf files that it was merging. When I looked again at the workflow log file, I can see that for two days the job results were being received, and occasionally the HaplotypeCaller was restarted:

If that is the case, then I have re-started this WDL workflow on another sample. I'm going to let it run for a week and see if it finishes running.

Good call, this job is very large and will require a considerable amount of time.

Is there a way to improve the run time without racking up huge costs?

One way I can think of that isn't already implemented in the haplotypecaller-gvcf-gatk4.wdl is the use of NIO ("protocol called NIO that allows us to only retrieve the piece of the file we care about for a given task, which neatly blows away the localization problem and delivers substantial savings" link ). You can alter aspects of the workflow to improve time while sacrificing cost like decreasing preemption or the number of scattering, but evening this increase in cost by using NIO. This is more of a theory than an answer. There is already nio version of haplotypecaller available in the git repo you mentioned called haplotypecaller-gvcf-gatk4-nio.wdl

One thing i noticed is the large amount of scattering taking place during an individual run. The workflow bases the number of times the haplotypecaller task is scattered by the number of intervals in the provided interval list, I can't view your interval list but I assume it has over 3000 intervals. This is one way of scattering the haplotypecaller task but in this case it can be overwhelming if the thousands of intervals are used to spin up thousands of jobs. Another way of scattering the task is to evenly divide intervals into a set number, this is done in the five-dollar-genome-analysis-pipeline using Utils.ScatterIntervalList task. You could incorporate this script into to your method, this way you aren't running 3000+ scatter jobs.

The the firecloud run that I started last week finally finished, but with an error. I was running just HaplotypeCaller (which finished with no errors) and then MergeVCFs that finished with the error: message: The job was aborted from outside Cromwell

The workflow.log has just these couple of lines that relate to the MergeVCF command:

I did not cancel the run; do you think that it was cancelled because I reached some firecloud or GCP limit, maybe because there were 8310 gvcf files to merge and after a certain point the job was automatically cancelled? Do you have any ideas of what could have caused the run to cancel?

I'm going to try using the Utils.ScatterIntervalList task from the five-dollar-genome-analysis-pipeline for the next pool to reduce the number of gvcf files that need to be merged and hopefully to avoid this in the future. But I now have two pools that have 8310 gvcf files that need to be merged into one. Can you recommend a way to do that?

Another tool for merging gvcfs would be GenomicsDBImport, but try testing your updated wdl with Utils.ScatterIntervalList task first. You can then determine if its worth spending time on GenomicsDBImport or just rerunning the data on updated wdl.

I think I can't use the GenomicsDBImport because I have a ploidy of 20, which is a shame because it definitely seems like the easiest route.

Looking at the Utils.ScatterIntervalList task I was confused about what it was actually doing, and how best to apply it to my job. If I have 8310 scaffolds (I can't stitch them together because I have the aligned sequence bam files, not the raw reads to be able to realign to an altered version of the genome) do I use the intervals list (-L) in the HaplotypeCaller command and then also the task Utils.ScatterIntervalList? That seems like Utils.ScatterIntervalList takes my intervals and splits them up even further which does not solve the problem of how to combine all 8310 resulting gvcfs after HaplotypeCaller, but I might just not be understanding it. There is no way to have intervals that span multiple scaffolds, right? So could I use no intervals (-L) in HaplotypeCaller but then scatter the job with Utils.ScatterIntervalList ? And if I go that route, I still need to provide Utils.ScatterIntervalList with a list of intervals, which would be my 8310 scaffolds, right? And that is OK, because it can hopefully gather all the 8310 gvcf files together using the Utils.ScatterIntervalList instead of with MergeVCFs which times out.

I would like to use GenotypeGVCF, and I have in some tests that were smaller than these full pools that I am trying to run through the pipeline. The issue that I run into is that GenotypeGVCF can take multiple GVCF files, but aren't they each intended to be different discrete samples? I have 8310 files from one sample (one pool of 50 individual fish) that I want to combine into 1 gvcf file for that sample before using GenotypeGVCF to genotype 12 samples (12 gvcf files; 12 pools of 50 fish each) at once. Combining all 8310 gvcf files is where I run into trouble- MergeVCFs times out before it can finish, though it works if I do smaller tests (ie 30 scaffolds instead of all 8310).

I can't reduce the number of scaffolds that I have, so I have to figure out either a hierarchical way of running HaplotypeCaller to produce fewer GVCF files per sample, or some way to combine all the GVCF files from one sample before moving on to the GenotypeGVCFs with multiple samples, maybe a hierarchical way of running MergeVCFS. Do you have any ideas?

Regarding your earlier question about ScatterIntervalList. The task uses IntervalListTools (Picard) "to scatter the input interval list into scatter_count sub interval lists". Meaning the input interval list you provided is broken down to sub interval list files (the number of files is determined by the scatter_count variable), each containing a list of intervals from the input interval list. This set of sub interval list files are then used in the haplotypecaller scatter call so that each scatter uses a sub interval list file to run its own job.

This is different from the way the workflow is currently being run. It looks like your workflow is scattering the haplotypecaller call based on each line in the interval list and your interval list has 8k lines, hence the 8k scattered jobs.

Note, scattering the haplotypcaller call is used to parallelize the workflow, it isn't required. The workflow can simply run so that haplotypcaller task is called once using one interval list, instead of being called multiple time for each line in the interval list. Not scattering the call would also mean you wouldn't have to use Mergegvcf since the workflow would be executing one haplotypecaller task creating one gvcf.

I have a couple more questions about optimizing my HC run, thank you for all your help so far!

I removed the scatter that had the HaplotypeCaller running a task for each of the lines of my interval list (8310) and instead have it running on a test of 6 intervals, using -L in the body of the HC command, but no additional scattering/parallelization. It has been running for more than 30 hours because it was preempted a couple of times and had to re-start after each from the beginning of the HC task. It looks like maybe it just restarted again. I'm thinking of running the HC command without the preemptible allowance, because I think I am just wasting time allowing the job to be preempted.

But I also wonder if I am not running HC on a fast or powerful enough VM on Google. How do I know whether to increase the memory or the CPUs of the job thoughtfully- that is without asking for more than I need and wasting money? I have asked for 7GB of memory in the wdl, but no explicit number of CPUs. Maybe another clue that my wdl isn't optimized for this task is that the HaplotypeCaller.log ahs these lines: (?)

In addition to optimizing the VM I'm running HC on in terms of computing power and memory, I want to optimize the time it takes to run HC using some sort of parallelization because I think HC is stalling on the large scaffolds because they are so big. So the other question I had is if you think this new plan I have is a good idea: I would run HaplotypeCaller on the 24 largest scaffolds I have using the original scatter method (one HC task for each line in the interval file -24 scaffolds in the interval file) (I would probably run this preemptively to save money) and then merge the resulting 24 gvcf files to produce one gvcf. I would run the remaining 8286 very very small scaffolds using the non-scattered protocol (just a wdl of HC with -L for the 8286 scaffolds within the body of the command) to produce one gvcf file for all of those small scaffolds. Then I would MergeVCF the two gvcf files to get one gvcf that contained all 8310 scaffolds.

I'm thinking of running the HC command without the preemptible allowance, because I think I am just wasting time allowing the job to be preempted.

Be aware it will increase the cost of your run.

How do I know whether to increase the memory or the CPUs of the job thoughtfully- that is without asking for more than I need and wasting money?

Good question, You could add a few commands in the task command block that writes the system resources being used into a log file. Then use this file to determine whether or not the task is being limited by cpu, memory, etc. (FYI: you can adjust the cpu number)

I would run HaplotypeCaller on the 24 largest scaffolds I have using the original scatter method (one HC task for each line in the interval file -24 scaffolds in the interval file) (I would probably run this preemptively to save money) and then merge the resulting 24 gvcf files to produce one gvcf. I would run the remaining 8286 very very small scaffolds using the non-scattered protocol (just a wdl of HC with -L for the 8286 scaffolds within the body of the command) to produce one gvcf file for all of those small scaffolds. Then I would MergeVCF the two gvcf files to get one gvcf that contained all 8310 scaffolds.