Variant Calling Pipeline: FastQ to Annotated SNPs in Hours

Identifying genomic variants, such as single nucleotide polymorphisms (SNPs) and DNA insertions and deletions (indels), can play an important role in scientific discovery. To this end, a pipeline has been developed to allow researchers at the CGSB to rapidly identify and annotate variants. The pipeline employs the Genome Analysis Toolkit (GATK) to perform variant calling and is based on the best practices for variant discovery analysis outlined by the Broad Institute. Once SNPs have been identified, SnpEff is utilized to annotate and predict the effects of the variants.

The pipeline requires six input values. The values can be hard coded into the script, or provided as command line arguments. They are found at the very top of the script.

DIR: The directory with your FastQ libraries to be processed. This should be located in /scratch/cgsb/gencore/out/<PI>/

REF: Reference genome to use. This should be located in the Shared Genome Resource (/scratch/work/cgsb/reference_genomes/). Selecting a reference genome from within the Shared Genome Resource ensures that all index files and reference dictionaries required by BWA, Picard, GATK, etc. are available.

SNPEFF_DB: The name of the SnpEff database to use. To determine the name of the SnpEff DB to use, issue the following command:java -jar /share/apps/snpeff/4.1g/snpEff.jar databases | grep -i SPECIES_NAMEFrom the list of returned results, select the database you wish to use. The value of the first column is the value to input for SNPEFF_DB.Example:java -jar /share/apps/snpeff/4.1g/snpEff.jar databases | grep -i thaliana

Run the script. If you hard coded the parameters in the script as shown in the example above, no additional parameters need to be provided. Simply run:sh pipeline.sh

Optional: Monitor the jobswatch -d qstat -u <netID>

Overview of the pipeline

Important Notes:

The pipeline assumes no known variants are available for the Base Quality Score Recalibration step and as such bootsraps a set of SNPs and Indels to provide as input at this step. Contact me if a list of high quality known variants is available for your organism as this data can improve the quality of the output

The current script is designed to work with Paired End data

Due to HPC queue limits on mercer, the pipeline can run on a maximum of 25 libraries at a time

If you need to run the pipeline on more than 25 libraries or on Single End reads, or have any other questions, please contact me

Walk through

Step 1

Alignment – Map to Reference

Tool

BWA MEM

Input

.fastq files, reference genome

Output

aligned_reads.sam*

*Intermediary file, removed from final output

Notes

Need to provide the -M flag to BWA, this tells it to consider split reads as secondary, need this for GATK variant calling/Picard support. Alternate alignment tools: Bowtie2, Novoalign

Readgroup info is provided with the -R flag. This information is key for downstream GATK functionality. GATK will not work without a read group tag.

Note: SNPs which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however they will be marked as ‘basic_snp_filter’, while SNPs which passed the filter will be marked as ‘PASS’

Note: Indelss which are ‘filtered out’ at this step will remain in the filtered_indels.vcf file, however they will be marked as ‘basic_indel_filter’, while Indels which passed the filter will be marked as ‘PASS’

Note: SNPs which are ‘filtered out’ at this step will remain in the filtered_snps_final.vcf file, however they will be marked as ‘basic_snp_filter’, while SNPs which passed the filter will be marked as ‘PASS’

Note: Indelss which are ‘filtered out’ at this step will remain in the filtered_indels_recal.vcf file, however they will be marked as ‘basic_indel_filter’, while Indels which passed the filter will be marked as ‘PASS’

Hi,
I am Minju, a master student of Dept. of Biology, performing computational research with HTS data.

my question is that where could I get the file, GenomeAnalysisTK.jar. because the Broad Institute, the developer, only provide GenomeAnalysisTK.tar which means I can’t execute the commands(specifically I am stuck in the Step 8: raw variant calling)

See “Important Notes” above: The pipeline assumes no known variants are available for the Base Quality Score Recalibration step and as such bootsraps a set of SNPs and Indels to provide as input at this step.

I am a PHD student and trying to run the codes step by step. When I followed the steps and ran Step 10, there were warnings said “undefined variable MQRankSum”. How can I solve this problem? Thanks for your time.

The pipeline you elaborated is good, please keep up the good work. It will help many. You have explained it well and kept it simple and lucid.
well, i was wondering, from where “depth_out.txt” you are sourcing out?
just i am curious to know…

Hi David really nice work!
Just a quick question. The GATK website says that you actually don’t need to run IndelRealigner if you are going to use Haplotypecaller. But this is in your steps. Would you recommend it anyway?

Hi Irene, thanks for the feedback! At the time of writing this post, IndelRealigner was a recommended step in the GATK best practices. With the latest version of Haplotypecaller, it seems this step is no longer required, so I think you are safe to skip it.

Hi
I’m wondering if an extra “SelectVariants” step is necessary to select only SNP with “PASS” tag after step 10 before used as knownsites in Step12. I’ve seen an example doing SelectVariants with -select ‘vc.isNotFiltered()’. Does BaseRecalibrator use SNP with ‘PASS” tag only in filtered_snps.vcf? I appreciate your insights.