Identify and validate coding variants from exome sequencing data

Summary

What coding variants do my exome sequences contain, and are any of them phenotypically interesting?

This recipe provides a method to identify coding variants from exome sequencing data. An example use of this recipe is a case where an investigator may want to identify variants from their alignment data and evaluate the phenotypes.

In particular, this recipe uses the variant detector, FreeBayes, in Galaxy to identify potential variants from short-read alignments and assign them quality scores. We also use other Galaxy tools to pare these potential variants down to a high quality subset and to compare them against a catalog of gold-standard variants. Finally, we use IGV to visualize these variants alongside known pathogenic variants.

Why identify genetic variants? Genetic variants are DNA sequence differences that occur with relative frequency in a population. These variants include single nucleotide polymorphisms (SNPs), insertions and deletions (indels), multi-nucleotide polymorphisms (MNPs), and complex events that are combinations of indels and polymorphisms. With the advent of next generation sequencing technologies, it has become common practice to identify variants using sequence data. This is usually accomplished by comparing sequence data from many samples or individuals against a reference. Identified variants can have many practical applications. For instance, variants that correlate strongly with disease risk may provide insight into the disease's genetic underpinnings. Variants may also be used as biomarkers for disease risk or to DNA "fingerprint" individuals.

Why use exome sequencing data? The human exome comprises all the exonic regions of the human genome. That is, it includes all the DNA sequences that, after transcription into RNA, remain after RNA-splicing, which amounts to about 1% of the entire genome. Whole exome sequencing (WES) thus generates much less data than whole genome sequencing (WGS). Furthermore, since WES focuses almost exclusively on protein-coding regions it is an effective method for identifying coding variants.

NOTE: Working with DNA sequencing data is a computationally and resource intensive process. Files containing raw sequence reads can be many gigabytes in size and can be cumbersome to manage. To maintain the usability of this recipe, we have chosen to work with only a partial region of the human genome, specifically the exonic regions of a chromosome 20.

Inputs

To complete this recipe, we will need a number of datasets of various types. First, we will need a set of exonic sequence reads that have been pre-processed for quality and duplicates and aligned to the genome. In this example, we will use aligned reads generated by the 1000 Genomes Project that have been mapped to the exonic regions of chromosome 20. These samples are derived from 10 individuals in the project's Great Britain population cohort. We also require a reference genome in FASTA format, which we will use to identify variants. Finally, we will need a set of gold-standard variants as well as a set of variant annotations to compare against our discovered variants. We will need the following datasets, which can be downloaded from the following GenomeSpace Public folder:

Public > RecipeData > AlignmentData > 1000GP.GBR.chrom20.exome.tar.gz: This is an archive of aligned, sorted, and duplicate-marked chromosome 20 exome reads in BAM format. Specifically, the archive includes the alignments of 10 individuals drawn from the Great Britain population in Phase 3 of the 1000 Genome Project.

Public > RecipeData > VariantData > hapmap_3.3.b37.chr20.vcf: This is a VCF formatted file containing genetic variants identified by the International HapMap Project (version 3.3) filtered to include only variants on chromosome 20.

NOTE: If you have not yet associated your GenomeSpace account with your Galaxy account, you will be asked to do so. If you do not yet have a Galaxy account, you can automatically generate a new account that will be associated with your GenomeSpace account.

NOTE: While GenomeSpace provide multiple methods for sending data to tools, in this instance we highly suggest using the above method. Users may encounter errors if using other methods due to the relatively large size and number of files.

We will use a pre-built Galaxy workflow to call and filter variants in the exome regions of chromosome 20. This workflow first uses the haplotype-based genetic variant detector, FreeBayes, to call variants from the short-read sequencing alignments, generating a VCF file that lists the discovered variants, genotypes, and various quality metrics. (The VCF or Variant Call Format is a standard file format for reporting genetic variants.) This workflow also uses VCFfilter and VCFintersect, from the VCFlib toolkit, to filter the results from FreeBayes for high quality, known variants. The resulting Galaxy workflow output is thus a VCF file containing high quality variants identified by FreeBayes and extant in the gold-standard HapMap catalog.

In this step, we use IGV to visualize high-confidence genetic variants, which we have identified in our British population sample, alongside known pathogenic variants defined in dbSNP.

Launch IGV from GenomeSpace by clicking on the context menu and choosing Launch, prompting the download of a .jnlp file. Double-click the file to launch IGV.

Load a reference genome into IGV by using the IGV genome selection drop-down menu. Make sure the reference genome matches the one used in the variant calling process, e.g., Human (b37) in this example.

Use the GenomeSpace tab to import files by navigating to the following menu: Load File from GenomeSpace....

To load files, navigate to the directory in GenomeSpace which contains the file, then click the file to select it, and then click Open. Load the following files into IGV:NOTE: Each file is loaded separately.

Since our original reads were aligned to chromosome 20 only, select 20 from the dropdown menu of regions to view the variants.

Results Interpretation

This is an example interpretation of the results from this recipe. In our Galaxy workflow we first used FreeBayes to compare our short-read exome sequencing alignments to a reference sequence in order to determine likely genetic variants in our sample cohort. Next we used VCFfilter to filter these results for high quality variants based on various metrics (e.g., read depth, quality score). We also used VCFintersect to further narrow our results down to those variants with equivalent alleles in the HapMap catalog. Finally, we turned to IGV to view our filtered variants alongside a set of known pathogenic variants identified in the dbSNP database. Specifically, we would like to see if any of our identified variants may be associated with an increased risk for disease.

We zoom in on a specific variant located in the q13.31 cytoband of chromosome 20 by typing the coordinates 20:54961541 into the regions navigation bar and hitting enter. Both our results track (upper) and dbSNP track (lower) display a variant at this site, as indicated by the blue and red bar (the proportion of red in the bar corresponds to the allele frequency of the variant). The group of 10 tracks display the genotypes of the 10 individuals whose sequences we used in this recipe. Specifically, 4 individuals are heterozygous for the variant (blue), 2 individuals are homozygous (cyan), and the remaining 4 individuals are homozygous (grey) for the reference allele.

Clicking on the variants in either track brings up info-boxes that display various statistics for the variant. By comparing to the dbSNP track, we notice that our variant analysis has identified an A/T SNP that corresponds to the pathogenic variant, rs2273535, in the AURKA gene. Moreover, our set of 10 samples carried the minor allele with a frequency of 0.4, which corresponds closely to the allele frequency of 0.316 reported in the dbSNP annotations. Given this correspondence, we reason that we have identified the pathogenic SNP rs2273535 in our sample set of 10 British individuals. If we search for SNP id rs2273535 in ClinVar (link), we see that it has been identified as a risk factor for colon cancer. However, note that due to our small sample size these variant discoveries may not be generalizable to the whole population; this is only a simple representation of possible results.

Hi, I've been trying to follow this recipe, but could not launch the data to Galaxy. Please check, whether there is any problem with that. Thanks, Marta

Posted by ted on December 18, 2017 17:12

Marta, it looks like the GenomeSpace importer in Galaxy has been updated and in the new version it does not allow multiple files to be imported at once. I am contacting the developer and will try to get this corrected. I will contact you via email to follow up.
Thanks for letting us know, Ted