Guest Post: Next Generation Variant Calling

Editor’s Note by Gabe Rudy: I’ve had the chance to exchange thoughts, emails, and blog post comments for a while now with Jeff as he has written posts on NGS Leaders and engaged with me on 23andMe. He has also worked with Golden Helix software as part of Dr. Todd Lencz’s research efforts at Zucker Hillside Hospital until he recently moved to the University of Medicine & Dentistry of New Jersey. Jeff participates in the 1000 Genomes Project and is also involved in projects to de novo sequence insect genomes. Jeff’s experience as part of the 1000 Genomes team in charge of improving variant calling for classes of variants outside Single Nucleotide Variants (SNVs or SNPs depending on your nomenclature preference) has given him an informed perspective of the weaknesses of current variant calling techniques and the various ideas proposed to address those.

Just as the techniques for DNA sequencing are rapidly evolving, the algorithms for identifying variants between genomes are being continuously developed. Currently, the most highly-used variant callers focus strictly on calling SNPs with limited support for indels. In this post, we will outline some of the newer variant callers that go beyond the traditional model and use techniques such as de novo assembly or physical phasing in order to identify variants. These new approaches require greater computational resources, but the increased quality and power of the results can more than outweigh the costs.

We will focus on the calling of short sequence variants such as SNPs and indels and will not discuss structural variants, which present their own large set of challenges.

Traditional Variant Calling
The general technique for sequencing follows this paradigm.

Sequence short reads from an Illumina, SOLiD or Ion Torrent machine.

Align the reads to the standard reference sequence.

Iterate through the genome to identify locations where there are a sufficient number of reads to indicate either the presence of a non-reference base either in a heterozygous or a homozygous context. These locations are called as SNPs.

Use a gapped mapping technique to identify locations where there is either a compression or an expansion of the genomic sequence indicating an indel.

Integrate the results of #3 and #4 to produce a set of variant calls for downstream analysis.

This is obviously a rough summary of variant calling and it skips steps such as base quality score calibration and the Base Alignment Quality (BAQ) technique that are often used to improve the robustness of variant calls. Importantly, alignment and base calling are performed sequentially and SNPs and indels are called separately. These two features produce some shortcomings in the variant calls that can be produced.

The requirement for an initial alignment of short (100bp) sequencing reads limits the amount of variation from the reference genome that can be detected. A case of 15 mis-matched sequential bases in a genome will not be detected since the aligner will be unable to properly map the reads supporting that variant. This shortcoming will remain regardless of the number of reads at that locus since the number of mismatches allowed by an aligner is limited in order to allow it to run in a reasonable amount of computational time. The procedure of calling SNPs and indels separately and then integrating the calls can lead to problems when there are complex loci containing a SNP opposite an indel. Let’s assume that the SNP caller made this call of a single SNP:
ATGTATGTA
ATGTGTGTA

and the indel caller produced this call of a 3 base deletion:
ATGTATGTA
ATGT- - - TA

Should it be assumed that there is a heterozygous SNP opposite a heterozygous indel or a more complex variant that is also supported by the reads? When a program is only tasked with finding a particular type of variant, such as an indel or a SNP, it will never detect anything else.

Fortunately, there are some new variant callers that are available which are more sophisticated and call SNPs, indels, and complex variants simultaneously. There are many different variant callers available. A quick scan of any issue of Bioinformatics will include at least one, and probably more, variant callers. I will highlight 3 different variant callers that I think are leading the way and take different approaches to the task of calling variants: FreeBayes which performs local physical phasing, the Complete Genomics caller which uses local de novo assembly, and Cortex which relies completely on de novo assembly.

FreeBayes – haplotype-based calling
The FreeBayes algorithm was developed by Erik Garrison in Gabor Marth’s group at Boston College and is one of the main callers utilized by the 1000 Genomes Project. It relies on an alignment of sequencing reads to the genome, but rather than just looking at individual bases, it calls variants based upon haplotypes up to the length of sequencing reads. Through a complex computational process (outlined in this preprint), the program takes reads from a single or multiple samples and determines the number of different haplotypes that appear in a local region. It then gives a probability for the presence of each of the haplotypes in each of the samples. Based upon these probabilities, FreeBayes gives the best approximation for the diploid genotypes in each individual. This technique has a few important benefits over a traditional variant caller:

A large number of different alleles can be called at each location across samples. For many variant callers, each location has a reference and an alternate allele and each sample is reported as either homozygous reference, heterozygous, or homozygous variant. With FreeBayes, there could potentially be 2,000 different alleles represented among the diploid genomes of the 1000 Genomes samples. In practice, this number of haplotypes is not found, but this is due to a limitation of human genomic variation and not the algorithm.

Nearby variants are identified on the same haplotype so they are in phase. When there are two SNPs separated by a reference base, it is not generally noted whether they are on the same chromosome or opposing chromosomes. These two possibilities are critical to distinguish when performing functional analysis to determine if variants are synonymous or non-synonymous. For example, given the reference sequence ACA and SNPs in both the 5’ and 3’ A, it is important to know whether there is a homozygous sequence TCG, or a heterozygous ACG and TCA. With calls from FreeBayes, this phasing can extend for several dozen bases.

All types of sequence-level variants are called simultaneously. Rather than employing a separate caller for SNPs, indels, and multi-nucleotide polymorphisms, FreeBayes calls all haplotypes at a locus regardless of what type of difference they contain relative to the reference genome. In a long haplotype, there could be a mix of multiple SNPs and indels. Additionally, the two alleles at a locus could be identified as extremely different from each other if that categorization is supported by the underlying reads.

The one major shortcoming of FreeBayes is that it is reliant on a prior alignment of reads to the genome. This requirement limits the scale of variants that can be called since highly divergent sequences would not be aligned properly by any algorithm. But with the lengths of reads increasing, this should become less problematic. With a 250bp read, many local mismatches or gaps could occur without preventing a valid alignment.

The Complete Genomics Caller
The variant caller from Complete Genomics (CG) is innovative and is completely based upon a local de novo technique. The CG variant caller takes the view that the best way to call variants is to use an assembly. Since a complete de novo assembly of a genome is extremely computationally taxing, it does the next best thing and uses a local assembly. The technique is encapsulated in this figure from their paper:

Reads are all aligned to the genome (blue) and a reference score is calculated based upon the number of reads at each location and the quality of the alignment. This score gives the likelihood of a base being homozygous reference. Whenever the reference score drops (purple box), then an assembly is performed to determine the variation that has occurred. Based upon the alignment of paired ends of reads (blue), all of the reads that are aligned with the variable base and a surrounding region are recruited (yellow). These yellow reads are then de novo assembled in order to determine their sequence.

Here are the benefits of this method of variant calling:

Every variant including a SNP is called based upon a de novo assembler rather than calling SNPs first and then looking for more complex variants. In this way, variants close to a SNP, such as other SNPs or indels that could alter the variation detection using a traditional method, are included in the process.

Because there are times when the evidence is insufficient to make a variant call properly, the program includes the possibility of calling a base as a ‘no-call’. These no-calls are regions where there is not enough read data to either support a call of homozygous reference or any type of variant. In the normal procedure, whenever no variant is called, the sequence is assumed to be reference whether or not this is correct. For downstream analysis, these no-call regions can be filtered to prevent unjustified conclusions.

The variants are called over a small region of the genome rather than at each specific base individually. This allows for the calling of longer haplotypes, just as with FreeBayes. Even better, since assembly is used, the length of a variant that could be called is not limited to an individual read.

A drawback to this technique is that the reads are relatively short (35bp) and therefore there will be regions of the genome where neither end of reads can be uniquely aligned. In this case, where there is a repetitive region of the genome, there will not be any reads whose mates map uniquely to allow for the local assembly. Additionally, this variant caller is only available as part of the Complete Genomics package, and one cannot simply run it separately.

Cortex
The Cortex algorithm was developed by Zamin Iqbal and Mario Caccamo in England. It is based upon the de Bruijn graph model of genome assemblers, but takes a novel approach by using what their paper calls colored de Bruijn graphs. The ideal method to call variants in a genome would be to do a complete de novo assembly and then to do an alignment to the reference genome. This is not feasible with current sequencing technologies due to their short read length and the complexity of the human genome. What Cortex does, is to assemble the genome as much as possible and then to use this assembly to call variants. If there are two genomes being assembled and compared, then they are each assigned a color for their graph. After the assembly graph is completed, a variation between the genomes will show up as a change between the two graphs. In the case of comparing a single genome to the reference, the genome will be broken up into a de Bruijn graph and given a color for the comparison. After a variant has been detected between the reference graph and a genome of interest, the assembled contig is mapped back to the reference genome so that the correct genome coordinates of the variant can be determined.

If there were two genomes colored as blue and red, then variants would be called as shown in figure at left. Since genomes are diploid, there are two red and two blue lines. The heterozygous variant has one blue allele matching the two red alleles with the variant blue allele separated. The homozygous variant has both blue and red lines separate from each other, and a repeat is represented by a bubble in the graph where there are multiple copies of the same sequence. While this graphical color view simplifies the procedure for explanation, in reality the work is all done within complex computational data structures. Because Cortex does an almost full de novo assembly, it has larger time and memory requirements than other variant calling algorithms. Even so, it has been used on large datasets including whole populations from the 1000 Genomes Project.

Here are a few of the benefits of Cortex:

The variant calling does not require a reference and can be performed directly between two or more sample genomes. This is extremely beneficial for researchers who work on non-model organisms that lack a reference genome. Using Cortex, they could directly examine the diversity of genomes of a species.

When using Cortex, there is no alignment at all to a reference, just the combined assembly of the sample genomes and then the calling of variants between them. This is extremely helpful if the sample genomes have many loci where they all have the same variant relative to the reference. Only the variants between the sample genomes would be detected by Cortex.

Cortex can call any type of variants between the genomes. As with FreeBayes and the Complete Genomics caller, this means that there is no difficulty in calling complex variants or multi-nucleotide polymorphisms. Since a de novo assembly is performed, the length of variants is theoretically unlimited, but in practice, it is limited by the coverage of the genome and the particulars of the de Bruijn graph construction.

Conclusion
As we have seen, there are many different innovations being made in the field of variant calling. Since these tools are either open source (FreeBayes and Cortex) or have had their details published (CG), their novel features can be incorporated into other tools. In this manner, there is a new version of GATK that is being released which will incorporate haplotype-based calling and some level of local de novo assembly. Due to their greater complexity, these tools require greater amounts of computational power than traditional variant callers, but their increased power is worth the trade off. Even if a researcher is only concerned with SNPs, the use of a sophisticated variant detector will ensure that their SNP calls are more robust with fewer false positives and incorrect calls.

About Dr. Jeffrey RosenfeldJeffrey is a Bioinformatics Scientist in the Division of High Performance and Research Computing at the University of Medicine and Dentistry of New Jersey (UMDNJ) and a Research Associate in the Division of Invertebrate Zoology at the American Museum of Natural History.