Discovering genetic variants from de novo assembly of high-throughput sequencing

High-throughput sequencing is dominating genetics research (and a lot of biotech) because of its speed, low cost, and the resulting abundance of data. But it still has some major shortcomings. One big problem is that analysis of this data usually requires aligning reads to a reference genome. Since this tends to bias any results heavily toward the reference, it can be rather problematic if your sequenced genome is very different from the reference, especially for novel sequence insertions and large rearrangements. I personally struggled with this problem a few years ago when I needed to analyze yeast data for a strain without a reference — I did the analysis using the standard S288c strain as a reference, but a few percent of the genome was substantially different, which meant we couldn’t reliably identify true variants in those regions.

An alternative is to perform de novo assembly of the reads yourself and call variants with respect to that, rather than an already-known reference. De novo assembly has its own problems, but with long enough reads and sufficient coverage, assembly might be better than to force your data to align to a suboptimal reference. A new pipeline that streamlines de novo assembly and variant calling together was just posted to arXiv last week:

The author, Heng Li, is well-known in bioinformatics circles for Samtools, a widely-used package of tools for analyzing sequencing data. The new pipeline described in this paper, FermiKit, strings together existing packages for assembly and variant calling along with a novel data compression technique (critical for human data, which is usually enormous). The pipeline seems to run rather fast (~1 day for a typical human data set) and also seems pretty easy to use.

Reading about this new package actually spurred my memory of another pipeline that does both de novo assembly and variant calling:

At present I cannot speak to the relative merits of these tools, but I am definitely looking forward to trying them in the future.

NB: On the topic of software and computational methods in biology, last week there was a provocative blog post about their importance and role in biology research last week — specifically, whether we should consider them as an intellectual contribution on par with a typical research paper. It triggered quite a wave of comments from people in the community, many of which are worth reading in my opinion.