January 3, 2013

New Method Allows For The Evaluation Of A Wide Range Of Genome Sequencing Procedures

Genome sequencing is much more common than in the past. In a large part, this is attributable to advances in biotechnologies and computer software, however, there is still some question about both the accuracy of different sequencing methods and the best ways to evaluate these efforts. Computer scientists, led by New York University, have now devised a new tool to better measure the validity of genome sequencing.

By tracking a small group of key statistical features in the basic structure of the assembled genome, the new method allows for the evaluation of a wide range of genome sequencing procedures. To put together the complete genome sequence – much like a complex jigsaw puzzle - such sequence-assembly algorithm lays out the individual short reads, which are strings of DNA's four nucleic acid bases sampled from the target genome.

The method, as described in the journal PLOS ONE, uses techniques from statistical inference and learning theory to select the most significant features, concluding that many features thought by human experts to be important are actually misleading.

The research team, consisting of scientists from New York University's Courant Institute of Mathematical Sciences, NYU School of Medicine, Sweden's KTH Royal Institute of Technology, and Cold Spring Harbor Laboratory, says current evaluation methods of genome sequencing are typically imprecise, relying on what amounts to "crowd sourcing." Scientists weigh in on the accuracy of a sequencing method, creating a consensus. Still other methods use apples-to-oranges comparisons to make assessments, limiting their value as an evaluation.

The research team expanded upon a previous system they had created with this new work. The earlier system, Feature Response Curve (FRCurve) offers a global picture of how genome-sequencing methods, or assemblers, are able to deal with different regions and structures in a large complex genome. FRCurve points out how an assembler might have traded off one kind of quality measure at the expense of another. It shows how aggressively a genome assembler might have tried, for example, to pull together a group of genes into a contiguous piece of the genome, while at the same time incorrectly rearranging their correct order and copy numbers.

The team admits FRCurve has a significant limitation, however. The system can only gauge the accuracy of certain kinds of assemblers at one time. This excludes comparisons among the range of sequencing methods being used currently. Where FRCurve failed is with many of the new methods that are becoming highly popular because they are specifically designed to work with the most established next-generation sequencing technologies. These methods are also able to perform some error correction and data compression. The problem, however, is that by doing so, they discard the original signature of key statistical features — position and orientation of the reads used to generate the candidate sequence — that FRCurve needs for evaluation.

The PLOS ONE article unveiled a new method, FRCbam, with the capability to evaluate a much wider class of assemblers by reverse engineering the latent structures obscured by error-correction and data compression. This operation is performed rapidly by using efficient and scalable mapping algorithms.

FRCbam validates its analysis by examining a large ensemble of assemblers working on a large ensemble of genomes, which are selected from crowd-sourced competitions like GAGE and Assemblathons, instead of assumption-ridden simulations or expensive auxiliary methods. Thus, FRCbam is able to characterize the statistics that are expected, and then validate any individual system with respect to it.

The team expects that FRCurve and FRCbam will be used to routinely rank and evaluate future genome projects. Currently, this method is employed to evaluate the sequence assembly of the Norway Spruce, one of the largest genomes sequenced so far; it is seven times longer than the human genome.