Background

Many important viruses, such as HIV, the SARS Coronavirus, Hepatitis C, and the Influenza virus, possess high mutation, recombination, and replication rates. These viruses generate "clouds" of sequence variants called viral quasispecies within infected hosts. Diversity and evolution of viral quasispecies are influenced by host-viral interactions [2]. Characterization of quasispecies genome populations from infected individuals is a first step to study such interactions. A recent proof-of-concept study by researchers at 454, CuraGen, and Yale, suggests that parallel sequencing and identification of the sequence variation present within a population of viral quasispecies is feasible [3] using the sequencing-by-synthesis technology recently developed by 454 Life Sciences and incorporated in the GS20 sequencer. In order to realize the potential for sequencing and assembly of quasispecies populations using this technology, it is necessary to develop and validate a robust methodology for genome-scale quasispecies assembly. We expect the bulk of the challenge will lie in the design and construction of the quasispecies assembler.

The Quasispeices Assembly Problem

Assembling and characterizing any quasispecies genome population poses a substantial computational challenge [4],[5]. Current assembly programs such as Phred/Phrap, TIGR Assembler, and 454's Newbler Assembler are designed to connect reads into a single consensus sequence. As such, they are not appropriate for simultaneously assembling multiple genome sequences. These programs assume, for example, that base mismatches represent base-calling errors or internal repeats rather than legitimate sequence variation from a population of input sequences. In addition, assembly is complicated by rearrangements and the existence of true internal repeats, making the problem of connecting fragments into correct genomic sequences a highly challenging one. The deep coverage capability of the GS20™ can aid greatly in addressing the former problem; however, the GS20's limited unidirectional read length (at ~100 bp per read), and lack of mate-pair information presents a serious challenge in dealing with the latter problem. Any quasispecies sequence assembler for application with the GS20 sequencing technology must take these factors into account.

Assembly Strategy

The problem of simultaneously assembling multiple highly similar, yet distinct genome sequences is not novel. Indeed, this situation is encountered routinely in determining the haplotype of diploid eukaryotic DNA (i.e., the mapping of polymorphisms to the correct chromosome). In regions where sufficient sequence variation exists between reads, a technique known as correlated differences can be applied to segregate the two distinct sequences [6],[7]. This technique uses repeatedly occurring high quality base call mismatches to segregate and connect sequencing reads. The same strategy can be applied to the separation of quasispecies sequences, although in general the sequences can only be effectively separated to a degree owing to existence of intervening stretches of highly similar sequence that break the connection between variable regions. In addition, the greater number of quasispecies assembly relative to haplotyping, and the lack of foreknowlege about the total number of members present in the quasispecies population will compound the difficulty of applying this technique to resolving viral quasispecies sequences.

Our current thinking is that comparative assembly offers the most promising approach to tackle this problem. In this procedure, sequence reads are aligned to a reference genome rather than being assembled de novo using the standard overlap-layout-consensus paradigm. Quasispecies sequences obviously are too diverse to align to any single reference genome, so instead we propose to modify the comparative assembly method with a "phylogenetic partitioning" step: input reads would be aligned initially to a representative sequence from each major clade. Each group of reads would then be realigned to subtypes of said clade, etc. Our intial studies suggest that this approach can successfully segregate the reads into groups that approximately represent their parent genomes, where the final assembly can occur.

Reference and Test Data

Currently we have obtained GS20 sequence reads from overlapping PCR products spanning the entire HIV genomes of two individuals. We will develop the assembly strategy and methodology using the GS20 sequence reads from the overlapping PCR products spanning the entire HIV genomes of these two individuals. To aid the testing and validation of our methodology, the NML HIV and Human Genetics Laboratory has PCR products of the HIV gag region (including part of 5'-LTR and part of protease, ~2kb in length) and fully sequenced clones (30 to 90 clones per sample) of the same PCR products from more than 200 patient samples. These samples represent diverse HIV subtypes, from mostly clade A, D, C and recombinant subtypes and were sequenced using standard Sanger sequencing methodology. The methods and strategy we develop will be tested by sequencing the same PCR products from HIV gag region using the GS20 sequencer. The quasispecies genomes assembled using the developed GS20 methods will be compared with and validated against the cloned sequences.