SR-ASM (Short Reads ASseMbly) algorithm is designed for DNA assembly of the short sequences coming from 454 sequencers. The 454 sequencer protocol is based on measuring light intensities generated during the sequencing process. The sequences are given as nucleotide chains with numbers of their consecutive appearances. Additionally to the sequences, one gets rates of confidence for every nucleotide.

The constructed SR-ASM algorithm is a heuristic based on a graph model, the graph being built on the set of input sequences. In the algorithm three parts can be distinguished. The first part computes feasible overlaps for input sequences. It requires as the parameters, values of the minimum overlap between two sequences and of the error bound, the latter being the percentage of the mismatches allowed in the overlap of two sequences. The comparison is done also for the sequences reverse complementary to the input ones, due to the assumption that the fragments come from both strands of a DNA helix. In the second phase, a graph is constructed with the fragments as the vertices. Two vertices are connected by an arc if there is a feasible overlap between the two fragments. Next, a path is searched for, which passes through one of the vertices from every pair: either through the straightforward fragment or its reverse complementary counterpart. Usually it is not possible to find a single path in the graph and several paths corresponding to contigs are returned as solutions. At the end, in the third part, a consensus sequence (sequences) is determined on the basis of the alignment.

The strength of SR-ASM comes from the combination of the following new propositions in the assembling procedure: temporary compression of the input sequences, new method of selecting promising pairs, operations repairing lack of some arcs or excess of some vertices in the overlap graph, and finally, the system of voting sequences in creating the consensus sequence.Download the source code of SR-ASM algorithm.

Usefulness of the algorithm has been proven in tests on raw data generated during sequencing of the whole 1.84 Mbp genome of bacteria Prochlorococcus marinus. The paper with the description of the algorithm and with the computational experiment is: