New path lays DNA puzzles bare

US computer scientists may have a less error-prone way to piece together the many short DNA sequences that make up a genome1. Genomics researchers everywhere would welcome such a tool - mistakes at this stage create expensive headaches in the lab later on.

All sequencing projects involve breaking up a genome and putting it back together again. The company Celera sequenced the human genome by breaking it up at random - the 'whole-shotgun' approach - and piecing together the resulting 27,271,853 sequences. Although the public Human Genome Project took a more structured approach, both groups faced similar problems when re-assembling their sequences.

Chief among these is that large genomes such as ours are very repetitive, like a jigsaw with many identically shaped pieces. Sequencing errors compound the problem - you don't know whether you're looking at different stretches of DNA or not.

Sequence assembly is analogous to finding the shortest route through many cities that passes through each only once. Often called the travelling salesman problem, this puzzle is officially known as a hamiltonian path.

Mathematicians call problems like this NP-complete: the only way to solve them is to try every possible route. This requires massive computer power as the number of possibilities rises exponentially with the number of towns - or DNA pieces.

By breaking the chunks of DNA into smaller fragments of equal size, Pavel Pevzner, of the University of California, San Diego, and his colleagues have transformed the hamiltonian path of genome sequencing into a 'eulerian path'.

In a eulerian path, instead of visiting every city once only, you must travel down every road once only - passing through each junction as often as you like. Finding the shortest route through this network is called the Chinese postman problem.

Chinese postmen are much more mathematically tractable than travelling salesmen. "Nobody can solve the travelling salesman problem - assemblers are forced to make arbitrary decisions, which lead unavoidably to errors," says Pevzner. "The eulerian path is almost the same in formulation, but there's a dramatic difference in complexity."

Path finder

In a play-off against other genome assemblers including PHRAP, used by the Human Genome Project, Pevzner's program, christened EULER, was the only one to make no errors piecing together fragments of the Neisseria meningitidis genome, the bacterium that causes meningitis.

Bacterial genomes are relatively unrepetitive, so the researchers - and sequencing labs such as the UK's Sanger Centre - are in the process of giving EULER stiffer challenges using data from higher organisms.

"The results make the new technique look very effective," says Roger Staden, of the UK Medical Research Council's Laboratory of Molecular Biology in Cambridge, who has written sequence-analysis software.

Assembly errors complicate the already laborious task of filling in the gaps in the draft sequence. "You really have to go in and look at every single base," says computer scientist Mihai Pop, of The Institute for Genomic Research, Rockville, Maryland.

Pop believes that it is too early to say whether EULER will make fewer mistakes than conventional assemblers. But he likes the program for flagging up particularly tricky repeat sections. "I'd like to try their assembler out on some of the genomes we have here," he says.

"To see basic concepts from complexity and information theory applied in this way is very beautiful," says Riccardo Zecchina, who works on similar optimization problems at the Abdus Salam International Centre for Theoretical Physics in Trieste, Italy. The quest to find out whether other puzzles will yield to the piece-breaking technique "should stimulate a lot of research", he says.

Pevzner believes that EULER will find applications beyond DNA. "People look on fragment assembly as a 20 year-old problem. But it's really thousands of years old - how to solve puzzles. We've found a new way to assemble puzzles."