Pages

NextGen sequencing: What Is It Good For?

Sequencing DNA has become a major industry. The genetic code of an organism contains huge amounts of data, and the potential for a greater understanding of how it works at an intracellular level, and whole centers and genome sequencing factories now exist to fill this need. While most of the sequencing is still done using a modified and more efficient version of Sanger's original dideoxy method, next-generation sequencing machines are starting to emerge that can achieve what is imaginatively named massively parallel sequencing. Massive amounts of DNA can be sequenced in parallel, and we're talking MASSIVE amounts of DNA. Illumina/Solexa machines can sequence hundreds of thousands of DNA molecules all in parallel.

The basic Sanger sequencing method is shown below (image taken from the Science Creative Quarterly, which also has a very good description of the process for those more interested in DNA sequencing)There is a catch in massively parallel sequencing however. Sequencing works by breaking a large DNA molecule down into smaller 'reads'. Each read is then sequenced and they can be stuck back into the right order (with varying accuracy) once all the reads have been completed. Sanger sequencing (diagram above) can produce reads up to 1000 base pairs long. NextGen sequencing is lucky if it manages 350 base pairs. They tend not to be quite as accurate as well.

What they are is cheap. Which gives geneticists an important tool; large numbers of short genome reads generated at very low cost. While these NextGen techniques are being improved, and there are many people looking into making them more effective for de novo gene sequencing, they are also being put to use in other areas, where the ability to sequence large numbers of short genomic sequences at low cost is hugely beneficial.

The most obvious areas are those where you don't need a particularly long sequence, such as when you just need to find the site of origin of a particular length of DNA. This is particularly useful for looking at transcribed portions of the DNA (those parts that are actually turned into proteins). Sequencing short bits of the transcribed RNA copy (that is used to make the protein) allows this to be compared to the original DNA sequence to find where the DNA corresponding to the protein is and, possibly more importantly, concrete evidence that it is being transcribed. In this situation the short reads aren't a problem, although there are still issues with the accuracy.

Another application is to look for novel small RNAs. These are small sections of RNA which regulate gene expression. They are discovered fairly recently (in plants originally) so there's quite a lot of excitement about them. As they're only small the length of the reads are not a problem. Pyrosequencing (a form of NextGen sequencing) was used to discover the Piwi-interacting RNAs, which are linked to transcriptional silencing in germ line cells.

NextGen sequencing also has a role in protein coding gene annotation. Protein-coding genes can be quite long, and would require several reads from NextGen techniques, but the low cost of these methods means that they are starting to be used for annotating protein coding regions. Integrating them with paired-end sequencing (which allows the reads to be re-connected more easily) removes some of the problems are shorter reads, and novel techniques are continually being explored to increase the accuracy.

NextGen machines are also starting to be used more for metagenomics, which works by taking random soil or water samples and sequencing every bit of DNA you can find, regardless of which organism it comes from. A metagenomics project in the Sargasso Sea (strangely enough most of these projects tend to take place in warmer climates...noone appears to do metagenomics in, say, iceland) produced over 1.2 million unknown gene sequences. These are suspected to be from 'unculturable' bacteria, which for some reason just don't grow in the lab, and metagenomics has revealed a huge number of these bacteria within the ecosystem.

If you want a novel genome sequenced your best bet is still to send it down to the Sanger Centre and be very polite to everyone who works there, but the growth of cheaper machines with massively parallel sequencing provides a whole range of new applications. Even if NextGen machines never quite reach the accuracy and read length of Sanger machines, there are still many areas in science to which they provide a large benefit.

-

EDIT: I have been informed by people who know a lot more about this than me that NextGen sequences are now pretty much exclusively used for whole gene sequencing. It appears my knowledge is a little out of date. However this post is still an interesting exploration of the other applications of NextGen sequencers, so I'll leave it as it stands.

5 comments:

Really good piece, and a nice 'sequencing primer' to boot (pardon the pun)! While scientists indeed seem to have a preference for the more tropical surroundings, some of them do face the harsher cold environments: http://www.pnas.org/content/early/2009/10/21/0908274106.abstract. No next-gen sequencing there though..

Read lengths for 454 pyrosequencing (a form of next, or second, generation sequencing) has a median read length of 700bp, with some reads being over 1k, and they promise a median read length of 1kb not too far in the future.

Also, while next gen sequencing is less, and has less complete coverage, per base, it is still far more accurate per unit cost than Sanger sequencing, because you can sequencing very high depth for a fraction of the cost (a single run produces a 30X genome, which is close in quality to a 10X sanger whole-genome shortgun). Second gen sequencing is basically the only way to do whole-genome sequencing; no-one does WGS with Sanger methods, and have not for years.

Luke: I was waiting for your comment :) Very useful information, I honestly did not know that about pyrosequencing. They told us in lectures that 350bp was the maximum! And the references are all from a couple of years ago, which I know is practically stone age in some areas of science. Also wasn't aware that NextGen machine were used for WGS, although I had a feeling the price might make them more feasible, especially as you can just do multiple reads very quickly over a single area if you aren't sure about it.

(Sorry for the jargon; a 30X genome means that you sequence each base on average 30 times. The shorter read length and poorer accuracy means that you need to sequence 3-4 times as much DNA by 2nd gen sequencing then by Sanger to get a similar quality)