Abstract

The development of DNA sequencing more than 30 years ago has profoundly impacted biological research. In the last couple of years, remarkable technological innovations have emerged that allow the direct and cost-effective sequencing of complex samples at unprecedented scale and speed. These next-generation technologies make it feasible to sequence not only static genomes, but also entire transcriptomes expressed under different conditions. These and other powerful applications of next-generation sequencing are rapidly revolutionizing the way genomic studies are carried out. Below, we provide a snapshot of these exciting new approaches to understanding the properties and functions of genomes. Given that sequencing-based assays may increasingly supersede microarray-based assays, we also compare and contrast data obtained from these distinct approaches.

ChIP-Seq

high-throughput sequencing

massively parallel sequencing

microarray

RNA-Seq

transcriptome

yeast

Sequencing as never before

Just when the era of sequencing seemed to have passed its peak, technological breakthroughs are launching a new dawn with huge potential and broad applications that are already transforming biological research. Two pioneering papers reporting new sequencing developments in 2005 have provided the first glimpse of things to come [1,2]. The sequencing revolution is currently driven by three commercially available platforms: 454 (Roche), Genome Analyzer (Illumina/Solexa) and ABI-SOLiD (Applied Biosystems) [3,4]. The development of additional platforms is well under way. These new technologies are based on different principles than the classical Sanger-based method [5], and they are collectively referred to as either ‘next-generation’ sequencing, ‘high-throughput’ (or even ‘ultrahigh-throughput’) sequencing, ‘ultra-deep’ sequencing or ‘massively parallel’ sequencing. (Makes you wonder what terms they will come up with once even more powerful technologies become available…) These novel technologies apply distinct principles, resulting in differences in sequence read lengths and numbers, which may provide distinct advantages and disadvantages for different applications. All technologies have in common, however, that they generate sequences on an unprecedented scale, without the requirement for DNA cloning, and at a fraction of the costs required for traditional sequencing. These features are the basis for the current revolution and provide the inspiration to apply sequencing approaches to biological questions that would not have been economically or logistically practical before. Next-generation sequencing should also democratize science in the sense that ambitious sequencing-based projects can now be tackled by individual laboratories or institutes, whereas before such projects would only have been possible in genome centres.

From genomes to function

Early studies applied next-generation sequencing to sampling microbial diversities in a deep mine and in oceans [6–8], launching the field of ‘meta-genomics’ where entire biological communities are sequenced, en masse, to survey the variety of all organisms living together in particular ecosystems [9–13]. Other interesting applications are in the field of ancient DNA research, where next-generation sequencing has been successfully applied to analyse genomes of woolly mammoths [14] and Neanderthals [15,16]. Naturally, next-generation sequencing is also used to decode modern genomes, from bacteria [17,18] and viral isolates [19,20] to James Watson [21]. The latter example illustrates that the power of next-generation sequencing is increasingly exploited to re-sequence strains and individuals for which reference genome sequences are available to sample genomic diversity. Such studies have identified mutations in bacterial strains [22,23], polymorphisms in worm [24], structural variation in the human genome [25] and specific alleles involved in cancer [26].

In addition to established analyses of genome sequences, next-generation sequencing is triggering new assays and applications that should greatly advance our understanding of genome function (Figure 1) [27]. The principle behind these alternative applications, which have been termed ‘sequence census’ methods, is simple: complex DNA or RNA samples are directly sequenced to determine their content. With reference genomes available, short sequence reads are sufficient to map their locations (except for repeated regions), and once mapped, millions of sequence hits are simply counted to determine their genomic distribution (Figure 2). This concept is based on previous approaches such as serial analysis of gene expression [28] and massively parallel signature sequencing [29]. Next-generation sequencing, however, delivers much more information at affordable costs, and it is easy to implement for a wider range of applications. Below, we will survey initial studies that analyse genome function exploiting sequence census methods, which will increasingly supersede microarray-based approaches (Figure 1).

The sequence score defines the number of times each base of the reference genome sequence is hit by a sequence read (top panel). Sequence scores (based on normalized read numbers) are then plotted along the genome (bottom panel). Based on data from our fission yeast transcriptome analysis [48].

Mapping of DNA-binding proteins and chromatin

ChIP-on-chip [ChIP (chromatin immunoprecipitation) using microarrays] [30,31] is a key approach to globally mapping the in vivo binding sites of various DNA-binding proteins across genomes. Instead of using the DNA that is precipitated with the protein of interest to interrogate microarrays, recent studies have directly sequenced this DNA to analyse the protein-binding sites at high resolution. This approach, termed ‘ChIP-Seq’, should produce a huge windfall, in particular for studies in multicellular eukaryotes where whole genome coverage has generally required the use of several arrays.

Initial studies looking at the binding sites of human NRSF (neuron-restrictive silencer factor) and STAT1 (signal transducer and activator of transcription 1) [32,33] indicate that the resolution of ChIP-Seq is far better than that of ChIP-on-chip. NRSF, a well-documented zinc-finger repressor that negatively regulates gene expression of neuronal genes in non-neuronal cells, has >80 previously validated targets, providing a well-defined test set to define false-positive and -negative rates [32]. The vast majority of previously known target sites have been confirmed among the ∼2000 targets identified through ChIP-Seq. Moreover, this analysis has exploited the deep sampling and high resolution of ChIP-Seq to identify a novel class of genomic NRSF-binding sites, suggesting the existence of different subclasses of genes regulated by the same factor [32]. Another ChIP-Seq study has mapped the binding sites of STAT1, a transcription factor that regulates genes involved in cell differentiation, survival and proliferation [33]. The dynamic behaviour of STAT1 is of interest as it usually localizes in the cytoplasm, but translocates to the nucleus on stimulation by an extracellular signal. As expected, the authors have observed a large increase in STAT1 binding sites after stimulation of cells with interferon-γ (from 11000 to 41000), and the results also agree well with previously published data.

Approaches to map the genomic protein-binding sites are not limited to transcription factors. One of the first papers demonstrating the utility of ChIP-Seq for whole-genome location analysis [34] mapped the genome-wide sites of 20 histone methylation marks, along with CTCF (CCCTC-binding factor), the histone variant H2A.Z and RNA polymerase II in human cells. The unprecedented detail of these data has led to several valuable conclusions about the association of specific sets of histone modifications with either active or repressed promoters. Such comprehensive and predictive patterns can be used not only to confirm annotated promoters but also to identify new ones [34]. Another comprehensive survey of two types of histone modifications has been reported for pluripotent and lineage-committed mouse cells, revealing how these modifications change during development [35].

In another adaptation of array-based methods, next-generation sequencing has also been applied to map regions with few or no chromatin proteins, namely DNase-hypersensitive sites in human cell lines to identify locations with regulatory elements [36]. Using this approach, ∼95000 DNase-hypersensitive sites have been uncovered, a surprising majority (∼80%) of which are not associated with promoter regions. This finding strongly suggests that the genome is replete with regions of open DNA, many of which may have unrecognized roles in genome function [36].

Deep sampling of transcriptomes

Next-generation sequencing is also changing the ways in which gene expression is studied, which is likely to have much future impact. Complex RNA mixtures can be analysed using sequence census methods, an approach termed ‘RNA-Seq’. Initial applications include the accelerated discovery of small RNAs [37–40]. Other studies have used early RNA-Seq approaches to quantify expression levels using a modified paired-end ditagging method [41], to detect rare cardiac mRNAs in mouse by ‘polony multiplex analysis of gene expression’ [42,43], or to directly sequence cDNAs of human tumour and fly cells [44–46]. Another RNA-Seq study has moved beyond simply describing the expression levels of transcripts towards assigning functions to the observed expression differences [47]. Together, such studies pave the way for complete transcriptome coverage, providing the ultimate resolution to analyse the levels as well as the structures of both processed and unprocessed transcripts under different conditions.

We have recently applied RNA-Seq, complemented with high-resolution tiling arrays, to obtain a detailed picture of the fission yeast transcriptome, independently of available gene annotations, at the best possible resolution [48]. The transcriptome has been interrogated under multiple conditions, including rapid proliferation, meiotic differentiation and environmental stress, as well as in splicing and exosome mutants, to analyse the dynamic adaptation of the transcriptional landscape as a function of environmental, developmental and genetic factors. These results provide rich, condition-specific information on widespread transcription, on novel, mostly non-coding transcripts, as well as on untranslated regions and gene structures, thus improving the existing genome annotation. Perhaps most interestingly, sequence reads spanning exon–exon or exon–intron junctions have given a unique and direct insight into a surprising variability in splicing efficiency across introns, genes and conditions. This analysis has revealed that splicing efficiency is largely co-ordinated with transcript levels, and hundreds of introns show regulated splicing during cellular proliferation or differentiation. These results suggest a global co-ordination between splicing efficiency and transcription, which may help to optimize and streamline gene expression programmes. As elaborated in the next section, the combined RNA-Seq and array data have also allowed comparing and contrasting of the relative performance and properties of sequencing- and hybridization-based approaches.

Next-generation sequencing compared with microarrays

Currently, global analysis of gene expression relies largely on hybridization-based platforms such as microarrays, which are routinely used for determining relative expression levels or changes in gene expression between different biological conditions. Unlike hybridization data, which consist of continuous signals, sequence census data are made of absolute numbers of reads (Figure 2). The countable, almost digital, nature of these results makes them highly suitable for the analysis of gene expression levels. Applying sequence census approaches to cDNAs (RNA-Seq), we can therefore estimate the relative abundance of given transcripts by counting the number of times they are hit by sequence reads. Recent studies have shown that, indeed, scores based on the number of sequence reads hitting a transcript, or on the average number of hits per base and per transcript, provide accurate measurements of relative RNA levels [45,46,48]. We have shown that sequencing-based estimates of transcript abundance are in good agreement with estimates acquired with microarrays, provided that sequencing depth is sufficient [48]. In addition, unlike microarray data, which are affected by the dynamic range of the scanner, sequence data have a linear dynamic range only limited by the sequencing depth. This aspect is attractive, because the dynamic range of different RNAs in a cell is almost certainly larger than the range provided by microarray scanners. In addition, and unlike hybridization-based techniques, sequencing-based approaches produce little or no noise, allowing detection of even very minimally expressed transcripts [48]. In the short term, however, the costs and amount of data produced make it unlikely that sequence census approaches will completely replace microarrays as the routine tool for expression profiling.

The structure of eukaryotic transcriptomes has received considerable attention with the availability of high-density tiling arrays [49]. These arrays consist of probes tiled evenly across the genome allowing characterization of transcript structure without prior knowledge of genome annotation. Sequence census methods are likely to provide an exciting new twist in the structural analysis of transcriptomes. Unlike tiling arrays whose resolution is limited by the number of probes on the platform, sequencing provides, by default, the best possible resolution. This feature may prove particularly powerful for dense genomes with small gene features or for large genomes that would otherwise require a substantial number of tiling arrays to provide adequate resolution. In addition, sequencing of the fission yeast transcriptome has proved sensitive enough to detect widespread transcription in >90% of the genome, including traces of RNAs that are not robustly transcribed or rapidly degraded [48], such that very rare isoforms expressed at levels below the detection threshold of tiling arrays can be identified. However, the sequencing costs incurred to reach a sufficiently deep coverage of large genomes remain currently an issue. Fortunately, the costs are likely to decrease further, allowing a wider and more extensive use of these technologies.

Sequence census technologies have been developed to analyse double-stranded DNA. When it comes to the analysis of transcriptomes, converting RNA into double-stranded cDNA is required, which may, in many protocols, result in the loss of strand-specific information. Given the large extent of overlapping and antisense transcription reported even in simple eukaryotes, this is clearly an issue. Moreover, classical protocols for cDNA synthesis do not produce samples allowing unambiguous detection of overlapping transcription. Therefore, in addition to classical cDNA analysis, techniques allowing the specific analysis of the 5′- and 3′-ends of transcripts combined with sequencing census approaches should prove useful for unambiguously determining the extent and structure of different transcripts.

Finally, a unique feature of sequence census technologies is their ability to identify, without prior knowledge, transcripts made of sequences that are not adjacent in the genome but that are connected when they are expressed. For instance, spliced transcripts can be uniquely detected through the presence of sequence reads spanning exon–exon junctions. Such positive evidence for splicing is not available from tiling array data, and although spliced transcripts could be probed with specially designed arrays, it would require a priori knowledge of the splice sites. For these reasons, sequence census technologies will provide powerful and versatile tools to study post-transcriptional processing of genetic sequences.

Concluding remarks

Next-generation sequencing technologies have had an enormous impact on research within a short time frame, and this impact appears certain to increase further, as many institutions are now acquiring these prevailing new sequencing platforms. Beyond conventional sampling of genome content, wide-ranging applications are rapidly evolving for next-generation sequencing. Sequence census methods such as ChIP-Seq and RNA-Seq are becoming powerful and quantitative approaches to analyse the structures and functions of both genomes and transcriptomes at maximal resolution. At this time, the huge amount of data generated by next-generation sequencing creates an informatics challenge. The establishment of routine data analysis methods, together with future decreases in sequencing costs and increases in the numbers and lengths of sequence reads, will help to unleash the full potential of next-generation sequencing.

Note added in proof (received 4 August 2008)

Since submission of this paper, six additional papers have been published reporting various applications of RNA-Seq [50–55].

Acknowledgments

We thank Josette-Renée Landry, Vera Pancaldi and Falk Schubert for comments on this paper. S.M. was supported by a Fellowship for Advanced Researchers from the Swiss National Science Foundation, and B.T.W. by Sanger postdoctoral and Canadian NSERC (Natural Sciences and Engineering Research Council) fellowships. Work in our laboratory is funded by Cancer Research UK grant number C9546/A6517.

Footnotes

British Yeast Group Meeting 2008: Independent Meeting held at National University of Ireland Maynooth, Maynooth, Co. Kildare, Ireland, 18–20 March 2008. Organized and Edited by Gary Jones (National University of Ireland Maynooth, Ireland).

This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Licence (http://creativecommons.org/licenses/by-nc/2.5/) which permits unrestricted non-commercial use, distribution and reproduction in any medium, provided the original work is properly cited.