Atopic dermatitis (AD; eczema) is characterized by a widespread abnormality in cutaneous barrier function and propensity to inflammation. Filaggrin is a multifunctional protein and plays a key role in skin barrier formation. Loss‐of‐function mutations in the gene encoding filaggrin (FLG) are a highly significant risk factor for atopic disease, but the molecular mechanisms leading to dermatitis remain unclear. We sought to interrogate tissue‐specific variations in the expressed genome in the skin of children with AD and to investigate underlying patho‐mechanisms in atopic skin. We applied single‐molecule direct RNA sequencing to analyze the whole transcriptome using minimal tissue samples. Uninvolved skin biopsy specimens from 26 pediatric patients with AD were compared with site‐matched samples from 10 nonatopic teenage control subjects. Cases and control subjects were screened for FLG genotype to stratify the data set. Two thousand four hundred thirty differentially expressed genes (false discovery rate, P < .05) were identified, of which 211 were significantly upregulated and 490 downregulated by greater than 2‐fold. Gene ontology terms for “extracellular space” and “defense response” were enriched, whereas “lipid metabolic processes” were downregulated. The subset of FLG wild‐type cases showed dysregulation of genes involved with lipid metabolism, whereas filaggrin haploinsufficiency affected global gene expression and was characterized by a type 1 interferon–mediated stress response. These analyses demonstrate the importance of extracellular space and lipid metabolism in atopic skin pathology independent of FLG genotype, whereas an aberrant defense response is seen

We measured half‐lives of 21,248 mRNA 3′ isoforms in yeast by rapidly depleting RNA polymerase II from the nucleus and performing direct RNA sequencing throughout the decay process. Interestingly, half‐lives of mRNA isoforms from the same gene, including nearly identical isoforms, often vary widely. Based on clusters of isoforms with different half‐lives, we identify hundreds of sequences conferring stabilization or destabilization upon mRNAs terminating downstream. One class of stabilizing element is a polyU sequence that can interact with poly(A) tails, inhibit the association of poly(A)‐binding protein, and confer increased stability upon introduction into ectopic transcripts. More generally, destabilizing and stabilizing elements are linked to the propensity of the poly(A) tail to engage in double‐stranded structures. Isoforms engineered to fold into 3′ stem‐loop structures not involving the poly(A) tail exhibit even longer half‐lives. We suggest that double‐stranded structures at 3′ ends are a major determinant of mRNA stability

The microglial sensome revealed by direct RNA sequencing

Microglia, the principal neuroimmune sentinels of the brain, continuously sense changes in their environment and respond to invading pathogens, toxins and cellular debris. Microglia exhibit plasticity and can assume neurotoxic or neuroprotective priming states that determine their responses to danger. We used direct RNA sequencing, without amplification or cDNA synthesis, to determine the quantitative transcriptomes of microglia of healthy adult and aged mice. We validated our findings using fluorescence dual in situ hybridization, unbiased proteomic analysis and quantitative PCR. We found that microglia have a distinct transcriptomic signature and express a unique cluster of transcripts encoding proteins for sensing endogenous ligands and microbes that we refer to as the sensome. With aging, sensome transcripts for endogenous ligand recognition were downregulated, whereas those involved in microbe recognition and host defense were upregulated. In addition, aging was associated with an overall increase in the expression of microglial genes involved in neuroprotection.

Alternative cleavage and polyadenylation influence the coding and regulatory potential of mRNAs and where transcription termination occurs. Although widespread, few regulators of this process are known. The Arabidopsis thaliana protein FPA is a rare example of a trans‐acting regulator of poly(A) site choice. Analysing fpa mutants therefore provides an opportunity to reveal generic consequences of disrupting this process. We used direct RNA sequencing to quantify shifts in RNA 3′ formation in fpa mutants. Here we show that specific chimeric RNAs formed between the exons of otherwise separate genes are a striking consequence of loss of FPA function. We define intergenic read‐through transcripts resulting from defective RNA 3′ end formation in fpa mutants and detail cryptic splicing and antisense transcription associated with these readthrough RNAs. We identify alternative polyadenylation within introns that is sensitive to FPA and show FPA‐dependent shifts in IBM1 poly(A) site selection that differ from those recently defined in mutants defective in intragenic heterochromatin and DNA methylation. Finally, we show that defective termination at specific loci in fpa mutants is shared with dicer‐like 1 (dcl1) or dcl4 mutants, leading us to develop alternative explanations for some silencing roles of these proteins. We relate our findings to the impact that altered patterns of 3′ end formation can have on gene and genome organization.

It has recently been shown that RNA 3′ end formation plays a more widespread role in controlling gene expression than previously thought. In order to examine the impact of regulated 3′ end formation genome‐wide we applied direct RNA sequencing to A. thaliana. Here we show the authentic transcriptome in unprecedented detail and how 3′ end formation impacts genome organization. We reveal extreme heterogeneity in RNA 3′ ends, discover previously unrecognized non‐coding RNAs and propose widespread re‐annotation of the genome. We explain the origin of most poly(A)+ antisense RNAs and identify cis‐elements that control 3′ end formation in different registers. These findings are essential to understand what the genome actually encodes, how it is organized and the impact of regulated 3′ end formation on these processes

The emerging discoveries on the link between polyadenylation and disease states underline the need to fully characterize genome‐ wide polyadenylation states. Here, we report comprehensive maps of global polyadenylation events in human and yeast generated using refinements to the Direct RNA Sequencing technology. This direct approach provides a quantitative view of genome‐wide polyadenylation states in a strand‐specific manner and requires only attomole RNA quantities. The polyadenylation profiles revealed an abundance of unannotated polyadenylation sites, alternative polyadenylation patterns, and regulatory element‐ associated poly(A)+ RNAs. We observed differences in sequence composition surrounding canonical and noncanonical human polyadenylation sites, suggesting novel noncoding RNAspecific polyadenylation mechanisms in humans. Furthermore, we observed the correlation level between sense and antisense transcripts to depend on gene expression levels, supporting the view that overlapping transcription from opposite strands may play a regulatory role. Our data provide a comprehensive view of the polyadenylation state and overlapping transcription.

Our understanding of human biology and disease is ultimately dependent on a complete understanding of the genome and its functions. The recent application of microarray and sequencing technologies to transcriptomics has changed the simplistic view of transcriptomes to a more complicated view of genome‐wide transcription where a large fraction of transcripts emanates from unannotated parts of genomes1, 2, 3, 4, 5, 6, 7, and underlined our limited knowledge of the dynamic state of transcription. Most of this broad body of knowledge was obtained indirectly because current transcriptome analysis methods typically require RNA to be converted to complementary DNA (cDNA) before measurements, even though the cDNA synthesis step introduces multiple biases and artifacts that interfere with both the proper characterization and quantification of transcripts8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18. Furthermore, cDNA synthesis is not particularly suitable for the analysis of short, degraded and/ or small quantity RNA samples. Here we report direct single molecule RNA sequencing without prior conversion of RNA to cDNA. We applied this technology to sequence femtomole quantities of poly(A)+ Saccharomyces cerevisiae RNA using a surface coated with poly(dT) oligonucleotides to capture the RNAs at their natural poly(A) tails and initiate sequencing by synthesis. We observed transcript 3′ end heterogeneity and polyadenylated small nucleolar RNAs. This study provides a path to high‐throughput and low‐cost direct RNA sequencing and achieving the ultimate goal of a comprehensive and bias‐free understanding of transcriptomes.

The known regulatory role of 3′ untranslated regions (3′UTRs) and poly(A) tails in RNA localization, stability, and translation, and polyadenylation regulation defects leading to human diseases such as oculopharyngeal muscular dystrophy, thalassemias, thrombophilia, and IPEX syndrome underline the need to fully characterize genome‐wide polyadenylation states and mechanisms across normal physiological and disease states. This chapter outlines the quantitative polyadenylation site mapping and analysis strategies developed with the single‐molecule direct RNA sequencing technology

Single-Molecule Direct RNA sequencing without cDNA Synthesis

Methods in Molecular Biology March 14, 2011

Fatih Ozsolak & Patrice M. Milos

Abstract

Methods for in‐depth genome‐wide characterization of transcriptomes and quantification of transcript levels using various microarray and next generation sequencing technologies have emerged as valuable tools for understanding cellular physiology and human disease biology and have begun to be utilized in various clinical diagnostic applications. Current methods, however, typically require RNA to be converted to complementary DNA prior to measurements. This step has been shown to introduce many biases and artifacts. In order to best characterize the ‘true’ transcriptome, the single‐molecule direct RNA sequencing (DRS) technology was developed. This review focuses on the underlying principles behind the DRS, sample preparation steps, and the current and novel avenues of research and applications DRS offers. WIREs RNA 2011 2 565–570 DOI: 10.1002/wrna.84.

Transcriptome Profiling Using Single-Molecule Direct RNA Sequencing

Methods Molecular Biology February 23, 2011

Fatih Ozsolak, Patrice M. Milos

Abstract

Methods for in‐depth characterization of transcriptomes and quantification of transcript levels have emerged as valuable tools for understanding cellular physiology and human disease biology, and have begun to be utilized in various clinical diagnostic applications. Today, current methods utilized by the scientific community typically require RNA to be converted to cDNA prior to comprehensive measurements. However, this cDNA conversion process has been shown to introduce many biases and artifacts that interfere with the proper characterization and quantitation of transcripts. We have developed a direct RNA sequencing (DRS) approach, in which, unlike other technologies, RNA is sequenced directly without prior conversion to cDNA. The benefits of DRS include the ability to use minute quantities (e.g. on the order of several femtomoles) of RNA with minimal sample preparation, the ability to analyze short RNAs which pose unique challenges for analysis using cDNA‐based approaches, and the ability to perform these analyses in a low‐cost and high‐throughput manner. Here, we describe the strategies and procedures we employ to prepare various RNA species for analysis with DRS.

Enhancers control the correct temporal and cell‐type‐specific activation of gene expression in multicellular eukaryotes. Knowing their properties, regulatory activity and targets is crucial to understand the regulation of differentiation and homeostasis. Here we use the FANTOM5 panel of samples, covering the majority of human tissues and cell types, to produce an atlas of active, in vivo‐ transcribed enhancers. We show that enhancers share properties with CpG‐poor messenger RNA promoters but produce bidirectional, exosome‐sensitive, relatively short unspliced RNAs, the generation of which is strongly related to enhancer activity. The atlas is used to compare regulatory programs between different cells at unprecedented depth, to identify disease‐associated regulatory single nucleotide polymorphisms, and to classify cell‐type‐specific and ubiquitous enhancers. We further explore the utility of enhancer redundancy, which explains gene expression strength rather than expression patterns. The online FANTOM5 enhancer atlas represents a unique resource for studies on cell‐type‐specific enhancers and gene regulation.

A promoter-level mammalian expression atlas

Regulated transcription controls the diversity, developmental pathways and spatial organization of the hundreds of cell types that make up a mammal. Using single‐molecule cDNA sequencing, we mapped transcription start sites (TSSs) and their usage in human and mouse primary cells, cell lines and tissues to produce a comprehensive overview of mammalian gene expression across the human body. We find that few genes are truly ‘housekeeping’, whereas many mammalian promoters are composite entities composed of several closely separated TSSs, with independent cell‐type‐specific expression profiles. TSSs specific to different cell types evolve at different rates, whereas promoters of broadly expressed genes are the most conserved. Promoter‐based expression analysis reveals key transcription factors defining cell states and links them to binding‐site motifs. The functions of identified novel transcripts can be predicted by coexpression and sample ontology enrichment analyses. The functional annotation of the mammalian genome 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell‐type‐specific transcriptomes with wide applications in biomedical research.

We report the development of a simplified cap analysis of gene expression (CAGE) protocol adapted for single‐molecule sequencers that avoids second strand synthesis, ligation, digestion, and PCR. HeliScopeCAGE directly sequences the 3′ end of cap trapped first‐strand cDNAs. As with previous versions of CAGE, we better define transcription start sites (TSS) than known models identify novel regions of transcription and alternative promoters, and find two major classes of TSS signal, sharp peaks and broad regions. However, using this protocol, we observe reproducible evidence of regulation at the much finer level of individual TSS positions. The libraries are quantitative over 5 orders of magnitude and highly reproducible (Pearson’s correlation coefficient of 0.987). We have also scaled down the sample requirement to 5 μg of total RNA for a standard HeliScopeCAGE library and 100 ng for a low‐quantity version. When the same RNA was run as 5‐μg and 100‐ng versions, the 100 ng was still able to detect expression for
∼60% of the 13,468 loci detected by a 5‐μg library using the same threshold, allowing comparative analysis of even rare cell
populations. Testing the protocol for differential gene expression measurements on triplicate HeLa and THP‐1 samples, we find that the log fold change compared to Illumina microarray measurements is highly correlated (0.871). In addition, HeliScopeCAGE finds differential expression for thousands more loci including those with probes on the array. Finally, although the majority of tags are 5′ associated, we also observe a low level of signal on exons that is useful for defining gene structures.

Adaptation of bacterial pathogens to a host can lead to the selection and accumulation of specific mutations in their genomes with profound effects on the overall physiology and virulence of the organisms. The opportunistic pathogen Pseudomonas aeruginosa is capable of colonizing the respiratory tract of individuals with cystic fibrosis (CF), where it undergoes evolution to optimize survival as a persistent chronic human colonizer. The transcriptome of a host‐adapted, alginate overproducing (space) isolate from a CF patient was determined following growth of the bacteria in the presence of human respiratory mucus. This stable mucoid strain responded to a number of regulatory inputs from the mucus, resulting in an unexpected repression of alginate production. Mucus in the medium also induced the production of catalases and additional peroxide‐detoxifying enzymes and caused reorganization of pathways of energy generation. A specific antibacterial type VI secretion system was also induced in mucus‐grown cells. Finally, a group of small regulatory RNAs was identified and a fraction of these were mucus regulated. This report provides a snapshot of responses in a pathogen adapted to a human host through assimilation of regulatory signals from tissues, optimizing its long‐term survival potential

Pathogens adapt to the host environment by altering their patterns of gene expression. Microarray‐based and genetic techniques used to characterize bacterial gene expression during infection are limited in their ability to comprehensively and simultaneously monitor genome‐wide transcription. We used massively parallel cDNA sequencing (RNA‐seq) techniques to quantitatively catalog the transcriptome of the cholera pathogen, Vibrio cholerae derived from two animal models of infection. Transcripts elevated in infected rabbits and mice relative to laboratory media

Satellite repeats in heterochromatin are transcribed into noncoding RNAs that have been linked to gene silencing and maintenance of chromosomal integrity. Using digital gene expression analysis, we showed that these transcripts are greatly overexpressed in mouse and human epithelial cancers. In 8 of 10 mouse pancreatic ductal adenocarcinomas (PDACs), pericentromeric satellites accounted for a mean 12% (range 1 to 50%) of all cellular transcripts, a mean 40‐fold increase over that in normal tissue. In 15 of 15 human PDACs, alpha satellite transcripts were most abundant and HSATII transcripts were highly specific for cancer. Similar patterns were observed in cancers of the lung, kidney, ovary, colon, and prostate. Depression of satellite transcripts correlated with overexpression of the long interspersed nuclear element 1 (LINE‐1) retrotransposon and with aberrant expression of neuroendocrine‐associated genes proximal to LINE‐1 insertions. The overexpression of satellite transcripts in cancer may reflect global alterations in heterochromatin silencing and could potentially be useful as a biomarker for cancer detection.

Quantification of the Yeast Transcriptome by Single-Molecule Sequencing

We present single‐molecule sequencing digital gene expression (smsDGE), a high‐throughput, amplification‐free method for accurate quantification of the full range of cellular polyadenylated RNA transcripts using a Helicos Genetic Analysis system. smsDGE involves a reverse transcription and polyA‐tailing sample preparation procedure followed by sequencing that generates a single read per transcript. We applied smsDGE to the transcriptome of Saccharomyces cerevisiae strain DBY746, using 6 of the available 50 channels in a single sequencing run, yielding on average 12 million aligned reads per channel. Using spiked‐in RNA, accurate quantitative measurements were obtained over four orders of magnitude. High correlation was demonstrated across independent flow‐cell channels, instrument runs and sample preparations. Transcript counting in smsDGE is highly efficient due to the representation of each transcript molecule by a single read. This efficiency, coupled with the high throughput enabled by the single‐ molecule sequencing platform, provides an alternative method for expression profiling.

RNA Sequencing and Quantitation Using the Helicos Genetic Analysis System

The recent transition in gene expression analysis technology to ultra high‐throughput cDNA sequencing provides a means for higher quantitation sensitivity across a wider dynamic range than previously possible. Sensitivity of detection is mostly a function of the sheer number of sequence reads generated. Typically, RNA is converted to cDNA using random hexamers and the cDNA is subsequently sequenced (RNA‐Seq). With this approach, higher read numbers are generated for long transcripts as compared to short ones. This length bias necessitates the generation of very high read numbers to achieve sensitive quantitation of short, low‐ expressed genes. To eliminate this length bias, we have developed an ultra high‐throughput sequencing approach where only a single read is generated for each transcript molecule (single‐molecule sequencing Digital Gene Expression (smsDGE)). So, for example, equivalent quantitation accuracy of the yeast transcriptome can be achieved by smsDGE using only 25% of the reads that would be required using RNA‐Seq. For sample preparation, RNA is first reverse‐transcribed into single‐stranded cDNA using oligo‐ dT as a primer. A poly‐A tail is then added to the 3′ ends of cDNA to facilitate the hybridization of the sample to the Helicos® single‐ molecule sequencing Flow‐Cell to which a poly dT oligo serves as the substrate for subsequent sequencing by synthesis. No PCR, sample‐size selection, or ligation steps are required, thus avoiding possible biases that may be introduced by such manipulations. Each tailed cDNA sample is injected into one of 50 flow‐cell channels and sequenced on the Helicos® Genetic Analysis System. Thus, 50 samples are sequenced simultaneously generating 10–20 million sequence reads on average for each sample channel. The sequence reads can then be aligned to the reference of choice such as the transcriptome, for quantitation of known transcripts, or the genome for novel transcript discovery. This chapter provides a summary of the methods required for smsDGE.

Circulating tumour cells (CTCs) shed into blood from primary cancers include putative precursors that initiate distal metastases1. Although these cells are extraordinarily rare, they may identify cellular pathways contributing to the blood‐borne dissemination of cancer. Here, we adapted a microfluidic device2 for efficient capture of CTCs from an endogenous mouse pancreatic cancer model3 and subjected CTCs to single‐molecule RNA sequencing4, identifying Wnt2 as a candidate gene enriched in CTCs. Expression of WNT2 in pancreatic cancer cells suppresses anoikis, enhances anchorage‐independent sphere formation, and increases metastatic propensity in vivo. This effect is correlated with fibronectin upregulation and suppressed by inhibition of MAP3K7 (also known as TAK1) kinase. In humans, formation of non‐adherent tumour spheres by pancreatic cancer cells is associated with upregulation of multiple WNT genes, and pancreatic CTCs revealed enrichment for WNT signalling in 5 out of 11 cases. Thus, molecular analysis of CTCs may identify candidate therapeutic targets to prevent the distal spread of cancer.

The accurate and thorough genome‐wide detection of adenosine‐to‐inosine editing, a biologically indispensable process, has proven challenging. Here, we present a discovery pipeline in adult Drosophila, with 3,581 high‐confidence editing sites identified with an estimated accuracy of 87%. The target genes and specific sites highlight global biological properties and functions of RNA editing, including hitherto‐unknown editing in well characterized classes of noncoding RNAs and 645 sites that cause amino acid substitutions, usually at conserved positions. The spectrum of functions that these gene targets encompass suggests that editing participates in a diverse set of cellular processes. Editing sites in Drosophila exhibit sequence‐motif preferences and tend to be concentrated within a small subset of total RNAs. Finally, editing regulates expression levels of target mRNAs and strongly correlates with alternative splicing.

The func on of RNA from the non-coding (the so called “dark ma er”) regions of the ge- nome has been a subject of considerable recent debate. Perhaps the most controversy is regarding the func on of RNAs found in introns of annotated transcripts, where most of the reads that map outside of exons are usually found. However, it has been reported that the levels of RNA in introns are minor rela ve to those of the corresponding exons, and that changes in the levels of intronic RNAs correlate ghtly with that of adjacent ex- ons. This would suggest that RNAs produced from the vast expanse of intronic space are just pieces of pre-mRNAs or excised introns enroute to degrada on. We present data that challenges the no on that intronic RNAs are mere by-standers in the cell. By per- forming a highly quan ta ve RNAseq analysis of transcriptome changes during an in am- ma on me course, we show that intronic RNAs have a number of features that would be expected from func onal, standalone RNA species. We show that there are thousands of introns in the mouse genome that generate RNAs whose overall abundance, which changes throughout the in amma on me course, and other proper es suggest that they func on in yet unknown ways. So far, the focus of non-coding RNA discovery has shied away from intronic regions as those were believed to simply encode parts of pre- mRNAs. Results presented here suggest a very di erent situa on – the sequences encod- ed in the introns appear to harbor a yet unexplored reservoir of novel, func onal RNAs. As such, they should not be ignored in surveys of func onal transcripts or other genomic studies.

Discovery that the transcriptional output of the human genome is far more complex than predicted by the current set of protein‐ coding annotations and that most RNAs produced do not appear to encode proteins has transformed our understanding of genome complexity and suggests new paradigms of genome regulation. However, the fraction of all cellular RNA whose function we do not understand and the fraction of the genome that is utilized to produce that RNA remain controversial. This is not simply a bookkeeping issue because the degree to which this unannotated transcription is present has important implications with respect to its biologic function and to the general architecture of genome regulation. For example, efforts to elucidate how non‐coding RNAs (ncRNAs) regulate genome function will be compromised if that class of RNAs is dismissed as simply ‘transcriptional noise’. We show that the relative mass of RNA whose function and/or structure we do not understand (the so called ‘dark matter’ RNAs), as a proportion of all non‐ribosomal, non‐mitochondrial human RNA (mt‐RNA), can be greater than that of protein‐encoding transcripts. This observation is obscured in studies that focus only on polyA‐selected RNA, a method that enriches for protein coding RNAs and at the same time discards the vast majority of RNA prior to analysis. We further show the presence of a large number of very long, abundantly‐transcribed regions (100’s of kb) in intergenic space and further show that expression of these regions is associated with neoplastic transformation. These overlap some regions found previously in normal human embryonic tissues and raises an interesting hypothesis as to the function of these ncRNAs in both early development and neoplastic transformation. We conclude that ‘dark matter’ RNA can constitute the majority of non‐ribosomal, non‐mitochondrial‐RNA and a significant fraction arises from numerous very long, intergenic transcribed regions that could be involved in neoplastic transformation

Short ( < 200 nt ) RNA (sRNA) profiling of human cells using various technologies demonstrates unexpected complexity of sRNAs with 100’s of thousands of sRNA species present 1,2,3,4. Genetic and in vitro studies argue that these RNAs are not merely degradation on products of longer transcripts but could indeed have function 1,2,5. Furthermore, profiling of RNAs, including the sRNAs, can reveal not only novel transcripts, but also make clear predictions about the existence and properties of novel biochemical pathways operating in a cell. For example, short RNA pro ling in human cells suggested existence of an unknown capping mechanism operating on cleaved RNA 2 a biochemical component of which was later identified 6. Here we show that human cells contain a novel type of sRNAs that have non-genomically encoded 5’ polyU tails. Presence of these RNAs at the termini of genes, specifically at the very 3’ ends of known mRNAs strongly argues for the presence of a yet uncharacterized endogenous biochemical pathway in a cell that can copy RNA. We show that this pathway can operate on multiple genes, with specific enrichment towards transcripts encoding components of the translational machinery. Finally we show that genes are also flanked by sense, 3’ polyadenylated sRNAs that are likely to be capped.

Bromodomain and extraterminal (BET) domain proteins have emerged as promising therapeutic targets in glioblastoma and many other cancers. Small molecule inhibitors of BET bromodomain proteins reduce expression of several oncogenes required for Glioblastoma Multiforme (GBM) progression. However, the mechanism through which BET protein inhibition reduces GBM growth is not completely understood. Long noncoding RNAs (lncRNAs) are important epigenetic regulators with critical roles in cancer initiation and malignant progression, but mechanistic insight into their expression and regulation by BET bromodomain inhibitors remains elusive. In this study, we used Helicos single molecule sequencing to comprehensively profile lncRNAs differentially expressed in GBM, and we identified a subset of GBM‐specific lncRNAs whose expression is regulated by BET proteins. Treatment of GBM cells with the BET bromdomain inhibitor I‐BET151 reduced levels of the tumor‐promoting lncRNA HOX transcript antisense RNA (HOTAIR) and restored the expression of several other GBM down‐regulated lncRNAs. Conversely, overexpression of HOTAIR in conjunction with I‐BET151 treatment abrogates the antiproliferative activity of the BET bromodomain inhibitor. Moreover, chromatin immunoprecipitation analysis demonstrated binding of Bromodomain Containing 4 (BRD4) to the HOTAIR promoter, suggesting that BET proteins can directly regulate lncRNA expression. Our data unravel a previously unappreciated mechanism through which BET proteins control tumor growth of glioblastoma cells and suggest that modulation of lncRNA networks may, in part, mediate the antiproliferative effects of many epigenetic inhibitors currently in clinical trials for cancer and other diseases.

In the past decade, numerous studies have made connections between sequence variants in human genomes and predisposition to complex diseases. However, most of these variants lie outside of the charted regions of the human genome whose function we understand; that is, the sequences that encode proteins. Consequently, the general concept of a mechanism that translates these variants into predisposition to diseases has been lacking, potentially calling into question the validity of these studies. Here we make a connection between the growing class of apparently functional RNAs that do not encode proteins and whose function we do not yet understand (the so‐called ‘dark matter’ RNAs) and the disease‐associated variants. We review advances made in a different genomic mapping effort – unbiased profiling of all RNA transcribed from the human genome – and provide arguments that the disease‐associated variants exert their effects via perturbation of regulatory properties of non‐coding RNAs existing in mammalian cells.

Heterochromatin formation drives epigenetic mechanisms associated with silenced gene expression. Repressive heterochromatin is established through the RNA interference pathway, triggered by double‐stranded RNAs (dsRNAs) that can be modified via RNA editing. However, the biological consequences of such modifications remain enigmatic. Here we show that RNA editing regulates heterochromatic gene silencing in Drosophila. We utilize the binding activity of an RNA‐editing enzyme to visualize the in vivo production of a long dsRNA trigger mediated by Hoppel transposable elements. Using homologous recombination, we delete this trigger, dramatically altering heterochromatic gene silencing and chromatin architecture. Furthermore, we show that the trigger RNA is edited and that dADAR serves as a key regulator of chromatin state. Additionally, dADAR auto‐editing generates a natural suppressor of gene silencing. Lastly, systemic differences in RNA editing activity generates inter‐individual variation in silencing state within a population. Our data reveal a global role for RNA editing in regulating gene expression.

On the importance of small changes in RNA expression

The analysis of the differential expression of genes has been the key goal of many molecular biology methods for decades and will remain with us for decades to come. It constitutes a fundamental resource at our disposal for determining the relationship between products of transcription, biology and disease. The completed genome sequencing of many common species allowed microarrays and RNA sequencing (RNAseq) to become major tools in Systems Biology. However, we estimate that at least half of all experiments ignore transcripts that change less than some subjectively chosen threshold, typically around 2–3 fold. Here we show that a majority of the informative RNAs and differentially expressed transcripts can exhibit fold changes less than 2. We use highly quantitative single‐molecule sequencing of total cellular RNA derived from a time course of inflammatory response, a process critical to a large number of diseases. Furthermore, we show that enrichment of biologically‐relevant functions occurs even at very low fold changes in RNA levels. In addition, we show that most of the common statistical methods can reliably detect transcripts with low fold change when as few as 3 biological replicates are sequenced using single‐molecule based RNAseq. In conclusion, given the prevalence of expression profiling in current research, the loss of data in half of all expression studies results in a significant, yet needless drain on the discovery process

The 5‐methylcytosine (5‐mC) derivative 5‐hydroxymethylcytosine (5‐hmC) is abundant in the brain for unknown reasons. Here we characterize the genomic distribution of 5‐hmC and 5‐mC in human and mouse tissues. We assayed 5‐hmC by using glucosylation coupled with restriction enzyme digestion and microarray analysis. We detected 5‐hmC enrichment in genes with synapse‐related functions in both human and mouse brain. We also identified substantial tissue‐specific differential distributions of these DNA modifications at the exon‐intron boundary in human and mouse. This boundary change was mainly due to 5‐hmC in the brain but due to 5‐mC in non‐neural contexts. This pattern was replicated in multiple independent data sets and with single‐molecule sequencing. Moreover, in human frontal cortex, constitutive exons contained higher levels of 5‐hmC relative to alternatively spliced exons. Our study suggests a new role for 5‐hmC in RNA splicing and synaptic function in the brain.

Protocol Dependence of Sequencing-Based Gene Expression Measurements

RNA Seq provides unparalleled levels of information about the transcriptome including precise expression levels over a wide dynamic range. It is essential to understand how technical variation impacts the quality and interpretability of results, how potential errors could be introduced by the protocol, how the source of RNA affects transcript detection, and how all of these variations can impact the conclusions drawn. Multiple human RNA samples were used to assess RNA fragmentation, RNA fractionation, cDNA synthesis, and single versus multiple tag counting. Though protocols employing polyA RNA selection generate the highest number of non‐ribosomal reads and the most precise measurements for coding transcripts, such protocols were found to detect only a fraction of the non‐ribosomal RNA in human cells. PolyA RNA excludes thousands of annotated and even more unannotated transcripts, resulting in an incomplete view of the transcriptome. Ribosomal‐depleted RNA provides a more cost‐ effective method for generating complete transcriptome coverage. Expression measurements using single tag counting provided advantages for assessing gene expression and for detecting short RNAs relative to multi‐read protocols. Detection of short RNAs was also hampered by RNA fragmentation. Thus, this work will help researchers choose from among a range of options when analyzing gene expression, each with its own advantages and disadvantages.

A Comparison of Single Molecule and Amplification Based Sequencing of Cancer Transcriptomes

The second wave of next generation sequencing technologies, referred to as single‐molecule sequencing (SMS), carries the promise of profiling samples directly without employing polymerase chain reaction steps used by amplification‐based sequencing (AS) methods. To examine the merits of both technologies, we examine mRNA sequencing results from single‐molecule and amplification‐based sequencing in a set of human cancer cell lines and tissues. We observe a characteristic coverage bias towards high abundance transcripts in amplification‐based sequencing. A larger fraction of AS reads cover highly expressed genes, such as those associated with translational processes and housekeeping genes, resulting in relatively lower coverage of genes at low and mid‐level abundance. In contrast, the coverage of high abundance transcripts plateaus off using SMS. Consequently, SMS is able to sequence lower‐ abundance transcripts more thoroughly, including some that are undetected by AS methods; however, these include many more mapping artifacts. A better understanding of the technical and analytical factors introducing platform specific biases in high throughput transcriptome sequencing applications will be critical in cross platform meta‐analytic studies.

Digital Transcriptome Profiling from Attomole-Level RNA Samples

Accurate profiling of minute quantities of RNA in a global manner can enable key advances in many scientific and clinical disciplines. Here, we present low‐quantity RNA sequencing (LQ‐RNAseq), a high‐throughput sequencing‐based technique allowing whole transcriptome surveys from subnanogram RNA quantities in an amplification/ligation‐free manner. LQ‐RNAseq involves first‐ strand cDNA synthesis from RNA templates, followed by 3′ polyA tailing of the single‐stranded cDNA products and direct single molecule sequencing. We applied LQ‐RNAseq to profile S. cerevisiae polyA+ transcripts, demonstrate the reproducibility of the approach across different sample preparations and independent instrument runs, and establish the absolute quantitative power of this method through comparisons with other reported transcript profiling techniques and through utilization of RNA spike‐in experiments. We demonstrate the practical application of this approach to define the transcriptional landscape of mouse embryonic and induced pluripotent stem cells, observing transcriptional differences, including over 100 genes exhibiting differential expression between these otherwise very similar stem cell populations. This amplification‐independent technology, which utilizes small quantities of nucleic acid and provides quantitative measurements of cellular transcripts, enables global gene expression measurements from minute amounts of materials and offers broad utility in both basic research and translational biology for characterization of rare cells.

Profiling of Short RNAs Using Helicos Single-Molecule Sequencing

Methods Molecular Biology December 1, 2011

Philipp Kapranov , Fatih Ozsolak, Patrice M. Milos

Abstract

The importance of short ( < 200 nt ) RNAs in cell biogenesis has been well documented. These short RNAs include crucial classes of molecules such as transfer RNAs, small nuclear RNA, microRNAs, and many others (reviewed in Storz et al., Annu Rev Biochem 74:199–217, 2005; Ghildiyal and Zamore, Nat Rev Genet 10:94–108, 2009). Furthermore, the realm of functional RNAs that fall within this size range is growing to include less well‐characterized RNAs such as short RNAs found at the promoters and 3′ termini of genes (Affymetrix ENCODE Transcriptome Project et al., Nature 457:1028–1032, 2009; Davis and Ares, Proc Natl Acad Sci USA 103:3262–3267, 2006; Kapranov et al., Science 316:1484–1488, 2007; Taft et al., Nat Genet 41:572–578, 2009; Kapranov et al., Nature 466:642–646, 2010), short RNAs involved in paramutation (Rassoulzadegan et al., Nature 441:469–474, 2006), and others (reviewed in Kawaji and Hayashizaki, PLoS Genet 4:e22, 2008). Discovery and accurate quantification of these RNA molecules, less than 200 bases in size, is thus an important and also challenging aspect of understanding the full repertoire of cellular and extracellular RNAs. Here, we describe the strategies and procedures we developed to profile short RNA species using single‐ molecule sequencing ( s m s ) and the advantages SMS offers.

Single Molecule Sequencing with a HeliScope Genetic Analysis System

Curr Protoc Mol Biol. October 2010

John F. Thompson and Kathleen E. Steinmann

Abstract

HelicosTM Single Molecule Sequencing (SMS) provides a unique view of genome biology through direct sequencing of cellular nucleic acids in an unbiased manner, providing both accurate quantitation and sequence information. Sample preparation does not require ligation or PCR amplification, avoiding the GC‐content and size biases observed in other technologies. DNA is simply sheared, tailed with poly A, and hybridized to a flow cell surface containing oligo‐dT for sequencing‐by‐synthesis of billions of molecules in parallel. This process also requires far less material than other technologies. Gene expression measurements can be done using 1st‐strand cDNA‐based methods (RNA‐ Seq) or using a novel approach that allows direct hybridization and sequencing of cellular RNA for the most direct quantitation possible. A diverse array of applications have been successfully performed including genome sequencing for accurate variant detection, ChIP‐Seq using picogram quantities of DNA, copy number variation studies from both fresh tumor tissue and FFPE tissue samples, sequencing of ancient and degraded DNAs, small RNA studies leading to the identification of new classes of RNAs and the direct capture and sequencing of RNA from cell quantities as few as 250 cells. Because most next generation sequencing technologies require amplification and a specific size range of target molecules, DNAs not meeting those criteria cannot be sequenced in a reliable manner. Single‐molecule sequencing does not suffer from those limitations.

Single-molecule decoding of combinatorially modified nucleosomes

Different combinations of histone modifications have been proposed to signal distinct gene regulatory functions, but this area is poorly addressed by existing technologies. We applied high‐throughput single‐molecule imaging to decode combinatorial modifications on millions of individual nucleosomes from pluripotent stem cells and lineage‐committed cells. We identified definitively bivalent nucleosomes with concomitant repressive and activating marks, as well as other combinatorial modification states whose prevalence varies with developmental potency. We showed that genetic and chemical perturbations of chromatin enzymes preferentially affect nucleosomes harboring specific modification states. Last, we combined this proteomic platform with single‐molecule DNA sequencing technology to simultaneously determine the modification states and genomic positions of individual nucleosomes. This single‐molecule technology has the potential to address

Mapping the regulon on Vibrio Cholerae Ferric Uptake Regulator Expands its Known Network of Gene Regulation

PNAS June 12, 2011

Bryan W. Davies, Ryan W. Bogard, & John J. Mekalanos

Abstract

ChIP coupled with next‐generation sequencing (ChIP‐seq) has revolutionized whole‐genome mapping of DNA‐binding protein sites. Although ChIP‐seq rapidly gained support in eukaryotic systems, it remains underused in the mapping of bacterial transcriptional regulator‐binding sites. Using the virulence‐required iron‐responsive ferric uptake regulator (Fur), we report a simple, broadly applicable ChIP‐seq method in the pathogen Vibrio cholerae. Combining our ChIP‐seq results with available microarray data, we clarify direct and indirect Fur regulation of known iron responsive (space) genes. We validate a subset of Fur‐binding sites in vivo and show a common motif present in all Fur ChIP‐seq peaks that has enhanced binding affinity for purified V. cholerae Fur. Further analysis shows that V. cholerae Fur directly regulates several additional genes associated with Fur‐binding sites, expanding the role of this transcription factor into the regulation of ribosome formation, additional transport functions, and unique sRNAs.

Chromatin Profiling by Directly Sequencing Small Quantities of Immunoprecipitated DNA

Chromatin structure and transcription factor localization can be assayed genome‐wide by sequencing genomic DNA fractionated by protein occupancy or other properties, but current technologies involve multiple steps that introduce bias and inefficiency. Here we apply a single molecule approach to directly sequence chromatin immune‐precipitated DNA with minimal sample manipulation. This method is compatible with just 50 pg of DNA and should thus facilitate charting chromatin maps from limited cell populations

Genome-wide fitness profiling reveals adaptations required by Haemophilus in coinfection with influenza A virus in the murine lung

PNAS September 17, 2013

Sandy M. Wong, Mariana Bernui, Hao Shen, Brian J. Akerley

Abstract

Bacterial coinfec on represents a major cause of morbidity and mortality in epidemics of in uenza A virus (IAV). The bacterium Haemophilus in uenzae typically colonizes the hu- man upper respiratory tract without causing disease, and yet in individuals infected with IAV, it can cause debilita ng or lethal secondary pneumonia. Studies in murine models have detected immune components involved in suscep bility and pathology, and yet few studies have examined bacterial factors contribu ng to coinfec on. We conducted ge- nome-wide pro ling of the H. in uenzae genes that promote its tness in a murine model of coinfec on with IAV. Applica on of direct, high-throughput sequencing of transposon inser on sites revealed tness phenotypes of a bank of H. in uenzae mutants in viral coinfec on in comparison with bacterial infec on alone. One set of virulence genes was required in nonvirally infected mice but not in coinfec on, consistent with a defect in an -bacterial defenses during coinfec on. Nevertheless, a core set of genes required in both in vivo condi ons indicated that many bacterial countermeasures against host defenses remain cri cal for coinfec on. The results also revealed a subset of genes required in coinfec on but not in bacterial infec on alone, including the iron-sulfur cluster regulator gene, iscR, which was required for oxida ve stress resistance. Overexpression of the an – oxidant protein Dps in the iscR mutant restored oxida ve stress resistance and ability to colonize in coinfec on. The results iden fy bacterial stress and metabolic adapta ons required in an IAV coinfec on model, revealing poten al targets for treatment or preven- on of secondary bacterial pneumonia a er viral infec on.

Genetic testing for disease risk is an increasingly important component of medical care. However, testing can be expensive, which can lead to patients and physicians having limited access to the genetic information needed for medical decisions. To simplify DNA sample preparation and lower costs, we have developed a system in which any gene can be captured and sequenced directly from human genomic DNA without amplification, using no proteins or enzymes prior to sequencing. Extracted whole‐genome DNA is acoustically sheared and loaded in a flow cell channel for single‐molecule sequencing. Gene isolation, amplification, or ligation is not necessary. Accurate and low‐cost detection of DNA sequence variants is demonstrated for the BRCA1 gene. Disease‐causing mutations as well as common variants from well‐characterized samples are identified. Single‐molecule sequencing generates very reproducible coverage patterns, and these can be used to detect any size insertion or deletion directly, unlike PCR‐based methods, which require additional assays. Because no gene isolation or amplification is required for sequencing, the exceptionally low costs of sample preparation and analysis could make genetic tests more accessible to those who wish to know their own disease susceptibility. Additionally, this approach has applications for sequencing integration sites for gene therapy vectors, transposons, retroviruses, and other mobile DNA elements in a more facile manner than possible with other methods.

Noninvasive trisomy 21 detection performed by use of massively parallel sequencing is achievable with high diagnostic sensitivity and low falsepositive rates. Detection of fetal trisomy 18 and 13 has been reported as well but seems to be less accurate with the use of this approach. The reduced accuracy can be explained by PCR‐introduced guanine‐cytosine (GC) bias influencing sequencing data. Previously, we demonstrated that sequence data generated by single molecule sequencing show virtually no GC bias and result in a more pronounced noninvasive detection of fetal trisomy 21. In this study, single molecule sequencing was used for noninvasive detection of trisomy 18 and 13. Single molecule sequencing was performed on the Helicos platform with free DNA isolated from maternal plasma from 11 weeks of gestation onward (n = 17). Relative sequence tag density ratios were calculated against male control plasma samples and results were compared to those of previous karyotyping. All trisomy 18 fetuses were identified correctly with a diagnostic sensitivity and specificity of 100%. However, low diagnostic sensitivity and specificity were observed for fetal trisomy 13 detection. We successfully applied single molecule sequencing in combination with relative sequence tag density calculations for noninvasive trisomy 18 detection using free DNA from maternal plasma. However, noninvasive trisomy 13 detection was

Noninvasive fetal aneuploidy detection by use of free DNA from maternal plasma has recently been shown to be achievable by whole genome shotgun sequencing. The high‐throughput next‐generation sequencing platforms previously tested use a PCR step during sample preparation, which results in amplification bias in GC‐rich areas of the human genome. To eliminate this bias, and thereby experimental noise, we have used single molecule sequencing as an alternative method. For noninvasive trisomy 21 detection, we performed single molecule sequencing on the Helicos platform using free DNA isolated from maternal plasma from 9 weeks of gestation onwards. Relative sequence tag density ratios were calculated and results were directly compared to the previously described Illumina GAII platform. Sequence data generated without an amplification step show no GC bias. Therefore, with the use of single molecule sequencing all trisomy 21 fetuses could be distinguished more clearly from euploid fetuses. This study shows for the first time that single molecule sequencing is an attractive and easy to use alternative for reliable noninvasive fetal aneuploidy detection in diagnostics. With this approach, previously described experimental noise associated with PCR amplification, such as GC bias, can be overcome.

The sequencing of the human genome, combined with brilliant technical advances in microarrays and computing, opened the genomic era of personalized medicine. The next generation of genomics is now being driven by massively parallel sequencers that are effectively high definition genetic analyzers capable of sequencing an entire human genome 30‐times over in approximately a week for several thousand US dollars. Likewise, these next‐generation sequencers, sometimes called deep sequencers, can sequence RNA transcriptomes to render unprecedented, high definition views of transcript sequence, SNP haplotypes, rare variants, splicing, exon boundaries and RNA editing. Presently, next generation sequencing platforms can be grouped into ‘discovery’ platforms, which provide broad sequence coverage, but require days per sample, versus ‘diagnostic’ platforms, which provide a fraction of the coverage, but require only hours for sequencing. As these technologies converge, it will be possible to sequence a human genome in a matter of hours for a few hundred US dollars. While presenting considerable technical challenges in handling the massive data generated, next‐generation sequencing platforms offer unparalleled opportunities for biological insights, target discovery and clinical diagnostics to accelerate personalized medicine in the coming years.

The single‐stranded genome of adeno‐associated viral (AAV) vectors is one of the key factors leading to slow‐rising but long‐term transgene expression kinetics. Previous molecular studies have established what is now considered a textbook molecular model of AAV genomes with two copies of inverted tandem repeats at either end. In this study, we profiled hundreds of thousands of individual molecules of AAV vector DNA directly isolated from capsids, using single‐molecule sequencing (SMS), which avoids any intermediary steps such as plasmid cloning. The sequence profile at 3′ ends of both the regular and oversized vector did show the presence of an inverted terminal repeat (ITR), which provided direct confirmation that AAV vector packaging initiates from its 3′ end. Furthermore, the vector 5′‐terminus profile showed inconsistent termination for oversized vectors. Such incomplete vectors would not be expected to undergo canonical synthesis of the second strand of their genomic DNA and thus could function only via annealing of complementary strands of DNA. Furthermore, low levels of contaminating plasmid DNA were also detected. SMS may become a valuable tool during the development phase of vectors that are candidates for clinical use and for facilitating/ accelerating studies on vector biology

Second‐generation sequencing technologies have revolutionized our ability to recover genetic information from the past, allowing the characterization of the first complete genomes from past individuals and extinct species. Recently, third generation Helicos sequencing platforms, which perform true Single‐Molecule DNA Sequencing (tSMS), have shown great potential for sequencing DNA molecules from Pleistocene fossils. Here, we aim at improving even further the performance of tSMS for ancient DNA by testing two novel tSMS template preparation methods for Pleistocene bone fossils, namely oligonucleotide spiking and treatment with DNA phosphatase. We found that a significantly larger fraction of the horse genome could be covered following oligonucleotide spiking however not reproducibly and at the cost of extra post‐sequencing filtering procedures and skewed %GC content. In contrast, we showed that treating ancient DNA extracts with DNA phosphatase improved the amount of endogenous sequence information recovered per sequencing channel by up to 3.3‐fold, while still providing molecular signatures of endogenous ancient DNA damage, including cytosine deamination and fragmentation by depurination. Additionally, we confirmed the existence of molecular preservation niches in large bone crystals from which DNA could be preferentially extracted. We propose DNA phosphatase treatment as a mechanism to increase sequence coverage of ancient genomes when using Helicos tSMS as a sequencing platform. Together with mild denaturation temperatures that favor access to endogenous ancient templates over modern DNA contaminants, this simple preparation procedure can improve overall Helicos tSMS performance when damaged DNA templates are targeted.

Second‐generation sequencing platforms have revolutionized the field of ancient DNA, opening access to complete genomes of past individuals and extinct species. However, these platforms are dependent on library construction and amplification steps that may result in sequences that do not reflect the original DNA template composition. This is particularly true for ancient DNA, where templates have undergone extensive damage post‐mortem. Here, we report the results of the first “true single molecule sequencing” of ancient DNA. We generated 115.9 Mb and 76.9 Mb of DNA sequences from a permafrost‐preserved Pleistocene horse bone using the Helicos HeliScope and Illumina GAIIx platforms, respectively. We find that the percentage of endogenous DNA sequences derived from the horse is higher among the Helicos data than Illumina data. This result indicates that the molecular biology tools used to generate sequencing libraries of ancient DNA molecules, as required for second generation sequencing, introduce biases into the data that reduce the efficiency of the sequencing process and limit our ability to fully explore the molecular complexity of ancient DNA extracts. We demonstrate that simple modifications to the standard Helicos DNA template preparation protocol further increase the proportion of horse DNA for this sample by threefold. Comparison of Helicos‐specific biases and sequence errors in modern DNA with those in ancient DNA also reveals extensive cytosine deamination damage at the 3′ ends of ancient templates, indicating the presence of 3′‐sequence overhangs. Our results suggest that paleogenomes could be sequenced in an unprecedented manner by combining current second‐ and third‐generation sequencing approaches.

Helicos Single-Molecule Sequencing of Bacterial Genomes

With the advent of high‐throughput sequencing technologies, multiple bacterial genomes can be sequenced in days. While the ultimate goal of de novo assembly of bacterial genomes is progressing, changes in the genomic sequence of closely related bacterial strains and isolates are now easily monitored by comparison of their sequences to those of a reference genome. Such studies can be applied to the fields of bacterial evolution, epidemiology, and diagnostics. We present a protocol for single‐ molecule sequencing of bacterial DNA whose end result is the identification of single nucleotide variants, and various size insertions and deletions relative to a reference genome. The protocol is characterized by the simplicity of sample preparation and the lack of amplification‐related sequencing bias.

Helicos® Single‐Molecule Sequencing provides a unique view of genome biology through direct sequencing of cellular and extracellular nucleic acids in an unbiased manner, providing both quantitation and sequence information. Using a simple sample preparation, involving no ligation or amplification, genomic DNA is sheared, tailed with poly‐A and hybridized to the flow‐cell surface containing oligo‐dT for initiating sequencing‐by‐synthesis. RNA measurements involving direct RNA hybridization to the flow cell allows for the direct sequencing and quantitation of RNA molecules. From these methods, a diverse array of applications has now been successfully demonstrated with the Helicos® Genetic Analysis System, including human genome sequencing for accurate variant detection, ChIP Seq studies involving picogram quantities of DNA obtained from small cell numbers, copy number variation studies from both fresh tumor tissue and formalin‐fixed paraffin‐embedded tissue and archival tissue samples, small RNA studies leading to the identification of new classes of RNAs, and the direct capture and sequencing of nucleic acids from cell quantities as few as 400 cells with our end goal of single cell measurements. Helicos methods provide an important opportunity to researchers, including genomic scientists, translational researchers, and diagnostic experts, to benefit from biological measurements at the single‐molecule level. This chapter will describe the various methods available to researchers.

Single-Molecule Sequencing of an Individual Human Genome

Nature Biotechnology August 10, 2009

Dmitry Pushkarev, Norma F Neff, & Stephen R. Quakec

Abstract

Recent advances in high‐throughput DNA sequencing technologies have enabled order‐of‐magnitude improvements in both cost and throughput. Here we report the use of single‐molecule methods to sequence an individual human genome. We aligned billions of 24‐ to 70‐bp reads (32 bp average) to ~90% of the National Center for Biotechnology Information (NCBI) reference genome, with 28× average coverage. Our results were obtained on one sequencing instrument by a single operator with four data collection runs. Single molecule sequencing enabled analysis of human genomic information without the need for cloning, amplification or ligation. We determined ~2.8 million single nucleotide polymorphisms (SNPs) with a false‐positive rate of less than 1% as validated by Sanger sequencing and 99.8% concordance with SNP genotyping arrays. We identified 752 regions of copy number variation by analyzing coverage depth alone and validated 27 of these using digital PCR. This milestone should allow widespread application of genome sequencing to many aspects of genetics and human health, including personal genomics.

We synthesized reversible terminators with tethered inhibitors for next genera on sequencing. These were efficiently incorporated with high fidelity while preventing incorporation on of additional nucleotides and were used to sequence canine bacterial artificial chromosomes in a single-molecule system that provided even coverage for over 99% of the region sequenced. This single-molecule approach generated high quality sequence data without the need for target amplification on and thus avoided concomitant biases.