A typical cell has approximately 10 pg of total RNA and may contain only 0.1 pg of poly-adenylated RNA. Thus, these approaches all require some sort of whole-transcript amplification to generate enough material to make a sequencing library (5). The downside of such extensive amplification is the generation of significant technical noise, and this problem has yet not been solved (35).

Finally, ribosomal footprinting can reveal the pool of cellular mRNA transcripts undergoing translation at any point in time (36, 37). The protocol involves treating cell lysates with RNase, leaving behind only the 30-nucleotide region protected by each ribosome. Ribosomes are then purified by sucrose density gradient centrifugation, and the co-purified mRNA fragments are extracted from the ribosomes. Another novel application of RNA sequencing is SHAPE-Seq (Selective 2′-hydroxyl acylation analyzed by primer extension)(38), which is used to probe the secondary structure of RNA via acylating reagents that preferentially modify unpaired bases. When the modified RNA and an unmodified control undergo RT using specific primers, the resulting cDNA fragments can be sequenced and compared to reveal nucleotide level base pairing information.

The main objective when preparing a sequencing library is to create as little bias as possible. Bias can be defined as the systematic distortion of data due to the experimental design. Since it is impossible to eliminate all sources of experimental bias, the best strategies are: (i) know where bias occurs and take all practical steps to minimize it and (ii) pay attention to experimental design so that the sources of bias that cannot be eliminated have a minimal impact on the final analysis.

The complexity of an NGS library can reflect the amount of bias created by a given experimental design. In terms of library complexity, the ideal is a highly complex library that reflects with high fidelity the original complexity of the source material. The technological challenge is that any amount of amplification can reduce this fidelity. Library complexity can be measured by the number or percentage of duplicate reads that are present in the sequencing data (39). Duplicate reads are generally defined as reads that are exactly identical or have the exact same start positions when aligned to a reference sequence (40). One caveat is that the frequency of duplicate reads that occur by chance (and represent truly independent sampling from the original sample source) increases with increasing depth of sequencing. Thus, it is critical to understand under what conditions duplicate read rates represent an accurate measure of library complexity.

Using duplicate read rates as a measure of library complexity works well when doing genomic DNA sequencing, because the nucleic acid sequences in the starting pool are roughly in equimolar ratios. However, RNA-seq is considerably more complex, because by definition the starting pool of sequences represents a complex mix of different numbers of mRNA transcripts reflecting the biology of differential expression. In the case of ChIP-seq the complexity is created by both the differential affinity of target proteins for specific DNA sequences (i.e., high versus low). These biologically significant differences mean that the number of sequences ending up in the final pool are not equimolar.

However, the point is the same—the goal in preparing a library is to prepare it in such a way as to maximize complexity and minimize PCR or other amplification-based clonal bias. This is a significant challenge for libraries with low input, such as with many ChIP-seq experiments or RNA/DNA samples derived from a limited number of cells. It is now technologically possible to perform genomic DNA and RNA sequencing from single cells. The key point is that the level of extensive amplification required creates bias in the form of preferential amplification of different sequences, and this bias remains a serious issue in the analysis of the resulting data. One approach to address the challenge is a method of digital sequencing that uses multiple combinations of indexed adapters to enable the differentiation of biological and PCR-derived duplicate reads in RNA-seq applications (41, 42). A version of this method is now commercially available as a kit from Bioo Scientific (Austin, TX).

When preparing libraries for NGS sequencing, it is also critical to give consideration to the mitigation of batch effects (43-45). It is also important to acknowledge the impact of systematic bias resulting from the molecular manipulations required to generate NGS data; for example, the bias introduced by sequence-dependent differences in adaptor ligation efficiencies in miRNA-seq library preparations. Batch effects can result from variability in day-to-day sample processing, such as reaction conditions, reagent batches, pipetting accuracy, and even different technicians. Additionally, batch effects may be observed between sequencing runs and between different lanes on an Illumina flow-cell. Mitigating batch affects can be fairly simple or quite complex. When in doubt, consulting a statistician during the experimental design process can save an enormous amount of wasted money and time.

There are many ways to minimize bias during library preparation. Within a single experiment, we aim to start with samples of similar quality and quantity. We also use master mixes of reagents whenever possible. One particularly egregious source of bias is from amplification reactions such as PCR; it is well documented that GC content has a substantial impact on PCR amplification efficiency. We recommend PCR enzymes such as Kapa HiFi (Kapa Biosystems, Wilmington, MA) or AccuPrime Taq DNA Polymerase High Fidelity (Life Technologies) that have been shown to minimize amplification bias resulting from extremes of GC content. It was recently reported that, for particularly high GC targets, a 3 min initial denaturation time with subsequent PCR melt cycles extended to 80 s can significantly reduce amplification bias (18). We use as few amplification cycles as necessary, but it is critical that every sample within an experiment is amplified the same number of cycles. In miRNA library preparation protocols, ligase enzymes have been shown to contribute a high level of sequence-dependent bias (46, 47). One group found that addition of three degenerate bases to the 5′ end of the 3′ adapter and the 3′ end of the 5′ adapter significantly reduced this ligation bias (48). A miRNA library prep kit that incorporates three degenerate bases on the 5′ adapter is commercially available through Gnomegen (San Diego, CA).

In addition to enzymatic steps, bias can be reduced in purification steps by pooling barcoded samples before gel or bead purification. In the case of miRNA-seq libraries, we first run the individual libraries on an Agilent Bioanalyzer (Agilent Technologies, Santa Clara, CA) to quantitate the miRNA peaks. We use this information to create barcoded library pools of up to 24 samples and then perform gel purification in a single lane of an agarose gel to avoid sizing variation between samples.

Targeted sequencing allows investigators to study a selected set of genes or specific genomic elements; for example, CpG islands and promoter/enhancer regions (reviewed in References (49). A common application of targeted sequencing is exome sequencing and high quality kits are commercially available; SureSelect (Agilent Technologies), SeqCap (Roche NimbleGen, Madison, WI) and TruSeq Exome Enrichment Kit (Illumina). All three capture methods are based on probe hybridization to enrich sequencing libraries made from whole genome samples (51, 52). Life Technologies has commercialized an alternative approach based on highly multiplexed, PCR-based AmpliSeq technology. There are options to customize all these products and investigators can design capture or PCR probes for target regions covering from thousands to millions of bases within a genome.

Hybridization capture approaches generally work well but can suffer from off-target capture and struggle to effectively capture sequences with high levels of repetition or low complexity (i.e., the Human Histocompatibility Locus region). The PCR-based AmpliSeq method is more efficient with lower amounts of DNA (53). It should also be noted that probes are based on a reference sequence, and variations that substantially deviate from the reference, as well as significant insertion/deletion mutations, are not always going to be identified.

Another targeted sequencing method, developed by Raindance (Billerica, MA) uses microdroplet PCR and custom-designed droplet libraries (54, 55). The nature of microdroplet emulsion PCR significantly decreases PCR amplification bias (56). Microdroplet PCR allows the user to set up 1.5 × 106 microdroplet amplifications in a single tube in under an hour. The droplet libraries are designed based on 500 bp amplicons, and a single custom library can target from 2000 to 10,000 different amplicons covering up to 5 × 106 bases.

Amplicon sequencing involves making NGS libraries from PCR products. This form of targeted sequencing is more appropriate for applications such as microbiomic experiments where community composition is analyzed by surveying 16S rRNA sequences in complex bacterial mixtures (57), analysis of antibody diversity (58) and T cell receptor gene repertoires (50), and facilitating the process of identifying and selecting high value aptamers in a SELEX protocol (59). To highlight the flexibility of amplicon sequencing, a recent study used the method to analyze the incorporation of unnatural nucleotides during DNA synthesis (60).

Sequencing of short amplicons also makes obtaining entire sequences possible in either a single read or using a paired-end read design. Here, adapters can be added directly to the ends of the amplicons and sequenced to retain haplotype information essential for reconstructing antibody or T cell receptor gene sequences as well as identifying species in microbiome projects.

However, it is often necessary to design longer amplicons for targeted sequencing applications. In this case, the PCR products need to be fragmented for sequencing. Amplicons can be fragmented as-is using acoustic shearing, sonication, or enzymatic digestion. Alternatively, they can be first concatenated into longer fragments using ligation followed by fragmentation. One problem associated with amplicon sequencing is the presence of chimeric amplicons generated during PCR by PCR-mediated recombination (61). This problem is exacerbated in low complexity libraries and by overamplification. A recent study identified up to 8% of raw sequence reads as chimeric (62). However, the authors were able to decrease the chimera rate down to 1% by quality filtering the reads and applying the bioinformatic tool, Uchime (63). The presence of the PCR primer sequences or other highly conserved sequences presents a technical limitation on some sequencing platforms that utilize fluorescent detection (i.e., Illumina). This can occur with amplicon-based sequencing such as microbiome studies using 16S rRNA for species identification. In this situation, the PCR primer sequences at the beginning of the read will generate the exact same base with each cycle of sequencing, creating problems for the signal detection hardware and software. This limitation is not an issue with Ion Torrent systems (not fluorescence-based) and can be addressed on Illumina systems by sequencing multiple different amplicons in the same lane whenever possible. An alternative strategy we employ is to use several PCR primers during PCR of a specific amplicon. Each primer has a different number of bases (typically 1–3 random bases) added to the 5′ end to offset/stagger the order of sequencing when adapters are ligated to the amplicons.

The objective of de novo sequencing is to use algorithms to produce a novel genome assembly that can serve as a reference for future experiments. Closing contigs and scaffolds into a cohesive genome map can be a remarkably challenging task. Because of this, de novo assemblies require some of the highest quality (i.e., least biased, most representative) sequencing libraries of any NGS application.

We routinely use three library preparation strategies to maximize assembly efficiency: (i) libraries comprised of long inserts (~1 kb insert sizes), (ii) no PCR amplification in library preparation, and (iii) mate-pair libraries with long distance spacing (5–20 kb) between reads. While it has so far proven impossible to build mate-pair libraries without PCR amplification, long insert libraries can easily be constructed without PCR if sufficient DNA is available (2). Such long insert libraries are created by careful shearing of genomic DNA. We find that the final data quality is greatly improved if sheared ~1 kb DNA is first size selected on a 1% agarose gel to narrow the size distribution as much as possible. This step minimizes the possibility for small fragments to concatenate during the adapter ligation step that increases the risk of chimeric read pairs impeding the data assembly process.

Mate-pair libraries are constructed by circularization of input DNA that has been fragmented to a size of >2 kb. Typically, insert size measures between 2 and 20 kb. We developed a mate-pair protocol using Cre-Lox recombination instead of blunt end circularization (64). In this method, a biotin-labeled LoxP sequence is created at the junction site from the end ligation of two LoxP adapters. This strategy allows junctions to be identified without using a reference genome. The location of the LoxP sequence in the reads distinguishes true mate-paired reads from spurious paired-end reads using the bioinformatics tool, Deloxer (64). A similar approach improves upon this method by allowing longer insert sizes (up to 22 kb)(65). Illumina also provides a transposome-based protocol that requires only a small amount of input material (~1 g) and allows barcoded multiplexing of up to 12 samples per lane.

A significantly more complicated protocol generates mate-pair reads with approximately 40 kb spacing using a unique fosmid vector design (Lucigen NxSeq 40 kb Mate-Pair Cloning Kit; Middleton, WI). The phage packaging mechanism selects for DNA fragments of ~40 kb, which are packaged into phage particles in vitro by bacteriophage Lambda packaging extract followed by transfection into Escherichia coli for replication. Experience in fosmid preparation and replication is a definite plus before taking on this protocol.

Sample preparation for NGS applications: ChIP-seq

Chromosome immunoprecipitation sequencing (ChIP-seq) is now a well-established method for evaluating the presence of histone modifications and/or transcription factors on a genome-wide scale. Histone modifications are an important part of the epigenomic landscape and are thought to help regulate the recruitment of transcription factors and other DNA modifying enzymes. The precise biological role of histone modifications is still poorly understood, but genome-wide studies using ChIP-seq are beginning to provide important insights into their patterns and purpose.

Originally developed as a low-throughput PCR-based assay, the introduction of NGS technology has allowed ChIP-seq to be efficiently applied on a genome wide scale (Figure 5). The general principle of this assay involves immunoprecipitation of specific proteins along with their associated DNA. The procedure usually requires DNA-protein crosslinking with formaldehyde followed by fragmentation of the chromatin using micrococcal nuclease (MNase) and/or sonication. Specific antibodies are used to target the protein or histone modification of interest, at which point the DNA is purified and subjected to high throughput sequencing. The sequencing results should be compared with a proper control. Data from a successful ChIP-seq should be enriched for the sequences that were crosslinked to the targeted protein/modified histone.

There has been some discussion on the best controls for ChIP-seq. Rabbit IgG has been used as a control for non-specific antibody binding, but these antisera typically don't control well for the non-specific cross-reactivity that is present with the use of affinity-purified antibodies. Thus, an aliquot of the input DNA pool after fragmentation but before immunoprecipitation has become more commonplace as the control for ChIP-seq. Additionally, input controls appear to give a better estimation of biases that result from chromatin fragmentation and sequencing (66).

ChIP-seq has a number of technical challenges that require consideration and more standardization to facilitate cross-study analysis. In particular, antibody quality is a large factor affecting the outcome of ChIP-seq experiments. The ENCODE (Encyclopedia Of DNA Elements; www.genome.gov/10005107) and Roadmap consortia (NIH Roadmap Epigenomics Mapping Consortium) have set forth procedures for assessing antibody quality, including dot blot immunoassays against histone tail peptides to evaluate binding specificity and cross-reactivity (67). Some of the technical procedures used in ChIP-seq studies have a direct impact on downstream ChIP-seq library preparation and the resulting sequencing data (40, 66, 68, 69). For example, the formaldehyde crosslinking typically used in ChIP-seq experiments is particularly important for studying transcription factors, but it appears to result in lower resolution and increases the likelihood of non-specific interactions (40). Resolution was recently addressed for DNA binding proteins with the use of lambda exonuclease to digest the 5′ ends at a fixed distance from the crosslinked protein, thus greatly reducing contaminating non-specific DNA (66). Additionally, the use of formaldehyde crosslinking has been shown to protect DNA from micrococcal nuclease digestion, so sonication is now the preferred method of fragmentation when using ChIP-seq in the assessment of DNA binding proteins. Conversely, micrococcal nuclease is known to digest the linker regions between nucleosomes, so it remains the preferred method for chromatin fragmentation when studying histone modifications (68). Regardless of fragmentation method, if successful the DNA insert plus the sequencing adapters should be ~300 bp. We routinely do bead-based purifications after sequencing adapter ligation and again after the PCR step in the library protocol in order to minimize sample losses.

One of the greatest technical issues in ChIP-seq has been the requirement for large amounts of starting material (68). Typically, 1 million to 20 million cells are required per IP in order to acquire sufficient material for sequencing. These amounts are particularly difficult to achieve for primary cells, progenitor cells, and clinical samples. This remains an area that will benefit greatly from improved sequencing library preparation methods from very small quantities of relatively short fragments of DNA. To date, most methods attempting to ameliorate the large amount of starting material required for ChIP-seq have required whole genome amplification or extensive PCR amplification. However, the recently introduced Nano-ChIP-seq method allows for starting amounts down to 10,000 cells by using custom primers with hairpin structures and an internal BciVI restriction site (66, 70). In another recent development, ChIP-seq for the transcription factor ERalpha was successfully performed with an input of only 5000 cells by using single tube linear amplification (LinDA). This approach uses an optimized T7 RNA polymerase IVT-based protocol, which was demonstrated to be robust and reduced amplification bias due to GC content (66).

It is especially challenging to study a novel DNA binding protein or histone modification for which there are no commercial antibodies. The approach required in these cases usually entails the use of transient or stable expression of the protein of interest with a tag that can be targeted (such as a His or FLAG tag). The drawback of this approach is the need for extensive controls to ensure that the fusion protein is localized properly and that interactions are not affected by steric hindrance or non-endogenous expression levels (67).

Sample preparation for NGS applications: RIP-seq/CLIP-seq

Transcription of primary RNAs begins a complex process involving the recognition of intron/exon junctions, splicing and alternative splicing, addition of poly(A) tails, transport to the cytoplasm, entry into ribosomes, processing of various non-coding RNAs, and the generation of signals for RNA degradation. One powerful tool for studying these events, and the proteins that control them, is RIP-seq, where protein complexes assembled at different sites on the RNA molecules are immunoprecipitated and then the RNA bound to them is purified and sequenced (Figure 6)(71).

RNA binding proteins (RBPs) recognize ribonucleic acid motifs including specific sequences, single-stranded backbones, secondary structures, and double-stranded RNA (72, 73). These interactions involve all types of RNAs and occur at every step from transcription to degradation (74). Many steps in the post-transcriptional processing of messenger RNA overlap, resulting in multiple RBP complexes bound to a transcript at any given moment in its existence (75). RIP-seq can be done with protein-specific antibodies or by expressing tagged versions of the RBPs of interest. Furthermore, RIP-seq provides the ability to characterize the function of an RBP in a specific cell type and/or cell state based on the population of bound RNAs (76-78).

The amount of starting total RNA needed for a successful RIP-seq experiment is significantly greater than that required for RNA-seq. First, the amount of RNA bound by any given RBP is highly variable but always only a fraction of the original pool and often a very minor fraction. Second, depending on the target RBP, a nuclear lysate may be required, necessitating an even greater amount of starting material (79). Another technical challenge is the tendency of RNA to non-specifically bind proteins. We address this limitation by preclearing the lysate with an isotype control antibody bound to beads. Non-specific DNA binding is also a challenge. DNase I treatment should be performed multiple times throughout the protocol (i.e., during lysate preparation, post-TRIZOL separation, and library preparation). The duration of the IP step can vary from 2 h to overnight. Longer incubation times can increase the percentage of pulled down protein; however, non-specific RNA binding is also increased, resulting in additional noise. RIP-purified RNA can be taken directly into standard library protocols suitable for low input, short fragment samples. We have had good success with the ScriptSeq-v2 RNA-Seq Library Preparation Kit (Epicenter) with our RIP-seq samples.

A variation of RIP-seq is crosslinking and immunoprecipitation (CLIP-seq) followed by digestion of the RNA sequences not protected by the RBP complexes. This procedure is used to identify the specific binding sites and flanking sequences of RBPs. In the original CLIP protocol, the starting material was crosslinked by exposure to UV radiation (80). Prior to immunoprecipitation, the prepared lysate is digested with RNase, limiting the RNA populations to those regions protected by the bound RBPs. Next, there is a multistep protocol to radiolabel the RBP-bound RNA, separate the samples by SDS-PAGE, visualize the RNA-protein complex by radiography, and excise the desired region (~5–30 kDa above the target RBP's molecular weight). Finally, the RBP is digested with proteinase K, linkers are ligated to the remaining RNA fragments, and a library is constructed for sequencing (81, 82). Control samples are required to account for crosslinking efficiency, RNase digestion, and non-specific RNA binding (83).

Recent modifications to the CLIP-seq protocol include individual-nucleotide resolution CLIP (iCLIP)(84) and photoactivatable-ribonucleoside-enhanced CLIP (PAR-CLIP)(85). In iCLIP, an adapter ligation step is replaced with an intramolecular circularization step that has increased reaction efficiency and the added ability to identify the site of crosslinking (individual nucleotide resolution)(84). In PAR-CLIP, a ribonucleoside analog (4-SU or 6-SG) is added to the media prior to UV-crosslinking. The irradiation step binds the ribonucleoside analog to the RBP in addition to changing the base's identity. Following the standard CLIP-seq protocol, the photoactivated crosslinked sites can be identified by locating single base mismatches or indels when compared with the whole RNA-seq data (86).

Sample preparation for NGS applications: Methylseq

A fundamental mechanism of the epigenetic regulation of gene activity is DNA methylation. This is rapidly being recognized as a critical feature of disease states where simple genetic inheritance is not sufficient to explain the complexity of the phenotypes encountered in clinical medicine. In principle, DNA methylation changes also reflect the history of the organism, not just the genetic inheritance.

Methylation of the 5 position of cytosine (5mC) is the most common form of DNA methylation, with 60%–80% of the 28 million CpG dinucleotides in the human genome being methylated (87, 88). While genome-wide hypomethylation has been linked to increased rates of mutation and chromosomal instability, hypermethylation of promoters inhibits gene transcription (89). DNA methylation is also essential for genetic imprinting, suppression of transposable elements, and X chromosome inactivation (90). Aberrant DNA methylation is associated with many diseases including cancer, autoimmune diseases, inflammatory diseases, and metabolic disorders (91-94).

Early studies were limited to investigating DNA methylation in a few genes at a time or generating a non-specific but global estimation of methylation. Recent advances in high throughput sequencing have dramatically increased both the throughput and resolution of such studies. There are three major methods for studying DNA methylation with NGS platforms: (i) restriction enzyme (RE) based, (ii) targeted enrichment, and (iii) bisulfite sequencing (Figure 7). Each of these methods has advantages and disadvantages that must be weighed according to the researcher's needs and budget.

Methylation sensitive restriction enzyme sequencing (MRE-seq) relies on restriction enzymes that are sensitive to CpG methylation (Figure 7A)(95, 96). The most commonly used REs are the methylation-sensitive HpaII and its methylation-insensitive isoschizomer MspI (97). A method called HELPseq (HpaII tiny fragment enriched by ligation mediated PCR) utilizes both of these enzymes to analyze genome-wide methylation profiles (98). A sample is digested with each enzyme, and the resulting fragments are sequenced separately. The MspI digested reference sample not only allows for a point of comparison for methylation but also controls for misinterpretation of HpaII not cutting due to single nucleotide polymorphisms (SNPs)(97). Other RE-based methods, such as methyl-sensitive cut counting (MSCC), methylation-specific digital sequencing (MSDS), and modified methylation-sensitive digital karyotyping (MMSDK) rely on other methylation sensitive REs (97). RE-based methods are limited in their scope by the fixed number of digestion sites present in the genome, which skews the view of CpG methylation to these particular sites, and its accuracy is dependent upon complete digestion with high fidelity (67).

Affinity enrichment of methylated DNA requires either antibodies specific for methylated DNA (MeDIP) or other proteins capable of binding methylated DNA (MBDseq)(Figure 7B)(95, 97, 98). Specifically, the methyl binding domain (MBD)-containing proteins MeCP2, MBD1, MBD2, and their binding partner MBD3L1 have been used to immunoprecipitate methylated DNA (98). While such immunoprecipitation methods are not limited by sequence specificity, they tend to preferentially pull down regions that are heavily methylated and miss genomic areas with sparse methylation. Moreover, sequencing of the recovered material gives the researcher an idea of the areas that are methylated, but does not reveal which individual bases are methylated.

Treatment of DNA with sodium bisulfite results in the chemical conversion of unmethylated cytosine to uracil while methylated cytosines are protected (Figure 7C)(99). Bisulfite conversion coupled with shotgun sequencing was first performed in Arabidopsis thaliana by two research groups who coined the methods BS-seq (100) and MethylC-seq (101). MethylC-seq was also used to create the first human single base resolution map of DNA methylation (87). While BS-seq/MethylC-seq is widely considered the gold standard in methylome analysis, it requires significant read depth (30× coverage)(67). It remains expensive and not easily applied to the large sample sizes needed for clinical investigations. Recently, it was shown that only ~20% of CpGs are differentially methylated across 30 human cells and tissues, suggesting that 80% of the CpG methylation in whole genome sequencing is not informative (88). To reduce the cost and complexity of data associated with whole genome bisulfite sequencing, recent methods have sought to couple enrichment methods with bisulfite sequencing. The capture and targeted sequencing of specific regions identified in the genome to be enriched for CpG methylation sites such as islands, shores, gene promoters, and differentially methylated regions (DMRs) can be accomplished using a commercially available kit from Agilent Technologies (SureSelectXT Methyl-Seq Target Enrichment). Alternatively, bisulfite conversion of DNA isolated by MeDIP or MBD pull downs allows for single base resolution to be achieved by these methods. Sequence-specific binding to beads (51) followed by bisulfite treatment or binding of bisulfite-converted DNA to bisulfite padlock probes (BSPPs)(102) has also been demonstrated to be an effective method for enriching potentially methylated regions. Our group developed a method for targeted bisulfite sequencing using microdroplet PCR with custom-designed droplet libraries (55). This technique relies on the unbiased amplification of bisulfite treated DNA with region-specific primers. All of these enrichment methods retain the single base pair resolution that is so advantageous for bisulfite sequencing while vastly reducing the amount of sequencing required. However, it is important to note that bisulfite treatment of DNA leads to DNA instability and loss of product; thus, many of these methods require more input DNA than the non-bisulfite conversion-based methods.

The recent discovery that 5-hydroxymethylcytosine (5hmC)(103) is an intermediate of the demethylation of 5mC to cytosine has opened a whole new area of study into the mechanics of DNA methylation and epigenetic regulation. Studies revealed that the Ten-Eleven Translocation (TET) family of proteins facilitate demethylation of 5mC to cytosine through three intermediates, 5hmC, 5-formylcytosine (5fC), and 5-carboxylcytosine (5caC). Bisulfite treatment converts 5fC and 5caC to uracil, but cannot convert 5mC or 5hmC. Thus, bisulfite sequencing cannot distinguish between 5mC and 5hmC (67). In order to detect these novel methylation intermediates, new techniques have been developed. The first efforts either involved antibodies specific for 5hmC (hMeDIP-seq) or chemical modification of 5hmC (67). More recent advances toward single-base resolution sequencing of 5hmC are oxidative bisulfite sequencing (oxBS-seq)(104) and TET-assisted bisulfite sequencing (TAB-seq)(105). Single-molecule real-time (SMRT) DNA sequencing (Pacific Biosciences, Menlo Park, CA) has been introduced as another method to sequence 5hmC (106). SMRT sequencing relies on the kinetics of polymerase incorporation of individual nucleotides, allowing for direct detection of these modified cytosines (106). Most recently, antibody-based immunoprecipitation methods (107, 108) and chemical modification methods have been developed to allow for sequencing of 5fC (109).

The tremendous and rapid evolution of NGS technologies and protocols has generated both amazing opportunities for science and significant challenges. We believe that the transformational power of deep sequencing has already been clearly demonstrated in basic science. It is poised to advance into clinical medicine, creating a new generation of molecular diagnostics based on DNA sequencing, RNA sequencing, and epigenetics.

Author contributions

SRH, PO, and DRS wrote and edited the paper. HKK contributed to the Methylseq section, SALM contributed to the ChIP-seq section, TW contributed to the RIP-seq section, and FVN contributed to the sections on Nextera and mate-pair library sections.