Next-Gen Sequencing Poised to Open New Vistas for Biomedical Research

NGS has generated a great deal of excitement in the life sciences research community in recent years. Funding agencies have focused on genomics in general and NGS in particular, resulting in rapid dissemination of high-throughput, short-read instruments and chemistries. The bioinformatics community has cooperated by providing continually improved means to deal with the mounds of data rolling out from genome labs and centers. In the Next-Generation Sequencing Technologies: Applications and Markets report, we examine the broad spectrum of NGS applications used in studies, which even at this early date have already begun delivering intriguing new insights in a variety of fields. Much of this work to date has focused on basic research studies, which are generating important new leads suggesting downstream follow-on studies, large numbers of new biomarker candidates, multiple clues in the search for genetic and epigenetic contributions to disease, insights into the human microbiome, new views on evolution, and the list goes on. Most importantly, the huge acceleration in NGS sequencing activity already shows signs of generating major shifts in core paradigms central to advancing biomedical research and therapeutics.

Although the boundary between basic and applied research is not always completely clear, we make an effort to describe and elucidate NGS-related activities that fall clearly in the basic realm. Our report focuses on applied research.

During the 1990s, the life sciences research community eagerly awaited results from the Human Genome Project. Many expected that we would find 100,000 or more genes, quickly identify the function of many of them, generate new knowledge about disease mechanisms, and identify lots of new high-quality targets for drug discovery. Perhaps not surprisingly, reality fell short of expectations. The number of genes turned out to be closer to 23,000 than 100,000; functions for more than half these genes remain to be discovered; most of the new untested drug targets proved disappointing; and discovery of useful new diagnostic biomarkers fell far short of expectations.

NGS came on the scene in 2005, was pretty well ramped up by 2008, and really hit its stride in 2009. Genome centers have by now largely mastered the generation, storage, and analysis of NGS data, and impressively large numbers of basic research NGS studies now appear in print each month. Not surprisingly, much of the activity to date has centered on cancer, the prototypical “DNA disease,” but activity has begun recently shifting to other disease categories as well.

We were fortunate to have Dr. Elaine Mardis, Co-Director and Director of Technology Development at The Genome Center at Washington University, a major force in basic NGSdriven research, agree to an interview for this report. Here, she summarizes current research at her center:

Insight Pharma Reports (IPR): You and your center have recently published some groundbreaking NGS work, especially in the area of cancer. Beyond that, what kinds of projects will you work on going forward?

Elaine Mardis: They break down along four lines. You are aware of the cancer work that has been done, and there is a lot more to come on that. By virtue of having really excellent clinical collaborators here, we’ve managed to get funding for cancer samples that I think are going to address some very significant questions about the genetic basis for the disease. Especially with the breast cancer samples that we’re working on right now, we’re starting to look at generating genomic predictors of response to specific chemotherapies and things like that. It’s all pretty exciting.

Beyond cancer, we’re working on the human microbiome, looking at various body sites to begin to characterize the microbial content in people of different genders and ages. For the most part, we’re looking at healthy individuals, although there are some diseasebased studies as well. Probably this is the first project in the history of human genetics to actually focus on health rather than disease as the first pass, which is kind of refreshing.

Moving from the microbiome back to the human genome, we have a number of projects just underway that are looking at human complex diseases other than cancer. Those run the gamut from eye degeneration disorders, like retinitis pigmentosa and acute macular degeneration, to garden-variety complex diseases such as metabolic syndrome, which is a catchall term for diabetes, hypertension, and others. There we’re looking at a huge Finnish cohort that’s well in excess of 7,000 samples. More recently we got approval for a look at ALS (Lou Gehrig’s disease) in some very specific samples that are derived from a warm autopsy program via a collaborator in Australia. We’ll be looking at somatic variation, as well as RNA and methylation differences, in tissues from patients that are affected by ALS versus unaffected tissue from the same patients. I think that will be a very interesting study.

Last but not least, there’s something that we’ve always done in various forms and flavors, which is continuing to de novo sequence and assemble genomes. Now with NGS technology we’re doing more of this work than ever before. The song remains the same, but you can do a lot more now in terms of looking at multiple individuals within a population, for example, of monkeys. You can identify major common SNPs in the population, things like that, just using next-gen sequencing with fairly light coverage against a high-quality reference assembly. So, those are the four basic areas of research: cancer genomics, the human microbiome, human complex diseases other than cancer, and de novo sequencing and assembly of genomes.

Epigenetics

The term epigenetics refers to changes in an organism’s phenotype or pattern of gene expression caused by mechanisms other than alteration of the DNA base sequence. A formal definition was proposed at a Cold Spring Harbor Laboratory conference in 2009: “An epigenetic trait is a stably heritable phenotype resulting from changes in a chromosome without alterations in the DNA sequence.”1 These changes may last only for the lifetime of a cell, or they may be heritable for multiple generations. Cellular differentiation in eukaryotic organisms is a normal and predictable epigenetic process, but the silencing of oncoprotective genes that probably contribute to the onset of cancer exemplifies the abnormal variety. The term epigenomics encompasses a genome-wide perspective of epigenetic alterations. Epigenomics has been targeted for largescale NGS investigation, and publications have already begun to appear.

Epigenetics has several molecular manifestations. Major ones include chemical modification of histones (e.g., methylation, acetylation, and phosphorylation), which alter their charge and thereby loosen the quaternary structure of chromosomes to enhance gene expression; methylation of DNA bases, which converts cytosines to 5-methylcytosines, primarily when targeted Cs are adjacent to Gs; generation of RNAi molecules that block expression of particular genes; and formation of prions, proteins that alter the tertiary structure and function of other proteins.

NGS naturally focuses on cytosine methylation, and the following section provides an introduction to the methods for assessing the location of methylated bases in the genome and their significance for basic and applied research. Results from recent NGS epigenetic studies in cancer and other diseases are relegated to later sections are included in the Next-Generation Sequencing Technologies: Applications and Markets report.

Methylation

DNA can be methylated at the 5-carbon of cytosine or 6-nitrogen of adenine. Cytosine methylation is particularly consequential and key for epigenetics. It plays an essential role in the normal development and cellular differentiation of many organisms, including all vertebrates, by influencing gene expression patterns in cells. Notably, methylation induces pluripotent stem cells to transform into stable special-purpose cells in the body. Gone awry, however, DNA methylation apparently has an important role to play in cancer and quite possibly a number of other diseases.

Addition of methyl groups to cytosine residues in coding regions of DNA can reduce or block the expression of particular genes. In adult somatic tissues, cytosine methylation usually occurs when the neighboring base is guanosine, forming so-called CpG pairs. However, methylation at cytosines not adjacent to guanosines dominates in embryonic stem cells. Regions of the genome with lots of CpG pairs are said to contain CpG islands (CGIs). Notably, hypermethylation of CGIs in the promoter region of tumor suppressor genes has been linked to cancer development.1

Studying patterns of DNA methylation has become a major preoccupation in genomics. In fact, determining genome-wide patterns of methylation has been granted “-omic” status and dubbed methylomics. Several methods for detecting methylation patterns have been developed. The most popular entail treatment of DNA with sodium bisulfite, which converts unmodified cytosines to uracils while leaving methylated ones untouched. For hybridization purposes, uracils behave like thymines.

Methylation analysis can be done either for typing or profiling purposes. The former encompasses only a few loci, which are typically examined in a number of samples. Typing is typically done using methods based on PCR, restriction enzymes, and mass spectrometry. Methylation profiles encompass larger regions or even whole genomes and are currently done using either microarrays or, increasingly, NGS.

An early entrant in the microarray methylation arena is Illumina’s GoldenGate Methylation Cancer Panel I, which covers 1,505 CpG pairs from 807 genes. The product was quite popular, but has now been phased out. Without going into detail, the method involves immobilization of bisulfitetreated DNA on beads followed by hybridization with allele-specific and locus-specific oligonucleotides. Bound oligos are then modified to accommodate PCR amplification with universal templates. Extent of methylation at a CpG site is determined by comparing signals from methylated and unmethylated alleles in the genomic sample.

An alternative approach involves enriching the fraction of methylated DNA via immunoprecipitation with an antibody that binds to methylated cytosines. The enriched fraction can then be compared with the total DNA for hybridization to an oligonucleotide microarray. The methylated DNA immunoprecipitation microarray method, MeDIP-chip, also applies to NGS, where it is called MeDIP-seq. A similar procedure replaces the antibody with MBD (methylbinding domain) proteins. The highly sensitive assay is then called methylated CpG island recovery (MIRA). No single microarray-based method provides comprehensive methylation analysis, and it seems clear for this and other reasons that NGS will become the dominant modality.2

NGS methylation analysis already provides favorable genome coverage to cost ratio, and as sequencing costs continue to decrease, we can expect further improvement. It is currently possible to sequence all CpG islands in a genome using whole-genome bisulfite sequencing. Several alternative technological approaches have been applied to methylation profiling. MethylCseq on Illumina’s Genome Analyzer offers single-base resolution, permitting examination of 94% of all cytosines in the human genome.

Pacific Biosciences’ SMRT single-molecule sequencer, currently in early access evaluation and due for commercial introduction before the end of 2010, offers the possibility of evaluating DNA methylation without going the bisulfite route. The key here is that SMRT measures nucleotide incorporation in real time, and methyl modification alters the kinetics of that process in a measurable way. The company has been working under an ARRA (American Recovery and Reinvestment Act) grant to develop this epigenetics application.

In that regard, PacBio’s Dr. Eric Schadt made the following comments in our interview for this report:

Dr. Schadt: The other really key advantage is detection of epigenetic changes. Because you’re observing the sequencing happen in real time, you can use this time component to assess whether there are epigenetic changes at a given site as the DNA polymerase sequences through a given region. You observe variation via the kinetics of the enzyme’s activity. We’ve just shown in a paper published in Nature Methods [2010;7:461–5] that this kinetic approach can predict chemical modifications to the bases, things like methyl- C residues become evident, and you just pick that up by looking at kinetic variation of the enzymatic sequencing process. So this opens up a whole new dimension in DNA that goes beyond the As, Ts, Gs, and Cs and leads to this whole chemical modification space without requiring any sort of bisulfite treatment as is required for the Illumina or SOLiD technologies. With those technologies, you have to treat the genome with this harsh chemical first to do the conversion, and then sequence, whereas with the PacBio technology you get it all for the price of a standard sequencing run. So the advantages are the kinetic variation detection, the fast turnaround time, and the long read length.

IPR: Will the methylation detection methodology be ready to go when the first systems ship?

Dr. Schadt: It comes straight from the sequencing data. So the information that’s needed to infer things like methylation will exist when the first units ship, but it may be six months to a year after that release before the software that enables the interpretation of that type of information is ready for prime time.

A large number of DNA methylation studies have been conducted and published. Subjects covered include variations in methylation patterns among various cell types, the role of methylation in gene regulation, X-chromosome inactivation, genomic imprinting, and tumorigenesis.3 Interest in characterizing DNA methylation patterns in whole genomes has increased markedly in recent years, no doubt driven by the maturation of NGS technologies.

A modest-in-scope Human Epigenome Project consortium was formed in 2003, involving The Wellcome Trust Sanger Institute, Epigenomics AG, and The Centre National de Genotypage. The project set out to catalog methylation variable positions (MVPs) in the human genome with emphasis on tumor samples. They have completed a pilot study of methylation patterns within the MHC (major histocompatibility complex), the region on chromosome 6 associated with a large number of diseases. They identified MVPs near promoter and other relevant regions of about 150 loci in the MHC in several tissues in a number of individuals. In conducting the study, they developed an integrated platform incorporating automated bisulfite treatment of minute tissue biopsy specimens, bisulfite PCR, and large-scale sequencing of PCR amplicons. They analyzed and quantified methylation patterns by mass spectrometry and microarray analysis. Among significant results, they found that DNA methylation remains more stable over one’s lifetime than originally suspected.

In 2008, the NIH’s Roadmap Epigenomics Program allocated $190 million for such research. As of April 2010, 44 programs had been funded in epigenomics of human health and disease, reference epigenome mapping, epigenomic data analysis and coordination, technology development, and discovery of novel epigenetic marks in mammalian cells.

In late January, a group of involved biologists announced launch of the IHEC (International Human Epigenome Consortium), which intends in its first phase to map 1,000 reference epigenomes within a decade.4 The consortium is in the process of recruiting member funding agencies and other organizations in order to generate the $130 million needed to complete the first phase. To qualify for executive committee status, members must contribute at least $10 million over five years and agree to make their data public. Spearheading the effort is the NIH, which as indicated above has its own $190 million five-year epigenomics roadmap program, along with the European Commission, which had planned to start soliciting proposals for a 30 million Euro epigenetics consortium in July 2010.

Cold Spring Harbor Laboratory’s Rob Martienssen, PhD, a member of the consortium’s steering committee, has been quoted: “Epigenomes are changeable and programmable and will feed us the bottom line on how the genome works.” Another steering committee member, Philip Avner, PhD (Institut Pasteur), cautions, “The human genome is singular and finite, but the human epigenome is almost infinite—the epigenome changes in different states and different tissues.” The epigenome changes during early development, but also with age and in response to environmental stress. Degrees of normal variation in the epigenome in a single individual over a lifetime, or even a single day, are currently unknown. Some involved scientists consider the project premature for this reason, but many others believe that the current explosion of epigenetics research is reason enough to bring some standardization and order to the field.

The first comprehensive maps of methylomes for two cell types were determined at a cost of $100,000 each and published in November 2009.3 Nature quoted lead author Joseph Ecker, PhD of the Salk Institute as saying, “I expect the cost of a similar methylome will fall to $10,000 in the next six months.” If he is correct, the $130 million IHEC budget would easily cover the 1,000 epigenome target. Participants in the formative meeting agreed that most of the reference epigenomes should come from normal human tissue to provide a reference base for subsequent examination of abnormal tissues.

As mentioned, an international team recently generated two “whole genome” maps (actually, maps covering the majority of the human genome) at single-base resolution, one for embryonic stem cells and one for fibroblasts. They also profiled several important histone-binding regions, sites where transcription factors bind, and the transcriptomes for the two cell types (including mRNA and small RNA components).

Procedurally, they generated MethylC-seq (using the aforementioned bisulfite method) and ChIPseq libraries (see next section) for sequencing on Illumina’s Genome Analyzer II. Reads were aligned to the human reference genome (hg18) and used base calls for each reference position on each DNA strand to identify methylated cytosines. Both cell lines generated around 1.2 billion base reads that aligned uniquely to the reference genome sequence. For each cell they covered more than 86% of both strands of the reference genome, which accounted for 94% of all cytosines.

They found 62 million methylcytosines for the stem cell line with a 1% false discovery rate, and 45 million for the fibroblast line. Essentially all the methylated cytosines (99.98%) in the fibroblast DNA were in CG pairs and the total number of methylated CG sites was very similar for both cell types. The major difference between the cell types comes from the observation that the stem cells had nearly 25% of all methylated cytosines at non-CG sites (methyl-CHG and methyl-CHH, where H = A, C, or T).

Previous, more limited studies had also found non-CG methylation in human embryonic stem cells. The whole-genome study found that as differentiation proceeds, non-CG methylation tends to be lost, while CG methylation is retained. It has also been found that non-CG methylation can be restored when differentiated cells revert to pluripotent status. They also found that highly expressed genes in stem cells contained three times as many non-CG methyl groups as nonexpressed genes.

The researchers generated additional significant observations, which will not be summarized here. They do note that methylation outside the CG context is typically overlooked in studies using alternative methodologies. This study is only the first step on a long and arduous journey to characterize the epigenome in depth. Yet the results are fascinating and strongly suggest the value of this major undertaking.

Dr. Elaine Mardis, Co-Director and Director of Technology Development at The Genome Center at Washington University, provided her views on where epigenetics fits in the big picture:

IPR: How do you feel about the role of epigenetics in this whole theoretical framework that’s beginning to form? Perhaps we can view epigenetics as a kind of dynamic genetic variation mechanism.

Dr. Mardis: It seems everybody in their heart of hearts feels that epigenetics is important. I think that it’s not well established how fluid it is in any given cell on any given day in the normal spectrum let alone the disease spectrum. So it seems to me that there are some fundamental experiments that need to get done in humans, and I’m not sure if anybody is doing them. I know they’ve been done in mice. The problem is that mice and humans aren’t always necessarily equatable systems. I worry a little about that.

I also think that for epigenomics from the standpoint of Methyl-seq or specific histone methylation pull-downs and that sort of thing, the data are pretty noisy. So I think we need some level of refinement on separating peaks from pseudo-peaks before we begin to really accept the data, and probably some of that reflects the need for hard-core replication experiments, which would fit in nicely with the other kinds of experiments that I was just talking about. Somebody just really needs to take that on. I know it has been done in other organisms such as worms and flies, and the Broad Institute has done some beautiful studies in mice. But more needs to be done in humans. There may be some things I’ve missed, but I do think the field needs to be cleaned up a bit before you can get out good information.