A blog on DNA evidence

Tag Archives: ENCODE

On the eve of the en banc oral argument in Haskell v. Harris, The Electronic Frontier Foundation (EFF) filed a letter asking “the Court to consider the ENCODE project findings in determining the outcome of this case.” It seems hard to oppose the idea that the court should consider relevant scientific research, but without input from the scientific community, will the judges do better than they have in the past as “amateur scientists” (to use the skeptical phrase of Chief Justice Rehnquist in Daubert v. Merrell Dow Pharmaceuticals, Inc.)?

Deciphering the ENCODE papers’ descriptions of the data is no easy task, and EFF’s lawyers do not seem to be up to it. Their letter asserts that the project “has determined that more than 80% of DNA once thought to be no more than ‘junk’ has at least one biochemical function, controlling how our cells, tissue and organs behave.” This is not a fair characterization of the findings. Which geneticist ever claimed that all noncoding DNA plays no role in how cells behave? The issue always has been how much junk, how much func — and what “functions”?

What does EFF mean by “controlling”? Making organs function? Stimulating tissue growth? Turning normal cells into cancerous ones? Making us tall or short, fat or skinny, gay or straight? None of those things are mentioned in the Nature cover story cited in the letter. Instead, the EFF relies on New York Times reporter Gina Kolata’s misleading news article for the letter’s claim that “The ENCODE project has determined that ‘junk’ DNA plays a critical role in determining a person’s susceptibility to disease and physical traits like height.”

My earlier postings described the limited meaning of the phrase “biochemical function” in the cited paper. I’d love to see a citation to a page of an ENCODE paper that asserts that fully 80% of the noncoding DNA is determining “susceptibility to disease and physical traits like height.” And if I were a judge, I would demand an explanation of why “physical traits like height” are, in the words of the EFF letter, “sensitive and private.”

After the judges consider the ENCODE papers (by having their law clerks read them?), will they be better informed about the actual privacy implications of the CODIS loci than they were before this excursion into this realm of the bioinformatics? I would not bet on it, but maybe I am growing cynical.

Earlier today, I introduced the concepts and terms required to ascertain whether the estimated proportion of the genome that encodes the structure of proteins or regulates gene expression has jumped from 5 or 10% to 80%. I now focus on the possible meanings of “functional” to see whether the ENCODE papers state or imply and such seismic change. It appears that they do not.

“Functional” is an adjective, and Alice learned from Humpty Dumpty that adjectives are malleable:

“When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean–neither more nor less.” “The question is,” said Alice, “whether you can make words mean so many different things.” “The question is,” said Humpty Dumpty, “which is to be master–that’s all.” Alice was too much puzzled to say anything, so after a minute Humpty Dumpty began again. “They’ve a temper, some of them–particularly verbs, they’re the proudest–adjectives you can do anything with, but not verbs–however, I can manage the whole lot! Impenetrability! That’s what I say!”

Like Humpty, who was redefining the word “glory,” the ENCODE authors recognized that “functional” can have many meanings. As Ewan Birney later explained:

Like many English language words, “functional” is a very useful but context-dependent word. Does a “functional element” in the genome mean something that changes a biochemical property of the cell (i.e., if the sequence was not here, the biochemistry would be different) or is it something that changes a phenotypically observable trait that affects the whole organism?1/

Still other possibilities exist. For example, the first paper to use the adjective “junk” for noncoding DNA noted that even debris accumulated in the course of evolution or introduced from viral infections could have a function simply by creating spaces between genes.2/ The pieces of dead wood that are joined together to form the hull of a row boat have a function–they exclude the water from the vessel to keep it afloat. This does not mean that the detailed structure of the planks–the precise width of each plank or the number of ridges on its surface–affects its functionality. And, just as something can be inactive and functional, so too something can be alive with activity and yet be nonfunctional.

ENCODE uses biochemical activity–the notion that “the biochemistry would be different”–as a synonym for functional. Here is the definition of “functional” in the top-level paper:

This definition may be useful for the purpose of describing the size of ENCODE’s catalog of elements for later study, but it contrasts sharply with the notion of functional as affecting a nontrival phenotype. The ENCODE papers show that 80% of the genome displays signs of certain types of biochemical activity–even though the activity may be insignificant, pointless, or unnecessary. This 80% includes all of the introns, for they are active in the production of pre-mRNA transcripts. But this hardly means that they are regulatory or otherwise functional.4/ Indeed, if one carries the ENCODE definition to its logical extreme, 100% of the genome is functional–for all of it participates in at least one biochemical process–DNA replication.

That the ENCODE project would not adopt the most extreme biochemical definition is understandable–that definition would be useless. But the ENCODE definition is still grossly overinclusive from the standpoint of evolutionary biology. From that persective, most estimates of the proportion of “functional” DNA are well under 80%. Various biologists or related specialists have provided varying guestimates:

Under 50%: “About 1% … is coding. Something like 1-4% is currently expected to be regulatory noncoding DNA … . About 40-50% of it is derived from transposable elements, and thus affirmatively already annotated as “junk” in the colloquial sense that transposons have their own purpose (and their own biochemical functions and replicative mechanisms), like the spam in your email. And there’s some overlap: some mobile-element DNA has been co-opted as coding or regulatory DNA, for example. [�] … Transposon-derived sequence decays rapidly, by mutation, so it’s certain that there’s some fraction of transposon-derived sequence we just aren’t recognizing with current computational methods, so the 40-50% number must be an underestimate. So most reasonable people (ok, I) would say at this point that the human genome is mostly junk (“mostly” as in, somewhere north of 50%).”5/

40%: “ENCODE biologist John Stamatoyannopoulos … said … that some of the activity measured in their tests does involve human genes and contributes something to our human physiology. He did admit that the press conference mislead people by claiming that 80% of our genome was essential and useful. He puts that number at 40%.”6/

20%: “[U]sing very strict, classical definitions of “functional” [to refer only to] places where we are very confident that there is a specific DNA:protein contact, such as a transcription factor binding site to the actual bases–we see a cumulative occupation of 8% of the genome. With the exons (which most people would always classify as “functional” by intuition) that number goes up to 9%. … [�] In addition, in this phase of ENCODE we did [not] sample … completely in terms of cell types or transcription factors. [W]e’ve seen [at most] around 50% of the elements. … A conservative estimate of our expected coverage of exons + specific DNA:protein contacts gives us 18%, easily further justified (given our [limited] sampling) to 20%.”7/

So why did the ENCODErs opt for the broadest arguable definition of “functional”? Birney’s answer is that it describes a quantity that the project could measure; that the larger number underscores that a lot is happening in the genome; that it would have confused readers to receive a range of numbers; and that the smaller number would not have counted the efforts of all the researchers.

Whether these are very satisfactory reasons for trumpeting a widely misunderstood number is a matter that biologists can debate. All I can say is that (1) I have been unable to extract a clear number–whatever one should make of it–for a percentage of the genome that constitutes the regulatory elements–the promoters, enhancers, silencers, ncRNA “genes,” and so on; (2) this number is almost surely less than the 80% figure that, at first glance, one might have thought ENCODE was reporting; and (3) “functional element” as defined by the ENCODE Project is not a term that has clear or direct implications for claims of the law enforcement community that the loci used in forensic identification are not coding and therefore not informative.

Of course, none of this means that the description of the information content of the CODIS STRs traditionally presented by law enforcement authorities is correct. It simply means that even after this phase of ENCODE, there are still a huge number of base pairs that might or might not be regulatory or influence regulation and, hence, gene expression. The CODIS STRs might or might not be among them. Published reports suggest that they are not,8/ but the logic that just because a DNA sequence is noncoding (and nonregulatory), it conveys zero information about phenotype is flawed. It overlooks the possibility of a correlation between the nonfunctional sequence (because it sits next to an exon or a regulatory sequence).9/ Again, however, the published literature reviewing the CODIS STRs does not reveal any population-wide correlations that permit valid and strong inferences about disease status or propensity or other socially significant phenotypes.10/

Will this situation change? A thoughtful answer would take up a lot of space.11/ For now, I’ll just repeat the aphorism attributed to Yogi Berra, Neils Bohr, and Storm P: “It’s hard to make predictions, especially about the future.”

2. David E. Comings, The Structure and Function of Chromatin, in 3 Advances in Human Genetics 237, 316 (H. Harris & K. Hirschhorn eds. 1972) (“Large spaces between genes may be a contributing factor to the observation that most recombination in eukaryotes is inter- rather than intragenic. Furthermore, if recombination tended to be sloppy with most mutational errors occurring in the process, it would an obvious advantage to have it occur in intergenic junk.”). For more discussion of this paper, see T. Ryan Gregory, ENCODE (2012) vs. Comings (1972), Sept. 7, 2012, http://www.genomicron.evolverzone.com/2012/09/encode-2012-vs-comings-1972/.

4. These regions do contain some RNA-coding sequences, and those small parts could be doing something interesting (producing RNAs that are regulatory or that defend against infection by viral DNA, for example), but this kind of activity does not exist in the bulk of the introns that are, under the ENCODE definition, 100% functional.

[A]s far as questions of “junk DNA” are concerned, ENCODE’s definition isn’t relevant at all. The “junk DNA” question is about how much DNA has essentially no direct impact on the organism’s phenotype–roughly, what DNA could I remove (if I had the technology) and still get the same organism. Are transposable elements transcribed as RNA? Do they bind to DNA-binding proteins? Is their chromatin marked? Yes, yes, and yes, of course they are–because at least at one point in their history, transposons are “alive” for themselves (they have genes, they replicate), and even when they die, they’ve still landed in and around genes that are transcribed and regulated, and the transcription system runs right through them.

What the ENCODE papers … have to say about transposons is incredibly interesting. Essentially, large numbers of these elements come alive in an incredibly cell-specific fashion, and this activity is closely synchronized with cohorts of nearby regulatory DNA regions that are not in transposons, and with the activity of the genes that those regulatory elements control. All of which points squarely to the conclusion that such transposons have been co-opted for the regulation of human genes — that they have become regulatory DNA. This is the rule, not the exception.

Last week I noted some of the hyperbolic headlines accompanying the coordinated publication of a large number of datasets from the ENCODE Project . The abstract of the top-level paper begins as follows:

The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions.1/

Hoping to decipher these sentences, I have been reading about gene regulation. This modest effort stems from more than academic curiosity. If the popular and even some of the scientific press is to be believed, ENCODE has exorcized “junk DNA” from the body of scientific knowledge.2/ The bright light suddenly shining on the “dark matter” of the genome (to introduce another sloppy metaphor)3/ raises a giant question mark for the criminal justice system. Law enforcement authorities have always insisted that the snippets of DNA used to generate DNA identification profiles are just nonfunctional “junk.”4/ Now, according to New York Times science correspondent Gina Kolata,

As scientists delved into the “junk” — parts of the DNA that are not actual genes containing instructions for proteins — they discovered a complex system that controls genes. At least 80 percent of this DNA is active and needed. … [�] … The thought before the start of the project, said Thomas Gingeras, an Encode researcher from Cold Spring Harbor Laboratory, was that only 5 to 10 percent of the DNA in a human being was actually being used.5/

This juxtaposition of percentages suggests that the scientific community has shifted from the view that “only 5 to 10 percent” of the genome is functional (“needed” for the organism to function normally) to a sudden realization that 80% falls into this category.

But the more I read, the clearer it became that this description of a sudden phase transition in science is wildly inaccurate. Johns Hopkins biostatistian Steve Salzberg, in a provocative Simply Statistics podcast interview, describes the 80% figure touted in the ENCODE paper as irresponsible.6/ University of Toronto biochemist Lawrence Moran saw it as a repeat of a similar, problematic performance five years ago, at the conclusion of the pilot phase of ENCODE.7/ Responding to criticism, ENCODE Project leader Ewan Birney explained the new knowledge this way:

After all, 60% of the genome with the new detailed manually reviewed (GenCode) annotation is either exonic or intronic, and a number of our assays (such as PolyA- RNA, and H3K36me3/H3K79me2) are expected to mark all active transcription. So seeing an additional 20% over this expected 60% is not so surprising.8/

“Not so surprising”? A whopping 60%–not a minor 5 or 10%–was already estimated to be “active”? What is going on here?

The answer lies in the definition of some key terms (like exons, introns, and transcription) and requires a rudimentary understanding of the fundamentals of gene expression and its regulation in human beings. This posting presents the essential terminology and concepts. A sequel will apply them to explain what ENCODE’s “assign[ing] biochemical functions for 80% of the genome” means. Anyone who knows what RNA transcripts and transcription factors do can skip this first part (or can read it to let me know of my inaccuracies).

To avoid suspense, I shall lay out my conclusions here and now: (1) if ENCODE gives a clear number for a percentage of the genome that regulates genes–the promoters, enhancers, silencers, ncRNA “genes,” and so on–I have yet to find it; (2) this number is almost surely less than the 80% figure reported for functionality; and (3) “functional element” as defined by the ENCODE Project is not a term that has clear or direct implications for claims of the law enforcement community that the loci used in forensic identification are not coding and therefore not informative. Those claims of zero information are somewhat exaggerated, but that is another story. For now, I merely describe some basics of gene expression and regulation.

Genes make proteins. But how? There are three big steps (with many activities within each step): transcription; post-transcription modification and transportation; and translation. All involve RNA, a single-stranded molecule related to DNA, and proteins. The basic picture is

In the first big step, the base pairs of the gene are transcribed jot-for-jot into an RNA molecule (precursor messenger RNA, or pre-mRNA). In the second major step, the transcript is modified at its ends, edited to remove parts that do not code for the protein that will be made (splicing), and the mature messenger RNA (m-RNA) is moved outside the nucleus. In the third phase, another type of RNA (transfer RNA, or tRNA) stitches together individual amino acids in the order dictated by the m-RNA transcript to form a protein, thereby translating the DNA sequence mirrored in the mRNA into the amino-acid order of the protein. Translation occurs on a kind of microscopic workbench (a ribosome) made of yet another RNA (ribosomal RNA, or rRNA).

For all this to happen, the DNA, which lies tightly coiled in the chromosomes (in a protein-DNA matrix known as chromatin), must open up for transcription to occur. Thus, changes in the chromatin regulate transcription, and these changes can be brought about in a number of ways. Transcription factors (specialized proteins) bind to the DNA. The bound transcription factors then recruit an enzyme (RNA polymerase) that produces RNA. This occurs within a region of DNA, known as a promoter, near the start of the protein-coding DNA (the structural gene). The level of transcription is influenced by activator or repressor proteins that bind to still other small regions (enhancers and silencers, respectively) that also lie outside the structural gene. In short, chemical interactions that open or close the chromatin that houses the DNA and transcription factors regulate the first step in the DNA-to-protein process.

In the past decade, other mechanisms of regulation or control of gene expression have been discovered. Many DNA sequences are not transcribed into messenger RNA, but they are transcribed into a variety of other RNAs. These non-protein-coding DNA sequences can be thought of as genes for RNA. Courting confusion, they usually are called “noncoding” (ncDNA)–because they do not code for protein–but they certainly code for RNAs that are crucial to translation–rRNA and tRNA–and for other RNAs that affect transcription, translation, and DNA replication. So it turns out that the genome is abuzz with transcription-to-RNA activity and other events that feed into the expression of the (protein-)coding DNA.

Yet, this hardly means that every biochemical event along the DNA is functionally important. Some, perhaps many, non-mRNA transcripts are just “noise.” They may float around for a while, but they may not do anything except wither away. In addition, large segments of the DNA transcribed in the course of making mRNA appear in the initial transcript (the pre-mRNA) but never make it into mature mRNA. These unused parts of the pre-mRNA transcripts correspond to long stretches of DNA, known as introns, that interrupt the smaller coding parts–the exons–that are translated into proteins. The initially transcribed intronic parts are removed from the pre-mRNA in a process called RNA splicing. Most of the RNA from introns probably just dissipates.9/

All these terms are a mouthful, but armed with this basic understanding of genes, RNA, and proteins, we can see why the 80% figure does not mean what one might think. We shall also see that the estimated proportion of the genome that encodes the structure of proteins or regulates gene expression has not jumped from 5 or 10% to 80%.

3. E.g., Gina Kolata, Bits of Mystery DNA, Far From ‘Junk,’ Play Crucial Role, N.Y. Times, Sept. 5, 2012. In one respect, the “dark matter” metaphor misrepresents dark matter. The presence of dark matter is inferred from its gravitational effects on visible matter. The presence of noncoding DNA is known from experiments that detect and characterize it just as they do coding DNA. Perhaps the metaphor means that the sequence of “dark matter” DNA cannot be deduced from the structure of a protein made in a cell. This, however, is like saying that dark matter is matter than cannot be seen with the naked eye. And that is not what astronomers mean by dark matter.

4. E.g., House Committee on the Judiciary, Report on the DNA Analysis Backlog Elimination Act of 2000, 106th Cong., 2d Sess., H.R. Rep. No. 106-900(1), at 27 (“the genetic markers used for forensic DNA testing … show only the configuration of DNA at selected ‘junk sites’ which do not control or influence the expression of any trait.”); New York State Law Enforcement Council, Legislative Priorities 2012: DNA at Arrest, at 5, http://nyslec.org/pdfs/2012/1_DNA_2012.pdf (“The pieces of DNA that are analyzed for the databank were specifically chosen because they are ‘junk DNA.’).

9. Post-splicing processing of a small fraction of the RNA from introns can produce noncoding RNAs that may regulate protein expression. L. Fedorova1 & A. Fedorov, Puzzles of the Human Genome: Why Do We Need Our Introns?, 6 Current Genomics 589, 592 (2005).

Or maybe you heard MSNBC report that the data from ENCODE “shows us living beyond our genes” –whatever that means — or listened to CBC intone that “‘Junk DNA has a purpose” — sounds divine — or saw the Independent‘s mishugina announcement that “Scientists Debunk ‘Junk DNA’ Theory to Reveal Vast Majority of Human Genes Perform a Vital Function!” — like we did not know that genes were functional and important?

The level of hype here is phenomenal. (Some useful clarification can be found at the Nature Newsblog). In the next few days, I hope to post some quick thoughts on what the ENCODE figures (like 80%) being bandied about for the “functional” or “biologically active” fraction of the human genome mean for the loci used in forensic DNA identification.

Cross-posted to Forensic Science, Statistics, and the Law (If any readers have insights to share, send me an email at kaye at alum.mit.edu, and I’ll try to use them. I am still educating myself about some of the details of gene regulation and can use any help I can get.)