Wednesday, 5 September 2012

ENCODE: My own thoughts

5 September 2012 - Today sees the embargo lift on the
second phase of the ENCODE project and the simultaneous publication of 30
coordinated, open-access papers in Nature,
Genome Research and Genome Biology as well as publications
in Science, Cell, JBC and others. The
Nature publication has a number of
firsts: cross-publication topic threads, a dedicated iPad/eBook App and web site and a
virtual machine.

This ENCODE event represents five years of
dedicated work from over 400 scientists, one of whom is myself, Ewan Birney. I
was the lead analysis coordinator for ENCODE for the past five years (and
before that had effectively the same role in the pilot project) and for the past
11 months have spent a lot of time working up to this moment. There were
countless details to see to for the scientific publications and, later, to
explain it all in editorials, commentary, general press features and other exotic
things.

But in telling the story over and over,
only parts of it get picked up here and there – the shiny bits that make a neat
story for one audience or another. Here I’d like to add my own voice, and to tell
at least one person’s perspective of the ENCODE story uncut, from beginning to
end.

This blog post is primarily for scientists,
but I hope it is of interest to other people as well. Inspired by some of my
more sceptical friends (you know who you are!), I’ve arranged this as a kind of
Q&A.

Q. Isn’t
this a lot of noise about publications when it should be about the data?

A. You are absolutely right it’s about the
data – ENCODE is all about the
data being used widely. This is what we say in the conclusions
of the main paper: “The
unprecedented number of functional elements identified in this study provides a
valuable resource to the scientific community…” We focused on providing not
only raw data but many ways to get to it and make sense of it using a variety
of intermediate products: a virtual machine (see below), browse-able resources that
can be accessed from www.encodeproject.org and the UCSC and Ensembl browsers (and soon NCBI browsers), and a new
transcription-factor-centric resource, Factorbook. As I say in a Nature commentary, “The overall
importance of consortia science can not be assessed until years after the data
are assembled. But reference data sets are repeatedly used by numerous
scientists worldwide, often long after the consortium disbands. We already know
of more than 100 publications that make use of ENCODE data, and I expect many more
in the forthcoming years.”

Q. Whatever
– you love having this high-profile publication.

A. Of course I like
the publications! Publications are the best way for scientists to communicate with
each other, to explain key aspects of the data and draw some conclusions from
them. But the impact of the project goes well beyond the publications
themselves. While it is nice to see so much focus on the project, publishing is
simply part of disseminating information and making the data more accessible.

Q. And 442 authors! Did they all really
contribute to this?

A. Yes. I know a large proportion of them personally, and for the ones I
don’t know, I know and trust the lead principal investigators who have
indicated who was involved in this. To achieve systematic data generation on
this scale – in particular to achieve the consistency – is a large, detailed
task. Many of the other 30 papers – and many others to be published – go into
specific areas in increasing levels of detail.

One group which I
believe gets less credit than they deserve are the lead data production
scientists; usually an individual with a PhD who heads up, motivates and
trouble shoots the work of a dedicated group of technicians. There is a simple
sentence in the paper: “For consistency, data were generated and processed
using standardized guidelines, and for some assays, new quality-control
measures were designed”. This hides a world of detailed, dedicated work.

There is no way to
truly weigh the contribution of one group of scientists compared to another in
a paper such as this; many individuals would satisfy the deletion test of “if
this person’s work was excluded, would the paper have substantially changed”.
However, two individuals stood out for their overall coordination and analysis,
and 21 individuals in this data production area, including the key role of the
Data Coordination Center.

Q. Hmmm.Let’s move onto the science. I
don’t buy that 80% of the genome is functional.

A. It’s clear that 80%
of the genome has a specific biochemical activity – whatever that might be. This
question hinges on the word “functional” so let’s try to tackle this first.
Like many English language words, “functional” is a very useful but context-dependent
word. Does a “functional element” in the genome mean something that changes a
biochemical property of the cell (i.e.,
if the sequence was not here, the biochemistry would be different) or is it
something that changes a phenotypically observable trait that affects the whole
organism? At their limits (considering all the biochemical activities being a
phenotype), these two definitions merge. Having spent a long time thinking
about and discussing this, not a single definition of “functional” works for
all conversations. We have to be precise about the context. Pragmatically, in
ENCODE we define our criteria as “specific biochemical activity” – for example,
an assay that identifies a series of bases. This is not the entire genome (so,
for example, things like “having a phosphodiester bond” would not qualify). We
then subset this into different classes of assay; in decreasing order of
coverage these are: RNA, “broad” histone modifications, “narrow” histone
modifications, DNaseI hypersensitive sites, Transcription Factor ChIP-seq
peaks, DNaseI Footprints, Transcription Factor bound motifs, and finally Exons.

Q. So remind me which one do you think is
“functional”?

A. Back to that word “functional”: There is
no easy answer to this. In ENCODE we present this hierarchy of assays with
cumulative coverage percentages, ending up with 80%. As I’ve pointed out in
presentations, you shouldn’t be surprised by the 80% figure. After all, 60% of
the genome with the new detailed manually reviewed (GenCode) annotation is
either exonic or intronic, and a number of our assays (such as PolyA- RNA, and
H3K36me3/H3K79me2) are expected to mark all active transcription. So seeing an
additional 20% over this expected 60% is not so surprising.

However, on the other end of the scale – using
very strict, classical definitions of “functional” like bound motifs and DNaseI
footprints; places where we are very confident that there is a specific DNA:protein
contact, such as a transcription factor binding site to the actual bases – we
see a cumulative occupation of 8% of the genome. With the exons (which most
people would always classify as “functional” by intuition) that number goes up to
9%. Given what most people thought earlier this decade, that the regulatory
elements might account for perhaps a similar amount of bases as exons, this is
surprisingly high for many people – certainly it was to me!

In addition, in this phase of ENCODE we did
sample broadly but nowhere near completely in terms of cell types or
transcription factors. We estimated how well we have sampled, and our most
generous view of our sampling is that we’ve seen around 50% of the elements. There
are lots of reasons to think we have sampled less than this (e.g., the inability to sample
developmental cell types; classes of transcription factors which we have not
seen). A conservative estimate of our expected coverage of exons + specific
DNA:protein contacts gives us 18%, easily further justified (given our
sampling) to 20%

Q. [For
the more statistically minded readers]: What about the whole headache of
thresholding your enrichments? Surely this is a statistical nightmare across
multiple assays and even worse with sampling estimates.

A. It is a bit of a
nightmare, but thankfully we had a really first class non-parametric
statistical group (the Bickel group) who developed a robust, non-parametric (so
it makes minimal assumption about distribution), conservative statistic based
on reproducibility (IDR). This is not perfect. Being conservative if one
replicate has far better signal-to-noise than the other, it stops calling on
the onset of noise in the noisiest replicate, but this is generally a
conservative bias. And for the sampling issues, we explored different
thresholds and looked at saturation when we were relaxed on thresholds and then
shifted to being conservative. Read the supplementary information and have a
ball.

Q. [For
50% of the readers]: Ok, I buy the 20% of the genome is really doing
something specific. In fact, haven’t a lot of other people suggested this?

A. Yes. There have
been famous discussions about how regulatory changes – not protein changes –
must be responsible for recent evolution, and about other locus assays
(including about 10 years of RNA surveys). But ENCODE has delivered the most
comprehensive view of this to date.

Q. [For
the other 50% of readers]: I still don’t buy this. I think the majority of
this is “biological noise”, for instance binding that doesn’t do anything.

A. I really hate the
phrase “biological noise” in this context. I would argue that “biologically
neutral” is the better term, expressing that there are totally reproducible,
cell-type-specific biochemical events that natural selection does not care
about. This is similar to the neutral theory of amino acid evolution, which
suggests that most amino acid changes are not selected either for or against. I
think the phrase “biological noise” is best used in the context of stochastic
variation inside a cell or system, which is sometimes exploited by the organism
in aspects of biology, e.g. signal
processing.

It’s useful to keep
these ideas separate. Both are due to stochastic processes (and at some level
everything is stochastic), but these biological neutral elements are as
reproducible as the world’s most standard, developmentally regulated gene. Whichever
term you use, we can agree that some of these events are “neutral” and are not
relevant for evolution. This is consistent with what we’ve seen in the ENCODE
pilot and what colleagues such as Paul Flicek and Duncan Odom have seen in
elegant experiments directly tracking transcription factor binding across species

Q. Ok, so why don’t we
use evolutionary arguments to define “functional”, regardless of what evolution
‘cares about’? Isn’t this 5% of the human genome?

A. Anything under
negative selection in the human population (i.e. recent human evolution) is
definitely functional. However, even with this stated criteria, it is very hard
to work out how many bases this is. The often-quoted “5%”, which comes from the
mouse genome paper, is actually the fitting of two Gaussians that look at the
distribution of conservation between human and mouse in 50bp windows. We’ve
been referring to 5% of those 50bp windows.

When you consider the
number of bases being conserved this must be lower than this as we don’t expect
100% of the bases in these 50bp windows to be conserved. However, this only
about pan-mammalian constraint, and we are interested in all constraint in the
human genome, including the lineage specific elements, so this estimate just
provides a floor to the numbers. The end result is that we don’t actually have
a tremendously good handle on the number of bases under selection in humans.

Some have tried other estimates
of negative selection, trying to get a handle on the more recent evolution. I
particularly like Gerton Lunter’s and Chris Ponting’s estimates (published in Genome Research), which give a window of
between 10% to 15% of the bases in the human genome being under selection –
though I note some people dispute their methodology.

By identifying those regions
likely to be under selection (because they have specific biochemical activity) in
an orthogonal, experimental manner, ENCODE substantially adds to this debate.
By identifying isolated, primate-specific insertions (where we can say with
confidence that the sequence is unique to primates), we could contrast the bases
inside ENCODE-identified regions with those outside. As ENCODE data covers the
genome, we now have enough statistical power to look at the derived allele
frequency (DAF) spectrum of SNPs in the human population. The SNPs inside
ENCODE regions show more very low frequency alleles than the SNPs outside (accurate
genome-wide frequencies due to the 1000 Genomes Project), which is a
characteristic sign of negative selection and is not influenced by confounders
such as mutation rate of the sequence (see Figure 1 of the main ENCODE paper).

We can do that across
all of ENCODE, or break it down by broad sub-classification. Across all sub-classifications
we see evidence of negative selection. Sadly, it is not trivial to estimate the
proportion of bases from derived allele frequency spectra that are under
selection, and the numbers are far more slippery than one might think. Over the
next decade there will, I think, be much important reconciliation work, looking
at both experimental and evolutionary/population aspects (bring on the million-person
sequencing dataset!).

Q. So
– we’re really talking about things under negative selection in human – is that
our final definition of “functional”?

A. If it is under negative selection in the human population, for me it
is definitely functional.

I, and other
people, do think we need to be open to the possibility of bases that definitely
effect phenotypes but are not under negative selection –both disease related
phenotypes and other normal phenotypes. My colleague Paul Flicek uses the shape
of the nose as an example; quite possibly the different nose shapes are not
under selection – does that mean we’re not interested in this phenotype?

Regardless of
all that, we really do need a full, cast-iron set of bases under selection in humans
– this is a baseline set.

Q. Do
you really need ENCODE for this?

A. Yes. Imagine that THE
set of bases under selection in the human genome were dropped in your lap by
some passing deity. Wonderful! But you would still want to know the how and
why. ENCODE is the starting place to answer the biochemical “how”. And given that
passing deities are somewhat thin on the ground, we should probably go ahead
and figure out models of how things work so that we can establish this set of
bases. I am particularly excited about the effectiveness of using position–weight
matrices in the ENCODE analyses (my postdoc Mikhail Spivakov did a nice piece
of work here).

Q. Ok,
fair enough.But are
you most comfortable with the 10% to 20% figure for the hard-core functional
bases? Why emphasize the 80% figure in the abstract and press release?

A. (Sigh.)Indeed. Originally I pushed for using an “80% overall” figure and
a “20% conservative floor” figure, since the 20% was extrapolated from the
sampling. But putting two percentage-based numbers in the same breath/paragraph
is asking a lot of your listener/reader – they need to understand why there is
such a big difference between the two numbers, and that takes perhaps more
explaining than most people have the patience for. We had to decide on a
percentage, because that is easier to visualize, and we choose 80% because (a)
it is inclusive of all the ENCODE experiments (and we did not want to leave any
of the sub-projects out) and (b) 80% best coveys the difference between a
genome made mostly of dead wood and one that is alive with activity. We refer also
to “4 million switches”, and that represents the bound motifs and footprints.

We use the bigger number
because it brings home the impact of this work to a much wider audience. But we
are in fact using an accurate, well-defined figure when we say that 80% of the
genome has specific biological activity.

Q. I get really annoyed with papers like ENCODE
because it is all correlative. Why don’t people own up to that?

A. It is mainly
correlative, and we do own up to it. (We did do number of specific experiments
in a more “testing” manner – in particular I like our mouse and fish
transgenics, but not for everything.) For example, from the main paper: “This
is an inherently observational study of correlation patterns, and is consistent
with a variety of mechanistic models with different causal links between the
chromatin, transcription factor and RNA assays. However, it does indicate that
there is enough information present at the promoter regions of genes to explain
most of the variation in RNA expression.”

Interestingly enough,
we had quite long debates about language/vocabulary. For example, when we built
quantitative models, to what extent were we allowed to use the word “predict”?
Both the model framework and the precise language used to describe the model
imply a sort of causality. Similarly, we describe our segmentation-based
results as finding states “enriched in enhancers”, rather than saying that we
are providing a definition of an enhancer. Words are powerful things.

Q. I am still skeptical. What new insights does
ENCODE offer, and are they really novel? Most of the time I think someone has
already seen something similar before.

A. I think that the
scale of ENCODE – in particular the diversity of factors and assays – is
impressive, and although correlative, this scale places some serious
constraints on models. For example, the high, quantitative correlation between
CAGE tags and histone marks at promoters limits the extent to which RNA
processing changes RNA levels. (This is measured by 5’ ends – n.b. if there is a
considerable amount of aborted transcription generating 5’ends, this need not
mean full transcripts, though this correlation is high both for nuclear
isolated 5’ends and cytoplasmic isolated 5’ ends.)

As for “someone has
discovered it already,” I agree that the vast majority of our insights and
models are consistent with at least one published study – often on a specific
locus, sometimes not in human. Indeed, given the 30 years of study into
transcription, I am very wary of
putting forward concepts that don’t have support from at least some individual
loci studies.

ENCODE has been
selecting/confirming hypotheses that are broadly genome-wide, or multi-cell
line true. ENCODE is a different beast from focused, mechanistic studies, which
often (and rightly) involve precise perturbation experiments. Both the broader
studies and the more focused studies help define phenomena such as
transcription and chromatin dynamics.

This is all in the
main paper, but then the network paper (led by Mike Synder and Mark Gerstein)
on transcription factor co-binding, the open chromatin distribution paper (led
by Greg Crawford, Jason Lieb, John Stamatoyannopolus), the DNaseI distribution
paper (led by John Stamatoyannopolus), the RNA distribution and processing
paper (led by Roderic Guigo and Tom Gingeras) and chromatin confirmation paper
(led by Job Dekker) all provide non-obvious insights into how different
components interact. And that’s just the Nature papers – there are
another 30-odd papers to read. (We hope our new publishing innovation –
“threads” – will help you navigate easily to the parts of all these papers you
are most interested in reading.)

Q. You talk about how this will help medicine,
but I don’t see this being directly relevant?

A. ENCODE is a foundational data set – a layer on top of the human genome –
and its impact will be to make basic and applied research work faster and more
cheaply. Because of our systematic, genome-wide approach, we’ve been able to
deliver essential, high-quality reference material for smaller groups working
on all manner of diseases. And in particular the overlap to genomewide
association studies (GWAS) has been a very informative analysis.

Q. Moving to the disease genetics, were you surprised
at this correlation with GWAS, as the current GWAS catalog is about lead
association study SNPs, and we don’t expect this to overlap with functional
data.

A. This was definitely
a surprise to us. When I first saw this result I thought there was something
wrong with some aspect of the analysis! The raw enrichment of GWAS-lead SNPs
compared to baseline SNPs (e.g. those from the 1000 Genomes Project) is very
striking, and yet if the GWAS-lead SNPs are expected to be tagging (but not
coincident) with a functional variant, you would expect little or no
enrichment.

We ended up with four
groups implementing different approaches here, and all of them found the same
two results. First, that the early SNP genotyping chips are quite biased
towards functional regions. By talking to some of the people involved in those
early designs (ca. 2003), I learned some of this is deliberate, for instance
favouring SNPs near promoters. But even if you model this in, the enrichment of
GWAS SNPs over a null set of matched SNPs is still there. This is similar to
that card in Monopoly: “Bank Error in your favour; please collect 10
Euros/Dollars/Pounds”. In this case, it is: “Design bias in your favour; you
will have more functional variants identified in the first screen than you
think”.

We think that around 10% to 15% of GWAS
“lead” loci are either the actual functional SNP in the condition studied or
within 200bp of the functional variant. This is all great, but we can now do
something really brilliant: break down
this overall enrichment by phenotypes (from GWAS) and by functional type, in
particular cell type (DNaseI) or transcription factor (TF). This matrix has a
number of significant enrichments of particular phenotypes compared with
factors or cell types. Some of these we understand well (e.g., Crohn’s disease and T-Helper cells); some of these
enrichments are perfectly credible (e.g., Crohn’s disease and GATA-factor
transcription factors); and some are a bit of a head-scratcher.

But the great thing about our data is that
we didn’t have to choose a specific cell type to test or a particular disease.
By virtue of being able to map both diseases and cell-specific (or
transcription-factor-specific) elements to the genome, we can look across all
possibilities. This will improve as we get more transcription factors and as we
get better “fine mapping” of variants. This result for me alone is totally
exciting: it’s very disease-relevant,
and it leverages the unbiased, open, genome-wide nature of both ENCODE and GWAS
studies to point to new insights for disease.

Q. You make a fuss about these new publishing
aspects, such as “threads”. Should I be excited?

A. I hope so! The idea of threads is a novel attempt by us to help readers
get the most out of this body of coordinated scientific work. Say you are only
interested in a particular topic – say, enhancers – but you know that different
groups in ENCODE are likely to have mentioned this (in particular the technical
papers in Genome Research and Genome Biology). Previously you
would have had to skim the abstract or text of all 30 papers to try and work
out which ones were really relevant. Threads offer an alternative, lighting up
a path through the assembled papers, pointing out the figures and paragraphs
most relevant to any of 13 topics and taking you all the way through to the
original data. The threads are there to help you discover more about the
science we’ve done, and about the ENCODE data. Interestingly, this is something
that’s only achievable in the digital form, and for the first time I found
myself being far more interested in how the digital components work than in the
print components.

The idea of threads
came from the consortium, but the journal editors, in particular Magdalena
Skipper from Nature made it a reality – remember that in these threads we are
crossing publishing house boundaries. The resulting web site and iPad App I
think works very well. I am going to be interested to see how other scientists
react to this.

Q. And what about this Virtual Machine. Why is
this interesting?

A. An innovation in computing over the last decade has been the use of
virtualization, where the whole state of a computer can be saved to a file, and
transported to another “host” computer and then restarted. This has given us a
new opportunity to increase transparency for data intensive science.

Many people have noted that complex computational methods are very hard
to track in all their detail. We currently place a lot of trust in each other
as scientists that phrases such as “we then removed outliers” or “we normalised
using standard methods” are executed appropriately. The ENCODE virtual machine
provides all these complex details explicitly in a package that
will last at least as long as the open virtualization format we use (OVF,
VirtualBox). So if you are a computational biologist in three years’ time, and
you want to see the precise details of how we achieved something, you can run
the analysis codes yourself. The only caveat to this is that for the large,
compute-scale pipelines we have an exemplar processing step, and then have the
results of this parallelised (i.e. we
do not have a virtualised pipelines). Think of this a bit like the ultimate
materials and methods section of the paper. I believe this virtual
machine substantially increases the transparency of this data-intensive
science, and that we should produce virtual machines in the future for all
data-intensive papers.

Q. I’ve read
your Nature commentary about large projects, and admit that
I'm uneasy about how these large projects throw their weight around. Isn’t there more friction and angst than
you admit to?

A. There is indeed friction and angst, in
particular with the smaller groups (“hypothesis testing groups, or R01 groups”)
close to the scientific areas of ENCODE. I regret every instance of this and
have tried my best to make things work out. After a lot of experience,
I’ve realised a couple of things: Like any large beast, projects like ENCODE
can inadvertently cause headaches for smaller groups. Part of this is actually
due to third parties, for example reviewers of papers or grants who mistakenly
think that the large datasets in ENCODE somehow replace or make redundant more
focused studies. This is rarely the case – what the large project provides is a
baseline dataset that is useful mainly for people who don’t have the time or
inclination to do such a study and, importantly, who would not find it
practical to do this work systematically (i.e.
cutting to established, promising focus areas). ENCODE’s target audience
is someone who needs this systematic approach, for example clinical researchers
who might scan their (putatively) causative alleles or somatic variants against
such a catalog. ENCODE does not replace the targeted perturbation experiment,
which illuminates some aspect of chromatin or transcriptional mechanism
(sometimes in a particular disease context). However, people less involved in
this work can make the mistake of lumping together the mechanistic study and
the catalog building as “doing ChIP-seq”, and assume they are redundant. As
scientists in this area, both large and small groups need to regularly point
out their explicit and non-overlapping complementarity.

Also, compared to some
other scientific fields, genomics has a remarkably positive track record in
data sharing and communication. We can do far better (more below), but everyone
should be mindful that for all our faults, we do share datasets completely and
openly, we nearly always share resources and techniques and we do communicate. Non-genomicists
would be surprised sometimes at the depths of distrust in other
fields. That said, there is always room for improvement. Although we did
use pre-publication raw data sharing in ENCODE, we should have spent more time
and effort sharing intermediate datasets (in addition to raw datasets). The
1000 Genomes Project provides an excellent example to follow.

Finally, I believe that
the etiquette-based system of how to handle pre-publication data release (and I
was a prominent participant in this discussion) is clumsy and out-moded: designed for a world where data generation -
not analysis - is the bottleneck. I believe we need to have a new scheme. I'm
not rushing to state my own opinion here - we need to have a deliberative
process that balances getting broad buy-in and ideas with a timely and
practical result.

Q. So ENCODE is all
done now, right?

A. Nope! ENCODE “only” did 147 cell types and 119
transcription factors, and we need to have a baseline understanding of every
cell type and transcription factor. Thankfully, NHGRI has approved the idea of
pushing for this – not an unambitious task – over the next 5 years. I see there
being three phases of ENCODE: the ENCODE Pilot (1% of the genome); the ENCODE
scale-up (or production), where we showed that we can work at this scale and
analyse the data sensibly; and next the ENCODE phase “build-out” to all cell
types and factors.

Q. So you get to do
this for another five years?

A. Someone does. I have hung up my ENCODE
“cat-herder-in-chief” hat, and moved onto new things, like the equally
challenging world of delivering a pan-European bioinformatics infrastructure
(ELIXIR). But that’s for another blog post!

Q. Be honest. Will
you miss it?

A. Looking back on my
ten years with ENCODE, you know, I really am going to miss this. (Okay, maybe I
won't miss three-hour teleconferences running to 2am...). It has been hard work
and excellent science – I’ve met and interacted with so many great scientists
and have honestly had a lot of fun.

43 comments:

I also dislike the term "biological noise" to mean elements (or interactions) that are biochemically true and reproducible but unlikely to have a impact on fitness if removed. Biologically neutral might be a good way to frame it. Unfortunately, not even lack of conservation can tell you if an element is important for fitness since you can also have compensatory changes. I don't think it will be easy to get estimates to the fraction of elements (or interactions) that are biologically neutral.

Hi -- thanks for the post. But, basically, you seem to be saying that you chose the 80% number for the hype, even though the best value is more like 10-20%?

What bugs me, I guess, is the constant bogus narrative -- which I just heard repeated on NPR 5 minutes ago -- along the lines of "scientists of yore were dumb and thought most of the genome was junk, but these new scientists are smart and now know that most of it isn't." I think that's just wrong, don't you? There were good reasons decades ago to think most of the genome didn't/couldn't have much of what people would reasonably call function, and those reasons still exist today. Basically, a ton of it looks like byproducts of viruses and the like, a ton of it isn't under selection, and we know, and have known for decades, that genome sizes vary widely among eukaryotes, without any clear pattern regarding organism complexity. Critters can and do dispense with much of their genomes without much impact at all. Sister species can vary in genome size by 50% or more. And many species have genomes many times larger than the human genome. Like onions. There's no particular reason to think the human genome is any different.

Have you heard of the onion test?

http://www.genomicron.evolverzone.com/2007/04/onion-test/

What's your answer? Why should anyone who knows about the above facts be happy that the public is being told that 80% of the genome is known to be functional, and the scientists who "thought" otherwise were dumb and old-fashioned?

I cross-posted a link to this post and the main Nature article on reddit yesterday. The post made it as high as the second spot on the r/science subreddit, although it was unable to overcome another post on the supposed increased incidence of glioma in mobile phone users. :( Still, I feel happy to have at least helped to spread the news and make the science available to a larger audience. If only you guys had included some videos with cats...

Wow. Congratulations on this amazing achievement! There are so many interesting and important aspects to the ENCODE project that I hardly know where to start. Actually, given that I left the field of bioinformatics 10 years ago I am inclined to focus on the bits that have implications for the practice of science in general, namely the use of 'threads' for navigating the literature and the use of virtualisation to distribute analyses (questions 17 & 18 if you'd numbered them). Lots to think about, and interesting times ahead! Thanks!

I like the idea of "threads" as a way of navigating through a set of papers, and unpacked the iPad app to see how they'd been put together, see http://iphylo.blogspot.co.uk/2012/09/decoding-nature-encode-ipad-app-omg-it.html. It helps that the papers were either published by Nature, or open access, intellectual property issues will be a stumbling block if we want to extend the concept across a broader range of journals. But there's a lot of potential for people to create their own threads and share them (rather than bookmark a collection of papers we could bookmark a collection of fragments).

I'll have you know I'm currently arguing with an anti-evolutionist who has taken the 80% to mean evolution is impossible. Because any mutations in that 80% would affect the function of that organism, meaning there are almost no neutral mutations.

While I think the work will be very useful in the years to come, the 80% comment is going to be a nightmare.

The 80% comment is a nightmare for those who think in terms of "random" models, or "random mutations." It's less problematic for those who think in terms of the epigenetic effects on intracellular signaling and stochastic gene expression required for adaptive evolution via ecological, social, neurogenic, and socio-cognitive niche construction, which obviously required nutrient chemicals (food) for individual survival and their metabolism to pheromones that control reproduction and thus species survival (in microbes to man).

If, for example, the epigenetic effect of a nutrient on stochastic gene expression is controlled by the epigenetic effect of a species specific pheromone, you have a controlled network of gene interactions which could include the epigenetic tweaking of a complex system-wide 80% due to the amount of "code" required for organisms to adaptively evolve.

Adaptive evolution, in this case, does not randomly occur. Can we not therefore expect that the amount of code would reflect the epigenetic effects of nutrient chemicals and pheromones in other species that led to the advent of man?

The 80% comment is a nightmare for those who think in terms of "random" models, or "random mutations." It's less problematic for those who think in terms of the epigenetic effects on intracellular signaling and stochastic gene expression required for adaptive evolution via ecological, social, neurogenic, and socio-cognitive niche construction, which obviously required nutrient chemicals (food) for individual survival and their metabolism to pheromones that control reproduction and thus species survival (in microbes to man).

If, for example, the epigenetic effect of a nutrient on stochastic gene expression is controlled by the epigenetic effect of a species specific pheromone, you have a controlled network of gene interactions which could include the epigenetic tweaking of a complex system-wide 80% due to the amount of "code" required for organisms to adaptively evolve.

Adaptive evolution, in this case, does not randomly occur. Can we not therefore expect that the amount of code would reflect the epigenetic effects of nutrient chemicals and pheromones in other species that led to the advent of man?

"Adaptive evolution, in this case, does not randomly occur. Can we not therefore expect that the amount of code would reflect the epigenetic effects of nutrient chemicals and pheromones in other species that led to the advent of man?

Or is there a random model for that?"

I'm glad you're on the 80% functional side. Please explain why various onions, ferns, salamanders, etc., need 5-80 times as much of this epigenetic stuff as humans do (their genomes are 5-80 times bigger than the human genome, depending on the species), while cheetahs, hummingbirds, pufferfish, Drosophila, etc., can get by with much less genome (and much less "epigenetics" or whatever) than humans have, sometimes 10% as much.

This product is best way to protect and personalize your iPhone. It gives your phone a dazzling and head-turning look. The case is completely covered with numerous rhinestones which protect your iPhone from wear and tear. Sparkling embedded rhinestones make your phone more luxurious which helps prevent from scratches and chips. This perfect fitting case makes your phone look like it has an invisible shield. Many rhinestones are individually applied to create designs like; stunning animal print design.

I found this post very interesting. I have been looking for information on onsite IT services when I came across your post. I just wanted to say thanks so much for this great information. I will be using parts of this in one of my papers. Thanks for sharing.

About Ewan Birney

Dr Birney is Joint Associate Director of EMBL-EBI and runs a small research group. Together with Dr Rolf Apweiler, he has strategic responsibility and oversight for bioinformatics services at EMBL-EBI.
Dr Birney played a vital role in annotating the genome sequences of the human, mouse, chicken and several other organisms; this work has had a profound impact on our understanding of genomic biology. He led the analysis group for the ENCODE project, which is defining functional elements in the human genome. Ewan’s main areas of research include functional genomics, assembly algorithms, statistical methods to analyse genomic information (in particular information associated with individual differences) and compression of sequence information.
Dr Birney completed his PhD at the Wellcome Trust Sanger Institute with Richard Durbin, and worked in the laboratories of leading scientists Adrian Krainer, Toby Gibson and Iain Campbell. He has received a number of prestigious awards including the 2003 Francis Crick Award from the Royal Society, the 2005 Overton Prize from the International Society for Computational Biology and the 2005 Benjamin Franklin Award for contributions in Open Source Bioinformatics. Ewan was elected a Fellow of the Royal Society in 2014.
ORCID iD: 0000-0001-8314-8497