Science! (and other good things)

This article popped up in some feed of mine in the past week, and it’s a quick, fascinating take on the octopus as our best on-earth example of an alien intelligence. It’s a fast read, and makes a compelling case for why octopi, from whom we diverged a long time ago, display signs of consciousness and (thus) are good alien analogs.

Pretty neat both from the “Hey, octopi are cool” perspective and if you’re an SF author interested in writing plausible alien intelligence.

Perhaps unsurprisingly, one of my interests is functional, accurate protein annotations. The default way to annotate new sequences, especially in a high-throughput manner, is to use sequence identity with some form of BLAST and use the best hit to annotate your sequence of interest.

Things have changed quite a bit in the last decade and especially the last five years in scientific publishing. Starting with open access efforts (notably the PLOS journals, which launched when I was in grad school and have driven a revolution in publishing) and continuing in venues like Retraction Watch and PubPeer, the move toward making science open, clear, honest, and above all accurate is powerful and completely unlike when I started doing research.

Retraction Watch put up a guest post today by Drummond Rennie and C.K. Gunsalus titled If you think it’s rude to ask to look at your co-author’s data, you’re not doing science. In it they talk about how a couple recent and slightly less recent high-profile scientific frauds have broken down, and the failure of senior authors to actually do their part to validate the data and really know what’s going on. They also provide a truly helpful breakdown of approaches to make sure everyone knows what work is being done, that the work is actually being done (and funded, and so forth), and that everyone is credited properly.

If you’re doing, well, science, or any kind of collaboration at all, I recommend this genuinely helpful piece.

In our work on orphan enzymes, we’ve consistently seen a “rich get richer” effect. Research tends to accumulate on those proteins that already have assigned sequences. This is a systematic issue, since annotation based on sequence similarity probably means that we’re often assuming that a newly identified gene does the same function as a known protein…when in reality, it is more like a highly similar orphan enzyme for which we lack sequence data.

We saw this occasionally in the generally awesome BRENDA enzyme database. A curator had assigned a sequence to an orphan enzyme when that sequence was actually for a highly similar enzyme that did not catalyze the orphan enzyme activity. This kind of over-assignment likely prevents further research on the orphan enzyme and tends to focus more research on the enzyme for which we had sequence data in the first place.

Cracking the brain’s “ignorome”

They surveyed those genes that show “intense and highly selective” expression in the brain (ISE genes, for short) and asked “How well-characterized are they?” After all, one of the promises of modern high-throughput methods is that we can look at features such as tissue-specific expression and use that as a guide for which genes to devote our research attention to.

What they found is that despite our knowing that these genes are all intensely and selectively expressed in the brain, research about them has been tremendously lopsided.

I’ll quote them on just how off-kilter the research distribution is:

The number of publications for these 650 ISE genes is highly skewed (Figure 1). The top 5% account for ~68% of the relevant literature whereas the bottom 50% of genes account for only 1% of the literature.

Here’s Figure 1:

So that shows us that despite somehow being specific and important to the brain, many of these genes remain understudied.

Why?

What makes the ignorome different?

The short answer is “age.”

Much like the “rich get richer” phenomenon I talked about for orphan enzymes, there is (unsurprisingly) a correlation between when a gene was first characterized and how much research there has been on it. Nothing else really differs between the genes that are understudied and those that have been the focus of significant study.

That brings up the natural corollary question of “Okay, so are we figuring out what the other genes do?”

The answer here seems to be that we were for a while, but now the rate of advancing discovery is flattening out. I’ll quote the authors here as well:

While the average rate of decrease was rapid between 1991 and 2000 (−25 genes/year), the rate has been lethargic over the past five years (−6.4 genes/yr, Figure 5). This trend is surprising given the sharp increase in the rate of addition to the neuroscience literature. As a result, the number of neuroscience articles associated with the elimination of a single ignorome gene has gone up by a factor of three between 1991 and 2012 (Figure 5). The rate at which the ignorome is shrinking is approaching an asymptote, and without focused effort to functionally annotate the ignorome, it will likely make up 40–50 functionally important genes for more than a decade.

So what do we do about it?

One of the core reasons for “rich get richer” effects is that known genes (or proteins) simply have more “handles” you can work with. If your expression analysis tells you that 20 genes are significantly enriched in your test condition and you can find some functional characterization for 10 of them, it’s only natural to focus on those 10 first.

…and given how time and work tend to play out, “first” can quickly become “only.” Given how daunting a completely uncharacterized gene can be, who would fault researchers for spending the majority of their effort on those genes that have some functional characterizations (or predictions) available for them? That certainly fits the whole 80/20 rule idea of focusing most of your effort where you’ll have the most gain.

Pandey and colleagues attempt to address this by making more handles. They show how we can leverage high-throughput and large-scale phenotype databases to generate additional functional characterization for at least some of the ignorome genes without significant additional effort. Now, instead of flying relatively blind, a researcher can have both sequence-similarity-based predictions of function and some best guesses at phenotype associations for these genes.

I really like this kind of leveraging of existing data to make avenues of research more accessible and thus more likely. This kind of thing is going to be very important in tackling those dark areas of unknown function that exist all over biology.

So what’s the big deal? What are orphan enzymes and why do we need to identify them?

Sequences are card catalog numbers for everything

In modern biology, protein and nucleotide sequence data are the glue that hold everything together. When we sequence a new genome, for example, we make a “best guess” for what each gene does by comparing its sequence to a vast collection of sequences we already have. Essentially, that lets us go from this amino acid sequence:

We’re missing a lot of sequences

As part of our Orphan Enzymes Project, we’ve tried to figure out how we can find sequences for these hundreds of enzymes.

After all, each enzyme represents hundreds of thousands of dollars in lost research…and each enzyme sequence we don’t have undercuts the value of all of our fantastic sequence-based tools.

We can rapidly identify a lot of orphan enzymes

Our new paper describes a few case studies on how we can identify orphan enzymes in the lab and just how big an impact identifying sequence for each orphan enzyme has.

We found several cases where we were actually able to buy samples of enzymes that had never been sequenced. We were also able to collaborate with Charles Waechter and Jeffrey Rush of the University of Kentucky to find sequence data for an enzyme they’d been working hard to characterize.

The key point of this part of our work is that many enzymes that are “tricky” for one set of researchers to sequence may be entirely doable for another group that specializes in sequencing. The more we collaborate, the more value we get out of all of our work.

Identifying orphan enzymes has a big impact

The second part of our work asks the simple question, “Does it matter?”

For each enzyme for which we found sequence data, we asked “How many enzymes should we now re-annotate?”

In other words, for all those guesses that have been made about what proteins do, for how many is our enzyme the best guess based on closeness of its sequence to the one we found.

It turns out that each enzyme sequence we identified led to anywhere from 130 to 430 proteins getting new, better guesses about their functions.

That’s hundreds of potential incorrect predictions or misled researchers averted by just “finishing the job” of sequencing a handful of enzymes.

Given the tremendous amount of work that has gone into characterizing each of these enzymes, it’s essential that we take every opportunity to apply modern sequencing expertise to existing samples.

Randy Schekman, James Rothman, and Thomas Südhof were just awarded the 2013 Nobel Prize in Physiology or Medicine for their discovery of the how the cell transports materials around. Theirs were fundamental discoveries that play into every aspect of cell biology, including fun things like “How neurotransmitters get into and out of nerves.”

Randy taught part of my undergraduate molecular and cell bio course way back in the late 90s (one of the other instructors was Nicholas Cozzarelli – I was blessed with some excellent teachers).

My grad school lab (the Hampton lab) had a close working relationship with Schekman’s group, as his ongoing discoveries about how molecules and materials move within the cell directly tied into our own research on protein degradation (gotta move all those protein parts somehow…).

Given the nature of Randy’s discoveries and the quality of his work, we all thought it was only a matter of time before we was awarded the Nobel. It’s awesome to see it finally happen.

Last year I found myself needing to visualize growth of a relatively thin lawn of E. coli on imperfectly translucent minimal medium plates. It was part of testing growth based on our predictions in the recently published Computing minimal nutrient sets from metabolic networks via linear constraint solving (I’ll have a separate post about that soon). Trying to get the lawns to stand out via backlighting or dark backgrounds didn’t do the trick, but staining the cells finally gave me the lovely picture you see above.

1) Pour the staining solution onto the plate so that it just covers the surface of the agar.
2) Let stand for 45 seconds.
3) Pour off the solution.
4) Pour on the rinse solution.
5) Swirl it for 10 seconds, then let stands for another 50 seconds.
6) Pour off the rinse solution.

One of the most unnerving aspects of biological research is the possibility that your samples aren’t what you think they are. Most of my lab work has involved yeast (cerevisiae) and a smattering of types of bacteria (largely coli and some cyanobacteria). In general, we didn’t maintain the cells by passaging them, and there were some obvious antibiotic and auxotrophic (nutrition-based) markers we could use to tell that the cells were basically what we thought they were.

But as I’ve learned since entering the exciting world of genome analyses, there is just a ton of variation between the “same” organism and strain in different labs…or in the same lab at different times. The “default” E. coli strain, K-12 MG1655, has a neat little mutation in amino acid biosynthesis that easily reverts to wild type, which plays all sorts of havoc with computational models that assume that it’s nonfunctional.

I’m quite interested in how we can account for these kinds of differences and make modeling and predictive tools that are resilient to them.

Except it turns out that they’re not bladder cancer cells. As Jager and his colleagues discovered, basically all the KU7 cell lines in the world are actually a completely different kind of cell (the most common cancer cell line in the world, HeLa). This apparently started with cross-contamination back at the source.

So what does this mean for studies based on those cells? Presumably we’d want to have a way to mass-tag those publications and all the databases or other informatics resources derived from them with the true identity of the cells used. Is this reasonably achievable, and is there a good way to track areas where the ideas or conclusions drawn from experiments using these misidentified cells ended up?

I’m not especially familiar with the bioinformatics and quantitative bio of cancer biology, so I don’t know how much impact this specific discovery has on large-scale data resources those fields rely on. Presumably this kind of thing is going to keep happening – we’ve certainly seen it in the misidentification and renaming of microbial samples from which enzyme and other metabolic data were derived. It would be handy to have consistent mechanisms in place to add additional metadata to publications so that this kind of “switch” can be tracked and propagated into downstream resources.

There’s more discussion of this discovery and its consequences for publications that used the misidentified cells over at Retraction Watch.

Gene and protein sequences really are the basic blueprints of life. We’re now living in a time where you can get that full blueprint for a bacterium in well under an hour – and you’ll spend most of that time loading your sample into the sequencing machine and reading the output. We interpret that full blueprint by comparing it to all the individual sequences we already know.

“What does this gene do?”

“I don’t know. Let’s check our library of genes and see if there’s something like it.”

As a result, protein and gene sequences are not only blueprints, but effectively addresses or library card catalog numbers. They tie the genetic information we’re looking at right now to all the research that has gone before. Without a sequence address, we can’t connect our past knowledge with the sequence we’re staring at right now.

So it’s been a real problem that we don’t have that sequence information for up to a third or so of all the enzymes we know. The Orphan Enzymes Project is an effort I’ve been leading for a few years now that aims to tackle that problem and connect modern sequencing efforts to the research community’s “back catalog” of amazing research.

The new site was put together by my talented collaborator Christine Rhee, who is also working on a vision for a true community effort to resolve the orphan enzyme problem.

Quite a bit of the work in our group is about cutting the scut work out of experimental biology. My colleagues typically say that we’re “accelerating” research, but I usually couch it in terms of “cutting waste.” They’re two sides of the same coin, but I think it makes a lot more sense to consider bioinformatics tools in terms of how they impact the time and effort we spend doing things that we don’t want to be doing.

Consider the example of owning a car. What does that car do for you?

Sure, it accelerates your day – after all, you’re going everywhere faster than you would have without it. But the real tragedy happens when your car breaks down. Suddenly, that 20 minute drive to work is 40 minutes on a bus or a laughably impossible day hike. In other words, it’s about the cost that you’re cutting out by using the tool – in this case, the car.

Sprint, stumble, walk, then run again

Genome sequencing is ridiculously fast. Just in terms of pure sequencing power, it takes maybe 1% of a single run from a state-of-the-art sequencing setup to plow through a bacterial genome. That’s not to say that sequencing a new genome is trivial – it’s not. But the chemistry and machinery behind sequencing have both advanced so far that the simple ability to churn through genomes is no longer the bottleneck.

Stumbling

Then things slow down a bit.

The first slowdown comes during the assembly and error checking steps…which I thankfully never, ever touch. I have a passing familiarity with the problems that pop up during assembly. For example, read lengths are a major issue. The “read length” is how long a continuous stretch of genome sequence you get in one go.

As you can imagine, you’d like read lengths to be long, since that means you have more ability to stitch together a finished genome by overlapping the pieces you’ve sequenced (since, you know, they’re more likely to overlap if they’re longer).

Fortunately, that’s not my complication to deal with.

Walking

Once we get out of the genome assembly woods, then we have the annotation step…which I’m also not going to talk about here, but which a lot of our work does directly support. Annotation is the slow, messy, and often inaccurate process by which we guess what the genes in our newly sequenced organism actually do.

The quality of a genome’s annotation still relates pretty directly to the amount of hands-on human time put into it. There are a bunch of solid predictive tools that try to fill in the gaps, but there’s still a tremendous amount of room for error. Right now, we’re stuck with the choice between “good” and “fast,” and that means that annotation is, at best, a walking step.

Back up into a run

One of our group’s major products is the Pathway Tools software package. The Pathway Tools package is in many ways the next comprehensive step after that initial annotation. It takes your shiny, new annotated genome and converts it into a database representing that organism’s genes, proteins, transporters, and metabolic network. You can read more about this process elsewhere. The concise version is that the software takes the genome’s annotation, then uses that to guess which metabolic pathways, transporters, and protein complexes the organism has…and then uses those pathways it guessed to figure out additional enzymes that are likely to be present in your organism.

I like to call that post-pathways second pass “second stage” annotation.

These databases are super convenient in terms of getting to check out the organism’s biology. Here’s an overview of the metabolism of Staphylococcus aureus, strain MRSA252, an antibiotic-resistant strain found in hospitals in the United Kingdom:

That’s a visualization of the metabolic network of that drug-resistant bug as inferred and predicted by the Pathway Tools software. There are all sorts of handy tricks you can do with this kind of visualization-rich database, but for now, the important part is that this a fully computable database.

In other words, you can use it to model the organism’s metabolic biology and ask handy questions like “Does this bug grow on glucose?”

Some easy steps from genome to model

So, after we’ve done all that work in building a database, we hit another slow, slow step…slower than that post-stumble walk I was just talking about above.

That’s the part where we actually do stuff with the organism in the lab. And then we have to troubleshoot over and over again for each new thing we want to do. Maybe we’re trying to get our microbe to crank out a bunch of our favorite small molecule – let’s say pinene, which you may have correctly guessed is responsible for pine scent. What should we grow our pine-scented microbe on? Will it make more pinene on sucrose instead of glucose? Are there some genes we should knock out to help raise our pinene yield?

Geez. Sounds like about a million experiments.

We can try to cut out some of the massive time and effort involved here using our expert knowledge, but each experiment is a ton of time and effort, even in a really friendly scientific model organism like E. coli.

We’d rather not do that.

MetaFlux to bridge genomes and lab work

So, this paper just came out from some of the very clever folks I work with:

In it, they describe some recent work they did on taking that first step that Pathway Tools handles, going from genome to model:

…and adding in the next step that we’d really want to have, where we go from model to what I’ll call a “functional model:”

Specifically, the new tool, called “MetaFlux,” uses mixed integer linear programming to make what’s called a “flux balance analysis” model from the database that Pathway Tools built from your genome.

Flux balance analysis, traditionally shortened to FBA, is a modeling method that approaches the organism’s metabolism as a steady-state system.

You’ll note right up front that a model’s metabolism is never a steady-state system. This is what we call a “simplifying assumption.” Even though it’s kind of vigorously wrong, it does a pretty good job of modeling many metabolic situations.

Making a hideously slow step a whole lot faster

Folks in the field have been doing flux balance models for years….very, very slowly. It turns out that the traditional approach to FBA generation involves making the world’s biggest spreadsheet containing all of the known reactions in the organism’s metabolism. The next steps look like this:

That “figure out where it’s broken” step is very labor intensive. It’s also a good thing, since it catches weird holes in your model.

For example, consider the pathway in E. coli that breaks down cyanate

I introduced the idea of metabolic modeling by talking about how we’re going to go from genes to enzymes and then to chemical networks in the body. Sometimes, however, the reactions just happen, like the breakdown of carbamate that gives us ammonia and carbon dioxide. In these cases, we might actually miss the reaction when we’re making our database, if it isn’t already included in a metabolic pathway in our MetaCyc database.

That’s actually one reason to have those pathways – to catch those reactions we can’t predict directly from the organism’s genes.

If you didn’t have the carbamate breakdown reaction, it would leave a gap in your model metabolic network that would “break” if you tried to model growth that depended on cyanate.

Imagine finding gaps and breaks like this over and over again, and you have a good feel for the tediousness of troubleshooting a flux balance model.

This is where MetaFlux steps in. Starting with a basic metabolic network model and a set of inputs (stuff you’re feeding that organism) and outputs (stuff it has to make to live), MetaFlux tries to see if it can make a working flux balance model…and if it can’t, it tries adding in the missing reactions until the model works.

It’s about weighing the costs

The basic principle behind what MetaFlux is trying to do is that basically all of our annotated genomes are going to have gaps – that is, areas where our predicted take on the organism’s metabolic network is going to be incomplete. This is just a given – we can’t even figure out any reasonable guess for what up to half the genes in any new genome do, so it would be surprising if we were able to build a perfect, working metabolism on the first try.

So, when we first build our metabolic network, it’s gonna have gaps. We want to fill those gaps, so we’re going to plug in reactions derived from our MetaCyc database, which contains over 10,000 distinct metabolic reactions culled from a wide range of organisms.

Now, it would be trivial to just chain together a bunch of reactions to patch a hole in our current metabolic model. Of course, that would be a lot like having a map that shows how to get from your home to the grocery store and “fixing” a missing city block form the route by replacing it with a journey back and forth across the entire continent.

With that in mind, we start introducing “costs” that help inform MetaFlux what direction we want our replacements to go. For example…

…if we were engineering a new pathway into E. coli and wanted to add as few new genes as possible, we could assign a high cost (expressed as an actual number) to the act of “adding a new reaction.”

…if we wanted to make sure that any new reactions didn’t undercut growth, we could assign a high value to growth, effectively making it costly to move away from robust growth.

…if we wanted to avoid predicting “plant” reactions for bacteria, we could assign a high cost to adding reactions that are not known to occur in our type of organism. This will mean that MetaFlux will only add these evolutionarily distant reactions if they are critical to making the model work.

MetaFlux takes all of these costs – that we’ve defined based on our goals – and uses them in calculating a working metabolic network that can yield a flux balance model.

MetaFlux is also set up to give you a “next best” answer when it simply can’t find a working metabolic network using all the rules you’ve given it. In those cases, it effectively tells you, “Well, you can’t have what you want, but what if you could make all but one of the products you wanted?”

The eventual goal – genome, model, go!

The end goal of all of this is to have a system where you can plug in an annotated genome, ask Pathway Tools to build a model organism database from it, and then ask MetaFlux to make a working flux balance model from that, so that you can now make predictions and model situations without months and months of scrolling through big spreadsheets of reactions looking for the missing and broken parts.

Right now, the chain from genome through to working model is not ready to go right out of the box, but it’s a pretty sweet set of tools if you have someone with a decent bit of technical savvy.