Thursday, August 22, 2013

Instantiating generic BioCyc (MetaCyc) reactions

Most databases of metabolic reactions include reactions that are generic. That is, some of the reactants and/or products are not specific molecules, but classes of molecules. This is a wonderfully abstract way of representing a large number of chemical transformations with a simple notation.
An example of a generic reaction is the CARBOXYLESTERASE-RXN in MetaCyc, which has the formula: a carboxylic ester + water <=> an alcohol + a carboxylate + a proton

As useful and elegant as generic reactions are, they present a problem for people trying to generate mathematical models from a database. The problem is that most mathematical modeling frameworks do not know how to deal with generic reactions.* So reactions in mathematical models should generally be balanced and unambiguous.

Latendresse et al. (2012) describe a strategy for generating specific reactions ("instances") from generic reactions from the BioCyc family of databases. Their strategy is to enumerate all possible combinations of reactants and products for a generic reaction, then check for mass balance. The set of mass balanced instances is further filtered by removing instances that include a reactant or product that appears in more than one balanced instance, as such instances are regarded as ambiguous. They also treat polymerization reactions as a special case and handle them differently from other instantiations.

As far as I can tell (I have not looked at the source code, and would have trouble understanding it even if I did because I'm not familiar with Lisp), Pathway Tools uses a brute force algorithm and will not instantiate reactions that have thousands or millions of possible instantiations. For example, in MetaCyc 17.1, there are
23 carboxylic esters,
314 alcohols,
321 carboxylates,
which means that there are a total of 2,318,262 possible instantiations for CARBOXYLESTERASE-RXN. Pathway Tools (at least the GUI version) will balk at this number and refuse to instantiate that reaction. However, the vast majority of these possible combinations will not mass balance. This suggests a more efficient instantiation algorithm, one that quickly prunes out as many non-mass-balancing instances as possible without checking them.

A common algorithm that does something similar is the binary search algorithm, which can be used to quickly find a number in an ordered list, by skipping over large groups of numbers if their upper bound is lower than the number being searched for. Lists of compounds can be reduced to lists of numbers that can be searched. We are interested only in combinations of compounds that will make a mass balanced reaction, so our search algorithm should produce a superset (or exact set), of all balanced reactions. We could reduce each compound to its mass, or to lists of its elemental composition. What I ended up using was the total number of protons in the molecule (that is, the sum of the atomic number of each elemental component times its frequency in the molecule), this has the nice feature of reducing each compound to a single integer, but the disadvantage that it won't filter out absolutely every non-mass-balancing combination of compounds (for example the reaction 12 H+ <=> 1 C, would pass this heuristic), but based on practical tests, reducing the pool of instances to only those that fulfill this heuristic is enough that the remaining instances can quickly be checked individually.

A python implementation of and more detailed explanation of the algorithm can be found here. (this is only for the binary search/branch and bound part of the problem, you'd still need to wrap it in code that can read and interpret a metabolic database. I have such code for BioCyc that I can provide if anybody wants it)

A problem with instantiating generic reactions is that there are some instantiations that while chemically valid, probably don't occur at non-negligible rates in-vivo. Building a model from instantiated reactions can lead to the inclusion of many orphan metabolites that cannot be synthesized by the cell/organism, but get included because they can maybe theoretically be metabolized by the organism. For example, alcohol dehydrogenase, E.C. 1.1.1.1, is a common enzyme/enzyme activity that probably shows up in most annotated genomes and transcriptomes, but including in a model every possible reaction that conforms to:
an alcohol + NAD <=> an aldehyde + NADH + H+
would be excessive and beyond the bounds of usefulness.

There are of course, some additional nuances as well. For example reactions involving protein prosthetic groups will often be filtered out as non-balanced because they do not have a supplied stoichiometry.

I think a truly effective instantiation algorithm would have to take structure into account (as described, but not implemented(?) in the BioHacker paper), but right now, I don't think the metabolic databases contain detailed enough descriptions of generic reactions for this to be possible. Rhea, is getting close though, as it supplies links to generic ChEBI entries that contain structural information. The problem with Rhea is that it is missing a lot of reactions and it has some inconsistencies where the same metabolite is given different names in different reactions. It seems to be getting better with every release, so hopefully pretty soon it will get to the point where it can be used as a reliable, consistent source for reactions for models. MetaCyc and (presumably) the other BioCyc databases that are derived from it also have some issues that need to be worked on, such as compounds that should probably actually be considered generics (like D-HEXOSE-6-PHOSPHATE). But again, as flux balance analysis and stoichiometric modeling continue to become more widely used, curators of these databases seem to be making an effort to make the databases more friendly for that kind of modeling, which is great.

For instantiating generic reactions, in the near term, I think that the strategy I present here (or some variation on it) is just about as good as we can do lacking structural information. Once structural information for generic compounds is more readily available, an algorithm for finding truly viable reactions could continue to use this algorithm as a base, but add on a check for structural coherence in addition to or in place of the current check for mass balance.

There is also a good case to be made for relying manual instantiation, which would be to have an expert look at a generic reaction and decide which of the possible instantiations are physiologically relevant (either in general, or for a specific organism/context). This would avoid many of the pitfalls discussed above for automatic instantiation, but at the expense of requiring more manpower to implement effectively.

*It is conceivably possible to write a differential equation model simulator that takes into account generic reactions with a sort of on-the-fly reaction instantiation, but I'm not aware of any that do. I think it would be more difficult to write a linear-programming based simulator, but maybe some kind of dynamic flux balance analysis based algorithm could do it. In any case, I don't think it's currently necessary for simulators to handle generic reactions, because generic reactions can be handled well enough upstream.