Next Generation Sequencing (NGS) data is knocking at our door and simultaneously, our ability to design novel enzymes (rational design or directed evolution) using high throughput methods has improved tremendously. As a result, the demand to link enzymatic sequences to their chemical products and metabolic pathways is ever increasing. On the other hand, the push to generate Metabolomics data to design Biomarkers, understand Toxicity, Functional genomics and Nutrigenomics has given researchers a run for their money!

Last year we launched EC-Blast (my old post), a robust tool to compare chemical reactions using chemical knowledge of bond changes, molecule molecule pair (MMP) and molecule substructures. This tool helps plough through and understand the reactions present in the Enzyme Commission (E.C.) classification. This has generated a lot of interest in the research community and industry to revisit and mine the knowledge which might have been overlooked by traditional methods. Feedbacks from our users strongly suggested a demand for tools/methods to systematically link the protein sequences to the knowledge of bond changes, molecule molecule pair (MMP) and molecules substructures.

We have recently developed Sequence to Enzyme (Seq2EC) a novel tool (Figure 1) to:

Shortest Path (SP) has been used in many aspects of graph traversing. The idea is to minimise the cost (number of edges to be traversed or the cost on the edge) of traveling between a source and destination. This is one of the most optimal ways of finding a path in the graph where you can generate a combination of paths using random walk.

Interestingly, generation of path based hashed fingerprint is very common in the area of chemo-informatics. The basic idea is to find all paths of a certain length from the source atom (fragments) in a molecule and convert it into a hashed fingerprint. This works very well with smaller or sparsed graphs, although in a few cases the run time may increase exponentially with the size of the graph (connectivity). The Chemistry Development Kit (CDK) has one such effective path based hashed fingerprint generator (Fingerprinter.java). This module in the CDK has generated a lot of interest from the user community. Recently, Nina Nikolova – Jeliazkova posted an interesting set of molecules were the path search in the fingerprint was hit by combinatorial explosion!

Note: The behaviour of the path finding algorithm is compromised once the depth of the path search is more than 6 (the recommended depth is 8). Hence for these set of molecules one may not be able to find the fingerprints.

Here are the exemplar molecules where the present CDK hashed fingerprint is subdued.

I have modified the existing CDK fingerprinter to report the shortest path rather than all paths. This overcomes the problem of combinatory explosion and runtime is no longer exponential as compared to previous case.

Here is the runtime and the density of the FP (number of bit occupied) as calculated by the SP based FP. One can deduce the runtime and density by half if the FP is only based either on weighted or unweighted bonds. In the modified SP based FP, I have used both weighted and unweighted bonds to give better consensus FP (more in my next blog!).

a) Presently the fingerprint accounts for only one shortest path between a source and the sink atom (discriminates between aromatic, ring and aliphatic paths). Hence, I had to canonicalize the atoms in the graph container such that if two molecules are similar then the returned SP path is same. A natural extension would be to report k-shortest path but this maybe as good as CDK default fingerprinter (in terms of the runtime).

b) For spare graph and smaller graphs it might be as fast as the previous implementation, and it will perform better on complex graphs.

Here is a test case based on the ring systems (aromatic and non aromatic) and aliphatic molecules.

Updated: 29th Aug 2012

Thanks to Egon for his suggestions to use SP2 hybridization instead of aromaticity checker. In my case I have to use CDK aromaticity detection as SP2 concept may not work.

I have clustered 11 molecules based on their fingerprint similarity scores using the

a) CDK default finprinter,

b) SP based Fingerprinter and

c) CDK Hybridization Fingerprinter

The clustered results are as shown below.

The CDK default fingerprinter based similarity clusters

The Shortest path fingerprinter based similarity clusters

Molecule similarity clusters based on the Hybridization Fingerprinter (doesn’t discriminate between open and close ring system)

The Hybridization based fingerprinter is the fastest one (in the non-complex cases), followed by the SP fingerprinter and improved CDK fingerprinter. In terms of the sensitivity and specificity, SP fingerprinter is the best and in complex cases its by far the fast one!

I will leave it to the readers to choose their favorite fingerprinter.

Well for many of us this might be a regular exercise on a chemical editor but this is a little trickier to program. So mathematically, we are aiming for a union between two molecules. The union between two molecules can be defined as (A ∪ B) = n(A) + n(B) – (A ∩ B), where A is a query molecule and B is the target molecule. The intersection (A ∩ B) can be obtained by finding isomorphism between two molecules. The joining part is a little challenging as not all combination(s) might be chemically valid. So we need to find combination(s) which are unique and unsaturated!

Here is a little snippetto perform this task in Java using the SMSD and CDK.

Metabolism influences building or replacement of tissue, conversion of food to energy, disposal of waste materials, reproduction etc. “Catalysis” is defined as the acceleration of a chemical reaction by a substance which itself undergoes no permanent chemical change. Most biochemical reactions do not take place spontaneously and enzyme catalysis plays an important role in biochemical reactions necessary for all life processes. Without enzymes, these reactions would take place at a rate far too slow for effective metabolism.

Enzymes can be classified by the kind of chemical reaction they catalyze. One such scheme of enzyme classification is defined by IUBMB.

The IUBMB assigns a 4-digit code to each enzyme. Each enzyme is prefixed by EC, followed by the digits.

For example: oxidoreductases EC 1.1.1.1

1. The first digit denotes “Class” of the enzyme

2. The second digit indicates, “Sub-class” of the enzyme

3. The third digit gives “Sub sub-class” of the enzyme

4. The fourth digit in the code is “Serial number” of the enzyme

The classification is as follows:

Group Name

Type of Reaction Catalysed

Example

Oxidoreductases

Oxidation-reduction reactions

Alcohol oxidoreductase (EC 1.1)

Transferases

Transfer of functional groups

Methyltransferase (EC 2.1)

Hydrolases

Hydrolysis reactions

Lipase (EC 3.1)

Lyases

Addition to double bonds or single bonds

Decarboxylases (EC 4.1)

Isomerases

Isomerization reactions

Epimerases and Racemases (EC 5.1)

Ligases

Formation of bonds with ATP cleavage

Enzymes forming carbon-oxygen bonds (EC 6.1)

b) How can I find similar enzymes?

Any similarity search is based on the presence of similar patterns (similar bond changes and/or small molecules) shared between query and target reactions. A large number of shared patterns results in higher similarity score or lesser distance score. In Bioinformatics, the concept of similarity or distance is used to find similar sequences based on amino acid similarity, structural topology, etc. In Chemoinformatics similarity between small molecules/drug molecules (i.e. based on Tanimoto score) is based on the presence of similar bonds and atoms between query and target molecules.

I reckon in the near future we might see such concepts being adapted by IUBMB itself to annotate and classify enzymes.

This would be vital in the study of the interactions between the components of biological systems (metabolites, enzymes and metabolic pathways), and how these interactions give rise to the function and behavior of that system.