Asad's Bloghttps://chembioinfo.com
Man and his will to survive!Thu, 26 Apr 2018 02:01:50 +0000enhourly1http://wordpress.com/https://secure.gravatar.com/blavatar/e3837228ea05a4702cb65021b9e9afa9?s=96&d=https%3A%2F%2Fs0.wp.com%2Fi%2Fbuttonw-com.pngAsad's Bloghttps://chembioinfo.com
A Romance between Biology and Chemistry – Protein Sequences, Molecules and Enzyme function!https://chembioinfo.com/2015/08/15/a-romance-between-biology-and-chemistry-protein-sequences-molecules-and-enzyme-function/
https://chembioinfo.com/2015/08/15/a-romance-between-biology-and-chemistry-protein-sequences-molecules-and-enzyme-function/#commentsSat, 15 Aug 2015 20:35:25 +0000http://chembioinfo.com/?p=791Continue reading →]]>Next Generation Sequencing (NGS) data is knocking at our door and simultaneously, our ability to design novel enzymes (rational design or directed evolution) using high throughput methods has improved tremendously. As a result, the demand to link enzymatic sequences to their chemical products and metabolic pathways is ever increasing. On the other hand, the push to generate Metabolomics data to design Biomarkers, understand Toxicity, Functional genomics and Nutrigenomics has given researchers a run for their money!

Last year we launched EC-Blast (my old post), a robust tool to compare chemical reactions using chemical knowledge of bond changes, molecule molecule pair (MMP) and molecule substructures. This tool helps plough through and understand the reactions present in the Enzyme Commission (E.C.) classification. This has generated a lot of interest in the research community and industry to revisit and mine the knowledge which might have been overlooked by traditional methods. Feedbacks from our users strongly suggested a demand for tools/methods to systematically link the protein sequences to the knowledge of bond changes, molecule molecule pair (MMP) and molecules substructures.

We have recently developed Sequence to Enzyme (Seq2EC) a novel tool (Figure 1) to:

BLAST-Basic Local Alignment Tool was born in the 1990s (1,2) and has since been the bread and butter of homology searches (sequence similarity in large databases). Having said that, I would not like to discredit similar tools such as WU-BLAST, FASTA etc. I recently came across and read with great interest a wonderful paper from Baldi and Benz (3) highlighting the possibility of using this tool to fish out similar chemicals from databases. It touches upon a few statistical challenges and adaptations of Tanimoto scores for calculating the similarity of molecules. There are various ways for calculating chemical similarity i.e. graph based, fingerprint based etc. In today’s blog I will discuss using fingerprint based similarity methods to calculate similarity between molecules and how we can use this for BLASTing small molecules. The wiki has a rich page dedicate to the BLAST Tool.

We all agree with the top hit CHEMBL23456, but we differ on the remaining top 4 hits. This could be due to the variation in choice of our fingerprints/methodology. Nonetheless, the top hits in both the cases looks interesting. I am sure there are many ways to reach the same goal and further optimisation of the code will make it even more attractive. Needs more playing around with!

Suggestions are welcome.

The ChemBLAST– tool is freely available on the GitHub for you to play with and its reasonably fast.

All things come to him who waits – provided he knows what he is waiting for. ~ Woodrow T. Wilson

—————————————————————————————————————————–

Basic principle of sequence BLAST

—————————————————————————————————————————–

To summarise a few of the important concepts:

a) It tries to compare gene sequences (amino acid or nucleotide) – query string against target string. BLAST will find conserved patterns in the database which are similar to sub-patterns in the query.

b) Its uses heuristics by calculating frequently occurring patterns called ‘high-scoring segment pair’ (HSP) in the database (using random walks). It then uses Gumbel extreme value distribution (EVD) to calculate the probability p of observing a score S equal to or greater than x. This is given by the equation:

The statistical parameters and are estimated by fitting the distribution of the un-gapped local alignment scores of the query sequence and a lot of shuffled versions (global or local shuffling) of database sequence patterns, to the Gumbel extreme value distribution.

Conditions: If W is too large, it will lead totoo many patterns in L and vice versa – if too small, it will lead to too few patterns. If T is too large, then it’s very stringent (conserved blocks) and if too small, it will lead to too many extensions. Look for High-scoring Sequence Pairs (HSPs)-tuples and choose cut-off for relevant hits.

d) Find high scoring pattern of length W, and compile a list L of all W-mers that score >T with some pattern in query sequence. Then scan database for words in L and each positive hit is matched and extended. When score drops more than X below hitherto best score, stop extension. Now report all words with large score S.

]]>https://chembioinfo.com/2014/12/31/chemblast-old-dog-new-tricks/feed/652.205337 0.12181752.2053370.121817chembioinfoTop 10 ChEMBL 19 hits reported by ChemBLAST tool for query molecule CHEMBL23456. In accordance with the Gumbel EVD, the probability p of observing a score S equal to or greater than x is given by the equation. \lambda\mathrm{K}Atom Atom Mapping (AAM) and Challengeshttps://chembioinfo.com/2014/03/18/aam/
https://chembioinfo.com/2014/03/18/aam/#respondTue, 18 Mar 2014 11:53:15 +0000http://chembioinfo.com/?p=626Continue reading →]]>We have just released our long awaited AAM tool in the public domain…this was long over due! You can download the tool from Github. This tool is based on the algorithm published in the SA Rahman et.al. (2014) Nature Methods paper. We have successfully mapped more than 6000 KEGG reactions in the EC-BLAST.

]]>https://chembioinfo.com/2014/02/04/ec-blast-tutorial-for-hands-on-training/feed/2chembioinfoShortest Path and Molecular Hashed Fingerprintshttps://chembioinfo.com/2012/07/23/shortest-path-and-molecular-hashed-fingerprints/
https://chembioinfo.com/2012/07/23/shortest-path-and-molecular-hashed-fingerprints/#commentsMon, 23 Jul 2012 11:10:16 +0000http://chembioinfo.com/?p=615Continue reading →]]>Shortest Path (SP) has been used in many aspects of graph traversing. The idea is to minimise the cost (number of edges to be traversed or the cost on the edge) of traveling between a source and destination. This is one of the most optimal ways of finding a path in the graph where you can generate a combination of paths using random walk.

Interestingly, generation of path based hashed fingerprint is very common in the area of chemo-informatics. The basic idea is to find all paths of a certain length from the source atom (fragments) in a molecule and convert it into a hashed fingerprint. This works very well with smaller or sparsed graphs, although in a few cases the run time may increase exponentially with the size of the graph (connectivity). The Chemistry Development Kit (CDK) has one such effective path based hashed fingerprint generator (Fingerprinter.java). This module in the CDK has generated a lot of interest from the user community. Recently, Nina Nikolova – Jeliazkova posted an interesting set of molecules were the path search in the fingerprint was hit by combinatorial explosion!

Note: The behaviour of the path finding algorithm is compromised once the depth of the path search is more than 6 (the recommended depth is 8). Hence for these set of molecules one may not be able to find the fingerprints.

Here are the exemplar molecules where the present CDK hashed fingerprint is subdued.

I have modified the existing CDK fingerprinter to report the shortest path rather than all paths. This overcomes the problem of combinatory explosion and runtime is no longer exponential as compared to previous case.

Here is the runtime and the density of the FP (number of bit occupied) as calculated by the SP based FP. One can deduce the runtime and density by half if the FP is only based either on weighted or unweighted bonds. In the modified SP based FP, I have used both weighted and unweighted bonds to give better consensus FP (more in my next blog!).

a) Presently the fingerprint accounts for only one shortest path between a source and the sink atom (discriminates between aromatic, ring and aliphatic paths). Hence, I had to canonicalize the atoms in the graph container such that if two molecules are similar then the returned SP path is same. A natural extension would be to report k-shortest path but this maybe as good as CDK default fingerprinter (in terms of the runtime).

b) For spare graph and smaller graphs it might be as fast as the previous implementation, and it will perform better on complex graphs.

Here is a test case based on the ring systems (aromatic and non aromatic) and aliphatic molecules.

Updated: 29th Aug 2012

Thanks to Egon for his suggestions to use SP2 hybridization instead of aromaticity checker. In my case I have to use CDK aromaticity detection as SP2 concept may not work.

I have clustered 11 molecules based on their fingerprint similarity scores using the

a) CDK default finprinter,

b) SP based Fingerprinter and

c) CDK Hybridization Fingerprinter

The clustered results are as shown below.

The CDK default fingerprinter based similarity clusters

The Shortest path fingerprinter based similarity clusters

Molecule similarity clusters based on the Hybridization Fingerprinter (doesn’t discriminate between open and close ring system)

The Hybridization based fingerprinter is the fastest one (in the non-complex cases), followed by the SP fingerprinter and improved CDK fingerprinter. In terms of the sensitivity and specificity, SP fingerprinter is the best and in complex cases its by far the fast one!

I will leave it to the readers to choose their favorite fingerprinter.

Kindly leave your comments and suggestions!

]]>https://chembioinfo.com/2012/07/23/shortest-path-and-molecular-hashed-fingerprints/feed/7chembioinfoThe CDK default fingerprinter based similarity clustersThe Shortest path fingerprinter based similarity clustersHybridization FingerprinterEC-BLAST: A Novel Tool for Finding Chemically Similar Enzymeshttps://chembioinfo.com/2012/04/11/ec-blast/
https://chembioinfo.com/2012/04/11/ec-blast/#commentsWed, 11 Apr 2012 23:49:11 +0000PDB]]>http://chembioinfo.com/?p=544Continue reading →]]>Enzymes have been part of our evolutionary machinery and it’s importance is ever increasing in our life. An enzymatic hierarchal functional classification has been developed to cluster similar enzymes based on its chemistry (kindly refer to my previous blog on enzymes). A parallel system envolves sequence and protein structural based classification systems. One of the most challenging issues in todays bio/chemo informatics science is to automatically link the sequence knowledge with the enzymatic chemistry. There exists many methods in the literature addressing this issue but its hard to find a direct link which can hold true for all the cases. Although, very recently in the Prof. Janet Thornton’s group we have come up with a web tool – “FunTree” for linking enzyme super families based on the knowledge of the evolution, derived from sequences and structures (proteins and small molecules). It’s very enigmatic to find a one to one mapping between genes->protein->enzymes and its equally mind boggling to navigate in this space. This is one of the reasons why we have many orphan enzymes or enzyme which do not have a sequence assigned to it yet. On one hand we have ever increasing sequence database and sophisticated tools like BLAST and FASTA to compare them. Unfortunately, the bio-chemical side of the story is slow as we have limited number of publicly available chemical databases and tools in chemistry. Although in the recent years there has been databases like BRENDA, KEGG, BioCyc, UniProt, EC->PDB and SwissProtetc. to bring forth and link sequence to chemistry. There are efforts to link up various resources of enzyme chemistry under an umbrella and one such web portal is “Enzyme Portal“. Likewise there exists, few curated databases linking enzyme function and reaction mechanism like MACiE , Rhea and SFLDetc.

The challenge for a biologist/chemist is find a tool which can function like BLAST (as a magic black box) in finding similar enzymes in a reaction database (needle in a haystack). The good new is that we have made some progress in this interesting area of research by coming up with a novel tool – “EC-BLAST“. The core idea behind this tool is to find similar enzymes ranked by similarity of the bond changes, reaction center or chemical structural similarity of the participating reactions. One could start a search with a molecule/reaction name or its structure. The Atom-Atom Mapping (AAM) is algorithmically generated on the fly for a balanced input reaction and the bond changes are automatically deduced and marked before performing any search.

EC BLAST front page

The cognisance of search results would channelise us to gain better insight into the catalytic promiscuity of the enzymes and complement the sequence based results obtained from tools like BLAST, FASTA etc (where the chemistry in not necessarily retained in the results). This will help us to link up the evolutionary and mechanistic aspects of the enzymes, in the biological findings with chemical knowledge.

Such tools will also help us gain better insight into toxicity studies (can be a value added parmeter to the likes of ChEMBL/DrugBank), in designing novel enzyme and retrosynthetic pathways etc. Although the first glimpse of the EC-BLAST was unveiled at the ISMB 2011, Vienna where it won the “Killer Apps 2011” award, it largely remained restricted to the EBI and collaborators. The response at the ISMB 2011 (poster here) was very encouraging for us and there has been an ever increasing need, scope and requisition for such a resource. Hence, we have now decided to go public with a beta version of our web portal service.

EC-Blast result page for bond change similarity searches.

Note: If you are interested in testing this service or sending us your comments or feedbacks, please do let me know!

In my previous post, I discussed the impact of the hashcode and random number generators on a hashed fingerprint. They play a major role in the uniform distribution of the bits in a fixed length array and the occurrence of the bit clashes. In order to prove the concept, I have prepared a test case of 1200 molecules and preformed a substructure search using the default CDK Fingerprinter class and its improved Fingerprinter class version (with the Apache math librarys HashCodeBuilder() method and Mersenne Twisterrandom number generator).

Each molecule was searched against other molecules in the dataset including itself. This was done at an interval of 200 data points. The gold standard was the substructure search results from the SMSD.

As expected the improved version of the Fingerprinter class outperformed the present CDK Fingerprinter class. The number of false positives (FP) were reduced by 35-40% (due to minimal bit clashes) thereby increasing the accuracy of the results, while the true positives remained unchanged. This also made an overall positive impact on the speed of the search results!

Fingerprints have been widely used in various fields to find similar features. Now for those of you who are using their detective instincts and aiming for DNA fingerprint or biological fingerprints, I might disappoint you in the later half of my post. Fingerprints are typically used to avoid cumbersome data comparison by using shorter “bit” string. My focus will be on the molecular fingerprints which have been used by chemo/bio informatician for finding similar molecular structures i.e. finding a needle in a hay stack! Theoretically, if you know the prerequisite features of “should have and not have” in the target molecules, then you can use a set of predefined keys to generate fingerprints. For examples PubChem fingerprint, MACCS keys etc. are based on certain substructure/SMARTS keys which are expected to be found or skipped in your target. On the other hand when we play with unknowns both at the level of query and target then one of the fastest ways to go for the kill is hashed fingerprints. Typically, in a hashed fingerprint a set of patterns are generated by gathering atom environment information or subgraph information or both. The generated patterns are then transformed into hash codes (a fixed size message digest) using hashing algorithm in computer science. These hash codes can then transformed into bit strings using random number generation of a defined length (size of the fingerprint). The presence and the absence of a pattern is marked as “1” and “0” respectively.

Pros:

Hashed fingerprints are like a black box with an assurance that similar patterns will have similar bits set to “1”. In the language of information science you are allowing clashes of the similar bits with certain probability.

The size of the generated fingerprints can be controlled by the user as predefined knowledge of the fingerprint patterns are not required.

Cons:

The resolution of the fingerprints depends on algorithms used for generating the hash code and random numbers.

It’s challenging to find a perfectly sized fingerprints which can strike a balance between minimising the clashes of bitsets and wastage of the bit space.

Implementation

Let’s play with some real-time examples to understand the depth of the above mentioned statements. Now we need to generate some patterns from molecules and store them as fingerprints. In order to analyse the quality of the fingerprints we will open the black box by keeping track of the generated pattern types. This will help us to quantify the patterns involved in the bitset clashes. The circular fingerprint or molecular signatures can be used to generate patterns of various diameter/height for a molecule. By increasing the diameter/height, we can enrich the patterns/information about the molecules. However, this will also increase the overhead of balancing the fingerprint size and reducing the bit clashes.

Stage 1: Generate patterns using molecular signatures of heights 0 to 3 for every atom in the molecule. An example is illustrated in the figure below.

Circular / Signatures patterns encoded as fingerprints

Stage 2: Transform these patterns as SMARTS/SMILES/Signatures and generate hash code for each pattern using your favourite algorithm.

Stage 3: Once we have the hash codes for these patterns then using random number generator, convert these hash codes into bit set bucket with a fixed range (eg. 1024).

I have used the CDK to generate molecular signatures (σ) of various heights (0 to 3) for 5000 mols. These signatures were transformed into canonical SMILES and hash code was generated using Java Apache math library HashCodeBuilder() method (better than default java hashCode() due to the flexibility). Well, you could use any method you like as long as equal objects produce same hash code and unequal objects produce distinct hash codes. Some of the most common hash code generation algorithms are MD5, SHA, PJW (Peter Weinberger’s hash) etc. The choice is made on the basis of data distribution (balance between random generation vs pattern in generation) and hashing function efficiency (should be very quick, stable and deterministic).

Now the tricky part is the conversion of hash codes into a fingerprint. I have used the famous Mersenne Twister random number generator. This yields better results than default java Random() method in terms of minimising the bit clashes and maximizing the bit set resolution.

Here are few statistical measure regarding the patterns generated and encoded into fingerprint bitsets.

Statistical Measure (5000 mols)

Height 0

Height 1

Height 2

Height 3

Unique Pattern Count (UPC)

53

426

4083

14448

Average number of patterns/fingerprint

3.09 +/- 1.04

10.34 +/- 5.82

15.16 +/- 10.01

17.01 +/- 13.07

Median number of patterns/fingerprint

3

9

13

13

Max. number of patterns/fingerprint

7

35

64

89

In order to understand the resolution of the fingerprints with respect to the bit clash and size of the fingerprints, I generated fingerprints of various sizes (ranging from 128 to 8192 bits). The fingerprint size 1024 bits seems like a good bet for signatures of height up to 2 (as marked in the graph below), while 4096 stands good for signature of height 3 (more than 95% bitsets are used and lesser % of bits clash).

BitSet usage vs Bit Clash in the hashed fingerprints

Analysis

From the above figure, it is clear that one of the key improvements which can be made in the hashed fingerprints is to divide it into sub-fingerprints. Then each sub-fingerprint can be populated with certain chemical/subgraph property of the molecule. Say in the case of molecular fingerprint of size 1024 bitset, one can divide the fingerprints into two sub-fingerprints –

a) One of 256 bits for storing labelled atom types and,

b) The second, of 768 bits for graph/topological information.

The hash code from the atom typed section is the depiction of concatenated labelled string of the CDK atom types + presence of atom in a ring system + stereo for each atom in a molecule (you could choose your own physiochemical labelling schema). The signatures/graph section can be populated with signatures/circular fingerprints of height/diameter 2. The Sub-fingerprints are easy to achieve and store with the above mentioned process due to the flexibility of generating hash codes within a range. The idea is to get the best of both the worlds i.e. physiochemical properties and subgraph patterns.

Conclusion

The quality of the hashed fingerprint depends a lot on the patterns generated (UPC), size of the bitsets, hashing function and random number generator. Next step for me would to cluster these similarity matrices or perform Leave One Out test on the dataset to check the specificity and sensitivity of the model.

References:

Further reading and reference therein will give you more insight into the story: