BLAST-Basic Local Alignment Tool was born in the 1990s (1,2) and has since been the bread and butter of homology searches (sequence similarity in large databases). Having said that, I would not like to discredit similar tools such as WU-BLAST, FASTA etc. I recently came across and read with great interest a wonderful paper from Baldi and Benz (3) highlighting the possibility of using this tool to fish out similar chemicals from databases. It touches upon a few statistical challenges and adaptations of Tanimoto scores for calculating the similarity of molecules. There are various ways for calculating chemical similarity i.e. graph based, fingerprint based etc. In today’s blog I will discuss using fingerprint based similarity methods to calculate similarity between molecules and how we can use this for BLASTing small molecules. The wiki has a rich page dedicate to the BLAST Tool.

We all agree with the top hit CHEMBL23456, but we differ on the remaining top 4 hits. This could be due to the variation in choice of our fingerprints/methodology. Nonetheless, the top hits in both the cases looks interesting. I am sure there are many ways to reach the same goal and further optimisation of the code will make it even more attractive. Needs more playing around with!

Suggestions are welcome.

The ChemBLAST– tool is freely available on the GitHub for you to play with and its reasonably fast.

All things come to him who waits – provided he knows what he is waiting for. ~ Woodrow T. Wilson

—————————————————————————————————————————–

Basic principle of sequence BLAST

—————————————————————————————————————————–

To summarise a few of the important concepts:

a) It tries to compare gene sequences (amino acid or nucleotide) – query string against target string. BLAST will find conserved patterns in the database which are similar to sub-patterns in the query.

b) Its uses heuristics by calculating frequently occurring patterns called ‘high-scoring segment pair’ (HSP) in the database (using random walks). It then uses Gumbel extreme value distribution (EVD) to calculate the probability p of observing a score S equal to or greater than x. This is given by the equation:

The statistical parameters and are estimated by fitting the distribution of the un-gapped local alignment scores of the query sequence and a lot of shuffled versions (global or local shuffling) of database sequence patterns, to the Gumbel extreme value distribution.

Conditions: If W is too large, it will lead totoo many patterns in L and vice versa – if too small, it will lead to too few patterns. If T is too large, then it’s very stringent (conserved blocks) and if too small, it will lead to too many extensions. Look for High-scoring Sequence Pairs (HSPs)-tuples and choose cut-off for relevant hits.

d) Find high scoring pattern of length W, and compile a list L of all W-mers that score >T with some pattern in query sequence. Then scan database for words in L and each positive hit is matched and extended. When score drops more than X below hitherto best score, stop extension. Now report all words with large score S.

In my previous post, I discussed the impact of the hashcode and random number generators on a hashed fingerprint. They play a major role in the uniform distribution of the bits in a fixed length array and the occurrence of the bit clashes. In order to prove the concept, I have prepared a test case of 1200 molecules and preformed a substructure search using the default CDK Fingerprinter class and its improved Fingerprinter class version (with the Apache math librarys HashCodeBuilder() method and Mersenne Twisterrandom number generator).

Each molecule was searched against other molecules in the dataset including itself. This was done at an interval of 200 data points. The gold standard was the substructure search results from the SMSD.

As expected the improved version of the Fingerprinter class outperformed the present CDK Fingerprinter class. The number of false positives (FP) were reduced by 35-40% (due to minimal bit clashes) thereby increasing the accuracy of the results, while the true positives remained unchanged. This also made an overall positive impact on the speed of the search results!

Fingerprints have been widely used in various fields to find similar features. Now for those of you who are using their detective instincts and aiming for DNA fingerprint or biological fingerprints, I might disappoint you in the later half of my post. Fingerprints are typically used to avoid cumbersome data comparison by using shorter “bit” string. My focus will be on the molecular fingerprints which have been used by chemo/bio informatician for finding similar molecular structures i.e. finding a needle in a hay stack! Theoretically, if you know the prerequisite features of “should have and not have” in the target molecules, then you can use a set of predefined keys to generate fingerprints. For examples PubChem fingerprint, MACCS keys etc. are based on certain substructure/SMARTS keys which are expected to be found or skipped in your target. On the other hand when we play with unknowns both at the level of query and target then one of the fastest ways to go for the kill is hashed fingerprints. Typically, in a hashed fingerprint a set of patterns are generated by gathering atom environment information or subgraph information or both. The generated patterns are then transformed into hash codes (a fixed size message digest) using hashing algorithm in computer science. These hash codes can then transformed into bit strings using random number generation of a defined length (size of the fingerprint). The presence and the absence of a pattern is marked as “1” and “0” respectively.

Pros:

Hashed fingerprints are like a black box with an assurance that similar patterns will have similar bits set to “1”. In the language of information science you are allowing clashes of the similar bits with certain probability.

The size of the generated fingerprints can be controlled by the user as predefined knowledge of the fingerprint patterns are not required.

Cons:

The resolution of the fingerprints depends on algorithms used for generating the hash code and random numbers.

It’s challenging to find a perfectly sized fingerprints which can strike a balance between minimising the clashes of bitsets and wastage of the bit space.

Implementation

Let’s play with some real-time examples to understand the depth of the above mentioned statements. Now we need to generate some patterns from molecules and store them as fingerprints. In order to analyse the quality of the fingerprints we will open the black box by keeping track of the generated pattern types. This will help us to quantify the patterns involved in the bitset clashes. The circular fingerprint or molecular signatures can be used to generate patterns of various diameter/height for a molecule. By increasing the diameter/height, we can enrich the patterns/information about the molecules. However, this will also increase the overhead of balancing the fingerprint size and reducing the bit clashes.

Stage 1: Generate patterns using molecular signatures of heights 0 to 3 for every atom in the molecule. An example is illustrated in the figure below.

Circular / Signatures patterns encoded as fingerprints

Stage 2: Transform these patterns as SMARTS/SMILES/Signatures and generate hash code for each pattern using your favourite algorithm.

Stage 3: Once we have the hash codes for these patterns then using random number generator, convert these hash codes into bit set bucket with a fixed range (eg. 1024).

I have used the CDK to generate molecular signatures (σ) of various heights (0 to 3) for 5000 mols. These signatures were transformed into canonical SMILES and hash code was generated using Java Apache math library HashCodeBuilder() method (better than default java hashCode() due to the flexibility). Well, you could use any method you like as long as equal objects produce same hash code and unequal objects produce distinct hash codes. Some of the most common hash code generation algorithms are MD5, SHA, PJW (Peter Weinberger’s hash) etc. The choice is made on the basis of data distribution (balance between random generation vs pattern in generation) and hashing function efficiency (should be very quick, stable and deterministic).

Now the tricky part is the conversion of hash codes into a fingerprint. I have used the famous Mersenne Twister random number generator. This yields better results than default java Random() method in terms of minimising the bit clashes and maximizing the bit set resolution.

Here are few statistical measure regarding the patterns generated and encoded into fingerprint bitsets.

Statistical Measure (5000 mols)

Height 0

Height 1

Height 2

Height 3

Unique Pattern Count (UPC)

53

426

4083

14448

Average number of patterns/fingerprint

3.09 +/- 1.04

10.34 +/- 5.82

15.16 +/- 10.01

17.01 +/- 13.07

Median number of patterns/fingerprint

3

9

13

13

Max. number of patterns/fingerprint

7

35

64

89

In order to understand the resolution of the fingerprints with respect to the bit clash and size of the fingerprints, I generated fingerprints of various sizes (ranging from 128 to 8192 bits). The fingerprint size 1024 bits seems like a good bet for signatures of height up to 2 (as marked in the graph below), while 4096 stands good for signature of height 3 (more than 95% bitsets are used and lesser % of bits clash).

BitSet usage vs Bit Clash in the hashed fingerprints

Analysis

From the above figure, it is clear that one of the key improvements which can be made in the hashed fingerprints is to divide it into sub-fingerprints. Then each sub-fingerprint can be populated with certain chemical/subgraph property of the molecule. Say in the case of molecular fingerprint of size 1024 bitset, one can divide the fingerprints into two sub-fingerprints –

a) One of 256 bits for storing labelled atom types and,

b) The second, of 768 bits for graph/topological information.

The hash code from the atom typed section is the depiction of concatenated labelled string of the CDK atom types + presence of atom in a ring system + stereo for each atom in a molecule (you could choose your own physiochemical labelling schema). The signatures/graph section can be populated with signatures/circular fingerprints of height/diameter 2. The Sub-fingerprints are easy to achieve and store with the above mentioned process due to the flexibility of generating hash codes within a range. The idea is to get the best of both the worlds i.e. physiochemical properties and subgraph patterns.

Conclusion

The quality of the hashed fingerprint depends a lot on the patterns generated (UPC), size of the bitsets, hashing function and random number generator. Next step for me would to cluster these similarity matrices or perform Leave One Out test on the dataset to check the specificity and sensitivity of the model.

References:

Further reading and reference therein will give you more insight into the story:

Well for many of us this might be a regular exercise on a chemical editor but this is a little trickier to program. So mathematically, we are aiming for a union between two molecules. The union between two molecules can be defined as (A ∪ B) = n(A) + n(B) – (A ∩ B), where A is a query molecule and B is the target molecule. The intersection (A ∩ B) can be obtained by finding isomorphism between two molecules. The joining part is a little challenging as not all combination(s) might be chemically valid. So we need to find combination(s) which are unique and unsaturated!

Here is a little snippetto perform this task in Java using the SMSD and CDK.

Here are some interesting results as they highlight the run time between three flavours of substructure search algorithms. I have chosen, a) UIT (Universal Isomorphism Tester) as implemented in the CDK (Chemistry Development Kit); b) VF2 as implemented in the SMSD (I have further optimised it); and c) VF2 as ported from ChemKit. The former two implementations (i.e. a and b) are optimised for substructure search whereas the later implementation (i.e. c) is optimised for the graph automorphism search (further extension would be good/ideal).

While benchmarking the substructure search, I realised that the CDK AtomContainer takes ample amount of time for methods such as getConnectedAtomsList(atom) or getConnectedAtomsCount(atom). Hence, I ended up refactoring the AtomContainer to store the adjacency map of atoms and bonds. This will speed up the above mentioned methods and the benchmark results will bring out the actual differences in the run time. Please find the updated AtomContainer here (there is scope for further optimisation).

Now let’s focus on the results from the benchmark 🙂

Benchmark results from the substructure search. X-axis: 31 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures.

The comparative study/benchmark results between the three algorithms revealed some very interesting characteristics.

1) ChemKit version of the VF2 is very fast as it just looks for graph automorphisms (light green bars on the graph).

2) CDK UIT seems to perform better on graphs with atom count <= 19.

3) VF2 implementation in the SMSD performs better on graphs with atom count > 19 and yes this implementation is better off in the worst case (i.e. if no hits were reported, e.g.: first bar with 11 atoms).

Further permutation based tests will reveal more intricacies, artefacts and the innate nature of these algorithms and implementation. If you are interested in these implementations and test cases please feel free to refer to the benchmark code on the github.

Update (16 March 2011): VF2 implementation from the Chemkit now calculates substructure isomorphism. Thanks to Kyle Lutz for his feedback. The new runtime of the VF2 chemkit implementation looks as good as the one in the SMSD except that the rejection of a solution (in case of mismatch) is faster in the SMSD version of the VF2 whereas matching is faster in the former. Rest of the facts regarding size and runtime mentioned in the graph above still hold good. Below is the updated graph.

Update (17 March 2011): Further optimisation of the ChemKit ported VF2 implementation makes is fastest (reporting the matches and rejection of solution; refer to the benchmark code on the github).

Performance benchmark of substructure searches VF2 vs CDK’s UIT

X-axis: 31 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures.

Updated (17 March 2010): Gilleain and myself were able to code an ultrafast VF2 algorithm for Isomorphism substructure search (permutation bug is resolved). The code is ported or re-factored from Chemkit and I have further optimised the code at our end. The previous VF2 code in the SMSD is more MCS oriented which was a big compromise in terms of speed while performing a substructure search. I have mentioned these issues in a previous blog. The whole idea of a substructure search is to answer a boolean question i.e. is A a subgraph/substructure of B ? If the answer is true then matching atoms are reported. On the other hand MCS where it’s about reporting the commonality between A and B.

Below is the benchmark graph which speaks about the performance (30 queries were subjected to 15030 targets, same as my previous benchmark test cases) of the VF2.

Substructure benchmark results between VF and CDK’s UIT

X-axis: 30 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures.

As is apparent from the benchmark graph, VF2 takes less than 10-20% of the total time (in most cases) for performing substructure searches (results are sorted in ascending order of the query atom count). So as for now we have an implementation which at last outperforms UIT (CDK code) in terms of speed for automorphic graphs. The performance of the VF2 improves as the query atom count (graph size) increases. We might need a few changes in the CDK AtomContainer class itself to further improve the speed/performance but this will involve further discussions with CDK developers.

Like this:

The design principle of the SMSD code was always to find Maximum Common Subgraph (MCS) between two molecules. Of course, it can be optimised to find substructures between two molecules. I was lazy in doing so but a recent post on the CDK mailing list about the SMSD based substructure search performance was a wakeup call for me.

I have made some changes to the SMSD code and now the substructure search is not that bad. Nonetheless, I have to redesign the substructure search part of the SMSD as substructure search ideally shouldn’t perform an exhaustive comparison between all possible combinations of the graph. It’s usually a boolean question i.e. is A a subgraph/substructure of B, unlike in the case of MCS where it’s about reporting the commonality between A and B.

So now one can query substructure search in the SMSD using the following code.

Results: Here is a small substructure search benchmark test between UIT and SMSD. I have taken 30 query molecules and fired each of them against a decoy of 15030 molecules. At least a hit is expected from each query run.

Substructure benchmark between SMSD vs UIT

In this case UIT seems to have an edge over SMSD as, in a few cases it takes more time. I have to check if the real difference is between implementation of the code or the actual algorithm itself.

The change(s) which is required in the SMSD code will involve heavy refactoring of the VF lib code. The present code works fine for the MCS but it’s time to optimise it for the substructure search. Any suggestions in optimising the VF code would be more than welcome. Please refer to the following class if you have any suggestions 🙂