Well for many of us this might be a regular exercise on a chemical editor but this is a little trickier to program. So mathematically, we are aiming for a union between two molecules. The union between two molecules can be defined as (A ∪ B) = n(A) + n(B) – (A ∩ B), where A is a query molecule and B is the target molecule. The intersection (A ∩ B) can be obtained by finding isomorphism between two molecules. The joining part is a little challenging as not all combination(s) might be chemically valid. So we need to find combination(s) which are unique and unsaturated!

Here is a little snippetto perform this task in Java using the SMSD and CDK.

Here are some interesting results as they highlight the run time between three flavours of substructure search algorithms. I have chosen, a) UIT (Universal Isomorphism Tester) as implemented in the CDK (Chemistry Development Kit); b) VF2 as implemented in the SMSD (I have further optimised it); and c) VF2 as ported from ChemKit. The former two implementations (i.e. a and b) are optimised for substructure search whereas the later implementation (i.e. c) is optimised for the graph automorphism search (further extension would be good/ideal).

While benchmarking the substructure search, I realised that the CDK AtomContainer takes ample amount of time for methods such as getConnectedAtomsList(atom) or getConnectedAtomsCount(atom). Hence, I ended up refactoring the AtomContainer to store the adjacency map of atoms and bonds. This will speed up the above mentioned methods and the benchmark results will bring out the actual differences in the run time. Please find the updated AtomContainer here (there is scope for further optimisation).

Now let’s focus on the results from the benchmark 🙂

Benchmark results from the substructure search. X-axis: 31 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures.

The comparative study/benchmark results between the three algorithms revealed some very interesting characteristics.

1) ChemKit version of the VF2 is very fast as it just looks for graph automorphisms (light green bars on the graph).

2) CDK UIT seems to perform better on graphs with atom count <= 19.

3) VF2 implementation in the SMSD performs better on graphs with atom count > 19 and yes this implementation is better off in the worst case (i.e. if no hits were reported, e.g.: first bar with 11 atoms).

Further permutation based tests will reveal more intricacies, artefacts and the innate nature of these algorithms and implementation. If you are interested in these implementations and test cases please feel free to refer to the benchmark code on the github.

Update (16 March 2011): VF2 implementation from the Chemkit now calculates substructure isomorphism. Thanks to Kyle Lutz for his feedback. The new runtime of the VF2 chemkit implementation looks as good as the one in the SMSD except that the rejection of a solution (in case of mismatch) is faster in the SMSD version of the VF2 whereas matching is faster in the former. Rest of the facts regarding size and runtime mentioned in the graph above still hold good. Below is the updated graph.

Update (17 March 2011): Further optimisation of the ChemKit ported VF2 implementation makes is fastest (reporting the matches and rejection of solution; refer to the benchmark code on the github).

Performance benchmark of substructure searches VF2 vs CDK’s UIT

X-axis: 31 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures.

Updated (17 March 2010): Gilleain and myself were able to code an ultrafast VF2 algorithm for Isomorphism substructure search (permutation bug is resolved). The code is ported or re-factored from Chemkit and I have further optimised the code at our end. The previous VF2 code in the SMSD is more MCS oriented which was a big compromise in terms of speed while performing a substructure search. I have mentioned these issues in a previous blog. The whole idea of a substructure search is to answer a boolean question i.e. is A a subgraph/substructure of B ? If the answer is true then matching atoms are reported. On the other hand MCS where it’s about reporting the commonality between A and B.

Below is the benchmark graph which speaks about the performance (30 queries were subjected to 15030 targets, same as my previous benchmark test cases) of the VF2.

Substructure benchmark results between VF and CDK’s UIT

X-axis: 30 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures.

As is apparent from the benchmark graph, VF2 takes less than 10-20% of the total time (in most cases) for performing substructure searches (results are sorted in ascending order of the query atom count). So as for now we have an implementation which at last outperforms UIT (CDK code) in terms of speed for automorphic graphs. The performance of the VF2 improves as the query atom count (graph size) increases. We might need a few changes in the CDK AtomContainer class itself to further improve the speed/performance but this will involve further discussions with CDK developers.