Sunday, October 27, 2013

Comparing fingerprints to each other. Part 1

Goal: Look at the differences between different similarity methods.

This uses a set of pairs of molecules that have a baseline similarity: a Tanimoto similarity using count-based Morgan0 fingerprints of at least 0.7. The construction of this set was presented in an earlier post: http://rdkit.blogspot.com/2013/10/building-similarity-comparison-set-goal.html.

Note: this notebook and the data it uses/generates can be found in the github repo: https://github.com/greglandrum/rdkit_blog

Set up

Do the usual imports, read in the molecules, set up the fingerprints we'll compare, and calculate the similarities between the pairs of molecules using those fingerprints.

What about methods that work well for similarity-based virtual screening?

Look at the methods that we found to be "best" as measured by AUC for similarity-based virtual screening in our benchmarking paper (http://www.jcheminf.com/content/5/1/26 ). The table itself is here: http://www.jcheminf.com/content/5/1/26/table/T1

I've got best in quotes here because there wasn't a statistically significant difference in performance.

The Tau values are still pretty low. The rankings from these fingerprints tend to have a low correlation with each other.

The comparison in the benchmarking paper showed, on the other hand, that across a broad range of data sets the fingerprints perform at about the same level when it comes to enrichment. It seems like there's either a contradiction or this set of pairs isn't particularly representative of what we used for that paper.

Even more concrete: look at the number of overlapping pairs in that pick