Purchasable compound similarity maps

Community

Subject

There are over 5 million compounds that are available to purchase according to the meta service, E-molecules (http://www.emolecules.com).
It is worth exploring these in the context of the OSDD project as it will identify compound series' that are very easy to explore by purchasing analogues (i.e SAR by catalogue) aswell as identifying compounds that are potentially more sythetically accessible than others (i.e. if there many close neighbours these compounds might be easier to make than others).
To this end I have generate three chemical similarity maps, showing compounds that are very similar to known anti-malarial's that can be purchased.

It would be great to get feedback on this approach, namely:

1) Does this type of visualization work for chemists? Is it straightforward to download cytoscape, install ChemViz and load the network?
2) What do we know about the SAR about these compounds? This would help to priortize/focus our search on chemical space. While I used ECFP_4 fingerprints, other similarity measures can prioritize other features differently (e.g. if we know a paticular substructure is key, then all compounds should contain it etc)
3) While I think the network view is great for a global overview of the compounds available (and can be overlayed with any other types of data that we can thing of, such as predicted targets etc), perhaps there is a better visualization for a smaller number of compounds?

Comments

Iain - this is very interesting. Cytoscape/chemviz are a doddle to install. So - let me clarify three things

1) If a compound appears in one of the three maps you posted above then it's commercially available? (was that through emolecules?)
2) How do we visulaise structures? I see there is a SMILES when I click once, but is there a possibility of a little window displaying the structure?
3) (Non-essesntial) One of the next steps would be to contact suppliers. It would save a non-trivial amount of time if we could group availability of compounds by vendor. Non-essential, just convenient!

Each of the maps is interesting in its own right, but I will flag the GSK one up to Javier and Felix at Tres Cantos - it's a very nice layer on top of their "Open Invitation" paper and could have ramifications for target prioritization. It's also important for us (at Sydney) to look at the other two targets Jimmy Cronshaw is pursuing.

1) Yeah, if a node is blue it means it was found in E-Molecules. The node label is the e-molecules id. The downloaded dataset was from 1-Mar-2012, so will not all molecules will neccessarily be available today
2) This is what I posted in response to a comment on Google+ (https://plus.google.com/u/0/114702323662314783325/posts/fp4qfLsBkUe)
Once it is installed, you can either
1) Right click on a node and select "Cheminformatics Tools" -> "Depict 2D Structure" -> "Show 2D Structures from this Node"
Alternatively, if that doesn't work (occasionally it doesn't)
2) Select node/nodes of interest then
Select from the menu bar, "Plugins" -> "Cheminformatic Tools" -> "Show table of compounds from selected nodes"
This will generate a table showing some information from various attributes. Columns can be deleted or resized.
There is some more details on general cytoscape usage linked there as well.
3) To get suppliers, you can
From the Cytoscape menu, press "Select" -> "Nodes" -> "All Nodes"
In the node attribute browser at the bottom of the screen you can then select all the id's by clicking the first one and holding shift and then pressing the down arrow. Once all the ids are selected/highlighted in the table, right click and select "Copy".
Now the id's can be pasted into the emolecules websitehttp://www.emolecules.com/ -> click "Enter a list"
On list page, select "Emolecules ID' and paste id list and search.
On the results page, click "Export" -> "Excel Spreadsheet"

Great!
It will be very interesting to see which (and why) you identify as likely candidates. As I mentioned above, it is possible to regenerate the maps based on the type of compounds you want to purchase.

Reproduction (with permission) of an email correspondence on this question:

Javier Gamo from GSK:

“Hi Matt,
You rationale is good but amenability of the chemistry (i.e. easy chem to have a good number of derivatives in a short time) should be another one if the initial goal is to get as many derivatives as possible to establish an preliminary SAR. This could be a good possibility providing there is no other major differences...in any case I should feel more comfortable driving the choice by a good biological profile. This could give you a better opportunity of starting with a good parasite target from the beginning and increasing chances of getting your desired therapeutic profile

Comments?

Javier”

Iain Wallace responded, also by email:

“Hi Javier,

I would be very interested to know if the number of available compounds commercially correlated with amenability of chemistry. Or maybe this is very lab specific?

Regardless, I would agree the biological profile should be the main driver of deciding which compounds to screen. However, my understanding (admittedly naive) is that all of these compounds equally attractive starting points, with no additional biological information available, other than that they are potent against the parasite?

We can also use this maps to identify which compound series is the largest. A cluster containing many known anti-malarials should have alot of SAR information that could be used to guide synthesis design.
Conversly, a cluster containing only one known anti-malarial or a singleton would not be a great starting point for a lead optimization as we would have no SAR information.

Yes, BUT, sometimes we like the small series or singletons because those may represent unusual compounds with new mechanisms of action, that may complement what's known. So a cluster may be a safer bet for deriving a new bioactive, but not necessariy the best for diversity. There must be an analogy there...

Yeah, I think it all depends on what you would like to do and what the ideal end result would be.
For example, if you wanted to identify compounds that have good anti-malarial properties, that you can quickly build a structure activity relationship data for, you could prioritize compounds that
1) already have lots of SAR info from the large publically availble screens
2) have lots of purchasable analogs
3) are easy to make
4) and importantly don't look similar to known anti-malaria compounds in the clinic.

Alternatively, if you wanted to identify novel, unique anti-malarial compounds then you could
1) looking for active singletons
2) that don't have any purchasable analogs
3) that don't appear in the patent literature

The second option seems to be much more costly and difficult, but that is just my risk adverse two cents. There are probably other aspects that I haven't considered either, which might be much more important.
Personally, I don't understand the drive for diversity for diversity's sake, because if that was all that mattered we should be just screening and making natural products. And I am not sure how much the concept novel diverse compound = unique mode of action holds up. According to people who screen natural products, the same mode of action is repeatedly found despite being structurally different (this is a really nice review of Natural Products for anti-bacterial compounds carried out in Merck http://www.sciencedirect.com/science/article/pii/S1074552111000354).

Mat/Ian, my takes are somewhat orthogonal to yours but I hope add to the discussion.

Pharma’s need to generate IP has impelled them to pack their screening collections with Chinese/Indian/Russian libraries that have designed-in novelty and diversity but also a high proportion of “no hitters” because they are too divergent from known bioactives. We don’t have this problem and should therefore be circumspect about such strategies.

IMCO the best predictor of possible new (antimalarial) bioactivity is simply any previous bioactivity, in vitro or in vivo, from any source (paper, patent, poster or public assay)

This thus includes ~ 40K old drug leads, ~ 200K NPs, ~ 3-5 millon actives from patents and ~ 1 million from ChEMBL (that includes the confirmed PubChem BioAssay hits).

What we might want to emulate from pharma is their strategy of working up several series of diverse chemotypes that can serve as back-ups when one of them runs into trouble. They can also (on a good day) cross-corroborate pharmacophores/isosteres and other types of VS models.

However, as has been pointed out, with our “bag of targets” different series may have different molecular moas. This can obviously have advantages but may confound SAR modeling between series

Verified (and QC’d) singeltons can be valuable founders of new series depending on why they are singletons e.g. if analogues are sparse or they have idiosyncratic discontinuous SAR.