Friday, May 29, 2015

Andrew Dalke has been working on putting together a dataset for testing substructure screenout performance for a while now. Due to concerns about privacy and/or confidentiality it's really difficult to get "real world" data on the kinds of things that scientists (really mostly chemists) look for in large-scale research systems like the data warehouses one finds in pharma companies. We have the data, but because of the nature of the searches we can't share it. Andrew visited us recently and the topic of substructure screening came up again. (Aside: There were three people in that room who cared deeply about substructure screening and who have invested some real thought and effort in it, how often does that happen? Sometimes I love my job!) During the conversation we came up with a way to fake having access to our actual queries. This post is about that.

One of the common use cases we see for substructure searching in the warehouse is to find all compounds that are in a particular chemical series. We obviously can't disclose the series that people are searching our database for, but we can fake it by collecting a bunch of substructure queries defining medchem-relevant chemical scaffolds and using those. This would have the added advantage of including things that have query features (variable atom and bond lists). Now we just need a source of medchem relevant scaffolds.

I did a post a while ago on the composition of the molecule sets we called "Datasets 2" in the model fusion paper. Since these are pulled from individual publications found in ChEMBL they tend to be made up of a set of molecules from one scaffold and then a few other molecules. One of the things I showed in that post was a method for estimating what the scaffold is in these sets using the RDKit's MCS code. This builds on that.

# define a function to convert a query molecule to SVG using the new rendering code.fromIPython.displayimportSVGfromrdkit.ChemimportrdDepictorfromrdkit.Chem.DrawimportrdMolDraw2Ddefmoltosvg(mol,molSize=(450,250),kekulize=True):mc=Chem.Mol(mol.ToBinary())ifkekulize:try:Chem.Kekulize(mc)except:mc=Chem.Mol(mol.ToBinary())ifnotmc.GetNumConformers():rdDepictor.Compute2DCoords(mc)drawer=rdMolDraw2D.MolDraw2DSVG(molSize[0],molSize[1])# the MolDraw2D code is not very good at the moment at dealing with atom queries,# this is a workaround until that's fixed.# The rendering is still not going to be perfect because query bonds are not properly indicatedopts=drawer.drawOptions()foratominmc.GetAtoms():ifatom.HasQuery()andatom.DescribeQuery().find('AtomAtomicNum')!=0:opts.atomLabels[atom.GetIdx()]=atom.GetSmarts()drawer.DrawMolecule(mc)drawer.FinishDrawing()svg=drawer.GetDrawingText()# It seems that the svg renderer used doesn't quite hit the spec.# Here are some fixes to make it work in the notebook, although I think# the underlying issue needs to be resolved at the generation stepreturnsvg.replace('svg:','')

In [43]:

# define a function to find the MCS for a set of molecules and return some summary info about that MCSimporttime# we will use a namedtuple to return the resultsfromcollectionsimportnamedtupleMCSRes=namedtuple('MCSRes',('smarts','numAtoms','numMols','avgNumMolAtoms','mcsTime'))defMCS_Report(ms,printSmarts=True,atomCompare=MCS.AtomCompare.CompareAny,bondCompare=MCS.BondCompare.CompareAny,completeRingsOnly=True,**kwargs):t1=time.time()mcs=MCS.FindMCS(ms,atomCompare=atomCompare,bondCompare=bondCompare,timeout=30,completeRingsOnly=completeRingsOnly,**kwargs)t2=time.time()nAts=numpy.array([x.GetNumAtoms()forxinms])print('Mean nAts %.1f, mcs nAts: %d'%(nAts.mean(),mcs.numAtoms))ifprintSmarts:print(mcs.smartsString)mcsM=Chem.MolFromSmarts(mcs.smartsString)mcsM.UpdatePropertyCache(False)Chem.SetHybridization(mcsM)svg=moltosvg(mcsM,kekulize=False)tpl=MCSRes(mcs.smartsString,mcs.numAtoms,len(ms),nAts,t2-t1)returntpl,svg

The MCS_Report() function defined above prints out the mean number of atoms per molecule in the input along with the size of the MCS. This is intended to help assess whether or not the MCS is actually a scaffold.