Noel O'Blog

Sunday, 17 February 2019

There are a number of graph algorithms with not-so-obvious names, and "transitive reduction" is one of those. When I first needed to implement this algorithm, I couldn't figure out whether it had a name, and I can tell you that it's hard to google for something when you don't know what it's called.

Anyway, what transitive reduction means is to remove the maximum number of edges from a directed acylic graph while keeping the 'reachability'; that is, you can remove any edge A->B in the graph if there's some other way to get from A to B. In practice, this is super-useful in tidying up a graph representing pairwise relationships, e.g. a partially-ordered set. For example, suppose you know that A>B, B>C and A>C. The graph showing this would have three edges, but you can reduce it to just two, A->B->C, that summarises the entire ordering.

I used this algorithm in my fingerprint paper, but more recently I've been working on another problem that also requires it, but this time the graphs are much bigger. At first, I used the tred program that comes with graphviz, but the graphs became too large for tred to finish within a reasonable amount of time. Next I used NetworkX, first from CPython and then from PyPy. The latter could just about handle it but required 30GB of memory and ran overnight, and so I moved onto C++. First I tried with igraph. This was painfuly slow - slower than the Python. It all comes down to the data structures that the library is using; I'm deleting 100s of millions of edges and though I batched up the deletions, it was just too slow.

Finally, I ended up at Boost Graph Library. This library is unusual in that you get to choose the data structures that the graph uses. If you need fast edge removal, then you can choose a linked list in which to store the edges. For the graphs of interest, the code ran in a matter of minutes (after a bit of profiling and optimisation) using a single GB of RAM.

Sunday, 25 November 2018

One of Roger's observations came to mind recently when I was looking at some SMARTS patterns in Open Babel that were causing some problems: "People cannot write SMARTS correctly." Instead, he argues, they should use tools to create SMARTS for them. The problem is not so much that patterns don't match molecules of interest, but that they match additional molecules (false positives) and may miss others (false negatives). False positives are a particular problem when transformations (e.g. SMIRKS) are applied to matching molecules, as they can result in exotic valence and charge states that were unintended.

The following is a description of a tool I put together to help craft SMARTS patterns. It is based upon the idea that a SMARTS pattern is completely defined by the molecules it matches. Thus, by looking at the matches in a database of sufficient size, we can see whether there are any mismatches.

So far, so obvious. What's novel is that the tool only returns distinct matches based upon simple atom types of the matching atoms, and sorts them by probability of those atom types (most unusual first). Thus, even for common substructures the number of hits is small and information-rich. Furthermore, by using my favourite trick of sorting the reference database by size (ChEMBL here), this means that the first hit found for each distinct match is likely to be small in size.

An example will illustrate: let's suppose we want to create a SMARTS pattern to match alcohols. I know that "CO" in SMILES indicates this, and so let's start with that. Only 28 hits are returned (from a 10% subset of ChEMBL), of which the first three are:

Ok, obviously we don't want a charged oxygen, so let's try "C[O+0]" and look at the first three hits:

And the oxygen should have a hydrogen, "C[Oh1+0]":

Now, about that carbon. I only had sp3-hybridised carbons in mind, so how about "[CX4][Oh1+0]"? Here is the full list of hits at this point:

At this point the hits are what I expected to see. Does this mean I should stop here? No, but it's a reasonable start.

After using this tool for a while, you begin to appreciate the importance of nailing down exactly what you want to see: if you only want to match neutral carbons, you should say so; if you don't want the oxygen to have additional neighbours, you should say so. Once every degree of freedom is nailed down...then it's finished. You might think this is going too far, but the molecules that exist in databases don't necessarily conform to typical valence rules - they may be unusual (consider TEMPO) or downright incorrect. Your SMARTS pattern should match exactly what you had in mind and not leave any wiggle room to match anything more: "[Cv4X4+0]-[OD1h1+0]"

This tool is by no means a panacea; you still need to think for yourself, and consider for example what to do about aromaticity, or the presence of explicit hydrogens or hydrogen isotopes, or whether to reorder the pattern for speed. But one could imagine a tool that could ease the burden further and do some of this for the user, or for example allow the user to mark "match" or "don't match" for the set of hits returned to automatically refine the query.

Tuesday, 2 October 2018

At the last ICCS, Andrew Dalke and I got talking. We had both been around the poster session, and had heard first-hand the problems of "invalid SMILES" that were affecting the generation of SMILES strings using deep neural networks. I remember Xuhan Liu in particular asking me whether I had a solution to this. Well, I had been turning over an idea in my mind for a little while, and by the time I got talking to Andrew, I had decided to do something about it. Now Andrew knows a thing or two about SMILES himself, so I shouldn't have been surprised that he had also come up with something (along with two references from 1964 to back it up). We put the two ideas together and called it DeepSMILES.

The problematic areas of SMILES syntax involve paired ring closure symbols for cycles, and parentheses for branches. These particular aspects of the syntax are difficult to reproduce correctly when generating SMILES strings using machine-learning methods, and so a certain percentage of generated SMILES tend to fail basic syntax checks. While there have been a variety of approaches aimed at improving the SMILES generation (with quite some success), it is reasonable to assume that the syntax also causes difficulties during the learning phase.

Our approach is not to use SMILES, but instead an alternative syntax that does not have these problems. Paired ring closures are replaced by a single digit, the ring size; paired parentheses are replaced by close parentheses indicating branch size. See the talk below, the preprint, or the GitHub site for more information. Feedback (positive or negative) is welcome, either here or on GitHub.

Friday, 20 July 2018

I've done a bit of work over the last year or two stripping back some accumulated cruft in Open Babel. One question that's been on my mind during this process is how much can you simplify the toolkit's core but yet leave it capable of complex behaviour?

Take the case of aromaticity handling. Just this part of the toolkit on it's own raises many questions. Should aromaticity be lazily perceived or require an explicit function call? What happens when the user modifies a molecule? What about copying a molecule? Or just copying a subset of the atoms to another molecule? What should happen when you add two molecules together? If you read aromaticity from a SMILES string, what if it's a different model than the toolkit uses internally? Should the SMILES writer reperceive aromaticity, or just use it as presented?

Often the easiest solution to these problems is to always do the maximum amount of work possible (i.e. throwing away perceived aromaticity information at every opportunity), which I wanted to avoid at all costs. So I went through removing bits and simplifying things, making sure that aromaticity information was copied where possible, and hoping that in the end all of the complex behaviour that I wanted to maintain would still be possible without adding back fixes or kludges. And fortunately it was. I even managed to add additional behaviour, an option to keep the aromaticity information as read from a SMILES string.

I'm a firm believer that there's no point adding features or improving things if you don't write documentation explaining why/how/when they should use it. What's the point doing this work if no-one knows how they can take advantage of it? So, I've just written up documentation on how Open Babel handles aromaticity. The following (exclusive!) excerpt describes how aromaticity information is stored by Open Babel.

The purpose of this section is to give an overview of how Open Babel handles aromaticity. Given that atoms can be aromatic, bonds can be aromatic, and that molecules have a flag for aromaticity perceived, it’s important to understand how these all work together.

Like many other toolkits, Open Babel stores aromaticity information separate from bond order information. This means that there isn’t a special bond order to indicate aromatic bond. Instead, aromaticity is stored as a flag on an atom as well as a flag on a bond. You can access and set this information using the following API functions:

OBAtom::IsAromatic(), OBAtom::SetAromatic(), OBBond::UnsetAromatic()

OBBond::IsAromatic(), OBBond::SetAromatic(), OBBond::UnsetAromatic()

There is a catch though, or rather a key point to note. OBMols have a flag to indicate whether aromaticity has been perceived. This is set via the following API functions:

OBMol::SetAromaticPerceived(), OBMol::UnsetAromaticPerceived()

The value of this flag determines the behaviour of the OBAtom and OBBond IsAromatic() functions.

If the flag is set, then IsAromatic() simply returns the corresponding value of the atom or bond flag.

If unset, then IsAromatic() triggers aromaticity perception (from scratch), and then returns the value of the flag.

Tuesday, 10 July 2018

What happens when four software developers get together and compare their implementations of the CIP rules? This is the background to a recent preprint deposited in ChemRxiv.

The CIP system is a series of rules that describe how to assign a stereodescriptor (e.g. R/S, E/Z) to a stereocentre. When Bob Hanson decided to add support for CIP to Jmol, rather than simply read the rules and implement it according to his interpretation as others have done, he decided to work with three other implementations to challenge each other on disagreements and clarify the wording of the rules.

Essentially, the issue that the authors are addressing is the fact that existing implementations even in "highly respected software packages" disagree with each other (see John Mayfield's presentation). By comparing the implementations in Jmol, Centres, Balloon and ChemSketch they were able to identify cases where:

"(a) the disagreement was due to different interpretations of CIP rules among software developers, (b) there was a problem with an algorithm or its implementation in code, or (c) the CIP rules themselves were flawed."

In all cases, however, they were able to come to a consensus, which led to "the discovery of a small number of errors in the Blue Book, two minor flaws in the CIP rules, and a proposal for a new rule".

The paper walks through their discussions of each rule in turn, looking at any issues arising and clarifying any ambiguities. It also includes a validation suite (browse it here) that covers all aspects of the rules and will allow future CIP implementations to avoid the pitfalls that have beset the field in the past.

Thursday, 14 June 2018

I've already gotten the "valid SMILES" issue off my chest, but there's something else that's been bugging me about deep learning papers in chemistry, and that's the use of canonical SMILES for training. Why would you use canonical SMILES? And if you did, wouldn't that introduce some problems?

This is one of those cases where this just feels like a mistake to me - and I'm going to do my best to articulate my concerns. However, I'm not familiar enough with the mechanics of DNNs to be sure, but I hope those in the field will consider the points I raise.

The problem

So, canonical SMILES. Behind the scenes, the canonicalisation procedure gives a unique label to each atom (typically 1 to N). These labels are then used to generate a canonical SMILES, typically by starting at the lowest label (but not necessarily). The canonicalisation procedure is based upon the attributes of the graph, with the result that the first atom in a canonical SMILES tends to favor particular atom types and avoid others.

For example, if you look at the atom types for the first atom in the canonical SMILES generated by RDKit for ChEMBL molecules, you will find that the second most common atom type in ChEMBL (namely, *-C(-*)=*) never appears as the first atom in a canonical SMILES string. This is by design and you'll see similar behaviour with other toolkits - SMILES strings tend to start with degree 1 atoms.

So what if the distribution of atom types is different for the first atom?

Well, firstly, I predict that as a result these atom types will be over-represented in structures generated by DNNs (and others under-represented). If you train on canonical SMILES, then the probabilities for the first atom will be determined by the atom types favored as starting atoms by canonical SMILES. Consider the extreme example where fluorine always occurs as the first atom in any canonical SMILES that contains it; you should see an increased number of fluorines as the first atom in the generated molecules. Now you could argue that the probabilities for the remaining atoms will be adjusted accordingly, but I believe that there is a strong edge affect and that any correction will attenuate as the SMILES string becomes longer.

Secondly, this bias makes the DNNs job harder. Instead of a relatively even distribution of atom types at all points in the SMILES string, the distribution will depend on the distance from the starting atom. It's now trying to learn something about the properties of canonical SMILES instead on concentrating on the task at hand...

...which bring me nicely to the third point. Predictive models attempt to deduce a property value from the structure, and a SMILES string is used by DNNs as a proxy for the structure. Using a canonical SMILES string is another step removed. What about a molecule with a very similar structure but very different canonical SMILES? Surely the goal of a robust model is to handle this well. Restricting good predictive power to only those structures that are both similar and have similar canonical SMILES is to develop a model with a reduced applicability. A fun task is do is to measure the degree to which this fitting to canonical SMILES occurs; this is left to the reader.

The solution

The solution is simple. Use random SMILES. A single one, or multiple. The use of multiple random SMILES has already been described by Thomas Bergwinkl and subsequently by Esben Jannik Bjerrum as a 'data augmentation technique', but I see it as just avoiding the inherent bias of using canonical SMILES. But either way, I like this quote from Thomas:

The output for alternative representations of a molecule should be the same, if you understand SMILES. Using alternative representations in the test data allows to verify if the neural network understands SMILES. Spoiler: After a while the output becomes very close for the alternatives!

So why do people use canonical SMILES in the first place? I have my theory on this.

I believe it's because the generative models more quickly converge to generation of syntactically valid SMILES strings when they train on canonical SMILES. And for some reason, the percentage of syntactically valid SMILES strings generated by the model has become a figure of merit.

But this makes no sense - who cares what this percentage is? Sure, we can all overfit to canonical SMILES and get high percentages quickly. But how is this a good thing? You know that feeling you get when you see a QSAR model with very high R2 on training data - that's how I feel when I see a high value for this percentage. If it's actually doing what it's supposed to be doing (i.e. learning the underlying structure rather than the training set of canonical SMILES), then the percentage should really be lower. What do I care if the percentage of syntactically valid SMILES is 1%? So long as that 1% solves my problem, it's irrelevant - these structures are spewed out of these models thousands per second (I presume, but even so).

Wednesday, 6 June 2018

Just back from the recent ICCS in Noordwijkerhout, the Netherlands. I really enjoyed it although, as you can see from the picture, it was hard work at times.

Here are my notes on the scientific program, which I have just extracted from Twitter. A big thank-you to ThreadReader, without which I couldn't have done the extraction. Naturally, all errors are my own - I may have misunderstood something or lost the thread of the talk. Also, just to note, I didn't take notes on all of the talks.

If you are interested in a particular talk, and paste the provided link into a browser, you can see if anyone on Twitter added a comment on the tweet.

If you want to follow the entire Twitter conversation on the meeting, which used the hashtag #11thICCS, go to this link. Again, if a particular Tweet has replies, you need to click on it to see them.