At NextMove we've been doing a bit on processing sketches from patents (see Sketchy Sketches). I learnt a few things about how generic structures are typically depicted I thought be interesting to share.

Substituent Variation (R groups)

The most common type of generic feature is substituent variation, colloquially known as R groups. The variation allows concise representation with an invariant/fixed part of a compound and variable/optional part.

wherein R denotes
That is: anisole, toluene, or ethylbenzene.

Substituent Labels

Multiple substituent labels may be distinguished by a number R1, R2, ... Rn. However in reality, any label can and will be used. This can be particularly confusing when they collide with elements, examples include: Ra (Radium), Rg (Roentgenium) B (Boron), D (Deuterium), Y (Yttrium), W (Tungsten). The distinction between the label Ra and Radium may be semantically captured by a format but lost in depiction.

To distinguish such labels we can style them differently. By using superscripting and italicizing the label the distinction becomes clear and also somewhat improves the aesthetics of numbered R groups. We avoid subscript due to ambiguities with stoichiometry, for example: –NR2.

Attachment Points

For substituents there are different notation options. In writing, radical nomenclature is used, for the above example we'd say: methyl-oxyl (-OMe), ethyl (-Et), or methyl (-Me). However this doesn't translate well to depictions: .

The CTfile actually does stores substituents this way and specifies the attachment point (APO) information separately.

Alternatively we may use a virtual or 'null' atom. We can convert to/from CTfile format although it's slightly easier to delete the null atom that add it on, due to coordinate generation. A disadvantage of this is the atom count isn't accurate, however the labelled group is also a type of null atom and already distorts the atom count. There are unfortunately different ways of depicting this null atom.

Don't use a dative bond style! You have to fudge the valences and just doesn't work, how would I show a double bond attachment?

The first time I'd encountered attachment points was in ChEBI where and R group means 'something attaches here' (CHEBI:58314, CHEBI:52569), whilst a 'star' label means 'attaches to something' (CHEBI:37807, CHEBI:32861). This actually a nice way of thinking about it, like two jigsaw pieces the asymmetry allows the substituent to connect to the labelled atom.

The 'star' atom used by ChEBI is tempting to use as there is a star atom in SMILES.

A major disadvantage of this notation is mis-encoding by users mistaking it for a wavy up/down stereo bond. I talk more about this in the poster (Sketchy Sketches) but the number of times you see the following drawn:

The captured connection table for that sketch does not have null atoms but instead uses carbon:

One way to implement abbreviations would be by modifying the molecule data structure with literal collapse/contract and expand operations. Whilst this approach is perfectly reasonable, deleting atoms/bonds is expensive (in most toolkits) and it somewhat subtracts the "display shortcut" nature of this Sgroup.

For efficiency abbreviations are implemented by hiding parts of the depictions and remapping symbols. Just before rendering we iterator over the Sgroups and set hint flags that these atoms/bonds should not be included in the final image. If there is one attachment (e.g. Phenyl) we remap the attach point symbol text to the abbreviation label ('C'->'Ph'). When there are no attachments (e.g. AlCl3) we add a new symbol to the centroid of the hidden atoms.

Hide atoms and bonds

Symbol Remap

Abbreviated Result

For two or more attachments (e.g. SO2) you also need coordinate remapping.

Multiple Group

Multiple groups allow, contraction of a discrete number of repeating units. They are handled similarly to the abbreviations except we don't need to remap parts.

All atoms are present in the data structure but are laid out on top of each other (demonstrated below). We have a list of parent atoms that form the repeat unit. Therefore to display multiple groups we hide all atoms and bonds in the Sgroup except for parent atoms and the crossing bonds.

It's worth mentioning that hidden symbols are still generated but simply excluded from the final result. This allows bond back off for hetero atoms to be calculated correctly as is seen in this somewhat tangled example:

Brackets

Polymer and Multiple group Sgroups require rendering of brackets. Encoded in the molfile (and when laid out) brackets are described by two points, a line. It is therefore up to the renderer to decide which side of the line the tick marks should face.

I've seen some implementations use the order of the points to convey bracket direction. Another method would be to point the brackets at each other. As shown for CHEBI:59342 this is not always correct.

Poor bracket direction

Preferred bracket direction

CHEBI:59342

I originally thought the solution might involve a graph traversal/flood-fill but it turns out there is a very easy way to get the direction correct. First we consider that brackets may or may not be placed on bonds, if a bracket is on a bond this information is available (crossing bonds).

For a bracket on a crossing bond exactly one end atom will be contained in the Sgroup, the bracket should point towards this atom.

If a bracket doesn't cross a bond then the direction should point to the centroid of all atoms in the Sgroup.

Saturday, 17 October 2015

The default Java serialization framework provides a convenient mechanism for streaming in-memory Objects to another computer or storing them on disk. Beyond the obvious badness of being tied to the internal object layout (i.e. not stable through changes), serialization can be very inefficient. Externalization and libraries like Kyro are popular for improving performance.

In the domain of Chemistry we have a rich variety of formats (e.g. SMILES) with which we can store molecules and reactions (in memory these are labelled graphs). Although these formats do not completely fulfil the utility of Object serialization they can be used as building block upon which we build. Not only are these defacto standards but they can be much faster and compact than default serialization of the in-memory connection table (graph) representation.

Recent History

Crafting efficient (de)serialization is beneficial and you can get great speed with simple setup. A few years ago I ran some experiments on writing an externalization stream for the Chemistry Development Kit (CDK) molecules (thread - High Performance Structure IO). Since the objects are huge any improvement over the default would be useful. This partly fed into the needs of CDK-Knime (a workflow tool) where I think CML was being used originally. From testing on ChEBI (~20,000 molecules) we see actually the ObjectInputStream was about as fast as an SDfile and much faster than CML. SDfiles are now much faster but that would be another post.

Read Performance

Method

Time

Size

Throughput

AtomContainerStream

346 ms

11.1 MiB

63739 s-1

SDfile

4159 ms

51.7 MiB

5302 s-1

CML

18605 ms

91.5 MiB

1185 s-1

ObjectInputStream

5552 ms

93.9 MiB

3972 s-1

It was around that time that Andrew Dalke payed a visit to EMBL-EBI. In discussing what I was currently working on he promptly showed me how fast OEChem could read/write SMILES. Needless to say – pretty quick and as fast if not faster than my attempt at 'High Throughput' streaming.

The CDK now also has fast SMILES processing and I wanted to compare this to the serialization to see how much of a performance penalty there is.

Benchmark

For a benchmark I used 100,000 structures for ChEMBL 20.

$ shuf chembl_20.smi | head -n 100,000 > chembl_20_subset.smi

Writing it to a ObjectOutputStream takes 28.78 seconds. The SMILES subset file takes up 6.8 MiB on disk whilst the serialized objects take up 295 MiB. Ouch, that's 42x larger.

In CDK we first read SMILES with Beam and then convert to the CDK objects so we'll also look at that small overhead. Here I compare the time to read the 100,000 SMILES using Beam, CDK, and the objects using an ObjectInputStream. Both CDK and Beam take less than 1 second whilst the ObjectInputStream takes more than 50.

In terms of throughput (mol per sec) here is the kind of speed we hit. I also show the total elapsed time for all 15 repeats.

Method

Min

Max

Elapsed Time

Size

Deserialization

1961 s-1

2089 s-1

12 m 16 s

295 MiB

Kryo (Auto)

42401 s-1

44557 s-1

33.9 s

186 MiB

Kryo Unsafe (Auto)

44854 s-1

47331 s-1

31.9 s

231 MiB

CDK

135286 s-1

142126 s-1

10.7 s

6.8 MiB

Beam

347534 s-1

489545 s-1

3.2 s

6.8 MiB

Auxiliary Data

With a performance difference that huge why would anyone want to use Serialization? Some use-cases might be that a format doesn't store the parts we need. A common argument against SMILES is the lack of coordinates but we can simply store this supplementary to the SMILES if we no what the input order will be (Code 2).

Code 2 - Writing Coordinates with SMILES

IAtomContainer mol = ...;
// 'Generic' - avoid canon SMILES we are not doing identity check
SmilesGenerator sg = SmilesGenerator.generic();
int n = mol.getAtomCount();
int[] order = new int[n];
// the order array is filled up as the SMILES is generated
String smi = sg.create(mol, order);
// load the coordinates array such that they are in the order the atoms
// are read when parsing the SMILES
Point2d[] coords = new Point2d[mol.getAtomCount()];
for (int i = 0; i < coords.length; i++)
coords[order[i]] = container.getAtom(i).getPoint2d();
// SMILES string suffixed by the coordinates
String smi2d = smi + " " + Arrays.toString(coords);

Using that same technique it's relatively simply to extend this to handle arbitrary data fields and it even forms the basis of ChemAxon's extended SMILES. A more advanced method would be combining the SMILES with a DataOutputStream since we know how many coordinates there are expected to be.

Summary

I'm certainly not against a performant AtomContainerInputStream but the default Java serialization should never be the first choice. Hopefully this post has put some numbers on why and will discourage knee-jerk usage.

Update

Tuesday, 8 September 2015

Despite the flaws, the molfile has been a defacto standard for chemical representation for several decades. The core format (atom and bond block) is well supported in many toolkits but more advanced features (dark corners) of the property block may be skipped.

At this year's Fall ACS (Boston '15) I bumped into an old colleague from ChEBI who told me they (ChEBI) couldn't use CDK because they wanted to display repeating brackets on records and CDK didn't do that.

Polymer representation (more precisely Structural Repeat Unit) used by ChEBI falls under the category of a Ctab Sgroup. I'd wanted to add support for Sgroups for some time and now had motivation to do so.

Substructure (or Substance) Groups

Over the years there seems to have been a shift in definition. The original literature[1] uses the term "substructure groups" but more recent materials use "substance groups"[2,3]. Personally I prefer "substructure" since it concisely summarises what they really are about.

Essentially an Sgroup annotates some part of the connection table (a substructure) with meta-information (data). There are several types of Sgroup that formalise the types of annotation present:

Display Shortcuts

Abbreviations

Multiple Groups

Polymers

Structural Repeat Unit (SRU)

Monomer

Copolymer (alternating, block, or random)

Mer

Crosslink

Graft

Modified

Any

Mixtures

Unordered Mixture

Ordered Mixture (formulation)

Component

Generic

Data

Example ChEBI Depictions

Egon reviewed the first patch (pull/149) last week that focussed on representation and molfile round tripping. The second patch enhances the rendering code to handle more than basic SRUs (e.g. >2 brackets) and display shortcuts.

As of ChEBI 131 there are 809 entries with at least one Sgroup. Generating the depictions of these from an SDfile took < 3 seconds, then a further 11 to actually write the files to disk. The rest of this post demonstrates some example of those depictions.

Display Shortcut, Abbreviations

Previously referred to as "superatoms", parts of a structure can be abbreviated to a more concise name (e.g. Ph for a phenyl substituent). The full structure is present but is only displayed when the expansion flag is set.

Display Shortcut, Multiple Group

Multiple groups allow structures with fixed repeating parts to be drawn more concisely. Similar to abbreviations, all the atoms and bonds are present but are hidden from display. They're actually all overlaid on one another with duplicated coordinates but for rendering you still want omit them from display.

Polymer, SRUs

The most common Sgroup used in ChEBI is the Structure Repeat Unit (SRU), an SRU defines a repeat unit of variable length. The brackets do not necessarily come in pairs, are parallel, or point towards each other.

Polymer, Others

Combinations

A structure can have more than one Sgroup and they can be nested. Here we see a multiple group within an SRU. There is also a data Sgroup attached to the Zn-N bond marking it as a coordination bond for Marvin. I've not decided whether to render those yet, but we have the information there.

Sunday, 9 August 2015

Some time last year Mark Williamson brought to my attention discrepancies in CDK's MMFF partial charge calculation. Investigating further it seemed to mainly be a problem with atom typing. There were two existing classes that could assigned MMFF atom types using a combination of a decision tree and string matching hose codes. The 761 molecules from the MMFF94 Validation Suite provided by Paul Kersey were used to give a more comprehensive overview then our current tests.

The results showed reasonable precision per-atom in the validation suite but were less favourable per-molecule, the best implementation assigned types to <90% of the molecules with <16% assigned correctly.

Assigned Types
(Atoms)

Correct Types
(Atoms)

Assigned Types
(Molecules)

Correct Types
(Molecules)

ForceFieldConfigurator

15576

90.1%

12932

74.8%

678

89.1%

118

15.5%

MMFF94AtomTypeMatcher

17120

99.1%

12309

71.2%

659

86.6%

75

9.9%

MmffAtomTypeMatcher

17279

100.0%

17279

100.0%

761

100.0%

761

100.0%

I wasn't keen to hard code the atom typing procedure but was delighted to find Robert Hanson of JMol had some SMARTS patterns that could be used as a starting point. After about a month of tweaking I managed to simplify the SMARTS patterns and achieve 100% precision on the validation suite. You can find the SMARTS patterns here: /org/openscience/cdk/forcefield/mmff/MMFFSYMB.sma.

Apart from improving atom type assignments the charge assignment also needed updating to include charge sharing and bond class differences. This wasn't quite as simple as I first thought as the parameter set parsing also needed reworking. After many months of analysis paralysis I decided last week to just rewrite what was needed and delegate calls from the existing implementation.

Now the patch is finished, charge assignments are much better. Notice that in the previous version (labelled CDK 1.5.10) equivalent terminal oxygens and the nitrogens in imidazole anion have different values. The overall charge was also inconsistent with the formal charges.

Improved charge assignment

Roger Sayle noted to me this week that MMFF charges should not be affected by representation, for example, charge separated pi bonds in nitro groups or phosphates.

Charges are independant of representation

Many thanks to Mark and Alison Choy for reporting the problem and adding patches for debugging and testing.