Oldalak

Wednesday, September 7, 2011

Representing stereochemistry in 2D is not a trivial task. Otherwise, this document wouldn't be that long. :) There are (in my opinion) too many conventions, which makes large-scale, automatic recognition a nightmare. Even though some people claim that cheminformatics was solved, here is a problem to those people who think we are not quite done yet, and there are some issues left to be solved:

How to indicate double bond stereochemistry and how to store this information in MDL MOL files (v2000) correctly?

Even for a single tetrahedral stereocentre there are numerous ways to represent stereochemical information in 2D. Nevertheless, the following bond types are generally used for the most typical cases:

Wedge bonds can indicate a single isomer or a racemic mixture. In drawings, they are typically distinguished by "ABS" or "absolute" labels in case of single isomers and "RAC" or "racemic" labels in case of racemic mixtures.

In v2000 MOL files, all these bond types have respective bond stereo types. According to the MOL file spec the stereochemical information should be stored in the bond block rather than in the atom block, which means that storing the stereo bond types is sufficient.

Single isomers with absolute configuration can be distinguished from racemic mixtures by turning on the chiral flag.

Great. Now let’s see what possibilities we have for representing double bond stereochemistry (for clarity, I skipped some linear representations only applicable when the double bond has exactly two substituents, one on each end):

From the picture above one thing is clear: some representations are missing. From the discussion on the Blue Obelisk Exchange forum, people (at least those commented on this question) seems to prefer the crossed double bond representation rather than introducing a wavy bond next to the double bond in question. OK. So, let’s forget about the wavy bonds. But have we decided what the crossed double bond will be used for? Egon Willighagen correctly pointed out that we should distinguish between unknown and undefined stereochemistry. So let’s first decide what we mean on these two words. Here is my go:

Unknown double bond stereo: configuration can be cis OR trans. We don’t know which one, but it is certainly one of them, and the sample is not a mixture of the two.

Undefined double bond stereo: we don’t know anything about the configuration. Can be cis, can be trans, can be a mixture of the two.

Now, which of the two would be best represented by a crossed double bond? I honestly don’t know, but it seems that the crossed double bond is associated with the unknown case (at least ChemWriter and MarvinSketch use it that way – they both write out “either” bond stereo type for crossed double bond in MOL files).

For undefined double bond stereo, Marvin introduced another bond type (see picture on the left). Another problem arising from this new bond type is how to store it e.g. in v2000 MOL files. Since the bond stereo type for double bonds is limited to the absolute and unknown cases, there is no way to directly indicate this in the bond block without violating the MOL file spec. Marvin has a clever solution for this by storing this information as an extra “M” line in the properties block:

M MRV CTU 1 1 (number of undefined double bonds in the connection table followed by their identifiers).

Unfortunately, we are not done yet… what about a sample containing a mixture of cis AND trans? I’m not aware of any representation for this case. Except for drawing both isomers with additional explanatory text. This is exactly what IUPAC suggests for mixtures, but I don’t think many cheminformaticians will like this idea…neither do I.

In principle, such a bond type for mixtures (cis AND trans) could be added like the undefined bond type in Marvin without any problems.

Another suggestion came from developers of Ketcher was to apply a system of labels on plain double bonds. That way unknown, undefined and mixture cases could be represented by using “OR”, “?”, “AND” labels, respectively. I personally like this idea as it somewhat goes into the direction of the v3000 MOL format. The only issue I see here is that these labels has to be carefully positioned to make sure they are not meant to indicate stereochemistry at tetrahedral stereocentres.

I would be very much interested to hear other people's opinion about the above suggestions, and also to hear other possible solutions for the problem.