Learn how to hack RDKit to handle peptides with pseudo atoms

The chemoinformatics package Rdkit has is strength with handling small organic molecules. These molecules are characterized by a large diversity in chemical structures. A description of the exact way the atoms bond together are necessary to understand what molecule it is.

Biological macro molecules are often build from repeating sequences of standard building blocks, such as amino-acids or nucleotides. This makes it possible to specify the structure of a protein by writing its sequence of amino-acids, which has been coded into single letters. As example, the peptide met-enkefalin (https://en.wikipedia.org/wiki/Enkephalin) has the sequence Tyr-Gly-Gly-Phe-Met or YGGFM in single letter notation.

Met-Enkephalin as full structure and amino acid sequence

But what if there’s suddenly a need to work with modified or unnatural amino acids? Of course the whole structure can be written out like in the illustration, but as the sequence length increases this becomes too complex to overview. So instead the two worlds can be combined by defining pseudo atoms which represent the naturally occurring amino acids. The old drawing program ISIS/Draw has support for these kind of structures, as well as its possible to use in MarvinSketch. My good friend Jan’s proteax software also has support for writing peptides in condensed form with pseudo atoms, as well as functions that can inter convert between fully atomistic representations, mixed representations with pseudo-atoms and line-notation.

To cut a long story short, can Rdkit read Sdfiles with amino acids specified as pseudo atoms? No, not straight away. But with just a small modification it can.

But due to the open source nature, its relatively straightforward to extend the capabilities of Rdkit. ISIS/Draw defined the pseudo atoms as atoms with atomic numbers starting from 171. Its good the nuclear physicists are not that far yet 😉 Rdkit has its atom definitions in the file Code/GraphMol/atomic_data.cpp So the pseudo atoms need to be defined there. Below is a excerpt of the additions made to that file. First Column is the atomic number. Rdkit expects a continuous sequence, so dummy values have to be filled in for atom number 112 to 170. The follows the label, the covalent radius, the rB0 (Some born radius, the origin is lost in rdkit lore) , the van der Waals radius, the atomic mass, the number of outer shell electrons, the most common isotope and the most common isotope mass followed by the number of allowed valences. I put in the most sensible values i could think off and find from course grained force fields, and calculated the atomic mass and exact atomic mass with Rdkit for each amino acid (minus H2O). Cysteine is treated different, it doesn’t contain the sulfur atom and has 5 outer shell electrons and a valence of 3. That way the sulfur atom can be added as an atom and used for creating cysteine bridges in the condensed format.

Some of the fingerprints and descriptors seemingly work. But please do check that they do something sensible before using them for production or research work, I have not tested them. But this opens up a whole bunch of interesting possibilities. Alignment free comparison of protein similarity would be usable I think, Let me know if you have other ideas or investigate it further.

So a lot of the usual RDkit goodies seem to work. To be truly useful, it should be possible to inter convert between full structure and condensed structure. On the other hand Proteax already has this capability, and the idea was originally to just ensure interoperability. Please comment below if you find it useful and how you used it.