Parsing SMILES, compression, Ragel

SMILES

Some classes of molecular structures can be represented using graphs
from number theory, with atoms as nodes and bonds as edges. These
happen to include effectively all of the molecules used to make drugs
and pesticides and those used in biological systems. (There are rough
edges, like tautomers, and some of the no-go areas are metals, which
have more complex bonding system not representable as bonds between
atoms.)

Tokenizing SMILES with a state diagram

I talked some about SMILES before
but didn't get into the details, only a bit about the tokenization
step. Years ago I developed the SMILES
tokenizer for FROWNS. I'm rather quite proud of it. It's pretty
simple to understand, especially compared to the set of if/switch
statements used in most of the other ones I've seen (for example, CDK's).

BTW, if you're going to use it you should make the following
extensions, based on OpenEye SMILES:

aromatic arsenic is allowed inside of brackets (that is, [as]). This is found in the NCI data set

they allow quadruple bonds using '$'

I now think that "C..C" should be an allowed SMILES string

A tokenizer is one stage in the parsing process. Usually it isn't the
only stage. My version strengthens the tokenizer so it know which
tokens are allowed after a given token. This is usually the parser's
job, along with checking that parentheses are balanced and that ring
closures match.

I used the FROWNS tokenizer idea for one of my clients. They have a
web form text entry box that lets you put in SMILES, and SD file, or a
set of identifiers. Their code autodetects the input format. I used
my tokenizer to see if the input was sufficiently SMILES-like. It was
on the brower client side so I translated the code by hand into
Javascript.

Parser generators and Ragel

I've long been interested on parser generation tools for things like
this. I like the idea that I can write something very succinctly and
let the machine make it work and hopefully make it work fast. My
SMILES parser using regular expressions to drive a state machine isn't
particularly fast. Fast enough for most cases though.

There are a large number of parser generation tools out there. Most
are oriented towards context-free grammars, which is a bit more
complex than I need. Turns out most of what I deal with uses regular
grammars. There are tools for that class of problems as well. One
I've had an eye on is Ragel.

In the last few years there have been two cases where I wanted a
SMILES parser which was more low-level. To understand what low-level
means in this context I'll describe the problems.

Unique chemical graph enumeration

When Mick Kappler was at Daylight (some years ago) he did some work on
the structure/graph enumeration problem. How many chemically
reasonable structures are possible with N (carbon) atoms? Some
details on his
poster.

One of the steps in generating a structure is ensuring that it's not a
duplicate structure. The simplest way is to have a set of "seen"
structures. If the new structure isn't in that set, and its canonical
form is not in that set, then it is a new structure, and add the
canonical SMILES (and possibly the non-canonical SMILES) to the set.

The number of monocyclic unsaturated structures he generated
(including duplicates) goes something like 3**(n-3.25) and the number
of unique ones also has exponential growth. At some point you can't
keep all of those SMILES in memory.

Compressing SMILES

I got to wondering how to make better use of memory storage. Dave
Weininger and Roger Sayle (then at Metaphorics) did some internal work
in the late 1990s to compress SMILES strings using LZW. I don't
recall the details now and I wasn't too clear on it then because it
wasn't until a couple years later that I learned about how compression
works, from the great book "Managing Gigabytes."

My memory is they only used a bit or two per letter. A nice thing
about LZW compression is there's a direct mapping from a given bit
pattern to a token, which makes it straight-forward to do a text
search of a compressed data stream. There's a bit of trickiness
because the algorithm has to deal with variable bit-sized tokens, and
there might be multiple ways to encode the SMILES given an input
dictionary. But Roger and Dave are smart and clever people.

(Another neat thing about compressing the data is that modern
computers are usually waiting on I/O. I can be faster to read
compressed data off the disk and uncompress it in-memory than to read
uncompressed data directly. I remember Dave not believing this at
first and had to test it to prove it to himself.)

Consider the input SMILES of "CCOC" and the dictionary containing
encoding for "C", "CC", "CO" and "OC". Depending on the bit size for
each token the encoder may choose to encode that string as "C", "CO",
"C" or as "CC" "OC". As I recall from Mick, the compression code he
inherited would try the various combinations to get better
compression, at the expense of higher compression time. I don't
recall if he ended up compressing the input.

Arithmetic coding

One of the limitations of LZW is the quantization of the number of
bits used per symbol. This is also its strength, depending on the
application. Another approach is arithmetic
coding, which is also really cool. When I was in 8th or 9th grade
I learned about how in a physically perfect world all information
could be stored in a single stick. Take the information, convert it
to a number, put a decimal point in front of it, and cut a stuck
that's exactly that long.

For example, the text "Hello!" in base 256, encoded as ASCII and
displayed in hex is "0x48656c6c6f21", or in base 10 as 79600447942433.
Cut a stick which is exactly 0.79600447942433 meters long. To read
information off of the stick, determine its length exactly (remember,
this is a physically perfect world) and do the back transformation.

Arithmetic coding is exactly that principle, with some tricks to
handle things like knowing when to stop, and generating the minimum
number of bits needed to get close enough to the stopping condition
for a given number.

Another limitation to Dave and Roger's approach, in my opinion, is
they didn't look at what they were compressing. They treated SMILES
as a text string. The more modern approach is to do statistical
modeling. They probably used a 0 order approach to find the most
common tokens some training set. SMILES doesn't use all of the
characters so they could use "Q" for, say, "CC(" and "!" for "O)" if
those proved common enough.

modeling the SMILES grammar

It's a 0 order model because it only looks at the statistics of each
token, without regard for each other. "C" might occur 80% of the time
and "O" the other 20%. A first order model looks at the frequencies
of when one token followed another. For example, the pair "CC" might
occur three times more often than "CO", so if there's a "C" then
there's a 25% chance that the next character is an "O", instead of the
20% predicted by the 0 order model. This means the encoder can
use slightly fewer bits in the output.

Encoding tools like gzip support higher order encoders, but at some
point it's probably not worth going higher unless you have very
repetitive strings. Though computers are fast and memory cheap so go
ahead these days and use "-9" when you gzip something. These systems
are adaptive and they adjust the encoding frequencies to better fit
the data as it comes in. The encoder and decoder use the same
adaptation algorithm so everything stays in synch.

The better you can model the input, the better the compression.
General purpose algorithms know nothing about SMILES strings, so I
figured I could probably make something which compresses better than
gzip. For Mick's case the SMILES are very simple: carbon atoms, '('
branches ')', one or two ring closures, no bond terms, nothing inside
of '[' brackets ']'. By construction I know the first character must
be a 'C' so there's no reason to actually encode it. If a '(' is in
the SMILES then there must be a ')' before the end.

What I wanted to do was take his input SMILES strings (around 34 bytes
each) and compress them down to 64 bits (8 bytes) or less then insert
that as a 64 bit number into a Judy1 bitset from the Judy library. Along with some
special code to handle the few SMILES which don't compress down that
far. I thought of using a hash like an MD5 or SHA-1 hash but those
are 128 bits folding to 64 bits and by the birthday paradox I would
get duplicates after about 2**32 compounds. Mick generated about 2%
of that number of compounds.

I want to compress short tokens. This is not the normal case
for gzip, which compresses larger blocks of text, and well outside
what bzip2 does with it's Burroughs-Wheeler block transform. gzip
uses zlib for the compression. I got and tried out the zlib source
code. Its output started with a signature, which wastes some bits,
and while I can prepopulate some of the frequency information there's
a high startup code which make it infeasible to use for every token.

Hence I couldn't compare directly. What I ended up doing was taking
my test set, compressing it with gzip/bzip, dividing by the number of
input compounds, and using that as an estimate for how well those
compress. Were I to use the gzip idea I would train on a test set to
generate statistics for the higher order models and figure out some
way to restart output anew for each token.

SMILES tokenization compression results

Here's the email summarizing my conclusions.

From: dalke@...
Subject: more on compression
Date: October 6, 2004 10:06:42 AM GMT+02:00
Following up on the conversation with Tudor and Mick during Friday
lunch, I experimented some with the combination of arithmetic coding
and better modeling of SMILES strings. (That is, don't assume they
are just characters but take into account the the first character must
be an atom, as well as the character after a '('. )
I tried two models. The first uses this state diagram for a SMILES
string. The number in parenthesis are the statistics generated from
the test set Mick gave me.
START:
'C' -> expect_any (20043)
expect_any:
None -> END (20043)
'(' -> expect_atom (92148)
')' -> expect_any (92148)
'1' -> expect_any (39454)
'2' -> expect_any (34766)
'3' -> expect_any (10156)
'4' -> expect_any (388)
'5' -> expect_any (38)
'C' -> expect_any (268353)
expect_atom:
'C' -> expect_any (92148)
END:
I then compressed that same data set using the above coding. Here's a
summary of what I found
20,043 strings, 669,685 bytes (649,642 without newline)
13517 with 64 bits or less
6526 with more than 64 bits
largest needs 105 bits, 16 need >= 100 bits
1,243,386 bits total = 62.0 bits per compound (avg)
163,260 bytes (1,306,080 bits) quantized to 8 bits = 65.2 bits/cmpd
compressed size = 23% uncompressed size (at bit level)
compressed size = 24% uncompressed size (at quantized byte level)
I'm looking at the number < 64 bits because one of the Judy tree data
structures is optimized for word-sized boolean arrays. This says 2/3
of the compounds could be stored in that very compact manner.
The "quantized to 8 bits" is because some of the codings didn't take
up the full byte. That's the number rounded up to the nearest byte
size.
I next considered a better model of SMILES strings. The data set I
got has at most 4 embedded (branches). That's
CCC1(CC(C(C(C)(C)C)C(C)(C)C)C2CC1C)C2
When inside of a '(' there cannot be the end of string, so there's no
need to include that possibility. Of course balanced parens are not
regular -- they need at least a context free grammar -- but if the
maximum depth is known beforehand then there is a regular grammar.
So I tried with this model of the SMILES strings
START:
'C' -> expect_any (20043)
expect_any:
None -> END (20043)
'(' -> expect_atom_close1 (76129)
'1' -> expect_any (33090)
'2' -> expect_any (27112)
'3' -> expect_any (8025)
'4' -> expect_any (305)
'5' -> expect_any (29)
'C' -> expect_any (214609)
expect_any_close1:
'(' -> expect_atom_close2 (15454)
')' -> expect_any (76129)
'1' -> expect_any_close1 (6055)
'2' -> expect_any_close1 (7309)
'3' -> expect_any_close1 (2039)
'4' -> expect_any_close1 (81)
'5' -> expect_any_close1 (9)
'C' -> expect_any_close1 (51073)
expect_any_close2:
'(' -> expect_atom_close3 (560)
')' -> expect_any_close1 (15454)
'1' -> expect_any_close2 (309)
'2' -> expect_any_close2 (343)
'3' -> expect_any_close2 (92)
'4' -> expect_any_close2 (2)
'5' -> expect_any_close2 (0)
'C' -> expect_any_close2 (2649)
expect_any_close3:
'(' -> expect_atom_close4 (5)
')' -> expect_any_close2 (560)
'1' -> expect_any_close3 (0)
'2' -> expect_any_close3 (2)
'3' -> expect_any_close3 (0)
'4' -> expect_any_close3 (0)
'5' -> expect_any_close3 (0)
'C' -> expect_any_close3 (22)
expect_any_close4:
')' -> expect_any_close3 (5)
'1' -> expect_any_close4 (0)
'2' -> expect_any_close4 (0)
'3' -> expect_any_close4 (0)
'4' -> expect_any_close4 (0)
'5' -> expect_any_close4 (0)
'C' -> expect_any_close4 (0)
expect_atom:
'C' -> expect_any (0)
expect_atom_close1:
'C' -> expect_any_close1 (76129)
expect_atom_close2:
'C' -> expect_any_close2 (15454)
expect_atom_close3:
'C' -> expect_any_close3 (560)
expect_atom_close4:
'C' -> expect_any_close4 (5)
END:
You'll note that this is only good enough for coding the given data
set. Other compounds are likely to use an edge in the state diagram
which is listed as 0 probability. That's easy enough to fix, but it
means the numbers I'm about to give are on the low side.
On the other hand, I can do even better modeling. For example, for
some of your data you'll know how many atoms are expected, so I can
remove the
'expect_any' -> END
probabilities entirely. I also know that if the previous atom used a
ring closure number then the atom bonded to it because they are
sequential in the SMILES string (barring branches) cannot have the
same ring closure. (Ie, 'C1C1' is not allowed, nor is C1(CC)C1 )
These could be described with a state diagram, with great difficulty,
but it's probably better to augment these with a bit of code.
Okay, so here are the numbers
20,043 strings, 669,685 bytes (649,642 without newline)
19219 with 64 bits or less
824 with more than 64 bits
largest needs 115 bits, 6 need >= 100 bits
1,046,046 bits total = 52.2 bits per compound (avg)
139,548 bytes (1,116,384 bits) quantized to 8 bits = 55.7 bits/cmpd
compressed size = 19.5% uncompressed size (at bit level)
compressed size = 21% uncompressed size (at quantized byte level)
As you can see, there's a 15% reduction in average size, and nearly
all the coded strings fit in a 64 bit word.
The impressive thing is that the gzip'ed size is 122,295 bytes and
bunzip uses only 112,487 bytes. Both general purpose compressors beat
my best effort even when I have a good model of the SMILES string.
I can conjecture on why this is true. Despite the detail of my model
it's still only a first order model. There is higher order
information in the system that I do not capture. For example, the
odds of "CCC" given "CC" is not the same as the odds of "CCCC" given
"CCC".
This can be fixed in the usual way of having a 2nd order model to
depend on this first order one, and a 3rd to depend on the second.
Though I see more advanced systems use Markov modeling.
By the way, I experimented with using zlib, the compression library
used by gzip, as a way to get this higher order model information
without having to do extra work. You can provide text to it to make
the coding dictionary so it can be pre-trained on the input data.
It didn't work out. The compressed strings were not well compressed.
Why? I suspect it's something to do with the stream oriented nature
of the zlib compression while these SMILES strings are rather small.
It may take a few characters for the higher order predictors to kick
in. I also tried to remove the obvious control bits it produced, so
as to only work with data bits. But I may not have done it correctly.
I have not looked into using the library behind bzip2.
This is all the work I'm going to put into this for now. The code for
this is all in Python. Available if desired.

I have the paper but haven't read it yet. At the poster presentation
I noted that because of its direct dependences on SMILES encoding
rules, 'C' and 'I' are given the same weight while 'Br' is different,
as is '[Se]'. I wondered if there could be some way of normalizing
the SMILES, at the syntax level. For example, assign a single
character to each element (convert "H" into chr(1), "C" into chr(6),
...) and strip out any atomic numbers and chirality information.

I prototyped a version at CUP using my Python tokenizer to show it's
feasible, but the performance isn't anywhere near what OpenEye wants,
and there's the obvious problem that finding a good "normalization" is
a lot of work for perhaps only a little improvement and it's best to
get something out to get feedback on if tuning is even needed.

He used straight gzip (not even zlib, so there's extra overhead) and
bzip2 and commented that he needed to generate *N copies of the input
to get better compression statistics. Probably because of the header
information and to let the higher order models kick in. He didn't
mention that, and when I went up to talk with him afterwards he asked
if I had published my results. Not being an academic I hadn't. Maybe
I should. But this article is probably as informative to the world.
For example, J Chem Inf Mod is subscriber only.

Kolmogorov-Chaitin algorithmic complexity

He started his talk with a reference to "Kolmogorov-Chaitin
algorithmic complexity ... defined as the size of the program (that
you would feed to a universal Turing machine) required to print the
sequence and then terminate." [Quoting his zippotron
page]. The tricky thing about that definition, which I didn't mention
to him when we chatted, is that it's highly dependent on the Turing
machine. SMILES strings don't cover all of symbol space so there's no
need to have a general purpose TM. Quoting (thanks be to Google) Delahaye and Zenil in "On
the Kolmogorov-Chaitin Complexity for short sequences"

A drawback to Kolmogorov-Chaitin complexity $(K)$ is that it is
uncomputable in general, and that limits its range of
applicability. Moreover when strings are short, the dependence of $K$
on a particular universal Turing machine $U$ can be arbitrary. In
practice one can approximate it by computable compression
methods. However, such compression methods do not provide a good
approximation for short sequences -- shorter for example than typical
compiler lengths. In this paper we will suggest an empirical approach
to overcome this difficulty and to obtain a stable definition of the
Kolmogorov-Chaitin complexity for short sequences.

(I quote this only for the background text. The paper itself does not
appear to be relevant to this topic.)

SMILES strings are definitely "short sequences". Also note that my
1st order compression scheme, which is very knowledgable about SMILES,
is not a good compression method for Zippotron-style comparison. It
really is the higher order modeler in gzip kicking in.

Ragel

I decided to write up here what I did, which gave me the incentive to
look at Ragel. I've been looking for a way to describe SMILES which
can do the parsing without have the overhead of using a regular
expression engine. And preferably something cross platform. Because
I know Roger, I want the generated result to be as fast as
hand-written code, eg, with no extra variable assignments which are
thrown away later. (Actually, it could potentially be faster because
Ragel can generate several different types of parsers, eg, if switch
statements are faster slower than state tables for a given compiler
and architecture).

I also wanted it to be flexible. For example, support for being able
to instrument state transitions to gather statistics and do predictive
modeling. Or to simplify the grammar (eg, no one really uses a charge
of '++++' instead of '+4', and no toolkit supports the higher level
chiral symmetry classes).

The code in there started as a "count the number of atoms and bonds in
a SMILES string" then turned into a "compute the molecular weight of
each SMILES in a file" program as I worked on testing it, then changed
a bit more. The original molecular weight numbers were wrong because
it doesn't include implicit hydrogens. I used OpenEye's OEChem to
convert my original test data into explicit atom form and to get the
molecular weight data.

The timings I got were that my code is "faster" than OpenEye's, but
it's a worthless comparison because I'm not doing the perception
needed to add implicit hydrogens myself and I'm not verifying that the
SMILES is correct. So I won't tell you actual timing numbers. I will
compare the 650 or so lines I needed compared to the following OEChem version