For now my driving example will be to print the Tanimoto values for
each compound in a data set, in order. I get to control where the
data comes from and the format that it's in, and I can precompute
whatever I want about the target data set. I deliberately don't care
about the identifiers, only the Tanimoto scores, so I can focus on the
search algorithm. In this essay I'll focus on computing the Tanimoto
between two fingerprints. The next one will compute the Tanimoto
between a single query fingerprint and a large number of targets.

Converting Babel's "fpt" format to "binary_fp" format

I'll start by converting the hex fingerprints from the "fpt" file into
a new binary file format, and I'll assume that there are 1024 bits in
the fingerprint. This new format will be very simple. Each
fingerprint takes up 1024/8 = 128 bytes, and in the same order as the
input file. This is not a good format for general use. I'll
talk about why later. Because it's a bad format, I'm using an ugly
and long extension to the filename. Don't use this format for real
work!

I decided to use Python for this conversion rather than C because
while my eventual goal is to do fast Tanimoto searches, this is a
preprocessing step where the performance is not critical. Plus, I
usually don't like parsing in C.

Parsing the "fpt" format turns out to be a bit of a pain, in almost
the same way that parsing
a FASTA file is a pain. There only way to know you are done with
one record is to read the first line of the next record, or get to the
end of the file. I read the header line (which is different for the
first record vs. the other records in the file) and merge the hex
lines into a single hex fingerprint string.

(Earlier I made it easier for myself by assuming I could read the
entire file into memory, but because I'm going to parse a large
fingerprint file soon, I can't make that assumption now.)

After I have the fingerprint as a hex digits string I need to convert
it into the raw byte values. I thought this would be tedious until I
remembered there's a "hex" encoding in Python which does that for me,
and which makes this step trivial.

Here's the conversion code, which I named "fpt_to_binary.py" for
boringly obvious reasons.

# fpt_to_binary_fp.py
# Convert from an OpenBabel "fpt" format with hex fingerprints (use
# -xf to generate the fingerprints) into a binary fingerprint file.
# Fingerprints are written consecutively, with no title information or
# separator between the fingerprints. This is a very raw format and
# should not be used for anything other then benchmarking and
# bootstrapping into a better format.
# Parsing formats with no explict end-of-record marker is a nuisance.
def read_records(infile):
# Parse the header line for the first record, which looks like:
# >Strychnine 183 bits set
# I'm not splitting on the " " because the title might have a space.
# Using a regular expression proved a bit too cumbersome.
line = infile.readline()
assert line[0] == ">"
# Turn ">Strychnine 183 bits set " into
# [">Strychnine", "183", "bits", "set"]
words = line.rsplit(None, 3)
assert words[-2:] == ["bits", "set"]
title = words[0][1:]
while 1:
hex_blocks = []
for line in infile:
if line.startswith("Possible"): # "Possible superstructure"
continue
if line[:1] == ">":
break
hex_blocks.append("".join(line.split()))
else:
# End of input
yield title, "".join(hex_blocks)
break
# "line" contains a title line for the next record
yield title, "".join(hex_blocks)
# Parse the header line for the other records, which look like:
# >cocaine Tanimoto from Strychnine = 0.353234
assert line[0] == ">"
i = line.rindex("Tanimoto from")
title = line[1:i].strip()
import sys, os
input_filename = sys.argv[1]
output_filename = os.path.splitext(input_filename)[0] + ".binary_fp"
infile = open(input_filename)
outfile = open(output_filename, "wb")
for title, hexdata in read_records(infile):
#print repr(title)
outfile.write(hexdata.decode("hex"))

Does it work? I'll convert the "drugs.fpt" file and compare its
contents with the newly created "drugs.binary_fp" file.

The "od -t cH" displays the hex values for the bytes of the
"drugs.binary_fp" file. You can see that the output file is simply
the input hex digits, expressed directly in binary. (You might have
to stare at it for a bit.) So yes, it does work.

Computing the popcount using only Python

Onwards to computing the Tanimoto scores! This will require computing
the popcount of the intersection of two fingerprints. Normal Python
is really quite bad at that. I could convert the 1024 bit/128 byte
fingerprint into a Python integer (like I did earlier). Computing the
bit intersection between two integers is easy with the '&'
operator, but finding the popcount is a bit complicated. (See below
for elaboration.)

I could keep everything as a byte string. Finding the popcount of the
fingerprint is the same as the sum of the popcount of each byte.
There are only 256 possible bytes, so that's a small lookup table.
The problem is computing the bits in common to both fingerprints.
Python doesn't support bit operations on characters, so I would have
to convert the bytes to a number before doing the &. This is
completely doable, but there's a somewhat better solution.

I can use the 'array' module
and create a byte array. This is a seldom used Python module but it
seems just the thing for this case.

This converted the input string into an array of bytes, stored as
integers. To compute the total fingerprint popcount I'll sum the
popcount for each byte in the fingerprint. I can get that by looking
it up in a table of precomputed values for each possible byte. To
compute the number of bits in common between the two fingerprints I'll
compute the sum of the number of bits in common between the
corresponding bytes in the fingerprints.

The two methods agree, so it's very likely that the code works.
Although there could still be mistakes in the lookup table.

Computing the Tanimoto using only Python

Now that I have a popcount it's easy to compute the Tanimoto. In the
following program I'll print the Tanimoto for every fingerprint in the
specified binary_fp file, using the first fingerprint as the query
fingerprint. I'll include computing the Tanimoto of the query
fingerprint against itself.

How do I verify those are correct? OpenBabel also computed the
Tanimoto scores when it generated the ".fpt" file. You can go back
and compare (as I did) and see that they agree. The only difference
is that I'm printing the results to double precision while OpenBabel
prints them to single precision. Since the denominator is at best
1024, single precision is more than accurate enough. In fact, double
implies an accuracy that isn't warranted.

Can I do this faster without writing C code?

There are ways to make the popcount code go faster using standard
Python or available 3rd party extensions. I'll review some of them
but won't use them outside this section.

The standard Python distribution is not a good platform for doing fast
Tanimoto calculations. There is no builtin function for it and while
I can build them they are all relatively slow. That's why I'll
shortly develop a C extension for doing this calculation. Before
doing that, just what other options exist?

I could convert the 1024 bit/128 byte fingerprint into a Python
integer. The first problem is converting the raw fingerprint bytes
into an integer. I don't know a direct way, but
int(s.encode("hex"), 16) works. Computing the bit
intersection is simple - just use "&". The problem is computing
the popcount.

The simple popcount algorithm is to see if the number has the 1 bit
set, then the 2 bit, then the 4 bit, ...

By being more clever I could make this a bit faster, but I'm not even
going to try.

During proofreading I remembered another way to make the popcount of
the and-ed fingerprints be faster - I could use 16 bit integers for
everything and look up the popcounts in a 16 bit lookup table. But
this doesn't work that well because Python doesn't have a 16 bit array
type. You would have to use one of the Numpy/numeric/numarray arrays.

Another way to compute the popcount of an integer is to use Scott David
Daniels' 'bits' module. It's a C extension and is quite fast. I
used it a couple of years ago when I was doing some experiments with
fingerprint searches.

If you're going to use a third-party C extension then you might
instead consider using GMPY. It's based on the GNU Multiple Precision Library and
includes a popcount function. Digging in the source code I see that
the popcount function is implemented in hand-crafted assembly for
different platforms.

Yet another possibility for computing the popcount of a fingerprint is
to convert the integer into hex string, or generate a binary pickle
string, and work with the resulting string using the same lookup table
approach I did earlier. This has the advantage of working with any
length Python integer, but has a lot of overhead.

Or I could stick with the raw byte string. The hard part there was
computing the number of bits in common between two characters. The
easiest solution is to convert the characters into integers, do the
bitwise-and, and convert back to a character, but that's a lot of
Python function calls for something that's a single assembly
instruction. I could perhaps lookup table, where the key in the table
is (query_byte, target_byte) and the value is the number of bits in
common. I don't like that solution. For one, it requires
256*256=65,536 entries.

Computing the Tanimoto in C, using GCC extensions

Computing the Tanimoto score is much easier in C, and even easier with
gcc because the latter includes a built-in function for computing the
popcount of an integer, a long, and a long long. I'll use gcc for now
so I don't need to define a popcount just yet. (I'll do that in the
next section.)

In the following I assume that integers are 32 bits in length, so
there are 32 integer parts to a 1024 bit fingerprint. I'll write each
unsigned integer in hex, so the number starts with "0x" (to mean
"hex") and ends with "u" (to mean "unsigned").

so you can see that it gave the same Tanimoto score that OpenBabel
computed.

To me this bit of C is much easier to understand than the same Python
code. Part of the reason is that I'm using the __builtin_popcount of
gcc but the real reason is that in C things like byte arrays, integer
arrays, and the bytes and bits in an integer are very fluid. It's
very easy to go from one form to another. Even if I have to implement
the bitcount myself, with the following bit of code, I can compute
over 1 million Tanimoto similarities per second. which is more than 20
times faster than my fastest Python code. Note though that this
computing score between the same pair. Computing Tanimoto scores for
a large number of targets will be slower because bringing that data to
the CPU takes additional time, but then there are other way to make
the performance be better. I'll talk about this in a bit.

Computing the Tanimoto in C, without GCC extensions

Here's the equivalent C code using an 8-bit lookup table to compute
the popcount for a 32 bit integer.

Computing the pairwise Tanimoto in C is fast. I put the Tanimoto
calculation in a loop and calculated it 100,000,000 times, which took
42 seconds. The last time I looked, PubChem contained 19,218,991
compounds. Assuming everything was available in fast memory then I
should be able to search it in 8 seconds using a C program, or about 7
minutes using Python. Which would you prefer?

This is interesting. Doing the popcount using a lookup table took 43
seconds. Using __builtin_popcount took 109 seconds. I was not
expecting that.

Computing the Tanimoto in Python via a C extension

I prefer the speed of C but the general ease of programming using
Python. (That's a sort of "have my cake and eat it too" answer.)
Using Python for Tanimoto searches is slow in part because there is no
fast way to compute the popcount. I can fix that by writing a Python
extension using C. Rather than add only the popcount I'll have the
extension compute the entire Tanimoto between two fingerprints.

The easiest way to pass fingerprints around in Python is as string,
and more specifically a byte string. (As compared to a Unicode
string.) I want my Python code to work like:

Ctypes is a "foreign function interface" ("FFI") package for Python.
It lets Python load a shared library and call C functions directly.
You still have to write some interface code to tell ctypes how to call
the function, and perhaps write some extra error checking code, but
not much.

I find ctypes to be the easiest way to get Python to talk to C code,
but there are a few downsides. Calling a function through ctypes has
a bit more overhead (it's a bit slower) than using a special-purpose
compiled interface like the other solutions use. It's also very easy
to crash Python using ctypes.

But the advantage is that I can write C code without worrying about
writing the Python/C interface in C. I do have to worry, but I'll
worry about it from Python. Here's a simple C library to compute the
Tanimoto similarity between two fingerprints as stored in unsigned
byte strings.

Very simple, yes? Next, I need to compile it. Again there are many
possible solutions. I'll use SCons which one of several
replacements for make. Instead of using a "Makefile" it uses a
configuration file named "SConstruct". The configuration language is
Python, extended with functions that know how to build binaries,
shared libraries, and other targets. This is very nice because
building shared libraries across multiple platforms is a nasty mess
and I don't want to do it myself.

The result is a "libtanimoto.dylib", which is what Macs use as a
run-time loadable shared library. Linux and other Unix OSes use the
extension ".so" and MS Windows uses ".dll".

Using ctypes to access the shared library

To load the function in ctypes I first load the shared library
(remember, the extension depends on which OS you are using -
'find_library' handles that detail) then get the function from the
module. I then have to annotate the function so ctypes knows how to
convert the Python call arguments to the right C types and how to
convert the C result into Python. Here's an example of me doing that
by hand, with a fingerprint of 4 bytes:

If I know the number of bytes in the fingerprint then I can call this
interface directly. Most people would prefer a simpler solution and
not have to worry about passing in the length every time. After all,
Python strings already know their length.

Making a simplified interface

What I'll do is create a new Python module in the file "tanimoto.py".
This will load the shared library and do the ctypes annotation. Once
done I'll define a "high-level" function also called "similarity"
which gets the fingerprint lengths, checks that they are the same, and
calls the underlying C similarity implementation.

Sweet! The right answer! Though we're display it to double precision
while OpenBabel only reports to float precision. That's fine -
there's only 1024 bits so the resolution of the answer can't be better
than 1/1024.

The high-level Python wrapper interface is easier to use than calling
the C interface exposed by ctypes. This will usually be the case. C
and Python have different world views. C libaries tend to optimize
towards performance and require that programmers think more about all
the details, while Python libraries tend be safer and easier to use,
but slower. I'm hoping to show that it's possible to get both by
mixing the two together.

Performance timings for the C extension

I also did some performance timings. The public high-level interface
can compute Tanimoto scores about 340,000 fingerprints (of 1024 bits
each) per second. That's about 15% the speed of the native C code but
about 7 times faster than my pure Python solution. Remember, the
high-level interface does extra work and then calls the low-level
ctypes function. If I remove that extra work then I can do about
440,000 calculations per second.

One nice thing about having a benchmark in-place is I can test
alternate implementations and see if I get better performance. My
Tanimoto calculation uses fingerprint bit counts 'a', 'b', and 'c',
based on the Daylight usage. This has general applicability because
other similarity measures can be described as a function of those
variables. The Tanimoto calculation has a simpler form, which is the
number of on bits in the bitwise-and of the two fingerprints divided
by the number of on bits in their bitwise-or. It's is succienctly
written as:

Doing a bitwise-or is faster than doing another popcount. My
benchmark became about 1% to 2% faster with this variation. That's
almost in the noise, but useful to note if you want to eek out a bit
more performance.