Andrew Dalke's writingshttp://www.dalkescientific.com/writings/diary/index.htmlWritings from the software side of bioinformatics and
chemical informatics, with a heaping of Python thrown in for good
measure. Code to taste. Best served at room temperature.Fri, 28 Nov 2014 14:36:20 GMTPyRSS2Gen-1.0.0http://blogs.law.harvard.edu/tech/rssSimilarity web servicehttp://www.dalkescientific.com/writings/diary/archive/2014/11/28/similarity_web_service.html<P>
Now that I've <a href="http://www.dalkescientific.com/writings/diary/archive/2014/11/28/indexing_chembl.html">indexed
ChEMBL for record lookup and similarity search</a>, I can write a
simple interactive web server. By "interactive" I mean that it will
show you results, if available, as you enter the SMILES or record
id. This is possible because <a href="http://chemfp.com/">chemfp</a>
takes under 0.1 seconds to search the 1.4 million fingerprints of
ChEMBL and report the results. By "simple" I mean there will be no
sketcher and a plain looking user interface. I'll also stick to using
jQuery for the browser-side code; I usually use <a
href="http://knockoutjs.com/">Knockout</a> for this sort of app.
</P><P>
I also decided to try out the <a
href="http://flask.pocoo.org/">Flask</a> web development framework
instead of using Django. This has the advantage that my web service
code fits into a single file.
</P><P>
Here's what the result looks like, so you can see where I'm going:<br />
<img style="border: 3px solid orange" src="http://dalkescientific.com/writings/diary/chembl_search_demo.png">
</P>
<h2>Multiple threads</h2>
<P>
Flask is a multi-threaded server, which means I need to configure a
few things to work correctly.
</P><P>
The record data is in a SQLite database. SQLite is <a
href="http://www.sqlite.org/faq.html#q6">threadsafe</a> but "<a
href="https://docs.python.org/2/library/sqlite3.html#multithreading">Older
SQLite versions had issues with sharing connections between
threads. That's why the Python module disallows sharing connections
and cursors between threads.</a>"
</P><P>
I don't write to the database and I'll be careful and create a new
cursor for each request, so I'm safe. However, I do need to disable
that check, which I do through the check_same_thread parameter:
<pre class="code">
import sqlite3
database_filename = "chembl_19.sqlite3"
db = sqlite3.connect(database_filename, check_same_thread=False)
</pre>
Chemfp can be compiled for OpenMP, which is the
default. Unfortunately, on the Mac, multi-threaded code and OpenMP
don't mix. I need to tell chemfp to always use the non-OpenMP code
path, as otherwise the program will crash on the first similarity
search:
<pre class="code">
import chemfp
chemfp.set_num_threads(1)
</pre>
</P>
<h2>Load the data sets</h2>
<P>
I'll open the SQLite database and make sure it has compounds, then
open the fingerprint file and make sure it has fingerprints. Hopefully
the two are the same! I made sure when I created the fingerprint file
to include the fingerprint type in the metadata. Now I can use the
type string to get the fingerprint type object, and from that I can
get the correct toolkit to use.
<pre class="code">
database_filename = "chembl_19.sqlite3"
fingerprint_filename = "chembl_19.fpb"
db = sqlite3.connect(database_filename, check_same_thread=False)
cursor = db.cursor()
try:
num_compounds, = cursor.execute("SELECT COUNT(*) FROM Compounds").fetchone()
except sqlite3.ChemFPError, err:
raise SystemExit("Cannot connect to SQLite database %r: %s" % (database_filename, err))
del cursor # remove the global cursor so there's no chance of accidental reuse
if num_compounds == 0:
sys.stderr.write("WARNING: Database %r contains no compounds." % (database_filename,))
# Load the fingerprints
arena = chemfp.load_fingerprints(fingerprint_filename)
# Use the fingerprint type string from the metadata to get the fingerprint type object
try:
fptype = arena.get_fingerprint_type()
except chemfp.ChemFPError, err:
raise SystemExit("Unable to load the fingerprint type for fingerprint file %r: %s"
% (fingerprint_filename, err))
# Get the toolkit associated with this fingerprint type
toolkit = fptype.toolkit
if toolkit is None:
raise SystemExit("No toolkit available to generate fingerprints of type %r"
% (fptype.base_name,))
# Report information about the number of compounds and fingerprints
if num_compounds == len(arena):
sys.stderr.write("Loaded %d compounds and fingerprints of type %r\n"
% (num_compounds, fptype.get_type()))
else:
sys.stderr.write("Loaded %d compounds and %d fingerprints of type %r\n"
% (num_compounds, len(arena), fptype.get_type()))
</pre>
When my server start (or in debug mode, when it restarts) it prints:
<pre class="code">
Loaded 1404532 compounds and fingerprints of type 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1'
</pre>
The FPB format for chemfp-1.2 was developed in part so that web server
reloads, which is important during web development, would be quick. I
basically don't notice the time it takes for the above code to run.
</P>
<h2>Defining the web service API</h2>
<P>
1) I'll have a single similarity search API, named '/search', which take
the following parameters:
<ul>
<li>q: the query, which is a SMILES string or a record id</li>
<li>k: the number of neighbors to find (default: 3)</li>
<li>threshold: the minimum required threshold for a valid result</li>
<li>oformat: the output format. "ids" return the match identifiers,
one per line; "smiles" returns a tab-separated file with the SMILES in
the first column, id in the second, and score in the third; "sdf"
returns the original SD records, each with a new field called "SCORE"
containing the similarity score, and "json" returns a JSON document
containing the ids, SMILES, scores, and a few other pieces of
information.</li>
</ul>
</P><P>
2) I'll have the home page ('/') return an interactive search brower.
</P><P>
3) I'll have a way to get the record for CHEMBL286759 by going to a
URL like "/CHEMBL286759.sdf". In Flask terminolgy, I need to match
anything of the form "/&lt;string:id&gt;.sdf", though if the given id
doesn't exist I'll need to return a <tt>404 NOT FOUND</tt> error.
</P><P>
All of them require hooking on to the Flask 'app', which is defined like this:
<pre class="code">
from flask import Flask, Response, jsonify, request
app = Flask(__name__)
</pre>
</P>
<h3>Get a record given its id</h3>
<P>
I'll start with the easiest, where "/CHEMBL286759.sdf" returns the
original SDF record for CHEMBL286759.
<pre class="code">
# get a record given its id, or return "404 Not Found"
@app.route("/&lt;string:id&gt;.sdf")
def get_record(id):
cursor = db.cursor()
cursor.execute("SELECT record FROM Compounds WHERE id = ?", (id,))
row = cursor.fetchone()
if not row:
page_not_found()
record = zlib.decompress(row[0])
return Response(record, mimetype="text/plain")
</pre>
</P><P>
I don't think it needs much explanation, but it might help to look at
the interactive example from my <a href="http://dalkescientific.com/writings/diary/archive/2014/11/28/indexing_chembl.html">previous essay</a>.
</P><P>
This calls a error handler function to report the 404 NOT FOUND error,
which is:
<pre class="code">
@app.errorhandler(404)
def page_not_found(error=None):
return "Not found", 404
</pre>
</P>
<h3>Similarity search - parse query arguments</h3>
<P>
The similarity search code is pretty involved, but mostly because it
need to handle different input and output cases. The input contains a
lot of code like this:
<pre class="code">
# Similarity search
@app.route("/search")
def similarity_search():
...
k = request.args.get("k", "3")
...
try:
k = int(k)
except ValueError:
return _error_response("bad value for k")
</pre>
which get the query parameter "k" if it exists, or uses the default of
"3" if it doesn't. It then converts it into an integer, and if that
fails it returns an error response. That function constructs a JSON
document:
<pre class="code">
def _error_response(errmsg):
return jsonify({
"status": "error",
"message": errmsg,
"columns": ["id", "score", "smiles", "exact match"],
"hits": [],
})
</pre>
</P>
<h3>Similarity search - handle the SMILES or record id</h3>
<P>
The "q" parameter is more complicated, because it can be a SMILES
string or a query id. I'll resolve this by trying to parse it as a
SMILES string. If that fails, I'll see if there's a record in the
database with that name, which also means I need a new cursor for the
database connection:
<pre class="code">
def similarity_search():
...
q = request.args.get("q", "").strip()
...
cursor = db.cursor()
# Flask queries are Unicode strings, but SMILES and record identifiers
# can only be ASCII, so convert and report any failures
try:
q = str(q)
except UnicodeEncodeError:
return _error_response("non-ASCII query")
errmsg, mol = _get_molecule(cursor, q)
if errmsg:
return _error_response(errmsg)
...
</pre>
Here's where I implement the business logic. Given the toolkit, parse
the query as a SMILES. If that fails (chemfp.ParseError is also a
ValueError) then get the SMILES for the given id. If that fails, or is
the SMILES cannot parsed, then report an error. Otherwise return the
newly parsed molecule object. (It's actually a two-value result, where
the first is the error message and the second is the molecule. At
least one of them will be None.)
<pre class="code">
def _get_molecule(cursor, query):
if not query:
return None, None
# Is this a valid SMILES?
try:
mol = toolkit.parse_molecule(query, "smistring")
except ValueError, err:
# No. Then is this an identifier?
match = cursor.execute("SELECT smiles FROM Compounds WHERE id = ?", (query,)).fetchone()
if not match:
return "Not a valid SMILES or identifier", None
smiles = match[0]
try:
mol = toolkit.parse_molecule(smiles, "smistring")
except ValueError:
return "Invalid SMILES in database!", None
return None, mol
</pre>
</P>
<h3>Similarity search - compute the fingerprint</h3>
<P>
It's only a couple of lines to compute the fingerprint for the
molecule and search the FPB file for the k-nearest neighbors within
the given threshold.
<pre class="code">
fp = fptype.compute_fingerprint(mol)
hits = arena.knearest_tanimoto_search_fp(fp, k=k, threshold=threshold)
</pre>
</P>
<h3>Similarity search - handle the oformat parameter and return ids</h3>
<P>
The 'oformat', short for 'output format', specifies the desired output
format if there are no errors. The "ids" output format returns the
hits as a list of target identifiers, with one id per line.
<pre class="code">
def similarity_search():
...
oformat = request.args.get("oformat", "json")
...
if oformat not in ("json", "ids", "smiles", "sdf"):
oformat = "json"
...
hits = arena.knearest_tanimoto_search_fp(fp, k=k, threshold=threshold)
...
if oformat == "ids":
return Response("\n".join(hits.get_ids()) + "\n", mimetype="text/plain")
...
</pre>
</P>
<h3>Similarity search - oformat == "smiles"</h3>
<P>
Returning SMILES is a bit more complicated because I need to get the
SMILES string for each hit id, and that requires a database lookup. I
decided to make a function call "get_columns()", which takes the list
of requested columns and the hits of interest, and searches for the
intersection of them.
</P><P>
As far as I can tell, there's no easy way to ask for all of the rows
for a given list of ids. There is the "WHERE id IN (?, ?, ... ?)"
construction, but I need a '?' for each query id. That's not hard to
do, because this is Python, but I feel like there should be a better
way to make the query.
<pre class="code">
def get_columns(cursor, columns, hits):
sql_query = "SELECT %s FROM Compounds WHERE id IN (" % (", ".join(columns),)
sql_query += ",".join("?"*len(hits)) + ")"
cursor.execute(sql_query, tuple(hits.get_ids()))
return cursor
</pre>
I use the get_columns() function to get a list of 2-element rows,
which I convert into a dictionary. There's the off chance that the
SQLite database and the fingerprint database can get out of
synchronization. A loop through the hits, which has the ids and
scores, also gives me the SMILES. If a given id is missing, I'll use
"MISSING" as the SMILES string for it. And I'll format the score to
three decimal places.
<pre class="code">
if oformat == "smiles":
id_to_cansmiles = dict(get_columns(cursor, ["id", "smiles"], hits))
lines = []
for id, score in hits.get_ids_and_scores():
smiles = id_to_cansmiles.get(id, "MISSING")
lines.append("%s\t%s\t%.3f\n" % (smiles, id, score))
return Response("".join(lines), mimetype="text/plain")
</pre>
</P>
<h3>The text_toolkit and "text molecules"</h3>
<P>
The SDF output is a bit more complicated because I need to decompress
the compressed SD record, and because I want to modify the original
record to have a new tag containing the "SCORE". For that I'll use the
text_toolkit, but this time for its "text molecule" toolkit API. It
implements the same toolkit API as rdkit_toolkit, openeye_toolkit and
rdkit_toolkit, but instead of being real molecule objects that knows
chemistry, it's really just a light-weight container for the record,
with enough knowledge of SD records to do some light changes to the
tags.
</P><P>
Here's what I mean. I'll start with a record (I'll leave out the left
side of Python's interactive shell to make it easier to copy and
paste).
<pre class="code">
record = """\
CHEMBL17564
SciTegic10311311082D
1 0 0 0 0 0 999 V2000
0.0000 0.0000 0.0000 C 0 0
M END
> <chembl_id>
CHEMBL17564
$$$$
</pre>
I'll then "parse" the record into a text molecule, using the
text_toolkit, and some of the SD tag API:
<pre class="code">
&gt;&gt;&gt; <b>from chemfp import text_toolkit</b>
&gt;&gt;&gt; <b>text_mol = text_toolkit.parse_molecule(record, "sdf")</b>
&gt;&gt;&gt; <b>text_mol.get_tag("chembl_id")</b>
'CHEMBL17564'
&gt;&gt;&gt; <b>text_mol.get_tag_pairs()</b>
[('chembl_id', 'CHEMBL17564')]
&gt;&gt;&gt; <b>text_mol.add_tag("SCORE", "0.567")</b>
&gt;&gt;&gt; <b>text_mol.get_tag_pairs()</b>
[('chembl_id', 'CHEMBL17564'), ('SCORE', '0.567')]
</pre>
Now I'll convert the modified record back into a string, with the new
tag(s) at the end:
<pre class="code">
&gt;&gt;&gt; print text_toolkit.create_string(text_mol, "sdf")
CHEMBL17564
SciTegic10311311082D
1 0 0 0 0 0 999 V2000
0.0000 0.0000 0.0000 C 0 0
M END
&gt; &lt;chembl_id&gt;
CHEMBL17564
&gt; &lt;SCORE&gt;
0.567
$$$$
</pre>
As a side note, I think this API is too complicated. A future version
of chemfp will likely have an "add_sdf_tag" which operates directly on
strings rather than through text molecule objects.
</P>
<h3>Similarity search - oformat == "sdf"</h3>
<P>
Once you know how to use the text_toolkit to add an SD tag, the SDF
output format handler is pretty straight-forward:
<pre class="code">
if oformat == "sdf":
id_to_sdfz = dict(get_columns(cursor, ["id", "record"], hits))
records = []
for id, score in hits.get_ids_and_scores():
recordz = id_to_sdfz.get(id, None)
if recordz is None:
continue
record = zlib.decompress(recordz)
text_mol = text_toolkit.parse_molecule(record, "sdf")
text_mol.add_tag("SCORE", "%.3f" % (score,))
writers.append(text_toolkit.create_string(text_mol, "sdf"))
return Response("".join(writer), mimetype="text/plain")
</pre>
</P>
<h3>Similarity search - oformat == "json"</h3>
<P>
I decided that I want the JSON output to have the id, score, and
SMILES string, and also have a flag (either 0 or 1) if it matches the
query SMILES:
<pre class="code">
query_cansmiles = toolkit.create_string(mol, "canstring")
</pre>
I then need to get the ids and smiles for the hits
<pre class="code">
id_to_cansmiles = dict(get_columns(cursor, ["id", "smiles"], hits))
</pre>
get the results, as a 4-element tuple of (id, score, SMILES, and match flag)
<pre class="code">
results = []
for id, score in hits.get_ids_and_scores():
cansmiles = id_to_cansmiles.get(id, None)
results.append( (id, score, cansmiles, query_cansmiles == cansmiles) )
</pre>
and return the results, along with some timing information I haven't yet described:
<pre class="code">
return jsonify({
"status": "ok",
"message": "",
"search_time": fp_search_time,
"lookup_time": smiles_lookup_time,
"total_time": time.time() - start_time,
"hits": results,
})
</pre>
</P>
<h2>The GUI</h2>
<P>
The code to display the search page is simple, because it's just
returning a HTML as a string.
<pre class="code">
@app.route("/")
def homepage():
return Response(js_code, mimetype="text/html")
</pre>
But how does the Javascript work? You should be aware that I am not as
good of a Javascript programmer as I am a Python programmer. What you
see here is advice, and not a recommendation as for best practices.
</P><P>
The main part of it says that if any of the values for the
"threshold", "k", or "smiles" form elements should change, then call
make_request(), and when the form starts also call make_request().
<pre class="code">
$("#threshold").change(make_request);
$("#k").change(make_request);
$("#smiles").on("input", make_request);
make_request(); // In case fields remain after a browser refresh.
</pre>
The make_request() function starts by getting request parameters from
the different form element, though if the query is empty it won't do a
new search.
<pre class="code">
var make_request = function() {
var k = $("#k").val();
var threshold = $("#threshold").val();
var smiles = $("#smiles").val();
if (smiles === "") {
$("#request").text("(empty query)");
return;
}
</pre>
Otherwise it reports the query parameters and starts an AJAX request
with the correct parameters. This request can either succeed or fail.
I'll leave out the success part for now. On failure you can see that
it adds a message to the "error" text and clears out the "message"
text.
<pre class="code">
$("#request").text("threshold=" + threshold + " k=" + k + " q=" + smiles);
$.ajax({
url: "search",
data: [
{name: "q", value: smiles},
{name: "k", value: k},
{name: "threshold", value: threshold}
]
}).done(function(data) {
... code to call on success .. see below ....
}).fail(function() {
$("#error").text("Server failure");
$("#message").text("");
});
</pre>
</P><P>
"Success" is relative. It might successfully report that the SMILES
string was invalid, so unless the status is "ok" I'll report the
resulting message as an error:
<pre class="code">
if (data["status"] !== "ok") {
$("#error").text(data["message"]);
$("#message").text("");
return;
}
</pre>
Otherwise clear the error message and report some timing numbers.
<pre class="code">
$("#error").text("");
var hits = data["hits"];
var round_trip_time = (new Date() - start_time) / 1000.0;
$("#message").text("Found " + hits.length + " hits. Times for " +
" Search: " + data["search_time"].toFixed(3) +
" Lookup: " + data["lookup_time"].toFixed(3) +
" Handler: " + data["total_time"].toFixed(3) +
" Round-trip: " + round_trip_time.toFixed(3) + " seconds.");
</pre>
The timings here are all in seconds, and reported to 3 digits. The
meanings are:
<ul>
<li>Search: the time for the fingerprint search</li>
<li>Lookup: the time to get the SMILES strings</li>
<li>Handler: the total time in my Flask handler</li>
<li>Round-trip: the elapsed time from just before the AJAX request
to just before reporting the time</li>
</ul>
</P><P>
I then clear the table rows that contain the results, and convert the
results into new rows. It's a bit more complicated than I would like;
this is where client-side templates can be nice.
<pre class="code">
$(".result-row").remove();
var tbody = $("tbody");
$.each(data["hits"], function(rowno, row) {
var tr = $(document.createElement("tr")).addClass("result-row");
tbody.append(tr);
$.each(row, function(colno, value) {
var ele = $(document.createElement("td"));
if (colno == 0) {
// The identifier, wrapped in a hyperlink to the orginal SD record
var a = $(document.createElement("a"));
a.attr({"href": value + ".sdf"});
a.text(value);
ele.append(a);
} else if (colno == 1) {
// The score, to three decimals
ele.addClass("score");
ele.text(value.toFixed(3));
} else {
ele.text(value);
}
tr.append(ele);
});
});
</pre>
</P><P>
The last part is to change the "Download" button. Because searches are
so fast, all I do is tell it the new query parameters, and let the
user choose the download format. The form download is actually a
regular query against the "/search" handler. This is easier than
storing all of the ids and scores from the current result and making a
new API to somehow resynthesize the correct results.
</P><P>
Here's the form I'm updating:
<pre class="code">
&lt;form action="search" method="GET"&gt;
&lt;input type="hidden" name="q"&gt;
&lt;input type="hidden" name="k"&gt;
&lt;input type="hidden" name="threshold"&gt;
Choose format:
&lt;select name="oformat"&gt;
&lt; &lt;option value="ids"&gt;ids&lt;/option&gt;
&lt; &lt;option value="smiles"&gt;SMILES&lt;/option&gt;
&lt; &lt;option value="sdf" SELECTED&gt;SDF&lt;/option&gt;
&lt; &lt;option value="json"&gt;JSON&lt;/option&gt;
&lt;/select&gt;
&lt;input id="submit-button" type="submit" disabled="disabled" value="Download" /&gt;
&lt;/form&gt;
</pre>
and the update code is:
<pre class="code">
$("form input[name=q]").val(smiles);
$("form input[name=k]").val(k);
$("form input[name=threshold]").val(threshold);
$("#submit-button").prop("disabled", data["hits"].length == 0);
if (data["hits"].length == 1) {
$("#submit-button").val("Download 1 record");
} else {
$("#submit-button").val("Download " + data["hits"].length + " records");
}
</pre>
I also did a bit extra to report the number of available records to
download, and don't allow a download if there are no records.
</P>
<h2>The final code</h2>
<P>
I left out a few pieces here and there, like most of the HTML and some
of the timing details. Here's the final code:
<pre class="code">
import sys
import sqlite3
import zlib
import time
from flask import Flask, Response, jsonify, request
import chemfp
from chemfp import text_toolkit
database_filename = "chembl_19.sqlite3"
fingerprint_filename = "chembl_19.fpb"
# <b>Threads and OpenMP don't work well under Mac OS X.</b>
# <b>Tell chemfp to not use OpenMP.</b>
chemfp.set_num_threads(1)
db = sqlite3.connect(database_filename, check_same_thread=False)
cursor = db.cursor()
try:
num_compounds, = cursor.execute("SELECT COUNT(*) FROM Compounds").fetchone()
except sqlite3.ChemFPError, err:
raise SystemExit("Cannot connect to SQLite database %r: %s" % (database_filename, err))
del cursor # <b>remove the global cursor so there's no chance of accidental reuse</b>
if num_compounds == 0:
sys.stderr.write("WARNING: Database %r contains no compounds." % (database_filename,))
# <b>Load the fingerprints</b>
arena = chemfp.load_fingerprints(fingerprint_filename)
# <b>Use the fingerprint type string from the metadata to get the fingerprint type object</b>
try:
fptype = arena.get_fingerprint_type()
except chemfp.ChemFPError, err:
raise SystemExit("Unable to load the fingerprint type for fingerprint file %r: %s"
% (fingerprint_filename, err))
# <b>Get the toolkit associated with this fingerprint type</b>
toolkit = fptype.toolkit
if toolkit is None:
raise SystemExit("No toolkit available to generate fingerprints of type %r"
% (fptype.base_name,))
# <b>Report information about the number of compounds and fingerprints</b>
if num_compounds == len(arena):
sys.stderr.write("Loaded %d compounds and fingerprints of type %r\n"
% (num_compounds, fptype.get_type()))
else:
sys.stderr.write("Loaded %d compounds and %d fingerprints of type %r\n"
% (num_compounds, len(arena), fptype.get_type()))
app = Flask(__name__)
@app.errorhandler(404)
def page_not_found(error=None):
return "Not found", 404
js_code = """<span style="color: blue">&lt;html&gt;
&lt;head&gt;
&lt;title&gt;Search demo&lt;/title&gt;
&lt;script type="text/javascript" src="https://code.jquery.com/jquery-2.1.1.min.js"&gt;&lt;/script&gt;
&lt;style&gt;
#error {
color: red;
}
table {
border-collapse: collapse;
}
table tr:first-child + tr td {
border-top: 1px solid black;
}
table tr:last-child td {
border-bottom: 1px solid black;
}
.score {
width: 5em;
}
&lt;/style&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;h2&gt;Search demo&lt;/h2&gt;
Target contains NUM_FINGERPRINTS fingerprints of type FINGERPRINT_TYPE&lt;br /&gt;
&lt;br /&gt;
&lt;div&gt;
k: &lt;select id="k"&gt;
&lt;option value="1"&gt;1&lt;/option&gt;
&lt;option value="3"&gt;3&lt;/option&gt;
&lt;option value="5" SELECTED&gt;5&lt;/option&gt;
&lt;option value="10"&gt;10&lt;/option&gt;
&lt;option value="100"&gt;100&lt;/option&gt;
&lt;option value="1000"&gt;1000&lt;/option&gt;
&lt;/select&gt;
threshold: &lt;select id="threshold"&gt;
&lt;option value="1.0"&gt;100%&lt;/option&gt;
&lt;option value="0.99"&gt;99%&lt;/option&gt;
&lt;option value="0.98"&gt;98%&lt;/option&gt;
&lt;option value="0.95"&gt;95%&lt;/option&gt;
&lt;option value="0.9"&gt;90%&lt;/option&gt;
&lt;option value="0.8 SELECTED"&gt;80%&lt;/option&gt;
&lt;option value="0.7"&gt;70%&lt;/option&gt;
&lt;option value="0.5"&gt;50%&lt;/option&gt;
&lt;option value="0.2"&gt;20%&lt;/option&gt;
&lt;option value="0.0"&gt;0%&lt;/option&gt;
&lt;/select&gt;
SMILES or id: &lt;input id="smiles" size="50"&gt;&lt;/input&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
Request: &lt;span id="request"&gt;&lt;/span&gt;&lt;br /&gt;
Response: &lt;span id="error"&gt;&lt;/span&gt;&lt;span id="message"&gt;(no response available)&lt;/span&gt;
&lt;table&gt;
&lt;tr&gt;&lt;th&gt;id&lt;/th&gt;&lt;th class="score"&gt;score&lt;/th&gt;&lt;th&gt;SMILES&lt;/th&gt;&lt;th&gt;Exact match?&lt;/th&gt;&lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;form action="search" method="GET"&gt;
&lt;input type="hidden" name="q"&gt;
&lt;input type="hidden" name="k"&gt;
&lt;input type="hidden" name="threshold"&gt;
Choose format:
&lt;select name="oformat"&gt;
&lt;option value="ids"&gt;ids&lt;/option&gt;
&lt;option value="smiles"&gt;SMILES&lt;/option&gt;
&lt;option value="sdf" SELECTED&gt;SDF&lt;/option&gt;
&lt;option value="json"&gt;JSON&lt;/option&gt;
&lt;/select&gt;
&lt;input id="submit-button" type="submit" disabled="disabled" value="Download" /&gt;
&lt;/form&gt;
&lt;script type="text/javascript"&gt;
<b>var make_request = function()</b> {
var k = $("#k").val();
var threshold = $("#threshold").val();
var smiles = $("#smiles").val();
if (smiles === "") {
$("#request").text("(empty query)");
return;
}
var start_time = new Date();
$("#request").text("threshold=" + threshold + " k=" + k + " q=" + smiles);
<b>$.ajax({</b>
url: "search",
data: [
{name: "q", value: smiles},
{name: "k", value: k},
{name: "threshold", value: threshold}
]
}).<b>done(function(data) {</b>
if (data["status"] !== "ok") {
$("#error").text(data["message"]);
$("#message").text("");
return;
}
// <b>Summarize the timings</b>
$("#error").text("");
var hits = data["hits"];
var round_trip_time = (new Date() - start_time) / 1000.0;
$("#message").text("Found " + hits.length + " hits. Times for " +
" Search: " + data["search_time"].toFixed(3) +
" Lookup: " + data["lookup_time"].toFixed(3) +
" Handler: " + data["total_time"].toFixed(3) +
" Round-trip: " + round_trip_time.toFixed(3) + " seconds.");
// <b>Display the results as rows in the table</b>
$(".result-row").remove();
var tbody = $("tbody");
$.each(data["hits"], function(rowno, row) {
var tr = $(document.createElement("tr")).addClass("result-row");
tbody.append(tr);
$.each(row, function(colno, value) {
var ele = $(document.createElement("td"));
if (colno == 0) {
// The identifier, wrapped in a hyperlink to the orginal SD record
var a = $(document.createElement("a"));
a.attr({"href": value + ".sdf"});
a.text(value);
ele.append(a);
} else if (colno == 1) {
// The score, to three decimals
ele.addClass("score");
ele.text(value.toFixed(3));
} else {
ele.text(value);
}
tr.append(ele);
});
});
// <b>Configure the download form</b>
$("form input[name=q]").val(smiles);
$("form input[name=k]").val(k);
$("form input[name=threshold]").val(threshold);
$("#submit-button").prop("disabled", data["hits"].length == 0);
if (data["hits"].length == 1) {
$("#submit-button").val("Download 1 record");
} else {
$("#submit-button").val("Download " + data["hits"].length + " records");
}
}).<b>fail(function() {</b>
$("#error").text("Server failure");
$("#message").text("");
});
};
$().<b>ready(function() {</b>
$("#threshold").change(make_request);
$("#k").change(make_request);
$("#smiles").on("input", make_request);
make_request(); // In case fields remain after a browser refresh.
});
&lt;/script&gt;
&lt;/body&gt;</span>
""".replace("NUM_FINGERPRINTS", str(len(arena))).replace("FINGERPRINT_TYPE", fptype.get_type())
# <b>Return the client-side application</b>
@app.route("/")
def homepage():
return Response(js_code, mimetype="text/html")
# <b>Parse the query as either a SMILES string or a database entry</b>
# <b>Return the (error string, None), or (None, parsed molecule)</b>
def _get_molecule(cursor, query):
if not query:
return None, None
# <b>Is this a valid SMILES?</b>
try:
mol = toolkit.parse_molecule(query, "smistring")
except ValueError, err:
# <b>No. Then is this an identifier?</b>
match = cursor.execute("SELECT smiles FROM Compounds WHERE id = ?", (query,)).fetchone()
if not match:
return "Not a valid SMILES or identifier", None
smiles = match[0]
try:
mol = toolkit.parse_molecule(smiles, "smistring")
except ValueError:
return "Invalid SMILES in database!", None
return None, mol
# <b>Return an error document in JSON format</b>
def _error_response(errmsg):
return jsonify({
"status": "error",
"message": errmsg,
"columns": ["id", "score", "smiles", "exact match"],
"hits": [],
})
# <b>Get the selected columns for the given hits</b>
def get_columns(cursor, columns, hits):
sql_query = "SELECT %s FROM Compounds WHERE id IN (" % (", ".join(columns),)
sql_query += ",".join("?"*len(hits)) + ")"
cursor.execute(sql_query, tuple(hits.get_ids()))
return cursor
# <b>Similarity search</b>
@app.route("/search")
def similarity_search():
start_time = time.time()
q = request.args.get("q", "").strip()
k = request.args.get("k", "3")
oformat = request.args.get("oformat", "json")
threshold = request.args.get("threshold", "0.0")
cursor = db.cursor()
# <b>Flask queries are Unicode strings, but SMILES and record identifiers</b>
# <b>can only be ASCII, so convert and report any failures</b>
try:
q = str(q)
except UnicodeEncodeError:
return _error_response("non-ASCII query")
errmsg, mol = _get_molecule(cursor, q)
if errmsg:
return _error_response(errmsg)
try:
k = int(k)
except ValueError:
return _error_response("bad value for k")
try:
threshold = float(threshold)
except ValueError:
return _error_response("bad value for threshold")
if oformat not in ("json", "ids", "smiles", "sdf"):
oformat = "json"
if mol is None:
return jsonify({
"status": "ok",
"message": "",
"search_time": 0.0,
"lookup_time": 0.0,
"total_time": time.time() - start_time,
"hits": [],
})
# <b>Compute the fingerprint and do the similarity search</b>
fp = fptype.compute_fingerprint(mol)
t1 = time.time()
hits = arena.knearest_tanimoto_search_fp(fp, k=k, threshold=threshold)
fp_search_time = time.time() - t1
<b>if oformat == "ids":</b>
return Response("\n".join(hits.get_ids()) + "\n", mimetype="text/plain")
<b>if oformat == "smiles":</b>
id_to_cansmiles = dict(get_columns(cursor, ["id", "smiles"], hits))
lines = []
for id, score in hits.get_ids_and_scores():
smiles = id_to_cansmiles.get(id, "MISSING")
lines.append("%s\t%s\t%.3f\n" % (smiles, id, score))
return Response("".join(lines), mimetype="text/plain")
<b>if oformat == "sdf":</b>
id_to_sdfz = dict(get_columns(cursor, ["id", "record"], hits))
records = []
for id, score in hits.get_ids_and_scores():
recordz = id_to_sdfz.get(id, None)
if recordz is None:
continue
record = zlib.decompress(recordz)
text_mol = text_toolkit.parse_molecule(record, "sdf")
text_mol.add_tag("SCORE", "%.3f" % (score,))
writers.append(text_toolkit.create_string(text_mol, "sdf"))
return Response("".join(writer), mimetype="text/plain")
# <b>Otherwise it's a json output format</b>
query_cansmiles = toolkit.create_string(mol, "canstring")
t1 = time.time()
id_to_cansmiles = dict(get_columns(cursor, ["id", "smiles"], hits))
smiles_lookup_time = time.time() - t1
# <b>get the (id, score, cansmiles, and SMILES match flag)</b>
results = []
for id, score in hits.get_ids_and_scores():
cansmiles = id_to_cansmiles.get(id, None)
results.append( (id, score, cansmiles, query_cansmiles == cansmiles) )
return jsonify({
"status": "ok",
"message": "",
"search_time": fp_search_time,
"lookup_time": smiles_lookup_time,
"total_time": time.time() - start_time,
"hits": results,
})
# <b>get a record given its id, or return "404 Not Found"</b>
@app.route("/&lt;string:id&gt;.sdf")
def get_record(id):
cursor = db.cursor()
cursor.execute("SELECT record FROM Compounds WHERE id = ?", (id,))
row = cursor.fetchone()
if not row:
page_not_found()
record = zlib.decompress(row[0])
return Response(record, mimetype="text/plain")
if __name__ == "__main__":
#app.run(debug=False, host="0.0.0.0")
app.run(debug=True)
</pre>
http://www.dalkescientific.com/writings/diary/archive/2014/11/28/similarity_web_service.htmlFri, 28 Nov 2014 12:00:00 GMTIndexing ChEMBL for chemistry searchhttp://www.dalkescientific.com/writings/diary/archive/2014/11/28/indexing_chembl.html<P>
I've just downloaded the SD file for ChEMBL-19. I'm happy that after
many years they have finally placed the record identifier in the title
line of each record! It used to contain either a blank line, or
occasionally text that was essentially junk.
</P><P>
Now I want to be able to work with the data in the file. More
specifically, I want a simple cheminformatics system where I can:
<ul>
<li>Get the original SD record given its identifier.</li>
<li>Get the canonical SMILES for a given identifier.</li>
<li>Search for records similar to a given query.</li>
<li>Check if a given SMILES is in the database.</li>
</ul>
and I want to show how to use the new features in <a
href="http://chemfp.com/">chemfp-1.2</a> to do this.
</P>
<h2>Parse records, find the SMILES, and add to the SQLite database</h2>
<P>
The easiest way is to set up two files. The first is a SQLite database
containing a Compound table. The Compound will contain three columns:
the record identifier, the original record (as a compressed string),
and the canonical SMILES string. SQLite is a very widely used SQL
database, and it's included as <a
href="https://docs.python.org/2/library/sqlite3.html">part of the
standard Python distribution</a>. Here's the SQL that I'll used to
create the database schema.
<pre class="code">
CREATE TABLE Compounds (
id TEXT PRIMARY KEY NOT NULL,
record BLOB,
smiles TEXT)
</pre>
which when embedded in Python code looks like:
<pre class="code">
import sqlite3
database_filename = "chembl_19.sqlite3"
conn = sqlite3.connect(database_filename)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE Compounds (
id TEXT PRIMARY KEY NOT NULL,
record BLOB,
smiles TEXT)
""")
</pre>
</P><P>
I need some way to get the SD records for each one, so I can save it
as a compressed blob in the database, and so I can create the
canonical SMILES string for the record. I pointed out in <a
href="http://dalkescientific.com/writings/diary/archive/2014/11/27/maccs_in_rdkit_and_open_babel.html">my
previous essay</a> that chemfp includes a "read_sdf_ids_and_records"
function in its text_toolkit submodule, which reads each record and
also pulls out the identifier. The code for this, along with
commented-out code to support older ChEMBL releases which only
stored the identifier in the "chembl_id" tag, looks like:
<pre class="code">
from chemfp import text_toolkit
filename = "/Users/dalke/databases/chembl_19.sdf.gz"
id_tag = None # Use the title as the id
#id_tag = "chembl_id" # Uncomment for pre-ChEMBL-19 releases
reader = text_toolkit.read_sdf_ids_and_records(filename, id_tag=id_tag, errors="ignore")
for (id, record) in sdf_reader:
# work with the identifier and SDF record
</pre>
In practice I also want to track the current record number. I can use
the enumerate() function for that, and I'll seed it so that the first
record is 1.
<pre class="code">
recno = 0
for recno, (id, record) in enumerate(sdf_reader, 1):
# work with the identifier and SDF record
</pre>
You might wonder why I set recno to 0 before doing the iteration. This
handles the case where the input file is empty. If that happens then
recno will never be set, leaving it with whatever the old value
was. If the value was never set, which is the case here, it would
raise a NameError. Here's an example:
<pre class="code">
&gt;&gt;&gt; for i, c in enumerate(""):
... print i, c
...
&gt;&gt;&gt; print i
Traceback (most recent call last):
File "&lt;stdin&gt;", line 1, in &lt;module&gt;
NameError: name 'i' is not defined
</pre>
</P><P>
I want to include the isomeric canonical SMILES string. Chemfp-1.2 has
a common toolkit API to work with RDKit, OpenEye/OEChem, and Open
Babel, so I'll simply use "toolkit" for respectively "rdkit_toolkit",
"openeye_toolkit", or "openbabel_toolkit". The conversion is
straight-forward; turn the SDF record into a toolkit molecule, then
turn the toolkit molecule into the "smistring" (that's the chemfp
format name for "isomeric canonical SMILES string").
<pre class="code">
toolkit_mol = toolkit.parse_molecule(record, "sdf")
smistring = toolkit.create_string(toolkit_mol, "smistring")
</pre>
It's possible that the toolkit doesn't handle a given record, and
unlikely but still perhaps possible that toolkit can't make a SMILES
string for the molecule. If that happens, these functions will raise a
chemfp.ParseError, which I can catch, ignore, and continue on to the
next molecule.
<pre class="code">
recno = 0
for recno, (id, record) in enumerate(sdf_reader, 1):
try:
toolkit_mol = toolkit.parse_molecule(record, "sdf")
smistring = toolkit.create_string(toolkit_mol, "smistring")
except chemfp.ParseError:
continue
.... work with the id, SDF record, and SMILES string ...
</pre>
All that's left for the SQLite database is to compress the SDF record
using the zlib library, and add the new row to the database.
<pre class="code">
# Compress the record (result is about 1/3rd of the original string)
compressed_record = zlib.compress(record)
# Save into the database and write to the fingerprint file
conn.execute("""
INSERT INTO Compounds (id, record, smiles)
VALUES (?, ?, ?)""",
(id, sqlite3.Binary(compressed_record), smistring))
</pre>
Compression is useful because the uncompressed file takes up 2.8GB
while the original compressed file takes 470 MB for a 6x compression
ratio. Compressing each record independently isn't as effective,
because the compressor can't share any of its analysis between
records, but it still manages about a 3x compression ratio.
</P><P>
However, the compressed string is a byte string with NUL characters
and non-ASCII data. The BLOB and the sqlite3.Binary() tell SQLite and
Python's sqlite3 API to treat the data as a byte string and not as a
Unicode string.
</P><P>
The last bit of code creates an index on the SMILES column (so I can
see if a given SMILES exists or find records with the given SMILES)
and commits the new data to the database and closes it.
<pre class="code">
conn.execute("CREATE INDEX Compounds_by_smiles on Compounds (smiles)")
conn.commit()
conn.close()
</pre>
</P>
<h2>Generate fingerprint and save to the FPB fingerprint file</h2>
<P>
The other file I need is an FPB fingerprint file containing the
fingerprints for similarity search. I'll use RDKit's hash
fingerprints, and since I'm using RDKit to generate the fingerprints I
might as well use RDKit as the toolkit to convert the SDF record into
a SMILES string.
<pre class="code">
fingerprint_type = "RDKit-Fingerprint fpSize=2048"
fptype = chemfp.get_fingerprint_type(fingerprint_type)
toolkit = fptype.toolkit
</pre>
</P><P>
I'll ask chemfp to open up a fingerprint writer. I want it to include
the canonical fingerprint type string in the metadata, and for good
record-keeping I'll also record the original source filename for the
structures. The fingerprint type's get_metadata() has a helper
function to create the correct metatdata object, which I can pass to
the writer.
<pre class="code">
filename = "/Users/dalke/databases/chembl_19.sdf.gz"
fingerprint_filename = "chembl_19.fpb"
metadata = fptype.get_metadata(sources=[filename])
fp_writer = chemfp.open_fingerprint_writer(fingerprint_filename, metadata=metadata)
</pre>
</P><P>
Then for each toolkit molecule, generate the fingerprint and save the
id and the fingerprint to the file.
<pre class="code">
recno = 0
for recno, (id, record) in enumerate(sdf_reader, 1):
# This part is shared with the SQLite builder
try:
toolkit_mol = toolkit.parse_molecule(record, "sdf")
smistring = toolkit.create_string(toolkit_mol, "smistring")
except chemfp.ParseError:
pass
...
# Compute the fingerprint and save to the file
fp = fptype.compute_fingerprint(toolkit_mol)
fp_writer.write_fingerprint(id, fp)
</pre>
Once finished, I'll close the file explicitly, rather than depend on
the garbage collector to do it for me.
<pre class="code">
fp_writer.close()
</pre>
</P>
<h2>Putting it all together</h2>
<P>
You've now seen the details, here's it is when I put it all together,
along with some progress information and a summary report when it's
complete.
<pre class="code">
import sys
import sqlite3
import zlib
import time
import chemfp
from chemfp import text_toolkit
# <b>Configuration information</b>
filename = "/Users/dalke/databases/chembl_19.sdf.gz"
id_tag = None # Use the title as the id
#id_tag = "chembl_id" # Uncomment for pre-ChEMBL-19 releases
fingerprint_type = "RDKit-Fingerprint fpSize=2048"
database_filename = "chembl_19.sqlite3"
fingerprint_filename = "chembl_19.fpb"
# <b>Open the output database and create the schema</b>
#import os # Uncomment to delete the database before trying again
#os.unlink(database_filename)
conn = sqlite3.connect(database_filename)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE Compounds (
id TEXT PRIMARY KEY NOT NULL,
record BLOB,
smiles TEXT)
""")
# <b>Use the fingerprint type string to get the fingerprint type</b>
# <b>and the toolkit for that type</b>
fptype = chemfp.get_fingerprint_type(fingerprint_type)
toolkit = fptype.toolkit
# <b>Open the fingerprint writer, along with the correct metadata.</b>
# <b>(It handles the canonical fingerprint type string automatically.</b>
# <b>I give it a bit more information about the source.)</b>
metadata = fptype.get_metadata(sources=[filename])
fp_writer = chemfp.open_fingerprint_writer(fingerprint_filename, metadata=metadata)
# <b>Iterate through the file as a set of SDF records.</b>
sdf_reader = text_toolkit.read_sdf_ids_and_records(filename, id_tag=id_tag, errors="ignore")
start_time = time.time()
recno = 0
for recno, (id, record) in enumerate(sdf_reader, 1):
# <b>Display some progress information</b>
if recno == 1 or recno % 1000 == 0:
dt = time.time() - start_time
sys.stderr.write("Processing record %d. Elapsed time: %.1f seconds\n" % (recno, dt))
# <b>Convert the record into a molecule and then get the SMILES and fingerprint.</b>
# <b>Ignore any conversion errors.</b>
try:
toolkit_mol = toolkit.parse_molecule(record, "sdf")
smistring = toolkit.create_string(toolkit_mol, "smistring")
except chemfp.ParseError:
continue
# <b>Compute the fingerprint</b>
fp = fptype.compute_fingerprint(toolkit_mol)
# <b>Compress the record (result is about 1/3rd of the original string)</b>
compressed_record = zlib.compress(record)
# <b>Save into the database and write to the fingerprint file</b>
conn.execute("""
INSERT INTO Compounds (id, record, smiles)
VALUES (?, ?, ?)""",
(id, sqlite3.Binary(compressed_record), smistring))
# Save the id and fingerprint to the output fingerprint file
fp_writer.write_fingerprint(id, fp)
# <b>How long did it take?</b>
end_read_time = time.time()
dt = end_read_time - start_time
sys.stderr.write("Processed %d records in %.1f seconds. %.2f records/sec\n" %
(recno, dt, recno/dt))
# <b>Create the index to be able to search on SMILES.</b>
conn.execute("CREATE INDEX Compounds_by_smiles on Compounds (smiles)")
dt = time.time() - end_read_time
sys.stderr.write("Indexing took %.1f seconds.\n" % (dt,))
# <b>Save everything</b>
conn.commit()
conn.close()
fp_writer.close()
</pre>
</P><P>
I ran the above code, and at the end it reports:
<pre class="code">
Processed 1404752 records in 7243.6 seconds. 193.93 records/sec
Indexing took 29.1 seconds.
</pre>
</P><P>
When I profile the code, I can see that most of the time is spent
creating the fingerprint. When I first ran this code I used an older
version of RDKit, which took twice as long. I'm glad that RDKit has
gotten faster over time.
</P><P>
Let's see if it works, first using the sqlite interactive shell:
<pre class="code">
% sqlite3 chembl_19.sqlite3
SQLite version 3.6.12
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite&gt; <b>select count(*) from Compounds;</b>
1404532
</pre>
(Oddly, it takes a long time to compute this count the first time I run
it, but once done I can quit and restart sqlite and it's fast.)
</P><P>
How about some more interesting queries?
<pre class="code">
sqlite&gt; <b>select smiles from Compounds where id = "CHEMBL23456";</b>
CNC(=O)c1[nH]c2ccc(Cl)cc2c1Sc1ccccc1
sqlite&gt; <b>select id, smiles from Compounds limit 5;</b>
CHEMBL153534|Cc1cc(-c2csc(N=C(N)N)n2)cn1C
CHEMBL440060|CC[C@H](C)[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)[C@@H](N)CCSC
)[C@@H](C)O)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](Cc1c[nH]cn1)C(=O)N
[C@@H](CC(N)=O)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CCC(N)=O)C(=O)N
[C@@H](CC(C)C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N[C@@H](CCC(N)=O)
C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)NCC(=O)N[C@@H](CCC(N)=O)C(=O)N[
C@@H](CC(C)C)C(=O)NCC(=O)N1CCC[C@H]1C(=O)N1CCC[C@H]1C(=O)NCC(=O)N[C@@H](CO)C(=O)
N[C@@H](CCCN=C(N)N)C(N)=O
CHEMBL440067|CC(C)[C@@H]1NC(=O)[C@@H](CCCCN)NC(=O)[C@H](Cc2c[nH]c3ccccc23)NC(=O)
[C@@H](Cc2c[nH]cn2)NC(=O)[C@H](NC(=O)[C@@H](N)Cc2ccc3ccccc3c2)CSSC[C@@H](C(=O)N[
C@@H](Cc2cccc(-c3ccccc3)c2)C(N)=O)NC1=O
CHEMBL440245|CCCC[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CCCCN)NC(=O)[C@H](CCCN=C(N
)N)NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](CO)NC(=O)[C@H](Cc1c[nH]cn1)NC(=O)[C@H](C)NC(=
O)[C@H](CCC(N)=O)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](C)NC(=O)[C@H](CC(C)C)NC(=O)[C@
H](CCC(N)=O)NC(=O)[C@@H]1CCCCNC(=O)CC[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O
)[C@H](CCC(=O)O)NC(=O)[C@H](CCCN=C(N)N)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC(C)C)NC(
=O)[C@H](Cc2c[nH]cn2)NC(=O)[C@H](N)Cc2ccccc2)C(C)C)C(=O)N[C@@H](CCCC)C(=O)N[C@@H
](C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N[C@@H](C)C(=O)N1)C(=O)N[C@@H](CCC(=O)O)C(=O)N[
C@H](C(=O)N[C@H](C(=O)C(N)=O)[C@@H](C)CC)[C@@H](C)CC
CHEMBL440249|CC(C)C[C@@H]1NC(=O)CNC(=O)[C@H](c2ccc(O)cc2)NC(=O)[C@@H]([C@@H](C)O
)NC(=O)[C@H](c2ccc(O[C@H]3O[C@H](CO)[C@@H](O)[C@H](O)[C@@H]3O[C@H]3O[C@H](CO)[C@
@H](O)[C@H](O)[C@@H]3O)cc2)NC(=O)[C@@H](CCCN)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@H]([
C@@H](C)O)NC(=O)[C@@H](c2ccc(O)cc2)NC(=O)[C@H](c2ccc(O)cc2)NC(=O)[C@@H](C(C)C)NC
(=O)[C@@H](CCCN)NC(=O)[C@@H](c2ccc(O)cc2)NC(=O)[C@@H](CNC(=O)[C@H](CC(N)=O)NC(=O
)Cc2cccc3ccccc32)[C@@H](C(N)=O)OC(=O)[C@H](c2ccc(O)c(Cl)c2)NC(=O)[C@@H](C)NC1=O
sqlite&gt; <b>select id, smiles from Compounds order by length(smiles) limit 5;</b>
CHEMBL17564|C
CHEMBL1098659|O
CHEMBL1160819|N
CHEMBL1200739|S
CHEMBL1233550|I
sqlite&gt; <b>.quit</b>
</pre>
That last query takes a little while because the database isn't
indexed by SMILES length.
</P><P>
Okay, now what about the FPB file? When know from the previous code
that CHEMBL23456 has the SMILES
<tt>CNC(=O)c1[nH]c2ccc(Cl)cc2c1Sc1ccccc1</tt>, so I'll search for its
nearest 5 neighbors.
<pre class="code">
% <b>time simsearch --query 'CNC(=O)c1[nH]c2ccc(Cl)cc2c1Sc1ccccc1' -k 5 chembl_19.fpb</b>
#Simsearch/1
#num_bits=2048
#type=Tanimoto k=5 threshold=0.0
#software=chemfp/1.2b4
#targets=chembl_19.fpb
#query_sources=/Users/dalke/databases/chembl_19.sdf.gz
#target_sources=/Users/dalke/databases/chembl_19.sdf.gz
5 Query1 CHEMBL23456 1.00000 CHEMBL180330 0.94105 CHEMBL286759 0.93508
CHEMBL314618 0.93295 CHEMBL382002 0.92812
0.262u 0.139s 0:00.47 82.9% 0+0k 0+0io 0pf+0w
</pre>
You can verify that CHEMBL23456 is in the output, with the expected
similarity of 1.0.
</P><P>
Also, take a look at the timings. Because of the new binary format in
chemfp-1.2, it takes less than 0.5 seconds to search 1.4 million
fingerprints from ChEMBL, from the command-line. I'm rather proud of
that.
</P><P>
Finally, I'll using Python's interactive shell to connect to the
database and display the uncompressed SD record for CHEMBL286759.
<pre class="code">
&gt;&gt;&gt; <b>import sqlite3</b>
&gt;&gt;&gt; <b>db = sqlite3.connect("chembl_19.sqlite3")</b>
&gt;&gt;&gt; <b>c = db.cursor()</b>
&gt;&gt;&gt; <b>c.execute('SELECT record FROM Compounds WHERE id = "CHEMBL286759"')</b>
&lt;sqlite3.Cursor object at 0x100d690e0&gt;
&gt;&gt;&gt; <b>record, = c.fetchone()</b>
&gt;&gt;&gt; <b>record</b>
&lt;read-write buffer ptr 0x100c64280, size 385 at 0x100c64240&gt;
&gt;&gt;&gt;
&gt;&gt;&gt; <b>import zlib</b>
&gt;&gt;&gt; <b>print zlib.decompress(record)</b>
CHEMBL286759
11280714492D 1 1.00000 0.00000 0
20 22 0 0 0 999 V2000
6.7292 -2.0542 0.0000 C 0 0 0 0 0 0 0 0 0
6.2417 -1.3875 0.0000 C 0 0 0 0 0 0 0 0 0
6.2417 -2.7292 0.0000 N 0 0 0 0 0 0 0 0 0
5.4542 -1.6417 0.0000 C 0 0 0 0 0 0 0 0 0
5.4542 -2.4750 0.0000 C 0 0 0 0 0 0 0 0 0
7.5542 -2.0500 0.0000 C 0 0 0 0 0 0 0 0 0
6.4917 -0.6042 0.0000 S 0 0 0 0 0 0 0 0 0
4.7375 -1.2292 0.0000 C 0 0 0 0 0 0 0 0 0
7.9625 -1.3292 0.0000 O 0 0 0 0 0 0 0 0 0
4.7375 -2.8875 0.0000 C 0 0 0 0 0 0 0 0 0
7.9667 -2.7625 0.0000 N 0 0 0 0 0 0 0 0 0
4.0167 -1.6417 0.0000 C 0 0 0 0 0 0 0 0 0
5.9417 0.0083 0.0000 C 0 0 0 0 0 0 0 0 0
4.0167 -2.4750 0.0000 C 0 0 0 0 0 0 0 0 0
3.3000 -1.2292 0.0000 Cl 0 0 0 0 0 0 0 0 0
6.2000 0.7958 0.0000 C 0 0 0 0 0 0 0 0 0
5.1417 -0.1667 0.0000 C 0 0 0 0 0 0 0 0 0
4.5875 0.4458 0.0000 C 0 0 0 0 0 0 0 0 0
5.6417 1.4083 0.0000 C 0 0 0 0 0 0 0 0 0
4.8375 1.2333 0.0000 C 0 0 0 0 0 0 0 0 0
2 1 2 0 0 0
3 1 1 0 0 0
4 2 1 0 0 0
5 3 1 0 0 0
6 1 1 0 0 0
7 2 1 0 0 0
8 4 1 0 0 0
9 6 2 0 0 0
10 5 1 0 0 0
11 6 1 0 0 0
12 8 2 0 0 0
13 7 1 0 0 0
14 10 2 0 0 0
15 12 1 0 0 0
16 13 2 0 0 0
17 13 1 0 0 0
18 17 2 0 0 0
19 16 1 0 0 0
20 18 1 0 0 0
4 5 2 0 0 0
14 12 1 0 0 0
20 19 2 0 0 0
M END
&gt; &lt;chembl_id&gt;
CHEMBL286759
$$$$
</pre>
</P>
<h2>Simple variations</h2>
<P>
You can easily see how to expand the database. You likely also want to
store and index the non-isomeric canonical SMILES, which you can do by
asking for the "cansmiles" format. You might also want to pre-compute
and store molecular properites, like cLogP and molecular weight, along
with a searchable index. It's very easy to make your own specialized
database once you know how.
</P><P>
Of course, I really would like to integrate the SQL query with the
fingerprint query, to search for things like "similar records with a
given logP range." It's possible to hack these a bit, but what I would
like to do some day is write a chemfp virtual table for SQLite - a
sort of similarity cartridge - so they can be integrated. I just need
funding. <tt>;)</tt>
</P>
http://www.dalkescientific.com/writings/diary/archive/2014/11/28/indexing_chembl.htmlFri, 28 Nov 2014 12:00:00 GMTMACCS in RDKit and Open Babelhttp://www.dalkescientific.com/writings/diary/archive/2014/11/27/maccs_in_rdkit_and_open_babel.html<P>
In this essay I'll compare the RDKit and Open Babel MACCS
implementations at a somewhat coarse level. The main goal of this will
be to show how some of the new <a
href="http://chemfp.com/">chemfp-1.2</a> features make it easier to do
cross-toolkit fingerprint analysis.
</P>
<h2>Fingerprint type API</h2>
<P>
It's hard to make things easy. I want to be able to generate a
fingerprint given the fingerprint type string and a structure record
as a string. Here are some examples of fingerprint type strings:
<pre class="code">
RDKit-Fingerprint fpSize=1024
OpenEye-Path numbits=1024 maxbonds=6
OpenBabel-FP2
</pre>
You can look at the names and figure out that they use different
cheminformatics toolkits, each with its own API for parsing a chemical
record and generating a fingerprint. That's complicated.
</P><P>
Chemfp solves it in the usual way, with an extra layer of indirection.
There's a common fingerprint type API, with back-end implementations
for OEChem, Open Babel, and RDKit. Here's how it looks in action:
<pre class="code">
import chemfp
smiles = "c1ccccc1O"
for typestring in (
# Iterate through fingerprints from three different toolkits
"RDKit-Fingerprint fpSize=1024",
"OpenEye-Path numbits=1024 maxbonds=6",
"OpenBabel-FP2"):
# Use the fingerprint type string to get a FingerprintType object
fptype = chemfp.get_fingerprint_type(typestring)
# The fingerprint type can parse a SMILES string and return a fingerprint
fp = fptype.parse_molecule_fingerprint(smiles, "smistring")
# The fingerprint type also has a way to get the canonical type string.
print fptype.get_type()
# Display the fingerprint in hex, and wrapped over several lines
hex_fp = fp.encode("hex")
for i in range(0, len(hex_fp), 60):
print hex_fp[i:i+60]
print
</pre>
This generates the following output:
<pre class="code">
RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1
040000000000100000000000800000000000000002040000040041000000
000080000000400000400004080800000001000000000008000000208020
000000000000000000000800000000000100000000112000000000000000
040000000001000000010000000220040200020008000000000200004085
0000000000000000
OpenEye-Path/2 numbits=1024 minbonds=0 maxbonds=6 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
200000000000000000000000008090000000000000000000800010000000
00800020000000100200000200008042000000000a000000000000100000
000000000400020000000001000000000000000000000000000004000010
00000000200802000000004008010000100000000000008a002184000000
2000000000000004
OpenBabel-FP2/1
000000000000000000000200000000000000000000000000000000000000
000000000000000000000000000800000000000002000000000000000000
000000000800000000000000000000000200000000800000000000004008
000000000000000000000000000200000000000000000002000000000020
0800000000000000
</pre>
</P>
<h2>Use parse_molecule() to parse the record once</h2>
<P>
How does RDKit's fingerprint density change as a function of its
length? It's not hard using the above example, though instead of
creating a type string containing the size information this time I'll
pass in the fingerprint parameters as the second argument to
get_fingerprint_type().
<pre class="code">
import chemfp
from chemfp import bitops
smiles = "CC(=O)Oc1ccccc1C(=O)O"
print "fpSize popcount density"
for size in (256, 512, 1024, 2048, 4096):
fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint", {"fpSize": size})
# The next line converts the SMILES to a molecule each time through the loop.
fp = fptype.parse_molecule_fingerprint(smiles, "smistring")
popcount = bitops.byte_popcount(fp)
print "%6d %4d %.3f" % (size, popcount, float(popcount)/size)
</pre>
For the curious, the output is:
<pre class="code">
fpSize popcount density
256 208 0.812
512 271 0.529
1024 319 0.312
2048 354 0.173
4096 375 0.092
</pre>
</P><P>
If you look at the code for a bit you'll see that it parses the same
SMILES string 5 times. This seems a bit excessive. Instead, I would
like to parse the SMILES string once, to get an RDKit molecule, and
pass that molecule to the different fingerprint types. I can do this
directly with the RDKit API, as in:
<pre class="code">
from rdkit import Chem
rdmol = Chem.MolFromSmiles(smiles)
</pre>
I instead created a common cross-toolkit API, where "rdkit_toolkit",
"openeye_toolkit", and "openbabel_toolkit", share the same toolkit
API. Here I use chemfp.rdkit_toolkit to make a molecule:
<pre class="code">
from chemfp import rdkit_toolkit
rdmol = rdkit_toolkit.parse_molecule(smiles, "smistring")
</pre>
and the same code would work with the other toolkits.
</P><P>
At this point it looks more complicated, or at least longer, to call
the chemfp toolkit API than use RDKit's native API. I hope to convince
you in a bit why it's useful.
</P><P>
I'll rewrite the density code to use the toolkit API and parse the
SMILES string only once. This time I call the compute_fingerprint()
method of the fingerprint type to compute the fingerprint given a
toolkit molecule.
<pre class="code">
import chemfp
from chemfp import bitops
from chemfp import rdkit_toolkit
smiles = "CC(=O)Oc1ccccc1C(=O)O"
print "fpSize popcount density"
# Parse the SMILES once, into an RDKit molecule
rdmol = rdkit_toolkit.parse_molecule(smiles, "smistring")
for size in (256, 512, 1024, 2048, 4096):
fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint", {"fpSize": size})
fp = fptype.compute_fingerprint(rdmol)
popcount = bitops.byte_popcount(fp)
print "%6d %4d %.3f" % (size, popcount, float(popcount)/size)
</pre>
As expected, the output doesn't change, so I won't show it again.
</P>
<h2>How to use the right toolkit for a given fingerprint type.</h2>
<P>
The above code worked because I knew to use the rdkit_toolkit for the
"RDKit-Fingerprint" type. What if I want to compare the fingerprint
density for different sizes and different fingerprint types, including
ones from other toolkits? I could keep track of them manually, but in
chemfp each fingerprint type knows its toolkit, which makes things
easier.
<pre class="code">
import chemfp
from chemfp import bitops
smiles = "CC(=O)Oc1ccccc1C(=O)O"
print "fpSize popcount density"
for typename, sizefield in (
("RDKit-Fingerprint", "fpSize"),
("RDKit-Morgan radius=1", "fpSize"),
("RDKit-Morgan radius=2", "fpSize"),
("RDKit-Morgan radius=3", "fpSize"),
("OpenEye-Path", "numbits"),
("OpenEye-Circular maxradius=1", "numbits"),
("OpenEye-Circular maxradius=2", "numbits"),
("OpenEye-Circular maxradius=3", "numbits")):
# Each fingerprint type knows its toolkit.
# Use that to create the toolkit-appropriate molecule
fptype = chemfp.get_fingerprint_type(typename)
toolkit = fptype.toolkit
mol = toolkit.parse_molecule(smiles, "smistring")
print "="*23, typename
for size in (256, 512, 1024, 2048, 4096):
fptype = chemfp.get_fingerprint_type(typename, {sizefield: size})
fp = fptype.compute_fingerprint(mol)
popcount = bitops.byte_popcount(fp)
print "%6d %4d %.3f" % (size, popcount, float(popcount)/size)
</pre>
The output of this is:
<pre class="code">
fpSize popcount density
======================= RDKit-Fingerprint
256 208 0.812
512 271 0.529
1024 319 0.312
2048 354 0.173
4096 375 0.092
======================= RDKit-Morgan radius=1
256 16 0.062
512 16 0.031
1024 16 0.016
2048 16 0.008
4096 17 0.004
======================= RDKit-Morgan radius=2
256 24 0.094
512 24 0.047
1024 24 0.023
2048 24 0.012
4096 25 0.006
======================= RDKit-Morgan radius=3
256 31 0.121
512 31 0.061
1024 31 0.030
2048 31 0.015
4096 32 0.008
======================= OpenEye-Path
256 124 0.484
512 142 0.277
1024 153 0.149
2048 161 0.079
4096 164 0.040
======================= OpenEye-Circular maxradius=1
256 15 0.059
512 16 0.031
1024 16 0.016
2048 16 0.008
4096 16 0.004
======================= OpenEye-Circular maxradius=2
256 23 0.090
512 24 0.047
1024 24 0.023
2048 24 0.012
4096 24 0.006
======================= OpenEye-Circular maxradius=3
256 29 0.113
512 31 0.061
1024 31 0.030
2048 31 0.015
4096 31 0.008
</pre>
This new code once again parses the SMILES more often than it needs
to. I only need to parse the string once for all of the OpenEye
toolkit fingerprint calculations and once for RDKit's. I'll make a
little cache where the key is "toolkit.name" (which is the string
"openeye" or "rdkit" or "openbabel") and the value is the correct
toolkit molecule. The following is a truncated of the result, since
the other details aren't so important and get in the way.
<pre class="code">
# The cache key is the toolkit name
mol_cache = {}
for typename, sizefield in (
("RDKit-Fingerprint", "fpSize"),
# ... removed many lines
("OpenEye-Circular maxradius=3", "numbits")):
# Each fingerprint type knows its toolkit.
# Use that to create the toolkit-appropriate molecule
fptype = chemfp.get_fingerprint_type(typename)
toolkit = fptype.toolkit
if toolkit.name in mol_cache:
mol = mol_cache[toolkit.name]
else:
mol = mol_cache[toolkit.name] = toolkit.parse_molecule(smiles, "smistring")
# ... compute and display the fingerprints
</pre>
</P>
<h2>Use a fingerprint type to read a structrue file</h2>
<P>
Now for something a bit different. Every wondered if the RDKit and
Open Babel MACCS implementations give the same answers? They should be
close, since they use the same SMARTS definitions, but there will
likely be differences in chemistry.
It's very easy to generate the MACCS fingerprints given a single
fingerprint type. The fingerprint type has a
"read_molecule_fingerprints()" method which takes the structure
filename and returns the (id, fp) pairs. Here's what that looks like
when I analyze a copy of ChEBI:
<pre class="code">
import chemfp
from chemfp.bitops import byte_popcount
# Choose one or the other, depending on your preference.
fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
#fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")
filename = "/Users/dalke/databases/ChEBI_lite.sdf.gz"
total_popcount = 0
with fptype.read_molecule_fingerprints(filename, errors="ignore") as reader:
for recno, (id, fp) in enumerate(reader, 1):
total_popcount += byte_popcount(fp)
print "#records %d total popcount: %d average popcount: %.1f average density: %.3f" % (
recno, total_popcount, total_popcount/float(recno), total_popcount/float(recno)/fptype.num_bits)
</pre>
This gives the output for RDKit of:
<pre class="code">
#records 18928 total popcount: 637448 average popcount: 33.7 average density: 0.203
</pre>
When I run it for Open Babel's MACCS I get:
<pre class="code">
#records 19640 total popcount: 649935 average popcount: 33.1 average density: 0.199
</pre>
These are close, but not identical. It's not clear why there's a
difference. Part of the reason is surely because RDKit and Open Babel
aren't able to process all of the records. It's also likely that they
have differences in aromaticity. But there may be other reasons.
<P><P>
It may just that the records that RDKit can't handle vs. those that Open
Babel can handle, so the real analysis should only include the ones
that both can handle.
</P>
<h2>Read SD records using the text_toolkit</h2>
<P>
I need some way to only analyze the records that both RDKit and Open
Babel can understand. I can't use the fingerprint type's own parser
API because it's difficult to keep the two parsers synchronized; I
don't know if the RDKit skipped a record or the Open Bable one did.
</P><P>
Chemfp includes a "text_toolkit" module, with functions to parse
individual SMILES and SDF records from a the corresponding files. I
can use that to get the SDF records as a string, then use
parse_molecule_fingerprint() to get the fingerprints from that
string. The functions will raise a chemfp.ParseError exception (which
is also a ValueError) if there was a problem. If there is an
exception, I'll skip that record.
</P><P>
The code to read SDF records can be as simple as:
<pre class="code">
for id, sdf_record in text_toolkit.read_sdf_ids_and_records(filename):
... sdf_record contains the record as a string ...
</pre>
though in practice I'll use a context-manager so the file is closed
automatically. Here's the full code:
<pre class="code">
import chemfp
from chemfp import text_toolkit
from chemfp.bitops import byte_popcount, byte_intersect_popcount
rdkit_maccs_fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
openbabel_maccs_fptype =chemfp.get_fingerprint_type("OpenBabel-MACCS")
assert rdkit_maccs_fptype.num_bits == openbabel_maccs_fptype.num_bits == 166
filename = "/Users/dalke/databases/ChEBI_lite.sdf.gz"
total_rdkit_popcount = total_openbabel_popcount = 0
only_rdkit_popcount = only_openbabel_popcount = 0
num_skipped = 0
with text_toolkit.read_sdf_ids_and_records(filename) as sdf_reader:
for recno, (id, sdf_record) in enumerate(sdf_reader, 1):
# Convert the record into a fingerprint. If there is a problem,
# keep track of the skip count and go on to the next record.
try:
rdkit_fp = rdkit_maccs_fptype.parse_molecule_fingerprint(sdf_record, "sdf")
openbabel_fp = openbabel_maccs_fptype.parse_molecule_fingerprint(sdf_record, "sdf")
except chemfp.ParseError:
num_skipped += 1
continue
# Figure out the number of bits in each fingerprint and number of bits in common
rdkit_popcount = byte_popcount(rdkit_fp)
openbabel_popcount = byte_popcount(openbabel_fp)
intersect_popcount = byte_intersect_popcount(rdkit_fp, openbabel_fp)
# Keep track of overall statistics for each molecule.
total_rdkit_popcount += rdkit_popcount
total_openbabel_popcount += openbabel_popcount
only_rdkit_popcount += rdkit_popcount - intersect_popcount
only_openbabel_popcount += openbabel_popcount - intersect_popcount
# Report the results
print "#records %d #skipped %d" % (recno, num_skipped)
num_valid_records = recno - num_skipped
print "RDKit popcount: %d RDKit only: %d average popcount: %.1f average density: %.3f" % (
total_rdkit_popcount, only_rdkit_popcount, total_rdkit_popcount/float(num_valid_records),
total_rdkit_popcount/float(num_valid_records)/166)
print "Open Babel popcount: %d OB only: %d average popcount: %.1f average density: %.3f" % (
total_openbabel_popcount, only_openbabel_popcount, total_openbabel_popcount/float(num_valid_records),
total_openbabel_popcount/float(num_valid_records)/166)
</pre>
<pre class="code">
#records 19640 #skipped 712
RDKit popcount: 637448 RDKit only: 9034 average popcount: 33.7 average density: 0.203
Open Babel popcount: 629173 OB only: 759 average popcount: 33.2 average density: 0.200
</pre>
</P><P>
That's interesting. RDKit has a slightly higher popcount than Open
Babel, and both have unique bits set which aren't in the other. Which bits are different?
For that I'll use a decoder function which returns the list "on" bit
positions in a fingerprint:
<pre class="code">
&gt;&gt;&gt; from chemfp import encodings
&gt;&gt;&gt; encodings.to_on_bitlist("\0\0\1\xf1")
[16, 24, 28, 29, 30, 31]
&gt;&gt;&gt;
</pre>
(Remember that chemfp's uses a mixed endian encoding, where bytes are
little endian and bits in a byte are big endian.)
</P><P>
The following will find where the two MACCS implementations have
different fingerprints, compare which bits are set in one
implementation which aren't set in the other, and report a few of the
records which are different. (ChEBI stores the record identifier in
the "ChEBI ID" tag instead of the title, so I use the 'id_tag' to
extract the actual record id.)
<pre class="code">
import chemfp
from chemfp import text_toolkit
from chemfp.bitops import byte_popcount, byte_intersect_popcount
from chemfp.encodings import to_on_bitlist
from collections import defaultdict
rdkit_maccs_fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
openbabel_maccs_fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")
assert rdkit_maccs_fptype.num_bits == openbabel_maccs_fptype.num_bits == 166
filename = "/Users/dalke/databases/ChEBI_lite.sdf.gz"
extra_rdkit = defaultdict(list)
extra_openbabel = defaultdict(list)
with text_toolkit.read_sdf_ids_and_records(filename, id_tag="ChEBI ID") as sdf_reader:
for recno, (id, sdf_record) in enumerate(sdf_reader, 1):
try:
rdkit_fp = rdkit_maccs_fptype.parse_molecule_fingerprint(sdf_record, "sdf")
openbabel_fp = openbabel_maccs_fptype.parse_molecule_fingerprint(sdf_record, "sdf")
except chemfp.ParseError:
continue
if rdkit_fp == openbabel_fp:
continue
in_rdkit = set(to_on_bitlist(rdkit_fp))
in_openbabel = set(to_on_bitlist(openbabel_fp))
# Record the identifier for each bit in one fingerprint which is not in the other
for bitno in (in_rdkit - in_openbabel):
extra_rdkit[bitno].append(id)
for bitno in (in_openbabel - in_rdkit):
extra_openbabel[bitno].append(id)
print "Extra in RDKit:", sum(len(ids) for ids in extra_rdkit.values())
for bitno, ids in sorted(extra_rdkit.items()):
examples = repr(ids[:5])[1:-1]
if len(ids) > 5:
examples += " ..."
print "%3d: %d extra = %s" % (bitno, len(ids), examples)
print
print "Extra in Open Babel:", sum(len(ids) for ids in extra_openbabel.values())
for bitno, ids in sorted(extra_openbabel.items()):
examples = repr(ids[:5])[1:-1]
if len(ids) > 5:
examples += " ..."
print "%3d: %d extra = %s" % (bitno, len(ids), examples)
</pre>
</P><P>
The results is a bit lengthy - good thing I don't report all of the identifiers for each one!
<pre class="code">
Extra in RDKit: 9034
2: 31 extra = 'CHEBI:30444', 'CHEBI:33313', 'CHEBI:37340', 'CHEBI:37341', 'CHEBI:37342' ...
20: 4 extra = 'CHEBI:27581', 'CHEBI:51461', 'CHEBI:51467', 'CHEBI:51474'
23: 12 extra = 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612', 'CHEBI:29811' ...
42: 13 extra = 'CHEBI:21143', 'CHEBI:33633', 'CHEBI:33722', 'CHEBI:33723', 'CHEBI:33724' ...
43: 3036 extra = 'CHEBI:7963', 'CHEBI:11230', 'CHEBI:11714', 'CHEBI:12194', 'CHEBI:15431' ...
44: 34 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
48: 29 extra = 'CHEBI:15378', 'CHEBI:15724', 'CHEBI:17045', 'CHEBI:17735', 'CHEBI:3611' ...
49: 10 extra = 'CHEBI:23066', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
52: 12 extra = 'CHEBI:21143', 'CHEBI:31196', 'CHEBI:33722', 'CHEBI:33723', 'CHEBI:33724' ...
53: 1 extra = 'CHEBI:33633'
58: 12 extra = 'CHEBI:228275', 'CHEBI:51895', 'CHEBI:52273', 'CHEBI:52298', 'CHEBI:52299' ...
64: 4 extra = 'CHEBI:57255', 'CHEBI:38600', 'CHEBI:57718', 'CHEBI:58571'
67: 6 extra = 'CHEBI:33130', 'CHEBI:29229', 'CHEBI:31197', 'CHEBI:33633', 'CHEBI:52699' ...
68: 33 extra = 'CHEBI:47402', 'CHEBI:49464', 'CHEBI:21143', 'CHEBI:29230', 'CHEBI:29275' ...
73: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
75: 7 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
77: 6 extra = 'CHEBI:38010', 'CHEBI:51595', 'CHEBI:52729', 'CHEBI:52748', 'CHEBI:52749' ...
83: 29 extra = 'CHEBI:27899', 'CHEBI:21549', 'CHEBI:29275', 'CHEBI:30027', 'CHEBI:30312' ...
85: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
89: 5 extra = 'CHEBI:31197', 'CHEBI:32715', 'CHEBI:32726', 'CHEBI:32734', 'CHEBI:58416'
90: 7 extra = 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726', 'CHEBI:32729' ...
92: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
98: 8 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
100: 1 extra = 'CHEBI:60153'
107: 2 extra = 'CHEBI:23066', 'CHEBI:51736'
112: 33 extra = 'CHEBI:15342', 'CHEBI:52781', 'CHEBI:57255', 'CHEBI:33826', 'CHEBI:33831' ...
113: 1 extra = 'CHEBI:51736'
114: 5 extra = 'CHEBI:23066', 'CHEBI:51736', 'CHEBI:55429', 'CHEBI:55504', 'CHEBI:55506'
115: 2 extra = 'CHEBI:51736', 'CHEBI:55504'
117: 1 extra = 'CHEBI:51736'
118: 6 extra = 'CHEBI:38010', 'CHEBI:51595', 'CHEBI:52729', 'CHEBI:52748', 'CHEBI:52749' ...
121: 1 extra = 'CHEBI:60153'
124: 4500 extra = 'CHEBI:1734', 'CHEBI:2303', 'CHEBI:2914', 'CHEBI:3013', 'CHEBI:3048' ...
125: 3 extra = 'CHEBI:41981', 'CHEBI:29374', 'CHEBI:29375'
127: 1 extra = 'CHEBI:51736'
128: 1 extra = 'CHEBI:51736'
130: 18 extra = 'CHEBI:21143', 'CHEBI:29229', 'CHEBI:32715', 'CHEBI:32726', 'CHEBI:32734' ...
134: 86 extra = 'CHEBI:15852', 'CHEBI:15982', 'CHEBI:16048', 'CHEBI:16304', 'CHEBI:16379' ...
137: 1 extra = 'CHEBI:51736'
140: 3 extra = 'CHEBI:51730', 'CHEBI:51736', 'CHEBI:55429'
143: 63 extra = 'CHEBI:15342', 'CHEBI:15955', 'CHEBI:16379', 'CHEBI:16848', 'CHEBI:17679' ...
146: 1 extra = 'CHEBI:51736'
147: 1 extra = 'CHEBI:60153'
148: 5 extra = 'CHEBI:23066', 'CHEBI:51730', 'CHEBI:51736', 'CHEBI:55429', 'CHEBI:55504'
150: 2 extra = 'CHEBI:29275', 'CHEBI:58416'
152: 1 extra = 'CHEBI:51736'
156: 17 extra = 'CHEBI:52781', 'CHEBI:51222', 'CHEBI:51224', 'CHEBI:51225', 'CHEBI:51899' ...
157: 12 extra = 'CHEBI:16531', 'CHEBI:17675', 'CHEBI:50375', 'CHEBI:2311', 'CHEBI:11573' ...
159: 7 extra = 'CHEBI:23066', 'CHEBI:36771', 'CHEBI:48079', 'CHEBI:51730', 'CHEBI:51736' ...
161: 21 extra = 'CHEBI:57255', 'CHEBI:30858', 'CHEBI:33134', 'CHEBI:33668', 'CHEBI:33826' ...
165: 934 extra = 'CHEBI:3243', 'CHEBI:3385', 'CHEBI:3395', 'CHEBI:3567', 'CHEBI:3637' ...
Extra in Open Babel: 759
1: 1 extra = 'CHEBI:33397'
2: 11 extra = 'CHEBI:49920', 'CHEBI:30437', 'CHEBI:30439', 'CHEBI:30440', 'CHEBI:30442' ...
25: 58 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
42: 7 extra = 'CHEBI:29779', 'CHEBI:30163', 'CHEBI:32060', 'CHEBI:35836', 'CHEBI:36143' ...
44: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
49: 39 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
52: 5 extra = 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:32060', 'CHEBI:50953', 'CHEBI:50958'
53: 9 extra = 'CHEBI:22634', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:30598', 'CHEBI:32743' ...
62: 11 extra = 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612', 'CHEBI:29811' ...
64: 34 extra = 'CHEBI:15852', 'CHEBI:15982', 'CHEBI:16304', 'CHEBI:16379', 'CHEBI:17439' ...
67: 24 extra = 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:29278', 'CHEBI:30101', 'CHEBI:30104' ...
68: 37 extra = 'CHEBI:16144', 'CHEBI:44951', 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:29278' ...
75: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
77: 67 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
81: 9 extra = 'CHEBI:22634', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:32060', 'CHEBI:32743' ...
83: 18 extra = 'CHEBI:161680', 'CHEBI:9016', 'CHEBI:29278', 'CHEBI:29422', 'CHEBI:30104' ...
89: 1 extra = 'CHEBI:37179'
90: 2 extra = 'CHEBI:32060', 'CHEBI:37178'
98: 58 extra = 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609', 'CHEBI:28421' ...
103: 2 extra = 'CHEBI:36143', 'CHEBI:37179'
105: 1 extra = 'CHEBI:55504'
112: 17 extra = 'CHEBI:40071', 'CHEBI:595389', 'CHEBI:8247', 'CHEBI:32014', 'CHEBI:36603' ...
118: 76 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:18023' ...
130: 32 extra = 'CHEBI:29278', 'CHEBI:29779', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:30101' ...
134: 4 extra = 'CHEBI:34921', 'CHEBI:51597', 'CHEBI:53328', 'CHEBI:60153'
135: 27 extra = 'CHEBI:49706', 'CHEBI:29221', 'CHEBI:29270', 'CHEBI:29271', 'CHEBI:29420' ...
143: 10 extra = 'CHEBI:5835', 'CHEBI:33089', 'CHEBI:36618', 'CHEBI:36645', 'CHEBI:51962' ...
150: 3 extra = 'CHEBI:29240', 'CHEBI:29278', 'CHEBI:30227'
156: 2 extra = 'CHEBI:51204', 'CHEBI:53327'
157: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
159: 4 extra = 'CHEBI:29360', 'CHEBI:29434', 'CHEBI:29437', 'CHEBI:29440'
161: 4 extra = 'CHEBI:36603', 'CHEBI:36618', 'CHEBI:36645', 'CHEBI:60153'
</pre>
</P>
<h2>Why are there big differences?</h2>
<P>
<h3>MACCS key 44</h3>
The second biggest difference is bit 43, where RDKit has 3036 extra
matches and Open Babel has 0 extra matches. This corresponds to MACCS
key 44, which is the "OTHER". The difference is due to timing. In <a
href="http://www.dalkescientific.com/writings/diary/archive/2014/10/17/maccs_key_44.html">last
month's essay</a> I tracked down the definition for OTHER, which had
been missing for Open Babel, RDKit, and CDK. I reported it to the
repsective projects, and the updated definition is now part of the
main code base.
<P></P>
RDKit had a new release between last month and now, which I'm
using. Open Babel has not. I'll demonstrate by asking the chemfp
toolkit API for information about the underlying toolkit version:
<pre class="code">
&gt;&gt;&gt; from chemfp import rdkit_toolkit, openbabel_toolkit
&gt;&gt;&gt; rdkit_toolkit.software
'RDKit/2014.09.1'
&gt;&gt;&gt; openbabel_toolkit.software
'OpenBabel/2.3.2'
</pre>
</P>
<h3>MACCS key 125</h3>
<P>
The biggest difference is bit 124/key 125. RDKit has 4500 matches that
weren't found in Open Babel, while Open Babel has no extra
matches. This key is defined as "Aromatic Ring &gt; 1".
</P><P>
This definition cannot be defined by a SMARTS pattern. RDKit has
special code to handle that case. Open Babel does not. The current
Open Babel code assumes that a fingerprint can be defined by a set of
SMARTS patterns, with one SMARTS per bit/key, and there's no provision
to indicate an alternative.
</P><P>
As a result, Open Babel will <i>always</i> set a 0 for that bit.
</P>
<h2>Open Babel MACCS bit count distribution</h2>
<P>
This got me to wonder about the number of records set by each bit, so
I wrote a short program for that.
<pre class="code">
import chemfp
from chemfp.bitops import byte_contains_bit
from chemfp.encodings import to_on_bitlist
fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")
filename = "/Users/dalke/databases/ChEBI_lite.sdf.gz"
# Initialize a list of bit counts to 0.
counts = [0 for i in range(fptype.num_bits)]
with fptype.read_molecule_fingerprints(filename) as fp_reader:
for (id, fp) in fp_reader:
# Keep track of the total bit count
for bitno in to_on_bitlist(fp):
counts[bitno] += 1
# Report the result
for bitno, count in enumerate(counts):
print "%3d: %6d" % (bitno, count)
</pre>
Here is the Open Babel MACCS bit use distribution:
<pre class="code">
0: 0
1: 2
2: 405
3: 31
4: 14
5: 44
6: 209
7: 407
8: 306
9: 125
10: 452
11: 160
12: 115
13: 144
14: 35
15: 232
16: 131
17: 204
18: 622
19: 102
20: 97
21: 438
22: 192
23: 666
24: 687
25: 870
26: 200
27: 242
28: 2654
29: 422
30: 402
31: 153
32: 214
33: 701
34: 311
35: 730
36: 740
37: 1508
38: 706
39: 753
40: 275
41: 583
42: 2069
43: 0
44: 665
45: 226
46: 590
47: 3519
48: 5666
49: 2429
50: 621
51: 452
52: 5505
53: 6337
54: 916
55: 248
56: 4981
57: 1013
58: 551
59: 969
60: 1012
61: 2398
62: 335
63: 565
64: 3477
65: 3222
66: 1315
67: 211
68: 3119
69: 490
70: 754
71: 6860
72: 1111
73: 3312
74: 2734
75: 3010
76: 2988
77: 1245
78: 2880
79: 3308
80: 1801
81: 5228
82: 4477
83: 3579
84: 3668
85: 2435
86: 1060
87: 2839
88: 7561
89: 8478
90: 7778
91: 4308
92: 2861
93: 1967
94: 6416
95: 5504
96: 5064
97: 6145
98: 4558
99: 4120
100: 5355
101: 4878
102: 1340
103: 7468
104: 5827
105: 5600
106: 1540
107: 4975
108: 5852
109: 4748
110: 5774
111: 7248
112: 3699
113: 2494
114: 5190
115: 5297
116: 4970
117: 7356
118: 1667
119: 5755
120: 5254
121: 4356
122: 9127
123: 6298
124: 0
125: 6236
126: 8584
127: 6237
128: 6218
129: 4972
130: 10503
131: 9283
132: 4075
133: 2230
134: 2722
135: 7841
136: 8606
137: 5036
138: 11348
139: 9332
140: 3954
141: 6373
142: 8584
143: 4266
144: 6504
145: 11568
146: 7385
147: 7141
148: 6480
149: 8560
150: 7651
151: 10272
152: 9338
153: 11775
154: 11310
155: 9060
156: 13650
157: 9125
158: 14337
159: 10031
160: 10299
161: 8148
162: 11358
163: 16302
164: 12939
165: 0
</pre>
You can see that bits 0, 43, 124, and 165 (or keys 1, 44, 125, and
166) are not set. I've already discussed 44 and 125. Key 1 is
"Isotope" and key 166 is "Fragment", neither of which have a SMARTS
definition. (I believe the SMARTS definitions should be "[!*0]" and
"(*).(*)", but I'll need to test if those are sufficiently
cross-toolkit ... Nope, RDKit doesn't seem to support component-level
SMARTS.)
</P>
<h2>RDMACCS - a more cross-toolkit MACCS implementation</h2>
<P>
Chemfp includes the "RDMACCS" fingerprints, which is my attempt at a
cross-toolkit implementation of the MACCS keys. It starts with the
same SMARTS definitions as RDKit (though I haven't yet added key 44)
and includes toolkit-specific implementations for the keys which can't
be expressed in SMARTS.
<P><P>
When I redo the same comparisons using the following fingerprint types:
<pre class="code">
rdkit_maccs_fptype = chemfp.get_fingerprint_type("RDMACCS-RDKit")
openbabel_maccs_fptype = chemfp.get_fingerprint_type("RDMACCS-OpenBabel")
</pre>
I get answers which are much closer together
<pre class="code">
#records 19640 #skipped 712
RDKit popcount: 634388 RDKit only: 528 average popcount: 33.5 average density: 0.202
Open Babel popcount: 634698 OB only: 838 average popcount: 33.5 average density: 0.202
</pre>
I haven't yet done the research to figure out the source of all of the
differences, but from work I did in 2011 on <a
href="http://www.dalkescientific.com/writings/diary/archive/2011/06/04/dealing_with_sssr.html">Dealing
with SSSR</a> I know that aromaticity perception is likely an important factor.
</P><P>
I'll leave you with a bitwise comparison so you can get a sense of the differences for yourself:
<pre class="code">
Extra in RDKit: 528
20: 4 extra = 'CHEBI:27581', 'CHEBI:51461', 'CHEBI:51467', 'CHEBI:51474'
23: 12 extra = 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612', 'CHEBI:29811' ...
42: 13 extra = 'CHEBI:21143', 'CHEBI:33633', 'CHEBI:33722', 'CHEBI:33723', 'CHEBI:33724' ...
44: 34 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
48: 18 extra = 'CHEBI:15724', 'CHEBI:17045', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612' ...
49: 10 extra = 'CHEBI:23066', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
52: 12 extra = 'CHEBI:21143', 'CHEBI:31196', 'CHEBI:33722', 'CHEBI:33723', 'CHEBI:33724' ...
53: 1 extra = 'CHEBI:33633'
58: 12 extra = 'CHEBI:228275', 'CHEBI:51895', 'CHEBI:52273', 'CHEBI:52298', 'CHEBI:52299' ...
64: 4 extra = 'CHEBI:57255', 'CHEBI:38600', 'CHEBI:57718', 'CHEBI:58571'
67: 6 extra = 'CHEBI:33130', 'CHEBI:29229', 'CHEBI:31197', 'CHEBI:33633', 'CHEBI:52699' ...
68: 33 extra = 'CHEBI:47402', 'CHEBI:49464', 'CHEBI:21143', 'CHEBI:29230', 'CHEBI:29275' ...
73: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
75: 7 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
77: 6 extra = 'CHEBI:38010', 'CHEBI:51595', 'CHEBI:52729', 'CHEBI:52748', 'CHEBI:52749' ...
83: 29 extra = 'CHEBI:27899', 'CHEBI:21549', 'CHEBI:29275', 'CHEBI:30027', 'CHEBI:30312' ...
85: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
89: 5 extra = 'CHEBI:31197', 'CHEBI:32715', 'CHEBI:32726', 'CHEBI:32734', 'CHEBI:58416'
90: 7 extra = 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726', 'CHEBI:32729' ...
92: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
98: 8 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
107: 2 extra = 'CHEBI:23066', 'CHEBI:51736'
112: 33 extra = 'CHEBI:15342', 'CHEBI:52781', 'CHEBI:57255', 'CHEBI:33826', 'CHEBI:33831' ...
113: 1 extra = 'CHEBI:51736'
114: 5 extra = 'CHEBI:23066', 'CHEBI:51736', 'CHEBI:55429', 'CHEBI:55504', 'CHEBI:55506'
115: 2 extra = 'CHEBI:51736', 'CHEBI:55504'
117: 1 extra = 'CHEBI:51736'
118: 6 extra = 'CHEBI:38010', 'CHEBI:51595', 'CHEBI:52729', 'CHEBI:52748', 'CHEBI:52749' ...
124: 12 extra = 'CHEBI:18023', 'CHEBI:35895', 'CHEBI:36303', 'CHEBI:37372', 'CHEBI:52583' ...
127: 1 extra = 'CHEBI:51736'
128: 1 extra = 'CHEBI:51736'
130: 18 extra = 'CHEBI:21143', 'CHEBI:29229', 'CHEBI:32715', 'CHEBI:32726', 'CHEBI:32734' ...
134: 86 extra = 'CHEBI:15852', 'CHEBI:15982', 'CHEBI:16048', 'CHEBI:16304', 'CHEBI:16379' ...
137: 1 extra = 'CHEBI:51736'
140: 3 extra = 'CHEBI:51730', 'CHEBI:51736', 'CHEBI:55429'
143: 63 extra = 'CHEBI:15342', 'CHEBI:15955', 'CHEBI:16379', 'CHEBI:16848', 'CHEBI:17679' ...
146: 1 extra = 'CHEBI:51736'
148: 5 extra = 'CHEBI:23066', 'CHEBI:51730', 'CHEBI:51736', 'CHEBI:55429', 'CHEBI:55504'
150: 2 extra = 'CHEBI:29275', 'CHEBI:58416'
152: 1 extra = 'CHEBI:51736'
156: 17 extra = 'CHEBI:52781', 'CHEBI:51222', 'CHEBI:51224', 'CHEBI:51225', 'CHEBI:51899' ...
157: 12 extra = 'CHEBI:16531', 'CHEBI:17675', 'CHEBI:50375', 'CHEBI:2311', 'CHEBI:11573' ...
159: 7 extra = 'CHEBI:23066', 'CHEBI:36771', 'CHEBI:48079', 'CHEBI:51730', 'CHEBI:51736' ...
161: 21 extra = 'CHEBI:57255', 'CHEBI:30858', 'CHEBI:33134', 'CHEBI:33668', 'CHEBI:33826' ...
Extra in Open Babel: 838
25: 58 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
42: 7 extra = 'CHEBI:29779', 'CHEBI:30163', 'CHEBI:32060', 'CHEBI:35836', 'CHEBI:36143' ...
44: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
48: 4 extra = 'CHEBI:29344', 'CHEBI:29940', 'CHEBI:30147', 'CHEBI:30148'
49: 39 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
52: 5 extra = 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:32060', 'CHEBI:50953', 'CHEBI:50958'
53: 9 extra = 'CHEBI:22634', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:30598', 'CHEBI:32743' ...
62: 11 extra = 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612', 'CHEBI:29811' ...
64: 34 extra = 'CHEBI:15852', 'CHEBI:15982', 'CHEBI:16304', 'CHEBI:16379', 'CHEBI:17439' ...
67: 24 extra = 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:29278', 'CHEBI:30101', 'CHEBI:30104' ...
68: 37 extra = 'CHEBI:16144', 'CHEBI:44951', 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:29278' ...
75: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
77: 67 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
81: 9 extra = 'CHEBI:22634', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:32060', 'CHEBI:32743' ...
83: 17 extra = 'CHEBI:161680', 'CHEBI:9016', 'CHEBI:29278', 'CHEBI:30104', 'CHEBI:30227' ...
89: 1 extra = 'CHEBI:37179'
90: 2 extra = 'CHEBI:32060', 'CHEBI:37178'
98: 58 extra = 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609', 'CHEBI:28421' ...
103: 2 extra = 'CHEBI:36143', 'CHEBI:37179'
105: 1 extra = 'CHEBI:55504'
112: 17 extra = 'CHEBI:40071', 'CHEBI:595389', 'CHEBI:8247', 'CHEBI:32014', 'CHEBI:36603' ...
118: 76 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:18023' ...
124: 88 extra = 'CHEBI:15852', 'CHEBI:15955', 'CHEBI:15982', 'CHEBI:16304', 'CHEBI:16379' ...
130: 32 extra = 'CHEBI:29278', 'CHEBI:29779', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:30101' ...
134: 4 extra = 'CHEBI:34921', 'CHEBI:51597', 'CHEBI:53328', 'CHEBI:60153'
135: 27 extra = 'CHEBI:49706', 'CHEBI:29221', 'CHEBI:29270', 'CHEBI:29271', 'CHEBI:29420' ...
143: 10 extra = 'CHEBI:5835', 'CHEBI:33089', 'CHEBI:36618', 'CHEBI:36645', 'CHEBI:51962' ...
150: 3 extra = 'CHEBI:29240', 'CHEBI:29278', 'CHEBI:30227'
156: 2 extra = 'CHEBI:51204', 'CHEBI:53327'
157: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
159: 4 extra = 'CHEBI:29360', 'CHEBI:29434', 'CHEBI:29437', 'CHEBI:29440'
161: 4 extra = 'CHEBI:36603', 'CHEBI:36618', 'CHEBI:36645', 'CHEBI:60153'
</pre>
As far as I'm aware, no one has done an extensive analysis of the
differences between the available MACCS implementations. Are you up
for the challenge?
</P>
http://www.dalkescientific.com/writings/diary/archive/2014/11/27/maccs_in_rdkit_and_open_babel.htmlThu, 27 Nov 2014 12:00:00 GMTMACCS key 44http://www.dalkescientific.com/writings/diary/archive/2014/10/17/maccs_key_44.html<P>
The MACCS 166 keys are one of the mainstay fingerprints of
cheminformatics, especially regarding molecular similarity. It's
rather odd, really, since they were developed for substructure
screening and not similarity. I suppose that <a
href="http://onlinelibrary.wiley.com/doi/10.1111/j.1469-8137.1912.tb05611.x/abstract">Jaccard</a>
would agree that any relatively diverse feature vector can likely be
used to measure similarity, whether it be Alpine biomes or chemical
structures.
</P><P>
Here's a bit of dirty laundry that you'll not read in the
literature. There are a lot of MACCS implementations, and they don't
agree fully with each other. The differences are likely small, but as
far as I can tell, no one has really investigated if it's a problem,
or noted that it might be a problem.
</P><P>
I'll structure the explanation around key #44. What is definition for
key 44?
</P><P>
To start, there is no publication describing the MACCS 166 public
keys. All of the citations for it either say a variation of "MDL did
it" or cite the 2002 paper which <a
href="http://pubs.acs.org/doi/abs/10.1021/ci010132r">reoptimized</a>
the keys for similarity ([<a
href="http://www.akosgmbh.de/pdf/Reoptimization_960_key.pdf">PDF</a>]). Thing
is, just about everyone uses the "unoptimized" definitions, so this
is, technically, the wrong citation. (Why do people use it? Tradition,
perhaps, or because it feels better to have a real citation rather
than a nebulous one.)
</P><P>
Instead, the definitions appear to have come from ISIS/Base, and have
been passed around from person to person through informal means. I
haven't used the MDL software and can't verify the source myself. There's
a relatively recent whitepaper from Accelrys titled
"<a href="http://accelrys.com/products/pdf/keys-to-keyset-technology.pdf">The
Keys to Understanding MDL Keyset Technology</a>" which says they are defined
in the file "eksfil.dat". A Google search finds 8 results for
"<a href="https://www.google.com/search?q=%22eksfil.dat%22">eksfil.dat</a>".
All are tied to that white paper. The PDF has creation and
modification dates of 31 August 2011, and Archive.org first saw that
URL on 11 October 2011.
</P><P>
It's easy to see that the reoptimization fingerprint is not the same
as the 166 keys that everyone uses. You'll find that many places say
that key 44 is defined as "OTHER". Table 5 of the reoptimization paper
has an entry for '"other" atom type', but there's nothing which
assigns it to key 44. You can't even try to infer some sort of
implicit ordering because the previous entry in table 5 is "isotope",
which is key 1 in the MACCS 166 keys, and two entries later is
"halogen", which is key 134.
</P><P>
If you cite Durant, Leland, Henry, and Nourse (2002) as your reference
to the MACCS 166 bit public keys then you are doing your readers a
disservice. Those define <i>different</i> fingerprints than you
used. Just go ahead and cite "MACCS keys. MDL Information Systems" and
if the reviewer complains that it's a bad citation, point them to this
essay and ask them for the correct one. Then tell me what they
said. If Accelrys complains then they need to suggest the correct
citation and put it in their white paper. Even better would be a
formal publication and a validation suite. (I can dream, can't I?)
</P><P>
In practice, many people use the MACCS keys <i>as interpreted by</i>
the implementers of some piece of software. I used "interepreted by"
because "implemented by" is too strong. There are ambiguities in the
definition, mistakes in the implementations, and differences in
chemical interpretation, compounded by a lack of any sort of
comprehensive validation suite.
</P><P>
Let's take key 44, "OTHER". Remember how the definition comes from an
internal MDL data file? What does "OTHER" mean? RDKit defines it as
'?' in MACCSkeys.py to indicate that it has no definition for that
key. That line has a commit date of 2006-05-06. RDKit's lack of a
definition is notable because
<a href="https://github.com/openbabel/openbabel/blob/master/data/MACCS.txt">Open Babel</a>,
<a href="http://pele.farmbio.uu.se/nightly/api/org/openscience/cdk/fingerprint/MACCSFingerprinter.html">CDK</a>,
a <a href="https://www.chemaxon.com/forum/ftopic8138.html">user contributed implementation for ChemAxon</a>
and many others reuse the RDKit SMARTS definitions. All of them omit key 44.
</P><P>
Others have implemented key 44. TJ O'Donnell, in "<a
href="http://www.crcpress.com/product/isbn/9781420064421">Design and
Use of Relational Databases in Chemistry</a>" (2009) defines it as the
SMARTS <tt>[!#6!#7!#8!#15!#16!#9!#17!#35]</tt>. <a
href="http://www.mayachemtools.org/docs/modules/html/MACCSKeys.html">MayaChemTools</a>
defines it in code as an atom with element number in
"1|6|7|8|9|14|15|16|17|35|53". (See <a href="http://www.mayachemtools.org/docs/modules/html/code/MACCSKeys.html">_IsOtherAtom</a>.)
</P><P>
These are the ones where I have access to the source and could
investigate without much effort.
</P><P>
Both the whitepaper and the reoptimization paper define what "other"
means, and the whitepaper does so specifically in the context of the
MACCS 166 keys. It says:
<blockquote>
"Other" atoms include any atoms other than H, C, N, O, Si, P, S, F,
Cl, Br, and I, and is abbreviated "Z".
</blockquote>
This appears definite and final. Going back to the three different
implementation geneologies, RDKit and its many spinoffs don't have a
definition so by definition isn't correct. O'Donnell's is close, but
the SMARTS pattern omits hydrogen, silicon, and iodine. And
MayaChemTools gets it exactly correct.
</P><P>
Good job, Manish Sud!
</P>
<h2>Are these MACCS variations really a problem?</h2>
<P>
No. Not really. Well, maybe. It depends on who you are.
</P><P>
When used for similarity, a dead bit just makes things more similar
because there are fewer ways to distinguish between molecules. In this
case too, key 44 is rare. Only a handful of molecules contain "other"
atoms (like the gold in <a
href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI%3a2922">auranofin</a>)
so when characterizing a database it's likely fine.
</P><P>
You don't need to trust my own gut feeling. You can read the <a
href="http://www.rdkit.org/docs/GettingStartedInPython.html#maccs-keys">RDKit
documentation</a> and see "The MACCS keys were critically evaluated
and compared to other MACCS implementations in Q3 2008. In cases where
the public keys are fully defined, things looked pretty good."
</P><P>
Okay, so you're hesistent about the keys which aren't "fully defined"?
No need to despair. Roger Sayle ported the RDKit patterns (and without
key 44) over to ChemAxon, and <a
href="https://www.chemaxon.com/forum/ftopic8138.html">reported</a>:
<blockquote>
This work is heavily based upon the previous implementation by Miklos
Vargyas, and the SMARTS definitions developed and refined by Greg
Landrum and Andrew Dalke. This implementation achieves ~65% on the
standard Briem and Lessel benchmark, i.e. almost identical to the
expected value for MACCS keys reported in the literature by MDL and
others.
</blockquote>
</P><P>
(NB: All I did was proofread the RDKit SMARTS and find a few places
that needed fixing.)
</P><P>
The MACCS 166 keys are a blunt tool, designed for substructure search
and repurposed for similarity more because it was already present and
easy to generate. 2D similarity search is another blunt tool. That's
not to say they are horrible or worthless! A rock is a blunt tool for
making an ax, but we used stone axes quite effectively throughout the
Neolith.
</P><P>
Just don't treat the MACCS 166 keys as a good luck charm, or as some
sort of arcane relic passed down by the ancients. There are
limitations in the definition and limitations in the
implementation. Different tools <i>will</i> give different answers,
and if you don't understand your tools they may turn on you.
</P><P>
And when you write a paper, be honest to your readers. If you are
using the RDKit implementation of the MACCS keys or derived version in
another toolkit (and assuming they haven't been changed since I wrote
this essay), point out that you are only using 164 of those 166 bits.
</P>
<h2>Homework assignment</h2>
<P>
For a warmup exerecise, what is the other unimplemented bit in the
RDKit MACCS definition?
</P><P>
For your homework assignment, use two different programs to compute
the MACCS keys for a large data set and see 1) how many bits are
different? (eg, sum of the Manhattan distance between the fingerprints
for each record, or come up with a better measure), 2) how many times
does the nearest neighbor change?, and 3) (bonus points) characterize
how often those differences are because of differences in how to
interpret a key and how often it's because of different toolkit
aromaticity/chemistry perception methods.
</P><P>
I expect a paper in a journal by the end of next year. <tt>:)</tt>.
</P><P>
(Then again, for all I know this is one of those negative results
papers that's so hard to publish. "9 different MACCS key
implementations produce identical MACCS keys!" doesn't sound exciting,
does it?)
</P>
http://www.dalkescientific.com/writings/diary/archive/2014/10/17/maccs_key_44.htmlFri, 17 Oct 2014 12:00:00 GMTchemfp's format APIhttp://www.dalkescientific.com/writings/diary/archive/2014/09/26/format_api.html<P>
This is part of a series of essays describing the reasons behind
chemfp's new APIs. This one, about the new format API, is a lot
shorter than the previous one on <a
href="http://dalkescientific.com/writings/diary/archive/2014/09/25/parse_molecule.html">parse_molecule()
and parse_id_and_molecule()</a>.
</P><P>
It's sometimes useful to know which cheminformatics formats are
available, if only to display a help message or pulldown menu. The
get_formats() toolkit function returns a list of available formats, as
'Format' instances.
<pre class="code">
&gt;&gt;&gt; from chemfp import rdkit_toolkit as T
&gt;&gt;&gt; T.get_formats()
[Format('rdkit/canstring'), Format('rdkit/inchikey'),
Format('rdkit/usmstring'), Format('rdkit/smistring'),
Format('rdkit/molfile'), Format('rdkit/usm'),
Format('rdkit/inchikeystring'), Format('rdkit/sdf'),
Format('rdkit/can'), Format('rdkit/smi'), Format('rdkit/inchi'),
Format('rdkit/rdbinmol'), Format('rdkit/inchistring')]
</pre>
(The next version of chemfp will likely support RDKit's relatively new
PDB reader.)
</P><P>
You can ask a format for its name, or see if it is an input format or
output format by checking respectively "is_input" and "is_output". If
you just want the list of input formats or output formats, use
get_input_formats() or get_output_formats().
</P><P>
Here's an example to show which output formats are not also input
formats:
<pre class="code">
&gt;&gt;&gt; [format.name for format in T.get_output_formats() if not format.is_input_format]
['inchikey', 'inchikeystring']
</pre>
</P><P>
You may recall that some formats are record-based and others are, for
lack of a better word, "string-based". The latter include "smistring",
"inchistring", and "inchikeystring". These are not records in their
own right, so can't be read or written to a file.
</P><P>
I really couldn't come up with a good predicate which described those
formats. This closest was "is_a_record". I ended up with
"supports_io". I'm not happy with the name. If true, the format can be
used in file I/O.
</P><P>
The RDKit input formats which do not support I/O are the expected ones
... and rdbinmol.
<pre class="code">
&gt;&gt;&gt; [format.name for format in T.get_input_formats() if not format.supports_io]
['canstring', 'usmstring', 'smistring', 'molfile', 'rdbinmol', 'inchistring']
</pre>
(The "rdbinmol" is an experimental format. It's the byte string from
calling an RDKit molecule's "ToBinary()" method, which is also the
basis for its pickle support.)
</P>
<h2>get_format() and compression</h2>
<P>
You can get a specific format by name using get_format(). This can
also be used to specify a compressed format:
<pre class="code">
&gt;&gt;&gt; T.get_format("sdf")
Format('rdkit/sdf')
&gt;&gt;&gt; T.get_format("smi.gz")
Format('rdkit/smi.gz')
&gt;&gt;&gt; format = T.get_format("smi.gz")
&gt;&gt;&gt; format
Format('rdkit/smi.gz')
&gt;&gt;&gt; format.name
'smi'
&gt;&gt;&gt; format.compression
'gz'
</pre>
</P>
<h2>Default reader and writer arguments</h2>
<P>
Toolkit- and format-specific arguments were a difficult challenge. I
want chemfp to support multiple toolkits, because I know people work
with fingerprints from multiple toolkits. Each of the toolkit has its
own way to parse and generate records. I needed some way to have a
common API but with a mechanism to control the underlying toolkit
options.
</P><P>
The result are reader_args, which I discussed in <a
href="http://dalkescientific.com/writings/diary/archive/2014/09/25/parse_molecule.html">the
previous essay</a>, and the writer_args complement for turning a
molecule into a record.
</P><P>
A Format instance can be toolkit specific; the "rdkit/smi.gz" is an
RDKit format. (The toolkit name is available from the aptly named
attribute 'toolkit_name'.) Each Format has a way to get the default
reader_args and writer_args for the format:
<pre class="code">
&gt;&gt;&gt; format = T.get_format("smi.gz")
&gt;&gt;&gt; format
Format('rdkit/smi.gz')
&gt;&gt;&gt; format.get_default_reader_args()
{'delimiter': None, 'has_header': False, 'sanitize': True}
&gt;&gt;&gt; format.get_default_writer_args()
{'isomericSmiles': True, 'delimiter': None, 'kekuleSmiles': False, 'allBondsExplicit': False, 'canonical': True}
</pre>
This is especially useful if you are on the interactive prompt and
have forgotten the option names.
</P>
<h2>Convert text settings into arguments</h2>
<P>
The -R command-line options for the chemfp tools rdkit2fps, ob2fps,
and oe2fps let users set the reader_args. If your target molecules are
in a space-delimited SMILES file then you can set the 'delimiter'
option to 'space':
<pre class="code">
oe2fps -R delimiter=space targets.smi.gz -o targets.fps
</pre>
or ask RDKit to disable sanitization using:
<pre class="code">
rdkit2fps -R sanitize=false targets.smi.gz -o targets.fps
</pre>
The -R takes string keys and values. On the other hand reader_args
take a dictionary with string keys but possibly integers and booleans
as values. You could write the converter yourself, but that gets old
very quickly. Instead, I included it the format's
get_reader_args_from_text_settings(). (The *2fps programs don't
generate structure output, but if they did the equivalent command-like
flag would be -W, and the equivalent format method is
get_writer_args_from_text_settings().)
</P><P>
Yes, I agree that get_..._settings() is a very long name. I couldn't
think of a better one. I decided that "text settings" are the
reader_args and writer_args expressed as a dictionary with string
names and string values.
</P><P>
I'll use that long named function to convert some text settings into
proper reader_args:
<pre class="code">
&gt;&gt;&gt; format.get_reader_args_from_text_settings({
... "delimiter": "tab",
... "sanitize": "false",
... })
{'delimiter': 'tab', 'sanitize': False}
</pre>
You can see that the text "false" was converted into the Python False
value.
</P>
<h2>Namespaces</h2>
<P>
Names like "delimiter" and "sanitize" are 'unqualified' and apply for
every toolkit and every format which accept them. This makes sense for
"delimiter" because it's pointless to have OEChem parse a SMILES file
using a different delimiter style than RDKit. It's acceptable for
"sanitize" because only RDKit knows what it means, and the other
toolkits will ignore unknown names. For many cases then you could
simply do something like:
</P><P>
<pre class="code">
reader_args = {
"delimiter": "tab", # for SMILES files
"strictParsing": False, # for RDKit SDF
"perceive_stereo": True, # for Open Babel SDF
"aromaticity": "daylight, # for all OEChem readers
}
</pre>
</P><P>
At the moment the toolkits all have different names for option names
for the same format, so there's no conflict there. But toolkits do use
the same name for options on different formats, and there can be a
good reason for why the value for a SMILES output is different than a
value for an SDF record output.
</P><P>
The best example is OEChem, which uses a "flavor" flag to specify the
input and output options for all formats. (For chemfp I decided to
split OEChem's flavor into 'flavor' and 'aromaticity' reader and
writer arguments. I leave that discussion for elsewhere.) I'll start
by making an OEGraphMol.
<pre class="code">
from chemfp import openeye_toolkit
phenol = "c1ccccc1[16OH]"
oemol = openeye_toolkit.parse_molecule(phenol, "smistring")
</pre>
Even though "smistring" output by default generates the canonical
isomorphic SMILES for the record, I can ask it to generate a different
output flavor. For convience, the flavor value can be an integer,
which is treated as the flavor bitmask, or it can be a string of "|"
or "," separated bitmask names. Usually the bitmask names are or'ed
together, but a leading "-" means to unset the corresponding bits for
that flag.
<pre class="code">
>>> openeye_toolkit.create_string(oemol, "smistring")
'c1ccc(cc1)[16OH]'
&gt;&gt;&gt; openeye_toolkit.create_string(oemol, "smistring",
... writer_args={"flavor": "Default"})
'c1ccc(cc1)[16OH]'
&gt;&gt;&gt; openeye_toolkit.create_string(oemol, "smistring",
... writer_args={"flavor": "Default,-Isotopes"})
'c1ccc(cc1)O'
&gt;&gt;&gt; openeye_toolkit.create_string(oemol, "smistring",
... writer_args={"flavor": "Canonical|Kekule|Isotopes"})
'C1=CC=C(C=C1)[16OH]'
</pre>
Here I'll ask for the SDF record output in V3000 format. (In the
future I expect to have a special "sdf3" or "sdf3000" format, to make
it easier to specify V3000 output across all toolkits.)
<pre class="code">
&gt;&gt;&gt; print(openeye_toolkit.create_string(oemol, "sdf",
... writer_args={"flavor": "Default|MV30"}))
-OEChem-09261411132D
0 0 0 0 0 999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 7 7 0 0 0
M V30 BEGIN ATOM
M V30 1 C 0 0 0 0
M V30 2 C 0 0 0 0
M V30 3 C 0 0 0 0
M V30 4 C 0 0 0 0
M V30 5 C 0 0 0 0
M V30 6 C 0 0 0 0
M V30 7 O 0 0 0 0 MASS=16
M V30 END ATOM
M V30 BEGIN BOND
M V30 1 2 1 6
M V30 2 1 1 2
M V30 3 2 2 3
M V30 4 1 3 4
M V30 5 2 4 5
M V30 6 1 5 6
M V30 7 1 6 7
M V30 END BOND
M V30 END CTAB
M END
$$$$
</pre>
</P>
<h3>What's the problem?</h3>
<P>
One problem comes when I want to configure chemfp so that if the
output is SMILES then use one flavor, and if the output is SDF then
use another flavor. You could construct a table of format-specific
writer_args, like this:
<pre class="code">
writer_args_by_format = {
"smi": {"flavor": "Canonical|Kekule|Isotopes", "aromaticity": "openeye"},
"sdf": {"flavor": "Default|MV30", "aromaticity": "openeye"},
...
}
record = T.create_string(mol, format,
writer_args = writer_args_by_format[format])
</pre>
but not only is that tedious, it doesn't handle toolkit-specific
options. Nor is there an easy way to turn the text settings into this
data structure.
</P>
<h3>Qualified names</h3>
<P>
Instead, the reader_args and writer_args accept "qualified" names,
which can be format-specific like "sdf.flavor", toolkit-specific like
"openeye.*.aromaticity", or both, like "openeye.sdf.aromaticity".
</P><P>
A cleaner way to write the previous example is:
<pre class="code">
writer_args = {
"smi.flavor": "Canonical|Kekule|Isotopes",
"sdf.flavor": "Default|MV30",
"aromaticity": "openeye", # Use the openeye aromaticity model for all formats
...
}
record = T.create_string(mol, format, writer_args = writer_args)
</pre>
or if you want to be toolkit-specific, use "openeye.smi.flavor",
"openeye.sdf.flavor" and "openeye.*.aromaticity", etc.
</P>
<h3>Precendence</h3>
<P>
You probably noticed there are many ways to specify the same setting,
as in the following:
<pre class="code">
reader_args = {
"delimiter": "tab",
"openeye.*.delimiter": "whitespace",
"smi.delimiter": "space",
}
</pre>
The chemfp precedence goes from most-qualified name to
least-qualified, so for this case the search order is:
<pre class="code">
openeye.smi.delimiter
openeye.*.delimiter
smi.delimiter
delimiter
</pre>
</P>
<h2>How to convert qualified names into unqualified names</h2>
<P>
The Format object's get_unqualified_reader_args() converts a
complicated reader_args dictionary which may contain qualified names
into a simpler reader_args dictionary with only unqualified names and
only the names appropriate for the format. It's used internally to
simplify the search for the right name, and it's part of the public
API so you can help debug if your qualifiers are working
correctly. I'll give an example of debugging in a moment.
<P><P>
Here's an example which shows that the previous 'reader_args' example,
with several delimiter specification, is resolved to using the
'whitespace' delimiter style.
<pre class="code">
&gt;&gt;&gt; from chemfp import openeye_toolkit
&gt;&gt;&gt;
&gt;&gt;&gt; reader_args = {
... "delimiter": "tab",
... "openeye.*.delimiter": "whitespace",
... "smi.delimiter": "space",
... }
&gt;&gt;&gt;
&gt;&gt;&gt; format = openeye_toolkit.get_format("smi")
&gt;&gt;&gt; format.get_unqualified_reader_args(reader_args)
{'delimiter': 'whitespace', 'flavor': None, 'aromaticity': None}
</pre>
You can see that it also fills in the default values for unspecified
arguments. Note that this function does not validate values. It's only
concerned with resolving the names.
</P><P>
The equivalent method for writer_args is get_unqualified_writer_args()
- I try to be predictable in my APIs.
</P><P>
This function is useful for debugging because it helps you spot
typos. Readers ignore unknown arguments, so if you type "opneye"
instead of "openeye" then it just assumes that you were talking about
some other toolkit.
</P><P>
If you can't figure out why your reader_args or writer_args aren't
being accepted, pass them through the 'unqualified' method and see
what it gives:
<pre class="code">
&gt;&gt;&gt; format.get_unqualified_reader_args({"opneye.*.aromaticity": "daylight"})
{'delimiter': None, 'flavor': None, 'aromaticity': None}
</pre>
</P>
<h2>Qualified names and text settings</h2>
<P>
The Format object also supports qualifiers in the reader and writer
text_settings and applies the same search order to give the
unqualified reader_args.
<pre class="code">
&gt;&gt;&gt; format.get_reader_args_from_text_settings({
... "sanitize": "true",
... "rdkit.*.sanitize": "false",
... })
{'sanitize': False}
</pre>
</P>
<h2>Errors in the text settings</h2>
<P>
The get_reader_args_from_text_settings() and
get_writer_args_from_text_settings() will validate the values as much
as it can, and raise a ValueError with a helpful message if that
fails.
<pre class="code">
&gt;&gt;&gt; from chemfp import openeye_toolkit
&gt;&gt;&gt; sdf_format = openeye_toolkit.get_format("sdf")
&gt;&gt;&gt; sdf_format.get_writer_args_from_text_settings({
... "flavor": "bland",
... })
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "chemfp/base_toolkit.py", line 407, in get_writer_args_from_text_settings
return self._get_args_from_text_settings(writer_settings, self._format_config.output)
File "chemfp/base_toolkit.py", line 351, in _get_args_from_text_settings
% (self.toolkit_name, name, value, err))
ValueError: Unable to parse openeye setting flavor ('bland'): OEChem sdf format does not support the 'bland' flavor option. Available flavors are: CurrentParity, MCHG, MDLParity, MISO, MRGP, MV30, NoParity
</pre>
</P>
<h2>File format detection based on extension</h2>
<P>
All of the above assumes you know the file format. Sometimes you only
know the filename, and want to determine (or "guess") the format based
on its extension. The file "abc.smi" is a SMILES file, the file
"xyz.sdf" is an SD file, and "xyz.sdf.gz" is a gzip-compressed SD
file.
</P><P>
The toolkit function get_input_format_from_source() will try to
determine the format for an input file, given the source filename:
<pre class="code">
&gt;&gt;&gt; from chemfp import openbabel_toolkit as T
&gt;&gt;&gt; T.get_input_format_from_source("example.smi")
Format('openbabel/smi')
&gt;&gt;&gt; T.get_input_format_from_source("example.sdf.gz")
Format('openbabel/sdf.gz')
&gt;&gt;&gt; format = T.get_input_format_from_source("example.sdf.gz")
&gt;&gt;&gt; format.get_default_reader_args()
{'implementation': None, 'perceive_0d_stereo': False, 'perceive_stereo': False, 'options': None}
</pre>
The equivalent for output files is get_output_format_from_destination().
</P><P>
The main difference between the two is get_input_format_from_source()
will raise an exception if the format is known but not supported as an
input format, and get_input_format_from_destination() will raise an
exception if the format is known but not supported as an output
format.
<pre class="code">
&gt;&gt;&gt; T.get_input_format_from_source("example.inchikey")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/openbabel_toolkit.py", line 109, in get_input_format_from_source
return _format_registry.get_input_format_from_source(source, format)
File "chemfp/base_toolkit.py", line 606, in get_input_format_from_source
format_config = self.get_input_format_config(register_name)
File "chemfp/base_toolkit.py", line 530, in get_input_format_config
% (self.external_name, register_name))
ValueError: Open Babel does not support 'inchikey' as an input format
</pre>
</P><P>
The format detection functions actually take two arguments, where the
second is the format name.
<pre class="code">
&gt;&gt;&gt; T.get_input_format_from_source("example.inchikey", "smi.gz")
Format('openbabel/smi.gz')
</pre>
This is meant to simplify the logic that would otherwise lead to code
like:
<pre class="code">
if format is not None:
format = T.get_input_format(format)
else:
format = T.get_input_format_from_source(source)
</pre>
</P><P>
By the way, the source and destination can be None. This tells chemfp
to read from stdin or write to stdout. Since stdin and stdout don't
have a file extension, what format do they have? My cheminformatics
roots started with Daylight, so I decided that the default format is
"smi".
</P>
http://www.dalkescientific.com/writings/diary/archive/2014/09/26/format_api.htmlFri, 26 Sep 2014 12:00:00 GMTchemfp's parse_molecule()http://www.dalkescientific.com/writings/diary/archive/2014/09/25/parse_molecule.html<P>
I used several use cases to guide chemfp-1.2 development. One, under
the category of "web service friendly", was to develop a web service
which takes a structure record and optional format name as input for a
k=3 nearest neighbor search of a pre-loaded target fingerprint
file. That doesn't sound so hard, except I'll let the fingerprint file
determine the fingerprint type based on its 'type' header, which might
specify Open Babel, RDKit, or OpenEye's toolkits.
</P><P>
Those toolkit all have different APIs, but I would prefer not to write
different code for each case. Chemfp-1.1 can be hacked to handle most
of what's needed, because read_structure_fingerprints() will read a
structure file and computes fingerprints for each record. The hack
part is save the XML-RPC query to a file, since
read_structure_fingerprints() only works from a file, not a string.
</P><P>
That isn't a full solution. Chemfp-1.1 doesn't support any sort of
structure output, so there would need to be toolkit-specific code to
set the tag data and convert the molecule to a record.
</P><P>
For chemfp I decided on the classic approach and make my own
toolkit-independent API, with a back-end implementation for each of
the supported toolkits. I think it's a great API, and I've been using
it a lot for my other internal projects. My hope is that some of the
ideas I developed go into other toolkits, or at least influence the
design of the next generation of toolkits.
</P><P>
To see a more concrete example, here's that use case implemented as an
XML-RPC service using the new chemfp-1.2 API.
<pre class="code">
from SimpleXMLRPCServer import SimpleXMLRPCServer
import chemfp
# Load the target fingerprints and figure out the fingerprint type and toolkit
targets = chemfp.load_fingerprints("databases/chembl_18_FP2.fps")
fptype = targets.get_fingerprint_type() # uses the 'type' field from the FPS header
toolkit = fptype.toolkit
print("Loaded %d fingerprints of type %r" % (len(targets), fptype.get_type()))
def search(record, format="sdf"):
# Parse the molecule and report any failures
try:
mol = toolkit.parse_molecule(record, format)
except ValueError as err:
return {"status": "error", "msg": str(err)}
# Compute its fingerprint, search the targets, and get the nearest 3 neighbors
fp = fptype.compute_fingerprint(mol)
nearest = targets.knearest_tanimoto_search_fp(fp, k=3, threshold=0.0)
# Add a tag value like "CHEMBL14060 1.00 CHEMBL8020 0.92 CHEMBL7970 0.92"
(id1, score1), (id2, score2), (id3, score3) = nearest.get_ids_and_scores()
toolkit.add_tag(mol, "NEAREST", "%s %.2f %s %.2f %s %.2f" %
(id1, score1, id2, score2, id3, score3))
return {"status": "ok",
"record": toolkit.create_string(mol, "sdf")}
server = SimpleXMLRPCServer(("localhost", 8000))
server.register_introspection_functions()
server.register_function(search)
if __name__ == "__main__":
server.serve_forever()
</pre>
</P><P>
I think this is a clean API, and a bit easier to understand and use
than the native toolkit APIs. It's very similar to <a
href="https://code.google.com/p/cinfony/">cinfony</a>, though I think
at this level cinfony is bit easier to understand because it wraps its
own Molecule object around the native toolkit molecules, while I leave
them as native molecule objects. I have to use helper functions where
cinfony can use methods. I did this because I don't want the
performance overhead of wrapping and unwrapping for the normal use
case of converting a structure file to fingerprints.
</P><P>
I also have more stand-alone objects than cinfony, like my fingerprint
type object for fingerprint generation, where cinfony uses a method of
the molecule.
</P>
<h2>parse_molecule()</h2>
<P>
That example, while illustrative of the new API, isn't significantly
better than existing systems. For that, I need to delve into the
details, starting with parse_molecule() .
</P><P>
Chemfp has three "toolkit" APIs, one for each of the supported
toolkits. The usual way to get them is through one of the following:
<pre class="code">
from chemfp import rdkit_toolkit
from chemfp import openeye_toolkit
from chemfp import openbabel_toolkit
</pre>
Most of the examples are toolkit independent, so I'll use "T" rather
than stress a toolkit. I'll use RDKit as the back-end for these
examples:
<pre class="code">
&gt;&gt;&gt; from chemfp import rdkit_toolkit as T
</pre>
</P><P>
The type signature is:
<pre class="code">
parse_molecule(content, format, id_tag=None, reader_args=None, errors='strict')
</pre>
In the simplest case I'll parse a SMILES record to get an RDKit
molecule object:
<pre class="code">
&gt;&gt;&gt; mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")
&gt;&gt;&gt; mol
&lt;rdkit.Chem.rdchem.Mol object at 0x104521360&gt;
</pre>
If you use the openeye_toolkit you would get an OEGraphMol, and
openbabel_toolkit returns an OBMol.
</P><P>
You might have noticed that the first two parameters are in reverse
order from cinfony's readstring(), which take the format in the first
position instead of the second. This is because the parse_molecules()
parameters parallel chemfp's read_molecules() parameters:
<pre class="code">
read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors='strict', location=None)
</pre>
The format there is second because the default of None means to
auto-detect the format based on the source. (If the source is a
filename then detection is based on the extension. I'll go into more
details in another essay.)
</P><P>
In comparison, cinfony readfile() takes <tt>(format, filename)</tt>,
and doesn't auto-detect the format. (A future version could do that,
but it would require a format like "auto" or None, which I thought was
less elegant than being able to use the default format.) I wanted
read_molecules() to support auto-detection, which meant the format had
to be in the second position, which is why parse_molecule() takes the
format in the second position.
</P>
<h2>parse_molecule() always returns a new molecule</h2>
<P>
This function promises to return a new molecule object each
time. Compare to OEChem or Open Babel, which parse into an existing
molecule object.
</P><P>
Those two toolkits reuse the molecule for performance reasons;
clearing is faster than deallocating and reallocating a molecule
object. My experience is that in practice molecule reuse is
error-prone because it's easy to forget to clear the molecule, or save
multiple results to a list and be surprised that they all end up with
the same molecule.
</P><P>
I agree that performance is important. I chose a different route to
get there. I noticed that even if the molecule were reused, there
would still be overhead in calling parse_molecule() because it has to
validate or at least interpret the function parameters. These
parameters rarely change, that validation is unneeded overhead.
</P><P>
What I ended up doing was making a new function,
make_id_and_molecule_parser(), which take the same parameters as
parse_molecule(), except leaving out 'content'. It returns a
specialized parser function which only takes one parameter, the
content, and returns the (id, molecule) pair. (In a future release I
may also have a make_molecule_parser() which only returns the
molecule, but that's too much work for now.)
</P><P>
This new function is designed for performance, and is free to reuse the
molecule object. Functions which return functions are relatively
unusual, and I think only more advanced programmers will use it, which
makes it less likely that people will experience the negative
problems.
<P><P>
It's still possible, so in the future I may add an option to
make_molecule_parser() to require a new molecule each time.
</P>
<h2>Format names</h2>
<P>
The second parameter is the format name or Format object. For now I'll
only consider format names.
<pre class="code">
mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")
</pre>
</P><P>
This was a surprisingly tricky to get right, because a SMILES record
and a SMILES string are different things, and because toolkits differ
in what a SMILES record means. When someone says "takes a SMILES as
input", does that mean to take a record or a string?
</P><P>
To clarify, a SMILES file contains SMILES records. A SMILES record is
a SMILES string followed by a whitespace, followed by a title/id. Some
toolkits take Daylight's lead and say that the title goes to the end
of the line. Others interpret the file as a whitespace or space or tab
separated file; or even an arbitrary character delimited file. Some
also support a header in the first line. Finally, SMILES records are
newline terminated, although that's not always required for the last
record. (I'll come back to this in a bit.)
</P><P>
I decided to use different format names for these two cases. The "smi"
format refers to a SMILES record, and the "smistring" format refers to
just the SMILES string.
</P><P>
Bj&ouml;rn Gr&uuml;ning pointed out that a similar issue exists with
InChI. There's the InChI string, but Open Babel and some other tools
also support an "InChI file" with the InChI string as the first term,
followed by a whitespace, followed by some title or identifier, as
well as an "InChIKey file", using the InChIKey instead of the InChI.
</P><P>
Thus, chemfp has "inchistring" and "inchi", and "inchikeystring" and
"inchikey", in parallel with the "smistring"/"smi" distinction.
</P><P>
The other formats, like "sdf" and "pdb", are only record-oriented and
don't have the same dual meaning as SMILES and InChI.
</P>
<h3>Compressed formats</h3>
<P>
A longer term chemfp idea is to extend the binary format to store
record data. My current experiments use SQLite for the records and FPB
files for the fingerprints.
</P><P>
Uncompressed SDF records take up a lot of space, and compress well
using standard zlib compression. The file-based I/O function support
format names like "smi.gz". I extended the idea to support
zlib-compressed records, like "sdf.zlib".
</P>
<h3>Output format names: different types of SMILES</h3>
<P>
The SMILES <i>output</i> format names are also tricky. This needs a
bit of history to understand fully. Daylight toolkit introduced SMILES
strings, but the original syntax did not support isotopes or
chirality. These were added latter, as so-called "isomeric SMILES".
Daylight and nearly every toolkit since maintained that separation,
where a "SMILES" output string (either canonical or non-canonical) was
not isomeric, and something different needed to be done to get an
isomeric SMILES.
</P><P>
This was a usability mistake. Most people expect that when the input is
<sup>238</sup>U then the output SMILES will be "[238U]". I know,
because I've probably made that mistake a dozen times in my own code.
On the plus side, it's usually very easy to detect and fix. On the
minus side, I've only rarely needed canonical non-isomeric SMILES, so
the default ends up encouraging mistakes.
</P><P>
OEChem 2.0 decided to break the tradition and say that "smi" refers to
canonical isomeric SMILES, which is what most expect but didn't get,
that "can" refers to canonical non-isomeric SMILES (this is
unchanged), and that "usm" is the new term for non-canonical,
non-isomeric SMILES.
</P><P>
It's a brillantly simple solution to a usability problem I hadn't
really noticed before they solved it. This made so much sense to me
that I changed chemfp's new I/O API to use those format names and
meanings. I hope others also follow their lead.
</P><P>
That's why "smi" as a chemfp output format means canonical isomeric
SMILES, "can" means canonical non-isomeric, and "usm" means
non-canonical non-isomeric. The corresponding string formats are
"smistring", "canstring", and "usmstring". Here they are in action:
<pre class="code">
&gt;&gt;&gt; mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")
&gt;&gt;&gt; T.create_string(mol, "smi")
'[16OH]c1ccccc1 phenol\n'
&gt;&gt;&gt; T.create_string(mol, "can")
'Oc1ccccc1 phenol\n'
&gt;&gt;&gt; T.create_string(mol, "usm")
'c1ccccc1O phenol\n'
&gt;&gt;&gt; T.create_string(mol, "smistring")
'[16OH]c1ccccc1'
</pre>
</P>
<h2>id_tag, parse_id_and_molecule(), and titles</h2>
<P>
The parse_molecule() function only really uses the "id_tag" parameter
to improve error reporting - the error message will use the 'id_tag's
value rather than the default title for a record.
</P><P>
The id_tag parameter exists because developers in Hinxton decided to
put the record identifier in a tag of an SD file instead of placing it
in the record title like the rest of the world. As a result, many
cheminformatics tools stumble a bit with the ChEBI and ChEMBL
datasets, which either have a blank title or the occasional non-blank
but useless title like:
<pre class="code">
"tetrahydromonapterin"
(1E)-1-methylprop-1-ene-1,2,3-tricarboxylate
S([O-])([O-])(=O)=O.[Ni+2].O.O.O.O.O.O.O
Structure #1
Untitled Document-1
XTP
</pre>
I've decided that this is a bug, but despite years of complaints, they
haven't changed, so chemfp has to work around it.
</P><P>
Here's an SD record from ChEBI that you can copy and paste:
<pre class="code">
Marvin 02030822342D
2 1 0 0 0 0 999 V2000
-0.4125 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.4125 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0 0
M STY 1 1 SUP
M SAL 1 2 1 2
M SMT 1 O2
M END
&gt; &lt;ChEBI ID&gt;
CHEBI:15379
&gt; &lt;ChEBI Name&gt;
dioxygen
&gt; &lt;Star&gt;
3
$$$$
</pre>
I'll assign that to the variable 'sdf_record', parse it, and show that
the title is empty, though the identifier is available as the
"ChEBI&nbsp;ID" tag.
<pre class="code">
&gt;&gt;&gt; from chemfp import rdkit_toolkit as T
&gt;&gt;&gt; sdf_record = """\
...
... Marvin 02030822342D
...
... 2 1 0 0 0 0 999 V2000
... -0.4125 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
... many lines deleted ...
... $$$$
... """
&gt;&gt;&gt; mol= T.parse_molecule(sdf_record, "sdf")
[00:49:59] S group SUP ignored on line 8
&gt;&gt;&gt; mol
&lt;rdkit.Chem.rdchem.Mol object at 0x1043de4b0&gt;
&gt;&gt;&gt; T.get_id(mol)
''
&gt;&gt;&gt; T.get_tag(mol, "ChEBI ID")
'CHEBI:15379'
</pre>
(The "get_id" and "get_tag" methods are part of the new molecule
API. I'll discuss them in a future essay.)
</P><P>
I found that I often want to get both the identifier and the molecule.
Rather than use a combination of parse_molecule() and
get_id()/get_tag() to get that information, I created the
parse_id_and_molecule() function, which returns the 2-element tuple of
(id, molecule), whose id_tag parameter specifies where to find the id.
<pre class="code">
&gt; &gt; &gt; T.parse_id_and_molecule(sdf_record, "sdf", id_tag="ChEBI ID")
[00:50:37] S group SUP ignored on line 8
('CHEBI:15379', &lt;rdkit.Chem.rdchem.Mol object at 0x104346a60&gt;)
</pre>
If the id_tag is None, then it uses the appropriate default for the
given format; the title for SD records. Otherwise it contains the name
of the tag with the title. (If there are multiple tags with the same
name then the choice of which tag to use is made arbitrarily.)
</P><P>
I can imagine that some people might place an identifier in a
non-standard location for other formats. If that happens, then there
will likely be an id_tag syntax for that format.
</P><P>
For example, the closest I've come up with is a SMILES variant where
the id you want is in the third column. In that case the id_tag might
be "#3". If the column is labeled "SMILES" then an alternate would be
"@SMILES", or if you know the column name doesn't start with a '#'
or '@' then simply "SMILES".
</P><P>
However, that's hypothetical. I haven't seen it in real life. What I
have seen are CSV files, where the SMILES and id columns are
arbtirary, and which may have Excel-specific delimiter and quoting
rules. Chemfp doesn't currently support CSV files and that will likely
be handled through reader_args.
</P>
<h2>Handling whitespace in a SMILES record</h2>
<P>
The SMILES record format is not portable across toolkits. According to
Daylight, and followed by OEChem, the format is the SMILES string, a
single whitespace, and the rest of the line is the title. This title
might be a simple alphanumeric id, or an IUPAC name with spaces in it,
or anything else.
</P><P>
Other toolkits treat it as a character delimited set of columns. For
example, RDKit by default uses the space character as the delimiter,
though you can change that to tab, comma, or other character.
</P><P>
This is a problem because you might want to generate fingerprints from
a SMILES file containing the record:
<pre class="code">
C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a
</pre>
In one toolkit you'll end up with the identifier "vitamin" and with
another toolkit get the identifier "vitamin A".
</P><P>
My experience is that most people treat a SMILES file as a space or
tab delimited set of columns, with a strong perference for the space
character. This isn't universally true. Roger Sayle pointed out to me
that the formal IUPAC name is a perfectly reasonable unique and
canonical identifer, which can have a space in it. The Daylight
interpretation supports IUPAC names, while the space delimited version
does not.
</P><P>
There is no perfect automatic solution to this ambiguity. Instead,
chemfp lets you specify the appropriate delimiter type using the
"delimiter" reader argument. The supported delimiter type names are
"whitespace", "tab", "space", and "to-eol", as well as the literal
characters "&nbsp;" and "\t". The default delimiter is
"whitespace", because most of the people I work with think of a SMILES
file more like a whitespace delimited file.
</P><P>
Here's an example of how the default doesn't like "vitamin A", but
where the "to-eol" delimiter handles it correctly:
<pre class="code">
&gt;&gt;&gt; smiles = "C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a\n"
&gt;&gt;&gt; T.parse_id_and_molecule(smiles, "smi")
('vitamin', &lt;rdkit.Chem.rdchem.Mol object at 0x10bbf39f0&gt;)
&gt;&gt;&gt; T.parse_id_and_molecule(smiles, "smi",
... reader_args={"delimiter": "to-eol"})
('vitamin a', &lt;rdkit.Chem.rdchem.Mol object at 0x10bc8c2f0&gt;)
</pre>
as well as an example of the difference between the "tab" and "to-eol"
delimiter types:
<pre class="code">
&gt;&gt;&gt; T.parse_id_and_molecule("C\tA B\tC\t", "smi",
... reader_args={"delimiter": "tab"})
('A B', &lt;rdkit.Chem.rdchem.Mol object at 0x10bc8c2f0&gt;)
&gt;&gt;&gt; T.parse_id_and_molecule("C\tA B\tC\t", "smi",
... reader_args={"delimiter": "to-eol"})
('A B\tC\t', &lt;rdkit.Chem.rdchem.Mol object at 0x10bbf39f0&gt;)
</pre>
</P><P>
It was a bit of work to make the different toolkits work the same way,
and my best attempt isn't perfect. For example, if you are daft and
try to interpret the record "C Q\tA" as a tab-delimited set of
columns, then OEChem will see this as methane with an id of "Q" while
RDKit and Open Babel will say they can't parse the SMILES "C Q".
</P><P>
So don't do that!
</P>
<h2>reader_args</h2>
<P>
As you saw, reader_args is a Python dictionary. All of the SMILES
parsers accept the 'delimiter' argument, and the RDKit and Open Babel
reader_args also support the "has_header" argument. If true, the first
line of the file contains a header. (I couldn't think of a good
implementation of this for OEChem.)
</P><P>
There are also toolkit-specific reader_args. Here I'll disable RDKit's
sanity checker for SMILES, and show that it accepts a SMILES that it
would otherwise reject.
<pre class="code">
&gt;&gt;&gt; T.parse_molecule("c1ccccc1#N", "smistring", reader_args={"sanitize": False})
&lt;rdkit.Chem.rdchem.Mol object at 0x104407440&gt;
&gt;&gt;&gt; T.parse_molecule("c1ccccc1#N", "smistring", reader_args={"sanitize": True})
[22:09:40] Can't kekulize mol
Traceback (most recent call last):
File "&lt;stdin&gt;", line 1, in <module>
File "chemfp/rdkit_toolkit.py", line 251, in parse_molecule
return _toolkit.parse_molecule(content, format, id_tag, reader_args, errors)
File "chemfp/base_toolkit.py", line 986, in parse_molecule
id_tag, reader_args, error_handler)
File "chemfp/base_toolkit.py", line 990, in _parse_molecule_impl
id, mol = format_config.parse_id_and_molecule(content, id_tag, reader_args, error_handler)
File "chemfp/_rdkit_toolkit.py", line 1144, in parse_id_and_molecule
error_handler.error("RDKit cannot parse the SMILES string %r" % (terms[0],))
File "chemfp/io.py", line 77, in error
raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot parse the SMILES string 'c1ccccc1#N'
</pre>
</P><P>
The SMILES parsers use different reader_args than the SDF parser. You
can see the default reader_args by using the toolkit's format API:
<pre class="code">
&gt;&gt;&gt; from chemfp import rdkit_toolkit
&gt;&gt;&gt; rdkit_toolkit.get_format("smi").get_default_reader_args()
{'delimiter': None, 'has_header': False, 'sanitize': True}
&gt;&gt;&gt; rdkit_toolkit.get_format("sdf").get_default_reader_args()
{'strictParsing': True, 'removeHs': True, 'sanitize': True}
</pre>
Also, the different toolkits may use different reader_args for the
same format.
<pre class="code">
&gt;&gt;&gt; from chemfp import openeye_toolkit
&gt;&gt;&gt; openeye_toolkit.get_format("sdf").get_default_reader_args()
{'flavor': None, 'aromaticity': None}
&gt;&gt;&gt; from chemfp import openbabel_toolkit
&gt;&gt;&gt; openbabel_toolkit.get_format("sdf").get_default_reader_args()
{'implementation': None, 'perceive_0d_stereo': False, 'perceive_stereo': False, 'options': None}
</pre>
I'll cover more about the format API in another essay.
</P>
<h3>Namespaces</h3>
<P>
This can lead to a problem. You saw earlier how to get the correct
toolkit for a given fingerprint type. Once you have the toolkit you
can parse a record into a toolkit-specific molecule. But what if you
want toolkit-specific and format-specific settings?
</P><P>
First off, parsers ignore unknown reader_arg names, and the reader_arg
names for the different toolkits are different, except for "delimiter"
and "has_header" where it doesn't make sense for them to be
different. That means you could do:
<pre class="code">
reader_args = {
"delimiter": "tab", # for SMILES files
"strictParsing": False, # for RDKit SDF
"perceive_stereo": True, # for Open Babel SDF
"aromaticity": "daylight, # for all OEChem readers
}
</pre>
and have everything work.
</P><P>
Still, it's possible that you want OEChem to parse a SMILES using
"daylight" aromaticity and an SD file using "openeye" aromaticity.
</P><P>
The reader_args are namespaced, so for that case you could use a
format qualifier, like this:
<pre class="code">
reader_args = {
"smi.aromaticity": "daylight",
"can.aromaticity": "daylight",
"usm.aromaticity": "daylight",
"sdf.aromaticity": "openeye",
}
</pre>
There's also a toolkit qualifier. In this daft example, the OpenEye
reader uses the whitespace delimiter option for SMILES files, the
RDKit SMILES and InChI readers uses tab, and the Open Babel SMILES and
InChI readers use the space character.
<pre class="code">
reader_args = {
"openeye.*.delimiter": "whitespace",
"rdkit.*.delimiter": "tab",
"openbabel.*.delimiter": "space",
}
</pre>
Fully qualified names, like "openeye.smi.delimiter" are also allowed.
</P><P>
The remaining problem is how to configure the reader_args. It's no
problem to write the Python dictionary yourself, but what if you want
people to pass them in as command-line arguments or in a configuration
file? I'll cover that detail when I talk about the format API.
</P>
<h2>Errors</h2>
<P>
What happens if the SMILES isn't valid? I'll use my favorite invalid SMILES:
<pre class="code">
&gt;&gt;&gt; T.parse_id_and_molecule("Q q-ane", "smi")
[22:01:37] SMILES Parse Error: syntax error for input: Q
Traceback (most recent call last):
File "&lt;stdin&gt;", line 1, in &lt;module&gt;
File "chemfp/rdkit_toolkit.py", line 264, in parse_id_and_molecule
return _toolkit.parse_id_and_molecule(content, format, id_tag, reader_args, errors)
File "chemfp/base_toolkit.py", line 1014, in parse_id_and_molecule
id_tag, reader_args, error_handler)
File "chemfp/base_toolkit.py", line 1018, in _parse_id_and_molecule_impl
return format_config.parse_id_and_molecule(content, id_tag, reader_args, error_handler)
File "chemfp/_rdkit_toolkit.py", line 249, in parse_id_and_molecule
error_handler.error("RDKit cannot parse the SMILES %r" % (smiles,))
File "chemfp/io.py", line 77, in error
raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot parse the SMILES 'Q'
</pre>
A chemfp.ParseError is also a ValueError, so you can check for this exception like this:
<pre class="code">
try:
mol = T.parse_molecule("Q", "smistring")
except ValueError as err:
print("Not a valid SMILES")
</pre>
</P><P>
I think an exception is the right default action when there's an
error, but various toolkits disagree. Some will return a None object
in that case, and I can see the reasoning. It's sometimes easier to
write things more like:
<pre class="code">
&gt;&gt;&gt; mol = T.parse_molecule("Q", "smistring", errors="ignore")
[00:04:26] SMILES Parse Error: syntax error for input: Q
&gt;&gt;&gt; print(mol)
None
</pre>
</P><P>
The default errors value is "strict", which raises an exception. The
"ignore" behavior is to return None when the function can't parse a
SMILES.
</P><P>
In the above example, the "SMILES Parse Error" messages come from the
the underlying RDKit C++ library. It's difficult to capture this
message from Python. Chemfp doesn't even try. OEChem and Open Babel
have similar error messages. Chemp doesn't try to capture those
either.
</P><P>
Error handling control for a single record isn't so important. It's
more important when parsing multiple records, where "ignore" skips a
record and keeps processing and "strict" raises an exception and stops
processing once a record cannot be parsed. I'll cover that in the
essay where I talk about reading molecules.
</P>
<h2>Other error handling - experimental!</h2>
<P>
The errors parameter can be two other strings: "report" and "log".
The "report" option writes the error information to stderr, like this:
<pre class="code">
&gt;&gt;&gt; T.parse_molecule("Q", "smi", errors="report")
[03:05:52] SMILES Parse Error: syntax error for input: Q
ERROR: RDKit cannot parse the SMILES 'Q'. Skipping.
</pre>
The "SMILES Parse Error", again, is generated by the RDKit C++ level
going directly to stderr. The "RDKit cannot parse the SMILES" line,
however, is chemfp sending information to Python's sys.stderr.
</P><P>
I don't think the "report" option is all that useful for this case
since it's pretty much a duplicate of the underlying C++ toolkit. It
does know about the id_tag, so it gives better error message for ChEBI
and ChEMBL files.
<P><P>
The "log" option is the same as "report" except that it sends the
message to the "chemfp" logger using Python's built-in logging
system. I have very little experience with logging, so this is even
more experimental than "report".
<P><P>
Finally, the "errors" parameter can take an object to use as the error
handler. The idea is that the handler's error() is called when there's
an error. See chemfp/io.py for how it works, but consider this to be
undocumented, experiemental, and almost certain to change.
</P><P>
I have the idea that the error handler can have a warn() method, which
would be called for warnings. The current generation of toolkits uses
a global error handler or sends the message directly to stderr. In a
multi-threaded system, which might parse two molecules at the same
time in different threads, it's hard to know which molecule caused the
error.
</P><P>
I haven't done this because ... it's hard using the current generation
of tools to get that warning information. I'm also unsure about the
error handler protocol. How does it know that it's collected all of
the warnings for a given molecule. Is there a special
"end_of_warnings()" method? Are the warning accumulated and sent in
one call? What if there are both warnings and errors?
</P><P>
That's why this subsection it titled "experimental". <tt>:)</tt>
</P>
http://www.dalkescientific.com/writings/diary/archive/2014/09/25/parse_molecule.htmlThu, 25 Sep 2014 12:00:00 GMTNew chemfp APIshttp://www.dalkescientific.com/writings/diary/archive/2014/09/24/new_chemfp_apis.html<P>
In the new few essays I'll cover chemfp's new APIs. I'll have a
different focus here than will be in the official documentation. I
will be writing for the people interested in chemistry toolkit API
development, and with a focus on the I/O subsystem in such a
toolkit. I've spent about 15 years experimenting and evaluating
different API ideas across a range of use cases. This new API is a
synthesis of those ideas. I hope it proves influential.
</P>
<h2>Why does chemfp have these new APIs?</h2>
<P>
Chemfp is a multi-platform Python library for cheminformatics
fingerprint generation and search. It depends on OEChem/OEGraphSim,
RDKit, or Open Babel to parse a structure file and generate
fingerprints, and has its own highly optimized search algorithms.
</P><P>
Chemfp supports multiple toolkits because many of the people I know
who use fingerprints use fingerprints from multiple toolkits. While
this is biased by the people I know, in general, I've seen that most
people who uses a similarity-based search as part of their science
spend some time to figure out which fingerprint type, and which
parameters, are most appropriate for that research.
</P><P>
Unfortunately, this may mean a lot of fiddling around with different
toolkits, since each has a different API and different take on the
problem. My belief is that more people will use chemfp (and pay me
money) if it really makes it easier to handle fingerprints across
different toolkits. The hard part is that no one wants to learn yet
another system except perhaps if it's significantly better. I think
I've done that.
</P>
<h2>Example</h2>
<P>
Here's an example program using new API. It reads in a SMILES file,
compute OpenEye's circular fingerprints (with the defaults settings)
and save the results to an FPS file.
<pre class="code">
import chemfp
fptype = "OpenEye-Circular"
source = "benzodiazepine.smi"
destination = "benzodiazepine.fps"
with chemfp.read_molecule_fingerprints(fptype, source) as reader:
with chemfp.open_fingerprint_writer(destination,
metadata=reader.metadata) as writer:
writer.write_fingerprints(reader)
</pre>
I wrote that in a more formal style, using Python's "with" statement
to ensure that the files are closed when no longer needed. While this
gets all of the details correct, it is verbose. You might even call it
excessive, since you can get the basic result using:
<pre class="code">
import chemfp
chemfp.open_fingerprint_writer("benzodiazepine.fps").write_fingerprints(
chemfp.read_molecule_fingerprints("OpenEye-Circular", "benzodiazepine.smi"))
</pre>
The only major difference between the two is that the output
fingerprint file in the latter doesn't contain any metadata.
</P>
<h2>Fingerprint family and fingerprint type API</h2>
<P>
Object-oriented programming uses the words "class" and "instance". A
class is the more abstract concept. It describes what's possible. An
instance fills in the details. For example, "integers" is the class of
integer numbers, and "38" is an instance of that class.
</P><P>
Chemfp-1.2 makes a similar distinction between a "fingerprint family"
and a "fingerprint type". A family might be "circular fingerprints"
and a type is "circular fingerprints of size 2".
</P><P>
Internally, chemfp has a fingerprint family registry. To get
information about all of the available fingerprint families, use
get_fingerprint_families(), and use get_fingerprint_family() to get a
specific family by name:
<pre class="code">
&gt;&gt;&gt; import chemfp
&gt;&gt;&gt; chemfp.get_fingerprint_families()
[FingerprintFamily(&lt;ChemFP-Substruct-OpenBabel/1&gt;),
FingerprintFamily(&lt;ChemFP-Substruct-OpenEye/1&gt;),
FingerprintFamily(&lt;ChemFP-Substruct-RDKit/1&gt;),
FingerprintFamily(&lt;OpenBabel-FP2/1&gt;),
...
FingerprintFamily(&lt;RDMACCS-OpenBabel/1&gt;),
FingerprintFamily(&lt;RDMACCS-OpenEye/1&gt;),
FingerprintFamily(&lt;RDMACCS-RDKit/1&gt;)]
&gt;&gt;&gt; family = chemfp.get_fingerprint_family("OpenEye-Circular")
&gt;&gt;&gt; family
FingerprintFamily(&lt;OpenEye-Circular/2&gt;)
</pre>
(This make take a while because the registry uses a lazy resolver
which doesn't load or probe the underlying toolkits until first
requested.)
</P><P>
You can ask a fingerprint family for its default values, and make a
new fingerprint type using those defaults.
<pre class="code">
&gt;&gt;&gt; fptype = family()
&gt;&gt;&gt; family.get_defaults()
{'numbits': 4096, 'minradius': 0, 'atype': 4239, 'maxradius': 5, 'btype': 1}
&gt;&gt;&gt; fptype
&lt;chemfp.openeye_types.OpenEyeCircularFingerprintType_v2 object at 0x10dd0ee90&gt;
</pre>
Every fingerprint type has a type string, which is the canonical
string representation of the fingerprint family name, the chemfp
version number for that family, and the fingerprint type parameters.
<pre class="code">
&gt;&gt;&gt; fptype.get_type()
'OpenEye-Circular/2 numbits=4096 minradius=0 maxradius=5
atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HCount btype=Order'
</pre>
(This is really one long string. I added a newline to make it easier
to read.)
</P><P>
You aren't restricted to the defaults. I'll use 1024 bits/fingerprint
and a maximum radius of 3 instead of the defaults of 4096 and 5:
<pre class="code">
&gt;&gt;&gt; fptype = family(numbits=1024, maxradius=3)
&gt;&gt;&gt; fptype.get_type()
'OpenEye-Circular/2 numbits=1024 minradius=0 maxradius=3
atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HCount btype=Order'
</pre>
</P><P>
If you have a type string then you can get the fingerprint type
directly using chemfp's get_fingerprint_type(). The fingerprint type
string does not need to be in canonical form nor include all of the
parameters.
<pre class="code">
&gt;&gt;&gt; fp2type = chemfp.get_fingerprint_type("OpenBabel-FP2")
&gt;&gt;&gt; fp2type.get_type()
'OpenBabel-FP2/1'
&gt;&gt;&gt;
&gt;&gt;&gt; rdktype = chemfp.get_fingerprint_type("RDKit-Fingerprint/2")
&gt;&gt;&gt; rdktype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1'
&gt;&gt;&gt; rdktype = chemfp.get_fingerprint_type("RDKit-Fingerprint/2 fpSize=512 maxPath=5")
&gt;&gt;&gt; rdktype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=5 fpSize=512 nBitsPerHash=2 useHs=1'
</pre>
</P>
<h2>Fingerprint writer API</h2>
<P>
Chemfp has always been able to write fingerprints to an FPS file, but
I thought the API was too clumsy for public use so I left it as part
of the undocumented internals.
</P><P>
I'm much happier about the new fingerprint writer API, which is based
on a writer object with a method to save a single fingerprint at a
time, and another method to save multiple fingerprints. In this
example I asked it write to None, which means to write to stdout, so
you could see the output from each methods call.
<pre class="code">
&gt;&gt;&gt; writer = chemfp.open_fingerprint_writer(None)
#FPS1
&gt;&gt;&gt; writer.write_fingerprint("zeros", "\0\0\0\0")
00000000&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; zeros
&gt;&gt;&gt; writer.write_fingerprints( [("fp1", "\0\0\0\1"), ("fp2", "\0\0\0\2")] )
00000001&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; fp1
00000002&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;fp2
&gt;&gt;&gt; writer.close()
</pre>
</P><P>
Version 1.2 also introduces a new binary file format, the FPB
format. The writer API knows how to write an FPB file and will do so
if the format ends with ".fpb" or told specifically to write an FPB
file.
</P>
<h2>Toolkit APIs</h2>
<P>
The toolkit API is the collective name for the format API, the
molecule I/O API, and the molecule API. There is one toolkit API
implementation for each supported chemistry toolkit, which is
basically a wrapper from the chemfp API to the underlying toolkit
API. Use get_toolkit_names() to get a list of the available
chemistry-based toolkits:
<pre class="code">
&gt;&gt;&gt; chemfp.get_toolkit_names()
set(['openeye', 'rdkit', 'openbabel'])
&gt;&gt;&gt; T = chemfp.get_toolkit("openeye")
&gt;&gt;&gt; T.is_licensed()
True
</pre>
Instead of get_toolkit() you can import a toolkit directly in Python:
<pre class="code">
&gt;&gt;&gt; from chemfp import openeye_toolkit as T
&gt;&gt;&gt; T.is_licensed()
True
&gt;&gt;&gt; T.name
'openeye'
</pre>
</P><P>
In addition to the chemistry-based toolkits, the "text" toolkit
implements just enough of the toolkit API to process SMILES and SDF
records. It doesn't not know chemistry; it stores the contents of each
record as a string. I wrote it to handle a few use cases that couldn't
be handled with the molecule centered toolkit. I'll discuss these
cases in the future.
</P>
<h3>Format API</h3>
<P>
The get_formats() function lists the available formats for a given
toolkit.
<pre class="code">
&gt;&gt;&gt; from chemfp import rdkit_toolkit
&gt;&gt;&gt; rdkit_toolkit.get_formats()
[Format('rdkit/canstring'), Format('rdkit/usmstring'),
Format('rdkit/smistring'), Format('rdkit/molfile'),
Format('rdkit/usm'), Format('rdkit/sdf'), Format('rdkit/can'),
Format('rdkit/smi'), Format('rdkit/rdbinmol')]
</pre>
There's also a get_input_formats() and get_output_formats(), or you
can use attributes of the format determine if the toolkit understand
the format as an input or output, as well as a few other properties.
</P><P>
You can use the format API to figure out the toolkit's (or at least
chemfp's) best guess for the format for a given filename. This
includes compression detection.
<pre class="code">
&gt;&gt;&gt; rdkit_toolkit.get_input_format_from_source("example.smi.gz")
Format('rdkit/smi.gz')
</pre>
</P>
<h2>Molecule input API</h2>
<P>
This is by far the most extensive of the new APIs in
chemfp-1.2. Previously there was read_molecule_fingerprints() which
took a fingerprint type name and the source and returned fingerprints
for each record. Suppose though that you want to read a structure file
and generate fingerprints using several three RDKit fingerprint
types. The older API requires reading the file three different types,
which means parsing the same record three different times, even though
the same RDKit molecule could be reused for each.
</P><P>
The new API lets you separate the process of reading molecules from
generating fingerprints, which can be a nice performance boost.
</P><P>
The old API could only take input from files, or file-like
sources. The new API has more extensive support for reading from a
string. This is useful in a web application, where the input might be
text field from a submitted form.
</P><P>
In practice, there several variations on how to read an input,
depending on if it comes from a file or a string or if you want to
read a single record (the first one) or all records. Sometimes you
want the really molecule id when reading a molecule and sometimes you
really don't that overhead; I ended up implementing both of those
readers. Finally, there are also cases where you want to convert a lot
of records into a molecule, but each record is stored into its own
string. The make_id_and_molecule_parser() creates a specialized parser
for the case.
</P><P>
<table>
<caption>Different toolkit reader API functions</caption>
<tr>
<td>&nbsp;</td>
<th>...from a file</th>
<th>...from a string</th>
</tr>
<tr>
<td>Read 0 or more molecules ...</td>
<td>read_molecules()</td>
<td>read_molecules_from_string()</td>
</tr>
<tr>
<td>Read 0 or more (id, molecule) pairs ...</td>
<td>read_ids_and_molecules()</td>
<td>read_ids_and_molecules_from_string()</td>
</tr>
<tr>
<td>Parse the first molecule ...</td>
<td>&nbsp;</td>
<td>parse_molecule()</td>
</tr>
<tr>
<td>Parse the first molecule and its id...</td>
<td>&nbsp;</td>
<td>parse_id_and_molecule()</td>
</tr>
<tr>
<td>Make an optimized function to parse the<br />first molecule and its id...</td>
<td>&nbsp;</td>
<td>make_id_and_molecule_parser()</td>
</tr>
</table>
</P>
<h3>Molecule output API</h3>
<P>
Chemfp-1.2 adds new APIs to save molecules to a structure file or
convert a molecule into a record. Like the reader, there are
several variations in the writer.
</P><P>
The open_molecule_writer() and open_molecule_writer_to_string()
functions return a writer object, though in the first case the writer
saves to a file and in the latter it saves to memory that can be
accessed as a string. The writer object has methods to save a single
molecule at a time, to save multiple molecules at once, and to save a
molecules using an alternate id.
</P><P>
To demonstrate, here's a program which reads the compressed ChEBI SD
file and saves it to a compressed SMILES file. I specify the id tag
because ChEBI stores its identifiers in the "ChEBI ID" SD tag, and not
the title:
<pre class="code">
from chemfp import openeye_toolkit as T
with T.read_ids_and_molecules("ChEBI_complete.sdf.gz", id_tag="ChEBI ID") as reader:
with T.open_molecule_writer("chebi.smi.gz") as writer:
writer.write_ids_and_molecules(reader)
</pre>
</P><P>
When you only need to turn a molecule into a record, use the
create_string() function. If you need to do a lot of these
conversions, all with the same parameters, then use
make_string_creator(). This function returns returns a specialized
function which is better optimized to turn a molecule into a string.
</P>
<h3>Molecule API</h3>
<P>
Chemfp does not try to make a common molecule API across the differnet
toolkits. For that, see Noel O'Boyle's "<a
href="https://code.google.com/p/cinfony/">cinfony</a>". In fact, it's
quite the opposite. The readers, writers, and fingerprint generation
functions take native molecule types and not wrapped objects.
</P><P>
All chemfp offers are a small set of functions to get what the toolkit
considers to be a molecule's title, to do some simple manipuation of
the SD tags for a record, and to make a copy of a molecule.
</P>
<h2>Connecting fingerprint types and toolkits to make a fingerprint</h2>
<P>
If you have a fingerprint type string, and a SMILES string, how to you
generate the fingerprint for it? Each toolkit fingerprint type has a
"compute" function, which takes a toolkit molecule and returns the
fingerprint string, so that's in the right direction. You still need
to know which toolkit to use to convert the SMILES into the right
molecule.
</P><P>
Well, actually you don't. The fingerprint type's parse_fingerprint()
will parse a molecule and generate the fingerprint for you:
<pre class="code">
&gt;&gt;&gt; fptype = chemfp.get_fingerprint_type("OpenBabel-FP2")
&gt;&gt;&gt; fptype.parse_fingerprint("c1ccccc1O", "smi")
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x02
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x80\x00\x00\x00
\x00\x00\x00@\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x08
\x00\x00\x00\x00\x00\x00\x00'
</pre>
This helper function works because the fingerprint type knows its
toolkit, through the "toolkit" attribute. This attribute is part of
the public API, so you could do the two steps yourself if you want to:
<pre class="code">
&gt;&gt;&gt; fptype.toolkit
&lt;module 'chemfp.openbabel_toolkit' from 'chemfp/openbabel_toolkit.pyc'&lt;
&gt;&gt;&gt; mol = fptype.toolkit.parse_molecule("c1ccccc1O", "smistring")
&gt;&gt;&gt; fptype.compute_fingerprint(mol)
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x02
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x80\x00\x00\x00
\x00\x00\x00@\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x08
\x00\x00\x00\x00\x00\x00\x00'
</pre>
</P>
http://www.dalkescientific.com/writings/diary/archive/2014/09/24/new_chemfp_apis.htmlWed, 24 Sep 2014 12:00:00 GMTSummary of the /etc/passwd reader APIhttp://www.dalkescientific.com/writings/diary/archive/2014/09/23/etc_passwd_api_summary.html<P>
Next month I will be in Finland for <a
href="http://fi.pycon.org/2014/">Pycon Finland 2014</a>. It will be my
first time in Finland. I'm especially looking forward to the
pre-conference sauna on Sunday evening.
</P><P>
My presentation, "<a
href="http://fi.pycon.org/2014/#talks-iterators-generators">Iterators,
Generators, and Coroutines</a>", will cover much of the same ground as
my <a
href="http://dalkescientific.com/writings/diary/archive/2014/07/12/io_location_and_errors.html">earlier
essay</a>. In that essay, I walked through the steps to make an
/etc/passwd parser API which can be used like this:
<pre class="code">
from passwd_reader import read_passwd_entries
with read_passwd_entries("/etc/passwd", errors="strict") as reader:
location = reader.location
for entry in reader:
print("%d:%s" % (location.lineno, entry.name))
</pre>
I think the previous essay was a bit too detailed to understand the
overall points, so in this essay I'll summarize what I did and why I
did it. Hopefully it will also help me prepare for Finland.
</P><P>
The /etc/passwd parser is built around a generator, which is a pretty
standard practice. Another standard approach is to build a parser
object, as a class instance which implements the iterator protocol.
The main difference is that the generator uses local variables in the
generator's execution frame where the class approach uses instance
variables.
</P><P>
Since the parser API can open a file, when the first parameter is a
filename string instead of a file object, I want it to implement the
context manager protocol and implement deterministic resource
handling. If it always created a file then I could use
contextlib.closing() or contextlib.contextmanager() to convert an
iterator into a self-closing context manager, but my
read_passwd_entries reader is polymorphic in that first parameter, so
I can't use a standard solution.
</P><P>
I instead wrapped the generator inside of a PasswdReader which
implements the appropriate __enter__ and __exit__ methods.
</P><P>
I also want the parser to track location information about current
record; in this case the line number of the current record but in
general it could include byte position or other information about the
record's provenance. I store this in a Location instance accessed via
the PasswdReader's "location" attribute.
</P><P>
The line number is stored as a local variable in the iterator's
execution frame. While this could be accessed through the generator's
".gi_frame.f_locals", the documentation says that frames are internal
types whose <q>definitions may change with future versions of the
interpreter</q>. That doesn't sound like something I want to depend
upon.
</P><P>
Instead, I used an uncommon technique where the generator registers a
callback function that the Location can use to get the line number.
This function is defined inside of the generator's scope so can access
the local variables. This isn't quite as simple as I would like,
because exception handling in a generator, including the StopIteration
from calling a generator's close(), is a bit tricky, but it does work.
</P><P>
The more common technique is to rewrite the generator as a class which
implements the iterator protocol, where each instance stores its state
information as instance variables. It's easy to access instance
variables, but it's a different sort of tricky to read and write the
state information at the respectively start and end of each iteration
step.
</P><P>
A good software design balances many factors, including performance
and maintability. The weights for each factor depend on the expected
use cases. An unsual alternate design can be justified when it's a
better match to the use cases, which I think is the case with my
uncommon technique.
</P><P>
In most cases, API users don't want the line number of each record.
For the /etc/passwd parser I think it's only useful for error
reporting. More generally, it could be used to build a record index,
or a syntax highlighter, but those are relatively rare needs.
</P><P>
The traditional class-based solution is, I think, easier to understand
and implement, though it's a bit tedious to save and restore the
parser state for each entry and exit point. This synchronization adds
a lot of overhead to the parser, which isn't neeed for the common case
where that information is ignored.
</P><P>
By comparison, my alternative generator solution has a larger overhead
- two function calls instead of an attribute lookup - to access
location information, but it doesn't need the explicit save/restore
for each step because those are maintained by the generator's own
execution frame. I think it's a better match for my use cases.
</P><P>
(Hmm. I think I've gone the other way. I think few will understand
this without the examples or context. So, definitely lots of code
examples for Pycon Finland.)
</P>
http://www.dalkescientific.com/writings/diary/archive/2014/09/23/etc_passwd_api_summary.htmlTue, 23 Sep 2014 12:00:00 GMTLausanne Cheminformatics workshop and contesthttp://www.dalkescientific.com/writings/diary/archive/2014/07/24/cheminformatic_tool_contest.html<P>
Dietrich Rordorf from <a
href="http://en.wikipedia.org/wiki/MDPI">MDPI</a> sent an announcement
to the CHMINF mailing list about the upcoming <a
href="http://cheminformatics.epfl.ch/workshop/20140912program.shtml">9<sup>th</sup>
Workshop in Chemical Information</a>. It will be on 12 September 2014
in Lausanne, Switzerland. It seems like it will be a nice meeting, so
I thought to forward information about it here. They also have a
software contest, with a 2,000 CHF prize, which I think will interest
some of my readers.
</P><P>
The workshop has been around for 10 years, so I was a bit suprised
that I hadn't heard of it before. Typically between 30 and 50 people
attend, which I think is a nice size. The preliminary program is
structured around 20 minute presentations, including:
<ul>
<li>Peter Ertl - Database of bioactive ring systems with calculated properties and its use in bioisosteric design and scaffold hopping</li>
<li>Micha&euml;l Zasso - Slice based electronic laboratory notebook</li>
<li>Dragos Horvath - Chemigenomics - is it more than inductive transfer?</li>
<li>Guillaume Godin - GC/MS identification in a Browser</li>
<li>Modest Korff - Sub-pharmacophore models as seeds in drug discovery</li>
<li>Jean-Louis Reymond - Interactive tools for visualisation and virtual screening of large compound databases</li>
<li>Michael J E Sternberg - INDDEx - logic-based ligand screening for scaffold hopping</li>
<li>Thomas Sander - 2D scaling with rubber bands and descriptors</li>
</ul>
If you know the authors, you might recognize that one is from
Strasbourg and another London, and the rest from Switzerland. I can
understand. From where I live in Sweden it will cost over US $300 in
order to get there, and Lausanne doesn't have its own commercial
airport so I would need to fly into Geneva or Bern, while my local air
hub doesn't fly there directly.
</P><P>
But I live in a corner of Europe, and my constraints aren't yours.
</P>
<h2>Source code contest</h2>
<P>
I had an email conversation with Luc Patiny about an interesting
aspect of the workshop. They are running a <a
href="http://www.mdpi.com/journal/challenges/special_issues/cheminformatic_tool_contest">contest
to identify the best open source cheminformatics tool of the year</a>,
with a prize of 2000 CHF. That's 1650 EUR, 2200 USD, 1300 GBP, or
15000 SEK, which is plenty enough for someone in Europe or even the US
to be able to travel there! They have a time slot set aside for the
winner of the contest to present the work. The main requirement is
that contest participants are selected from submissions (1-2 pages
long) to the open access journal <a
href="http://ideas.repec.org/s/gam/jchals.html">Challenges</a>. (And
no, there are no journal page fees for this contest, so it doesn't
seem like a sneaky revenue generating trick.)
</P><P>
The other requirement is that the submission be "open source". I put
that in quotes because much of my conversation with Luc was to
understand what they mean. They want people to be able to download the
(unobsfucated) software source code for no cost and be able to read
and review it to gain insight.
</P><P>
I think this is very much in line with classical peer review thought,
even though it can include software which are neither <a
href="http://opensource.org/osd">open source</a> nor <a
href="https://www.gnu.org/philosophy/free-sw.html">free
software</a>. For example, software submissions for this contest could
be "for research purposes only" or "not for use in nuclear reactors",
or "redistributions of modified versions are not permitted."
</P><P>
Instead, I think their definition is more in line with Microsoft terms <a
href="http://en.wikipedia.org/wiki/Shared_source">shared source</a>.
</P><P>
In my case, my latest version of <a
href="http://chemfp.com/">chemfp</a> is 'commercial open source',
meaning that those who pay me money for it get a copy of it under the
MIT open source license. It's both "free software" and "open source",
but it's not eligible for this workshop because it costs money to
download it.
</P><P>
But I live in a corner of open source, and my constraints aren't
yours. <tt>;)</tt> If you have a free software project, open source
software project, or shared source software project, then you might be
interested in submitting it to this workshop and journal. If you win,
think of it as an all-expenses paid trip to Switzerland. If you don't
win, think of it as a free publication.
</P>
http://www.dalkescientific.com/writings/diary/archive/2014/07/24/cheminformatic_tool_contest.htmlThu, 24 Jul 2014 12:00:00 GMT