Saturday, 30 November 2013

Recently I had to quickly come up with Python code that found the longest path through a weighted DAG (directed acylic graph). This particular DAG had a single start node and a single end node, but there were multiple weighted paths from the start to the end. "Longest path" in this context meant the path with the highest sum of weights.

To begin with, I coded up an exhuastive algorithm for trying all possible paths. This wouldn't be fast enough in the general case, but it didn't take long to write and is useful for comparison.

Internet rumour (contradicted by Wikipedia I now see) had it that the well-known Dijkstra's algorithm for the shortest path could be adapted for longest paths (but only in a DAG) by using the negative of the weights. However, a test of a couple of libraries for doing so indicated that negative weights were not supported (segfault with R for example), and I didn't want to code Dijkstra myself, so I decided to hunt around the internet for a direct solution to the longest path problem.

Geeksforgeeks (appropriately enough I guess) had an lucid description of a 3-step algorithm which I implemented. The only snag was that Step 1 involved a topological sort. Again, I didn't want to spend any time figuring this out so I looked for an implementation in Python and found one at ActiveState's Python recipes. I note that another popular Python implementation is available at PyPI.

Here's the code (if you are using an RSS reader to read this, you may have to click through to the original blog post):

Thursday, 28 November 2013

I've just been using GNU parallel for the first time. It makes running jobs over multiple CPUs trivial.

In the past, if I had a large number of single-CPU computationally intensive jobs and multiple CPUs to run them over, I would create separate bash scripts for each CPU with a line for each calculation, e.g. ./runthis input1.smi > output1.txt. This is not super-ideal as different jobs take different lengths of time and so any CPU that finishes its bash script ahead of schedule just sits there idle. It also involves making N separate bash scripts.

Enter GNU parallel. This comes with several Linux distributions but on Centos I just quickly installed from source. Once done, you just need to put all of the jobs in a single script and pipe it through parallel:

cat myjobs.sh | parallel -j7 # e.g. for 7 CPUs

There are a whole load of complicated ways of using parallel. I'm very happy with this one simple way.

Wednesday, 20 November 2013

There's been a longstanding problem with the 2d layout in Open Babel. When support for correct layout of cis/trans stereo was added, it was done as a correction to the initial layout rather than as an intrinsic part of the layout. This means that after nicely laying out a molecule to avoid overlaps, it may then flip two groups at the end of a double bond, crashing one part of a molecule into another.

At the time, given that the depiction code had just moved from no support for stereo to supporting it correctly (at a small expense of layout), sorting this out was way down my priority list. Anyhoo, a couple of days ago I decided to revisit this issue. Here's an example before and after:

I thought it might be interesting for some (or perhaps not) to see behind-the-scenes how one might go about sorting out something like this. In this case, I already knew the technical fix, but it still took some time to ensure it was done right. It's not quite #realtimechem, but probably as close I'll get.

Day one
1. Update local git repo by pulling from the main branch
2. Create (and switch to) branch for bugfix
3. Build Open Babel under MSVC (debug and release)
4. Some build issues with recent code (presumably only tested on Linux) for chemicaljson and chemdoodle plugins. Ignore for now as it doesn't impact this work (note to self: fix).
5. Need to generate problematic test cases. Convert several thousand ChEMBL molecules to SMIs and use grep to remove those with "." and keep those with either "\" or "/". Write Python script to sort by size and added line number as title (these are useful for debugging; smaller test cases are easier to handle).
6. Add print statements to the code to print out the measured "overlap error" before and after calling the code to fix the 2D layout
7. Use obabel to convert my test cases to SVG and pipe the output to a fileNight came. Morning came. The second day.
8. Coding time: move the double-bond correction to immediately after the initial layout.
9. When doing the somewhat exhaustive rotation around bonds for layout, do not alter bonds with defined double bond stereo.
10. Add print statements to the code to print out the measured "overlap error" afterwards
11. Use obabel to convert the test cases to SVG and pipe the output to a file
12. Write a Python script to compare the overlap errors for the old and new code
13. Visually inspect some of the cases where an improvement is supposed to have occured.
14. Oh oh! I see an example where the layout has improved but the stereo is now wrong.Night came. Morning came. The third day.
15. I run the problem case in the debugger and step through the code inspecting the molecule object in the depiction code. Funny thing is, my code appears to be correct but none of the bonds are marked with stereo refs and so it doesn't do anything.Night came. Morning came. The fourth day.
16. I run the problem case in the debugger with the original code. The stereo refs are there.
17. I update to the new code and run it again. The stereo refs are there originally but must be disappearing later. I step further along. A copy is made of the molecule; it seems like it uses its own special clone method. Looking further along, the bonds are cloned but the stereo refs are not being copied at that point. I fix this and all is well.
18. I run the test cases again, run my comparison Python script and start visually inspecting problem cases.
19. Seems like everything is working better than before.
20. I write a blog post before I forget everything.
21. Still need to remove the print statements, inspect the diff, commit the code, and send a pull request.

This release has no new functionality apart from updates to the parser. The main reason for this new 3.0 release is that the codebase has been modernised so that GaussSum will continue to work in future. Specifically, GaussSum is now a Python 3 application, and it uses Python's Matplotlib for all graphs instead of Gnuplot. Also the dependency on Python Imaging Library has been removed.

Unfortunately, because of the move to Python 3, Linux users that are not on fairly recent distributions may not find either python3-numpy or python3-matplotlib in their package manager. If this is the case, you'll either have to update your distro or try compiling them yourself.

The Windows release is now a self-installing .msi courtesy of Python's cx_Freeze.

Since I'm now using Matplotlib instead of Gnuplot, if you want to spruce up your staid scientific plots à la xkcd, uncomment "# pyplot.xkcd()" in mpl.py. Then you can make graphs like the following:

Like a perfect alignment of stars, the Teach-Discover-Treat initiative touches on several areas of interest of mine. Its aim is to encourage the development of tutorials for drug discovery that involve freely-available software. The winners get a cash prize plus partial reimbursement of expenses to attend the ACS National Meeting where they will present the results. The competition is now open over at the TDT website.

I was fortunate to be present at the recent Spring ACS in New Orleans where the winners of the inaugural competition were presenting: Paulo Tosco, David Koes, George Nicola and Khaled Elokely. Unfortunately the tutorials themselves do not seem to be directly available on the web (and so will not be found by Google) but after agreeing not to be unethical you can download them from this page.

The results of the current competition will be presented at the ACS in San Francisco next year. Maybe I'll see you there.