Tuesday, 21 December 2010

It's the Open Babel user documentation. It still has a (*cough*) few TODOs here and there, but it brings together various bits of documentation scattered over the wiki and covers a bit of everything I think. It's been on the web as HTML for a while, but I've just set up the automated build for a PDF and so am giving it a shout out.

So grab a few mince pies. Throw another logP on the fire, and get ready for some riveting reading. Then dial it back a notch and check out the Open Babel docs.

P.S. If you find any mistakes or have any suggestions, feel (very) free to just fix them yourself or add some text in by forking the docs on github.

Monday, 13 December 2010

About a year ago, SciTouch LLC announced the release of the open source cheminformatics library Indigo. At that time, the full API of the library was not exposed to the user. Instead, a set of simplified APIs were available along with a number of command-line applications.

Just a few weeks ago at Goslar, Dmitry and Mikhail of Indigo made a couple of announcements. First, the library, documentation, etc. is no longer at SciTouch, but at GGA software. Second, the full API is now being made available. It's actually a C library, but wrappers are available for C#, Java and Python. If you're interested in accessing it from Python, follow the Download link to the Python API.

The Indigo library is currently at the version 1.0 Beta 3 stage, but new API functions are being added on request over at the Indigo mailing list. So if there's something missing that you'd like to have, get over there now and ask about it.

Naturally, I'm interested in adding support for Indigo to Cinfony. I've already pretty much done the Python bindings. Just put indy.py into the same directory as indigo.py and away you go. As usual, let me know if you find any bugs.

Wednesday, 8 December 2010

Here's an image from PubChem. I count two chiral centers, but for how many of these is the chirality specified?

As Egon pointed out in a comment on the original post, it's impossible to interpret this image without assuming the use of a particular wedge/hash bond convention. Either the stereochemistry is defined at both edges of the wedge, or it's defined only at one end.

The problem with MOL files

But this isn't just a problem with images of molecules; this same problem affects 2D MOL files. The underlying MOL file for the above molecule has that same bond marked as a wedge. Any time this happens to a bond connecting two chiral centers, there is a resulting ambiguity that requires a particular convention to be assumed. If your primary means of storing chemical data is a 2D MOL file, you should start feeling nervous right about now.

To be fair, the example discussed is really an isolated case in PubChem. In the case of the first 23071 molecules in PubChem, there are 14362 bonds connecting two chiral centers, but there are only 21 instances where they have been marked as wedge or hash. (I note that for all of these cases, it was possible to choose a different stereobond to avoid this problem.)

Another database whose primary means of storing chemical information is 2D MOL files is the ChEMBL database. This contains 635933 molecules, with 482773 inter-chiralcenter bonds. Of these bonds, there are 7335 marked as stereobonds. In other words, more than 1% of the molecules have an ambiguous stereocenter, simply because of the way the stereochemistry was encoded into the MOL file. This is probably a bit high, and I expect that ChEMBL will either fix this or point out an error in my calculations.

Interpret this

Okay. So we know the problem. But we still need to answer the original question...what stereochemistry was intended for the molecule in question?

The most common (and recommended) convention for handling a wedge/hash bond connecting two stereocenters is to consider the stereochemisty as defined only at the thin end of the wedge. If you look at the SMILES string calculated by OEChem, and the InChI string (calculated by the official InChI binary?), you will see that this is how the MOL file was interpreted.

But, here's the thing...

I think that the convention that the generator of the MOL file was following was that the stereochemistry was defined at both ends. Why? Well, firstly, similar molecules in the database have their stereochemistry clearly defined, most notably ascorbic acid of which this is a derivative. And secondly, the chiral marks in the connection table in the MOL file indicate that the stereochemistry is defined at both centers.Correction (see comment by WDI below): The chiral marks do not indicate this - they actually say that the stereochem is only defined at one center.

What to do

The reason I'm even looking into this is because I'm trying to figure out how Open Babel should handle these cases. When reading them, should it just assume that the MOL file was following the common convention but issue a warning to flag up the fact that the MOL file sucks? Should it provide an option to read other conventions? Should it avoid writing files that contain inter-chiralcenter stereobonds even if they were in the input, or will that upset users who expect Open Babel to pass wedge/hash bonds through unchanged?

Of course, this could all be avoided if people would just fix their 2D MOL files. With that in mind, here a couple of lines of code to identify such problems (requires dev version of OB 2.3):

Tuesday, 30 November 2010

How much does it cost to attend a conference, all in? If your boss is paying then you mightn't care. However, you need to have a figure in mind when you start writing grants because you need to budget it in. Want to go to one European and one international conference a year for three years? How much would that set you back?

I recently attended the German Conference on Cheminformatics in Goslar, and I determined that this time, at least, I would do my utmost (short of squabbling over the bill) to collect all relevant receipts as I had specific funding just for those costs.

Now, bearing in mind that Cork->Goslar is an indirect flight involving an overnight stay plus some train journeys here are the figures (rounded to the nearest 20 euro, E&OE):

Thursday, 25 November 2010

As I said only last year, when I started programming I used to play around with graphics. So here are a couple of programming problems based around graphics, and specifically around the Sierpinski triangle.

Random motion within a triangle

What do you get if you put the following recipe together?

(1) Draw an equilateral triangle and label the corners A, B and C(2) Set your current position, P, to corner A.(3) Choose one of the three corners at random and move P halfway to that corner.(4) Draw a dot at the new location of P.(5) Go back to 3 (Repeat 500 or 1000 times in total)

Apparently, this is called the Chaos game. If you follow the links on the Wikipedia page I linked to, it seems that the same method could be used to draw a fern fractal.

Triangles within triangles

This one is a bit tricky in terms of programming as you need to learn about recursion (functions that call themselves). It's not rocket science but it's a bit confusing first time you see it. Not to worry - let's just plough ahead anyway :-)

(0) Draw a white equilateral triangle of length M (some value) pointing upwards centered around a point N (some point in the middle of the screen).(1) Write a function that takes a point on the screen, P, and a length, L, and draws an upside-down black equilateral triangle centered around P with sides of length L. For example, if this function is called with L=M/2 and with P=N, this should leave three white equilateral triangles on the screen (separated by the black one).(2) The second thing the function should do is if L > M/16, it should call itself three separate times to draw three black triangles of length L/2 centred in the middle of each of the three white equilateral triangles around it.(3) Now, back in the main part of the program, call the function you've just written using N for the value of P, and M/2 for the value of L.

If you get this working, this is a fun program to play around with. What happens if you use squares or circles? Another related structure is the Koch snowflake: there's a recipe at Wikipedia.

Friday, 19 November 2010

I'm interested in analysing the barrier to rotation around a particular bond in a molecule. A simple way to look at this is to fix the torsion angle around the bond, and optimise the other coordinates. The end result is a plot of the energy versus the torsion angle.

Here I'll look at the energy profile when rotating around the C-C bond in ethane. There seem to be two approaches to do this using Firefly. The first is described at the Firefly website and involves using IFREEZ in the STATPT group.

However, this approach seems to require users to spell out the internal coordinates themselves. Instead, it's easier to have Firefly work out the coordinates itself. After some hacking based on examples in the docs (see links below), I came up with the following which appears to give the same results but I think is more generally useful:

(2)For the actual system of interest, I found that the initial conversion of the 'displacement' from internal to cartesian coordinates failed to converge until I set the starting angle for the scan to close to the angle in the input data.(3) The graph was created using the following Python code:

Wednesday, 17 November 2010

Here's a graph I created showing the relationship between the first few elements of the periodic table and...what?Every year around November the ACS sends me a mug corresponding to a different chemical element. The first year ("H") the mug arrived in one piece...:-)

Tuesday, 16 November 2010

Last week I attended the 6th German Conference on Chemoinformatics at Goslar for the first time. This was thanks to the Jacques-Émile Dubois Grant from the CSA Trust which covered the cost of my attendance. There aren't many awards available for young cheminformaticians, so if you're 35 or younger think about applying (it's open right now) - it's a short and simple application, and could cover the cost of attending a conference or visiting another group.

Here's a graph I created showing the relationship between the first few elements of the periodic table and...what?Don't take it too seriously, as it's not quite a scientific study, but here's a hint: every year around November I will be able to add the next element to the graph.

Thursday, 4 November 2010

I've just been spending some time ensuring that Tim's new symmetry and isomorphism code in Open Babel can be accessed from Python (you can already access it from C++). It didn't make it into the 2.3.0 official release but I'll push it out into the Windows Python in the next while.

First the symmetry classes and canonical labelling. These can be calculated and accessed as follows (the test molecule is shown in the image above):

Wednesday, 27 October 2010

I am very happy to finally announce the release of Open Babel 2.3.0, the next major release of the open source chemistry toolbox.

Open Babel has been downloaded over 160,000 times and is used in over 40 projects.

This release represents a major update and should be a stable upgrade, strongly recommended for all users of Open Babel. Highlights include a completely rewritten stereochemistry engine, Spectrophore descriptor generation, 2D depiction, improved 3D coordinate generation, conformer searching, and more. Many formats are improved or added, including CIF, PDBQT, SVG, and more. Improved developer API and scripting support and many, many bug fixes are also included.

I would like to personally thank a few people for extra effort in improving this release. In alphabetical order Chris Morley, Noel O'Boyle, Tim Vandermeersch, and Silicos NV, which donated their Spectrophore(TM) framework to Open Babel 2.3.

This is a community project and we couldn't have made this release without you. Many thanks to all the contributors to Open Babel including those of you who submitted feedback, bug reports, and code.

Friday, 8 October 2010

In this (final) post on the topic (see Parts I, II), I will look at the error rate going from SDF->Canonical SMILES (using Open Babel) versus SDF->SMI->Canonical SMILES. Note that unlike InChI, there is no standard 'canonical SMILES' (although there has been some talk about this in the context of OpenSMILES) - each toolkit has its own way of creating it.

So, a failure rate of five out of 18084. 7527 and 21918 seem to be the same problem (same substructure involved) which is related to graph symmetry, while 167, 9107 and 23699 are kekulization issues. (Update 11/10/2010: 167 and 23699 fixed.)

Credits: The main Open Babel developers are Geoff Hutchison, Chris Morley, Tim Vandermeersch, Craig James and myself. Code contributions from many more.

Thursday, 7 October 2010

In a comment on the previous post, Richard Hall asked about the error rate going from SDF->SMI->InChI. A very good question. Clearly, this tests the fidelity of reading and writing to and from SMILES. But there's a second less obvious effect...

But first, the results.

For the initial set of 18053 molecules, we find 108 disagreements (0.6%) with the InChIs obtained by converting using Open Babel straight from SDF->InChI. Of these, 25 have an error in the molecular formula in the InChI. These are straightforward bugs in Open Babel in determining the correct number of implicit hydrogens when reading some SMILES (Update 08/10/2010: Now fixed).

The others are more interesting disagreements: when converting from SDF -> InChI, the InChI library itself gets to decide which are the stereocenters*; when converting from SMI -> InChI, the InChI library needs to accept what Open Babel tells it. In other words, disagreements arise when the internal stereochemistry models in the two libraries disagree.

I took a look at three which appeared to disagree in different ways.

[CID 15550] Going through SMI leads to loss of stereo at a double bond in a ring systemFrom SDF: InChI=1S/C8H12/c1-2-4-6-8-7-5-3-1/h1-4H,5-8H2/b3-1-,4-2-From SMI: InChI=1S/C8H12/c1-2-4-6-8-7-5-3-1/h1-4H,5-8H2

Whoops, it's those pesky double bonds in ring systems as discussed in the previous post. Might be time to look into this. The ring in question is an 8-membered ring. Is it possible to have a trans bond in an 8-membered ring?

The double bond in question is a [H]N=C bond. Open Babel doesn't think this can be a cis/trans bond; InChI thinks it can. Anyone actually know?

[CID 15456] Going through SMI leads to loss of stereo at two tetrahedral centers. From SDF: InChI=1S/C11H22N3O3P/c1-6-17-9(15)12-18(16,13-7-10(13,2)3)14-8-11(14,4)5/h6-8H2,1-5H3,(H,12,15,16)/t13-,14-/m0/s1From SMI: InChI=1S/C11H22N3O3P/c1-6-17-9(15)12-18(16,13-7-10(13,2)3)14-8-11(14,4)5/h6-8H2,1-5H3,(H,12,15,16)

The tetrahedral centers in question are both sp3 nitrogens where two of the (three) bonds to the nitrogen are part of the same ring. Again, there is a disagreement between Open Babel and InChI on whether such nitrogens can be tetrahedral centers.

The good news about these results is that we're almost down to the level where the only disagreements we see are disagreements on stereocenters rather than plain buggy bugs.

The bad news is that it's not clear what to do about these disagreements. Setting aside the "Open Babel is wrong - no, InChI is wrong!" discussion, another cheminformatics library will produce different InChIs yet again depending on how it defines stereochemical centers. If we could all agree on what constitutes a reasonable stereocenter there wouldn't be any problem. Alternatively, we could just follow whatever InChI says is a stereocenter even if we don't agree...?

* I apologise for use of the American spelling throughout. It's a symptom of preparing a paper for an ACS journal. Normal service will resume shortly.

Tuesday, 5 October 2010

If you've followed my posts on parsing stereochemistry in SMILES, you'll have realised that every conversion between different chemical formats has the very real possibility of losing or confusing information. There are several ways to identify such problems. One way is to compare results of a particular conversion to an independent standard.

Here I'll calculate the error rate of conversion from SDF to InChI using Open Babel, compared to doing the same conversion using the official InChI binary. Note that what we are actually calculating is the error rate of conversion from SDF -> Open Babel's internal chemical model -> InChI; it's not that Open Babel just hands the InChI code the raw SDF file.

This test is using the OB 2.3 development code. The test file is the first entry in PubChem3D. This contains 18084 3D structures.

First, run the InChI binary (I'm using Windows):

inchi-1.exe Conformers_00000001.sdf /AuxNone 2> errors.txt

Next, convert from SDF to InChI with obabel.exe (InChI format options described here):

Ah...now we have something. I found 15 disagreements on the InChI. Hmmmm...but 13 of these involve molecules with isotopes of Br...one quick bug fix later (SVN r4134), I have 2 errors left: molecules 144031 and 144132. These both have multiple double bonds in ring systems, and I think there may be a difference in opinion between Open Babel and InChI over the cutoff for the size of ring in which the stereochemistry of double bonds should be considered...but that's a problem for another day.

So how does the current release compare to this? Not so well, not so well at all. We started reimplementing stereochemistry in Open Babel about 1.5 years ago, and it's only now we're getting such good performance. In short, if stereochemistry in InChI is important for your application, you should wait for the 2.3 release (or run the development code).

Thursday, 30 September 2010

Due to popular demand (just kidding!), here is yet another post on the subject of stereochemistry in SMILES, this time focusing on the handling of ring bonds (ring opening/closure) at cis/trans stereo centers. This follows on from Part I.

The key point (and difficulty) when dealing with rings bonds on such double bonds is that, since the ring bond appears twice in the SMILES string (at both the opening and closing), the stereo symbol can appear at either occurrence or indeeed both. (I think this was a mistake in the SMILES specification, but there you go.) When writing a SMILES string, the preferred syntax just shows the stereo symbol at the end on the double bond (Open Babel will only output this syntax).

The following structure will be used as an example:So, using the preferred syntax, a SMILES string for the example structure would be:

(a) C/C=C\1/NC1

In other words, from carbon-3 it's down to the C of the ring closure, and up to the N of the ring, where up and down are relative to carbon-1.

There's no need to specify the stereochemistry of both groups on the right-hand side, of course. The following SMILES is equivalent to (a) although not so clear:

(b) C/C=C1/NC1

Coming back to SMILES (a), we could have written the stereo symbol at the ring closure, or indeed at both ends:

(c) C/C=C\1/NC/1(d) C/C=C1/NC/1

Note that the symbol used for the ring opening is the opposite of that for the ring closure. The rationale for this is that from the point of view of carbon-4, carbon-3 is up (hence C/1), whereas from the point of view of carbon-3, carbon-4 is down (hence C\1). Whatever...just stick to form (a) and you won't need to think about this, as it will just be a source of errors. (It would have been simpler for everyone if Daylight had only allowed the stereo symbol at the carbon on the double bond.)

So much for valid SMILES. How should invalid SMILES be handled? Consider the following:

(e) C/C=C\1\NC1

Both the ring closure and the N are down...? I don't think so. This should be treated as undefined stereochemistry.

How about the case where the two ring bonds have stereo symbols which are not in agreement?

(f) C/C=C\1NC\1(g) C/C=C\1/NC\1

In both of these cases the stereochemistry for the ring bonds should be considered undefined. In Open Babel, I've chosen to handle these as follows:

Monday, 20 September 2010

Sometimes you just need to identify the chemical structure in an SD file, but don't have access to a graphics terminal. For example, you could be logged into a server at a remote location. What to do?

Well, it turns out that you can depict pretty much anything using text symbols - this is known as ASCII art. Ubuntu has two of the main ASCII art libraries available, aa-lib and libcaca (named by a 10-year old). With both of these there are associated viewers, asciiview and cacaview.

There a few ways to go here: either a cheminformatics library could directly depict a molecule using ASCII art, or it could depict it using one of these libraries, or we can be lazy and just convert an existing PNG to text. The first case is likely to produce a better quality image - it is actually the subject of a paper by Raymond Carhart in JCICS in 1976 (via Pat Walters). Naturally, since this is a blog post, we will take the lazy route here and just convert from PNG to text.

So, this is the original image:I found it better to convert to B&W by thresholding all non-white pixels to black:

convert orig.png -threshold 99% blackwhite.png

Running asciiview, we have the following:

Note that the structure is immediately clear. Still - we can do better. If we "-negate" the image first, we have:

How about for cacaview?

Not so good. However, both asciiview and cacaview have zoom and pan functionality and once we zoom in, the structure can be clearly identified:

I was originally thinking of including this functionality in Pybel (which, with the help of OASA, can generate 2D depictions as PNG files), but I think that generating such text images is best done through these ASCII art viewers, as you might need to zoom and pan to get the "full picture".

In a comment on FriendFeed, Hari wondered whether an exact depiction of a chemical structure could be made using Unicode characters. Good question. But first, how close can we get with ASCII characters? Here's my best attempt:

Monday, 13 September 2010

A chemistry undergraduate from South America recently emailed me asking about how to get into cheminformatics:

My area is chemistry and I'm very interested about cheminformatics. Actualy, I'm using Python to develop a software to make some analysis (image analysis applied to chemistry). Here in ----, the college course of chemistry don't have disciplines of informatics related.

Because of this, I got some questions, if you can answer to me, I'll be very grateful:

Have you done chemistry college or some informatics college related?If you have done the chemistry college, how you started to work with computation applied to chemistry?Here, in ----, actualy I think that the cheminformatics is not very known, even in the scientific field. What about in other countrys? The most of people that are working with cheminformatics have done chemistry colleges or some computation college related?

I answered as follows:

My own background is a degree in Chemistry, followed by a PhD in Inorganic Computational Chemistry (DFT calculations). In the field most people have chemistry degrees, although there are also a few computer scientists. The types of problems the two work on are often different; the computer scientists may be more interested in developing methods, while the chemists may be more interested in applying and interpreting the results. I think that most chemists would not do any informatics or programming during their degree - they would just teach themselves at the start of their PhD - it sounds like you have already done this.

When I was 12 or 13, I started programming in BASIC on my home computer and got involved in programming competition for high school students. I didn't have a computer while in university, but during my PhD I started programming again, this time in Python. More recently, I've learnt C++ by working on Open Babel.

If you want to gain expertise in the field, I would very much encourage you to get involved in an open source cheminformatics project. You will learn a lot about programming, organising large projects, testing, how to work with other people, and so on. If you're interested in image analysis you could look at OSRA, etc. You may also want to subscribe to the blueobelisk mailing list or ask a question at blueobelisk.shapado.com.

Cheminformatics is not a very well known field - I didn't know what it was until I started doing it, even though I had done computational chemistry during my PhD. The main countries associated with cheminformatics are the UK, US and Germany, it seems to me; these are the countries where a lot of the pharmaceutical companies do drug design. But you can do cheminformatics anywhere - you just need a computer.

What advice would you give? I'm especially keen to hear from cheminformaticians from South America. (I'll point the student to this blog post)

Friday, 3 September 2010

You know you have arrived in Liverpool when your first sight on exiting the airport is a yellow submarine. The mug you just bought showing the cover of Sgt. Pepper's Lonely Hearts Club Band is also a good clue.

The 1st Open Babble meeting took place this week in Chester, UK, near Liverpool and Manchester. For the first time in its 7 (8?) year history, the developers of Open Babel met each other and discussed issues such as whether we should open brackets on the same line or on the next line. Well, that and other more serious questions. Many discussions aren't really suited to email, and so this was a chance for us to cover a whole variety of issues in a short space of time.

We (Marcus, Tim, Chris and I) worked off a agenda of topics pre-suggested through a Google spreadsheet, keep very brief minutes to record decisions/action items, and brought Geoff in remotely through Skype after sending around the minutes. Although we had a couple of slides prepared here and there, in general it was all round-table discussions. It worked out pretty well I think.

This Open Babble involved core developers only, and we are going to continue such meetings in future. However, just as a matter of interest, are there people out there interested in a user's meeting, e.g. a 1-day workshop or something of that nature? I've added a poll on the left-hand side where you can indicate your interest (or lack thereof). (Update: Poll now over)

Thursday, 19 August 2010

Update 21/08/2010: I've found enough beta testers for the moment - thanks to all those who volunteered.

I'm looking for beta testers for Confab, a new conformation generator I've been working on over the last while. It's based on Open Babel and aims to generate diverse low energy conformations. In particular I am interested in feedback on what sort of features should be added.

If you have access to a Windows system and are interested trying out Confab, please send me an email at baoilleach@gmail.com and I'll follow up with download instructions.

Note: The software is open source (GPL v2) and the source is available on the fastconf branch (commit b9300dd) at http://github.com/baoilleach/openbabel-svn-mirror.

Tuesday, 17 August 2010

I've been working on moving documentation from the Open Babel wiki (Mediawiki) to a Sphinx-based system, and it's really so much better.

Wikis are great because they have a low barrier to contribution, make it easy to collaborate on documentation and (less importantly) have versioning. Wiki are rubbish because they don't impose a hierachical structure (they have some ideas about categories, but that's just not enough), pages get orphaned, and never get read. (They don't look great either, without a lot of customisation.)

Take, for example, the Open Babel wiki. Over the years a lot of effort has gone into creating content. However, there are many pages on the wiki I simply have never seen before because there is no clear path to them from the front page.

So, what to do?

As I said, I've been moving the docs into Sphinx (either manually, or using Pandoc). This uses a simple markup, reStructuredText (reST), to highlight headings, links and so forth. Most importantly it imposes a hierarchy - only pages that are included in a table of contents get incorporated into the documentation. The best part of all is that unlike a wiki (or at least, unlike Mediawiki) Sphinx can output not only HTML, but also LaTeX (for making PDFs) and HTMLHelp (for Windows CHM files).

But it's not a wiki any more, you're thinking. Well actually, that's where you'd be wrong. Through the magic of the internets, if you store your source code on Github, you have a de facto wiki, because reStructuredText is understood by Github and rendered reasonably well. You can click "Edit" right there on Github, edit the source and immediately see the result (and it's versioned of course).

An in-progress version of the Open Babel user documentation is available here, and you can check out the source code on Github (and fork it too, if you want).

Thursday, 29 July 2010

I met Roger Sayle the other day and he was showing me some of the work he's been doing recently since leaving OpenEye. Apart from OpenEye, you may remember Roger from such programs as Rasmol and also GCC (he's the middle level maintainer; this is apparently the part that fixes your code to be more efficient - see this quiz for examples). He's also the author of the only paper in JCIM to contain Klingon, the Lexichem paper (incidentally this is also, as far as I know [1], the only paper in JCIM that is freely available through Author Choice).

Early this year, Roger set up NextMove Software which is still in early days but is one to keep an eye on. It currently has software to aid in checking and correcting the spelling of chemical nomenclature (e.g. in a user interface, word document, text mining) and has software for searching databases in the works.

What I want to highlight here is a bizarre way he found to speed up a Python script (via Andrew Dalke) by turning it into a function. If you go to his page on PPAccelerator, which is all about a potential product that speeds up Pipeline Pilot scripts (by up to 37 times if interpreted, or 1083 times if compiled), you will see timing comparisons of code for the Mandelbrot fractal written in different languages.

You will see two Python v2.5 examples, the second of which is about twice as fast. Look at the code for each. The only difference in the second case is that the entire script has been wrapped in a function (main() in this case), which is called at the end of the script. This is just plain weird! I don't think I've ever heard of this effect before. Apparently, Python optimizes the bytecode of a function but not module level code. If this is written anywhere in the docs, I certainly can't find it.

On a final note, a quick Google search brings up promise, which may be useful for exploiting such optimizations further.

Notes:

[1] The ACS don't provide any means to list Author Choice articles on their website. Supposedly Author Choice articles should receive increased exposure but the ACS seem to want to hide them.

...it is our pleasure to announce that Silicos NV, a Belgian-based company providing services in the field of computational drug discovery and virtual screening, has made a strategic decision to port its own developed software under the open source domain of Open Babel following the GNU GPL.

...The reason why we have made this strategic decision to port all our software to the open source domain is that we, as management of Silicos, strongly believe in an open innovation model, and open source is just one of these factors that make open innovation possible. For Silicos as a company, we believe that by actively participating and supporting the OB community, we could create more business in the form of services than we could otherwise...

I'm quite excited about this, both about the actual code being contributed and about what this decision signifies for Open Babel. I've always believed in the idea that contributing novel algorithms to a common resource is the best way of maximising usage; you still get full credit for your work, you still get your paper, but you are going to have basically bajillions of users. The fact that a commercial company also groks the advantages is significant.

Porting the code to Open Babel is currently under way, and a formal announcement will be made with the release of Open Babel 2.3.0 in a couple of months.

Wednesday, 23 June 2010

Molekel is an open-source 3D molecular visualisation package for analysing the results of computational chemistry packages. It has been developed at the Swiss National Supercomputing Centre by Ugo Varetto (et al.?). Unlike Avogadro, it does not have facilities for setting up calculations or building structures; however, at least for what I want to do, it has the edge in visualising orbitals and density. So, my curent workflow is to set up Gaussian calculations with Avogadro, and then analyse them with Molekel.

One interesting aspect of Molekel is that the electron density (and orbitals, etc.) are calculated on the fly by Molekel rather than extracted from a cube file. This makes the program much more capable and completely avoids the need for cube files. On the other hand, it is not possible to save the volumetric information generated by Molekel, so there is no way to sanity check the accuracy against a Gaussian cube (e.g. by subtracting them with cubman, and then visualising the results).

It took me a little while to figure out a workflow for using Molekel successfully on Windows with Gaussian 09 log files from a Linux server. Here's how it goes:

Run a single point calculation with POP=(FULL, ESP) and GFOLDPRINT. I also use IOP(3/36=-1) to reduce the file size. The current release of Molekel will not be able to colour surfaces with the electrostatic potential unless the ESP option is specified (this information is courtesy of src/old/readgauss.cpp around line 1043).

Convert the log file to Gaussian03 format using the bash script available here. (I hope they will fix this in the next version of Molekel.)

Convert to dos format with unix2dos. If you don't do this, you will get a segfault from Molekel after opening the file. (They should fix this.)

scp the file over to your Windows machine. (It would be nice to be able to open the file directly on the server using ssh; maybe this is planned for the future?)

Start Molekel, File, Open, Choose "G03", then the filename and "Open". If you don't choose G03 and instead use "All files", it will fail horribly.

Under Display, 3D View Properties, change Samples to "10" to add reasonable anti-aliasing. This also affects images saved to disk so it's worth doing as the difference is quite noticeable.

Also under Display, Set Background to white, and turn off the bounding box and the axes. (There is a bug if you leave the axes on and choose scaling > 1 when saving an image - the axes are repeated several times.)

In practice, I carried out steps (2) and (3) using Cygwin on Windows, but it would be easier to do this on the Linux server. (As a side issue, I note that only the File, Edit, Interaction and Help menu are accessible through keyboard shortcuts - hopefully this will be fixed in a future release.)

To actually draw the orbitals and so forth you go to Surfaces/Electron Density, but see the video tutorials on the Molekel website for more information. One thing that might save you some time is to generate all of the orbital isosurfaces you want first, and then go to File/Save Orbital Pictures and choose a folder - this automatically saves all of the orbital images one after the other with sensible names.

As a final point, if you have a better graphics card than I do, you will be able to apply some really impressive shading effects using the GLSL shaders under Display/Shaders. Again see the video tutorials on the Molekel website for a demo.

Tuesday, 15 June 2010

I was going to say this was a novel method, but unfortunately it isn't. The idea of splitting a data set into training and test sets non-randomly is one that is very appealing, seems quite reasonable at first, and leads to great results. And those results are completely misleading.

The safest procedure when doing the two-third/one-third split (or whatever) of a dataset into training and test sets is to choose the split randomly. As soon as you do anything else, the performance of your results on the test set will not be related to its real-life predictive performance.

Now and then a paper comes out where the training set is chosen to be a diverse set (whether by a Kennard-Stone type procedure, gridding the points, or initial clustering). At first this seems quite reasonable: you've made sure that you have chosen points that span the entire space of your dataset. In actual fact, what you've just done is to ensure that all of your test set points are close to a point in your training set. This means that the predictions on the test set are now woefully optimistic and unrepresentative of predictions on a true test set (which probably won't have the courtesy to be close to points in your training set, nor be distributed evenly across the whole space).

Here's another way of thinking about it. Let's imagine that our predictive model is meant to work on a particular chemical space of molecules (the domain of applicability or so). These molecules have a particular distribution of descriptor variables (or whatever). And both the training and test sets are supposed to be representative samples of this distribution. However, once you use a non-random method to split into training and test sets, both the training and test sets become less and less representative of the population and hence you will have little idea of the real-life predictive ability of your model.

I've never seen this idea written down anywhere, so I'd be interested to know whether anyone thinks there is justification for choosing a non-random training set in certain circumstances. My own advice though is to keep it real and keep it random.

Thursday, 10 June 2010

I have a list of 36 ligands whose structures I need to get from specific PDB entries. Matteo's method for getting their structures from the PDB Ligand Expo worked fine for all but 2.

The underlying problem was the same in both cases; the authors had decided that their ligand was actually composed of several 'residues' and had given the different parts different names. The PDB had not taken this craziness into account and had split the ligand into separate ones based on the HETATM residue names.

So, a little bit of extra legwork for the first case (1pph). I searched for the three ligands in the large SDF file, copied and pasted them into a new SDF, and joined the ligands into one entry with "babel --join". Finally I opened it in Avogadro and drew the missing bonds between the 'residues'.

Fine.

The second case was worse (1ppc). Only 3 of the 4 residues made it to the PDB's ligand SDF file; the fourth residue was a 'glycine' and was not labelled as HETATM in the PDB file - as a result it was not included in the ligand SDF file.

Time to get the sledgehammer. I don't use the PDB file format in Open Babel very often, so I wasn't sure if the following would work:

babel 1PPC.pdb --separate split.sdf

The second molecule in split.sdf is the complete ligand. Nice.

Now you might be wondering why I didn't just do this in the first place for all of the ligands. Well, the PDB file format doesn't contain bond order information, and so I was hoping to take advantage of the work the PDB had put into preparing its own lignad SDF files (which I think was one of the issues addressed during the PDB remediation process a few years ago).

If 2 out of my 36 ligands had this same problem, I wonder how common it is in the PDB data? Maybe I should make a list of cases and send it to the PDB.

The Blue Obelisk Q&A website at http://blueobelisk.shapado.com is really turning into a useful resource. There are now over 100 user accounts, and the quality of the questions and answers is generally quite high. Although several of the questions relate to open source cheminformatics software, many are simply general cheminformatics questions. You can log in with OpenID or your Google account, or even ask questions anonymously (but this doesn't work so well, as you won't be updated when it's answered).

Q&A sites are much better than forums or mailing lists for handling technical questions because there is no repetition; each question only gets asked once (how times have you seen the same question on the CCL for example?). If additional information becomes available, that can be added as a new answer (unlike a forum, where adding to an old forum thread is discouraged). Also, if a question is ambiguous, the original author can correct it after some prompting (in a forum, the corrected question may appear half-way down the page).

Anyway, I recently asked the following question which is related to my research:

Given a PDB code, what is the simplest way to get an SDF file with the experimental coordinates of a particular ligand?

My current procedure is to go to the PDB Ligand Expo, and use the “Search for instances of chemical components by 3-letter ID code” at the bottom with “PDB entry codes + coordinates files” chosen. This requires me to specify the 3-letter ID code, which I need to have obtained from the web page for the PDB file.

On the search results page, I can simply click on the SDF file for the particular PDB file in which I am interested to download the experimental coordinates.

Is there an easier way? For example, it does not seem to be possible to go directly from the website for a particular PDB entry.

(I would also be interested in the same question applied to a handling a large number of PDB codes at the same time.)

First, you should download this dictionary (Tabulation of PDB entries containing each chemical component); in this way you can get the full list of ligands associated to PDB entries.Then, the best is to download the SDF file with all the ligands and use the dictionary for the mapping PDB identifier -> list of associated ligands.

...this gives unique combinations of PDB entry – ligand – PDB entry chain… let me show two examples: the first is the ligand ZU3, for which we have the coordinates in all the PDB where this ligand is present:

2zu3_ZU3_1_A_501B 2zu4_ZU3_1_A_501B

the second example, the ligand ZZC, that is present in the SDF file for the same PDB entry but in 2 different chains (A and B):

I have been doing some work on conformer generation and trying to think about how to measure the quality of one method over another. Since I'm a great believer in looking at the data (see also this post), I decided to calculate the pairwise RMSD of the systematically generated conformers of a molecule with 3 rotatable bonds, and then visualise these data with classical multidimensional scaling (cmdscale in R):It turned out to be quite a nice graph. Each point here represents a different conformer. Of the three torsion angles, it seems that two were incremented in steps of 30°, while for the third Δang was 120°. Each torsion angle had a different magnitude of effect on the RMSD (or at least, this is how I interpret this graph - I haven't drilled down into the figures yet); presumably the torsion angle closer to the centre of the molecule gives rise to the three clusters of values.

The points in red are the lowest energy conformers (by MMFF94). In this graph it is difficult to see what the correspondence between these conformers is, although it is clear that the overall RMSD between some of these conformers is high.

All data was generated using a locally-modified development version of OpenBabel. Here is the R code I used:

Wednesday, 12 May 2010

I used to think I was pretty good at avoiding spam. I always tick the box (or sometimes untick it - you have to be on your toes) that stops companies sending me marketing 'literature'. It all worked well until I met the ACS...

You see, I did tick that box with the ACS (I've just checked) but the spam keeps coming. And there's isn't just one list - oh no, they have got lists for authors, conference attendees, everything and anything. They used to send me a pre-approved membership form even though I was already a member! Naturally I complained, someone responded, and I continued to receive the requests. So it goes. I subsequently moved countries - seems to have worked.

So...back to the main topic. Today I received an email from marketing@acs.org about something called ACS Spam (I'm summarising some details here). As required by some internet law (in the US?), they include the following helpful unsubscribe link at the bottom of the email:

If you do not wish to receive any further correspondence from ACS via email, please send a blank e–mail to purge-allcontributors2010-7753700d73f2fd1010f8a0584006fffc5bb2fdf@listmanager1.acs.org and send the message.

Apart from the fact that the sentence doesn't make any sense, it simply doesn't work - and I've tried it before for other ACS spam. Here's the response I received:

Technical details of permanent failure:Google tried to deliver your message, but it was rejected by the recipient domain. We recommend contacting the other email provider for further information about the cause of this error. The error that the other server returned was: 550 550 ... Hostname of recipient unknown to Lyris ListManager (state 14).

State 14, eh? This means that the ACS wants to play hardball (hurling perhaps). I was about to email the ACS directly, when I noticed a feature of Gmail I hadn't seen before. If you click on "Show Details" for the original email, you will see a link for "Unsubscribe me from this sender". And believe it or not, clicking on this appeared to work. I got an email back from "Lyris ListManager" saying that I had been subscribed from 'allcontributors2010'. This appears to be a Gmail only feature (see this blog post) so good luck unsubscribing if you use a different email provider.

Friday, 7 May 2010

You may have missed one or two talks at this year's ACS. If so, break open your internet and head over to the recorded talks. If you want a retro fix, you can check out last Fall's ACS too.

About half the talks from the Visual Analysis of Chemical Data symposium are available (just search the page for that phrase), but you know what's really missing - some sort of social voting system. In its absence, feel free to leave comments below about which talks are really great.

Tuesday, 4 May 2010

I have recently become a convert to the idea of putting as many as possible of my talks, posters and lectures on the web (see current progress here).

On March 25th I gave a talk on Cinfony to about 30 people at the ACS National Meeting. As soon as I got back, I put the talk up on Slideshare and inserted it in my blog. There have been 451 views in the month and a half since.

Because of this, now I'm going back and digging up all of my talks and posters and putting them up both on Slideshare and on my website. Naturally, I'm going to include the original file (typically a Powerpoint file) as one day Slideshare will be no more.

The funny thing is that although few scientists tend to provide their talks (notable exceptions are Rajarshi and Jean-Claude Bradley), I find it hard to think of any disadvantage. The whole point of giving talks and presentations is to publicise your work and surely your website is equally important in this regard. Indeed, many people might prefer to click through a presentation that explains your paper in bullet points rather than wade through your scintillating prose.

Another good question to ask is whether there's any difference between the audience that might have seen the talk at the conference and those that would have seen it online? The conference audience (especially at the ACS) would be more of the PI variety than the postgrad/postdoc. And I would guess that it would be the other way around online.

If you have any thoughts on whether this is a good/bad idea or know anyone else that provides all their talks, please leave a comment below...

Saturday, 10 April 2010

cclib is an open source Python library for parsing and analysing computational chemistry log files.

Avogadro is a cross-platform molecular visualiser with a lot of functionality geared towards setting up and analysing computational chemistry calculations. It is an open source project led by Marcus Hanwell and is built on top of OpenBabel.

Although Avogadro has increasing support for parsing comp chem log files from a variety of packages, there may be times when it chokes on a file. If this happens, it's nice to have a backup option. This is where cclib can come in handy.

If you're on Windows, have Avogadro 1.0.0, cclib 1.0, OpenBabel 2.2.3 and its Python bindings (these latter shouldn't be necessary in a future version of Avogadro), just copy the following code into a .py file and save it in C:\Program Files\Avogadro\bin\extensionScripts.

When you next start Avogadro, it will have an "Open with cclib" Option in the Scripts menu.

There should be no need to have the OpenBabel Python bindings installed separately, but there is no way to access the OpenBabel library in Avogadro (at least on Windows). This prevents me, for example, from calling ConnectTheDots (I had to use my own installation of OpenBabel) or to add Conformers (which was what I wanted to do).

There are two Script menus after installing this plugin!

How do I emit a debug message?

The Python prompt in Avogadro requires you to "print" everything to see its value. This should not be necessary.

Cutting and pasting multiple lines into the Python prompt works fine, but it looks pretty weird as the prompt (>>>) is missing.

Tuesday, 6 April 2010

At the recent ACS in San Francisco I co-organised a Visual Analysis of Chemical Data symposium with Jean-Claude Bradley and Andy Lang (see summary here). The original idea for the symposium was as a sort of follow-up to ChemToolsMeet ([1], [2]), a really excellent workshop organised by Daresbury Labs in the UK. I mentioned this idea to Rajarshi (the CINF Program Chair) and he volunteered me to organise it, but suggested tailoring the topic towards the more general area of Chemical Data Viz rather than Molecular Viz as this was being covered by the preceding ACS Meeting (i.e. last Fall). As this was my first time organising a session, I thought it best to have some help and so Jean-Claude and Andy were invited on board.

What went well:

Great talks, and good participation from the audience in questions.

We had a couple of first-time ACS speakers. I thought about standing up at the start and trying to settle their nerves by telling them I was a first-time organiser, but that might have made things worse. :-)

Having a keynote speaker was a great idea. I especially liked the idea of having a keynote speaker with broader experience outside the main topic of the symposium, the reason being that this is sometimes one of the best ways of bringing new ideas into a field. Elizabeth Dorland (Washington Uni) kicked things off to a great start with a look at Chemistry Visualization in Education.

High attendance. This is a reflection on the quality of the speakers of course.

I think the symposium was reasonably well advertised on the CCL, my blog, and CHMINF-L. Apart from the keynote, none of the speakers were invited and yet we received quite a number of talks.

What could have been done better:

The talks were 20min + 5min for questions. At the 15min mark, we indicated that 5 mins were left. Afterwards, one of the presenters told us that he understood that the signal indicated the 20min mark and so brought things to a premature end. Given that all the talks ended with plenty of time for questions, he may not have been the only person we confused. Next time, we should explain the hand signals in advance.

Bring a functioning timepiece and a pointer next time. Apparently, my phone charger doesn't work in the US and my watch is broken. A pointer was provided by the ACS on Day Two, but we lacked one on Day One (I should have asked the meeting technicians).

We also had some problems following up on sponsorship which I won't discuss further here.

What was unexpected:

Exactly how the recording worked. Many talks were recorded, in which case the desktop PC provided needed to be used to enable the powerpoint timing to be captured.

The inability to log into the PACS system after the symposium was finalised. This made it difficult to get the speakers email addresses and abstracts. Fortunately, I had printed out (as PDF) some of the details earlier.

The fact that apart from one or two exceptions, almost everyone submits their abstracts in the last week before the deadline. This is somewhat understandable as it's still 6 months before the conference at that stage.

The time required for the ACS to communicate to authors that their papers were accepted.

Notes:

Some of these talks were recorded and will be up on the web soon. In the meanwhile, you can find some on Slideshare, such as that by Jean-Claude and Andy.

If you have an idea for a symposium or are simply interested in helping to organise one (good for CV, etc.), I know that Rajarshi is very happy to hear from you..

Finally, I'd like to thank Rajarshi publicly for all of his help in organising this symposium. Considering that this was only one of several symposia in the CINF division he really does a huge amount behind the scenes keeping things going.

Tuesday, 30 March 2010

Cinfony is a cheminformatics toolkit that makes it easy to access OpenBabel, RDKit, CDK and internet webservices from a Python program. Here is a talk I gave last Thursday at the ACS National Meeting in San Francisco describing Cinfony 1.0.

Work I've done as part of a collaboration with Geoff Hutchison will be presented by Geoff on Monday morning. This involves the computational design of organic polymers for use in solar cells. We're pretty excited about this.

And of course I'll be attending the session I'm organising along with Jean-Claude Bradley and Andy Lang, so I might see you at the Visual Analysis of Chemical Data Symposium on Monday morning or all-day Tuesday.

Tuesday, 9 March 2010

Due to a clerical oversight (originally pointed out by Fredrik Wallner over at Cinfony), the current release of Pybel doesn't support the MACCS fingerprint (although OpenBabel does). It's not pretty but we can hack it in as follows: