Sunday, October 07, 2012

Before you continue reading this post, please be aware that the CDK 1.5.x series is the not a new stable release, but the current unstable, development release, where all the API changes happen. For stable releases, only look at 1.4.x, such as the just released CDK 1.4.14.

It took me some effort to remove all patches in the cdk-1.4.x branches, but I think the below list shows all changes since 1.5.0. And since that first alpha version, the changes in the releases 1.4.8 through 1.4.14 have been included. Therefore, you may also want to read the changelogs for 1.4.8, 1.4.9, 1.4.10, 1.4.11, 1.4.12, and 1.4.14.

Significant changes in this release include that IO settings do now use an enum, as well as a matching new implementation for IO readers and writers to handle settings (done by John). The getHillString() API has been improved, to make it more consistent with the matching getString(). The iterating SD file reader has been renamed from IteratingMDLReader to IteratingSDFReader, and the Elements static fields for the elements in the periodic table now use a final class called NaturalElement, independent from any data interfaces implementation. Daniel worked a lot of the 3D structure builder, improving the code significantly, and Jonathan revamped the fingerprint stack, and introduced two new interfaces, IBitFingerprint and ICountFingerprint, making the framework more uniform. On top of that, there is a new ShortestPathFingerprinter and new IO classes for the Mopac 7 input and output formats.

All in all, quite a lot, but that was to be expected after 8 months. Mind you, like 1.5.0 this release too shows an increased number of failing unit tests on Nightly. Nothing severe, so if you are in a development branch, with the first betas a few months from now, it may be tempting to migrate.

The changes

s/Molecule/AtomContainer/ to fix a compile issue with the port of the SDG patch to master 4d8be8a

Implemented new flag storage on IChemObject implementations. Flags are now stored on a single numeric val
ue (currently a short) and flags are accessed/mutated via bit shifting of this value. This implementation pro
vides space saving over using a boolean array however getFlags() and setFlags() now have a overhead due to co
nversion from the array to the numeric value. Usage however indicates the singular setFlag/getFlag is used >1
000 times where as the setFlags/getFlags is used ~50. 1a1b03d

Updated unit test for commit #3093241, where null's are always larger than an actual object ab8aa48

Removed two static fields that are already provided by RingSizeComparator 5e7772a

Very basic tests for the setWriter() methods (it cannot test if something is really written, as we do not
know what objects are supported by a random reader; therefore, we just expect that no exception is thrown) 3cc4232

And here are the changes in CDK 1.4.14. Compared to 1.4.12/1.4.13 I think this release is much more interesting. For example, as of this release, we report details on the IO options for readers and writers automatically in the JavaDoc (see this post), it has improvements to the CML stack, and tetrahedral stereochemistry encoded with the ITetrahedralChirality interface is now reflected in generated InChIs.

Again, the number of changes is not that large, reflecting that we are really moving towards development in the master branch for the 1.5.x releases. The first alpha version was already released a while ago, and I will try to make a 1.5.1 release soon.

The changes

Added unit tests for two CML bugs - both use the same molecule to test - 3553328: Atoms missing explicit atomic number default to 1. - 3557907: Only support for bond stereo with attribute dictRef d953285

Implementing fix for bugs 3557907 and 3553328 3557907: Previously only the dictRef attribute of bondStereo was supported. This patch adds support for the 'content/text' of the bondStereo element to be set. This patch allows the bondStereo to be added from the charContent when the end of the element is detected. 3553328: Added support for CML files missing atomic number information. As the starting atom is a Hydrogen in the passer if no atomic number is provided the atomic number will default to '1'. This fix checks if the atom 'hasAtomicNumber' before the atom data is stored - if there is no atomic number specified but the symbol has been the atomic number is looked up in the periodic table (as per Atom constructor). 89ce74a

Added null check before input close. If the reader was created with a URL the input is never created and invoking '.close()' will throw a null pointer exception 345c0be

Implemented test for conversion of SMILES with a topological chiral centre. bd922f6

- Added properties for JVM arguments this allows us to switch on/off debugging and stdout via ant. This is useful as it can be seen from the run target debug had been commented out. The properties allow us to explicitly turn off debugging (on by default) - Used properties for junit-test, run-test and run targets - Added jarTestData as a required target before junit-test can be run 913c796

I was just about to write up the changes of CDK 1.4.14 I uploaded to SourceForge last week, when I noticed that I forgot the blog the changes for CDK 1.4.13 and 1.4.12. Well, fortunately, those two releases are identical, caused by me fighting the SourceForge file system. So, first I will post the changes of that release then.

This release contains a few patches to improve the packaging on Debian, now reports the version too in the JavaDoc window title along with a few other JavaDoc fixes, adds the Co.plus atom type, has bug fixes for the MolecularFormulaManipulator, DebugChemObjectBuilder, and the NonotificationChemObjectBuilder, and add roundtripping of aromaticity in the CML format. All in all, unless you were affected by one of the fixed bugs, this release is not overly interesting.

Refactored to have jar file names as properties and thus customizable, and split out development targets into devel.xml, reducing the dependencies for compiling the CDK (via the taskdef, it dependend on JavaNCSS too) 18d4cb2

Fixed pointing to the development libraries, using the same customizable approach as in the rest of the build.xml a22f151

Added Nina's modifcations to ensure that getting a molecular formula includes implicit hydrogens. Addressed bug 2983334. Also updated a unit test to take into account that H's are being considered 37ca43c

Updated a method based on Ninas suggestion to avoid modifiying a atom container when looping over it. This avoids a concurrent modification exception. Also updated some Javadocs eba09ca

Change the getHTML method to return the elements in the Hill System order (bug #3432131). 3f3fab5

Two weeks ago an experienced researcher asked me this question. I was speechless. I am not sure the other understood that I was more wondering where to start, with so many things I want to get done, or whether I had literally no idea how to spend that other than propagating my current post-doc position.

Thoughts that blasted through my mind in random order: make the CDK 1000x faster, finish the orbital development kit, make a chemically decent Connectivity Map, run weekly, untargetted metabolomics on my body fluids for one full year (and three months) - that is about where my opponent began wondering if I had any ideas, I think -, do the same for the plants in my garden, do nanoQSAR (and in fact, I am writing this up for 4 million as we speak), develop a linked data, CCZero PubChem/ChemSpider alternative, make a database with the original literature on the first 1M organic compounds ever discovered (starting with ureum)... seriously, I have many more ideas; don't get me started on doing boring things like studying a single disease; we have M.Sc. students for that.

Saturday, October 06, 2012

As part of the Open PHACTS hackathon in Manchester last week, we working on further integration with the identifiers.org project at the EBI. There was a lot of talk with Nick Juty and Camille Laibe on integration with BridgeDB and the OPS Identify Mapping Service (IMS), but I also asked Nick to get LinkedChemistry.info listed as provider of information from ChEMBL using the ChEMBL-RDF data:

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.