Pages

Wednesday, December 28, 2005

Derek Lowe is the author of the blog In the Pipeline which is really fun to read. Derek works in pharmaceutical industry and gives a great insight in how things work in that field of molecular sciences. Yesterday he blogged about What Makes an Ugly Molecule?, and touches the Rule-of-Five, the hydrochloric acid bath (aka stomach), and other reasons that make molecules ugly.

But there are many other interesting posts, and, something that my blog still lacks, comments by many users, discussing the ideas he posts, making his blog even nicer.

Tuesday, December 27, 2005

After the three obligatory days of christmas holidays (fun, especially with two children, but very exhausting), it is time to get back to business again. I'm still at my father-in-laws place with only XP installed, so booted the Knoppix 4.0.2 DVD I burned last friday. Eclipse is not working, but being able to use Kmail to read my email again is just what you need as in internet-junkie. A computer is just not complete without a nice KDE session hanging around.

Anyway, booted eclipse on my computer at work, and tunneled the window over SSH. Not overly fast, but it seems to run fine. (If only I knew how to setup NX on that Kubuntu breezy system!) Let's see if I can get the CDK bug count somewhat lower.

Friday, December 23, 2005

In a recent JCIM article, Schuffenhauer compares a few subset selection methods, and notes that some of them reduce the average complexity of the molecules. They put this in relation to other research that states that lead compounds with high complexity have higher activities. Recommended reading material for the holidays.

I knew I did a lot of work on the CDK, but never realized that 62.7% of the commits were mine! Keep in mind, though, that a lot of these commits are for code maintainance! Next in line are steinbeck and rajarshi. In total 28 people commited patches to CVS, though other people contributed patches too, which were commited by a developer with write access. There is jump in the commit messages somewhere this summer, which I think is the move of the data directory from cdk/data to cdk/src/data.

The full analysis results can be found here. It was generated with the StatCVS version in sid, and will rerun it soon with a more recent StatCVS version.

Friday, December 16, 2005

For some weeks now I have been thinking about bug 1309731 : "ModelBuilder3D overwrites Atom IDs". The ModelBuilder3D is a complex piece of source code, reusing many other parts of the CDK, including atom type perception.

Somewhere in October, however, I found that Taverna could not create 3D models and convert these into reasonable CML because the Atom ID's were messed up. So the question is, where did the ModelBuilder3D do this? Did it do this itself, or is it done by one of the other pieces of CDK that it uses? But due to the complex nature of this algorithm, it quickly became clear that looking at the code was not going to solve it; there was too much code to look at.

The solution was clear to me: use the new data interfaces. To identify where the IDs where messed up, I only needed to write a DebugAtom class with a method that looked like:

So I started this week to implement the DebugAtom and related classes. By extending Atom, I could just add debugging stuff and reuse the code in that class. However, the DebugAtom can not extend DebugAtomType too then. And this is a pity, because all methods inherited by the Atom interface from AtomType, Isotope, Element and ChemObject interfaces could not be inherited from the DebugAtomType class. Instead, they now have to duplicate those bits of code.

This is not a clean solution, as duplicate code is a known cause of bugs. So, the next step was to write JUnit tests for the new debug classes. And for this I wanted to reuse, i.e. extend, the tests for the default data classes. This required, however, changes to those test classes.

The first thing that needed to be changed was that instantiation of data classes in the tests would now have to depend on the data classes being tested. A simple

Atom atom = new Atom("C");

only makes sense when a specific Atom class was important. Fortunately, the new interfaces provide a solution for this: the ChemObjectBuilder implementations. These allow to use the following syntax to replace the hard coded instantiation:

Atom atom = builder.newAtom("C");

Therefore, I added a protected field to the AtomTest, which was instantiated in the setUp():

The sources for these debug data classes tests are found in the new cdk.test.debug package.

The number of JUnit tests for the CDK jumped from around 1250 to over 1500 tests right now. And if you think these new tests only test old code, because of all the super.bla() calls in the debug classes, you're way off. I found bugs in the new debug classes, but also many class cast bugs and several other problems in the real data classes!

This shows me where the Atom ID is overwritten to be something other than "carbon1"! I can now look at the rest of the result.modeling.builder3d.ModelBuilder3dTest.txt file to see what the ModelBuilder3D was doing at the time, and which CDK class made the setID() call.

I only needed to change this line in the JUnit test for the bug to generate the above debug lines:

Tuesday, December 13, 2005

I drop in on the #classpath channel of freenode.net IRC network, where the #cdk channel runs too. The #classpath channel is for the Classpath project which is developing the free Java libraries used by most open source virtual machines.

A Slashdot.org item was mentioned "Java Is So 90s". It lead to a funny discussion about what that would make C/C++ and Fortran. A more serious question was brought up: where are the efficient and super fast Java linear algebra and complex number libraries?

There is Weka but it is more aimed at data analysis. I believe it has support principle component analysis, so it must have singular value decomposition. There is a book called Java Number Cruncher: The Java Programmer's Guide to Numerical Computing by Ronald Mak, 2003, Prentice Hall.

After some further asking about it on the channel, they mentioned the Apache commons math project, which seems promising. The website mentions complex numbers, linear algebra, statistics and numerical analysis, but have not looked at the full API, so not sure how well populated these areas are.

Saturday, December 10, 2005

I reported earlier that the CDK has been updated in CVS to use CML from the new Jumbo 5.0. The transition actually involved a lot of changes in the CDK, some I would like to address in the following comments. One thing is that CML write support (not reading!) uses the new Jumbo library which requires Java 1.5. Thus, if Java 1.5 is not available, then CML writing should not be compiled. This is how this is done.

The JavaDoc

The CDK makes extensive use of JavaDoc taglets. CDK uses tags of type @cdk.SOMETAG. And an important tag in this case, is the @cdk.require tag, becuase it allows us to make the CDK build system aware that the class requires Java 5.0 to be compiled. Thus, we have for example this code in CVS, of which bits are:

As probably is clear compiling this jars requires a two jars to be present, of which the jumbo50.jar itself is not required for compiling the class source code. It also shows the use of the @cdk.require tag.

The build.xml

Because the CDK still does not require Java 1.5, the CDK is supposed to be buildable with Java 1.4 (the oldest supported Java release). The Antbuild.xml script is quite able to conditionally leave out compiling parts of the CDK, if configured correctly using proper JavaDoc tags, as explained earlier.

First, the build.xml checks what libraries are available for compiling certain parts of the CDK. For example, the build.xml code to check for Java 1.5 looks like:

Keep in mind that the *.javafiles are created with JavaDoc based on the CDK JavaDoc tags mentioned earlier.

The build.xml 2

While the above mechanism has been present since for some time now, having jumbo50.jar in CVS made the situation a bit trickier: the jumbo50.jar uses the 49.0 class format used in Java 1.5, and cannot be processed by Java 1.4 systems. Since the classpath used when compiling CDK source code, is defined in configuration files for those modules in src/META-INF, the problem did not occur when compiling the modules. However, it did show an error in the reallyRunDoclet target today, when I was creating the *.javafiles with JavaDoc. The solution was trivial:

There is another area of interest: the FileConvertor, which is, sort of, CDK's OpenBabel's babel variant. The FileConvertor must be compiled in all cases, so we need to conditionally instantiate the CMLWriter, which is not really a problem. However, compiling the source code is more troublesome: the CMLWriter class must be loaded on runtime, and not occur hardcoded in the source code.

In the past I have solved this by using .getInstance() constructs, but the ChemObjectWriter interface does not define this functionality, so I decided to use the java.lang.reflect mechanism:

Now, this has been, by far, the longest blog item I have written so far. I hope it gave you good insight in some techniques CDK uses to deal with situations where functionality might, or might not, be present at build and at run time.

Thursday, December 08, 2005

Tobias commitedJumbo 5.0 to CDK CVS, so that the CDK is now again up to date with the latest CML library. Note that Jumbo 5.0 requires Java 5.0.

At first all JUnit tests seems to work, but apparently the CML2Writer tests were skipped because they were only run when Java 1.4 was found. I updated the test for the a appropriate Java version, and then it turned out that most tests fail. So those running CDK from CVS and depent on CML writing: hang on, it will be fixed very soon.

Tuesday, December 06, 2005

The code clean up after CDK's interfaces transition is in progress, and two CDK modules are now independent of the data module. After doing the core module, the standard was next, and I finished this yesterday. The dependencies in CVS now look like (click it to get a larger view):

The last advantage is really important: it allows alternative implementations of the data classes. For example, we could make debug data classes, which, unlike the normal classes, do all sorts of checks when using methods of these classes. For example, they can explicitely check that parameters are not null, of the right class, and generally make sense. This makes them, possibly, slower, but also more type save, and as such great for debugging and development sessions.

Another important application of making the CDK library independent of the data classes (and only depending on the interfaces), is that we can have data classes shared with other Java libraries, such as JOElib, Octet, CML (Jumbo 5.0 is out!), and even proprietary libraries. This approach is already used in the CDK-Taverna library, and I anticipate much wider use with the arrival of Bioclipse.

Sunday, December 04, 2005

After requests I added yesterday more visible the RSS and Atom feeds for the Planet Blue Obelisk. They are linked in the menu on the right, and as alternative links to the document. These should show up in most recent webbrowsers as feed icon in the lower right corner of the browser window. It is often an orange icon. I also added a 'Leave a comment' link to encourage people to leave comments on items. Please do!

Saturday, December 03, 2005

Stefan has done an excellent debugging week on JChemPaint, while I have been late with a 2.1 release. Anyway, I've just uploaded a Java 1.4 compiled JChemPaint 2.1 series release. I was told the (reported) bug count is down to one, so I expect to see the next stable branch to be released soon (2.2 series).

But what after JChemPaint 2.2 gets released? Will a 2.3 developers branch be opened? Or will the JChemPaint application, as we know it, cease to exist, and make place for the BioclipseJChemPaint plugin, that is being worked on?

It is worth mentioning the pros and cons of JChemPaint. One big pro is the applet version of JChemPaint, though free but closed source alternatives are available (e.g. MarvinSketch). Another advantage is the great semantics of the chemistry being drawn. For example, when drawing reactions, reactants are really marked as reactants, and are not just molecules left of an arrow. Moreover, JChemPaint is a great platform in which ideas can be tested! One of the key virtues of opensourceness. Cons include the limited amount of templates, print quality graphics, and others. (Comments on JChemPaint most welcomed.)

So what about this Bioclipse then? It is inheritently SWT based, but currently the SWT_AWT bridge is used to embed to current JChemPaint and underlying CDK code as is. Unfortunately, this bridge is using proprietary code from Sun (sun.awt classes), which makes it impossible to use with free virtual machines.

But there is also the option of using the SWT drawing classes. This has the advantage that it can be run with free virtual machines, and that it can even be compiled to native code. It requires serious rewriting of code in the JChemPaint and CDK code base. But, CDK's Renderer2D needs a rewrite anyway: it does not even use Swing's Java2D efficiently (try to figure out how it transforms atomic 2D coordinates into screen coordinates!). Some efforts have been ongoing, but a rewrite from scratch, with a better, more modular, design cannot hurd at all.

Search This Blog

Loading...

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at Maastricht University, studying biology at an unsupervised but atomic level. Open science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and Wikipathways.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.