Pages

Sunday, July 29, 2012

The CDK customizes quite a few things in the build process. One aspect of that is custom JavaDoc tags, such as @cdk.githash (source of the Taglet). This tag replaced a similar tag for Subversion (@cdk.svnrev) and allowed a link to the matching source code for that class, for which we have not found another way to achieve that. This linking functionality was broken for a while, but is now fixed again:

The last line also shows the branch name now, instead of always master, thanx to GitHub's link-to-friendly URIs for Git repository content.

Sunday, July 22, 2012

Computer-aided code evaluation (CACE) is an important part of scientific code development projects. There are many ways to do peer-review of source code (Maven, Gerrit, ...), and I won't go into details here. Instead, I focus on CDK's Nightly build system.

Nightly reports
Making sure the source code compiles is one of the most basic requirements. Given it compiles, we get a full report with a log of information:

The lead of the report contains useful links to a precompiled binary Java ARchive (jar file), a link to the latest git commit, source code, and the JavaDoc. Also very useful is the keyword list, which acts as an index to CDK functionality, using @cdk.keyword hashtags in the class JavaDoc.

Unit testing

Below the horizontal bar are the code evaluation reports. First, are the results for the unit tests (for which JUnit is used):

In the middle we get the JUnit test results for each module separately, and behind the 'Stable' link there is a summary, giving a quick glance at all modules:

Full reports are again available for individual reports, but we all get statistics per module on the number of unit tests run, the number of fails and errors, and the number of methods not tested.

JavaDoc quality

For JavaDoc we also run evaluations. For this, we use OpenJavaDocCheck for which too alternative solutions are available, I learned later. The front page section of Nightly looks like:

The summery is quite like that of the unit testing, and a single report for a module looks like this:

Many of these tests are general for JavaDoc, but we also have CDK-specific tests, such as shown below (along with the summary down the bottom of the page):

There is a lot of small code fixes for those who like to contribute to the CDK project, and like to learn git skills along the way.

Code evaluation

We use PMD for general code evaluation. which is most useful in computer-aided code evaluation. It often highlights the more interesting bits of code, and importantly, those code bits where errors may occur. Another set of tests involve tests for code readability, which is very important too, allowing your peers to review your code more efficiently. The Nighlty front page looks for PMD very much the same as for the other parts:

For example, we get warnings like these:

We here get reported about various things. For example, about short variable names, like 'st'. Really short variable names often make it harder to read the code, because they are less informative. Is 'ac' refering to the old or the new atom container?

We also get a warning about incorrect use of the StringBuffer.append() method, indicated where we can improve the code (making it faster in this case). We also see a CDK-specific test here (sources are here), warning us about a bad practice: interfaces should take data model interfaces, rather than implementations.

Conclusion

As will be clear, the Nightly reports provide a wealth of information helping code review. I hope this post has popularized this useful resource a bit more, and I invite you to visit it frequently. For example, it is a useful too to validate your own code before you send it for review. For the latter it is useful to know you do not have to install a full Nightly to do this. Mind you, for largest patch writing efforts, we can set up a Nightly crontab on a specific branch, as we have done frequently before.

But you can also run these code evaluations from the command line with:

$ ant clean dist-all test-dist-all jarTestdata

$ ant -Dmodule=io qa-module

This will run the JavaDoc, JUnit, and PMD tests, and store the results in the reports/ subfolder.

Sunday, July 15, 2012

I have uploaded a new revision of my Groovy Cheminformatics book, based on the CDK 1.4.11 and CDK-JChemPaint 26. Slowly I am becoming confident in uploaded PDFs to Lulu.com and perhaps the frequency of updates will increase. At least, I would love to have revisions of the book at least follow the stable releases, but the previous book version was already based on 1.4.7.

But there is interesting new content, and I am happy this version is out, so that I can now prepare a larger update to be released in September or so. This version has 176 pages, I think an increase of twelve, but the next release will pass 200 pages, making it reach my minimal page count for releasing on Amazon.com.

Friday, July 06, 2012

This is something I have been asked about many times. I had to find out myself, as I had no experience with this corner of the CDK rendering stack. In fact, I think there will be a second, follow-up post on that later, where I will explain I did it all wrong :)

Anyway, here is example code for how to mark a substructure. It a variations of the triazole examples I have given earlier. First thing is to add the proper generator:

The full script can be downloaded here. A downside of this script is that the background of the symbol is not in the same color as the selection highlight. Also I do not think you can color multiple selection at the same time. But, I guess it is a start of an answer.

Wednesday, July 04, 2012

OK, the advantage of Linked Data is that it is Linked Data. So, when a link is made to, for example, side effects, as reported below by the Free University of Berlin (using SIDER, doi:10.1038/msb.2009.98), we do not just get a link to a new resource, but we can actually look up the label for that resource, and show that in the Isbjørn results instead of the URL:

Of course, we also do have the link, so notice the link icons behind the side effect names.

And because it's all using common standards (rdfs:label, dc:title, skos:prefLabel, skos:altLabel) it works for any database, thus my DBpedia support got upgraded too:

Still on a tight schedule, and you must be getting tired of my updates, I'm still beefing up Isbjørn a bit more. First, I added DBPedia and FreeBase support, which means, it knows about the ontologies they use. But I also played with inline images and set the encoding so that the page not only looks nice in Bioclipse, but you can also email it and it will still look nice in Chrome and Firefox:

For FreeBase is looks similar. Note that I had to cut our spidering from that resource, as it links to each translated page back to DBPedia and DBPedia is not always very fast with responding but after I added an additional reading time out to the RDFManager, it seems to be working.

Wrapping up the first release of Isbjørn I am adding further data extraction from the databases, such as for Bio2RDF (doi:10.1016/j.jbi.2008.03.004). Bio2RDF does not (yet) use standard ontologies, so I added support for their ontology:

Monday, July 02, 2012

During the summer holidays I plan to extend my Groovy Cheminformatics book to reach 200 pages, but before that I plan to upload an updated PDF for CDK 1.4.11. This upcoming 6th edition will have a few new things, including the above section.

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.