Monday, August 22, 2011

Not so long since the CDK 1.4.1 blog post, but about a month after the actual release, there was already more than enough content for the next minor release. The 1.4.2 release adds another batch of atom type by Asad, Nimish, and Gilleain, which constitutes the majority of the patches. There is also some fixing of the creation of JavaDoc, and it seems that the link to the Git repository got broken somewhere. It turns out that the Tag.position()-returned SourcePosition class no longer returns the full file name path, but only the ClassName.java in some intermediate folder, causing all package information to be lost. And, more apparently got lost, because we no longer seem to have module pages. Clearly, some help with project maintenance is appreciated!

Other cool changes include a patch by Thorsten to speed up Morgan number calculation, a reference to the paper by Miguel on metabolite identification (DOI:10.1093/bioinformatics/btr409) for which he wrote a lot of code for the CDK, a bug fix, code clean up patches by Dmitry allowing this branch to be compiled with Java5 again, and the first patches by my son :)

The Changes

Apparently something changed in the JavaDoc API, and I only get the file name now, so added this work around to be at least somewhat useful 819f2fc

Saturday, August 20, 2011

The CDK code base is not just a regular dump of Java source code; it is an annotated dump of Java source code. You might have heard about git blame, and if you did, this would be a good time to start reading up on git, e.g. using this great book: Git from the bottom up. However, that will not tell you about git blame, but The Git Community Book will, and the man page will give you all the details.

We take advantage of the history of a file, as it helps us understand the full picture, complementing JavaDoc, inline comments, proper variable names, etc, etc. The annotation links each code line to a commit message. And that also explains that CDK reviewers are strong on good commit messages. No useless messages like 'fixed a bug', but a message that actually describes what has been fixed, and how. That's hard to do, but we are all increasingly trained twitterers, so we are trained to say much in 140 chars. Well, some are. So, I always hope to see something like "made the creation of morgan numbers N times faster, where N is the number of atoms in the AtomContainer" (liketoday).

What I do not like to see, is line changes that do not actually change something, for example, because they 'fix' whitespace. First, they ruin the git line annotation, by linking a random commit to a particular code line. Second, the reviewer does not know if the line has code changes, or just whitespace changes, and has to check the line in detail anyway. Waste of precious time, where code review is already quite a bottleneck in the CDK development process.

So, no stuff like this in your next patch please (it's extracted from a larger patch):

Update: It's fine to have whitespace changes as separate patches; if one knows only whitespace changes, that requires a different kind of reviewing. Just don't mix it with functional changes that require more in-depth reviewing.

Thorsten Flügel found a nice speed up for the CDK as part of the work in Dortmund on Scaffold Hunter: calculation of Morgan numbers. He has actually written a set of patches, and analyzed several bottlenecks. I expect more of that work to enter the CDK. Below is my observation of the speed up:

Saturday, August 13, 2011

Some time ago Andrew asked me how to write molecular descriptor implementations for the CDK. I have no such chapter in my book yet, and at that time wrote up a quick overview of the general steps. In the near future I will elaborate on those steps, but just to have them more easily recoverable, here are those steps as I replied on FriendFeed at the time:

1. get yourself a working CDK development environment in NetBeans or Eclipse (I got experience only with the last) - You

3c. that URI you will use in your descriptor impl class .getSpecification() method, see e.g. in the BCUTDescriptor.java - You

4. decide if your descriptor has parameters, which it does not have to - You

if it does not .getParameters() should return a zero length Object[] array - You

5. .getDescriptorNames() should return labels for each value you return, e.g. some descriptor algorithms return multiple values, like the BCUTDescriptor, while others return a single value, like the XLogP descriptor - You

oh, and I guess step: 0. decide which version you like to develop against. Recommended is against the cdk-1.4,x branch at this moment, but master is good too. If you must, cdk-1.2.x is possible too, which is against the current stable release. - You

6. decide what your descriptor will return... also discussed in step 5. A single value? A double, or an integer? A boolean perhaps? Or an array of integers? this is what the methods .getDescriptorResultType() is about. - You

Friday, August 12, 2011

Despite some an initially hesitant BioStar community, I got some good replies on my question about biology personas, including good material from an Søren Mønsted of CLC bio. Coincidentally, a few humorous perspectives came online, which in fact nicely demonstrate what I'm at.

When building a new platform, you need to know who will be using it and how, and how those people will interact. So, for our ToxBank design we need personas to do the requirement analysis, and I have created initial draft personas now, which I hope I'll be able to share later.

So, how people interact is important, as communication is central to scholarly research. For example, this is why we blog: they are like conferences. And some insight in how the various personas look at each other can be helpful in describing personas and modeling an social science platform. Matus Sotak (aka @biomatushiq) created this funny but right-on overview:

So, there it is: five personas, each of whom characterizes the other. These views reflect how others think about that persona, which is what a persona is all about: a virtual character we recognize and can characterize in terms as done in this plot. If we hook this to requirements, we could observe that the less-knowledgeable need better access to important literature. Just to name something off the top of my head.

The second is a XKCD comic. This one is more important to the message of this post: what happens if you neglect personas? The above comic shows that ignoring personas is daily business, but is that bad?

This show two personas, an average user who appreciates cool GUIs and apps on cool topics, and a regular dude who lives in an area where tornadoes actually occur. The take home message here is that mere ratio of persona abundance is not generally a proper guide for design.

Now, try to map these two comics to anything you see around. For example, do the five personas match your research group? How does the head of your group handle this? Is hes accepting the status quo, or is hes trying to overcome these stereotypes? How do these personas get reflected in author lists? How does that map onto how you think about your EU project partners? Is it useful?

Repeating this experiment for the second comic is more useful. For example, map this comic to your citation list, and then reevaluate the impact of your research. This is exactly why CiTO is crucial. For our ToxBank project this last observation has major implications too.

Thursday, August 11, 2011

It seems I had forgotten to blog about the stable update for CDK 1.4, but here it is for 1.4.1 (download). I also finally get fed up with searching my blog each time for those git scripts for commit statistics to write up this post, so I now posted them in this gist. I also took the opportunity to now point to GitHub commit pages, rather than those on SourceForge.

As a decent stable update release, nothing much happened. A good part is atom types: I added a few myself, and the first bits of work of the patch by Nimish and Gilleain in Asad's team made it in. Nimish has done a great job over the summer to find the details (some details are still missing, such as hybridization info for many metallic elements) for atom types found in KEGG. Besides some minor code clean up, only one other thing happened. The addition of InChINumbersTool, which you can read about already in my book (see figure), but I'll be blogging about that pretty soon too.

Wednesday, August 10, 2011

For a toxicology paper we are writing up, I need to create a few plots showing how the toxic and non-toxic molecules differ (or not) with respect to a few molecular properties, such as logP or the molecular weight. The rcdk package provides all, of course, except for a nice convenience method (or does it?) to make a plot. That is, I just want to do something like:

Tuesday, August 09, 2011

Blog planets are websites that aggregate blog feeds around a particular topic or project. It is probably called after one of its first implementations, the Planet software. These planets are like conferences, rather than journals. Like conferences with a continuously ongoing year-around poster session. And like any good scientists you blog (read: present posters) and you join blog planets (read: present your poster at conferences). The reality is that many of our peers are afraid of presenting posters at conferences (read: they are afraid of blogging).

This week my blog got accepted (read: I submitted an abstract which was reviewed and accepted) to R-bloggers.com. I do not present all my posters at this venue, and use labels to identify which posters go to this meeting. For this planet, those are labeled R. And unlike other virtual worlds, these virtual conferences venues (read: a web site) are easy to reschedule. With a simple click I switch from today's floor (read: a web page) to a room dedicated to me (read: another web page).

There are many other of these conference I attend, including Planet CDK, Planet Bioclipse, Chemical blogspace (quite general topic: chemistry, but with sessions on many topics, like cheminformatics), Planet Eclipse, Planet RDF, Nature.com Blogs (very general too, but also with dedicated floors, like chemistry) and a few more I cannot think of right now. In science such planets do not exist in this form, really, The closest things are blog service providers, like Science 3.0. I guess these are like conferences for general sciences, where you're kind of lost in which corner you really belong, and you cling on to a few bloggers (read: colleagues) whose work (read: posters) you know you'll probably like.

Sunday, August 07, 2011

Usability. I am not an expert in Human-Computer Interaction (HCI) at all. Worse, I make the crappiest looking interfaces, typically. So, that's said. Usability. Wikipedia writes that "[U]sability is the ease of use and learnability of a human-made object."

A cheminformatician is, despite doing cool science, per popular demand by peer scientists, also a HCI expert to at least some extend. Scientists want usability. It is merely an extension of any scientist being a Human-Paper Interaction (HPI) expert to some extend (you know, getting the bibliography properly typeset-ed).

Now, what is usability. What is it that someone means if he says your system has a 'usability issue'? That causes any cheminformatician to be some sort of HCI expert. I have had usability discussions many more times than I personally care about. Too often these discussions are held without defining who the users actually are. Are they chemist/biologists for whom Excel is the supreme data analysis tool, or statisticians who work with Matlab or R, or are they hackers (like Pierre or Neil perhaps) who just want to get their work done.

Taverna and KNIME primarily target a user who is thinking visually and who like to see what happens with their data. Jmol users do not even what to see what happens to their data (file reading, etc), and only care about seeing it it nice colors. The Chemistry Development Kit on the other side is targeted at hackers who know and want to know in detail what they are doing and what is going on.

Importantly, the last paragraph talks about the most visible part of usability: ease of use. In particular, easy of use to humans. However, readers of my blog there is more than humans: there is software too, and these too are users of a system. Here the easy of use is defined by the Application Programming Interface, or API.

So, any system is oriented at multiple user types. And each user type will have their own set of requirements. So, in a requirement analysis process, you identify the user types and associate requirements to those. Now, my software engineering book is hidden in some box, and I can therefore not cite some good practices standards right now, but the bottom line is that talking about usability without a set of project-defined user types is difficult, and may in fact result in heated discussion, where people probably want to same thing, but just are not aligned, resulting in confusion of priorities. (This sounds wise but I get fooled each meeting again myself.)

Targeting more than one user type double the effort. Yet, in science this is important. Particularly for large projects where a lot of user types are expected to interact anyway: project manager, bench chem/biologists, statisticians, data warehouses, etc. An agreement on what users are being target are core to the analysis. Bioclipse is example software where multiple user types are targeted: the visually oriented human (that will use the graphical user interface (GUI), like the Bioclipse-OpenTox one), and people who want full control (and use a scripting language).

Once the user types are defined, we can start think about data flow and how to model that. It is important here to found a common ground and that underlying technologies are the same. That requires your design to be expressed in layers that build on top of each other (e.g. as done in the TCP/IP and OSI network stacks). Multiple applications oriented at multiple user types must use the same lower layers. Some initial agreements about what such a layered approach looks like for you project is important too.

Now, we're not done yet. There is the learnability aspect of usability. That is often neglected, and the discussion often only focuses on the easy of use. Bioclipse is based on Eclipse and they have several approaches for learnability, one we adopted in Bioclipse: cheat sheets (I think a great Open Standard!). They talk the user through a particular process, but at the same time link tightly to the software and they can even make things happen in the software, by running certain actions. This way, it teaches the users around in the design.

I personally like scripting very much, hacker that I am. Just because of the learnability aspect of HCI. Scripts are not for everyone, but for those who know a bit about programming, scripts are a perfect tool to teach others about how your product works. This is why projects like MyExperiment exist: to share scripts (and workflows of course, but those are just graphical scripts). The are explicit, show what is happening, etc, and thus are the most informative means to get your message across. This is why my Groovy Cheminformatics book is full of scripts too. For GUIs, screencasts server pretty much the same role, but are much less interactive: you cannot pause a screenshot just to see what happens if you hit that other button at that exact same time, limiting the learnability of the solution.

As a final note, I will briefly return to Bioclipse, Jmol and layers. What Bioclipse and Jmol have in common is that they have a two-layer design (well, maybe more, but for the current argument I want to focus on two layers). The lower layer defines an API on top of which two applications are developed, both using the exact same underlying API: a GUI and a scripting language. Both Bioclipse and Jmol all GUI funtionality (or 90% at least) is expressed in terms of API calls. How that technically works, is a whole other story, but early on the developers of Bioclipse and Jmol decided that was a smart thing to do. In fact, both projects did not have this approach, and changed the design later, and the point here is that any new project should take advantage of that experience and express from the start:

what are the targeted user types

what is the layered model that is going to be used, to allow targeting all user type

Users are demanding. Peter (of Specs) and I chatted yesterday afternoon briefly about the Bioclipse Scripting Langauge, and in particular how to append content in the JavaScript to an existing file. I do not know how to do that with either Bioclipse managers, nor with JavaScript.

So, I look up an old patch that updated the JavaScript console to use Groovy instead (which I also use in my Groovy Cheminformatics book, which just had a third edition out). And it still worked! On the train back home I cleaned up the code to use a separate console window, separate threads, etc, so that you can have a Groovy and a JavaScript console running in parallel (they do not share variables):

The print command still routes the text to something undefined, instead of the console.But it mostly works the same as with the JavaScript console, and all managers are available in the same way.

Now, I just realize that the 'print' issue is actually worked around in the JavaScript console with a dedicated js manager, which I used above. But of course this routes the output to the JavaScript console, not the Groovy console :)

Update: I discovered that a JSR 223 provides a ScriptContext which allows one to overwrite the ScriptEngines standard output. That means practically, that I got the print to work properly now :)

Wednesday, August 03, 2011

Machine readable content is good for something. The actual format is not so important, and we can route RDF over JSON, so I'm fine with JSON. MediaWiki has a External Data (ED) extension that allows getting remote data in various formats, among which JSON. It works OK, but I have not figured out how to take advantage from hierarchy. If the some field shows up at various places in the hierarchy, ED still does not distinguish between them :(

Anyway, the obvious application it to show your last tweets on your MediaWiki homepage. Right?

The {{{Twitter|}}} bit hooks in to the template field that created that People box on the right hand side of that page, which you can see in the above screenshot with my twitter account set.

Of course, I have a hidden agenda here. The true reason is not twitter messages. It is being able to move data around. For example, wouldn't it be great to embed SPARQL results in wiki pages this way? Well, maybe there are better solutions, but the point is that this is technologically possible, and that we should think creatively in making mashups that help our scientific research.

Tuesday, August 02, 2011

Web of Science is my de fact standard for citation statistics (I need these for VR grant applications), and defines the lower limit of citations (it is pretty clean, but I do have to ping them now and then to fix something). The public front-end of it is Researcher ID. There is an Microsoft initiative, which looks clean but doesn't work on Linux for the nicer things, but the coverage of journals is pretty bad in my field, giving a biased (downwards) H-index. And CiteULike and Mendeley focus more on your publications than on citations (though the former has great CiTO support!).

Then Google Scholar Citations (GSC) shows up. While it does not look as pretty as competing products, it compensates that with a wide coverage of literature (for example, it supports the JChemInf, which Web-of-Science currently does not; and I happen to publish a lot in that journal recently), books, and reports, while keeping false positives fairly low. Thus, it provides an upper limit of my citations statistics, but one I am pretty happy confident about. And my H-index is quite comparable anyway. This is what my profile looks like:

So, these statistics have two purposes to me: 1. grant applications, and 2. I like to know what people based on my research. (Well, OK, 3. it helps me understand why I work so hard on too many things.)

Now the question is, will GSC take off. Will it replace ORCID? Will they join ORCID? Will GSC get a good API? Who will write the first userscript to make the GUI fancier? Will GSC support CiTO? Will GSC start using microformats or RDFa? What mashups can we expect between bibliographic databases? Will new entries automatically be posted to Google+? Will it have a button to autocreate a blog post when a paper gets cited 100, 500, or a 1000 times? Will GSC support #altmetrics?

Update: these are my personal experiences, and do not reflect that of other people and/or organizations.

The first half year of the ToxBank EU FP7 project (co-founded by Colipa; I think I am legally required to mention them, and I am happy to do so either case) I am working on (50%) has giving me mixed feelings. ToxBank has a great team of people (and great names in the other cluster project too!), and I am quite happy about the results we made. What results, you may ask. Well, indeed: they exist, and nice results too! It's just that they are not so visible.

That is the part I am less happy about: legalities. It took months for the consortium agreement to get finalized, and then there is a cluster agreement for the whole of SEURAT-1. Information is monopolized, and there is a general scare to accidentally release information which others may claim IP. It slows us down; it inhibits new collaborations and thus serendipity. Naive and idealistic as I am, I say this is bad for science.

But, the community is positive about Open nowadays, one achievement of the gold Open Access journals, I guess, and we hosted a workshop on Open Data recently too, which was well attended. People realize that openly sharing data has a role in science (see also this post).

Anyway, the goals are great of SEURAT-1. There now is an official website, and I am pretty sure I am not disclosing any trade secrets. SEURAT (Safety Evaluation Ultimately Replacing Animal Testing) has as ambitious goal to make animal testing obsolete. The ultimately is there to reflect this will not happen any time soon, but at least we die trying (this is also why there is a -1 in the name... there may be follow projects, as outlined in the Vision and Strategy). Of course, that following up makes Open Source and Open Data important, in my personal opinion. Fortunately, more and more people share that opinion. I hope my direct and/or indirect contributions to ToxBank can set an example.

The title of the -1 project is "Towards the replacement of in vivo repeated dose systematic toxicity testing". So, that will be more or less the focus of the ToxBank data warehouse. The types of data will be very diverse, and includes many areas of the omics space, including my favorite: metabolomics. To allow a systems toxicology approach, a short list of test compounds will be established. As part of ToxBank we have set up a Semantic Web system, allowing these compounds to be part of the Linked Data network. However, here comes the legal stuff again, and the wiki is not generally accessible; only to SEURAT-1 member (more precisely: will be very soon).

And that makes it impossible for me to start call in the community to run that favorite tools against these compounds. For example, to calculate solubilities in various solvents. That is information useful to our compound evaluation, and would contribute to the SEURAT-1 project! But I cannot do that right now :(

But, we're just 6 months into the five year project. I think we'll see a lot of firework later! The hurdles may slow us down, they will not mean we will not reach what we want.

Monday, August 01, 2011

This page nicely writes up what you need to do to make your RDF resource part of the Linked Open Data network. CKAN is used to aggregate facts about the resources, and I am finally getting around to adding the metadata describing how the ChEMBL data (CC-SA-BY) is linked to other LOD resources. This process is conveniently supported by a validator (see the screenshot on the right side).

The links out are mostly to various data sets of Bio2RDF. SPARQL helps me count the number of links to other LOD nodes. A typical query looks like:

Search This Blog

Loading...

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at Maastricht University, studying biology at an unsupervised but atomic level. Open science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and Wikipathways. ORCID:0000-0001-7542-0286

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.