Pages

Wednesday, April 30, 2008

The MetWare components are slowly coming together. The RAW data upload facility prototype went into beta stage, while the SKOS has proven really useful for various things.

Because of being compatible with various Java libraries and tools, we decided some time ago to use Java. We also wanted to start of with a HTML GUI to MetWare, which led us to Java Server Faces. Not being so fond of Tomcat (e.g. use by the NMRShiftDB), I was not sure how that would turn out, but Steffen was rather positive about it. And I like it :)

Key concept here is that JSF uses Java Beans, which are referred to in the above example with code like #{bean.field} for bean fields, and with #{bean.method}, assuming a bean exists with getField(), setField() and method(). The <h:outputText> stuff is JSF to work out bean details and will create HTML in the output. As really brief intro.

The Metware BeansIt is clear that java beans for Metware would be useful, and this is what I have been working on for the last few weeks. The relevant beans for the above example are automagically created from the SKOS, complemented with extra bits of RDF for the additional details, like field data type, mapping to SQL tables, and an example value. This all works very smoothly (the code to load() and save() into the SQL database is automatically generated too!) as you can see in the above example. The screenshot shows matches from a (local) live SQL metabolomics database. The text on the right side is directly taken from the SKOS.

Now, the bean library allows integration with other tools too, though this cannot be found in our current roadmap. But, for example, I have been thinking about a simple Bioclipse wrapper around these beans. What is on our roadmap involves workflows for metabolomics.

Sunday, April 27, 2008

About 1Actually, much of the work I have been doing in opensource chemoinformatics was done as 'home' science; I started as organic chemist student, and later data analyst, while the CDK/Jmol/JChemPaint was something I did at home because I liked, and needed it. I started in 1995 working on a website to aid my organic chemistry studies, the Woordenboek Organische Chemie (open data). And, I needed semantic tools for 2D and 3D display of molecular structure. Commercial offerings were not an option, for me as student, so I got involved with the Chemical Markup Language, Jmol and JChemPaint in 1997-98.

Note, that in that time free academic licenses were rarer than now. I always had, and still have, the feeling that those clauses are just there to give academics a reason to support non-opensource tools. Also note that a lot of commercial offerings started as incorporation of the code base of some PhD work. Not uncommonly, the PhD would simply be hired by the company.

Fact is, commercial chemoinformatics licenses are indeed unfriendly for scientists who maintain related hobbies at home. And, given my experience, I appreciate your worries: the high costs for those tools, which I certainly could not afford with my student funding, drove me to the opensource ideas many, many years ago.

About 2The second issue brought up, regards the ability to make mash ups. Open source and open standards are indeed important to make mash ups, though the former only helps you work around lack of use of open standards. Using web services contributes to the solution as it has a well-defined, open standard interface. Open source is particularly important for reproducibility of scientific results (see my thesis), and is the opposite of proprietary software, not commercial software. So, it seems bbgm is just looking for Blue Obelisk projects.

On a practical note, I think that Bioclipse might just be what you are looking for, and integrates local services as well as services on the internet, just alike. Particularly, the upcoming Bioclipse2 is strong at this, and supports SOAP, BioMart, BioMoby for online services (also see this), as well as R, BioJava, CDK, Jmol as local services. You can even run Taverna workflows from within Bioclipse, if you like. Mash ups can be done in various ways. Hard code Java coders would go the RCP plugin way, for example this nanotube example. Others will prefer scriptinglanguages, such as JavaScript and Ruby (in addition to R and Jmol scripting). Or, you might do record as script the tihngs you did graphically, using the recording feature.

Of course, there are other solutions... Bioclipse is just one, one to which I contributed.

About running webservices...Running webservices, is basically being hosting provider, and requires some commercial model. One conflicting problem is that, at least being said, that large groups withing the potential user base, aka pharma industry, does not even like sending over their highly secret data over an httpS:// line to the outside world.

Rajarshi and the rest of the Indiana group have been running chemoinformatics webservices. They might be the provider you are looking for.

ConclusionAll I can say to bbgm: "Yes, your two thoughts are indeed issues, and many from within the Blue Obelisk community have been addressing them." Oh, and we will not stop either. Peter recently gave in Nature a nice overview of what we, Blue Obelisk members, have been cooking on: Chemistry for Everyone: and that includes the hobby scientist.

Thursday, April 24, 2008

Via Rich' blog, I was informed about the work by goesLightly on CampDepict, a Ruby-based application which uses the CDK for SMILES parsing and 2D diagram generation. With cdk-20060714.jar it's using pretty ancient code, and I have not seen a screenshot.

But, importantly, it allows third-parties to efficiently set up DOI-InChI tables. Cheap (Asian?) workers become rather expensive, when compared to machine mining to create such databases. Sure, the authoring becomes somewhat more expensive, but who will argue that scientists might be a bit more precise in what they publish. I, for sure, would love to see authors focus on adding InChIs to experimental sections, then that they focus on getting EndNote to put the comma, bold and upper casing in the right place, to meet journal standards.

Another publisher who takes its job seriously is Beilstein. Stephan recently showed me some of the things they are up too, like information rich figures (yes, you'll have access to the source, and identify the molecular structures in reaction schema). He also showed me to the RDF now by default available for all their articles. For example, for DOI:10.1186/1860-5397-3-50, the RDF is available here. It's indicated in the HTML with:

There is, actually, also a lot of citation information available in the <meta> tags in the HTML, but apparently not the right stuff yet to have Zotero pick it up nicely (not sure what this Firefox plugin is actually looking for). No chemistry in the RDF it seems, but there is BIBO, FOAF and Dublin Core.

Main suggestion to Stephan, right now, would be to include InChIs in the RDF and RSS feed.

Disclaimer: Colin, behind Project Prospect, visit our group when I was still in Cologne; Stephan contributed code bits to the CDK project, e.g. this this Matrix class.

Oh, Nature is, of course, also a publisher who actively gets into electronic publishing age.

ChemSpider has been using a similar approach to add value to existing resources. The interesting thing in this case, is that these substructure searchable versions, have an interesting spin off: it allows ChemSpider to build a valuable DOI-InChI table. So far, I spotted:

Tuesday, April 15, 2008

While I do not agree in details on the statement made by Klaus, I agree with his intentions, and happy to propagate the mantra, like othersdid before me:

MAKE ALL RESEARCH RESULTS CC-BY

The details I disagree with:

no need for shouting; we can all perfectly well read it in lower case

CC-BY is not required; any open data license will do

Now, I know some of you disagree, and I understand the costs for maintaining and curating a database. But, if all research results would be freely available, these costs can be shared by the community, and we could all stand on the shoulders of giants.

Wednesday, April 09, 2008

Today starts the MetWare developers meeting, hosted by Steffen Neumann, at the Leibniz-Institut für Pflanzenbiochemie. Steffen's group and the Applied Bioinformatics group where I now work, are co-developing an opensource platform for metabolomics data management. Not really a full LIMS system, but a system to keep track of all the facts about the experiments and samples we would use when analyzing the data in order to find new chemistry, biomarkers, etc (see this earlier blog too). Good news is, that BioAssist is developing a support platform for the NMC, and plans to use MetWare as a main component.

OK, off to catch my train now. See you online (#metware @ irc.freenode.net); the wiki has an agenda for the meeting.

We used the R statistics software together with Rajarshi's rcdk package (an R wrapper around the CDK library) and Ron's (my PhD supervisor) PLS package (see this paper), to predict retention indices for a number of metabolites.

You may not distribute, publicly display, publicly perform, or publicly digitally perform the Work with any technological measures that control access or use of the Work in a manner inconsistent with the terms of this License Agreement.

CC-BY 3.0 reads differently, but has similar aims.

Let me make clear that I value machine readable publications much more than free (gratis, as-in-free-beer) publications. Now, the NIH initiative now just is 'Free Access'. An interesting step, but not one I care much about; not in relation to science anyway.

Now, Peter indicates that the NIH has put in place 'technological measures to control access' to the distribution of our work on userscripts (the PMC entry). That is in clear violation of the CC license.

I know that other NIH initiatives do allow this, such as PMC OAI, but that's just an 'auxiliary service'. It may come down to technical details, but some text on the PMC website is at least inaccurate:

Crawlers and other automated processes may NOT be used to systematically retrieve batches of articles from the PMC web site. Bulk downloading of articles from the main PMC web site, in any way, is prohibited because of copyright restrictions.

They way it is described right now, it is like: You may not drive a car. Next paragraph. But, if you have a driver license, we will approve. Or, translated to this example: You may only use this and that article, but only a few of them. Next paragraph. Unless you use the following technical hole in the measure we took to disallow you access.

What the PMC website should indicate, instead, is that text mining is allowed for the PMC OAI subset, but that they would highly prefer to use the PMC OAI or PMC FTP routes. This is the least they have to do.

No matter what, I still have the feeling that any technical obstacles are disallowed by the CC-license. Any legal expert here, that can explain me if the CC license allows controlling how people have access to my material?

Thursday, April 03, 2008

I am doctor now; I shall now be addressed asweledelzeergeleerde Egon; translating to something like quite-noble-very-knowledgeable, hahahaha. I'll put up a few photo's of the ceremony, which is actually quite formal at the Radboud University, later.

With this blog item, I would to thank everyone who left a message, sent email, etc with good luck messages. Very much appreciated! I'd also like to thank my supervisors, promotores Lutgarde Buydens and Peter Murray-Rust (he mentions the event here, and Ron Wehrens for their confidence in me and their guidance on the path towards the post-doc life. I also thank all those who attended my defense; I had a brilliant day, and actually enjoyed talking to those who took place in my promotion committee and who asked me the not-really-nasty-questions about my work.

CDK-Chemometrics in Metabolomics UnconferenceFor today, I organized a small, informal unconference, oriented around the CDK, chemometrics and metabolomics. I'm certain we will be online much of the day, as we typically do. The meeting will start around 10:00 CEST, but we'll attend a seminar by Marjana Novič at 11:00 CEST. If you happen to be in Nijmegen, just drop in on the Analytical Chemistry department. Otherwise, join the #cdk chat channel in the irc.freenode.net network.

Tuesday, April 01, 2008

In about 26 hours from now, I will be defending my PhD thesis. Follow that link to read the summary; I was thinking if publishing my introduction and discussion (the rest has been published in peer-reviewed journals) on Nature Precedings; would that be a good idea? Otherwise, I'll post it in my blog. If you just happen to want to attend the public defense, it's here:

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.