Pages

Thursday, December 30, 2010

Two and a half months after the CDK milestone, the Blue Obelisk paper also reached 100 citations. Here the lucky paper is Design, Synthesis, and Preclinical Evaluation of New 5,6- (or 6,7-) Disubstituted-2-(fluorophenyl)quinolin-4-one Derivatives as Potent Antitumor Agents by Chou et al (doi:10.1021/jm100780c). The Blue Obelisk paper (doi:10.1021/ci050400b) is cited because the authors used OpenBabel. About half of the 100 citations is because OpenBabel was used, whereas OpenBabel is only mentioned as one of the Blue Obelisk-associated unprojects in the Blue Obelisk paper.

I am not sure how this habit started, but with citation practices, it is unlikely to go away. But who am I to complain....

Oscar is a text miner. It mines in text for chemistry. Oscar4 is the next iteration of Oscar code that I worked on in the past three months, with Lezan, Sam, and David. I blogged about aspects of Oscar4 at several occasions:

These posts will server is a some initial critical mass for a draft report I plan to finish today. I might have to blog some further posts with diagrams, here and there. This post is actually one of them, and discusses something where Oscar can be expected to go next, now that the design is cleaned up (though this effort is not halted now) and it has become possible again to extend it. The over 250 unit tests make this a lot easier too.

One aspect where I expect Oscar to go in 2011 is the support for other languages. To a very large extend this is based on multi-language support in the dictionaries, as well as having training data in a particular language. This also provides some context to my earlier post about the need for a Oscar training data repository.

This extension opens a number of options: analysis of patent literature in other languages, monitoring of press releases in other languages, and news items in local news papers, etc. For example, it could analyse this C2W news item on yeast cells:

There are many use cases for such localized text mining. And it surely matters for determining the impact of research.

Oscar has various places where language specifics are found. For example, in tokenization of a text. One step here is the detection of sentence ends. This is done in most western languages with a period, exclamation mark, question mark, etc. But periods (dots) are also used in abbreviations. Similarly, colons can be used in chemical names. But the every language comes in with different abbreviations that need to be recognized.

Currently, some abbreviations are found in NonSentenceEndings. In the past three months, we have been cleaning up the code, and restructured the source code, making it easier to detect such places. This class will likely undergo further refactoring, to making the list of such non-sentence-endings configurable via files or so. What I expect to see, is that we you initiate Oscar like this:

Oscar oscar = new Oscar(Locale.US);

This might actually even make a nice student summer project. The biggest challenge will be in making a good corpus of training data, like the SciBorg training data that was used for training Oscar3.

But the whole normalization is tainted with English language specifics too. For example, the normalizer will have to 'normalize' the question marks, for which there exist several unicode variations. But the normalized variant is language dependent. For example, greek and armenian have different characters (see this page), and then we have not even started talking about the right to left.

Besides localized dictionaries, this Oscar will also benefit from a localized OPSIN. It seem to recognize the Dutch propaan, but not benzeen. I am not going to look at that soon, but if you are interested, I recommend checking out Rich' postsaboutforking OPSIN and writing patches.

Getting Oscar going for other languages is a challenge, but also offers new opportunities. Just email the oscar mailing list if you are interested and need help.

Today is the last day I work on Oscar in my position in Cambridge (tomorrow I have a day off and fly back to Sweden). Three months go quick indeed. Next Monday I start my position in Stockholm at the IMM department at the Karolinska Institutet on predictive toxicology. Back in Sweden, it is. Well, of course, I worked from home most of the time anyway.

So, today it is time for me to write up a report for the last three months. This blog item is basically a prelude, or procrastination, or so. People sometimes ask me how I find time to blog so much. The trick is just make blogging part of your workflow. So, small scripts I use in finishing another task form a blog post (e.g. Converting JSON to RDF/XML with Groovy). I started blogging (in 2005) to actually optimize my workflow. I was sending the same message ("have you seen this interesting webpage") to several mailing lists, often tuning it a bit to the audience. Now, by just posting it in a blog, I removed the need for tuning, and as a bonus, would reach a much larger audience too. Actually, with more then 300 unique visitors a week, I cannot complain.

Neither can I complain about the amount of discussion it triggers. It's like having my own private symposium:

Monday, December 27, 2010

What if scientists could host small amounts of CC0 data for free? Something like computation results, e.g. outputted as HTML+RDFa? Without having to worry about setting up triple store, etc? Well, that future might be near. The above screenshot shows a first go. Not by me, but in response to a feature request by me. So, the question right now is, what would be like to see on the summary page. Some things I can think of are:

One of my first encounters with open source cheminformatics was the XYZ file viewer applet by Sun. I extended it back then with minimal PDB support for our Woordenboek Organische Chemie website (started in 1995, now extinct). This applet dates back to at least 1997, as shown by the screenshot.

Sunday, December 26, 2010

Oscar uses a Maximum Entropy Markov Model (MEMM) based on n-grams. Peter Corbett has written this up (doi:10.1186/1471-2105-9-S11-S4). So, it basically is statistics once more. If you really want a proper bioinformatics education, so do your PhD at a (proteo)chemometrics department.

N-grams are word parts of n characters. For example, the trigrams of acetic acid include ace, cid, tic, eti, and aci. N-grams of length four include acid, etic, and acet. The MEMM assigns weights to these n-grams, and based on that decided if something is in deed a named entity (in Oscar terminology). For example, consider the acet n-gram: acetone should be matched, but facet not.

Put this in perspective in the ongoing refactoring of the Oscar software. We are changing normalization (e.g. converting all unicode hyphen alternatives into one specific hyphen), updating the tokenizer (e.g. changing the list of non-sentence-endings like Prof.). It is clear this changes the n-grams typical for chemical-like things. Worse, the weights are tuned towards to know n-grams, and statistical models are generally a bit overtrained for the data, or, at least, specific for it.

Now, if the distribution of n-grams changes, the weights in the model need to be updated too, to not degrade the model performance. So, Oscar is useless if we cannot retrain its MEMM component after a refactoring. If that would be impossible, we would have effectively created an intellectual monopoly.

Thus, what the Oscar project needs, is one or more free sets of annotated literature, which can be used to train new MEMM models. The SciBorg corpus was used to train the current Oscar3 and Oscar4 models. This data (copyright RSC) will very likely be available under a Creative Commons license (RSC++), but may have the NC clause, which would not be good for developing a business model around the opensource Oscar (such as providing a high-performance web service via a subscription service). I have recently written up the problems the NC clause introduces, and some examples of commercial Open Source cheminformatics projects.

We need not focus only on this SciBorg data, however. In fact, we will need multiple models anyway. For example, the SciBorg papers (42 if not mistaken) are around a particular kind of literature. So, it introduces the risk of using it to analyse papers out of the application domain. Furthermore, I am very interested (and others indicated so too) to use Oscar for other languages. Surely, English is the major language, but there are many use cases for Oscar when useful for other languages.

Therefore, for what we need in the Oscar project, is a registry of training (/test) data, annotated itself with metadata around how that data was created (what quality assurance, what kind of named entity types, how many domain experts were involved, etc), test results for those data sets, etc. My time on the Oscar project is almost over, and I have no clue when I will be able to invest the same amount of time into the project as I did in the past three months. But the creation of this registry is clear step that must be taken in the Oscar4 development.

Thursday, December 23, 2010

The twoearlier posts in this series showed screenshots of results of Oscar, but the title also promised results by Lezan's ChemicalTagger. Sam helped with getting the HTML pages online via the Cambridge Hudson installation. Where Oscar find named entities (chemical compounds, processes, etc), ChemicalTagger finds roles, like solvent, acid, base, catalyst. Roles are properties of chemical compounds in certain situations. Ethanol is not always a solvent, sometimes it is a Xmas present. The current output is not entirely where I want to go yet, but makes it easy which solvents are frequently found in the BJOC corpus:

This screenshot of an analysis of 15 BJOC papers shows that AcOEt (is that the same as EtOAc?) is mentioned as solvent three times in PMC1399459. Brine, however, is mentioned as solvent in three papers.

As said, these twopages contain RDF and the tables are sortable. Hudson recompiles them automatically when I update the source code to create the HTML+RDFa. So, go ahead, send me bug reports, feature requests, and patches!

Tuesday, December 21, 2010

OK, the second paper I ran into today is a perfect match for the paper by Khanna and Ranganathan I just dicussed in the Commercial or Proprietary? post. So perfect, in fact, that it I should have really combined them. But since the other post is already infecting the WWW, I'll have to post this update.

Yap wrote up a paper on PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints (doi:10.1002/jcc.21707), and Table 2 is quite like Table 1 in the paper by Khanna and Ranganathan. Not only does Yap correctly differentiates between product cost and license, it also details the descriptor type count and descriptor value count. It is a good exercise to compare those two tables yourself.

Khanna and Ranganathan wrote up a review paper on molecular similarity (doi:10.1002/ddr.20404). I have not fully read it yet, but my eye fell on Table 1, which lists a number of programs that can be used to calculate QSAR descriptors, both open source and proprietary. However, the table features a column Availability which has two options: Public, Commercial. They qualify Bioclipse, CDK, and RDKit as public, and Dragon, MOE, CODESSA and others are commercial. Effectively, it seems to suggest that they classify them as open source versus commercial, though I am not entirely sure what they mean with public.

The authors and referees would not be the first to make this common mistake: not express the difference between two orthogonal axis: free-versus-commercial and open source-versus-proprietary. To clarify these axes, I created this diagram (CC-BY, SVG source available upon request):

It is very important to realize the Open Source software can be commercial. For example, you can get commercial support for Bioclipse and CDK with GenettaSoft. It is also really important to realize that free software (public?) does not mean it is Open Source (or visa versa). E-Dragon is an example here: you can freely use, but the source code is proprietary. Some years after open source cheminformatics took off, commercial providers started to provide free-for-academic-use packages, which fits into this category too.

Readers of my blog know that I advocate Open Source, not gratis software (see also Re: Why I and you should avoid NC licences), even though you can download many of the Open Source cheminformatics tools I worked on for free. Here it is important to realize that the CDK and Bioclipse are not free: it is just that the tax-payer covered the cost via academic institutes mostly, as well as hobbyists working out-of-office, like I have done for many, many years, and companies who saw mutual benefit. Maybe something to consider the next time you are wondering about donating money to an Open Source cheminformatics project, and pay some respect to the project contributors of the software you use.

Off-topic: there is a second inaccuracy in this table. For each software, they list the number of descriptors, but without units. Units, units?? Yes. For example, for the CDK they list ">40" descriptors, while for Dragon "3,224" (it puzzles me why you can count accurately above 3000, but not below 50. But the point here is that the CDK count is really the number of Java classes, reflecting descriptor algorithms. One algorithm can calculate more than one descriptor value, and those are counted for Dragon. The columns is comparing apples with oranges. While I have never really counted it, and you every CDK user can in fact tune it, the number of calculated CDK descriptor values approaches a thousand. Well, I guess that is ">40" too :(

Sunday, December 19, 2010

Peterblogged about an issue that recently came up: the role of the Non-Commercial clause in Creative Commons licenses. This clause goes like:

NonCommercial
nc

You let others copy, distribute, display, perform, and (unless you have chosen NoDerivatives) modify and use your work for any purpose other than commercially unless they get your permission first.

Originally, I have been tempted to use this clause too, and maybe I forgot to remove it here and there, but I agree with Peter that it is better to not use this clause. Peter outlines several of things involved. One important thing is indeed that 'commercial' use is not well-defined.

Moreover, one of the arguments outlines "[the NC clauses] are unlikely to increase the potential profit from your work". I think this is true (I could have said, 'I believe this is true', but that would only introduce unqualified trust): the material already being free to some extend, the material will itself is not-profitable, but the services around would be. But, by making it impossible to allow other to set up 'services' around the material (education at a high-fee university, books, software you sell on CD, a webservice for which you require fees for high-volume use, ... remember, 'commercial' is not well-defined), you also make it impossible to build a community around the material; and as such, you reduce the value of the material, also for yourself. FaceBook would not be what it is right now, without community building (neither would the CDK).

These issues are acknowledged by Creative Commons themselves too, and they wrote up an interesting report last year about how commercial use is understood. The bottom line is that no one really knows. I guess. I have not fully read the report yet, but anticipate it is a must read. Here is a bit of the executive summary as a teaser:

The most notable differences among subgroups within each sample of creators and users are between creators who make money from their works, and those who do not, and between users who make money from their uses of others’ works, and those who do not. In both cases, those who make money generally rate the uses studied less commercial than those who do not make money. The one exception is, again, with respect to personal or private uses by individuals: users who make money consider these uses more commercial than those who do not make money.

Tuesday, December 14, 2010

Yesterday, a new journal launched, but your regular new journal. This journal is about scientific software. Tested software. Documented software. Open Source software. I am on the editorial board, so I am biased here. But, this journal is special. This new journal is called Open Research Computation (ORC?). Several others bloggedaboutittoo.

It was great talking to Judy about her PhD research on NMR in metabolomics, and Tim about further work in this research. I was interesting to learn that here too they have problems in metabolite identification quick like those when analyzing *C/MS data, and was really happy to hear he is in contact with Christoph about an Open repository for metabolomics data.

Another 'new' slide was one advertising the yesterday launched Open Research Computation which I'll blog about in more detail later today.

Saturday, December 11, 2010

A quick update on the post of this morning. The above screenshot shows the progress of the reporting of text mining results using Oscar on the BJOC literature. I think I am almost ready to analyze the full corpus, with a blacklist put in place for large papers, What you see is the same kind of JQuery-enabled sortable list in the HTML view, and a SPARQL query in RDFaDev, to list all papers that mention DHMO (in the first 10 of all 350 BJOC papers) by its InChI.

Some smart software developer once said to not optimize your code too early. However, not caring about it at all does not help either. Some basic knowledge of memory management can keep you going. That is, I just ran into the limits of Oscar and ChemicalTagger. As I blogged earlier today, I am analyzing the BJOC literature, but Lezan and I are running into a reproducible out-of-memory exception. At first I thought it was a memory leak, as it was the 95th paper if fell over on, but after we optimized our code a bit, by reusing classes, the problem remained and turned out to be not in recreating objects (though the code is significantly faster now), but in a single BJOC paper being too large.

The particular paper is not even ridiculously large, though it has an amazing 800 references! The paper, Molecular recognition of organic ammonium ions in solution using synthetic receptors (doi:10.3762/bjoc.6.32), is in fact an interesting review paper on supramolecular chemistry. The molecules I worked on (see one below) in my own supramolecular chemistry time (doing a M.Sc. minor (6 month practical) with Peter Buijnsters in organic chemistry in the group of Prof. Nolte), are actually of the type they review, though surfactants are not really covered in it particularly.

Yeah, supramolecular chemistry has this nice level complexity; it is so supramolecular, that it is currently outside the scope of the molecular analysis of Oscar and ChemicalTagger ;)

Friday, December 10, 2010

Thanks to all who replied and shared their views. Particular thanx to Christina who replied in her blog. With Saml, and Cameron and Bill they think this is about semantics. Linguistic tricks. I hope not; this is too serious to get away with such. "Reliable, trustworthy, assumptions": it's all working around the real issue. Similarly, splitting up 'trust' into 'blind trust' and 'smart trust' is just working around the real problem.

Indeed, my point is different. The key of science is to replace trust by facts. Or, when talking about database, software, research papers in Nature, it is replacing trust with traceability. Actually, we seem to have lost a long-standing tradition of citing previous work when we write down the arguments we base our argumentation on. Facts are backed up with references, providing the required traceability.

Now, compare that to current electronic sciences. We 'trust' our database to have done something sane. Well, don't. They made an attempt, but made errors. As they say with software, having zero bugs just means you have not found them yet.

The real point with 'trust' is, is that it is completely irrelevant. It adds zero to the scholarly discussion. Whether you trust the highly curated ChEMBL database or not, it has errors. (Noel pointed out one source of ambiguity in the ChEMBL database this week). What does matter, instead, is if those errors are significant. Do they affect the conclusions I draw when I use this data. That is what actually matter. Trust has no place in science. Error has.

Sadly, this is basically the hypothesis of the VR grant I wrote up but did not get awarded. But I trust I do better next time.

Why this matters? Well, this is what ODOSOS is about: bring back the traceability into science, and get rid of trust.

But, I do like to put this question out in the open: how many H-bond acceptor groups does this triazole have? The CDK calculates 3 groups, while PubChem counts 2. ChemSpider thinks 3 too.

Intuitively, I would agree with the CDK and ChemSpider: the nitrogen which acts as the single H-bond donor still have a free electron pair. What do you think? Is PubChem wrong, or is the CDK and ChemSpider wrong? Can this special nitrogen be both a donor and acceptor at the same time? I think so. However, I do not know how I can easily search CrystalEye for this. Bonus points for answering the Blue Obelisk eXchange question.

Monday, December 06, 2010

One discussion I had often had in the past year, is about trust in science. I, for one, believe (hahahaha; you see the irony? ;) that trust has nothing to do with science. Likewise, any scholar should be, IMHO, hes is suspicious when someone talks about trust. A scholarly scientist will never trust any result: hes will accept it as true or false, but will take responsibility for that decision; hes will not hide behind 'but I trusted him' or 'but it was published in Nature'.

Antony asked last week the community to answer a questionnaire, which turned out the be about our trust in online chemical database. He presented the results at the EBI. This is the slide that summarizes the results from that questionnaire:

We see that trust clearly has a very significant place in science. How disappointing. You can spot me in these results easily: I am the one that consequently answered 'Never Trust' for all databases. It's not that I do not value those databases, but there is no need for them to trust them. I verify. This is actually a point visible in Tony's presentation: we can compare databases.

This is the point that I and others have been making for more than a decade now: if we do things properly, we can do this verification. Anyone can. With Open Data, Open Source, and Open Standards we can. I can only stress once more how important this is. We trust people, we trust government, but repeatedly this trust is taken advantage of. Without transparency, people can hide. By being able to hide, human loose there ability to decide what is right. With transparency, we see things return to normal, as we saw this week with UK politicians.

Sunday, December 05, 2010

In 2004 I wrote up a short CDK Newsarticle on how to set up Konqueror web shortcuts for the CDK (Windows users can download Konqueror here). They are very handy, and I just found another simple use case, and as I have not seen it get much attention recently, and it is a great productivity tool, here goes.

CHEMINF is an ontology under development by people at the EBI, Canada and Sweden, for cheminformatics. I was aggregation examples, and when you browse these, you immediately run into the problem that all OWL resources have rather cryptic names, like CHEMINF_000000. Of course, any decent OWL tool will just view the rdfs:label, but I and others prefer to work in plain text editors, rather than, for example, Protege.

Fortunately, the Michel and/or Leonid have made the CHEMINF ontology LinkedData, which is where the web shortcuts come in. So, CHEMINF_000000 has the URIhttp://semanticscience.org/resource/CHEMINF_000000. For the RDF and ontology users, a web shortcut is just like defining a namespace (actually, a bit more general), so we will define cheminf:CHEMINF_000000. Actually, let's skip this step, and make use of web shortcuts fully, and define cheminf:000000. After all, a web shortcut is nothing more than an simple expansion.

Friday, December 03, 2010

Update: Wow, how tired can you be. I have to apologize for this post: as Andrew points out in the comments, Rich did not analyze the Chrome source code, but his own source code. That is not so special indeed. I have misread Rich' post. This completely ruins the point I was making. He did not take advantage of Chrome being Open Source, and find the problem that way, but in an old fashion debugging session on ChemWriter. The below could have happened, but it didn't.

Rich of MetaMolecular works on Open Source and closed source cheminformatics solutions. ChemWriter is one product he is working on which uses JavaScript and SVG (two Open Standards), and recently asked feedback on the new version. Test users found a problem on Google's Chrome browser, and Rich then did something that is only possible in an Open Source environment: he downloaded the buggy product (Chrome), started looking for the cause, found it, and filed a detailed bug report. Just think that would have happened if this problem was in MS Internet Explorer...

Monday, November 29, 2010

Say, you have your own dictionary of chemical compounds. For example, like your companies list of yet-unpublished internal research codes. Still, you want to index your local listserv to make it easier for your employees to search for particular chemistry you are working on and perhaps related to something done at other company sites. This is what Oscar is for.

But, it will need to understand things like UK-92,480. This is made possible with the Oscar4 refactorings we are currently working on. You only need to register a dedicated dictionary. Oscar4 has a default dictionary which corresponds to the dictionary used by Oscar3, and a dictionary based on ChEBI (an old version) (see this folder in the source code repository).

Adding a new dictionary is very straightforward: you just implement the IChemNameDict interface. This is, for example, what the OPSIN dictionary looks like:

Now, you can implement the interface in various ways. You can even have the implementation hook into a SQL database with JDBC, or use something else fancy. The dictionary will be used at various steps of the Oscar4 text analysis workflow.

Mind you, the refactoring is not over yet, and the details may change here and there.

Saturday, November 27, 2010

As you know, my post-doc in Uppsala ended. It was a good time, and it was great collaborating on Bioclipse with Ola, Jonathan, Arvid, and Carl. I would have loved tighter integration with the work of Maris and Martin, but that was limited to one joined paper (in press). I thank Professors Jarl Wikberg and Eva Brittebo for allowing me to continue my research at their department, and hope this is not the end of the collaboration yet.

Like with new year, the end of a contract is a good time to reflect on ones accomplishments. It's been a bit delayed, but as you know, I already in my next project in Cambridge, and will start in January with yet a longer term position in predictive toxicology (more on that soon). This makes this a really crowded period, on top of birthdays, Sinterklaas, x-mas, and sorts.

My Research
As you might know, my research interest lies in understanding molecular properties and their applications in larger molecular systems. This can be how small molecules pack in crystals, finding patterns in properties (QSAR-like work), etc. Because the underlying methods are useful in many domains, you see applications in various too, including drug discovery, metabolomics, etc. These methods involve statistics and cheminformatics, primarily, which is clear from my publications on method development in chemometrics and cheminformatics. You will also have seen that visualization is a very important tool here, as our numerical validation can easily mislead even a trained scientist.

How Uppsala fits in
About 30 months ago, I got an offer to join the Bioclipse team to work on the cheminformatics features of the workbench. It was already using the CDK, so the project was a tight match with what I did in the past. Additionally, there were plans to integrate R, and while the latter is partially implemented, that part was unfortunately not completed by the group yet. I believe this is a crucial aspect, and without it the large-scale impact of Bioclipse will be severely reduced.

Bioclipse is positioned as a workbench to use third-party libraries, web services, databases, etc, and has done so very successfully (doi:10.1186/1471-2105-8-59). It speaks many Open Standards, and already incorporates various important Open Source libraries for life sciences research, including the aforementioned CDK (doi:10.2174/138161206777585274, doi:10.1021/ci025584y), but also Jmol, JChemPaint, BioJava (doi:10.1093/bioinformatics/btn397), and others. Using these libraries it has rich visualization means for life sciences data, including molecules and protein sequences. The latter, of course, is directly related to the proteochemometrics research done in Wikberg's group. Recently, Bioclipse adopted scripting functionality, making it a perfect tool to share life sciences computation, just like Taverna (doi:10.1093/bioinformatics/bth361) and KNIME.

Results
So, what has this resulted in, besides a number of unsuccessful grant applications? We're still counting, but two book chapters, a book on pharmaceutical bioinformatics, one proceedings paper, five research papers, seven oral presentations at international meetings, and a ACS conference on RDF in chemistry. Oh, and tons of Open Source code, of course. (I'm at the edge of collapsing; I did that as student, lost a year, but learned a lot about myself ... these results I have worked very hard for; I am not a miracle worker. And I have to disappoint people occasionally, as things do not work out how I expected them to be. My apologies for that.)

I will not describe all in detail now, but focus on a few things around what I made my research in Uppsala: semantic cheminformatics, which I believe to be a key concept of where cheminformatics must be going. The first paper resulted from a collaboration with Johannes, a medical researcher at the Ludwig-Maximilians-Universität in Germany (full reference at the end of the post). This work provides an alternative to SOAP, which has a better solution to asynchronous computing that the polling approaches now commonly used. A XMPP-based service just reports back when it is done, so that you do not have to ask all the time. Makes sense to me. We made the platform available to Bioclipse and Taverna, and demonstrated the technology with applications in life sciences, including (QSAR) descriptor calculation, and susceptibility for seven known HIV protease inhibitors.

This work stresses that if we really want to, we can significantly improve scientific computing. It's very much like what Peter concluded this week: "None of this is rocket science - it’s purely a question of will". This is what I have being trying to show in the past few years. The disuse of accurate scientific computing is a deliberate choice. Making your cheminformatics research irreproducible is a choice, and a bad one too. There can be acceptable reasons, but the choice would be bad nevertheless (I hope that distinction is clear: you can have valid reasons to do something intrinsically wrong. You will be forgiven, and be encouraged to change your behavior.) Many people on the Blue Obelisk community are laying out the foundations and show cases, hoping to make it easier for others to change behavior. I think we have been quite successful there.

Anyways... on to the second paper. As said, Bioclipse is the platform that can bring these new cheminformatics methods to the desktop. The new and improved Bioclipse 2 (see citation below) adds one important new feature: scripting. My work in this paper focuses on doing making sure the cheminformatics library was properly integrated, continued development of JChemPaint (yet unpublished, and in collaboration with the EBI, but see for example this blog post), and helping Ola and other to properly use the CDK in their applications (MetaPrint2D (doi:10.1186/1471-2105-11-362), etc.). The impact of this work goes far beyond the papers on which I am author, though not every reviewer will understand that, unfortunately. This work is really the plumbing, it's the development of the measuring machines to do the job, the development of a STM device to actually get going.

The third paper, also listed at the end, is about defining a standard for detailed exchange of QSAR data. It defines what information is needed to reproduce a set of QSAR descriptors, including the input, and using a descriptor ontology which we published about before (doi:10.1021/ci050400b). This project can be the seed of a public repository of QSAR data, where it will be clear what is meant, and how the data can be used. If you are interested in setting up such a public repository, please contact me or Ola.

That leaves me to the work that I have initiated in the group: the use of RDF technologies (I do hope all VR reviewers are listening). RDF provide a lingua franca for data exchange in life sciences, and the meaning of words is provided by sharing dictionaries (ontologies). Bioclipse has been extended to speak RDF, and we developed various applications based on it. A proceedings previews the effort, while the paper is in print in the new Open Access Journal of Biomedical Semantics. Of course, you can also read much about this topic in this blog.

RDF is going to change bio- and cheminformatics in ways the XML has been unable to do. Various papers are currently in preparation to provide detailed uses case and related research. I am very excited about this technology which further improved interoperability and reproducibility in cheminformatics. Should you care about that? Yes, because by using these good practices, research will be easier to interpret, conclusions judges, and as such, we can focus on the underlying chemistry in much more details, instead of looking at noise which many current cheminformatics literature is doing. (Ouch, that's a bold statement indeed. True? Well, without reproducibility it is hard to tell. Let's all work towards less magic, less black box, and more science in this field; we will all benefit from that. Who knows, we might even convince the bench chemist that we are doing something right ;)

So, where is the understanding of underlying patterns, you may wonder? That is a fair question, but I have no grudge in admitting that after my PhD that part has been underrepresented. That will change soon enough, though. Now I can only hope it is on time go get me a Nature or Science paper, required to get tenure (see this discussion).

Monday, November 22, 2010

About half-way my Oscar project now. I blogged about the Oscar Java API, the command line utility, and the Taverna plugin (all in development). Also, David Jessop joined in, boosting the refactoring. The meeting with the ChEBI team last week was great too. We worked out details for the use cases, involving Oscar and Lezan's ChemicalTagger.

Here follow some install instructions to get going. Please give things a try, even though we are under heavy development. You can monitor the stability of the code via these Hudson pages for oscar4, oscar4-cli, ChemicalTagger, and others. This continuous building of software should be set up for any scientific code.

Requirements
The follow the below instructions, you will need a working environment with a tool to process zip files (or equivalent), Java DK, and Maven. Having wget makes things easier. Oh, and you need a working internet connection.

The pattern is otherwise the same for all above tools. I will demonstrate the process with oscar4-cli.

Ubuntu and Debian users can use:

$ sudo aptitude install unzip maven2 \\
openjdk-6-jdk wget

Downloading the source
The source code for the above tools is all hosted on BitBucket, using the Mercurial version control system. However, there is no need to worry about that, because BitBucket provides source drops. At this page you can download a .zip file for the command line utilities, which you can unzip with your favorite tool.

which downloads and unzips the latest version in the repository. (This is where monitoring Hudson comes in; there you can check if there are failing unit tests.)

Compiling the code
With the source code locally installed, it is same for Maven to come into action. Again, I have no clue how to do this on Windows (other than with Cygwin, which every Windows users should have installed, if a virtual machine with a full Linux is no option), but maven should run the 'assembly:single' target. Or, from the command line:

True, the top journals in my field (chemometrics, cheminformatics) do not have very high impact factor, because the field is less eager to add 100+ citations to each journal paper, nor is the field know to be popular enough for Nature, Science, etc.

Two of these are highly cited.

Indeed. I recently blogged about that. Mind you, 46 citations is not highly cited, even though it exceeds the impact factor of Nature and Science.

This is quite impressive for such a young scientist (PhD in 2008).

But, of course, that does not matter. It's surely not about impression, right?

His PhD work (in Netherlands) and his postdoctoral work (in Uppsala) is
actually all on the same project ...

This is where the reviewers show some disrespect, I believe. Apparently, they have not taken it as one of the responsibilities to actually check what I have done. My PhD work was partly in the UK (Cambridge), and I have done postdoctoral work in the Netherlands and Germany too.

On the same project? Well, depends on how you look at it. Surely, cheminformatics, QSAR, statistics, etc, is all the same. Same for crystallography, NMR, etc. One big pile of science. Again, I feel the reviewers took their responsibility of reviewing very narrow.

... and with the same collaborators.

Wow... that's impressive, right? And I was always thinking that international collaboration was positive. But apparently not if you have long term, successful collaborations. WTF??

The role of the applicant in relationship to Prof X and the other developers is not clear.

OK, I should have made it clearer how the other scientists are involved.

However the backgrounds is definitely adequate for the suggested project but the applicant lack in independency.

(Carefully transcribed.)

This is an interesting point, and nicely outlines how the current academic system works. As post-docs you are forced to hop around from one funded project to another, hoping to get funding. Until you do, you are working on other PIs project with predefined topic.

Project quality

The main focus of the project is software development of Bioeclips in collaboration with X and others.

No, if you read the proposal, the project is about statistical method development, and Bioclipse (not Bioeclips) is used as platform to make it look like Excel so that the average scientist understands it. That distinction is difficult, even for scholars.

The application is mainly about managing errors in observations and processing, annotation and propagation of these.

Indeed! Well copied from the proposal's abstract.

Expected outcome is identification of processing errors, and potentials for improvement in the data handling.

The reviewers got it almost right. I have not written up clearly enough that the main improvement is finding the source of the error, which we all know is the (biological) experiment and the average scholar inadequacy to do data handling (think Excel).

Although the development might be of real importance the application does not show a significant scientific component, neither from a computational not from a life science perspective.

This quite puzzles me, as we had very strongly written all over this proposal: metabolomics, metabolomics, metabolomics! With applications including metabolite identification, with experimental partners.

Have they actually read the proposal?

Project quality

The background and competence of the applicant should ensure success.

So, why not fund me? Read on...

The applicant requires compliance of users and buy-in from scientific community.

I guess my work is not cited enough to show that my work is actually used. Anyone using the CDK here?

Although some indication that this will happen is provided...

I guess this reflects to the international collaborations I listed in the proposal.

Saturday, November 20, 2010

A casual reader my not know the background of the title of my blog. A bit over five years ago, when I started this blog, I defined chemblaics:

Chemblaics (pronounced chem-bla-ics) is the science that uses computers to address and possibly solve problems in the area of chemistry, biochemistry and related fields. The general denomiter seems to be molecules, but I might be wrong there. The big difference between chemblaics and areas as cheminformatics, chemoinformatics, chemometrics, proteochemometrics, etc, is that chemblaic only uses open source software, making experimental results reproducable and validatable. And this is a big difference with how research in these areas is now often done.

Later, I also identified molecular chemometrics (doi:10.1080/10408340600969601) when I reviewed important innovation in the field, which has, IMHO, a strong overlap with chemblaics. Any reader of my blog will understand that I see semantic technologies play a very important role here, as Open Standard for communication Open Data between the bench chemist and the data analyst using Open Source allowing others to reproduce, validate, and extend that work. Some have identified the possibilities the internet brings over 10 years ago, while the use of semantic computing goes back even further.

And what have the big publishers done? Nothing much yet. Not the old, not the new. There are project ongoing, and there is a tendency (BioMed Central starts delivering Open Data, Beilstein Institute spits RDF for a few years now, Royal Society of Chemistry has Project Prospect), but most publishers are too late, and not investing enough with respect to their yearly turn over. This is particularly clear if you realize that citizen and hobby scientists can innovate publishing more effectively than those projects (really!). Anyway, I do not want to talk about publishing now, but it is just so relevant: I am not a publisher, but publications are the primary source of knowledge (not implying at all that I think that is the best way; it is not).

Instead, I am a data analyst, a chemometrician, a statistician, a cheminformatics dude, or a (pharmaceutical) bioinformatician, depending on your field of expertise. Really, I am a chemblaics guy: I apply and develop informatics and statistics methods to understand chemistry (and biology) better.

During my PhD it became painfully clear that current science is horribly failing, in many ways:

Firstly, we are hiring the wrong people because we care more about a co-authored pipetting paper in Nature, than ground-breaking work in the J. Chem. Inf. Mod. (what journal?! Exactly my point!).

Secondly, we have our most bright scientists (the full time professors, assuming that some have been hired for the right reasons) spend most of their time on administrative work (like proposal writing, administrating big national/EU projects).

Thirdly, we spent million (in whatever currency) on large projects which end in useless political discussions instead of getting science done.

Finally, all knowledge from laborious, hard work is placed in PDF hamburgers and lost to society (unless you spend multibucks to extract it again).

There are likely several more, but these three are the most important to me right now.

So, in the past years after finishing my PhD research which was on data mining and modeling molecular data, I have spend much of my time on improving methods in chem- and bioinformatics to handle. Hardly anyone else was doing it (several Blue Obelisk community un-members as prominent exceptions), but someone has.

Why I have been doing this? Well, without good, curated data sources it is impossible to decide why my (or others) predictive models are not working (as good as we want them to be). Is this relevant? I would say so, yes! The field is riddled with irreproducible studies of which one has no clue how useful they are. Trust the authors who wrote the paper? No, thank you, but I rather verify: I am a scientist and not a cleric. Weirdly, one would have expected this to be the default in cheminformatics, where most stuff is electronic and reproducing results should be cheap. Well, another fail for science, I guess.

So, that explains why I have not recently done so much in chemometrics. Will I return? Surely! Right about now. There is already a paper in press where we link the semantic web to (proteo)chemometrics, and more is to follow soon.

One example, interestingly, is pKa prediction, which has seen quite a few publications recently, yet experimental pKa data is not available as Open Data. Why?? Let me know if you have any clue. Yet, pKa prediction seems to be important to drug discovery, as it gets an awful lot of attention (400+ papers in the past 10 year, of which 50+ in 2010!). But this is about to change. Samuel and I are finishing a project that greatly simplifies knowledge aggregation and curation, as input to statistical modeling. We now have the tools ready to do this fast and efficiently. Right now, I am entering curated data at a speed of about 3 chemical structures a minute. That means, given I need a break now and then, that I enter create a data set of reasonable size in a few days. Crowd-sourcing this, the a small community can liberate data from literature in a few days.

This will have a huge impact on the cheminformatics and QSAR communities. They will no longer have any excuse for making their data not available. There is no argument anymore that curation is expensive. This will also have a huge impact on cheminformatics and chemical data vendors. Where Open Source only had moderate impact so far (several software vendors have already joined the Open Source cheminformatics community), this will force them to rethink their business model. Where they could hide behind curation where it came to text mining initiatives (like Oscar, on which I am currently working), with cheap, expert curation knowledge building at hand, they will be forced to rethink their added value.

The impact on the CDK should be clear too. We no longer depend on published models for ALogP, XLogP, pKa, etc, predictions. Within a year, you can expect the CDK project to release the tools to train your own models, and make choices suitable for your user base. For example, you can make more precise models around the structures your lab works on, or more generic models with large screening projects. Importantly, the community will provide an Open Data knowledge base to start from. Using our Open Standards, you can plug in your own confidential data and make mixed, targeted models.

Is this possible with the cheminformatics of the past 30 years? No, and that's the reason why I have been away from chemometrics for a while.

Thursday, November 18, 2010

One goal of my three month project is to take Oscar4 to the community. We want to get it used more, and we need a larger development community. Oscar4 and the related technologies do a good, sometimes excellent, job, but have to be maintained, just like any other piece of code. To make using it easier, we are developing new APIs, as well as two user-oriented applications: a Taverna 2 plugin, and command line utilities. The Oscar4 Java API has slightly evolved in the last three weeks, removing some complexity. In this post, I will introduce the command line utilities.

Oscar4
Most people will be mostly interested into the full Oscar4 program, to extract chemical entities. Oscar3 was also capable of extracting data (like NMR spectra), but that is not yet being ported. The OscarCLI program takes input, extracts chemicals, and where possible resolves them into connection tables (viz. InChI).

To extract chemicals from a line of text (e.g. "This is propane.", you do:

Wednesday, November 03, 2010

The fact that Piet Hein said it, gives it an extra (fourth) dimension. Things Take Time.

But, as Peter indicated, we are getting there in cheminformatics. We see the commercial entities experimenting and contributing to Open Source cheminformatics and the Blue Obelisk has reached critical mass a few years ago. Next year is the year of Open Source cheminformatics on the desktop ;)

Sunday, October 31, 2010

I promised the CiTO author, David, my use cases, but have been horribly busy in the past few weeks with my new position, wrapping up my past position, and thinking on my position after Cambridge. But finally, here it is. Based on source code I wrote and released earlier, the first use case I represent is the Wordle one, which I showed with manual work in February.

Now that all the data is semantically marked up in CiteULike, I can easily extract all paper titles (or whatever is available in CiteULike) for all papers that cite the first CDK paper (doi:10.1021/ci025584y). Using the JSON interface, I have this Groovy script to extract all titles:

The output is two blocks which I can easily copy/paste into Wordle. Now, I think I heard one can actually download the java code, so I am tempted to integrate it later, but for now copy/paste will do fine, after the data handling is mostly automated: with a few lines extra I can make such visualizations for any paper I annotated in CiteULike with CiTO.

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.