Pages

Monday, May 25, 2015

Update: if you had problems installing this feature, please try again. Two annoying issues have been fixed now.

Third in this series is thispost about the Bioclipse plugin I wrote for the OWLAPI library. This manager I wrote to learn how the OWLAPI is working. The OWLAPI feature is available from the same update site as the Linked Data Fragments, so you can just follow the steps outlined here (if you had not already).

Using the manager
The manager has various methods, for example, for loading an OWL ontology:

ontology = owlapi.load( "/eNanoMapper/enanomapper.owl", null);

If your ontology imports other ontologies, you may need to tell the OWLAPI first where to find those, by defining mappings. For example, I could do before making the above call:

Tuesday, May 19, 2015

Apparently I never extended the cdk.cite JavaDoc Taglet to use DOIs from the bibliographic database to create hyperlinks in the JavaDoc. But fear no more! I have submitted a simple patch today to add these to the JavaDoc, and I assume it will be part of the next CDK release from the master branch.

Of course, many papers in this bibliographic database (i.e. this cheminf.bibx file) do not have DOIs for all papers :/

Of course, you can help out here! The only thing you need is a web browser and some knowledge how to look up DOIs for papers. Just check this blog post (from Step 4 onwards) and line 260 in cheminf.bibx to see how a DOI addition to a BibTeXML entry should look like.

Sunday, May 17, 2015

Nature wrote a piece on data sharing (doi:10.1038/520585a). It remains a tricky area to write about, particularly those terms like public access. Researchers are still a bit shy in sharing data, in some fields more than in others. And for valid reasons. Data sharing is a choice, it is something you do to get something in return. The return you get on your investment can vary, for example:

goodwill (e.g. from your employer or funder)

others will donate data to the same resource to benefit your research (a research needs some critical mass)

it can be enjoyable

the repository where you contribute your data adds value (e.g. by linking to other resources)

others can find your data more easily, leading to more citations of your publications

after using Open Data for yourself (e.g. pdb.org), you like to return a favor

I probably miss a few. On the other hand, you may miss out on other opportunities. For example, your data could have been part of an IP-based business model. For example, you are the only one to be able to use that data to solve/answer questions.

As said, there are many good and valid reasons for either option. It is an option, it is a choice.

The Nature News article has this lead that misled me:

Initiatives to make genetic and medical data publicly available could improve diagnostics — but they lose value if they do not share with other projects.

The article, however, then discusses a few mechanisms use for data sharing, but I could not spot one that had anything to do with "publicly available". So, I left this comment with the editorial and with PubMed Commons:

Like Open Access, "sharing" is a meaningless term if it is not linked to meaningful rights. The problems outlined in this paper result from the fact that their may be a wish to share data but only if it allows you to take back the data. Private, custom data licenses do just that. There is nothing wrong with this kind of sharing, but it must not be confused with Open Data. It must not be confounded with terms like "publicly available", because if it needs a signature, it's not publicly available. That makes the lead of this article quite misleading.

For public or open data, three basic rights are part of the social agreement between the data owner (yes, fact in many countries; database rights, etc) and data user. These rights are: 1. make a copy, 2. make modifications, and 3. reshare (under the same conditions). By using a license (or waiver) that gives this rights automatically to the receiver, then there is no need for signatures. It also allows for anyone to make the mappings that are required to convert one format into another.

BTW, the image I used in this post is from a paper from Roche et al. of about a year ago (doi:10.1371/journal.pbio.1001779). I have not read that one yet, but looks like an interesting read too, just like the Nature editorial.

Saturday, May 16, 2015

Update 2015-06-04: the authentication with the Google Drive has changed; I need to update the code and am afraid I missed the point, so that the below code is not working right now :(

Similar to the previous post in this new series, this post will outline how to make use of the Google Spreadsheet functionality in Bioclipse 2.6.2. But before I provide the steps needed to install the functionality, first consider this Bioclipse JavaScript:

Because that's what this functionality: read data from Google Spreadsheets. That opens up an integration of Google Spreadsheets with your regular data analysis workflows. I am not sure of Bioclipse is the only tool that embeds the Google client code to access these services, and can imagine similar functionality is available from R, Taverna, and KNIME.

Getting your credentials

The first call to the google manager requires your login details. But don't use your regular password: you need a application password. This specific, sixteen character, password needs to be manually created using your webbrowser, following this link. Create a new App password (”Other (Customized name)” ) and use this password in Bioclipse.

Installing Bioclipse 2.6.2 and the Google Spreadsheet functionality

The first you need to do (unless you already did that, of course) is install Bioclipse 2.6.2 (the beta) and enable the advanced mode. This is outline in my previous post up to Step 1. The update site, obviously, is different, and in Step 2 in that post you should use:

Yes, the links only seem to get longer and longer. Just continue to the next step and install the Google Feature:

That's it, have fun!

Oh, and this hack is not so recent. I wrote the first version of the net.bioclipse.google plugin and matching manager, as used in the above code, dates back to January 2011, when I had just started at the Karolinska Institutet. But the code to download data from spreadsheets is even older, and goes back to 2008 when I worked with Cameron Neylon and Pierre Lindenbaum on creating RDF for data being collected by Jean Claude-Bradley. If you're interested, check the repository history and this book chapter.

Friday, May 15, 2015

Originally a series I started in the CDK News, later for some issuespartofthisblog, and then for some time on Google+, CDK Literature is now returning to my blog. BTW, I created a poll about whether CDK News should be picked up again. The reason why we stopped was that we were not getting enough submissions anymore.

For those who are not familiar with the CDK Literature series, the posts discuss recent literature that cites one of the two CDK papers (the first one is now Open Access). A short description explains what the paper is about and why the CDK is cited. For that I am using the CiTO, of which the data is available fromCiteULike. That allows me to keep track how people are using the CDK, resulting, for example, in these wordles.

I will try to pick up this series again, but may be a bit more selective. The number of CDK citing papers has grown extensively, resulting in at least one new paper each week (indeed, not even close to the citation rate of DAVID). I aim at covering ~5 papers each week.

Ring perception
Ring perception has evolved in the CDK. Originally, there was the Figueras algorithm (doi:10.1021/ci960013p) implementation which was improved by Berger et al. (doi:10.1007/s00453-004-1098-x). Now, John May (the CDK release manager) has reworked the ring perception in the CDK, also introduction a new API which I covered recently. Also check John's blog.

Screening Assistant 2
A bit longer ago, Vincent Le Guilloux published the second version their Screening Assistant tool fo rmining large sets of compounds. The CDK is used for various purposes. The paper is already from 2012 (I am that much behind with this series) and the source code on SourceForge does not seem to have change much recently.

Figure 2 of the paper (CC-BY) shows an overview of the Screening Assistant GUI.

Similarity and enrichment
Using fingerprints for compound enrichment, i.e. finding the actives in a set of compounds, is a common cheminformatics application. This paper by Avram et al. introduces a new metric (eROCE). I will not go into details, which are best explained by the paper, but note that the CDK is used via PaDEL and that various descriptors and fingerprints are used. The data set they used to show the performance is one of close to 50 thousand inhibitors of ALDH1A1.

The International Chemical Identifier
It is only because Antony Williams advocated the importance of the InChI in this excellent slides that I list this paper again: I covered it here in more detail already. The paper describes work by Sam Adams to wrap the InChI library into a Java library, how it is integrated in the CDK, and how Bioclipse uses it. It does not formally cite the CDK, which now feels silly. Perhaps I did not add because of fear of self-citation? Who knows. Anyway, you find this paper cited on slide 30 in aforementioned presentation from Tony.

Predictive toxicology
Cheminformatics is a key tool in predictive toxicology. I starts with the assumption that compounds of similar structure, behave similarly when coming in contact with biological systems. This is a long-standing paradigm which turns out to be quite hard to use, but has not shown to be incorrect either. This paper proposes a new approach using Pareto points and used the CDK to calculate logP values for compounds. However, I cannot find which algorithm it is using to do so.

Cheminformatics in Python
ChemoPy is a tool to do cheminformatics in Python. This paper cites the CDK just as one of the tools available for cheminformatics. The tool is available from Google Code. It has not been migrated yet, but they still have about half a year to do so. Then again, given that there does not seem to have been activity since 2013, I recommend looking at Cinfony instead (doi:10.1186/1752-153X-2-24): exposed the CDK and is still maintained.

It's been a while since I blogged about a release of my "Groovy Cheminformatics with the CDK" book, but not too long ago I made another release, 1.5.10-0. This was also the first one with white paper, and updated for the latest CDK development release.

There are two versions (and always check the special deals, e.g. today you can use UNPLUG10 to get an additional 10% off the below prices):

Sunday, May 03, 2015

Event 1
The Nature Publishing Group (NPG) has launched a new journal, which you probably did not miss. There is founding editorial titles From mechanisms to management (doi:10.1038/nrdp.2015.1) as the goal of the journal. Very noble and very needed, indeed! They write:

Each Primer article includes the same major sections: epidemiology, mechanisms, pathophysiology, diagnosis, screening, prevention, management and patient quality of life.

The complement the articles with PrimerViews and even animations:

Together, we hope that the Primer and PrimeView will provide readily accessible introductions to each topic for readers from all disciplines.

Very exciting! The mechanistic diagrams in the papers are perhaps even better, but, it wouldn't be a proper chem-bla-ics post had I not something to bitch about. And I do; read on.

In related news, about a year ago, Patricia Zaandam worked in our group on pathway analysis related to malaria. At the time, we selected human data from ArrayExpress because of the abundance of human pathways in WikiPathways (>600 now, of which the Curated Collection and Reactome Approved are subsets). So, on a weekend where I really needed a break from working and with some time free, I decided to make that pathway. One of the first observations was that you cannot create Plasmodium pathways on WikiPathways yet. Second, we also do not have a BridgeDb gene identifier mapping database for this organism either. But that is not needed for drawing the pathway.

So, I am digitizing the pathway from the various sources that I can find, added MMV008138, and will probably add more malaria drugs and drug leads along the way. The idea of the project of Patricia last year was indeed possible drug targets. This resulted in this current outcome (with MMV008138 highlighted in red):

Thoughts
The new NPG journal realized we need high quality summaries, and they are correct. This is why the periodic table of elements has been so useful, and the purpose of physical laws expressed as mathematical equations: it puts emphasis on what we think matters. This is also why I believe WikiPathways is so important.

But that's where the parallel between WikiPathways and NatRevDiseasePrimers about ends. The goal of WikiPathways is not just to summarize the knowledge, but to make it manageable. We are talking about data management here. I don't care that much about nice graphics; if we really want to make the science and the industry going forward, then we cannot hide behind a knowledge publishing system that doesn't scale and that doesn't integrate. That is not the kind of management we need.

New readers of my blog - welcome! - can browse my past writings to read what the publishing industry should have done. I have explored many different solutions, and only few of them are being picked up. The Nature Publishing Group has repeatedly experimented with new technologies to make the flood of knowledge manageable, and it find it rather disappointing that this editorial does not manage to go beyond nice graphics. I hope the journal will quickly pick up speed, and add the missing machine readability and APIs. Because a new journal is for years, and we really cannot wait another 15 years.

I am not claiming this new journal is not useful, but it could have been so much more.

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.