Anyway, when this paper reached the most viewed paper position in the JChemInf journal, and I tweeted that event, I was asked for an update of the linked data graph (the darker nodes are the twelve the LODD task force worked on). A good questions indeed, particularly if you consider the name, and that not all of the data sets were really Open (see some of the things on Is It Open Data?). UMLS is not open; parts of SIDER and STICH are, but not all; CAS is not at all, and KEGG Cpd has since been locked down. Etc. A further issue is that the Berlin node in the LODD network is down, which hosted many data sets (Open or not). Chem2Bio2RDF seems down too.

Bio2RDF is still around, however (doi:10.1007/978-3-642-38288-8_14). At this moment, it is a considerable part of the current Linked Drug Data network. It provides 28 data sets. It even provides data from KEGG, but I still have to ask them what they had to do to be allowed to redistribute the data, and whether that applies to others too. Open PHACTS is new and integrated a number of data sets, like ChEMBL, WikiPathways, ChEBI, a subset of ChemSpider, and DrugBank. However, it does not expose that data as Linked Data. There is also the new (well, compared to three years ago :) Linked Life Data which exposes quite a few data sets, some originating from the Berlin node.

I am aggregating data in a Google Spreadsheet, but obviously this needs to go onto the DataHub. And a new diagram needs to be generated. And I need to figure out how things are linked. But the biggest question is: where are all the triples with the chemistry behind the drugs? Like organic syntheses, experimental physical and chemical data (spectra, pKa, logP/logD, etc), crystal structures (I think COD is working on a RDF version), etc, etc. And, what data sets am I missing in the spreadsheet (for example, data sets exposed via OpenTox)?

Friday, March 28, 2014

This week the ChEBI3rd User Workshop took place, and I presented how WikiPathways is using ChEBI, and how I have been using it in the BridgeDb identifier mapping database for metabolites, and in mapping metabolites to WikiPathways using the ChEBI ontology.

What some of us already interpreted is that the Non-Commercial (NC) clause of the Creative Commons (CC) is a killer. German court has ruled that the NC clause means that the material is only for personal use. And that is literally breaking news! It means that such material is not Open Access in the context of (European) universities. I learned from Lessig's Free Culture (a must read) that academic use falls under fair use under USA law. but as far as I know this is not the case in Europe. It effectively means that all journals using a CC license with the NC clause now officially do not fall under most Open Access directives (AFAICS but IANAL).

Sunday, March 16, 2014

I have ranted often enough about publishing. I have also often enough indicated how publishers (or journals) could improve their act. Enough to find in the archives of this blog. Even the more innovative publishers have a long way to go. The reason why I blog about this, is why I can be happy with something like a rrdf package (doi:10.7287/peerj.preprints.185v3). Seriously, it is far away from where my heart is: understanding the underlying chemistry of biology. Really, I rather study how phosphorylation really causes signaling; at some level this is just protein interacting with another protein, small molecule, or something. But what? Still, the package makes me happy. No one else is doing it; I need it. We all need it to make science more reproducible. We need good tool and we do not need excuses for not doing it right (tm).

And just to make the point, we do need tools like this. We did 20 years ago. And publishers have done way too little. I really understand innovation is slow, is expensive. But, come on, use your imagination. I cannot solve everything in the world and really on others to implement stuff too. And here is an idea.

What if publishers could actually solve this problem. I know plenty of people are talking about it, and give it funny names, like nanopublications. That idea too existed for more than 20 years now. In fact, CMLRSS is not far from the nanopublication (doi:10.1021/ci034244p). And it was functional. Really, the implementation and standard is not even the issue. The key is adoption. Adoption may be slow, but it must exist. And for adoption to happen, you need commitment. For example, by promising that the time and resources invested in the adoption will have a return in investment. For example, have a guarantee that your solution won't go commercial at some point (causing a vendor lock in!).

But that something must happen is clear if you return to the science. Have you ever tried to do some theoretical study of some phenomenon? Than you know that data availability is a problem. And this data scarcity is exactly the reason why it has become valuable, and causing people to sit on top of it like a hen on her egg(s). If you ever have been involved in getting some good quality data together (ever noticed that much commercial data does not have the data you really need?), you know how expensive data is then. Recovering it costs more after the publishing process then before. Really, the original notebook has more information, likely be more informative then the formal publication.

Not just has the publishing model itself become more expensive than needed (just think about the APC of newer publishers, like PeerJ!), publishers also make access to the data more expensive than really needed.

This is a huge fail is the Western approach to science: we enormously disrespect data.

If you are not convinced, please give me answers to these questions (read active ingredient for "drug"):

how were the CYP experiments performed for the top ten selling drugs and what are the main human transformations?

what is the experimental errors on pKa measurements of the top ten selling drugs (uncharged and single charged, positive and negative)?

how were the logP values measured for the top ten selling drugs and at what pH?

what are the size distributions of samples of nanomaterials reported in literature?

what are the different forms of a protein (not shape, but in terms of structure; so, phophorylation states, exact position, relevant SNPs, etc) of the top ten proteins relevant to pancreatic cancer?

If you can answer any of these questions in less than one hour with provenance (list of DOI and/or PubMed IDs), then I love to hear that. It would give an estimate of the problem. However, my estimate currently is that you cannot fully answer these questions, and most certainly not within one day. Had publishers taken their goal of knowledge dissemination seriously in the past 20 years, it would have been a lot simpler. But they failed. Why should I trust them to do better in the next 20 years? Meanwhile, with the limited funding I get, I will keep being happy with things I can contribute.

Now, if you do not understand why those details matter, start doing a multivariate statistics course. </rant>

Saturday, March 08, 2014

Three weeks ago the CDK project migrated from Ant to Maven as the primary build tool. That means that my workflow for making and, importantly, reviewing patches is completely turned upside down. Well, that happens.

In there issues I always had CDK Nightly as backup, and this is now replaced by Jenkins; e.g. check this instance at the EBI. This workflow now translate to something like this (the extraction of the results was suggested by John):

Search This Blog

Loading...

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at Maastricht University, studying biology at an unsupervised but atomic level. Open science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and Wikipathways.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.