Pages

Friday, September 30, 2016

Two weeks ago (already!), the NanoSafety Cluster (NSC) organized two meetings. First, there was on Wednesday afternoon the NSC half-yearly meeting. Second, on Thursday and Friday, in the beautiful Visby on Gotland, the 2nd NanoSafety Forum for Young Scientists. I ran an experiment there, which I will blog about later. Here, please find the slides of my presentation about Open Data I gave on Wednesday:

Oh, and I also presented a few slides about the Working Group 4 activities:

Monday, September 12, 2016

If you want to map experimental data to (digital) biological pathways, you need to know what measured datum matches which metabolite in the pathways (that also applies to transcriptomics and proteomics data, of course). However, if a pathways does not have a single database from which identifiers are used, or your analysis platform outputs data with CAS registry numbers, then you need something like identifier mapping. In Maastricht we use BridgeDb for that, and I develop the metabolite identifier mapping databases, which provide the mapping data to BridgeDb, which performs the mapping.

So, this weekend I released a new mapping database, based on HMDB 3.6, ChEBI 142, and data from Wikidata from September 7. Here are the total number of identifiers and changes compared to June release for the supported identifier databases:

Friday, September 09, 2016

Those following me on Twitter may have seen the discussion this afternoon. A weird law case went to the European court, which sent our their ruling today. And it's scary, very scary. The details are stillunfoldingand several media have written about it earlier. It's worth checking out for everyone doing research in Europe, particularly if you are a chem- or bioinformatician. I may be wrong in my interpretation, and hope to be, but hope even more to be proven wrong soon, but fear it will not be soon at all. The initial reporting I saw was in a Dutch news outlet, but I was pointed by Sven Kochmann to this press release from the Court of Justice of the European Union. Worth reading. I will need to write more about this soon, to work out the details why this may turn out disastrous for European research. For now, I will quote this part of the press release:

Furthermore, when hyperlinks are posted for profit, it may be expected that the person who posted such a link should carryout the checks necessary to ensure that the work concerned is not illegally published.

I stress this is only part of the full ruling, because the verdict is on a combination of arguments. What this argument does, however, is turn around some important principle: you have to proof you are not violating copyright.

Now, realize that in many European Commission funded projects, with multiple partners, sharing IP is non-trivial, ownership even less (just think about why traditional publishers require you to reassign copyright to them! BTW, never do that!), etc, etc. A lot of funding actually goes to small and medium sized companies, who are really not waiting for more complex law, nor more administrative work.

A second realization is that few scientists understand or want to understand copyright law. The result is hundreds of scholarly databases which do not define who owns the data, nor under what conditions you are allowed to reuse it, or share, or reshare, or modify. Yet scientists do. So, not only do these database often not specify the copyright/license/waiver (CLW) information, the certainly don't really tell you how they populated their database. E.g. how much they copied from other websites, under the assumption that knowledge is free. Sadly, database content is not. Often you don't even need wonder about it, as it is evident or even proudly said they used data from another database. Did they ask permission for that? Can you easily look that up? Because you are now only allowed to link to that database until you figured out if they data, because of the above quoted argument. And believe me, that is not cheap.

Combine that, and you have this recipe for disaster.

A community that knows these issues very well, is the open source community. Therefore, you will find a project like Debian to be really picky about licensing: if it is not specified, they won't have it. This is what is going to happen to data too. In fact, this is also basically why eNanoMapper is quite conservative: if it does not get clear CLW information by the rightful owner (people are more relaxed with sharing data from others, than their own data!), it is not going to be included in the output.

I have yet to figure out what this means for my Linked Data work. Some databases do great work and have very clear CLW information. Think ChEMBL, WikiPathways, and also Open PHACTS did a wonderful job in tracking and propagating this CLW information. On the other hand, Andra Waagmeesterdid an analysis of database license information of life sciences databases and note the number of 'free content' and 'proprietary' databases (top right figure), which are the two categories of databases where the CLW info is not really clear. How large the problem is with illegal content in those databases (e.g. text mined from literature, screenscraped from another database), who knows, but I can tell you this is not insignificant, unless you think it's 99%.

At the same time, of course, the solution is very simple. Only use and link to websites with clear CLW information and good practices. But that rules out many of the current databases, but also supplementary information, where, even more than in databases, the rules of copyright are ignored by scientists.

And, honestly, I cannot help but wonder what all the publishers will now do with all the articles published in the past 20 years with hyperlinks in them. I hope for them it doesn't link to illegal material. Worse, the above quoted argument will have to make sure, none(!) of those hyperlinks point to material with unclear copyright.

I'll end this post with a related Dutch law (well, at least for the sake of this post). If you buy second hand goods, and the price is less than something like 1/3rd of the new price, you must demand the original receipt of the first buy. Because if not provided, you are legally assumed to realize it is probably stolen. How will that translate to this situation? If the linked scientific database is less then 1/3rd of the cost of the commercial alternative, you may assume it is illegal? Fortunately, this argumentation does not apply.

Problem is, there are enough "smart" people that misuse weird laws and ruling like this to make money. Think of the patent trolls, or about this:

For scientific information this doesn't exist; we have to do with tools like Google Scholar and Google Images. Both are pretty brilliant and allow you to filter on things, besides your regular keyword search. Of course, what we really need is an ontology-backed search, which Google seamlessly integrates under the hood, e.g. using the aforementioned schema.org.

Now, particularly for my teaching roles, I am frequently looking for material for slides, to support my message. Then, Google Images is great, as it allows me to filter for images that I am allowed to use, reuse, and even modify (e.g. highlight part of the image). Now, I know that some jurisdictions (like the USA) have more elaborate rules about fair use in education, but these rules are too often challenged and money, DRM, etc, limit those rights. Let alone scary, proposed European legislation (follow Julia Reda!).

So, I very much welcome this new effort! Search engine have a better track record than catalogs, like the Open Knowledge Foundation's DataHub. Of course, some repositories are getting so large, like FigShare, to a large extend by very active population by publishers like PLOS, they may soon become a single point of entry.

Anyway, Elsevier is looking for peer-review, which I give them for free (like I gave them free peer reviews until they crossed an internal, mental line, see The Cost of Knowledge). I can only hope that I am not violating their patent. Oh, and please don't look at the HTML of the website. You would certainly be violating their Terms of Use. They really need to talk to their lawyers; they're making a total mess of it.

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.