A Scientist and the Web

Menu

Monthly Archives: August 2010

The Green Chain Reaction will soon be generating a lot of high quality structured data. The question is how and where to store this. To give an idea of the scope let me illustrate this with the patent data.

The European Patent Office publishes about 100 patents each week in the categories that we are interested in. Our current software downloads the index, extracts all patents, and selects those which are classified as chemistry. Each patent contains anywhere between 5 and 500 files (the large number is because the chemical structures are represented as graphical images, usually TIFFs). So this means about 10,000 files each week, in a well structured hierarchy. The absolute size is not large, and is about 100 MB per index. We arrange the raw data and processed data in a directory structure for each index such that it can easily be exposed on the web. Every document will have a unique identifier, so that it is straightforward to transform this into URIs and URLs. This means that we will be able to create Linked Open Data in a straightforward manner.

This shouldn't terrify any body and in fact I routinely hold subsets of this on my laptop. It's simply a set of structured directories which can be held on a file system and exposed as web pages. There is no need for relational databases or other engines to deliver them. (Of course we shall build indexes as well which may require engines such as triple stores).

The question is where to store it? We've been having discussions recently as to whether data should be stored in domain-specific repositories (DSRs) or in institutional repositories (IRs) or in both or in neither. This is where we'd like your help.

I don't know how to store five million web pages in an institutional repository. It ought to be easy. (I and Jim Downing tried with 200,000 files in DSpace and it was a failure, because of the insistence on splah pages which destroy the natural structure).It's critical to store them as web pages so that then they are indexed by the major search engines. We shall also index them chemically and by data. It's obviously a valid type of digital hyperobject to store in a repository and ours must be similar to many other requests that scientists would be likely to make.

We could also store them in a domain repository. I don't know any Open domain repository for chemical patents (there are many closed ones and a few that are free but not open). It's possible that we could create an equivalent service to the one we provide for Crystaleye (http://wwmm.ch.cam.ac.uk/crystaleye). However this does not address the problem of long term archiving (although assuming this experiment is successful I don't think there will be any problem in finding people who wish to help.)

Or we could store it to through the Open Knowledge Foundation and its CKAN repository of metadata. CKAN is not normally used for storing data per se so this would be a departure and the OKF will need to discuss it. It wouldn't be my first choice, but it's certainly better than not storing the results at all.

Or we could store it through something like BioTorrent (http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0010071 and http://www.biotorrents.net/ ). This is a new and exciting service which tackles the problem of sharing open data files in a community. One of its purposes is to solve the problem of distributing very large files, but it may also be suitable for distributing a very large number of small files – I don't yet know but I'm sure I will get feedback from the community. If this is the best technical solution then I don't think I would have much difficulty persuading them that chemical patents were a useful source of linked open data for the Bio community.

Or some other organizations that I haven't thought of might offer to host the data. Examples could be the British Library, or UKPubMedCentral (UKPMC), or a bioinformatics institute, ...

... or you. (I tried with 200,000 files in DSpace and it was a failure).

It would be a major step forward in Open data-driven science if we could find some common answers to this problem.

We've made great progress over the last week on the Green Chain Reaction. Our progress is all being recorded on the Etherpad provided by the Open Knowledge Foundation at thttp://okfnpad.org/solo10. Have a look! And feel free to add anything - it's very easy.

Each section here will contain an invitation to participate.

CODE

Dan and Mark have worked very hard in showing that the System Works and documenting what is necessary. I am regularly feeding new bits of code we are very close to having a system which will extract and analyse chemistry from published documents. The test bed is Acta Crystallographica E, which is open with about 10,000 reactions. The main data will come from patents, and I have been modifying David Jessop's download and analyser so that they can be distributed.

We're now looking for computer-savvy volunteers to see if the code can be widely distributed in its current form. Please volunteer by signing up on the Etherpad.

CONTENT

Mat Todd and Jean-Claude Bradley have already contributed material (via their links) and we will soon be analysing this in detail. Mat's posted a number of places where we might get additional content and some of these will be straightforward as it is clear that the content is Open. However a number of offers, such as from Chemspider, are formally copyrighted and we will need explicit permission from the "owners" to make them open.

We'd like to hear from anyone who has chemical reaction content that we can extract and they completely open. We particularly like to hear from publishers who would like to take part in this high-profile activity to show the value of open data. And we'd also like to hear from organisations such as government agencies who are by default make their data open.

ISITOPEN

Heather Piwowar and I are creating a series of letters to enquire of data providers whether their data is open. I hope to draft the principles today in a PantonPaper, and will blog this probably in the afternoon.

Contributions on describing and taking data as open will be particularly valuable. Feel free to join heather and myself in the ether pad. Any pointers to existing protocols and manifestos on Open data will be particularly valuable

OTHER HELP

We've had some other offers which are much appreciated. These include links to other resources, help with the Green concept, and a lot more. You may have ideas on what is possible in the next few weeks so we'd love to hear them.

Please make other suggestions and offers of help that we may not have thought of.

TIMELINE

I'm expecting that by the end of today I will have managed to modularise the parts of David's code which will be used in this project. They will be described in the Etherpad, and reposited in several Bitbucket projects.

Also by the end of today I expect to have drafted the principles of data extraction from scientific documents on websites. Heather and I will work this so that we expect to be able to impose these to document providers such as publishers and get their early response.

I also hope to be able to spend some time on creating semantic chemistry definitions that will result from parses. I shall do this on the ether pad and will welcome contributions.

We've been bouncing emails around on the Green Chain Reaction coding part of the project and we've got to the stage where looking back through emails to see where bits of code got copied in, who was on the email header, etc. is a nightmare. Email is awful for collaboration. Awful.

Googledocs is better , but you have to invite people so there's an entry barrier when you don't know who your collaborators are.

Etherpad is a wonderful collaborative tool for community projects. It takes a tiny bit of effort to set up (i.e you have to have a machine; and you have to be allowed to set up a server on it). But don't let that put you off, read on ... it's already done

It's completely open to anyone in the world. [Surely it will get spammed? Not much point, since it's ephemeral and has no outward pointing links. Any vandalism is immediately apparent and can be reverted. ] So I'm setting one up for the Green Chain Reaction collaborators at #solo10.

And I can do this because the OKF makes Etherpads available for its projects and collaborators. After getting Rufus's OK (not needed technically , but polite) it takes 5 second to set it up:

Anyone – and that means YOU! – can edit the material. Two or more people can edit it at once! You can see the other keystrokes appearing as you type! (If you try to do the same part of the document you can backtrack). There's a timeslider that literally remembers everything types in (and deleted). It's almost like a Memex.

So from now on most of my contributions to the project will be on the Etherpad. It's a great way of creating communal documents – READMEs, code snippets, expected outputs, failing tests, useful URLs, etc. Letters to publishers on IsItOpen

And at the end of it you can have a polished document which could be mailed, pasted into a Wiki, turned into a PDF...

It's not Google wave, it doesn't do multimedia, it's Flash-agnostic, ... It does honest to goodness text.

Aren't most supplementary information files freely available? There must be tons of synthetic procedures hiding in there.

From a JACS paper: "This [Supporting Information] material is available free of charge via the Internet at http://pubs.acs.org."

As regards explicit permission…good luck finding any information at all about that.

The problem is copyright, followed by contract. By default copyright prevents you from doing anything with a document. And it's the law, so that not surprisingly reputable institutions such as universities are absolutely clear that it must not be broken.

It has been argued that copyright law even forbids your saving a copy of any downloaded article on your hard disk. It's almost certainly true that if you have thousands of PDF articles you're breaking copyright in the eyes of some interpreters. And I should stress that this is an area where almost every question is unanswered. The only responsible default answer is "you can't do that."

This is an enormous impediment to data-driven science. Enormous. By default we cannot extract or use any data from the published literature unless there is explicit permission. It's got to the stage where this problem is seriously holding back science.

You might argue that I'm being pedantic or chopping hairs. I'm afraid that's not true. Shelley Batts reproduced a single graph from a closed access article (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=338 ) and was sent a threatening legal letter from the publisher (Wiley). There was an explosion in the blogosphere, and Wiley dropped their threats against Shelley. However they have never explicitly given permission to any one to post any closed material from any journal and we must assume that they are capable of carrying out the same threats in the future, although they know how the blogosphere will react.

I myself have caused the University of Cambridge to be completely cut off from a publisher on two separate occasions (ACS and Royal Society Of Chemistry). In both cases what I did was completely legal but the publisher's automatic systems assumed that I was stealing content and automatically closed down all access to anybody in the University of Cambridge for an indefinite period. I do not know on what basis this was done, but I assume that it was for a (wrongly) interpreted infringement of the specific contract that the University of Cambridge had with the publisher.

You should note that almost all universities sign contracts with publishers which are more restrictive than copyright. These contracts are often covered by secrecy agreements, and so the average scientist in the university probably has no access to the details of the contract that they may infringe. The contracts in many cases cover the amount of material and the frequency that a scientist has access to. Given that data-driven science requires access to large amounts of material on a frequent basis it is almost certain that attempts to carry it out are likely to involve in breach of contract in particular institutions.

I appreciate that the current business model of publishing is that closed access publishers "own – or control - the content". I personally do not agree that this is a good thing for science (in fact I believe it is highly detrimental) but I do not intend to break the law (at present) and I do not intend to cause my university to break its contract. However this means that readers are regarded as being potential content thieves and that and the publishers put in place expensive content management and access systems with a partial purpose to police the activities of readers. It is more important to most closed-access publishers to protect content than disseminate science and they err on that side.

The default position, therefore, is that one cannot automatically download and re-use content on the web without permission. It is very disappointing that none of the major publishers (other than of course the CC-BY Open Access publishers who are excused from all of these arguments and discussion) have changed their attitude or practice in making content available to scientists for the practice of modern science. I have on several occasions written to publishers asking whether I can use material and almost invariably I get no reply. It is difficult for me to see the closed-access publishers as part of an alliance which is trying to improve the way that data-driven science is done.

I should comment also that Openly reusable data must lead to higher quality science in that it is easier to pick up errors. For example our Crystaleye System (or rather Nick Day's system http://wwmm.ch.cam.ac.uk/crystaleye ) not only aggregates all publicly visible crystallographic data on publisher's web sites but also validates it as it processes it. This validation is part of our recently funded #jiscxyz bid where we are working to develop a means of validating Scientific Data before it appears in print. Just recently I have been pointed at two cases from very high profile closed-access publishers where the crystallography has been inappropriately used to support what is clearly invalid science. If the data had been openly available these errors would not have happened. There are also notable cases where the blogosphere has rightly criticized published science on the basis of invalid or incorrectly interpreted data.

Despite my campaign for greater openness in publication, I am not against a commercial publishing industry. I am in favour of a responsible publishing industry which makes efforts to innovate and support science. I am disappointed that publishers have not addressed the question of re-use of data and I am saddened by the fact that they do not regard readers' emails and enquiries such as mine as worth replying to. It is not just me – it is now two years since Chemspider enquired to ACS about the rights to re-use ACS supplemental data and as far as I know ACS has not given a formal reply.

I am, however, an optimist and I believe the publishers will now take this problem constructively and start to give clear information. Since data are not only valuable for citations but also are necessary for the proper practice of science I'm going to assume that the major players in chemistry will be keen to give definitive answers on this problem. If nothing else, it is actually in their best interests – being seen to be helpful and to increase their own citations is hardly a barrier to doing good business.

Therefore Heather and I will be preparing a set of letters to send to all the publishers under the IsItOpen tool. This allows the precise request and precise response to be publicly visible and act as a useful definitive record. It therefore saves both the readers and the publishers from having to continually reiterate their position. We appreciate that it may in some cases not be possible to give complete answers but we would certainly expect the courtesy of a timely reply.

We hope that it will be possible to collect all the replies from the major publishers before the Science Online meeting and hope that this will be a valuable contribution to the delegates and those who are following the procedures from a distance.

I'm waiting until the rain stops before I cycle in to work, so here is an update on The Green Chain Reaction. It's going incredibly well. The energy of those who have already volunteered is enormous, and so is the speed with which they've picked up the ideas and the competence and initiative that they have used. Absolutely incredible. An important byproduct of this experiment is to show how universal the ideas of collaboration are, and how the tools have developed over the last few years so that they are straightforward to use.

Mark and Dan have made spectacular progress on the code. We have produced a system in house which we use regularly for downloading and analysing publicly visible and Open scholarly material. We have a build system which involves about 30 different projects and libraries and is quite complex. I should pay tribute to Jim Downing, Sam Adams, Nick Day, and others in our group for having set this up. I don't think we could have done it without the infrastructure that we have built and we'd be delighted to talk with are the people who are interested in managing large and varied amounts of scholarly information. What is really exciting is that Mark and Dan have understood what we're doing, have written some very nice documentation on the science online wiki, and have robustified the procedures. Because they are working on other systems, including other operating systems, this is an excellent test of portability. Quite shortly we should be able to create a package which many of you will be able to use in this project.

Heather Piwowar has made wonderful progress on the IsItOpenData resource. We shortly going to be creating template letters to send to those people who expose data on the web. Will be sending these to the owners of most of the sources. It may seem strange to enquire from an open access publisher whether their data is open, but this letter will give us a chance to thank them and also an opportunity for them to respond to the project. So if you're a publisher, or a site, which exposes open data then don't be surprised if you get a request asking IsItOpen. And thank you.

Mat Todd and Jean-Claude Bradley have already posted their notebooks and Lab reports. I'll be looking at these in detail later this morning I hope. Mat has also posted a list of potential sources of Open chemistry, and will be using these where it is clear that these are open. Some of those sources are not explicitly open and heather and I (and anyone else who wishes to help) out will be composing letters and sending them through IsItOpen. We hope that we will get a quick enough response to allow us to use these sources for the project, and if so this will be fantastic.

Because these volunteers have made such rapid progress much of this is in a state where you can join in. This is part of the purpose of the experiment, allowing anyone to get a feel for what data-rich science is like. So for example you will be able to install the package for downloading and analysing Open Data.

ONE CAVEAT – SOME OF THE TOOLS CARRY OUT AUTOMATIC DOWNLOADING. WHERE POSSIBLE WE WISH TO DO THIS ONLY ONCE TO AVOID DENIAL-OF-SERVICE AND MESSING UP ACCESS COUNTS. SO WE WOULD HOPE TO CACHE DATA AFTER IT HAS BEEN DOWNLOADED. ANY SUGGESTIONS ON HOW AND WHERE WE CAN DO THIS (AND OFFERS OF HELP) WOULD BE WELCOMED. NOTE THAT THE DATA WILL ALL BE EXPLICITLY OKD-COMPLIANT SO CANNOT BE SERVED FROM A LESS-THAN-OPEN RESPOSITORY

Initial comment here - sample size here might be small, and hence the kinds of reactions looked at may skew the analysis a little. We're focussing on one main reaction type, and so is Jean-Claude. Not exclusively, just a preference. Obviously the wider the sample set the better the analysis.

To this I will add:

About 7-9000 syntheses from Acta Crystallographica E (some will be simply "we took this from a bottle" but most are actual preparations). Licence CC-BY

Somewhere about 100,000 reactions per year in patents. We expect the historical quality to be textually lower. Licence PUBLIC DOMAIN

A number of gifted theses in Cambridge

A number of theses in University repositories. Most licences CC-DONTKNOW, some CC-SA, some CC-BY. Smallish numbers (guess about 100-1000 theses if we work hard. A really good opportunity for collaboration)

PMR The main aspects that will be important are:

What are the explicit permissions on the site?

This is more important than anything else. We cannot use material that is visible on the web unless there is explicit permission. This is an ideal opportunity to use IsItOpenData? Of the sources above I shall assume 1 and 2 are Open, 3 is unknown until we hear from Chemspider, 4 is unknown (most blogs are CC-BY or CC-SA but we have to check). We cannot use CC-NC. 5, 6 are fully CC-BY

What is the format? PDF is the worst, but probably usable for this project. XHTML is normally excellent. *.doc (theses) is also excellent.

How is the information structured? If it's in diagrams it's very difficult. If it's in running text it depends on the style. Formal reports of single compounds are often quite tractable. Highly detailed accounts are potentially much more valuable but harder to parse as there is less consistency.

What information is given? Acta E and Patents do not normally give yields (a pity). Theses are usually very rich.

I'll start liasing with Heather about asking for formal permission on IsItOpen?

We are delighted that Heather Piwowar has offered to help on the project. This is specially exciting as she says "I know nothing about chemistry" – I doubt that's absolutely true but I'll take her at her word. This means that anyone can help. Already she is organizing the Wiki and starting to increase the use of the IsItOpenData resource (http://www.isitopendata.org/ ). I'll talk more about that later...

The Green Chain Reaction is a ground-breaking, innovative, global experiment. It will apply open data and citizen science to rapidly investigate:

"Are chemical reactions in the literature getting greener?"

Background work will be done in the next month, and some of the investigation will take place in real time at Science Online London 2010. As such the project will be highly visible.

Do you believe in the power of open data, open science, citizen science, and fun sprints around an important problem? Do you talk about it? Then join us.

Yup, us. I'm helping. I'm putting my minutes where my mouth is, and reallocating a few hours in the next month to the cause. I know nothing about chemistry. That is no problem, everybody can help. I'll probably focus on writing to publishers asking for clarification about whether their publications can be data-mined.

You can help extract information from full text publications, test some software, help flush out the wiki, tweet about the experiment, contribute ideas, etc.

Don't wait, check it out now and participate with some minutes. This sprint is only on for a month…

ps Kudos to Peter Murray-Rust, Simon Hodson, and the conference organizers for initiating this project. Innovative, informative, and exciting.

I hope this helps you also to jump in...

Heather is, of course, a star in her own right as she has seminally shown that if you share your data you get more citations. The first step to sharing your data is to find it. Then you have to make it available and if you post it Openly on the Web, with a CC0 or PDDL license then anyone can find it and re-use it. When they re-use it, they cite it. Bingo!

The International Union of Crystallography are showing this enlightenment by exposing their data and we thank them for it. Actually it's much more than that as they have also worked tirelessly to try to persuade other publishers to expose their data. And to the credit of the ones that agree, they have also exposed their data. So kudos to the Royal Society of Chemistry and the American Chemical Society for making their crystallographic data Open. As a result I expect they get more citations.

Those publishers also expose some of their chemical syntheses and we'd like to follow that up in this project. Because it would help not only in this project but also in the more general area of getting better quality science. Here are some simple mantras:

Open Data is Shared Data

Open Data means Better Science

Shared Data gets More Citations

Better Science gets More Citations

And so

Making Data Open is Good for Everyone

If Heather finds time among the Wiki tending she can point to more convincing figures.

Viewed in the light of normal projects with Gantt charts, milestones, etc. the GCR project is completely mad. There is very little formal support other than the excellent wikis provided by the conference organizers. It relies almost completely on volunteer contributions over which we have no moral or other control. People promise what they can when they can, but often they find that they can't deliver. There's no shame in that whatever. We all have day jobs, including myself. It's just that it's such a wonderful opportunity to do something new and exciting.

What is fantastic it is that we have had already a number of really valuable offers and I have no doubt that we will continue to get more.

I could reasonably count on Jean-Claude Bradley and Mat Todd and today I shall be exploring how they manage reactions and seeing if we can come up with a common representation. This will also be closely coupled to our own work on the JISC funded AMI project. But the other offers have been unexpected and delightful.

The real value-add of data archiving, though, is in the potential for more efficient and effective scientific progress through data reuse. There have been many calls to quantify the extent and impact… to do a cost/benefit analysis. An estimate of value of reuse would help to make a business case for repository funding, an ethical case for compelling investigators to bear the personal cost of sharing data, and clarify that sharing even esoteric data is useful — as the Nature Neuroscience editorial Got Data? puts it, "After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay."

The GCR project is looking at re-use and Heather's contribution will be invaluable.

We've also had two amazing offers of computing support. I have already mentioned Dan Hagon, but yesterday we also got an offer from Mark W:

I'm no chemistry expert (more bioinformatics) but can contribute Java coding and/or computing power. Please get in touch if I can help with either!

This is brilliant. I have put Dan and Mark in touch and pointed them at the first batch of data and code to see if they can get this working. They both have experience with high throughput computing and these will be magnificent for the project.

We've also had three contributions about the greenness of the project. I have already mentioned Rob Hanson and the tools that he has built.

Please keep the contributions coming. Today I shall concentrate on making sure that the next batch of code is likely to work, and also that we have a good formal representation for the chemical reactions.

I'm confident that we will get many more volunteers and that between us we will be able to show not only that there is significant value in the published literature but also that A rapidly convened group of committed people can make enormous progress.

Why do I believe this? Because of a wonderful project over 10 years ago that produced the SAX specification for XML. In only one month (see http://www.saxproject.org/sax1-history.html ) a group of people on the xml-dev mailing list had created and tested a de facto standard which now runs on every computer on the planet. There is no reason why the same dynamics should not apply to this project.

Hi Peter, sounds a really fun project. I'm happy to help out with some Java coding. Also I have a cloud-hosted virtual machine I'm not really making much use of right now which you're welcome to use.

This is exactly one of the skills we shall need for this project. If we are going to look at patents over many past years we are going to have to use either/or a lot or humans or a lot of computing.

Dan worked with us as a summer student and then moved on to RAL. He helped us get much of the automation into crystal structure repositories. So I know that he knows this contribution is possible and valuable.

I'll explain in more detail what we are going to do, but this is about how. We have written most of the tools (in Java) and we'll be able to offer them so they can run standalone on any machine. This may require wrapping them as a WAR or other self-starting distributable. We'll also need to make sure they run remotely (Java is described as write-once-run-anywhere and parodied as write-once-debug-everywhere. So people who know what debugging looks like are highly valued).

The main distributed tool will be natural-language-processing (NLP) for chemical documents and specifically reactions. I'll describe this in detail in a later post. The overall strategy looks something like:

Find all reactions in the document (can be hundreds in patents, only one in Acta)

Carry out NLP on each reaction.

Create a datafile from each

Index each datafile (probably using RDF)

Search for green concepts in the RDF repository

Present the results

We've got code for 1-4. We'll need help and imagination with the later stages (5-7), especially since they may come slightly later than the initial parsing. But there will be many of you out there who have some experience of this sort of thing.

Note that the cloud is an ideal place to do this sort of work as it is embarrassingly parallel – or can be created as map-reduce. For example each volunteer could take a year of patents (many tens of thousands of reactions in each year)

I'd like to be able to calculate the "greenness" of a reaction. This is obviously subjective, but should illustrate the principles and components of green chemistry even if it's not completely worked out. Bob Hanson (of Jmol / BlueObelisk fame) has created a Green Chemistry Assistant that calculates Process Mass Efficiency (PME) and Atom Efficiency. Bob's done great work for getting students involved with thinking about Green Chemistry. I asked him about the nature of the materials as well (e.g. toxicity, flammability) but the calculator doesn't do this.

DOES ANYONE HAVE A GREEN CHEMISTRY CALCULATOR/PROGRAM THAT ASSESSES THE OVERALL GREENNESS OF A REACTION?

To give an example, here's our first reaction:

The title compound [1] was synthesized as described in the literature. To glycine (1.00 mol) and potassium hydroxide (1.00 mmol) in 10 ml of methanol and 5 ml of water was added 2-hydroxy-1-naphthaldehyde (1.00 mmol in 10 ml of methanol) dropwise. The yellow solution was stirred for 2.0 h at 333 K. The resultant mixture was added dropwise to Cu(II) nitrate hexahydrate (1.00 mmol) and pyridine (1.00 mmol) in an aqueous methanolic solution (20 ml, 1:1 <i>v</i>/<i>v</i>), and heated with stirring for 2.0 h at 333 K. The brown solution was filtered and left for several days, brown crystals had formed that were filtered off, washed with water, and dried under vacuum.

We can't address yield (it wasn't given and crystallographers don't need lots of stuff – purity is more important. However it uses methanol and pyridine and several other compounds you will find in Wikipedia – which have some modest hazards associated with them. Anyone doing this in Universities and Industry has to prepare a safety assessment (COSHH in UK) before doing the work. Can we create a numerical index of safety from the form??