A Scientist and the Web

Menu

Monthly Archives: September 2010

This post introduces the command line and the file system which are the bedrock of the Lensfield system we have built to support scientific computing .

Neal Stephenson has a marvellous essay/book http://en.wikipedia.org/wiki/In_the_Beginning..._Was_the_Command_Line . It's primarily a discussion on proprietary operating systems, but it highlights the role of the command line in supporting one of the most natural and most durable ways of humans communicating with machines. It's so "natural" that we may feel it's a fundamental part of human intelligence, mapped onto the machine.

But there were machines before command lines. When I started (what an awful phrase – but please don't switch off)… there was no keyboard input into machines. Input was via paper tape:

[thanks to wikimedia]

Ours was even worse – it had only 5 holes and would tear at the slightest chance. And ask yourself how you would edit it? – yes, it was possible and we did it… You encoded your data on an ASR33. To get your results out the computer had a tape punch and you fed your tape into the ASR33 which punched it out at 10 chars/second – rather slower than a competent typist.

And where was the filestore? Yes, you've guessed it – rolls of paper tape. Howe were they indexed? By writing on them with pens. If the information was too big it was split over two or more tapes. If you read them in the wrong order, the result was garbage. And if the rolls were large the middle could fall out. A beautiful effect, but topologically diabolical. Because every layer in the spiral contributed to a twist that had to be undone by hand. It could take an hour to rewind a tape.

The next advance was magnetic tape. This was almost infinite in storage.

Initial tape speed was 75 inches per second (2.95 m/s) and recording density was 200 characters per inch, giving a transfer speed of 120 kbps[1]. Later 729 models supported 556 and 800 characters/inch (transfer speed 480 kbps). At 200 characters per inch, a single 2400 foot tape could store the equivalent of some 50,000 punched cards (about 4,000,000 six-bit bytes, or 3 MByte).

A tape could therefore hold many chunks of information. I can't remember whether we called them "files" – it depended on the machine. But it was really the first filing system. Let's assume we had 1000 files, each about 32Kb. The files were sequential. To find one at the end of the tape you had to read through all the other 999. This could take ages. Writing a file could only be done if you have a tape with space at the end (even if you wanted to delete a file in the middle and overwrite it, problems with tape stretch might make this dangerous). So generally you read from one tape and wrote to another.

And how did you input instructions to the machine? Not with a commandline. Not with buttons and guis. But with switches. Here's a PDP-8 (which controlled my X-ray diffractometer in the mid-1970's ).

[thx Wikimedia]

To enter a program you have to key in the operating system – set the toggle switches (orange/white at the bottom) to a 12-bit number – enter it. Set it to another number. Enter it. Do this for about 10 minutes without mistake (a single mistake meant back to the start). Then you read the rest of the operating system on paper tape. Then you could read in your program!

The point is that back then…

No command line

No file system

That's why when Dennis Ritchie and others introduced the hierarchical file system and the command line in UNIX it was a major breakthrough.

The magnificent thing is that 40 years on these two fundamentals are still the most natural way for many humans to interact with machines. 40 years in which the tools and approaches have been tested and honed and proved to work.

And that's why Lensfield builds on the filesystem and the command line to provide an infrastructure for modern scientific computing and information management. And why we've used Lensfield for the Green Chain Reaction and will be providing it to the Quixote project (and anyone else for anything else).

I'm delighted to see that a major computational chemistry program (NWChem) has been released under a fully F/OSS Open Source software licence. There are many programs ("codes") in compchem but few of them are F/OSS. The norm is either to be fully commercial, or allow free-to-academics. The main exceptions I knew about already were http://en.wikipedia.org/wiki/ABINIT and http://en.wikipedia.org/wiki/MPQC ; the former deals with solids rather than molecules so is outside the scope of Quixote. This means we can now – in principle – create completely open distributions for a community project.

We are pleased to announce the release of NWChem version 6.0. This versionmarks a transition of NWChem to an open-source software package. The softwareis being released under the [Educational Community License 2.0] (ECL 2.0).Users can download the source code and a select set of binaries from the newopen source web site http://www.nwchem-sw.org

"These ambiguities, redundancies, and deficiencies recall those attributed by Dr. Franz Kuhn to a certain Chinese encyclopedia called the Heavenly Emporium of Benevolent Knowledge. In its distant pages it is written that animals are divided into (a) those that belong to the emperor; (b) embalmed ones; (c) those that are trained; (d) suckling pigs; (e) mermaids; (f) fabulous ones; (g) stray dogs; (h) those that are included in this classification; (i) those that tremble as if they were mad; (j) innumerable ones; (k) those drawn with a very fine camel's-hair brush; (l) etcetera; (m) those that have just broken the flower vase; (n) those that at a distance resemble flies."[3]

I shall be writing one or more blog posts on a proposed architecture for the Quixote project (which will gather computational chemistry output). I'll be describing a concept – the World Wide Molecular Matrix – which is about 10 years old but only now is the time right for it to start to flourish. It takes the idea of a xdecentralised web where there is no ontological dictatorship and people collect and republish what they want.

In classifying the output of computational chemistry we can indeed see the diversity of approaches. Here are some meaningful and useful reasons why someone might wish to aggregate outputs:

Molecules which contain an Iron atom

Calculations of NMR shifts in natural products

Molecules collected by volunteers

Calculations using B3LYP functional

Studies reported in theses at Spanish Universities

Work funded by the UK EPSRC

Molecules with flexible side chains

Large macromolecules with explicit solvent

Work published in J. American Chemical Society

Calculations which cause the program to crash

These are all completelye serious and some collections along some of these axes already exist.

The point is that there is no central approach to collection and classification. For that reason there should be no central knowledgebase of calculations, but rather a decentralised set of collections. Note, of course, that both Jorges' and my classification have intersections and omissions. This is the fundamental of the web – it has no centre and is both comprehensive and incomplete.

I'll be showing how the WWMM now has the technology – and more inmportant the cultural acceptability – to provide a distributed knowledgebase for chemistry. It knocks down walled gardens through the power of Open Knowledge.

Last week we agreed that a small, very agile, group of co-believers would put together a system for collecting, converting, validating, and publishing Open Data for computational chemistry, decribed by the codeword "Quixote". This is not a fantasy – it's based on a 10-year vision which I and colleagues put together and called the "World Wide Molecular Matrix". I've talked about this on various occasions and it's even got its own Wikipedia entry (http://en.wikipedia.org/wiki/WorldWide_Molecular_Matrix - not my contribution (which is as it should be)). We put together a proposal to the UK eScience program in (I think) 2001 which outlined the approach. Like so much the original design is lost to me, though it may be in the bowels for the EPSRC grant system. We got as far as presenting it to the great-and-the-good of the program but it failed (at least partly) because it didn't have "grid-stretch". [I have been critical of the GRID's concentration on tera-this and peta-that and the absurdly complex GLOBUS system, but I can't complain too much because the program gave us 6 FTE-years funding for the "Molecular Standards for the Grid" which has helped to build the foundations of our current work. Actually it's probably a good thing that it failed – we would have had a project which contained herding a lot of cats and where the technology – and even more the culture – simply wasn't ready. And one of my features is that I underestimate the time to create software systems – it seems to be the boring bits that take the time.

But I think the time has now come where the WWMM can start to take off. It can use crystallography and compchem separately and together as the substrates and then gradually move toward organic synthesis, spectroscopy and materials. We need to build well-engineered, lightweight, portable, self-evident modules and I think we can do this. As an example when we built an early prototype it used WSDL and other heavyweight approaches (there was a 7-layer software stack of components which were meant to connect services and do automatic negotiation – as agile as a battle tank). We were told that SOAP was the way forward. And GLOBUS. And certificates. We were brainwashed into accepting a level of technology which was vastly more complex (and which I suspect has frequently failed in practice). Oh, and Upper Level Ontologies – levels of trust, all the stuff from the full W3C layer cake.

What's changed is that the bottom-up approach has taken a lightweight approach. REST is simple (I hacked the Green Chain Reaction in REST – with necessary help from Sam Adams). The new approach to Linked Open Data is that we should do it first and then look for the heavy ontology stuff later – if at all. Of course there are basics such as an ID system. But URLs don't have to resolve. Ontological systems don't have to be provably consistent. The emergent intelligent web is a mixture of machines and humans, not First-Order predicate logic on closed systems. There's a rush towards Key-value systems – MongoDB, GoogleData, and so on. Just create the triples and the rest can be added later.

What's also happened is Openness. If your systems are Open you don't have to engineer complex human protocols – "who can use my data?" – "anyone!" (that's why certificates fail). Of course you have to protect your servers from vandalism and of course you have to find the funding somewhere. But Openness encourages gifts – it works both ways as the large providers are keen to see their systems used in public view.

And the costs are falling sharply. I can aggregate the whole of published crystallography on my laptop's hard drive. Compchem is currently even less (mainly because people don't publish data). Video resources dwarf many areas of science – there are unnecessary concerns about size, bandwidth, etc.

And our models of data storage are changing. The WWMM was inspired by Napster – the sharing of files across the network. The Napster model worked technically (though it required contributors to give access to local resources which can be seen as a security risk and which we cannot replicate by default). What killed Napster was the lawyers. And that's why the methods of data distribution and sharing have an impaired image – because they can be used for "illegal" sharing of "intellectual property". I use these terms without comment. I believe in copyright, but I also challenge the digital gold rush that we've seen in the last 20 years and the insatiable desire of organization to possess material that is morally the property of the human race. That's a major motivation of the WWMM – to make scientific data fully Open – no walled gardens, however pretty. Data can and will be free. So we see and applaud the development of Biotorrents, Mercurial and Git and many Open storage locations such as BitBucket. These all work towards a distributed knowledge resource system without a centre and without controls. Your power is your moral power, the gift economy.

And that is also where the Blue Obelisk (http://en.wikipedia.org/wiki/Blue_Obelisk ) comes to help. Over the last 6 years we have built a loose bottom-up infrastructure where most of the components are deployed. And because we believe in a component-based approach rather than monoliths it is straightforward to reconfigure these parts. The Quixote system will use several Blue Obelisk contributions.

And we have a lot of experience in our group in engineering the new generation of information systems for science. This started with a JISC project, SPECTRa, between Cambridge and Imperial, chemistry and libraries and which has seeded the creation of a component-based approach. Several of these projects turned out to be more complex than we thought. People didn't behave in the way we thought they should, so we've adjusted to people rather than enforcing our views. That takes time and at times it looks like no progress. But the latest components are based on previous prototypes and we are confident that they now have a real chance of being adopted.

To keep the post short, I'll simply list them and discuss in detail later:

Lensfield. The brainchild of Jim Downing: a declarative make-like build system for aggregating, converting, transforming and reorganizing files. Originally designed in Clojure (a Java functional language), Sam Adams has now built a simpler system (Lensfield2). This doesn't have the full richness and beauty of Clojure – which may come later – but it works. The Green Chain Reaction used the philosophy and processed tens or hundreds of thousands of files in a distributed environment.

Emma. The embargo manager. Because data moves along the axis of private->Open we need to manage the time and the manner of its publication. This isn't easy and with support from JISC (CLARION) we've built an Embargo manager. This will be highly valuable in Quixote because people need a way of staging release.

Chem# (pronounced "ChemPound"). A CML-RDF repository of the chemistry – based on molecules. We can associate crystallography, spectra, and in this case compchem and properties. The repository exposes a SPARQL endpoint. This means that a simple key-value approach can be used to search for numeric or string properties. And we couple this to a chemical search system based on (Open) Blue Obelisk components.

The intention is that these components can be easily deployed and managed without our permission (after all they are Open). They will act as a local resource for people to manage their compchem storage. They can be used as push either to local servers or to community Chem# repositories which we shall start to set up. Using Nick Day's pub-crawler technology (which builds crystaleye every night) we can crawl the exposed web for compchem, hopefully exposed through Emma-based servers.

We hope this prompts publishers and editors to start insisting that scientists publish compchem data with their manuscripts. The tools are appearing – is the communal will-to-publish equally encouraging?

In some subjects such as crystallography (and increasingly synthetic chemistry), the publication of a manuscript requires the publication of data as supplemental/supporting information/data (the terms vary). This is a time consuming process for authors but many communities feel it is essential. In other disciplines, such as computational chemistry, it hasn't ever been mandatory. In some cases (e.g. J. Neuroscience) it was mandatory and is now being abolished without a replacement mechanism. In other cases such as Proteomics it wasn't mandatory and is now being made so. So there are no universals.

At the ZCAM meeting there was general agreement that publishing data was a "good thing" but that there were some barriers. Note that compchem, along with crystallography, is among the least labour-intensive areas as it's a matter of making the final machine-generated files available. By contrast publishing synthetic chemistry can require weeks of work to craft a PDF document with text, molecular formulae and spectra. Some delegates said that suppInfo could take twice as long as the paper. (There was rejoicing (sadly) in some of the neuroscience community that they no longer needed to publish their data).

So this post explores the positive and negative aspects of publishing data.

Here were some negatives (they were raised and should be addressed)

If I publish my data my web site might get hacked (i.e. it is too much trouble to set up a secure server). I have some sympathy – scientists should not have to worry about computer infrastructure if possible. We do, but we are rather unusual.

It may be illegal or it may break contractual obligations. Some compchem programs may not be sold to the various enemies of US Democracy (true) and maybe it's illegal to post their outputs (I don't buy this, but I live in Europe). Also some vendors put severe restrictions on what can be done with their programs and outputs (true) but I doubt that publishing output breaks the contract (but I haven't signed such a contract)

If I publish my data certain paper-hungry scientists in certain countries will copy my results and publish them as theirs (doesn't really apply after publication, see below)

Too much effort. (I have sympathy)

Publishers not supportive (probably true)

Now the positives. They fall into the selfish and the altruistic. The altruistic is the prisoner's dilemma (i.e. there is general benefit but *I* benefit only from other people being altruistic). The selfish should be compelling in any circumstances.

Altruistic:

The quality of the science improves if results are published and critiqued. Converge on better commonality of practice.

New discoveries are made ("The Fourth Paradigm") from mining this data, mashing it up, linking it, etc.

Speed up the publication process (e.g. less work required to publish data with complying publishers).

Be mandated to comply (by funder, publisher, etc.)

Store one's data in a safe (public) place

Be able to search one's own data and share it with the group

Find collaborators

Create more portable data (saves work everywhere)

That's the "why". I hope it's reasonably compelling. Now the "when" , "where" and "how".

The "when" is difficult because the publication process is drawn out (months or even years). The data production and publication is decoupled from the review of the manuscript. (This is what our JISCXYZ project is addressing). The "where" is also problematic. I would have hoped to find some institutional repositories that were prepared to take a role in supporting data management, publication, etc. but I can't find much useful. At best some repositories will store some of the data created by some of their staff in some circumstances. BTW it makes it a lot easier if the data are Open. Libre. CC0. PDDL, etc. Then several technical problems vanish.

So the scientist has very limited resources:

Rely on the publisher (works for some crystallography)

Rely on (inter)national centres (works for the rest of crystallography).

Put it on their own web site. A real hassle. Please let's try to find another way.

Find a friendly discipline repository (Tranche, Dryad). Excellent if it exists. Of course there isn't a sustainable business model but let's plough ahead anyway

Twist some other arms (please let me know).

Anyway there is no obvious place for compchem data. I'd LOVE a constructive suggestion. The data need not be huge – we could do a lot with a few Tb per year – we wouldn't get all the data but we'd get most that mattered to make a start.

So, to seed the process, we'll see what we can do in the Quixote project. If nothing else we (i.e. our group) may have to do it. But I would love a white knight to appear.

That's the "where". Now the "when" and "how". I'd appreciate feedback

If we are to have really useful data it should point back to the publication. Since the data and the manuscript are decoupled that only works when the publisher takes on the responsibility. Some will, others won't.

An involved publisher will take care of co-publishing the paper and the data files. Many publishers already do this for crystallography. The author will have to supply the files, but our Lensfield system (used in the Green Chain reaction) will help.

Let's assume we have a non-involved publisher...

Let's also assume we have a single file in our plan9 project: pm286/plan9/data.dat (although we can manage thousands) and that there is a space for a title. When we know we are going to publish the file we'll get a DataCite DOI. (I believe this only involves a small fixed yearly cost regardless of the number of minted dataDOIs – please correct me if not). We'll mint a DOI. Let's say we have a root of doi:101.202, so we mint: doi:101.202/pm286/plan9/data.dat . We add that to the title (remember that our files are not yet semantic). This file is then semantified into /plan9/data.cml with the field (say)

<metadata DC:identifier=" doi:101.202/pm286/plan9/data.cml"/>

The author adds the2 identifiers to the manuscript (again the system could do this automatically, e.g. for Word or LaTeX documents).

After acceptance of the manuscript the two files (data.dat and data.cml) are published into the public repository. Again our Lensfield system and the Clarion/Emma (JISC-CLARION) tools can manage the embargo timing and make this automatic. The author can choose when they do this so they don't pre-release the data.

So the reader of the manuscript has a DataCite DOI pointing to the repository. What about the reverse?

This can be automated by the repository. Every night (say) it trawls recently published papers (looking for DataCite DOIs. Whenever these are located then the repository is updated to include the paper's DOI. In that way the repository data will point to the published paper.

This doesn't need any collaboration from the publisher except to allow their paper to be read by robots and indexed. They already allow Google to do this. So why not a data repository?

And what publisher would forbid indexing that gave extra pointers to the published work?

So – some of the details will need to be hammered out but the general process is simple and feasible.

Quantum Chemistry addresses how we can model chemical systems (molecules, ensembles, solids) of interest to chemistry, biology and materials science. To do that we have to solve Schroedinger's equation (http://en.wikipedia.org/wiki/Quantum_chemistry ) for our system. This is insoluble analytically (except for the hydrogen atom) so approximations must be made and there are zillions of different approaches. All of these involve numerical methods and all scale badly (e.g. the time and space taken may go up as the fourth power or even worse.

The approach has been very successful in the right hands but also is often applied without thought and can give misleading results. There are a wide variety of programs which make different assumptions and which take hugely different amounts of time and resources. Choosing the right methods and parameters for a study are critical.

Millions (probably hundreds of millions) of calculations are run each year and are a major use of supercomputing, grids, clusters, clouds, etc. A great deal of work goes into making sure the results are "correct", often checked to 12 decimal places or more. People try to develop new methods that give "better" answers and have to be absolutely sure there are no bugs in the program. So testing is critical.

Very large numbers of papers are published which rely in part or in full on compchem results. Yet, surprisingly, the data are often never published Openly. In , for some disciplines (such as crystallography) it's mandatory to publish supplemental information or deposit data in databases. Journals and their editors will not accept papers that make assertions without formal evidence. But, for whatever reason, this isn't generally the culture and practice in compchem.

But now we have a chance to change it. There's a growing realisation that data MUST be published. There's lots of reasons (and I'll cover them in another post). The meeting had about 30 participants – mainly, but not exclusively from Europe and all agreed that – in principle – it was highly beneficial to publish data at the time of publication.

There's lots of difficulties and lots of problems. Databases have been attempted before and not worked out. The field is large and diverse. Some participants were involved in method development and wanted resources suitable for that. Others were primarily interested in using the methods for scientific and engineering applications. Some required results which had been shown to be "correct"; others were interested in collecting and indexing all public data. Some felt we should use tried and tested database tools, others wanted to use web-oriented approaches.

For that reason I am using the term knowledgebase so that there is no preconception of what the final architecture should look like

I was invited to give a demonstration of working software. I and colleagues have been working for many years using CML, RDF, semantics and several other emerging approaches and applying these to a wide range of chemistry applications including compchem. So, recently, in collaboration with Chemical Engineering in Cambridge we have built a lightweight approach to compchem repositories (see e.g. http://como.cheng.cam.ac.uk/index.php?Page=cmcc ). We've also shown (in the Green Chain Reaction, http://scienceonlinelondon.wikidot.com/topics:green-chain-reaction ) that we can bring together volunteers to create a knowledgebase with no more than a standard web server.

I called my presentation "A quixotic approach to computational chemistry knowledgebases". When I explained my quest to liberate scientific information into the Open a close scientist friend (of great standing) asked "where was my Sancho_Panza?" – implying that I was a Don_Quixote . I'm tickled by the idea and, since the meeting was in Aragon it seemed an appropriate title. Since many people in chemistry already regard some of my ideas as barmy, there is everything to gain.

It was a great meeting and a number of us found compelling common ground. So common that it is not an Impossible_Dream to see computational chemistry data made Open through web technology. The spirit of Openness has advanced hugely in the last 5 years and there is a groundswell that is unstoppable.

The mechanics are simple. We build it from the bottom up. We pool what we already have and show the world what we can do. And the result will be compelling.

We've given ourselves a month to get a prototype working. Working (sic). We're meeting in Cambridge in a month's time – the date happened to be fixed and that avoids the delays that happen when you try to arrange a communal get-together. As always everything is – or will be when it's created – in the Open.

Who owns the project? No-one and everyone. It's a meritocracy – those who contribute help to decide what we do. No top-down planning – but bottom-up hard work to a tight deadline. So, for those who like to see how Web2.718281828... projects work, here's our history. It has to be zero cost and zero barrier.

Abstraction of the commonest attributes found in compchem (energy, dipole, structure, etc.) This maps onto dictionaries and ontologies

Automated processing (perhaps again based on Lensfield)

Compelling user interfaces (maybe Avogradro, COMO, etc.)

By giving ourselves a fixed deadline and working in an Open environment we should make rapid progress.

When we have shown that it is straightforward to capture compchem data we'll then engage with the publishing process to see how and where the supplemental data can be captured. This is a chance for an enthusiastic University or national repository to make an offer, but we have alternative plans if they don't.

A community creates results and wants to make the raw results available under Open licence on the web. The results don't all have to be in the same place. Value can be added later.

One solution is to publish this as supplemental data for publications. (The crystallographers require this and it's worked for 30 years). But the Comp chem. People have somewhat larger results – perhaps 1-100 TB /year. And they don't want the hassle (particularly in the US) or hosting it themselves because they are worried about security (being hacked).

So where can we find a few terabytes of storage. Can university repositories provide this? Would they host data from other univs? Could domain-specific repositories (e.g. Tranche, Dryad) manage this scale of data?

Last time I asked for help on this blog I got no replies and we had to build our own virtual machine and run a webserver. We shouldn't have to do this. Surely there is a general academic solution – or do we have to buy resources from Amazon. If so how much does it cost per TB-year?

If we can solve this simple problem then we can make rapid advance in Comp Chem.

Simple web pages, no repository, no RDB, no nothing.

UPDATE
Paul Miller has tweeted a really exciting possibility:http://aws.amazon.com/publicdatasets/
At first sight this looks very much what we want. It's public, draws the community together, it's Open. Any downside?

I attended the Open Bibliographic Data meeting in London and came away very excited. Here are a few thoughts.

Although bibliography is often regarded as a dry subject it's actually incredibly relevant. We were asked for a compelling use-case for bibliography to be presented to Vice-Chancellors. Here was my suggestion:

Universities compete against each other. They do it in large part through bibliography. Really? Yes! The RAE or REF or whatever metric is increasingly based on bibliography. The Universities that manage their bibliographies best will be more visible in all sorts of metrics. Soton and QUT have a concerted policy on exposing their research, and they succeed. So a modern bibliographic tool is a sine qua non for a VC. I'm serious. Here are some questions that an Open bibliography could make a lot of contribution to:

What subjects does my University publish?

Which departments co-publish?

Which other universities does mine co-publish with?

Which universities are starting to eat my lunch?

Open Bibliography can answer those!

So here are some ideas that the meeting has converged on, drafted by Paul Miller. They're not final, but Paul's style is wonderfully brief and I doubt I could improve on it:

Universities should proceed on the presumption that their bibliographic data should be freely available for use and reuse. [...]The default position remains transparency, unless the risk assessment can compellingly argue otherwise.

Just use CC-BY for creative works. Just use ODC-PDDL for facts.

DO NOT USE 'Non-Commercial,'DO NOT DEVELOP YOUR OWN LICENSE

That is all there is to it. By using open approaches you don't have to explain, qualify, niggle. It simply works.

Given that, we can now develop powerful tools that speak directly to the world, including vice-chancellors. And, with open data, and open source they are cheap to build.

My practical message was that we could reclaim our scholarship. That's true, though I now prefer the phrase "Open Scholarship". An Open Bibliography is practicable. It's even more practicable than I thought in July. A number of technologies and people have come together and we have already prototyped parts of it. I'll post on this later.

But, again, I have paid my homage to Berkeley. These images and sounds refresh and reinvigorate.