Learned society publishers have a key role to educate researchers about data.

Institutions

have a key role as part of mobilising the scientific community

The expectations of institutions regarding data have to be spelled out.

Group 3: What future is there for national and institutional data repositories to provide a platform for
data publication?

The future's great!

At the moment, institutional data policies are patchy.

A good incentive for the building of a good institutional repository is it will provide a good record of all institutional research outputs.

Data is a first class scientific output

Institutional repositories should be based on a national template of good practise

Some journals are taking this role at the moment, not sure if someone else should.

Reuse of datasets is a key driver.

Is there mileage in offering a cash prize for best demonstration of best data reuse?

Summary and next steps

Everyone at the meeting was given the job of cascading information about data publication to their colleagues/funders/institution. The DCC promised to engage with funders and others to the extent it can within the UK.

Getting sharing research data right brings in real economic benefits, and that's something we don't have to persuade government about. We need to find out areas to carry out actions where everyone gains. We might find ourselves in the situation where the effort-benefit doesn't apply to the same people, so we need to be prepared.

David's role is to work on strategy and policy for all the ways people access Elsevier 's data, including open access, mechanisms for access, access for data.

Elsevier 's approach: Interconnections between data and publications are important. Scientists who create data need to have their effort recognised and valued. When journals add value and/or incur significant cost, then their contributions also need to be recognised and valued.

There are many potential new roles - want to embrace an active test and learn approach. Will be sensitive to different practises in different domains. Want to work in collaboration with others. Key is sustainability - want to ensure that information is available for the long term.

Publishing research consortium survey: Data sets/models/algorithms shown as being important yet difficult to access.

Paradox in data availability and access: asking researchers gives positive reasons for sharing data (increased collaboration, reduced duplication of effort, improved efficiencies), but also negative reasons (being scooped, no career rewards for sharing, effort needed to make data suitable for sharing). Embargo periods allow researchers to maximise the use of their data.

Researchers should be certifying data, not the publisher.

Articles on Science Direct can link to many external sources - continuing to work on external and reciprocal linking (at the moment there are 40 different linking agreements). Example: linking with Pangaea.

Article of the future: tabbed so the user can move from one bit of article to another very quickly, incorporating all the different data elements into the text but also in the tabs. Elsevier are rolling it out across all their journals (alongside the traditional view)

Content mining: researchers want to do this so Elsevier are doing a lot to enhance and enable content mining wherever they can. An analogy was shown with physical mining workflows (and some nice pictures too).

IUCr unusual amoung international unions in that they publish their own journals. Two journals publish crystal structure reports. These are the most structured and disciplined publications, and had to integrate handling these within more general publishing workflows.

Brian gave a very nice description of a crystallographic (x-ray diffraction) experiment, handily explaining what's involved for all the non-crystallographers in the audience.

Data can mean any or all of: raw measurements from an experiment, processed numerical observations, derived structural information, variable parameters in the experimental set-up or numerical modelling and interpretation, bibliographic and linking information. Make no distinction between data and metadata - metadata are data that are of secondard interest to the current focus of attention.

Crystallographic Information Framework (CIF): human readable and easily machine parseable. Simple tag and value structure. Are xml instances of CIF data. CIF can be used as a vehicle for aticle submission. Within the CIF file is the abstract, other text as a data field. Can reformat the CIF file to produce a more standard paper format.

CIF standard is documented and open.

"Standards are great: everyone should have one!" Important to get started - whatever you can standardise you can leverage.

There is a web service called checkCIF which is the same programs as used to check data on submission of a paper. Authors are encouraged to use this before submission. The author uploads a CIF file, programs generate a report flagging outlying values. If anomaly is detected, then paper will not be passed on through publishing process unless the anomaly is addressed by the author. Reviewer sees outlier flag and response and makes a judgement about it.

Why publish data? Reproducibility, verification, safeguard against error/fraud, expansion of research, example materials for teaching/learning, long-term preservation, systematic collection for comparative studies. Each community has to assess the cost-benefit of each of these reasons for themselves.

IUCr policies: Derived data made freely available. Working on a formal policy for primary data.

Community discussion for peer-review: is the method appropriate? Is there enough information for replication? Appropriate controls? Usual format/structure? Data limitations described? Does data "look" ok?

Basic idea of packing information into one thing (the paper) not threatened by enhanced publications, nano publications, data papers.

Researcher - requesting data after publication doesn't work very well.

There is a logical point in time to archive data associated with publications, during the publication process. That's when researchers are motivated to clean up and make data available.

Joint Data Archiving Policy - start of Dryad. Grass-roots effort, rolled out slowly, in the knowledge that there wasn't the infrastructure to handle the long tail data. Response to this policy has been very positive. Embargo on data for a year after publication key to community acceptance.

Dryad requirements (handed down from on high): Less than 15 minutes to complete the deposit through repository interface (once files etc. had been completed). Long term preservation important.

Paper provides large amounts of rich metadata associated with dataset. Orphan data, as long as one has the paper associated with it, can still be valuable. Long-tail data very information rich.

Journals refer authors to Dryad or other suitable repositories.

Curation is the most expensive part of the process. Data DOI (assigned by DRYAD) is put into the article, in whatever version of the article.

Dryad also has authors submitting data outside the integrated systems with specific journals.

Data made available through CC0. About 1/3 of the files get embargoed. Some journals disallow the embargo.

Dryad have handshaking set up with specialised repositories, working with TreeBASE, trying to make progress with Genbank. Will require a lot of community effort on standards.

Adding new journals, ~1/month. Getting closer to financial sustainability all the time.

Legacy data being added. Data being added as a result of it being challenged in the press.

Incentives - in some cases data has a different author list from article author list - providing credit for dataset authors.

"Perfect is the enemy of the good" for long tailed data. Repository governance should be community initiative. Lot of room for education about how to prepare data for re-use, how to make data citations actually count. Do we have enough researcher incentives, or are publisher mandates and citation enough?

Limit of 10 GB for dataset. Curation costs for lots of files/complicated metadata drive the costs of deposit.

Reuse of Dryad data: median 12 downloads in a year after deposit. Leader has ~2,000 downloads. Room for improvement in tracking how people use the downloaded data.

Simon Coles (University of Southampton) - "Making the link from laboratory to article"

Talk focussed on the researcher perspective: doing research and writing papers!

We don't think a lot about the researcher's notebooks/how they actually work, of record what they're doing.

Faraday's notebooks are a great example. He recorded over 30,000 experiments and devised a metadata scheme for indexing and tagging experiments.

The notion that we're drowning in a sea of data is true and important to researchers.

Researchers manage and discuss data in relative isolation.

At some level, academics really do want to share, but they want recognition. There's also "how do I get my PhD student to share with me?"

Data puts a large burden on the journals, and it's not clear what the benefits are for the journals.

Example shown of Dial-a-molecule, an EPSRC grand challenge, where information about molecules are provided very efficiently and quickly, all predicated on informatics.

We need to understand all the experiments ever done and the negative results are as important as the positive ones.

Mining data is a big scientific driver.

Chemistry data is: scribblings in a book, the process of mixing stuff, analysis and characterisation of compound using instruments and computers, images, molecules, spectra, all the raw data coming out of instruments. And data ranges from highly structured to difficult to describe.

In chemistry publications, the abstract has complicated and difficult information to catalogue and understand. The experimental section has reams of coded text providing the recipies for what was done. Supplementary information also has pages of text.

There is a problem with information loss, for example, when an author chooses one point from a complete spectra to report on in the text.

With structured data the problem is largely solved. The problem is with unstructured data.

My Lab Notebook provides an on-line research diary to capture what you're doing when you're doing it. This allows a stripped down paper to be written, containing links to the notebook.

Christopher's remit is to find open data at the University of Southampton and publish it in a joined up way. Or, in other words "Allow the bearer to publish any non-confidential data in our realm without let or hindrance".

His job title is "architect", but he thinks that "gardener" might be more appropriate.

Working on the principle that "we'd be smarter if we knew what we knew".

He started working with buildings on the grounds that they (usually) stay in the same place, and aren't known for suing.

There is a danger with this sort of work, in that the temptation is trong to start by opening up and designing a new system/data model, instead of seeing what's already out there first.

Best practise is to simply list what's available (buildings... people...) and what key is used for them.

He showed an example page about a building, with information about the building, a map of where it is, and a picture of what it looks like, all of which make it a lot easier for visitors and students to find the building. A PhD student put a load of information about the buildings into some code to generate a map of the university buildings. This took a lot of effort to build, but is easy to maintain.

Linked data on the web should have some text with it, explaining what it is, for the use of a random person who has just been dumped on the page courtesy of Google.

If we want to link between research facilities and papers, then each facility needs a unique id. There is value for research data in linking it with the infrastructure that produced it.

Most of the links of value to an organisation are internal.

Homework for the forum attendees: think about the vocabulary we need to specify data management requirements.

The discussion was lively on the Thursday evening (I think we ran out of steam on the Friday, but it was still an excellent event). Below are the points that were raised:

Journals have a significant role in driving the connections between data and publications. The example given was Nature demanding accession numbers in the 1970s was a key driver for setting up data repositories.

We've only just started with interactive data in papers, and we really do need to think about what readers need and want. Publishers need to become more aware of how researchers work, and get involved further upstream of paper production.

What is the journals' role in the preservation data? Not sure if there is a need for publishers to get into the data repository business. There is a need to move away from supplementary information, and think about how to preserve it. We all have a responsibility to maintain data.

Big question: how do we define a trusted repository? Trusted repositories should be "community endorsed". Publishers are driven by the norms in each scientific community. What are sustainable models for repositories?

An easy way to get more out of supplementary information would be to support it in more and different formats.

What constitutes the version of record for datasets?

The peer-review process is unfunded - how would it change with the integration of data? Nature did a survey where they found that a high percentage of respondents wanted peer-review of data, but didn't want to be the ones to actually do the review.

What role should repositories play in the peer-review of data?

Data papers might help the peer-review process, as it'd break up the procedure of review. For example, in the publication of protocols, the Royal Society of Chemistry checks the data to ensure it is internally consistent, a process separate from peer-review. Could this be part of a new role for technical editors?

There is a CrossRef initiative (CrossMark) in the works which will allow users to see what version a section of a paper is by hovering over it - allowing users to be aware of post publication changes.

The UK Data Archive have a system of high impact and low impact changes for when/if changes in a dataset trigger a new DOI.

Where should data citations be put? In the text? Footnotes? There is concern about things being in the reference list which aren't peer-reviewed, and dual citations. Some publications limit their reference lists.

UKDA are approaching publishers to suggest methods of citations for the social sciences.

Men with printing press, circa 1930s by Seattle Municipal Archives, on Flickr

Ruth Wilson (Nature Publishing Group) set the scene for us with an excellent key-note talk, which led into some very spirited discussion both after the talk and down the bar before dinner. I scribbled down 3 1/2 pages of notes, so I'm not going to transcribe them all (that would be silly) but instead will aim to get the key points as I understood them. If it's a case of tl;dr, then skip down the end to the talk's conclusions, and you'll get the gist.

NPG's main driving factors for their interest in data publication are: ensuring the transparency of the scientific process, and to speed up the scientific process.

Increasing amounts of information are integral to the article (and even more are supplementary). How can we link to data with no serving repository?

Interactive data is becoming important - things like 3 D structure, regraph info, add/remove traces, download data from behind graphs/figures, geospatial data on maps. These are all being pulled together in things like Elsevier's article of the future.

Supplementary data has become "a limitless bag of stuff!", often with the data locked in pdf. Supplementary information is adversely affecting the review process, in that it puts extra pressure on authors, reviewers and readers. There has been a 65% increase in supplementary information between 2008 and 2011. Sometimes it's only tenuously linked to the article, or it can be integral to the article, but put in supplementary information due to journal stringent space restrictions.

Nature Neuroscience will be trialling a new type of paper from April 2012, where the authors will submit one seamless article, putting all of the essential information into it. Editors will then work with the referees and the authors to determine what elements should stay in the paper, and what should be considered supplementary. The plan is that it will make people think what's integral to the paper and ensure all the information submitted is peer-reviewed.

Nature are also investigating an extended on-line version of articles (in html and pdf) where there can be up to 14 extra figures or tables included.

Nature Chemistry was shown as an example: they publish a lot of compounds, where the synthetic procedure for the compounds is in the supplementary information, and gets pulled through to the on-line article in an interactive way.

Linking and integration between journals and data repositories is important. NPG are looking for bidirectional linking between article and data, and are seeking more serious, interactive integration.

NPG has a condition that "authors are required to make materials, data and associated protocols promptly available to others without undue qualifications". It also "strongly recommends" data deposit in subject repositories.

Regarding data publications, the call for data to be a first class scientific object was acknowledged, along with the interest publishers now have in data (as shown by the increasing number of fledgeling data publications)

Data papers were described as being a detailed descriptor of the dataset, with no conclusions, instead focussing on increasing interoperability and reuse. The data should be held in a trusted repository (definition of trusted to be defined!), with linking and integration between the paper and data. Credit would be given through citation for data producers, and would also provide attribution and credit for data managers, who might not qualify for authorship of a traditional paper.

The conclusions:

Linking publications and data strengthens the scientific record and improves transparency

Funders policies are a key driver for integrating data and publications

Journals can and do influence data deposition

Not a situation of one size fits all!

Partnerships are important (institutions, repositories, publishers, researchers, funders), but the roles are not well established, and business models need to be determined.