Event Report: Managing Research Data Hack Day

DevCSI worked with the JISC Managing Research Data Programme and the JISC Orbital Project to organise a Managing Research Data hack event in Manchester from 3rd-4th May, 2012. The event was designed to bring together software developers, project managers, data librarians and experts with an interest in the area of managing research data to share, talk, collaborate and create useful solutions.

Participants were encouraged to develop ideas, paper prototypes or even working code to address some of the issues raised by delegates from a range of different projects. A prize was available for the best idea, with the winners receiving their expenses paid to get together and develop their idea further. There were also opportunities to share skills throughout the event.

The event followed a relaxed hack event format, opening with a series of lightning talks from participants describing their projects and areas of interest, followed by a period of brainstorming, development into the evening, and reporting back to the group to gather feedback.

Lightning Talks

History Data Management Plan (HDMP) Project

John Nicholls, University of Hull

John Nicholls

Nicholls described himself as an example of “the reason we are all here.” He represents history researchers at the University of Hull, where he works as the data manager on the JISC-funded HDMP project. This involved working with the university’s library services to create useable data sets from the information collected by ordinary historians, and has resulted in the formulation of a history data management plan, which they are now using to help inform new projects so the researchers can put their data into a useable format from the outset. He was able to offer examples of the historical data for developers to experiment with during the event, but appealed for information about the tools available for managing data that an ordinary researcher could use, and asked how he might engage in the documenting process.

MongoDB

Nick Jackson, University of Lincoln

Jackson offered to run crash course for those who were interested in storing and querying data in MongoDB, a no SQL database. He provided a brief overview of the benefits of MongoDB, which he argued was massively scaleable, agile and flexible, and explained why it is useful for handling research data.

PIMMS (Portable Infrastructure for the Metafor Metadata System)

Gerard Devine, National Centre for Atmospheric Science (NCAS), University of Reading

Gerard Devine

PIMMS provides institutions with tools to capture information about the workflow of running simulations from the design of experiments to the implementation of experiments via running simulations models.

Devine explained how this works within his own research area of climate modelling, where the outputs are so large and complex that past strategies for understanding the models and the limited available metadata are no longer sufficient. The PIMMS project has created a system to help document this climate data, including a web form to help describe all the aspects of the experiments and models according to a set vocabulary. These can then be used in portals which can understand the schema and expose the information in different ways.

He noted that mapping data to metadata has been a particular problem, so he was interested in working with people who have experienced similar issues or found solutions.

Database as a Service implemented in Oxford University

Asif Akram, Oxford University

Asif Akram

Akram outlined the Virtual Infrastructure with Database as a Service (VIDaaS) project, which allows users create a project and upload for example, Microsoft Access databases or Excel spreadsheets to create an online, shareable database in the cloud. The system then converts this into an online database, which can be modified and shared using a simple user interface. He described the three tools they have created to make this process simple, including a database migration tool, a Microsoft Access database converter, and an SQL Designer to help researchers create a working SQL database using drag-and-drop tools.

ORCID and DataCite Interoperability Network (ODIN)

John Kay, The British Library

John Kay

Kay described his role as a social sciences curator at the British Library and his work with DataCite, which creates persistent identifiers for datasets so they can be cited. They have just received funding for a project to take DataCite forward, which will include working on interoperability between other systems, such as ORCID.

He was able to offer some APIs for developers to play with at the event, and the opportunity to mint DataCite DOIs.

REWARD

Brian Hole, Ubiquity Press

Brian Hole

Brian Hole described the JISC-funded REWARD project, which aims to incentivise researchers to deposit their data without introducing any new steps into their everyday procedures. He provided an overview of how this worked, including submission to the Journal of Open Archaeological Dataand ePrints. He outlined some of the recommendations resulting from the project, including the need for more training of library staff to customise ePrints to accept data more neatly.

Hole also provided an overview of the Journal of Open Archaeological Data and the benefits this gives to researchers as a peer reviewed journal which guides researchers through the process of finding an acceptable repository and issuing a data paper that can be cited in traditional papers.

He was particularly interested in collaborating with others to consider some of the issues identified by the project, including minting their own identifiers.

YouShare

Aaron Turner, University of York

Aaron Turner

YouShare is a HEFCE-funded project to provide an environment for researchers to share programs and data and apply programs to their data.

Turner described their current efforts to link the front end that researches see to an archival system that is standards compliant. This will suck the data set into various tiers of an archival system when it is not being used, then bring this back into a live system when people want to access it to carry out further experiments. He provided a demonstration of the interface, showing how to create workflows using YouShare and publish DOIs to facilitate citations.

He was looking to form collaborations to discuss issues associated with data ingest to archival systems during the hack event.

Data.bris

Damien Steer, University of Bristol

Damian Steer

Steer described the cluster system at the University of Bristol, where they have high performance computing and persistent storage for researchers. Research projects can apply for storage, nominate data steward, and receive 5tb free storage. He observed that quite a few people are already using it, including arts and humanities researchers.

The data.bris project aims to create an interface to help researchers use some of this infrastructure and make deposits with metadata. They are looking to add submission via SWORD, despite internal policy questions, and intend to start looking at packaging data sets and integrating with with PURE system by Atira.

DataStage

Sander van der Waal, OSS

Sander van der Waal

Van der Waal outlined the two software components of the DataFlow project: Data Stage and Data Bank. Data Bank is a institutional repository system, and Data Stage is a step before that. As a researcher, before you are ready to publish your data set, you are working with data which needs to be stored and managed. Data Stage helps researchers on a local level to manage departmental data, in a similar way to Dropbox, by providing external back up and version control. Van der Waal explained that he would like to Data Stage to push data to other SWORD compliant repositories, and appealed for people interested in connecting repositories using SWORD to collaborate.

Using dSpace

Ian Wellaway, University of Exeter

Ian Wellaway

Wellsway described work at the University of Exeter, where they are using dSpace with Oracle. He observed that the submission process is a bit clunky, so they have been looking a easy deposit. The problem they have encountered is that a lot of researchers have big data sets, which they are struggling to get into the repository over http. This causes frustration and inevitably puts people off submitting. He appealed for help from people who have solved or have an interest in solving a similar problem.

DMPOnline

Monica Duke, DCC

Monica Duke

Duke introduced the DMPOnline tool developed by the Digital Curation Centre to help researchers create the data management plans, which are now requested by many funders. The DCC are looking to create an API for this, and she has been involved with some thinking about how people might interact with DMPOnline via this API. She was interested in talking further with any people who are interested in getting data in or out of DMPOnline, or think their system should be interacting with it. She also promoted a forthcoming workshop at Open Repositories 2012 which will be exploring this further.

Biomedical Research Infrastructure Software Service kit (BRISSkit)

Malcolm Newbury, Guildfoss

Malcolm Newbury

Newbury outlined the BRISSkit, a suite of applications to support the entire clinical study process, including CiviCRM to recruit participants, CA Tissue which tracks assets (blood, samples etc) and Informatics for Integrating Biology and the Bedside (I2B2) which decomposes information about each patient, adds an ontology and allows researchers to query the data based on that ontology. These tools have all been integrated so information can travel between the application and the full application set can now be provisioned in the cloud.
Newbury observed that they still have some challenges, including generating the unique numbers that are attached to samples, and integrating the applications in a way that does not slow them down, so they are currently looking into open source ways of orchestrating that integration.

Ideas

There were a number of ideas shared before the event, which were summarised briefly before the group began to brainstorm new ideas on the ideas wall. A complete list of all the ideas shared before and at the event can be found on the MRD Hack Days Ideas page.

Teams

Several broad teams formed to discuss the ideas further and suggest potential projects to work on throughout the rest of the event.

Data Activity Stream

A group worked on a proof-of-concept for a centralised service for tracking activity data around research projects and individual datasets. This would allow researchers to see what others have been doing with particular data objects, together with a stream of information about activity within the project as a whole.

In this video interview, Nick Johnson explains the concept in more detail and outlines their progress during the event, which included building a working API.

SWORD 2

This group decided that the problem with SWORD 2 and big data is the resumption problem is that fundamental to http. They discussed how they might send a SWORD request asking server to get content via some other mechanism, such as a bit torrent client, FTP or Dropbox. Discussion with the wider group generated positive feedback about bit torrenting as a good route to handle big data. The group experimented with this during the event to test their reasoning.

In this video interview, Damian Steer outlines the progress made by the group and the issues

Academic Dropbox

Also connected with the issue of big data, a separate group discussed the potential of am academic dropbox, using a client rather than a server-based pool approach. They explored a number of tools, including tools like SparkleShare, and documented their survey of the issues in a series of blog posts.

In this video interview Joss Winn and Jez Cope reflect on some of these issues in more detail…

Metadata for Datasets

This group chose to explore existing metadata schemas to identify the minimum number of elements needed in a schema to accompany data transferred between repositories. They highlighted a potential use case involving a researcher who makes a deposit into a subject repository. From an institutional institutional perspective it will be useful to know about all research outputs, so a basic common schema would allow information about the deposit to be shared between the subject repository, the institutional repository, and any other interested repository, such as the British Library. They also noted that this may be useful if the data is held in more than one place, helping to make it clear where the citable data is held and which versions are copies.

The group speculated that an extension of this work could allow people to “follow” a particular researcher in a social media style.

In this video interview, Brian Hole describes the progress they made during the hack event and how they see this developing in the future…

Other Work

During the event there were a number of discussions about issues associated with identifiers. Whilst these did not lead to a working project group, they covered useful ground and led to solutions to some of the specific problems participants brought to the hack event with them.

In this video interview, Gerard Devine from the PIMMs project describes one such outcome…

Conclusions

One of the key outcomes from the event was a consensus about the need for a different paradigm to deal with moving and managing big data, compared to smaller data sets or multiple small data sets. Exploring these issues and identifying where projects and institutions are encountering similar issues proved to be one of the most useful outcomes for all participants.

Participant Responses

A number of participants blogged about this event from their own perspectives: