Really interesting article that proposes a model that distinguishes the characteristics of the object, its “state” (the external, objectively determinable, characteristics), from the subjective “standing” (the position, status, or reputation) granted to it by different communities.

Fascinating article about the history of academic journal peer review, and the societal pressures that have made peer review the "gold standard" of academic credibility, with some discussion of how it's creaking at the seams.

Issues with asssessing multi-file datasets (with files in different formats), quality of metadata (how to evaluate when metadata is insufficient versus rich), how to define use of standard vocabularies

I was asked to present at an Association of Learned and Professional Society Publishers seminar, back in March this year. You can found my presentation slides here, and the audio of my presentation here.

I've info-dumped my notes on the various talks below, but to sum up, it was a very interesting seminar that seemed to go down well with an audience of primarily publishers, many of whom were getting to grips with this whole data thing for the first time.

William Killbride, Digital Preservation Coalition

* "Access is not an event, it's a process"
* Standing on someone's shoulders is quite precarious! We need a stable and secure platform - but how do we make one?
* Solutions for digital preservation need to be put in place at the beginning of the lifecycle
* Discussions with publishers can get bogged down in Open Access issues
* Small publishers hold the content that's most at risk
* We need action on Open Access! We've talked about it lots already
* International profile is important

Mark Thorley, NERC

* The digital, networked world is a real game changer. Peopel want on-line access now and for free. And anyone can "publish" anything on the web
* Open research is not an admin overhead
* The data revolution is replaying the printing revolution established by Gutenberg's mechanical, moveable type
* ICSU's report "Open Data in a Big Data World"
* Open research costs money - we have to learn to live with that
* Technology is the "easy bit" - people are complicated!

Robert Gurney, University of Reading

* The cloud approach is developing fast in environmental data - visualisation of data (especially large quantities of data) is very important
* Infrastructure as a service provides easy access to resources
* Problems in Big Data - volume, variety, veracity
* The Belmont Forum* is set up to allow common cross-national calls. Their data policy and principles are published on the web* is establishing a data and e-Infrastructure coordination office* creating a common enhanced data plan* planning scoping workshops and international calls for case studies and to share infrastructure and develop best practice
* NERC are leading the effort on cross-disciplinary training curriculum to expand human capacity. This will involve the UN training agency, and there will be an open call for a training champion
* The Belmont Forum implementation plan is published

Phil Jones, Digital Science

* We are moving from cottage industry to industrial scale science, but funding structures are more set up to support cottage industry science.
* Valen, Blanchat, figshare, 2015 - Survey of data policies for funders across the UK and USA
* Open Academic Tidal Wave is moving from recommendations to enforcement
* Data repositories have different approaches - structured versus unstructured
* Publishers only have a limited window of time to engage with researchers during the research workstream - but new tools are coming out to allow publishers to interactwith researchers across a greater time
* If we want compliance, the simpler we can make the tools to do it, the better

Peter Burnhill, EDINA

* Increasingly more references to the wild web, not just back to other articles
* Scholarly record always has a fuzzy edge
* Libraries no longer have e-collections, only e-connections
* Mostly big publisher content being archived - but we don't know if the small stuff is being archived. Research libraries archiving stuff aren't going for the long tail of stuff published by small publishers
* Reference rot = link rot + content drift
* analysed ~ 1 million URI links - tested if URIs still worked, is there a "memento" of that reference in the "archived web"
* ~75% not archived within 14 days of publication
* Klein 2014, PLOS One - "Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot"
* rotten references mean defective articles!
* author workflow - note taking software, working with Zotero
* Publishers should accept robust links in cited reference, avoid reference rot by triggering archiving of snapshots and inserting Hiberlinks/robust links at the point of ingest into submission system.

Mike Taylor, Elsevier

* Research data metrics - interest has exploded in past few years
* NISO - data metrics recommendations - set up 3 working groups* "metrics for non-traditional outputs" group* recommending report dataset download usage by using COUNTER compliant formulations, and that funders support repositories to do this
* Elsevier is adapting its research infrastructure to deal with research data* much easier to set up new products than adapt existing systems!
* Ambitions for next year:* most Elsevier journals promoting data publishing with data policies* submission system to support data citations and data submissions* communicate what's being done
* Data metrics part of the value loop encouraging researchers to make their data available. (Also including data)* Metrics based on data citation will be happening in the near future, as soon as the infrastructure is built
* Not just one metric!* article level metrics* journal level metrics* the more metrics, the harder it is to hide things - multiple metrics give multiple points of view

Josh Brown, Orcid

* CRediT schema - update ORCID schema to include other research roles e.g. data etc.
* Contributor type badges
* project-thor.eu
* need PIDs for organisations
* issues with versioning, identifier equivalence, granularity, changes over time, making cultural changes mainstream
* all research activities need to be taken into account
* we can't reward it if we don't recognise it
* we won't recognise it if we can't agree on what it is

Matthew Addis, Arkivium

* direct benefit to researchers in getting involved with digital preservation
* tools and services exist now that allow researchers to get on and do digital preservation
* 44% of links to Astronomy data broken after 10 years
* Researchers only really get judged on how much grant money thay bring in, and how many publications - digital preservation will help with both these
* Lots of tools and models out there, but not particularly helpful for most researchers. Too much choice!
* do the bare minimum to get benefits from digital preservation - parsimonious preservation* know what you have - understand the formats, catalogue the data* put it somewhere safe
* link rot - how to address it?
* Droid - file format identification tool, can generate xml/pdf reports. Metadata includes links to PRONOM - technical registry for file formats
* checksums - useful to establish if data has been lost/corrupted. Tools e.g. exactly - creates BagIt manifest of files
* ADMIRe survey at Nottingham
* make lots of copies to keep stuff safe - put them in places like institutional repositories...
* links are important. DOIs are dependent on URLs, which are as brittle as any URLs - lots of links compensate for reference rot

* Lots of different types of data journals and data papers
* Data paper describes the research context of a dataset
* Presentation of a data paper should look attractive - more user-friendly than the view of the dataset in the archive
* Variety of interactive data visualisation - make the data more alive
* publishing data in Mendeley data - Elsevier aren't making it obligatory to publish data in Mendeley Data

Being fairly new to this being an editor business, and the workshop being so local, I took the opportunity to go, and found it all really useful. Not only from my perspective as someone in charge of a journal, but also from the data management and publication point of view. A lot of the issues raised during the workshop, like attribution, authorship, plagiarism etc. are just as easily applied to datasets as they are to journal articles.

The workshop was a mixture of talks and discussion sessions, where we were given examples of actual cases that COPE had been told about, and we had to discuss and decide what the best course of action was. Then we were told what the response from the COPE members was in those particular cases - reassuringly we were pretty much in agreement in all cases!

It was a pretty standard workshop format - lots of talks, but there were a wide variety of speakers, coming from a wide spread of backgrounds, which really helped make people think about the issues involved in data visualisation. I particularly enjoyed the interactive demonstrations from the speakers from the BBC and the Financial Times - both saying things that seem really obvious in retrospect, but are worth remembering when doing your own data visualisations (like keep it simple, and self contained, and make sure it tells a story).

For those who are interested, I've copied my (slightly edited) notes from the workshop below. Hopefully they'll make sense!

Richard O’Beirne (Digital Strategy Group, Oxford University Press)

What is a figure? A scientific result converted into a collection of pixels

Data FAIRport initiative - to join and support existing communities that try to realise and enable a situation where valuable scientific data is ‘FAIR’ in the sense of being Findable, Accessible, Interoperable and Reusable

Hard to make visualisations scale!

Open data and APIs make it easier to understand the context behind the stories

Keep interaction simple - remember different devices are used to access content

Rowan Wilson (Research Technology Specialist, University of Oxford)

Creating cross walks for common types of research data to get it into Blender

People aren't that used to navigating around 3 dimensional data - example imported into Minecraft (as sizeable proportion of the population are comfortable with navigating around that environment)

Issues with confidentiality and data protection, data ownership, copyright and database rights, open licenses are good for data, but should consider waiving hard requirement for attribution, as cumbersome attribution lists will put people off using data

Meshlab - tool to convert scientific data into Blender format

Felix Krawatzek (Department of Politics and International Relations, University of Oxford)

Visualising 150 years of correspondence between the US and Germany

Letters (handwritten/typed) need significant resource and time to process them before they can be used

Software produced to systematically correct OCR mistakes

Visualise the temporal dynamics of the letters

Visualisation of political attitudes

Can correlate geographic data from the corpus with census data

Always questions about availability of time or resources

Crowdsourcing projects that tend to work are those that appeal to people's sense of wonder, or their human interest. Get more richly annotated data if can harness the power of crowds.

Zooniverse created a byline to give the public credit for their work in Zooniverse projects

Andrea Rota (Technical Lead and Data Scientist. Pattrn)

Origin of the platform: the Gaza platform - documenting atrocities of war, humanitarian and environmental crises

Paper on survey about reproducibility - "More than 70% of researchers have tried and failed to reproduce another scientist's experiments, and more than half have failed to reproduce their own experiments."

The problem with doing closing talks is that so much of what I wanted to say had pretty much already been said by someone during the course of the day - sometimes even by me during the breakout sessions! Still, it was a really interesting workshop, with excellent discussion (despite the pall that Brexit cast over the coffee and lunchtime conversation - but that's a topic for another time).

There were three breakout session possibilities, of which the timings meant that you could go to two of them.

I started with Group 3: Making possible and encouraging the reuse of data: incentives needed. This is my day job - taking data in from researchers, making it understandable and reusable, and figuring out ways to give them credit and rewards for doing so. And my group has been doing this for more than 2 decades, so I'm afraid I might have gone off on a bit of a rant. Regardless, we covered a lot, though mainly the old chestnuts of the promotion and tenure system being fixated on publications as the main academic output, the requirements for standards (especially for metadata - acknowledging just how difficult it would be to come up with a universal metadata standard applicable to all research data), and the fact that repositories can control (to a certain extent) the technology, but culture change still needs to happen. Though there were some positives on the culture change - I noted that journals are now pushing DOIs for data, and this has had an impact on people coming to us to get DOIs.

Next breakout group I went to was Group 1: Research Data services planning, implementation and governance. What surprised me in this session (maybe it shouldn't have) was just how far advanced the UK is when it comes to research data management policies and the likes, in comparison to other countries. This did mean that me and my other UK colleagues did get quizzed a fair bit about our experiences, which made sense. I had a bit of a different perspective from most of the other attendees - being a discipline-specific repository means that we can pick and choose what data we take in, unlike institutional repositories, who have to be more general. On being asked about what other services we provide, I did manage to name-drop JASMIN, in the context of a UK infrastructure for data analysis and storage.

I think the key driver in the UK for getting research data management policies working was the Research Councils, and their policies, but also their willingness to stump up the cash to fund the work. A big push on institutional repositories was EPSRC's putting the onus on research institutions to manage EPSRC-funded research data. But the increasing importance of data, and people's increased interest in it, is coming from a wide range of drivers - funders, policies, journals, repositories, etc.

I understand that the talks and notes from the breakouts will be put up on the workshop website, but they're not up as of the time of me writing this. You can find the slides from my talk here.

Friday, 2 October 2015

Last week was the 6th Plenary of the Research Data Alliance, held in Paris, France. It officially started on the Wednesday, but I was there from the Monday to take advantage of the other co-located events.

This workshop consisted of a quick-fire selection of presentations (12 of them!) all in the space of one afternoon, covering such topics as busting DOI myths; persistent identifiers other than DOIs; persistent identifiers for people (including ORCIDs and ISNI - including showing Brian May's ISNI account - linking his research with his music); persistent identifiers for use in climate science,the International GeoSample Number (ISGN) - persistent identifiers for physical samples; the THOR project - all about establishing seamless integration between articles, data, and researchers across the research lifecycle; and Making Data Count - a project to develop data level metrics.

(I also learned that DOIs are also assigned to movies, as part of their supply chain management)

Questions were collected via Google doc during the course of the workshop, and have all since been answered, which is very helpful! I understand that the slides presented at the workshop will also be collected and made available soon.

This was a day long event featuring several parallel streams. Of course, I went to the stream on Research Data infrastructures for Environmental related Societal Challenges, though I had to miss the afternoon session because of needing to be at the RDA co-chairs meeting (providing an update on my Working Group and also discussing important processes, like, what exactly happens when a Working Group finishes?) Thankfully, all the slides presented in that stream are available on the programme page.

Unsurprisingly, a lot of the presentations at this workshop dealt with the importance of e-infrastructures to address the big changes we'll need to face as a result of things like climate change. There was also talk about the importance of de-fragmenting the infrastructure, across geographical, technological and domain boundaries (RDA being a key part of these efforts).

A common thing in this, and the other RDA meetings, were analogies between data infrastructures and other infrastructures, like for water, or electricity. Users aren't worried about how the water or power gets to them, or the pipes, agreements and standards are generated. They just want to be able to get water when they turn the tap, and electricity when they flick a switch. Another interesting point was that there's a false dichotomy between social and technical solutions, what we really have is a technical solution with a social choice attached to it.

Common themes across the presentations were the sheer complexity of the data we're managing now, whether it's from climate science, oceanography, agriculture, and the needs to standardise, and fill in those gaps in infrastructure that exist now.

As ever, the RDA plenaries are a glorious festival of data, with many, many parallel streams, and even more interesting people to talk to! It's impossible to capture the whole event, even with my pages of notes.

If I can pick out a few themes though, these are them:

Data is important to lots of people, and the RDA is a key part of keeping things going in the right direction.

Infrastructures that exist aren't always interoperable - this needs to be changed for the vast quantities of data we'll be getting in the future.

The RDA is all about building bridges, connecting people and creating solutions with people, not for them.

Axelle Lemaire, Minister of State for Digital Technology, French Ministry of Economy, Industry and Digital Technology, said that people say that data are the oil of the 21st century, but this isn't such a good comparison – better to compare it to light – the more light gets diffused, the better it is, and the more the curtains are open the more light gets in. She is launching a public consultation on a digital bill she's preparing and is looking for views from people outside of France - the RDA will distribute the information about this consultation at a later date.

It's interesting now that the RDA has matured to the point that several working groups are either finished, or will be finished by the next plenary (though there is still some uncertainty what "finished" actually means). Given the 18 month lifespan of the working groups - that's enough time to build/develop something, but the actual time to get the community to adopt those outputs will be a lot longer. So there was plenty of discussion about what outputs could/should be, and how the adoption phase could be handled. I suspect that, even with all our discussions, no definite solution was found, so we'll have another phase of seeing what the working groups decide to do over the next few months.

This is of particular relevance to me, as my working group on Bibliometrics for Data is due to finish before the next plenary in March. We had a packed meeting room (standing room only!) which was great, and we achieved my main aim for the session, which was to decide what the final group outputs would be, and how to achieve them. Now we have a plan - hopefully the plan will work for us!

A key part of that plan is collecting information about what metrics data repositories already collect - if you are part of a library/repository, please take a look at this spreadsheet and add things we might have missed!

I went to the following working group and Birds of a Feather meetings:

We have a plan for our group outputs, which will basically map the landscape for data bibliometrics as it stands - identifying what needs to be done, and the other groups that are addressing aspects of this problem (which is a big one!)

Interesting stuff this, anthropologists and social scientists looking at how we deal with data as humans. Not directly relevant to me, but I think I'll keep half an eye on it purely out of personal interest.

There were a few demonstrations made, which showed off how far the group has come in developing a potential new service.

Obviously, when ingesting links from several places, standards and interfaces are needed!

Supporting RDA women networking breakfast

An interesting meeting, despite it being held in a corner of the main marquee, so it was really difficult to get a proper conversation going. RDA is about 1/3 female, which is good, but given that more than 50% of Internet users are female, we need to be careful of the human aspect of our work. It was also very good to see several male RDA members in attendance too - this is not just a woman's issue!

Again, the fragmented landscape of repositories came up - we'll need to help people navigate it and find the best places for their data

There was some discussion about commercial data repositories, and the threat they pose to domain/institutional ones. My thoughts (as part of a domain repository) - I'd rather have the data with minimal metadata in a commercial repository than lost on a CD in a drawer somewhere. And the commercial companies are pressure on us to up our game. If we're losing researchers to them because it's easier to put data in the commercial repositories, then we either have to make it easier to put data into ours, or really explain why the pain is worth it!

We had a lot of discussion about the structure of the Publishing Data Interest Group, now that most of the Working Groups under its umbrella are coming to an end. Personally, I think there's still a lot that this group can do - we haven't touched on issues like peer review of data for example, plus implementation and adoption of the working group outputs is going to take a while. But having a refresh of the group is probably a good thing too.

So, that was RDA Plenary 6. Next plenary will be held in Tokyo, Japan from the 1st to the 3rd of March 2016. In the meantime, we've got work to be getting on with!