Where Are Our Data Backups Heading?

Taking care of data, collecting it, curating it, adding in all the metadata, generally getting it into a state where other people might be able to work with it in the future, is a lot of work. Increasingly researchers are being asked to make their data open and available on publication. I’ve written and talked before about the whole ecosystem, and why it’s hard to do a good job of making research data a first-class citizen when it comes to research outputs.

It’s clear that there are many reasons for all you researchers to become skilled at managing research data, but probably the most pragmatic of these is that it supports reuse of that data. If you have created some data, who is the person that is most likely to want to reuse that data, most likely to swear at you for doing a bad job with the data, and most likely to get annoyed when that data is no longer available? The simple answer is that it’s your future self, looking back a few months, or perhaps a few years later! So getting good at managing research data up front is actually going to save your future self quite a lot of headaches.

If you are working with data at scale within a large collaboration, the structures that are in place almost certainly have your back, but if you are primarily responsible for the creation and curation of your data, then there are a couple of trends that I think could be of interest to you.

The first is about the projected future of data storage. I’ve been fascinated by the astonishing decrease in the cost of data storage. In 1985 one gigabyte of storage cost US$71,000. By 2015 that cost had come down to just three cents. In those 30 years our basic understanding of the world has probably advanced in line with the pace of growth of human knowledge that we have seen over the past century, but our power to collect and analyse data has increased several fold.

That decrease on storage costs was reducing at a supra-linear rate until about 2010. If that decrease in cost were to have continued out into the future then by 2015 it would have only cost one dollar to store a terabyte of data, and by 2038 one dollar would have bought you a petabyte (you could have stored all the PDFs of all the research papers in the world—about 28TB– in a thumb drive!).

In that imagined future my advice would have been something like: make many redundant copies of your research data, have your data with you at all times, never have an excuse for losing your data, but we are seeing that price decrease tail off and one very plausible reason for this is that many of us (and here I mean all of us in our normal non researcher non work personas) are increasingly using cloud services, where the data we want – that song we are listening to, the article we are reading, the video we are watching, are being streamed to us, and these objects hang around on our local devices just long enough for their consumption to happen, only then to be relegated to some holding pen on our machines, awaiting their eventual overwrite.

The implication for research is clear; we should be moving our research data into cloud services, or at least backing up our local copies to cloud service.

The cost of storing those hypothetical 28TB of research papers for one year in Google’s cloud platform is $6,800 per year, and if you only need to access a portion of that data, while keeping most of it in a “warm” state, then that cost halves. It’s a bit more than the magical one dollar mark, but it’s possibly within reach of a research budget, and given that most small scale data comes nowhere near this size, the argument for cloud hosting at least a copy of your data becomes quite strong.

If you have some fluency with systems programming you can setup a sync to service like Amazon Glacier, and if you don’t, then there are other tools like Arq Agenct that can mostly take care of the heavy lifting of backing up your data. If you want to, and are willing to pay a bit more, a service like Dropbox, or Apple’s iCloud will give you about one terabyte of backup for about $10 per month.

Another trend that I think is interesting is the growth in platforms and tools that want to help you collaborate on your research, or make data sharing a lot easier. I sometimes hear people asking what is going to be the Dropbox for science or the Github) for science, often missing the point that the answer is probably Dropbox and Github respectively. That aside generic data repositories like Dryad, Zenodo and Figshare are just great, and by far the type of scholarly infrastructure that has emerged in the last six years that I think are making a real difference in our data sharing habits.

One tool that I will mention is one that I’ve not used myself, but it’s just so cool that I have to tell you about it. It’s called Dat and it uses something similar to the bittorrent protocol to allow you to setup a data sharing peer to peer network right from your laptop. Multiple researchers can work together and have their data auto-synced over the wire, with data transfer speeds increasing along with the number of people you add to the network. You can read more here

There are a lot of tools and strategies out there, and none of them provide a one size fits all approach. Sharing your experiences, and learning from others who seem to be doing a good job is always a good idea, so if you have the chance to participate in a Data Carpentry Workshop then do try to go along.

My final thought that I’ll leave you with is that of course in a world where we have access to being able to create, manage, and store volumes of data that go beyond our naive ability to just navigate a native way, search, and retrieval become critical factors. Skilling up on use of tools like PostGres, ElasticSearch, or looking to work with solutions providers like Merriam tech may become critical skills for the future.