I'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation.

Thursday, June 13, 2013

Brief talk at ElPub 2013

I was on the panel entitled Setting Research Data Free: Problems and Solutions at the ElPub 2013 conference. Below the fold is the text of my introductory remarks with links to the sources.
One of the few things that most people actually trying to preserve large amounts of digital content agree on is that the number 1 problem they face is not technical but economic. Unlike paper, bits are very vulnerable to interruptions in the money supply. To survive, or in the current jargon to "be sustainable", a digital collection needs an assured stream of funds for the long term. Very few have it. You can tell people are worried about a topic when they appoint a "Blue Ribbon Task Force" to study it. We had such a task force. It reported 2 years ago that, yes, sustainable economics was a big problem. But the panel conspicuously failed to come up with credible solutions.

You, or at least I, often hear people say something similar to what Dr. Fader of the Wharton School Customer Analytics Initiative attributes to Big Data zealots:

Save it all - you never know when it might come in handy for a future data-mining expedition.

Clearly, the value that could be extracted from the data in the future is non-zero, but even the Big Data zealot believes it is on average probably small. The reason the Big Data zealot gets away with saying things like this is because he, and his audience, believe that this small value outweighs the cost of keeping the data indefinitely. They believe that storage is, in effect, free.

So how free does storage turn out to be? This concern has motivated a good deal of research into the costs of
digital preservation, efforts such as CMDP (PDF), LIFE, KRDS, PrestoPrime,
ENSURE, and others. Their conclusions differ, but broadly we can say that
typically about half the total cost is ingest, about one-third is
preservation, mostly storage, and about one-sixth is dissemination.

It is easy to understand why ingesting content is expensive, at least it
is easy if you have ever tried to do it on a production scale. There is a
lot of stuff to ingest. In the real world it is diverse and messy. People
want not just the content, but also metadata. This has to be either
manually generated, which is expensive, or extracted automatically, which
is a great way of revealing the messy nature of the real world.
It is easy to understand why disseminating content is a small part of the
total, because preserved content is, on average, very rarely disseminated.
Why is storage, an on-going cost that must be paid for the life of the
collection, such a small part of the total?

It is easy to understand why disseminating content is a small
part of the total, because preserved content is, on average,
very rarely accessed.

Why has storage, an on-going cost that must be paid for the
life of the collection, been such a small part of the total
in the past?

The reason is this graph, showing Kryder's Law, which says that the areal
density of bits on disk platters has increased 30-40%/year for the last
30 years. The areal density doesn't have a one-to-one relationship with
the cost per GB of disk, but they are closely correlated. The effect has
been, for the last 30 years, that consumers got roughly double the storage
at the same price every two years or so.

If something goes on steadily for 30 years or so it gets built into
people's models of the world. For digital preservation, the model of the
world into which it gets built is that, if you can afford to store something
for a few years, you can afford to store it forever. The price per
byte of the storage will have become negligible.
Thus, the breakdown that has storage costs being one-third of the total has
built into it the idea that storage media costs drop so fast that the one-
third has only to pay for a few years of storage.

If you look on my blog, for example at the talk I gave at the UNESCO "Memory of the World" meeting last year, you will find
a lot of detailed explanation of the technological and economic
reasons why Kryder's Law has slowed, and will continue to slow,
and what this means for cost of storing data for the long term.
But there's a much simpler argument to convey the basic idea.

This graph projects these three numbers out for the next 10 years. The
red line is Kryder's Law, at 20%/yr. The blue line is the IT budget, at
2%/yr. The green line is the annual cost of storing the data accumulated
since year 0 at the 60% growth rate, all relative to the value in the
first year. 10 years from now, storing all the accumulated data would
cost over 20 times as much as it does this year. If storage is 5% of
your IT budget this year, in 10 years it will be more than 100% of your
budget. If you're in the digital preservation business, storage is
already way more than 5% of your IT budget. Its going to consume 100%
of the budget in much less than 10 years.

Lets look at the economics of each of the three components.

Ingest: This is a one-time, up-front cost, so it can in principle be
grant-funded. The big cost is generating and validating
metadata that is good enough to allow sharing. This is hard to
automate, so it is expensive. The cost falls on the owner of
the data, but the benefits accrue to the re-user of the data.
This makes it hard to motivate the data owner to fill the gap
between metadata that is good enough for their own use, and
good enough for sharing and re-use.

Worse, the potential beneficiaries of this effort are
competitors for recognition and funding. The mechanisms for
getting credit for re-use of data don't work well, precisely
because they depend on the competitor to assign the credit,
which isn't in their interest.

Thus, for the data owner, the costs of re-use are likely to
exceed the benefits. And, unless the data owner takes
pro-active steps to market the data to competitors, re-use
isn't likely to happen.

Dissemination: If the data owner doesn't provide good metadata and doesn't
market their data well, it won't be accessed much and thus
the access costs will be low. It isn't hard to pay for the
outcome we don't want.

But if data, especially large data, gets popular the access
costs can be significant. The market price, just for data
transfer, is roughly $120/TB. These costs are borne by the
data owner, so the better job of marketing they do the more
costs they incur.

Further, these costs are unpredictable, so they are hard to
budget for. And they're an on-going cost that can't be grant
funded.

We all want the data to be open, but that implies that the
access costs can't be recovered from the readers. And selling
ads on data isn't a viable business model. Worse, much of the
data is either burdened with rosy projections of the IP that
can be generated from it, or contains personally identifiable
information, so cannot be made open.

Storage is an on-going cost, but unlike access it is somewhat
predictable. This makes the endowment model possible, where
data is deposited together with a capital sum thought to be
adequate to pay for its storage "for ever". The endowment
model enables grant-funding of long-term data storage.

This price is so high that funders, used to Kryder's Law and
thus assuming that "storage will be free", are unlikely to
agree to fund storing everything. But allowing data owners
to select the data to be stored for re-use is a very bad
idea. We see that from the dire effects of selective publishing
of the results of drug trials. Thus either no data should be
shared, or all data should be shared.

Given these economic hurdles, one can expect that data sharing
will continue to be the exception rather than the rule, no
matter how much society might benefit.