Mine was the second, entitled Costs: Why Do We Care? It was an update and revision of The Half-Empty Archive, stressing the importance of collecting, curating and analyzing cost data. Below the fold, an edited text with links to the sources.

Introduction

I'm David Rosenthal from the LOCKSS (Lots Of Copies Keep Stuff Safe) Program at the Stanford University Libraries, and I have two reasons for being especially happy to be here today. First, I'm a Londoner. Second, under the auspices of JISC the UK has been a very active participant in the LOCKSS program since 2006. As with all my talks, you don't need to take notes or ask for the slides. The text of the talk, with links to the sources, will go up on my blog shortly.

Why do I think am I qualified to stand here and pontificate about preservation costs? The LOCKSS Program develops, and supports users of, the LOCKSS digital preservation technology. This is a peer-to-peer system designed to let libraries collect and preserve copyright content published on the Web, such as e-journals and e-books. LOCKSS users participate in a number of networks customized for these and other forms of content including government documents, social science datasets, library special collections, and so on. One of these networks, the CLOCKSS archive, a community-managed dark archive of e-journals and e-books, was recently certified to the Trusted Repository Audit Criteria, equalling the previous highest score and gaining the first-ever perfect score for technology. The LOCKSS software is free open source, the LOCKSS team charges for support and services. On that basis, with no grant funding, for more than 7 years we have covered our costs and accumulated some reserves.

Because understanding and controlling our costs is very important for us, and because the LOCKSS system's Lots Of Copies trades using more disk space for using less of other resources (especially lawyers), I have been researching the costs of storage for some years.

Like all of you, the LOCKSS team has to plan and justify our budget each year. It is clear that economic failure is one of the most significant threats to the content we preserve, as it is even for the content national libraries preserve. For each of us individually the answer to "Costs: Why Do We Care?" is obvious. But I want to talk about why the work we are discussing over these two days, of collecting, curating, normalizing, analyzing and disseminating cost information about digital curation and preservation, is important not just at an individual level but for the big picture of preservation. What follows is in three sections:

The current situation.

Cost trends.

What can be done?

The Current Situation

In 2010 the ARL reported that the median research library received about 80K serials. Stanford's numbers support this. The Keepers Registry,
across its 8 reporting repositories, reports just over 21K "preserved"
and about 10.5K "in progress". Thus under 40% of the median research
library's serials are at any stage of preservation.

We manually repeated this experiment with the more complete Keepers
Registry and found that more than 50% of all journal titles and 50% of
all attributions were not in the registry and should be added.

Scott Ainsworth and his co-authors tried to estimate the probability
that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Web is Archived?"
They generated lists of "random" URLs using several different
techniques including sending random words to search engines and random
strings to the bit.ly URL shortening service. They then:

tried to access the URL from the live Web.

used Memento to ask the major Web archives whether they had at least one copy of that URL.

An Optimistic Assessment

First, the assessment isn't risk-adjusted:

As regards the scholarly literature librarians, who are concerned with post-cancellation access not with preserving the record of scholarship,
have directed resources to subscription rather than open-access
content, and within the subscription category, to the output of large
rather than small publishers. Thus they have driven resources towards
the content at low risk of loss, and away from content at high risk of
loss. Preserving Elsevier's content makes it look like a huge part of
the record is safe because Elsevier publishes a huge part of the record.
But Elsevier's content is not at any conceivable risk of loss, and is
at very low risk of cancellation, so what have those resources achieved
for future readers?

As regards Web content, the more links to a page, the more likely
the crawlers are to find it, and thus, other things such as robots.txt
being equal, the more likely it is to be preserved. But equally, the
less at risk of loss.

Second, the assessment isn't adjusted for difficulty:

A similar problem of risk-aversion is manifest in the idea that
different formats are given different "levels of preservation".
Resources are devoted to the formats that are easy to migrate. But
precisely because they are easy to migrate, they are at low risk of
obsolescence.

The same effect occurs in the negotiations needed to obtain
permission to preserve copyright content. Negotiating once with a large
publisher gains a large amount of low-risk content, where negotiating
once with a small publisher gains a small amount of high-risk content.

Similarly, the web content that is preserved is the content that is
easier to find and collect. Smaller, less linked web-sites are probably
less likely to survive.

Harvesting the low-hanging fruit directs resources away from the content at risk of loss.

Third, the assessment is backward-looking:

As regards scholarly communication it looks only at the traditional
forms, books and papers. It ignores not merely published data, but also
all the more modern forms of communication scholars use, including workflows, source code repositories,
and social media. These are mostly both at much higher risk of loss
than the traditional forms that are being preserved, because they lack
well-established and robust business models, and much more difficult to
preserve, since the legal framework is unclear and the content is either
much larger, or much more dynamic, or in some cases both.

As regards the Web, it looks only at the traditional, document-centric surface Web rather than including the newer, dynamic forms of Web content and the deep Web.

Fourth, the assessment is likely to suffer measurement bias:

The measurements of the scholarly literature are based on
bibliographic metadata, which is notoriously noisy. In particular, the
metadata was apparently not de-duplicated, so there will be some amount
of double-counting in the results.

As regards Web content, Ainsworth et al describe various forms of bias in their paper.

As Cliff Lynch pointed out in his summing-up of the 2014 IDCC conference,
the scholarly literature and the surface Web are genres of content for
which the denominator of the fraction being preserved (the total amount
of genre content) is fairly well known, even if it is difficult to
measure the numerator (the amount being preserved). For many other
important genres, even the denominator is becoming hard to estimate as
the Web enables a variety of distribution channels:

Books used to be published through well-defined channels that assigned ISBNs, but now e-books can appear anywhere on the Web.

YouTube
and other sites now contain vast amounts of video, some of which
represents what in earlier times would have been movies.

Scientific
data is exploding in both size and diversity, and despite efforts to
mandate its deposit in managed repositories much still resides in grad
students laptops.

Of course, "what we should be preserving" is a judgement call, but
clearly even purists who wish to preserve only stuff to which future
scholars will undoubtedly require access would be hard pressed to claim
that half that stuff is preserved.

Preserving the Rest

Overall, its clear that we are preserving much less than half of the
stuff that we should be preserving. What can we do to preserve the rest
of it?

We can do nothing, in which case we needn't worry about bit rot,
format obsolescence, and all the other risks any more because they only
lose a few percent. The reason why more than 50% of the stuff won't make
it to future readers would be can't afford to preserve.

We can more than double the budget for digital preservation. This is so not going to happen; we will be lucky to sustain the current funding
levels.

We can more than halve the cost per unit content. Doing so requires a radical re-think of our preservation processes and technology.

Such a radical re-think requires understanding where the costs go in our current preservation methodology, and how they can be funded. As an engineer, I'm used to using rules of thumb. The one I use to
summarize most of the research into past costs is that ingest takes half the
lifetime cost, preservation takes one third, and access takes one sixth.

On this basis, one would think that the most important thing to do would be to reduce the cost of ingest. It is important, but not as important as you might think. The reason is that ingest is a one-time, up-front cost. As such, it is relatively easy to fund. In principle, research grants, author page charges, submission fees and other techniques can transfer the cost of ingest to the originator of the content, and thereby motivate them to explore the many ways that ingest costs can be reduced. But preservation and dissemination costs continue for the life of the data, for "ever". Funding a stream of unpredictable payments stretching into the indefinite future is hard. Reductions in preservation and dissemination costs will have a much bigger effect on sustainability than equivalent reductions in ingest costs.

Cost Trends

We've
been able to ignore this problem for a long time, for two reasons. From
at least 1980 to 2010 storage costs followed Kryder's Law,
the disk analog of Moore's Law, dropping 30-40%/yr. This meant that, if
you could afford to store the data for a few years, the cost of storing
it for the rest of time could be ignored, because of course Kryder's
Law would continue forever. The second is that as the data got older,
access to it was expected to become less frequent. Thus the cost of
access in the long term could be ignored.

Can we continue to ignore these problems?

Preservation

Kryder's Law held for three decades, an astonishing feat for exponential
growth. Something that goes on that long gets built into people's model
of the world, but as Randall Munroe points out,
in the real world exponential curves cannot continue for ever. They are
always the first part of an S-curve.

This graph, from Preeti Gupta of UC Santa Cruz, plots the cost per GB of disk drives against time. In 2010 Kryder's Law
abruptly stopped. In 2011 the floods in Thailand destroyed 40% of the
world's capacity to build disks, and prices doubled. Earlier this year
they finally got back to 2010 levels. Industry projections are for no
more than 10-20% per year going forward (the red lines on the graph).
This means that disk is now about 7 times as expensive as was expected
in 2010 (the green line), and that in 2020 it will be between 100 and
300 times as expensive as 2010 projections.

These are big numbers, but do they matter? After all, preservation is only about one-third of the total. and only about one-third of that is media costs.

Our models of the economics of long-term storage compute the endowment, the amount of money that, deposited with the data and invested at interest, would fund its preservation "for ever". This
graph, from my initial rather crude prototype model, is based on
hardware cost data from Backblaze and running cost data from the San Diego Supercomputer Center (much higher than Backblaze's) and Google. It plots the endowment needed for three copies of a 117TB dataset to
have a 95% probability of not running out of money in 100 years, against
the Kryder rate (the annual percentage drop in $/GB). The different
curves represent policies of keeping the drives for 1,2,3,4,5 years. Up
to 2010, we were in the flat part of the graph, where the endowment is
low and doesn't depend much on the exact Kryder rate. This is the
environment in which everyone believed that long-term storage was
effectively free.
But suppose the Kryder rate were to drop below about 20%/yr. We would be
in the steep part of the graph, where the endowment needed is both much
higher and also strongly dependent on the exact Kryder rate.

We don't need to suppose. Preeti's graph and industry projections show that now and for the foreseeable future we are in the steep part of the graph. What happened to slow Kryder's Law? There are a lot of factors, we outlined many of them in a paper for UNESCO's Memory of the World conference (PDF). Briefly, both the disk and tape markets have consolidated to a couple of vendors, turning what used to be a low-margin, competitive market into one with much better margins. Each successive technology generation requires a much bigger investment in manufacturing, so requires bigger margins, so drives consolidation. And the technology needs to stay in the market longer to earn back the investment, reducing the rate of technological progress.

Thanks to aggressive marketing, it is commonly believed that "the cloud"
solves this problem. Unfortunately, cloud storage is actually made of
the same kind of disks as local storage, and is subject to the same
slowing of the rate at which it was getting cheaper. In fact, when all
costs are taken in to account, cloud storage is not cheaper for long-term preservation than doing it yourself
once you get to a reasonable scale. Cloud storage really is cheaper if
your demand is spiky, but digital preservation is the canonical
base-load application.

Jillian Mirandi, senior analyst at Technology Business Research Group (TBRI), estimated that AWS will generate about $4.7 billion in revenue this year, while comparable estimated IaaS revenue for Microsoft and Google will be $156 million and $66 million, respectively.

cloud prices across the industry were falling by about 6 per cent
each year, whereas hardware costs were falling by 20 per cent. And
Google didn't think that was fair. ... "The price curve of virtual
hardware should follow the price curve of real hardware."

Notice that the major price drop triggered by Google
was a one-time event; it was a signal to Amazon that they couldn't have
the market to themselves, and to smaller players that they would no
longer be able to compete.

In fact commercial cloud storage is a trap.
It is free to put data in to a cloud service such as Amazon's S3, but
it costs to get it out. For example, getting your data out of Amazon's
Glacier without paying an arm and a leg takes 2 years. If you commit to
the cloud as long-term storage, you have two choices. Either keep a copy
of everything outside the cloud (in other words, don't commit to the
cloud), or stay with your original choice of provider no matter how much
they raise the rent.

Unrealistic expectations that we can collect and store the vastly increased amounts of data projected by consultants such as IDC within current budgets place currently preserved content at great risk of economic failure. Here are three numbers that illustrate the looming crisis in long-term storage, its cost:

Here's a graph that projects these three numbers out for the next 10
years. The red line is Kryder's Law, at IHS iSuppli's 20%/yr. The blue
line is the IT budget, at computereconomics.com's 2%/yr. The green line
is the annual cost of storing the data accumulated since year 0 at the
60% growth rate projected by IDC, all relative to the value in the first
year. 10 years from now, storing all the accumulated data would cost
over 20 times as much as it does this year. If storage is 5% of your IT
budget this year, in 10 years it will be more than 100% of your budget.
If you're in the digital preservation business, storage is already way
more than 5% of your IT budget.

Dissemination

The storage part of preservation isn't the only on-going cost that will
be much higher than people expect, access will be too. In 2010 the Blue Ribbon Task Force on Sustainable Digital Preservation and Access
pointed out that the only real justification for preservation is to
provide access. With research data this can be a real difficulty; the value of
the data may not be evident for a long time. Shang dynasty astronomers
inscribed eclipse observations on animal bones. About 3200 years later,
researchers used these records to estimate that the accumulated clock
error was about 7 hours. From this they derived a value for the viscosity of the Earth's mantle as it rebounds from the weight of the glaciers.

But the advent of
"Big Data" techniques mean that, going forward, scholars increasingly
want not to access a few individual items in a collection, but to ask
questions of the collection as a whole. For example, the Library of
Congress announced that it was collecting the entire Twitter feed,
and almost immediately had 400-odd requests for access to the
collection. The scholars weren't interested in a few individual tweets,
but in mining information from the entire history of tweets.
Unfortunately, the most the Library could afford to do with
the feed is to write two copies to tape. There's no way they could afford
the compute infrastructure to data-mine from it. We can get some idea of how expensive this is by comparing Amazon's S3, designed for data-mining type access
patterns, with Amazon's Glacier, designed for traditional archival
access. S3 is currently at least 2.5 times as expensive; until recently it was 5.5 times.

Ingest

Almost everyone agrees that ingest is the big cost element. Where does the money go? The two main cost drivers appear to be the real world, and metadata.

In the real world it is natural that the cost per unit content increases through time, for two reasons. The content that's easy to ingest gets ingested first, so over time the difficulty of ingestion increases. And digital technology evolves rapidly, mostly by adding complexity. For example, the early Web was a collection of linked static documents. Its language was HTML. It was reasonably easy to collect and preserve. The language of today's Web is Javascript, and much of the content you see is dynamic. This is much harder to ingest. In order to find the links much of the collected content now needs to be executed
as well as simply being parsed. This is already significantly
increasing the cost of Web harvesting, both because executing the
content is computationally much more expensive, and because elaborate
defenses are required to protect the crawler against the possibility
that the content might be malign.

The days when a single generic crawler could collect pretty much
everything of interest are gone; future harvesting will require more and
more custom tailored crawling such as we need to collect subscription
e-journals and e-books for the LOCKSS Program. This per-site custom work
is expensive in staff time. The cost of ingest seems doomed to
increase.

Worse, the W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it.

Metadata in the real world is widely known to be of poor quality,
both format and bibliographic kinds. Efforts to improve the quality are
expensive, because they are mostly manual and, inevitably, reducing
entropy after it has been generated is a lot more expensive than not
generating it in the first place.

What can be done?

We are preserving less than half of the content that needs preservation. The cost per unit content of each stage of our current processes is predicted to rise. Our budgets are not predicted to rise enough to cover the increased cost, let alone more than doubling to preserve the other more than half. We need to change our processes to greatly reduce the cost per unit content.

Preservation

It is often assumed that, because it is possible to store and copy
data perfectly, only perfect data preservation is acceptable. There are
two problems with this expectation.

To illustrate the first problem, lets examine the technical
problem of storing data in its most abstract form. Since 2007 I've been
using the example of "A Petabyte for a Century".
Think about a black box into which you put a Petabyte, and out of which
a century later you take a Petabyte. Inside the box there can be as
much redundancy as you want, on whatever media you choose, managed by
whatever anti-entropy protocols you want. You want to have a 50% chance
that every bit in the Petabyte is the same when it comes out as when it
went in.

Now consider every bit in that Petabyte as being like a radioactive
atom, subject to a random process that flips it with a very low
probability per unit time. You have just specified a half-life for the bits. That
half-life is about 60 million times the age of the universe. Think for a
moment how you would go about benchmarking a system to show that no
process with a half-life less than 60 million times the age of
the universe was operating in it. It simply isn't feasible. Since at
scale you are never going to know that your system is reliable enough, Murphy's law will guarantee that it isn't.

Here's some back-of-the-envelope hand-waving. Amazon's S3 is a
state-of-the-art storage system. Its design goal is an annual
probability of loss of a data object of 10-11. If the average
object is 10K bytes, the bit half-life is about a million years, way
too short to meet the requirement but still really hard to measure.

Note that the 10-11 is a design goal, not the measured performance of the system. There's a lot of research into the actual performance of storage systems at
scale, and it all shows them under-performing expectations based on the
specifications of the media. Why is this? Real storage systems are
large, complex systems subject to correlated failures that are very hard
to model.

Worse, the threats against which they have to defend their contents are
diverse and almost impossible to model. Nine years ago we documented the
threat model we use for the LOCKSS system. We observed that most discussion of digital preservation focused on these threats:

Media failure

Hardware failure

Software failure

Network failure

Obsolescence

Natural Disaster

but that the experience of operators of large data storage facilities
was that the significant causes of data loss were quite different:

Operator error

External Attack

Insider Attack

Economic Failure

Organizational Failure

To illustrate the second problem, consider that building systems to defend against all these threats combined is
expensive, and can't ever be perfectly effective. So we have to resign
ourselves to the fact that stuff will get lost. This has always been
true, it should not be a surprise. And it is subject to the law of
diminishing returns. Coming back to the economics, how much should we
spend reducing the probability of loss?

Consider two storage systems with the same budget over a decade, one
with a loss rate of zero, the other half as expensive per byte but which
loses 1% of its bytes each year. Clearly, you would say the cheaper
system has an unacceptable loss rate.

However, each year the cheaper system stores twice as much and loses 1%
of its accumulated content. At the end of the decade the cheaper system
has preserved 1.89 times as much content at the same cost. After 30
years it has preserved more than 5 times as much at the same cost.

Why is this? Because the collection was always a series of samples
of the Web, the losses merely add a small amount of random noise to the
samples. But the samples are so huge that this noise is insignificant.
This isn't something about the Internet Archive, it is something about
very large collections. In the real world they always have noise;
questions asked of them are always statistical in nature. The benefit of
doubling the size of the sample vastly outweighs the cost of a small
amount of added noise. In this case more really is better.

Unrealistic
expectations for how well data can be preserved make the best be the
enemy of the good. We spend money reducing even further the small
probability of even the smallest loss of data that could instead
preserve vast amounts of additional data, albeit with a slightly higher
risk of loss.

Within the next decade all current popular storage media, disk, tape and
flash, will be up against very hard technological barriers. A
disruption of the storage market is inevitable. We should work to ensure
that the needs of long-term data storage will influence the result. We should pay particular attention to the work underway at Facebook and elsewhere that uses techniques such as erasure coding, geographic diversity, and custom hardware based on mostly spun-down disks and DVDs to achieve major cost savings for cold data at scale.

Every few months there is another press release announcing that some new, quasi-immortal medium such as fused silica glass or stone DVDs has solved the problem of long-term storage. But the
problem stays resolutely unsolved. Why is this? Very long-lived media
are inherently more expensive, and are a niche market, so they lack
economies of scale. Seagate could easily make disks with archival life,
but they did a study of the market for them, and discovered that
no-one would pay the relatively small additional cost.

The fundamental problem is that long-lived media only make sense at very
low Kryder rates. Even if the rate is only 10%/yr, after 10 years you
could store the same data in 1/3 the space. Since space in the data
center or even at Iron Mountain isn't free, this is a powerful incentive
to move old media out. If you believe that Kryder rates
will get back to 30%/yr, after a decade you could store 30 times as much
data in the same space.

There is one long-term storage medium that might eventually make sense.
DNA is very dense, very stable in a shirtsleeve environment, and best of
all it is very easy to make Lots Of Copies to Keep Stuff Safe. DNA
sequencing and synthesis are improving at far faster rates than magnetic
or solid state storage. Right now the costs are far too high,
but if the improvement continues DNA might eventually solve the archive
problem. But access will always be slow enough that the data would have
to be really cold before being committed to DNA.

The reason that the idea of long-lived media is so attractive is that it
suggests that you can be lazy and design a system that ignores the possibility of
failures. You can't:

Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.

Long media life does not imply that the media are more reliable,
only that their reliability decreases with time more slowly. As we have seen, current media are many orders of magnitude too unreliable for the
task ahead.

Double the reliability is only worth 1/10th of 1 percent cost increase. ...Replacing one drive takes about 15 minutes of work. If we have
30,000 drives and 2 percent fail, it takes 150 hours to replace those.
In other words, one employee for one month of 8 hour days. Getting the
failure rate down to 1 percent means you save 2 weeks of employee salary
- maybe $5,000 total? The 30,000 drives costs you $4m.The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent
higher cost to us. ACTUALLY we pay even more than that for them, but not
more than a few dollars per drive (maybe 2 or 3 percent more).Moral of the story: design for failure and buy the cheapest components you can. :-)

Dissemination

The real problem here is that scholars are used to having free access
to library collections and research data, but what scholars now want to do with archived data is so expensive that they must be
charged for access. This in itself has costs, since access must be
controlled and accounting undertaken. Further, data-mining
infrastructure at the archive must have enough performance for the peak
demand but will likely be lightly used most of the time, increasing the
cost for individual scholars. A charging mechanism is needed to pay for the
infrastructure. Fortunately, because the scholar's access is spiky,
the cloud provides both suitable infrastructure and a charging
mechanism.

For smaller collections, Amazon provides Free Public Datasets,
Amazon stores a copy of the data with no charge, charging scholars
accessing the data for the computation rather than charging the
owner of the data for storage.

Even for large and non-public collections it may be
possible to use Amazon. Suppose that in addition to keeping the two archive copies of
the Twitter feed on
tape, the Library of Congress kept one copy in S3's Reduced Redundancy Storage
simply to enable researchers to access it. For this year, it would have averaged about $4100/mo, or
about $50K. Scholars wanting to access the collection would have to pay for their own computing resources at Amazon, and the per-request charges; because the data transfers would be internal to Amazon there would not be bandwidth charges. The storage charges
could be borne by the library or charged back to the researchers. If
they were charged back, the 400 initial requests would each need to
pay about $125 for a year's access to the collection, not an
unreasonable charge. If this idea turned out to be a failure it could be
terminated with no further cost, the collection would still be safe on
tape. In the short term, using cloud storage for an access copy of
large, popular collections may be a cost-effective approach. Because the
Library's preservation copy isn't in the cloud, they aren't locked-in.

Ingest

There are two parts to the ingest process, the content and the metadata.

The evolution of the Web that poses problems for preservation also poses problems for search engines such as Google. Where they used to parse the HTML of a page into its Document Object Model (DOM) in order to find the links to follow and the text to index, they now have to construct the CSS object model (CSSOM), including executing the Javascript, and combine the DOM and CSSOM into the render tree to find the words in context. Preservation crawlers such as Heritrix used to construct the DOM to find the links, and then preserve the HTML. Now they also have to construct the CSSOM and execute the Javascript. It might be worth investigating whether preserving a representation of the render tree rather than the HTML, CSS, Javascript, and all the other components of the page as separate files would reduce costs.

It is becoming clear that there is much important content that is too
big, too dynamic, too proprietary or too DRM-ed for ingestion into an
archive to be either feasible or affordable. In these cases where we
simply can't ingest it, preserving it in place
may be the best we can do; creating a legal framework in which the
owner of the dataset commits, for some consideration such as a tax
advantage, to preserve their data and allow scholars some suitable
access. Of course, since the data will be under a single institution's
control it will be a lot more vulnerable than we would like, but this
type of arrangement is better than nothing, and not ingesting the
content is certainly a lot cheaper than the alternative.

Metadata is commonly regarded as essential for preservation. For
example, there are 52 criteria for ISO
16363 Section 4. Of these, 29 (56%) are metadata-related. Creating and validating
metadata is expensive:

In both cases, extracted metadata is sufficiently noisy to impair its usefulness.

We need less metadata so we can have more data. Two questions need to be asked:

When is the metadata required? The discussions in the Preservation at Scale workshop contrasted the pipelines of Portico and the CLOCKSS Archive,
which ingest much of the same content. The Portico pipeline is far more
expensive because it extracts, generates and validates metadata during
the ingest process. CLOCKSS, because it has no need to make content
instantly available, implements all its metadata operations as
background tasks, to be performed as resources are available.

The LOCKSS and CLOCKSS systems take a very parsimonious approach to
format metadata. Nevertheless, the requirements of ISO 16363 forced us
to expend resources implementing and using FITS, whose output does not in fact contribute to our preservation strategy,
and whose binaries are so large that we have to maintain two separate
versions of the LOCKSS daemon, one with FITS for internal use and one
without for actual preservation. Further, the demands we face for
bibliographic metadata mean that metadata extraction is a major part of ingest costs for both systems. These demands come from requirements for:

Bibliographic search, preservation tracking and bragging about exactly how many
articles and books your system preserves are all important, but whether
they justify the considerable cost involved is open to question. Because they are cleaning up after the milk has been spilt, digital preservation systems are poorly placed to improve metadata quality.

Resources should be devoted to avoiding spilling milk rather than cleanup. For example, given how much the academic community spends on the services publishers allegedly provide in the way of improving the quality of publications, it is an outrage than even major publishers cannot spell their own names consistently, cannot format DOIs correctly, get authors' names wrong, and so on.

The alternative is to accept that metadata correct enough to rely on is impossible, downgrade its importance to that of a hint, and stop wasting resources on it. One of the reasons full-text search dominates bibliographic search is that it handles the messiness of the real world better.

Conclusion

Attempts have been made, for various types of digital content, to measure the probability of preservation. The consensus is about 50%. Thus the rate of loss to future readers from "never preserved" will vastly exceed that from all other causes, such as bit rot and format obsolescence. This raises two questions:

Will persisting with current preservation technologies improve the odds of preservation? At each stage of the preservation process current projections of cost per unit content are higher than they were a few years ago. Projections for future preservation budgets are at best no higher. So clearly the answer is no.

If not, what changes are needed to improve the odds? At each stage of the preservation process we need to at least halve the cost per unit content.
I have set out some ideas, others will have different ideas. But the need for major cost reductions needs to be the focus of discussion and development of digital preservation technology and processes.

We live in a marketplace of competing preservation solutions. A very
significant part of the cost of both not-for-profit systems such as
CLOCKSS or Portico, and commercial products such as Preservica
is the cost of marketing and sales. For example, TRAC certification is a
marketing check-off item. The cost of the process CLOCKSS underwent to obtain this check-off item was well in excess of 10%
of its annual budget.

Making the tradeoff of preserving more stuff using "worse preservation"
would need a mutual non-aggression marketing pact. Unfortunately, the
pact would be unstable. The first product to defect and sell itself as
"better preservation than those other inferior systems" would win. Thus
private interests work against the public interest in preserving more
content.

To sum up, we need to talk about major cost reductions. The basis for this conversation must be more and better cost data.

7 comments:

"Worse, the W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it"

The vast majority of the content in question was never available without DRM, so the addition of an optional DRM spec doesn't change the situation in any significant way – it's just swapping Flash/Silverlight for EME. In every case you have the same questions which we've had for the last couple decades with e.g. DVDs: is it legal to backup the encrypted bitstream and/or the keys? Where can it legally be played back and for whom?

This is a hard problem because it'll require a legal solution. The only good news is that companies expecting payment for the content have an above-average incentive to preserve it, hopefully enough to at least partially defray the risk of being forced to rely on them.

Chris, I agree with your comment under the assumption that only content now DRM-ed will be delivered using HTML5 DRM.

But I don't think this is a safe assumption. Given a universal, standards-approved DRM mechanism in browsers, I expect that vast swathes of content now delivered without DRM but also without a CC or equivalent license will be delivered with HTML5 DRM.

Necessary reading, as with all of your blog posts David, almost compensation for not having been at the conference.

A correction/clarification requested: "The Prelida project studied the links in 46,000 US theses and determined that about 50% of the linked-to content was preserved in at least one Web archive" should instead credit and link to the Hiberlink project . I presented Hiberlink work at that workshop, http://www.slideshare.net/prelida/hiberlink-reference-rot-and-linked-data-threat-and-remedy but better to go to http://hiberlink.org/news.html for the ETD2014 & other reporting.