I'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation.

Tuesday, March 24, 2015

The Opposite Of LOCKSS

Jill Lepore's New Yorker "Cobweb" article has focused attention on the importance of the Internet Archive, and the analogy with the Library of Alexandria. In particular on the risks implicit in the fact that both represent single points of failure because they are so much larger than any other collection.

Typically, Jason Scott was first to respond with a outline proposal to back up the Internet Archive, by greatly expanding the collaborative efforts of ArchiveTeam. I think Jason is trying to do something really important, and extremely difficult.

The Internet Archive's collection is currently around 15PB. It has doubled in size in about 30 months. Suppose it takes another 30 months to develop and deploy a solution at scale. We're talking crowd-sourcing a distributed backup of at least 30PB growing at least 3PB/year.

To get some idea of what this means, suppose we wanted to use Amazon's Glacier. This is, after all, exactly the kind of application Glacier is targeted at. As I predicted shortly after Glacier launched, Amazon has stuck with the 1c/GB/mo price. So in 2017 we'd be paying Amazon $3.6M a year just for the storage costs. Alternately, suppose we used Backblaze's Storage Pod 4.5 at their current price of about 5c/GB, for each copy we'd have paid $1.5M in hardware cost and be adding $150K worth per year. This ignores running costs and RAID overhead.

It will be very hard to crowd-source resources on this scale, which is why I say this is the opposite of Lots Of Copies Keep Stuff Safe. The system is going to be short of storage; the goal of a backup for the Internet Archive must be the maximum of reliability for the minimum of storage.

Nevertheless, I believe it would be well worth trying some version of his proposal and I'm happy to help any way I can. Below the fold, my comments on the design of such a system.

Reliability

Why is reliability so important for this system? After all, I've been arguing that reliability isn't as important as people think elsewhere. Lets suppose that somehow we have a single copy of the 30PB on disks, and that we perform an integrity check via checksums 10 times a year. Optimistically, we assume these disks never fail for any reason, but they achieve their specified Unrecoverable Bit Error Rate (UBER) of 10-15. There are 2.4*1017 bits in the copy, so on average every time we do an integrity check we will get 240 bad bits. Pessimistically, we assume these bits are randomly distributed (this makes the analysis much easier).

If, as Jason suggests, the backup is divided into 70K 500GB blocks, the probability that any of them will have more than 1 bad bit is small, so we will lose 10*240*500GB of data every year, or 1.2PB, or about a third of the incoming data. Of course, we can repair these failures from the Internet Archive itself, at the cost of increasing the bandwidth impact from about an additional quarter of the Archive's current bandwidth to about a third (see below). But the probability that the Archive and the backup would lose the same data becomes significant.

This argues for much smaller blocks, to reduce the impact of the UBER at the cost of increasing the overhead of the system. Smaller blocks would also make it possible for more people to contribute storage, both from the cost of their contribution and from the impact on their bandwidth. Downloading 500GB on my DSL link would take its entire capacity for two weeks.

In real life, even in data centers disks fail in all sorts of ways that make UBER fairly unimportant. The crowd-sourced disks are likely to be much less reliable still. So the system needs to replicate the data.

Replication

The discussions I've seen so far assume that the data is simply replicated, as it is in the LOCKSS system, but only three times. Even replicating by a factor of three means the demand for storage in the backup network by 2018 is nearly 100PB. Clearly, some scheme that gave adequate reliability but used less storage would help significantly. There are two techniques that might help, erasure coding and entanglement. Warning: the following discussion is radically simplified, see here.

Erasure Coding

Erasure coding is like a distributed version of RAID; files are divided into storage blocks. These data blocks are organized into groups of N. For each group (M>N) blocks are stored, containing the data from the N blocks mixed together so that from any N blocks in the group the original data can be recovered. This allows for non-integer replication factors; the replication factor is (M/N). There are two ways to do this:

The N data blocks can be stored unchanged, and (M-N) parity blocks computed from the N blocks can be added. This is the way RAID typically works; it has the advantage that, if nothing has gone wrong, reading a data block requires accessing only a single block. Writing a data block requires writing (1+M-N) blocks, as the parity blocks need to be updated to reflect the new data.

The N data blocks are not stored. Instead M blocks are stored each computed from all of the N data blocks in the group. Reading a block requires accessing N blocks, writing a data block requires accessing M blocks.

The second form is much more expensive, so why would you do it? In a distributed
system these accesses can happen in parallel, so the impact is less. Also, reads of
the backup, other than for integrity checking, will be rare, and
integrity checks do not need to recover data "in the clear"; the
performance costs are not significant in this application.

The real importance is that no individual storage node can, if
compromised, reveal any data. In the context of a crowd-sourced backup of the Internet Archive, this is important. If a node in the backup network contains data from the Archive "in the clear" the owner of the node might be in trouble if the relevant authorities considered that content undesirable. If the owner has deniability, in the sense that they can say "there is no way I can know what the data I am storing is, and no way anyone can recover usable data from my disk alone" it is much harder for the authorities to claim that the owner is doing something bad.

The second form of erasure coding has desirable properties for a backup of the Archive, and it can significantly reduce the demand for storage. Examples of systems using the second form of erasure coding are Tahoe-LAFS and Cleversafe.

Like the second form of erasure coding, entanglement does not store the data blocks themselves. Each stored block contains data derived from multiple data blocks. The key difference is that erasure coding mixes the data from a fixed group of blocks, whereas entanglement does not organize the blocks into groups but mixes each incoming block with a pseudo-randomly chosen set of stored blocks. This has the following effects:

Since the information from which a data block can be recovered is spread across the whole set of stored blocks, deleting or over-writing a data block will affect other data blocks. If the spread is wide enough, selective censorship is effectively blocked and the system is append-only. For the Internet Archive backup application, this is a good thing.

Entanglement supports only integer replication factors:

Tangler's publishing algorithm takes two stored blocks and a data block and outputs two new stored blocks, thus its replication factor is two. A data block can be recovered from any three of the four (two input and two output) stored blocks.

In the default three-strand configuration of the Helical Entanglement Code (HEC) system publishing takes three stored blocks, one from each strand, and a data block and outputs three new stored blocks, one for each strand. Its replication factor is thus three. Absent data loss, recovering a data block requires accessing two successive stored blocks from any of the three strands. If data loss means that none of the three strands can supply the necessary blocks, a search process can recover the lost stored blocks from information in other stored blocks.

Entanglement systems vary in how they spread the information about a data block among the stored blocks. In Towards A Theory of Data Entanglement James Aspnes et al introduced two criteria for this:

A system provides document dependency if a document cannot be recovered if any document it is entangled with is lost.

A system provides all-or-nothing integrity if no document can be recovered if any document is lost.

They show that Dagster and Tangler do not meet these criteria. Helical Entanglement is claimed to provide all-or-nothing integrity, the stronger of the criteria. In her brief Work-In-Progress talk at FAST15, Verónica Estrada Galiñanes showed that HEC systems could be configured with large numbers of strands and devices and a replication factor of four to have very high tolerance for failures.

Entanglement has several desirable properties for a backup of the Archive, but it has too high a replication factor to be practical.

Requirements

If I were doing the design I would start from the end I haven't seen any discussion of so far. Its neat to have a backup copy of the archive, but if it won't actually work when it is needed what's the point? So I'd start the design by looking not at how the data gets out there, but at the use cases when the data needs to get back. Two obvious cases are:

The archive loses say 10TB, perhaps because it suffers a rash of correlated disk failures. How does the archive get it back from the backup?

The Big One hits the Bay Area and the entire archive is lost. How can the service be re-constituted from the backup?

Note that time is a big issue here. If it is theoretically possible to recover the needed data, but only in a timescale that's so long everyone will have forgotten about the archive by the time it's back, nothing has really been achieved. Recovery needs to be modelled with realistic upstream bandwidths, which will be much less than downstream for most nodes, replication factors, and proportion of accessible nodes.

Once I had a good recovery design, then I'd figure out:

How to get data from the archive into a system like that.

How to continually audit the system verify that it was in good enough shape to work when needed, something systems used only in an emergency frequently fail to do.

So lets say the requirements are:

Capacity of 35PB by late 2017.

Replication factor less than 1.5, to limit storage demand by late 2017 to less than about 50PB, or say 100K volunteers each providing 500GB.

Provides deniability (see above).

Ingest bandwidth of 15PB/year by late 2017 (the current content of the archive needs to be backed up in say 2 years while it is growing say 5PB/year). Note that this is about 4Gb/s leaving the archive, or roughly an additional 25% outbound bandwidth.

95% probability of correctly recovering 10TB in 5 days (it is assumed that much of the content will be off-line most of the time, so instant recovery cannot be a requirement).

95% probability of correctly recovering 95% of the entire archive in 90 days.

Meet these requirements in the face of 5% malign nodes conspiring with each other, and realistic error and availability probabilities for the non-malign nodes.

The system self-configures as available storage resources change so as to:

The Internet Archive is a great project, though it needs a disaster plan. But the backup is only the initial step. To talk about preserving our cultural heritage, we need to think about geo-distribution of content. Storing data only in one country and even worse in only one building makes data too vulnerable. This is the reason why I believe helical entanglement codes can be useful. At first, one may argue that we need minimum storage to reduce costs and my solution requires at minimum 3x space overhead. But if we consider that we also need a global distribution of the content (for multiple reasons such avoid censorship, fault-tolerance, latency). Then, helical entanglements are a flexible method that permits: 1) different parties will have the totality of data, 2) cooperation to repair missing data, and 3) controls to the integrity of data made by data holders or a third party.

Firstly, in this application distribution of content is achieved by either erasure coding or entanglement, so that is not a difference. It is achieved by spreading the content, using either technique, across the storage provided by the volunteers.

What is an important difference in this application is that erasure coding would use the storage contributed by volunteers more efficiently, probably at least twice as efficiently.

The idea of crowdsourcing 45PB of storage is only half as intimidating as crowdsourcing 90PB. It requires only 90K volunteers contributing 500GB instead of 180K of them. 180K active participants is in the same league as the very small number of highly successful crowdsourced computation projects such as BOINC.

Hi, David. Certainly, 45PB is less intimidating but 90PB. But in that model, network bandwidth is not considered. Using traditional erasure coding volunteers need to donate more network bandwidth and more I/O access. With entanglements, one single failure is repair with only two blocks.

Needing at least twice as much space is a REALLY BIG DEAL, like almost $2M worth of disk. In practice, it means that you're not actually going to be backing up most of the Internet Archive's collection, and failures in the part you're not backing up are going to be totally lost.

Paying a little extra bandwidth and I/O in the, hopefully rare, case when you need to recover is a tiny detail in comparison to failing to back up much of the collection.

As usual, I'm running a little later than my promised "tomorrow" when I was on a plane to Sweden. By the time I'd come back, I was in a whole other related world of obligations and issues. But here I am with a dazzling smattering of responses to the article.

The entire INTERNETARCHIVE.BAK project, I have to stress to the utmost, is not connected to the Internet Archive's own data retention, backup, and infrastructure policies. It is an experiment intended to serve a wide amount of functions and discoveries, and as such could flame out as easily as anything else being poked at by a bunch of volunteers. While Archive Team tinkers about with the project, Internet Archive continues doing its own proper efforts any professional and competent organization does in taking care of its data. Entirely separate sets - in fact, except for an info page I'm running on a box in the lab area, none of the IA.BAK client-server is happening at the Archive at all. I just wanted that out there.

Next, I appear to have stepped in it when I referenced LOCKSS. I mostly think of it as the philosophy more than the toolset and software - the philosophy that ensuring data is in as many places, especially physically, as possible is one of the most important, possibly the most important, requirements for taking care of that data going forward and protecting it. So when you called it the opposite of LOCKSS, I was surprised, until I saw that you were including the toolset. Anyway, sorry about that - was trying to acknowledge I didn't just come up with this on a plane ride on my own a month or two ago. I'll get that reference out of there.

Along the theme of different intentions, I see it all up and down this entry. What I'm trying to play with along with the Archive Team is a completely different dialect of storage approach, certainly for the round of what we're trying out. You're talking about an agnostic (nay, nearly double-blind) virtual drive that is shared among volunteers that provides a deniability defense and extensive efforts to incorporate optimization of the data.

This project, in its current state and round, is nothing like that!

In the model of what we're up to with this, think of it like FTP (as opposed to SSH, or even MOSH). We threw around a bunch of ideas, and people have thrown in lots of them, but the first round of effort is humble, careful and prone to faceplants. Working with a 3tb example dataset, we're seeing how the whole thing functions, finding interesting issues (and documenting them), and once we have acquired the +3 copies in separate places, we're going to hit it with a sledgehammer and see what we learn from THAT.

If this sounds rudimentary and quaint, it is - it was just something I was messing with after dreaming something up, that one of the Archive Team members who likes to post what we're doing on Reddit, did what he does. Once it got on Reddit, a lot of attention came in, and you can imagine my surprise when this baking-soda-and-vinegar volcano was suddenly being discussed like it was an iPhone 7 prototype.

The process as it stands is just poking around the whole Internet Archive collection, asking what represents the "Digital Heritage" that can be publically downloaded (nothing internal is being put into this experiment), and documenting all our stumbles along the way. Eventually, I expect some interesting writing to come out of it, along with a continued highlighting of the neat stuff the Archive has within its walls, which a very small number of people know about.

And like FTP, whatever next-generation distributed backups come to the Archive will likely smoke this effort into the ground.