I will not be acquihired

There's a story I notice playing out increasingly often these days: A startup
company is founded and provides a service lots of people want and use; the
company is acquired by one of the tech giants (most often Google or Facebook);
and a few months later the startup's founders are shuffled off to work on other
projects and the service they created is shut down. In corporate-speak, this
is a "talent acquisition"; in the world of internet startups, it's more often
called an "acquihire".

This, quite reasonably, tends to make people nervous about using services
which are provided by small startup companies. After "what if you get hit by
a bus", the most common question people ask me about my
online backup service
is some variation on "what if Google buys Tarsnap and doesn't want to keep it
running?" I've answered this privately many times, but I think it's time to
answer it publicly: Tarsnap will not be acquihired.

Announcements of startup acquisitions rarely provide much in the way of back
story, but reading between the lines there's two common sub-plots I see. I
call these the "job hunter" story and the "failure to lift-off" story.

The job hunter story runs roughly as follows: Some smart kids graduate from
college and start looking for jobs. They go around to the usual list of high
profile companies — Google, Facebook, Microsoft, Amazon, etc. —
but either aren't offered jobs or are offered relatively low salaries. While
it's obvious to their interviewers that they are smart, they've never built
anything beyond the scope of a one-term college course, and employers are
naturally hesitant about hiring anyone without practical work experience.

So lacking any job offers commensurate to their talents, our job hunters set
out to demonstrate that they have practical skills in addition to academic
knowledge. They build a company much like an artist or architect puts
together a portfolio: As a way to show off their talents. It doesn't matter
if the company makes any money — that's not the point, since they're
trying to demonstrate their suitability for software development jobs, not
their ability to run companies. A year or so later, they get the jobs they
were looking for; a large software firm gets employees with demonstrated
practical skills; and the only people who aren't happy are the users who now
find that the service they were using no longer exists.

The failure to lift-off story is a bit uglier. Some people have a
business idea, and decide to set out on their own to create it. Usually
it turns out that the market isn't quite as large as they thought; in some
cases a lot of people want what they're providing, but aren't willing to
pay enough to make the business very profitable. Maybe the company is
approaching bankruptcy and can't convince any venture capital companies to
invest any more money; maybe the company is limping along barely profitably
but the founders aren't getting rich and the investors are getting impatient.

Either way, some calls are made, and the company is "acquired"; most often
it seems that the only connection between the acquired and acquiring companies
is that they share an investor in common. (From a fiduciary perspective,
these deals stink: It seems very implausible that they would all occur if
it weren't for personal relationships greasing the wheels.) This story
doesn't end with anyone being very happy, but at least it's a cleaner and
less embarassing wrap-up than a company going bankrupt and its founders being
unemployed.

Tarsnap doesn't fit either of these stories. I started Tarsnap shortly after
receiving my doctorate, so it might look like a "job hunter" story; but I
only started Tarsnap after being offered a well-paid job. In fact, without
that job offer, I wouldn't have started Tarsnap — it was the security
of having "get a job at a big internet company" as a proven backup plan which
made me willing to take the risk of starting my own company. Nor does Tarsnap
fit the "failure to lift-off" story: While Tarsnap is nowhere near as large as
Dropbox,
AirBnb, or
Heroku,
it doesn't need to be. As a "bootstrapped" company, Tarsnap has no investors
who could get impatient; and it's sufficiently profitable that I'd be
satisfied with running it indefinitely even if it never grows any further
(which is, of course, a very unlikely scenario).

In short, Tarsnap won't be acquihired because I'd have nothing to gain from
it. I didn't built Tarsnap in the hopes of attracting a job offer, and it's
successful enough that I'd be a fool to ever abandon it. I'm in this for the
long haul; your backups are safe.

Tarsnap outage

At approximately 2012-06-30 03:02 UTC, the central Tarsnap server (hosted in
Amazon's EC2 US-East region) went offline due to a power outage. According
to Amazon's post-mortem of this
incident this was caused by two generator
banks independently failing to provide stable voltages after utility (grid)
power was lost during a severe electrical storm.

When the Tarsnap EC2 instance came back online about an hour later, I found
that the abrupt loss of power had resulted in filesystem corruption. As I
explained in a blog post in December 2008,
Tarsnap
stores all user data in Amazon S3, but keeps metadata cached in EC2.
While I could not see any
evidence that the power loss had resulted in this cached metadata being
corrupted, I could not absolutely rule out the possibility -- so keeping in
mind that the first responsibility of a backup system is to avoid any
possibility of data loss or corruption, I decided to err on the side of
caution by treating the state on EC2 as "untrustworthy".

As I described in my December 2008 blog post, losing all state on EC2 is a
failure mode which Tarsnap is designed to survive, by reading data back from
S3 and replaying all of the operation log entries. That said, losing local
state and needing to recover completely from off-site backups (which Amazon
S3 counts as, since it's replicated to multiple datacenters) is very much a
worst-case scenario. It's also, quite literally, a nightmare scenario: A
few days before this outage, I had a nightmare about this happening, probably
provoked by the 2012-06-15 power outage which affected instances in a different
EC2 availability zone.

I restored Tarsnap to a fully operational state at 2012-07-01 22:50 UTC,
slightly less than 44 hours after the outage started. Obviously any outage
is bad and an outage of this length is unacceptable; over the course of the
recovery process I learned several lessons which should make recovery from
any future "complete metadata loss" incidents faster (see below).

While Tarsnap does not have any formal SLA, I have a (rather ill-defined)
policy of issuing credits to Tarsnap users affected by outages or bugs in the
Tarsnap code, based on my personal sense of fairness. While the original cause
of this outage was out of my control, the outage should have been much shorter,
and as a result I have credited Tarsnap users' accounts with 50% of a month of
storage costs.

Timeline of events

(Some of the times below are approximate; I was not worrying about keeping notes
as this process was underway so I've had to reconstruct the timeline based on
my memory, log files, and file timestamps.)

2012-06-30 03:02 UTC: Power is lost to the Tarsnap server, and I observe it
ceasing to respond to network traffic. Due to two recent outages in the EC2
US-East region -- the power outage on June 15th, after which Amazon wrote in
a post-mortem that "We have also completed an audit of all our back-up power
distribution circuits" -- and a network outage earlier on June 30th (note that
these two earlier outages were far more limited in scope, and neither affected
Tarsnap) my initial presumption was that the outage was caused by a network
issue.

2012-06-30 03:21 UTC: My presumption that the outage was network-related is
reinforced by Amazon posting to its
status page
that they were "investigating connectivity issues".

2012-06-30 03:40 UTC: Amazon posts to its status page that "a large number
of instances [...] have lost power". I start attempting to launch a replacement
EC2 instance in case it becomes necessary to perform a full state recovery from
S3 (as eventually proved to be the case), but my attempts fail due to EC2 APIs
being offline as a result of the power outage.

2012-06-30 04:03 UTC: Power is restored to the Tarsnap server, and I SSH in to
find that it suffered filesystem corruption as a result of the power outage.
Since I cannot rule out the possibility that the local state was corrupted, I
continue with plans to perform a full state recovery.

2012-06-30 04:37 UTC: I succeed in launching a replacement Tarsnap server and
start configuring it and installing the Tarsnap server code. This process
includes creating and attaching Elastic Block Store disks using the AWS
Management Console, which is slowed down by repeated timeouts and errors.

2012-06-30 05:25 UTC: I finish configuring the replacement Tarsnap server and
start the process of regenerating its local state from S3. The first phase of
this process involves reading millions of stored S3 objects; unfortunately,
these reads were performed in sequential order, triggering
a
worst-case performance behaviour in S3. As a result, this phase of recovery took much
longer than I had anticipated; unfortunately, the design of the code meant that
changing the order in which objects were read was not something I could do "on
the fly".

2012-06-30 15:00 UTC: After inspecting the "outaged" Tarsnap server, I concluded
that the state corruption was almost certainly limited to archives committed in
the last few seconds before the power loss. Consequently, I brought the server
back online in a read-only mode, so that anyone who needed their data urgently
could retrieve it before the full recovery process was complete.

2012-07-01 01:45 UTC: The first phase of Tarsnap recovery -- retrieving bits
from S3 -- completes, and the second phase -- reconstructing a "map" database
which identifies the location of each block of data within S3 -- starts.

2012-07-01 02:30 UTC: I notice that the map database reconstruction is running
anomalously slowly. This turns out to be caused by some I/O rate-limiting code
I had put in place to prevent back-end processes starving the front-end Tarsnap
daemon (which has much stricter scheduling requirements) for I/O -- obviously,
those limits were not necessary during the reconstruction stage when there was
no front-end daemon running.

2012-07-01 11:18 UTC: The third (final) phase of Tarsnap recovery -- replaying
logs to reconstruct the server-side cached state for each machine -- starts.

2012-07-01 12:41 UTC: The third phase of Tarsnap recovery fails due to requests
to S3 timing out. Normally the Tarsnap code retries failed S3 requests, but in
this particular code path that functionality was missing. I increase the S3
timeouts by a factor of 10 and restart the third recovery phase.

2012-07-01 21:59 UTC: The third phase of Tarsnap recovery completes. I look
over the reconstructed state (and compare it against the mostly-correct state
on the outaged server) to confirm that the reconstruction worked properly; then
start the Tarsnap server code and run some tests.

2012-07-01 22:50 UTC: I'm satisfied that everything is working properly, and
switch the Tarsnap server's "Elastic IP" address over to point at the new
server instance.

Lessons learned

1. Test disaster recovery processes *at scale*. I have always tested the
process for recovering from state stored on S3 every time I prepare to roll out
new code -- it's part of my "make test" -- but I have always done this on a
small data set. A test with six orders of magnitude less data may help to
confirm that a recovery process works, but it certainly doesn't test that the
process works *quickly*; and both success and speed are important.

2. Disaster recovery processes should not rely on the AWS Management Console.
According to Amazon
the EC2 and EBS control plane was restored by 2012-06-30
04:10 UTC, but I was experiencing timeouts and errors (most commonly "too many
requests") from the Console over an hour beyond that point. Obviously when a
disaster occurs there is a large influx of users into the Management Console;
evidently it needs work to improve its scalability.

3. Disaster recovery processes should start at the first sign of an outage. In
this particular case it wouldn't have made any difference since for the first
hour after the outage began it was impossible to launch a replacement EC2
instance; but that is an EC2 failure mode which Amazon says they are addressing,
so in the future starting the recovery immediately rather than waiting to find
out if the outage is a transient network issue or more severe could save time.

4. Sequential accesses to Amazon S3 are bad. This isn't so much a "lesson
learned" as a "lesson I remembered applies here" -- I was aware of it but hadn't
realized that it would slow down the recovery process so much.

5. The appropriate behaviours during state recovery are not always the same as
the appropriate behaviours during normal operations. Disk I/O rate limiting
of back-end processes and filesystem syncing are normally useful, but not so
much when it is important to recover state and get back online as quickly as
possible. Code shared between "normal" and "recovery" operations -- which is,
in fact, most of the Tarsnap server code -- should be aware of which mode it's
running in and control those behaviours appropriately.

6. Tarsnap users are amazing people. (Ok, I knew this already, but still, this
episode reinforced it.) I didn't see a single irate email, tweet, or comment on
IRC during the entire outage; people politely asked if there was an outage and
if it was related to the EC2 outage, and aside from that the only communications
I had were very positive, mostly thanking me for my efforts and status reports.

Final words

After writing so much about Tarsnap, I'd like to take a moment to provide some
wider context to this outage. The power outage which knocked Tarsnap offline
was big -- judging by Amazon's statement of "7% of instances" and the number of
IP addresses in the EC2 US-East region, somewhere around 50,000 instances went
offline -- and as I'm sure everybody reading this is aware, also affected such
"big names" as Netflix, Pinterest, Instagram, and Heroku. Most people stop at
this point, but there's more to the story than that.

This power outage was caused by a
"derecho"
thunderstorm system which is
believed to be one of the most severe non-hurricane storm systems in North
American history. 22 people are believed to have died as a direct result of
the storm, and over 3 million homes and businesses lost power. Approximately
a million people are still without power now, five days later.

Even worse, this storm system was caused by a heat wave which has set record
temperatures in hundreds of locations across the Eastern US, with many areas
exceeding 38 C (100 F) for several days in a row. For the elderly, the young,
and individuals with chronic diseases, prolonged exposure to these temperatures
can be life-threatening -- and without electricity, air conditioning is not
available. By the time this heat wave is over, it could easily be responsible
for hundreds or even thousands of deaths.

For many of us, a datacenter losing power is the only effect we will see from
this storm. For most of the people who were directly affected by the storm,
it's the least of their worries.