HTTP Routing Errors

Follow-up

Begining at 2012-06-07 15:52:41 UTC, the Heroku routing mesh experienced a major outage that impacted all apps running on the Heroku platform. Customer impact was as follows:

Approximately 30 minutes of complete HTTP routing outage.

Afterward, approximately 1.5 hours of intermittent HTTP errors and degraded HTTP route times for 10-15% of all HTTP traffic on the platform.

For most of the outage, API maintenance mode was enabled as a control rod to contain damage.

status.heroku.com was largely inaccessible for the early part of the outage, and intermittently unreliable in later parts.

What happened?

Routing Outage

The routing outage was the result of three root causes.

The first root cause is related to the streaming data API which connects the
dyno manifold to the routing mesh. On the dyno management side, an engineer
was performing a manual garbage collection process which created an unusual
record in the data stream. On the routing side, the subprocess of the
router which handles the incoming stream could not parse this record.

The nature of this streaming API is similar to that of the replication
protocols used by CouchDB or
Redis. As such, unexpected records
cannot simply be discarded, since the veracity of the entire dataset depends on
the in-sequence collection of data. In this model, the correct failure mode of
the routing subprocess which consumes the stream from the dyno manager when
encountering an unexpected record is to stop processing the stream, flag an
error in the monitoring system, and wait for a human to investigate. This puts
that routing node into a degraded / read-only mode, which is suitable to
continue operating for the next few minutes as engineers investigate the issue.
However, rather than going into read-only mode, the subprocess repeatedly crashed
when attempting to handle the unusual record.

The second root cause was that when the router subprocess encountered the
record, instead of going into this degraded mode of operation, it crashed
completely. Each time it was restarted by the supervisor process it tried to
handle the record and crashed again.

The third root cause was that the supervisor process has a cooldown for
subprocess restarts, similar to that found in
Upstart and other
init-style process managers. Due to a
design flaw in the router's process tree, the supervisor process was itself
crashing when this cooldown was reached.

Additionally, there is a warm-up time when new routing nodes are brought
online. As our engineers worked to boot a large amount of extra capacity at the
same time, this placed a substantial load on our internal systems and increased
the boot time for these new nodes. Had we been able to bring up this new
capacity faster, the residual effects of this incident would have been
shortened.

Status Outage

Just over two weeks ago, we launched a totally rewritten version of our status
site. The improved status site allows users to subscribe to notifications when
an incident is opened. As a result, our status site experienced unprecedented
spikes of load during this incident. This high load crushed the site, resulting
in an inability for us to effectively communicate with customers during the
course of this incident.

Customer communication during an incident is extremely important. When we lost
our primary channel for communicating with our customers, we made a very bad
incident even worse.

Remediation

HTTP Routing Failure

Since the outage, we've rearchitected the routing subprocesses to be more
resilient to unexpected input. Rather than crashing, they will fail gracefully
by alerting a human of the bad input and continue to operate in read-only mode.
We are also updating the router to be able to run cleanly in read-only mode in
order to provide service even when unanticipated control plane failures occur.

We're also working to decrease the time it takes us to bring additional routing
nodes online, especially when many of them are launched at the same time.
Reducing this cycle will enable us to shorten the duration of these kinds of
residual effects.

Additionally, we intend to make much greater use of fine-grained control rods
in the future to disable specific functionality in order to prevent incidents
from spiraling out of control. Had we utilized this functionality earlier, it
would have shortened the system's recovery time.

Status Site

The status site is not hosted on Heroku, so it does not benefit from the
platform's increased elasticity. In order to cope with the increased demand
generated by the status notifications, we have replaced the server with
a larger instance. We're also improving the site's performance with improved
caching and other optimizations. Finally, an inadvertent dependency on fonts &
images hosted on the Heroku platform has been removed.

Incident Response

We are continually striving to achieve the uptime our customers demand. In
addition to platform uptime, however, there's also how we respond to incidents
when they do happen. Our internal incident readiness team has been working on
improving our procedures for rapid response and customer communication during
outages and we saw some of the results of that work during Thursday's incident:

Engineers were on the case within 2 minutes of our monitoring alerts, and had
diagnosed the actual issue within 7 minutes.

Improved communication procedures resulted in a public status incident being
opened 3 minutes after the first alert was sent. Previously, a lack of
procedure meant that it could sometimes take much longer to confirm an issue.
(Unfortunately, the status site outage caused our timely action here to be
largely moot.)

Utilization of control rods (locking down the API via maintenance mode)
prevented the issue from being prolonged by secondary effects.

However, this incident has demonstrated exactly where we should focus our
efforts to further improve our incident response. We intend to use the lessons
learned to rapidly iterate on the enhanced incident response procedures we've
been developing. In the end, we want to deliver a reliable platform and keep
our customers as informed as possible when we're having issues.