We’ve started running game day exercises at Stripe. During a recent game day, we tested failing over a Redis cluster by running kill -9 on its primary node [0], and ended up losing all data in the cluster. We were very surprised by this, but grateful to have found the problem in testing. This result and others from this exercise convinced us that game days like these are quite valuable, and we would highly recommend them for others.

If you’re not familiar with game days, the best introductory article is this one from John Allspaw [1]. Below, we’ll lay out a playbook for how to run a game day, and describe the results from our latest exercise to show why we believe they are valuable.

How to run a game day exercise

The system we recently tested, scoring-srv, is one part of our fraud detection system. The scoring-srv processes run on a cluster of boxes and connect to a three-node Redis cluster to store fraud scoring data. Our internal charge-processing code connects to scoring-srv for each charge made on Stripe’s network, so it needs to be very low-latency; likewise, accurate scoring requires historical data, so it needs durable storage.

The scoring-srv developers and a member of our systems team, who could help run the tests, got together around a whiteboard. We drew a basic block diagram of the machines and processes, the data stores, and the network connections between the components. With that diagram, we were able to come up with a list of possible failures.

We came up with a list of six tests we could run easily:

destroying and restoring a scoring-srv box,

destroying progressively more scoring-srv boxes until calls to it began timing out,

partitioning the network between our charge processing code and scoring-srv,

increasing the load on the primary Redis node,

killing the primary Redis node, and

killing one of the Redis replicas.

Since the team was new to game days, we did not try to be comprehensive or clever. We instead chose the simplest, easiest to simulate failures we could think of. We’d take a blunt instrument, like kill -9 or aws ec2 terminate-instances, give the system a good hard knock, and see how it reacted [2].

For each test, we came up with one or more hypotheses for what would happen when we ran it. For instance, we guessed that partitioning the network between charge processing and scoring-srv would cause these calls to time out and fail open (that is, allow the charge to go through immediately). Then, we decided on an order to perform the tests, saved a backup of a recent Redis snapshot as a precaution, and dove in.

Here, then, is a quick-start checklist for running a game day:

Get the development team together with someone who can modify the network and destroy or provision servers, and block off an afternoon to run the exercise.

Make a simple block diagram of the machines, processes, and network connections in the system you’re testing.

Come up with 5-7 of the simplest failures you can easily induce in the system.

Write down one or more hypotheses for what will happen after each failure.

Back up any data you can’t lose.

Induce each failure and observe the results, filing bugs for each surprise you encounter.

Observations and results

We were able to terminate a scoring-srv machine and restore it with a single command in roughly the estimated time. This gave us confidence that replacing or adding cluster machines would be fast and easy. We also saw that killing progressively more scoring-srv machines never caused timeouts, showing we currently have more capacity than necessary. Partitioning the network between the charge-processing code and scoring-srv caused a spike in latency, where we’d expected calls to scoring-srv to time out and fail open quickly. This test also should have immediately alerted the teams responsible for this system, but did not.

The first Redis test went pretty well. When we stopped one of the replicas with kill -9, it flapped several times on restart, which was surprising and confusing to observe. As expected, though, the replica successfully restored data from its snapshot and caught up with replication from the primary.

Then we moved to the Redis primary node test, and had a bigger surprise. While developing the system, we had become concerned about latency spikes during snapshotting of the primary node. Because scoring-srv is latency-sensitive, we had configured the primary node not to snapshot its data to disk. Instead, the two replicas each made frequent snapshots. In the case of failure of the primary, we expected one of the two replicas to be promoted to primary; when the failed process came back up, we expected it to restore its data via replication from the new primary. That didn’t happen. Instead, when we ran kill -9 on the primary node (and it was restarted by daemontools), it came back up – after, again, flapping for a short time – with no data, but was still acting as primary. From there, it restarted replication and sent its empty dataset to the two replica nodes, which lost their datasets as a result. In a few seconds, we’d gone from a three-node replicated data store to an empty data set. Fortunately, we had saved a backup and were able to get the cluster re-populated quickly.

The full set of tests took about 3.5 hours to run. For each failure or surprise, we filed a bug describing the expected and actual results. We wound up with 15 total issues from the five tests we performed (we wound up skipping the Redis primary load test) – a good payoff for the afternoon’s work. Closing these, and re-running the game day to verify that we now know what to expect in these cases, will greatly increase our confidence in the system and its behavior.

Learning from the game day

The invalidation of our Redis hypothesis left us questioning our approach to data storage for scoring-srv. Our original Redis setup had all three nodes performing snapshots (that is, periodically saving data to disk). We had tested failover from the primary node due to a clean shutdown and it had succeeded. While analyzing the cluster once we had live data running through it, though, we observed that the low latency we’d wanted from it would hit significant spikes, above 1 second, during snapshotting:

Obviously these spikes were concerning for a latency-sensitive application. We decided to disable snapshotting on the primary node, leaving it enabled on the replica nodes, and you can see the satisfying results below, with snapshotting enabled, then disabled, then enabled again:

Since we believed that failover would not be compromised in this configuration, this seemed like a good trade-off: relying on the primary node for performance and replication, and the replica nodes for snapshotting, failover, and recovery. As it turned out, this change was made the day before the game day, as part of the final lead-up to production readiness. (One could imagine making a similar change in the run-up to a launch!)

The game day wound up being the first full test of the configuration including all optimizations and changes made during development. We had tested the system with a primary node shutdown, then with snapshotting turned off on the primary, but this was the first time we’d seen these conditions operating together. The value of testing on production systems, where you can observe failures under the conditions you intend to ship, should be clear from this result.

After discussing the results we observed with some friends, a long and heated discussion about the failure took place on Twitter, in which Redis’ author said he had not expected the configuration we were using. Since there is no guarantee the software you’re using supports or expects the way you’re using it, the only way to see for certain how it will react to a failure is to try it.

While Redis is functional for scoring-srv with snapshotting turned on, the needs of our application are likely better served by other solutions. The trade-off between high-latency spikes, with primary node snapshotting enabled, versus total cluster data loss, with it disabled, leaves us feeling neither option is workable. For other configurations at Stripe – especially single-node topologies for which data loss is less costly, such as rate-limiting counters – Redis remains a good fit for our needs.

Conclusions

In the wake of the game day, we’ve run a simple experiment with PostgreSQL RDS as a possible replacement for the Redis cluster in scoring-srv. The results suggest that we could expect comparable latency without suffering snapshotting spikes. Our testing, using a similar dataset, had a 99th percentile read latency of 3.2 milliseconds, and a 99th percentile write latency of 11.3 milliseconds. We’re encouraged by these results and will be continuing our experiments with PostgreSQL for this application (and obviously, we will run similar game day tests for all systems we consider).

Any software will fail in unexpected ways unless you first watch it fail for yourself. We completely agree with Kelly Sommers’ point in the Twitter thread about this:

If there's anything to learn from this Redis problem, even a simple kill -9 test needs to happen more often in our industry.

We’d highly recommend game day exercises to any team deploying a complex web application. Whether your hypotheses are proven out or invalidated, either way you’ll leave the exercise with greater confidence in your ability to respond to failures, and less need for on-the-fly diagnosis. Having that happen for the first time while you’re rested, ready, and watching is the best failure you can hope for.

Notes

[0] We’ve chosen to use the terms “primary” and “replica” in discussing Redis, rather than the terms “master” and “slave” used in the Redis documentation, to support inclusivity. For some interesting and heated discussion of this substitution, we’d recommend this Django pull request and this Drupal change.