Blockading Stardog: Creating Order from Chaos

Stardog Cluster provides a highly-available platform that can be distributed
across multiple servers in your infrastructure. We’ve implemented a kind of
controlled chaos testing to get there.

However, any distributed system will experience unexpected node or network
failures. These failures, while relatively infrequent, can wreak havoc
when they occur. Debugging and fixing these seemingly random
failures is often extremely difficult, if not impossible, depending on the
circumstances of the failure.

Background

In this post we discuss the work we’re doing at Stardog to solve a key challenge of
chaos engineering
for Stardog Cluster by creating a suite of repeatable chaos tests that we call
“controlled chaos”. The tests help us simulate, quickly diagnose, and fix issues
that may arise with Stardog Cluster as a result of unexpected node and network
failures—and they help us do all of that in a repeatable way. The controlled chaos tests supplement our extensive set of unit and
integration tests for Stardog.

When unexpected failures happen in production it is often very difficult to fix
them solely from a postmortem.

First, you need to determine the cause of the
problem. Was it a node crashing? A network partition? Excessive network latency?
Excessive processing latency due to garbage collection?

Once you identify the cause, you need to determine when, exactly, it
occurred and what services were impacted and for how long. If you can piece
that information together from logs and any other monitoring tools available
(if any), then you are forced to reason about potential problems that could
occur in the code as a result of the failure.

When you think you’ve identified the problematic code, you have to
set about attempting to recreate the failure and testing your fix, repeating
the process if your fix doesn’t work.

And, finally, you need to add a regression
test to ensure the problem isn’t reintroduced by future code.
While it’s not impossible to fix bugs in this manner, it’s extremely
time consuming and often fraught with road blocks and trial and error.

Services like Netflix’s Chaos Monkey
can help inflict failures on your application by randomly terminating Amazon
EC2 instances in your test or production environments. Killing an instance is a
worthwhile failure to ensure your application can withstand.

There are many other types of failures that can cause your
application to behave unexpectedly, such as excessive latency or network
partitions. Finally, similar to any completely random failure, Chaos Monkey
doesn’t make it easier to reproduce and help you
determine the root cause.

Controlled Chaos

The idea behind our controlled chaos tests is to inject chaotic node and
network failures in a systematic fashion so that bugs can be easily recreated,
debugged, fixed, and added to our regression tests, ensuring that the issue is
not reintroduced in the future.

It’s ironic to talk about “chaos” and “systematic” or “deterministic” in the same sentence, but it’s exactly what we’re after. We want to be resilient in the face of a class of unpredictable errors or faults, and the best way to do that is to instrument them into the system in a programmatic and repeatable fashion.

Blockade

To introduce node and network failures we use an open source tool,
Blockade, that is designed to insert
failures into distributed systems. Blockade assumes the
target application is running in Docker containers and causes failures by
terminating a container or using system commands like iptables and tc
to partition the network, add latency, or drop packets between different
groups of containers. Blockade can inject the following failures between
containers:

Partition the network

Add network latency

Duplicate packets

Drop packets

Kill, stop, and restart containers

Blockade can be run as a standalone command line application or as a daemon,
providing access to Blockade operations via a
REST API.

Test Environment and Test Cases

Our controlled chaos test environment runs three Stardog nodes, three Zookeeper nodes, an HAProxy node, and Blockade, each in its own Docker containers. The
cluster is defined in a small Docker Compose
configuration file, allowing the cluster to be deployed on any Docker-enabled
host, including local Linux or Mac developer environments.

The HAProxy and Stardog nodes use known static ports for themselves and are configured to point
to the dynamic Zookeeper ports setup by Docker.

The cluster test suite is written in Java with JUnit and uses Blockade’s REST
interface to manipulate the Stardog and Zookeeper Docker containers.
The initial set of controlled chaos tests covers a variety of Stardog
cluster use cases, including:

The tests iterate over the Blockade failure events and target Stardog and
Zookeeper instances, triggering failures at specific points throughout the test
by making REST calls to Blockade. For example, a test may create a database,
begin a transaction, insert triples, partition a Stardog node with Blockade and
then attempt to commit the transaction.

Because Stardog Cluster prefers consistency over availability in the presence of a partition—that is, the C in CAP theorem
parlance—the remaining Stardog nodes should expel the partitioned node from
the cluster and commit the transaction successfully. Of course, depending on the
size of the cluster and the number and type of nodes impacted by the partition,
Stardog may continue to provide both consistency and availability. When the
partition is resolved, the node will synchronize any changes and rejoin
the cluster.

There are currently about 250 controlled chaos tests that iterate over these
Stardog use cases, Blockade failures, and Stardog and Zookeeper node targets.
This provides a set of tests that perform specific database operations along with
specific node and network failures, occurring at inopportune times, directed
at specific targets that behave in the same manner every time they’re run. The
tests also serve as a set of chaos regression tests as we continue to improve
and evolve Stardog Cluster.

This testing approach has dramatically improved the stability, resilience, and performance of Stardog Cluster and will be available in the upcoming 5.0 release.

What’s Next

The controlled chaos tests are only the beginning of Stardog Cluster chaos
testing. We’re constantly adding additional use cases to help us identify
and fix issues. For example, we will integrate programmatic failure events at the filesystem level in order to improve Stardog’s fault handling from filesystem faults and performance issues.

We’ll also be adding tests that are even more chaotic
that introduce random failures (via Blockade in our case)
at completely random times on a cluster under load to demonstrate even higher
confidence in Stardog Cluster’s ability to withstand chaotic node and network
failures. Look forward to more posts on this in the coming months!