Simulating the P of CAP on a Riak cluster

When developing and testing a distributed system, one of the essential concerns you will deal with is eventual consistency (EC). There areplenty of articles covering EC so I’m not going to dwell on that much here. What I am going to talk about is testing, particularly of the automated kind.

Testing an eventually consistent system is difficult because everything is transient and unpredictable. Unless you tune your consistency values like N and DW then you’re offered no guarantees about the extent of propagation of your commit around the cluster. And whilst consistency tuning may be acceptable for some tests, it most definitely won’t be acceptable for tests that cover concurrent write concerns such as your sibling resolution protocols.

What is “partition tolerance”?

This is where a single node or a group of nodes in the cluster become segregated from the rest of the cluster. I liken it to a “net split” on an IRC network. The system continues operating but it has been split into two or more parts. When a node has become segregated from the rest of the cluster it does not necessarily mean that its clients can also not reach it. Therefore all nodes can continue to perform writes on the dataset.

Generally speaking, a partition event is a transient situation. They may last a few seconds, a few minutes, hours.., days.. or even weeks. But the expectation is that eventually the partition will be repaired and the cluster returned to full health.

Fig. 1: State changes of a cluster during a partition event

In the diagram (Fig. 1) there is a cluster comprised of three nodes, and this is the sequence of events:

N2 becomes network partitioned from N1 and N3.
N1 and N3 can continue replicating between one another.

Client(s) connected to either N1 or N3 perform two writes (indicated by the green and orange).

Client(s) connected to N2 perform three writes (purple, red, blue).

When the network partition is resolved, the cluster begins to heal by merging the commits between the nodes.

The yellow and green commits (that were already replicated between N1 and N3) are propagated onto N2.

The purple, red and blue commits on N2 are propagated onto N1 and N3.

The cluster is now fully healed.

Simulating a partition

I have a Riak cluster of three nodes running on Windows Azure and I needed some way to deterministically simulate a network partition scenario. The solution that I came up with was quite simple. I basically wrote some iptables scripts that temporarily firewalled certain IP traffic on the LAN in order to prevent the selected node from communicating with any other nodes.

To erect the partition:

# Simulates a network partition scenario by blocking all TCP LAN traffic for a node.
# This will prevent the node from talking to other nodes in the cluster.
# The rules are not persisted, so a restart of the iptables service (or indeed the whole box) will reset things to normal.
# First add special exemption to allow loopback traffic on the LAN interface.
# Without this, riak-admin gets confused and thinks the local node is down when it isn't.
sudo iptables -I OUTPUT -p tcp -d $(hostname -i) -j ACCEPT
sudo iptables -I INPUT -p tcp -s $(hostname -i) -j ACCEPT
# Now block all other LAN traffic.
sudo iptables -I OUTPUT 2 -p tcp -d 10.0.0.0/8 -j REJECT
sudo iptables -I INPUT 2 -p tcp -s 10.0.0.0/8 -j REJECT

To tear down the partition:

# Restarts the iptables service, thereby resetting any temporary rules that were applied to it.
sudo service iptables restart

Disclaimer: These are just rough shell scripts designed for use on a test bed development environment.

I then came up with a fairly neat little class that wraps up the concern of a network partition:

RingStatus.AssertOkay("riak03");
using (NetworkPartition.Create("riak03")) {
RingStatus.AssertDegraded("riak03");
// TODO: Do other stuff here whilst riak03 is network partitioned from the rest of the cluster.
}
RingReady.Wait("riak03");
RingStatus.AssertOkay("riak03");

Obviously as part of this work I had to write a quick and dirty wrapper around the PuTTYplink.exe tool. This tool is basically a CLI version of PuTTY and it is very good for scripting automated tasks.

My solution could be improved to support creating partitions that consist of more than just one node, but for me the added value of this would be very small at this stage. Maybe I will add support for this later when I get into stress testing territory!

Source

You can view it over on GitHub; the namespace is tentatively called “ClusterTools”. Bear in mind it’s currently held inside another project but if there’s enough interest I will make it standalone. There has been talk on the CorrugatedIron project (which is a Riak client library for .NET) about starting up a Contrib library, so maybe this could be one its first contributions.

2 Responses

Today, I went to the beach with my children. I found a sea shell and gave
it to my 4 year old daughter and said “You can hear the ocean if you put this to your ear.” She
put the shell to her ear and screamed. There was a hermit crab
inside and it pinched her ear. She never wants to go back!
LoL I know this is entirely off topic but I had to tell someone!