Programmer. Currently building containerization and orchestration system for evo.company

Jan 17, 2015

Evaluating Mesos

Mesos is becoming increasingly popular, and pretends to be a common standard for different solutions for orchestrating distributed systems. Many components of the mesos system lack documentation of how it behaves under certain failure conditions. So I’ll try make some distributed tests to shed some light on it.

Testing distributed systems is hard. It requires spawning and killing a lot of virtual machines. With Vagrant setting up VMs is easier, still running multiples of them is slow and tedious work. Since I’m developing a containerization tool called vagga, I took this opportunity to implement network virtualization and testing tools in vagga.

So this article is twofold an adventure of designing and building network virtualization for vagga, and the actual testing of mesos and marathon.

Vagga Basics

I’m going to start with a brief introduction of what vagga looks like. If you are familiar with it you might skip this section.

Note: all examples in this article are using vagga 0.2 the one that is not released yet. The master branch of vagga (which is 0.1.x) has a little bit different syntax, but is very close in spirit.

First let’s start with building zookeeper cluster. To start linux containers with vagga we need to create a `vagga.yaml` file with the following contents:

If you setup zookeeper configuration files right (I’ll omit them here for brevity, examples of all configuration files are in repository), the command starts three zookeepers. Because we have no network isolation yet, we can start zookeepers on different ports. When you want to shutdown these processes you just need to hit Ctrl+C and vagga will shut down all of them.

Few words about vagga config, because we will change it a lot in the following sections. It consists of two sections: containers which define base container image and it’s properties. And commands which defines concise commands lines to run inside container and abstracts over running multiple commands simultaneously.

Vagga uses fairly big set of YAML features, which many people not aware of. Here is a quick reference of commonly underused features.

The word prefixed with exclamation mark “!Something” is a tag. The tag is attached to a value following, which may be string, mapping, or a sequence.

The pipe sign “|” means that following text is just a string, which ends based on the indentation. We use it extensively for pieces of shell scripts embedded into our config.

In the containers section we define how to build filesystem image, and few runtime properties of container. It has a range of mapped users and groups (“mapped” basically means “usable inside a container”). The container has ubuntu distribution inside, with universe repository enabled and zookeeper package installed. Two more things here:

The /work directory inside the container it’s a directory where vagga.yaml file resides, we also call it a project directory. We link /etc/zookeeper/conf to point to zk-conf dir inside our project dir, to be able to edit configuration files without rebuilding container.

We create a tmpfs (in-memory) volume at /var/lib/zookeeper. This volume is separate for each container instance (basically for each started process). This serves two purposes: (a) zookeepers running in parallel do not interfere with each other and (b) each start of a process is “clean”.

The container’s filesystem is built when we first run command that needs that container.

In commands sections we define a commands, which can be run by running “vagga command_name”. We define a command three-zookeepers which runs three processes (each in it’s own namespace/container), based on parameters set in zookeeper container defined above. By default processes run in parallel, and are shut down when any process dies, or when signal is sent to vagga itself.

Because every run we have a new and empty /var/lib/zookeeper folder, we need to initialize it by echoing appropriate zookeeper id to the myid file.

Adding Networking

Many networking tools for containers are usually based on the old model of how network was managed on real machines. I’m not going to describe all the downsides of the common model, but we are going to redesign it so that running test network requires absolutely minimal initial setup.

Let’s start with our own assumptions:

Vagga is for development environments only, in particular it means containers are run only on single host.

Container communication: (a) containers should be able to establish outgoing connections to internet; (b) incoming connections are only from the host machine; (c) free communication between multiple containers.

Containers should be able to use fixed IP addresses, which can be placed into configuration files which are tracked by revision control system.

Multiple instances of the same project might be running in parallel.

The #3 is more complex that it might appear. First it kinda contradicts #4. But let’s first note that, #3 means that same IP addresses may be used on machine of every other developer (but not on production system of course), and everybody might have different network configuration on host system, and multiple projects running simultaneously. Still we should do our best to keep these assumptions, as this tremendously simplifies setup of development environments (e.g. otherwise we would need to generate configuration files for zookeeper on the fly).

Network Topology

The first reasonable scheme I came up with:

Note, the bridge, and other processes are running in separate network namespaces, so only the IP 172.18.255.1 is visible to host system. And Only 172.18.0.x IP addresses are visible to target processes. The IP address ranges are picked up arbitrarily as presumably not very often used ones. The network “172.18.255.0/24" and gateway “172.18.0.254" addresses are chosen so that you don’t need to remember to start naming IPs starting from “2".

The IP 172.18.255.1 with is the only one visible to the host system may be changed in vagga’s settings, without affecting containers. And containers may have hard-coded IP addresses because they do not interfere with outer network.

The only case when it may be a problem if some container needs to access some service on LAN, and LAN happens to have same network address (i.e. 172.18.0.0/24). But I believe this case will be rare enough, so we accept it as a reasonable trade-off.

Careful reader might notice that by using this scheme we can’t run multiple command simultaneously. So the real network isolation would work as follows:

The two instances here have isolated networks, so may use same IP addresses.

Setting Up Network for Unprivileged User

Next part of the vagga’s problem is that it’s impossible to setup networking as regular user (the rules which we need on host system). And we are trying to do our best to run vagga as unprivileged user. So what we decided to do is to setup network namespaces with sudo, and then attach to them as a regular user. I.e. we added a command:

Here are some notes on veth interfaces if you want to know how to setup networking for linux containers. The “mount —bind” lines are there to keep user and network namespaces persistent. So we can attach to them later, and don’t need to run sudo for each start of a container.

The names for all internal vagga commands start with underscore to avoid conflicts with user-defined commands. E.g. the counter-part of “_create_netns” is “_destroy_netns”

Configuring Network

Configuration should be as simple as possible. We just add an IP address to the command. As we don’t need to setup different ports for different zookeeper instances any more, we can even use single configuration, which is named “zoo.cfg” by default. So we omit configuration file name from the command:

It’s enough to get zookeepers running and setting up quorum. Which you may find out in the logs:

[LearnerHandler-/172.18.0.1:35177:Leader@598] — Have quorum of supporters; starting up and setting last processed zxid: 0x100000000

But as we now, have the network isolation it’s unclear how to access zookeepers. There is a “nsenter” utility in linux ecosystem that may be helpful for debugging. It allows to join any namespace and execute commands there. In particular you may join only network namespace (“nsenter —target PID —net”) of the process and have all the utilities that you have in host system to access that network. We also done port forwarding in vagga, but let’s go on.

How to Write Tests?

We expect that different users that will do testing of networked systems with vagga use different tools. In particular I might choose the programming language that I’ll use for tests based on the following criteria:

Language should be one of the few that I use every day, so I can write tests quickly

This is enough to get test running. Note we me mark command as “!BridgeCommand”. This effectively makes script run in bridge network and user namespace, which let us run networking tools described in the next section. Just for completeness I’ll show first test script:

The “Connection refused” messages just mean that zookeepers start slower than python

Other “connection broken” are there because zookeeper drops connections on early initialization (probably before leader is elected)

Each time script is run zookeeper is empty (i.e. create_node doesn’t fail with NodeExists error), as was described earlier.

Network Partitioning

The most interesting tests for network failures is when network is partitioned in arbitrary connected groups. For easy integration with various testing tools and workflows, we make command-line tool which sets up partitions.

We assume that the normal network setup is a full mesh (i.e. every node connected to each other). We also assume that every node is seen from the bridge node always (regardless of partitioning), so that test script can query nodes in every partition.

The simplest kind of partitioning is done via “disjoint” command:

vagga_network disjoint -- [NODE,...] [--, NODE, ...] ...

The groups of nodes are delimited by double dash. Nodes named by their names in vagga.yaml. For example the following command:

vagga_network disjoint -- zk1 zk2 -- zk3

Will make zk3 inaccessible from zk1 and zk2 and vice versa.

The “disjoint” command requires all nodes to be listed, and each node exactly once. You may use less strict “split” command to achieve interesting things too:

vagga_network split -- zk1 zk2 -- zk2 zk3

This means that node zk2 is accessible from both zk1 and zk3. But zk1 is not reachable from zk3 and vice versa. This kind of simple, yet powerful mode, can potentially exploit interesting failure scenarios.

And to revert networking back to normal execute:

vagga_network fullmesh

If you need some special rules you may also run iptables, for example:

vagga_network run zk1 iptables -A INPUT -p tcp --dport 2181

This blocks only port 2181 on first zookeeper node. But see notes in next section.

Network Partitioning Idempotence and Atomicity

One of the most important aspects of our design is being able to repeat tests reliably. We try to improve reproducibility of tests by the following:

Every time you run “vagga_network” command (except in “run” mode), whole state of firewall tables is rewritten, so that rules depend only on the last command-line used, and not on any previous invocation. I.e. calling “vagga_network” is idempotent.

We write firewall tables by “iptables-restore”, so applying rules on each node is atomic. I.e. when adding new rule for some node, there is no time window where node is visible to everybody.

The #2 requires further comments. Unfortunately we can’t apply rules atomically on whole network. But running “vagga_network” only takes about 5–15 milliseconds, together with atomicity of changing table at any node it should be no problem (network timeouts are usually bigger then few tens of milliseconds). But if that turned out to be an issue we may apply additional techniques for ensuring atomicity (or just shorter delay).

Also #1 means that all rules run by “vagga_network run iptables …” are discarded by next run of “vagga_network”.

Example Test

Okay, let’s finish this section with small zookeeper test, to demonstrate our partitioning tools (skipping boilerplate code for brevity, you can find it in repository):

In this unscientific benchmark we find out that combination of zookeeper doing master election plus kazoo doing reconnection takes about 15 seconds. Here are few other thing you can do with vagga_network:

Emulate multiple failures sequentially and simultaneously

Do other interesting network splits e.g. “split — zk1 zk2 — zk2 zk3"

Add more zookeepers and emulate various failures and network partitions

Inspect how python client handles failures, including session expires, watches. Try to install locks and see what happens when zookeeper fails when lock is expected to be acquired.

The Test Plan

Measure how quickly marathon restart processes in case of death, and which components need to be functioning for this to work (marathon, mesos, zookeeper)

Find out how mesos + marathon behaves in various scenarios of network partitioning

Warming Up

Starting mesos + marathon was pretty easy. Mesos likes to access nodes by name so I needed to setup hostnames. When setting up port forwarding I noticed that mesos assumes that browser can visit “hostname:port” directly, which is not the case very often (including our situation because of gateway), but it’s a small inconveniency, because we are going to run all the things by python script in the same network.

It’s a good idea to set work and log dirs of mesos to somewhere in the “/work”, so they are accessible outside.

What we’ve got at the start are mesos-master + mesos-slave + marathon + zookeeper, each 3 instances. Mesos and marathon are configured to use zookeeper for bookeeping, and marathon with “—ha” (which means high-availability) enabled. The memory footprint at the start is 3*(20 + 20 + 200 + 50) ≈ 870 megabytes which is even better than I’ve expected ☺

I decided to run “webfsd” daemon by marathon, because it’s easy to configure (just command-line) and has low memory footprint (it’s important because we run many containers/processes on single machine). So we are going to run something like the following for all the tests:

Which is pretty ugly. But at least I can get rid from one of the shells by prefixing command in marathon’s config by “exec”.

Ideally I would like to run commands without even having a shell in a system. Also I don’t understand why mesos needs 3 additional processes (+10 threads in mesos-executor and 12 threads in mesos-slave itself) to run a single command when no isolation is enabled. These ones are merely quibbles, but when comparing with any other (single-machine) supervisor it’s unclear why mesos-slave must be more bloated.

Another observation came out if I omit “-F” flag for webfsd. The flag allows webfsd daemon to fork in background. Mesos just let it daemonize and start another one, so I get many webfsd processes on each node:

I think that it’s OK to expect process to run in foreground (however there are some old-fashioned services that still do not support this mode). But what is not OK is leaving them hanging around.

The actual issue is not just wrongly used webfsd daemon. There are may cases where application might spawn some processes and leave them around for some reason (e.g. bug in the code). And they are not accounted by mesos in any way. So mesos will continue both: schedule buggy task on every machine and schedule tasks on this machine regardless of free memory/cpus. So some buggy application may just fill up the whole cluster.

There is a documentation which explains that it’s something related to checkpoints, but not why I need to delete files manually. For now after unsuccessful attempt to disable checkpoints I just proceed with removing files at each run.

Next step is check health of running processes. Mesos master has “http://mesos-host:5050/state.json” which returns state. Marathon has “http://marathon-host:8080/v2/apps” (amongst other endpoints). We assume that as soon as JSON is OK, the services are running. But for slaves we need to also check if they are registered. So I decided to check if “activated_slaves” (in mesos-master’s state.json) is equal to number slaves we started (marathon does not expose that information AFAIK).

Here is the first catch: state.json is adequate only in mesos-master which is elected as a leader. Other masters’ state is all zeros/empty lists. I.e. rather than respond with some kind of redirect it just sends empty data. So if you don’t check if the host you are asking matches “leader” for some reason (for example if your real mesos is behind proxy or load-balancer which does requests on your behalf), you will silently get wrong data. Honestly, this JSOS is not an API but rather a chunk of data for Javascript so critique is quite weak. But what we understand from this situation is that mesos-master doesn’t synchronize state to stand-by nodes.

Mesos and marathon starts in about ~11.5 seconds (~3.5 seconds from start of the vagga to python printing first message is not visible in the log). Which is very good for this kind of complex system (i.e. we have 12 processes fighting two-core i7 CPU on a laptop)

It takes ~3.5 seconds for marathon to run command

Let’s look at some more unscientific stuff:

Probably this kind of 3-point grouping is because we start 3 mesos slaves and offers from different slaves are processed in parallel, but no simultaneous offers from the same slave. But I’m barely speculating. What I’m sure is when I change number of slaves, group size is changed proportionally. Anyway, probably this property is not even noticeable on large clusters that mesos usually run on.

Failure Detection

Process death

The most simple scenario is when process just die for some reason. I’ve tried to kill a one of the running processes, and it was restarted in few seconds. It seems that process restarted at random node (may be same, may be some other). That was expected.

There are some interesting properties observed. When there is exactly one worker per slave, usually dead child process is started on the same node it was dead. I.e. there is a one process per machine. But when there are 2 workers on 3 slaves it’s a lot more likely that both processes will be on the same machine. But I may be I was unlucky. I gave up making any scientific test on this, because actually any allocation strategy is legitimate.

The interesting question is how mesos behaves on network failures.

Isolating Slave Node

Let’s start from the simple partitioning. We isolate slave node so that it’s not seen from master:

We observe that it takes 1 min 20 seconds to notice that slave is inaccessible. This delay is reproducible, so this is something that’s likely tunable, but no obvious value found in documentation. Note that processes on isolated node continue to run, but are not restarted in case of failure.

And what do you think happens when slave reconnects to master again? Well, I could never guess:

Yes, really, it just shuts down. Which basically means that mesos-slave needs a supervisor to restart it. Remember mesos-slave is actually a supervisor. Taking in account “executor” (as shown above) that is essentially a supervisor too, you need at least 3 supervisors to run anything useful. When you want containerization (and usually you want, unless you run only java) you should run docker under mesos, which is a process supervisor too. And docker runs user processes as pid 1 by default, which is not a good idea unless that process is a supervisor too. Running 5 nested supervisors on each machine sounds like a very reliable design ☺

You run at least 3 nested supervisors to run anything useful on mesos,and up to 5 to run anything in container.

The Essential Parts

Another good question is how system behaves when one of the component is inaccessible. We have plenty of them: mesos-slave, mesos-master, zookeeper and marathon. We already know that mesos-slave does not touch processes when it’s alone. Let’s try other ones. The basic test is, waiting for processes to be started and run something like this (for zookeeper):

network.isolate(‘zk1', ‘zk2', ‘zk3')

Then try to kill any webfsd and see what happens. Let’s start with zookeeper.

Lost leadership… committing suicide!

Yes. That was actual log message:

Lost leadership… committing suicide!

At this time you shouldn’t be too surprised. You need to run mesos-master by supervisor too. But the mesos-master is just useless on restart when zookeeper is still inaccessible. The other interesting observation is that I get one of the slaves kill it’s children, and others don’t. I’m not sure why, but this doesn’t bother me enough. Sure, this also happens when I isolate just mesos-master because it does not see zookeeper in this case too.

Okay, let’s isolate marathon. This time it works seamlesly. But it does not restart processes run by marathon. This is probably expected by mesos specialist, but I was not sure.

This is just a simple thing that traditional (single-node) supervisors do in a fraction of second regardless of any external resources available

Isolating Marathon

This test is also interesting in the long run: if marathon is inaccessible, when mesos-master consider to drop it’s processes? This is very tough trade-off:

If marathon failed or somehow isolated from network, we often want our workers to continue their job. So the timeout should be big enough for system to be repaired (maybe even outliving manual intervention)

But if frameworks (marathon is a “framework” in mesos terms) come and go, mesos need to cleanup after them so there are no processes that run tasks nobody cares about.

There is another interesting catch: when marathon has no connection to zookeeper it can’t determine a leader. And when it can’t determine a leader it can’t respond on API requests. So it’s impossible to determine are processes actually running on behalf of marathon so at least service discovery with marathon will not work. Which may or may not make your workers actually fail.

Well, the result was not something I expected. I waited 8 hours and mesos still have the framework registered and all processes still running. I’m not sure is it a bug or a feature, or may be I’m impatient, but at least …

After reconfiguring a marathon (or any other mesos framework) you need to review your cluster state to be sure that the framework doesn’t register twice.

And I’m not sure whether just restart of marathon do not cause duplicates (in some obscure scenarios), because I never set any identifiers for marathon. I believe that it keeps it’s identity in specified zookeeper folder, so that folder serves role of marathon identity. Which is good enough if it’s true.

Wrapping Up

As we see in this test running network partitioning tests is very very important task. I’m not sure is it the lack of tools or the problem complexity leads to soo many issues with network partitioning in various software projects. But good tools is definitely a plus. Efficient tools not just help develop and debug good software products, they also help to evaluate a product before putting it in production. Similarly, they help post-mortem examinations.

I tried to build a very simple yet powerful tool for that in vagga. I think it’s simple enough for SysOps without strong programming skills to use. The vagga 0.2.x that I use here is in active development now, and is expected to be in beta stage in few months. It’s open-source project on github and you are encouraged to run, hack and send pull requests. It’s written in rust, so if you ever wanted to try rust on some project it may be good start. Also being written in rust it has zero dependencies in the runtime (installs a couple of static binaries).

The biggest part of impression of mesos is that it is largely research project which only big and rich companies can afford to use. I only tested the small part of possible issues and they are already complex enough. And building some higher level “operating system” for data centeron top of it doesn’t look promising.

Talking about mesos, I should say that I used binaries from mesosphere repositories for ubuntu (marathon 0.7.6 and mesos 0.21.1). It’s possible that development builds have some issues already fixed.

But regardless of bugs, mesos looks like over-engineered. I mean running 4 services to just to start process is too much (and in fact real project needs a 5-th one that tells marathon how much processes to run). Additionally running 3 to 5 local supervisors is kinda ugly too. But I’m coming from the web development where most of the time we use many long-running processes per machine. For other uses (e.g. hadoop) mesos abstractions may be appropriate.