Chaos Engineering with Docker EE

Chaos Engineering with Docker EE

Sameer Kumar I Senior Solution Architect, Ashnik

Why Chaos Engineering?

Even before we get into the definition of Chaos Engineering or why it has become important, let’s take a look at traditional approach. Most of the applications and configuration would be put under stress testing to find out the breakage point. This primarily helped to assure the operations team that the provisioned capacity is enough for the anticipated workload. The tests was relatively (if not fairly) simple to do. But with time there are couple of things that has changed:

System have become more and more complex now

Workloads can change abruptly and scaling up and down is a necessity now

Also, there is a philosophical shift happenning the way IT operations used to think –

Servers are disposable – Earlier the basic deployment units (in most cases physical or virtual servers) were treated like “Pets” and the configuration changes would lead to a snowflake. Now with configuration management tools servers are disposable like “cattles” and can be resurrected from scratch if there is a configuration change aka Pheonix Servers.

Failure have been accepted as business as usual, outages are not. I am not trying to force you to accept system failures, but most of the IT operations today acknowledges that things would go wrong. Simply put, one needs to be prepared for it.

Because of the explosion of internet, services are not limited by geographies anymore. Workloads are not predictible anymore and they are bound to go beyond the breakage point of one servers, it is just a matter of time and chance.

Complexity of applications has increased multi-fold. Today applications are not just three-tier deployments. A web page rendered might be working with 10s or in some cases 100s of micro-services in the backend. Only way test the resiliency of the system is by injecting random issues on purpose.

How do you go about it?

So what should be your strategy? I believe the easiest way is to introduce unit testing and integration testing for infrastructure and architecture components too, just like application code. so for any kind of High Availability or Disaster Recovery approach you have implemented, you should have a test case. e.g. if you are having a cluster with 2 nodes, your test case could be shoot down one of the node. Yes, you read it right. I am suggesting that you should take down a node. There is no other way for you to test high availability but to simulate failure. Similarly, you can test scalability but injecting slowness and network congetion.

There are many popular examples and inspirations for Chaos Injection. Most popular one are:

How does that translate in the container’s world?

In today’s date a lot of new applications and services are being deployed as containers. If you are starting up with Chaos Engineering in Docker, there are many different mechanisms and tools available at your disposal.

Before we get into tools, let’s look at some of the basic features of Docker which should be helpful to you.

1. Docker Service

It is often better to deploy your application as a Swarm Service instead of deploying them as native container. In case you are using Kubernetes, it is better to deploy your request as a sevice. Both the definitions are declarative and define the desired state of service. This is really helpful in maintaining the uptime of your application as the service would always try to maintain the availability of service.

Example

In this example, I am going to use a Dockerfile to build a new image and then I will be using it to deploy a new service. The example is executed against a Docker UCP cluster from a client node (with docker cli and UCP Client Bundle).

This request asks the Swarm cluster to setup the service with --mode=replicated and --replicas=2 i.e. Swarm would try to maintain two tasks for this service at any point of time, unless requested otherwise by the user. You can inspect the tasks running for the service with docker service ps command:

ID

zzq1jgolcc2o

zlkf4ejuxus8

NAME

twet-app.1

twet-app.2

IMAGE

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay

NODE

ip-10-100-2-67

ip-10-100-2-93

DESIRED STATE

Running

Running

CURRENT STATE

Running 3 minutes ago

Running 3 minutes ago

ERROR

PORTS

As you can see there are two tasks running and these tasks would be setup with VIP which will do load-balancing among the two containers/tasks.

Let’s try to kill one of the underlying containers and see if Swarm is able to maintain the declarative state we had requested:

603c7f8940fe

54aa164ea509

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay

“nginx -g ‘daemon …”

“nginx -g ‘daemon …”

7 minutes ago

7 minutes ago

Up 7 minutes

Up 7 minutes

80/tcp, 443/tcp

80/tcp, 443/tcp

ip-10-100-2-67/twet-app.1.zzq1jgolcc2oyucexn4j9u9pq

ip-10-100-2-93/twet-app.2.zlkf4ejuxus851onp4i2t143p

sh-4.2$

sh-4.2$ docker container kill 603c7f8940fe

603c7f8940fe

sh-4.2$

sh-4.2$

sh-4.2$ docker service ps twet-app

ID

sp4hz64oytu0

zzq1jgolcc2o

zlkf4ejuxus8

NAME

twet-app.1

\_ twet-app.1

twet-app.2

IMAGE

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay

NODE

ip-10-100-2-67

ip-10-100-2-67

ip-10-100-2-93

DESIRED STATE

Running

Shutdown

Running

CURRENT STATE

Running 2 seconds ago

Failed 7 seconds ago

Running 8 minutes ago

ERROR

.

“task: non-zero exit (137)”

.

PORTS

As you can see the container 603c7f8940fe was used by one of the tasks of our service twet-app and once we kill the container, Swarm tries to maintain the state by starting another task.

Note: Pushing image to repository is needed when you are running with distributed setup. As you can see above in the build was done on one of the nodes from the Swarm clusterip-10-100-2-106 and image would be only available on only one node. Hence if we were to run service without pushing the image to a repository, there is good chance that the tasks would get started on the same node (ip-10-100-2-106) i.e. the only node that has access to the image or different nodes would get different images (left by different image builds). Swarm does a good job of reminding us about this. Here is an example if I tried to run the servie without pushing the image:

image dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay could not be accessed on a registry to record

its digest. Each node will access dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay independently,

possibly leading to different nodes running different

versions of the image.

t46gb1wi3tc7xs2j08egzcut1

2. Health Checks

Docker allows you to use healthcheck to keep a tab on the health of running containers. The healthcheck can be either baked into you image during the build process using HEALTHCHECK direction in Dockerfile or during runtime using –healthcheck option with docker service create or docker container run

The HEALTHCHECK instruction tells Docker how to test a container to check that it is still working. This can detect cases such as a web server that is stuck in an infinite loop and unable to handle new connections, even though the server process is still running.

Note: The HEALTHCHECK feature was added in Docker 1.12.

Build time example of HEALTHCHECK

To make use of this feature we will add a new command to our Dockerfile now

HEALTHCHECK –interval=30s –timeout=3s –retries=2 \

CMD python /usr/share/nginx/html/healthcheck.py || exit 1

This means that the healthcheck command python /usr/share/nginx/html/healthcheck.py will be run for the first time after 30s i.e. 30 seconds after starting up the tasks. The healthcheck will be run with an interval of every 30s after that. The healthcheck would timeout in 3s and upon failure of 2 retries the container will be declared unhealthy.

We will have to add a few new files to support HEALTHCHECK

healthcheck.py – our own little piece of code to check the health of container.

{“Status”:”healthy”,”FailingStreak”:1,”Log”:[{“Start”:”2018-05-12T17:28:33.714393794Z”,”End”:”2018-05-12T17:28:33.793534206Z”,”ExitCode”:0,”Output”:””},{“Start”:”2018-05-12T17:28:53.793900452Z”,”End”:”2018-05-12T17:28:53.871217425Z”,”ExitCode”:1,”Output”:”The content of the healthcheck did not match. Expected Content-\”healthy\”, we got: test\n”}]}

{“Status”:”unhealthy”,”FailingStreak”:2,”Log”:[{“Start”:”2018-05-12T17:28:33.714393794Z”,”End”:”2018-05-12T17:28:33.793534206Z”,”ExitCode”:0,”Output”:””},{“Start”:”2018-05-12T17:28:53.793900452Z”,”End”:”2018-05-12T17:28:53.871217425Z”,”ExitCode”:1,”Output”:”The content of the healthcheck did not match. Expected Content-\”healthy\”, we got: test\n”},{“Start”:”2018-05-12T17:29:13.871399894Z”,”End”:”2018-05-12T17:29:13.948097443Z”,”ExitCode”:1,”Output”:”The content of the healthcheck did not match. Expected Content-\”healthy\”, we got: test\n”}]}

Note: The output will contain a friendly message if one is printed by your healthcheck command.

3. Tooling and Automation

Now that we have covered the basic building blocks of chaos engineering with Docker, let’s try to take a look at some tools. Pumba is a fairly new but quite promising tool for chaos orchestration. Best thing is it works well with a Swarm cluster, you just need to point it to the manager node. We can easily get it to work with Docker UCP Client Bundle.

Example

First, we need to setup an isolated network where we will setup our application and test it out docker network create -d overlay tweet-app-net

Now let’s setup a service using healthcheck from the previous examples

docker service create -d –name=twet-app –network tweet-app-net \

–mode=replicated –replicas=2 –publish 8080:80 \

–health-cmd “python /usr/share/nginx/html/healthcheck.py || exit 1” \

–health-interval 20s \

–health-retries 2 \

–health-timeout 200ms \

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

Let’s ensure that the service has been started properly with requested number of replicas which are healthy

sh-4.2$ docker container ls | grep -i twet

75b2bf6f219d

393355d083fb

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

“nginx -g ‘daemon …”

“nginx -g ‘daemon …”

27 seconds ago

About a minute ago

Up 21 seconds (healthy)

Up 59 seconds (healthy)

80/tcp, 443/tcp

80/tcp, 443/tcp

ip-10-100-2-93/twet-app.1.im7f7qm2xh6fk6uqla462qzia

ip-10-100-2-67/twet-app.2.6uueh28nxj7btpfzffeq40f6b

Now let’s use Pumba to randomly kill some containers under the service

As you can see, pumba was able to introduce network delay and HEALTHCHECK in the image or --health-cmd at service level helped us to restart the images which were slowing. Well, at this time this is the most that Pumba and Swarm can do. I am hoping in times to come, Swarm service healthcheck would allow us to define auto-scale policies too.

Now, if we are running against a UCP setup or any “true” swarm cluster which has worker and manager nodes, pumba netem command would not work when you fire it from a client. This is unlike the kill command (or most of the other pumba commands), which do work against a Swarm cluster. I came up with a simple solution to work around it.

Pumba in a container

Well you can run pubma in a container as the example says on it’s github page.

# once in a 10 seconds, try to kill (with `SIGTERM` signal) all containers named **hp(something)**

This means that we can create, a service that runs on each node in your Swarm cluster and executes pumba netem command. We need to change the entrypoint of the service and mount /var/run/docker.sock of the local node to container so that pumba can have access to docker deamon on each node.

The pumba command should essentially look for containers that belong to your service only so you need to pass a list of containers to entrypoint pumba command.

As you can see the node became unavailable once the reboot was executed

In order to maintain the desired state of service with 2 replica, Swarm manager would start a new container on one of the surviving nodes

ID

7e2icqhk0n54

s8iib1ue7nrd

oiafp1o6klxx

NAME

twet-app.1

\_ twet-app.1

twet-app.2

IMAGE

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck

NODE

ip-10-100-2-93

m4j4g27conj199uciw98k5h1b

ip-10-100-2-93

DESIRED STATE

Running

Shutdown

Running

CURRENT STATE

Running 12 seconds ago

Running 50 seconds ago

ERROR

Sameer Kumar – Senior Solution Architect

Sameer Kumar is Database Solution Architect working with Ashnik. He has worked on many complex setups and migration assignments for some of the key customers from Retail, BFSI and Telecom Sector. Sameer is a certified PostgreSQL and EDB Postgres Plus Advanced Server Professional. He is also a certified Postgres Trainer and has delivered many trainings for public and corporate batches. He is well versed with other RDBMS e.g. DB2, Oracle, and SQL Server and is also trained on NoSQL technologies viz MongoDB. He has worked closely with customer and helped them build analytics platform on NoSQL databases and migrate from RDBMS to MongoDB. And while he’s in the free mode, he loves to take his cycle around Singapore for a spin.