Docker | Kubernetes | Cloud

Category: Docker Swarm

Docker Engine 1.13.1 went GA last week and introduced one of the most awaited feature called Secrets Management . With a mission to introduce a container native solution that strengthens the Trusted Delivery component of container security, new Secrets API is rightly integrated into Docker 1.13.1 Orchestration engine.The new secrets-management capabilities are also included in Docker Datacenter as part of the Docker 1.13.1 release.

What are secrets all about?

It is a blob of data, such as password, SSH private keys, certificates,API keys, and encryption keys etc..In broader term, it can be anything that can be tightly control access to.The secrets-management capability is the latest security enhancement integrated into the Docker platform so as to ensure applications are safer in a containerized environment.This is going to benefit financial sector players who look for hybrid cloud security strategy.

Why do we need Docker secrets?

There has been numerous concerns over environmental variables which are being used to pass configuration & settings to the containers.Environmental variables are easily leaked when debugging and exposed into many places including child processes, hosting secrets on a server etc.

As shown above, environmental variables are insecure in nature because they are accessible by any process in the container, preserved in intermediate layers of an image, easily accessible through docker inspect and lastly, it can get shared with any container linked to the container. To overcome this, one can use secrets to manage any sensitive data which a container needs at runtime aand there is no need to store in the image . A given secret is only accessible to those services which have been granted explicit access to it, and only while those service tasks are running.

How does it actually work?

Docker secrets is currently supported for Swarm mode only starting Docker Engine 1.13.1. If you are using Docker 1.12.x you might need to upgrade to the latest 1.13.x release to use this feature. To understand how secret works under Docker Swarm mode, you can follow the below process flow:

Docker Compose v3.1 File Format now supports Secrets

Docker compose file format v3.1 is available and requires Docker Engine 1.13.0+. It introduced support for secrets for the first time which means that now you can use secrets inside your docker-compose file.

Let us test-drive Compose v3.1 file format to see how secrets can be implemented using the newer docker stack deploy utility as shown below:

Ensure that you have the latest Docker 1.13.1 running on your Swarm Mode cluster:

I will leverage 4-node Swarm Mode cluster to test the secret API:

Let us first create a secret using docker secret create utility as shown:

As shown above, one can use docker exec to connect to the container and read the contents of the secret data file, which defaults to being readable by all and has the same name as the name of the secret.

Key Takeaways:

Docker secrets are only available to swarm services, not to standalone containers. To use this feature, consider adapting your container to run as a service with a scale of 1.

No Compose binaries are required to run docker stack deploy. All you require is Compose v3.1 file format for this to work.

Raft data is encrypted in Docker 1.13 and higher.

It is recommended to update all of your manager nodes to Docker 1.13 to prevent secrets from being written to plain-text Raft logs.

Docker Engine 1.13.0 Final Release has been officially announced . With over 1050 commits, 1025 file changes and 175 days since Engine 1.12,Docker team has put a major effort to extend the Swarm Mode functionality and bug fixes/improvements. Docker 1.13.0 brings dozens of new features, a major highlights of this release includes – Centralized Logging, New Docker Management CLI, impressive New Secret API , Deploy Stack directly from Docker Compose, Auto Locking, Plugins Management and many more. With this release,Docker Inc. added support for building docker DEBs for Ubuntu 16.04 Xenial & 16.10 Yakkety Yak PPC64LE & s390x platform. There is now an inclusion of RPM builder for VMware Photon OS, Fedora 25 and DEB builder for Ubuntu 16.10 Yakkety Yak.

Under this blog, let us explore few of important new features of Docker 1.13 Swarm Mode:

Upgrading Docker Engine from 1.12 to 1.13

curl -sSL https://get.docker.com/ | sh

Docker 1.13 includes New System Management CLIs (which we will discuss later under this blog):

Experimental & Stable Release ~ all in a single binary

Experimental features are now included in the standard Docker binaries as of version 1.13.0.This is a great improvement since 1.12 release. Under Docker 1.12, there was a separate branch for experimental & stable release and one has to pull them via curl utility from test.docker.com & get.docker.com repository respectively. For enabling experimental features, you need to start the Docker daemon with --experimental flag. You can also enable the daemon flag via /etc/docker/daemon.json. e.g.

{
"experimental": true
}

Then make sure the experimental flag is enabled:

$ docker version -f ‘{{.Server.Experimental}}‘

true

With this new experimental feature enabled, one can see additional Docker management commands as shown below:

Docker 1.13 provides you an ability to set DOCKER_HIDE_LEGACY_COMMANDS environment variable to show only the management commands. You can enable it using the below command:

DOCKER_HIDE_LEGACY_COMMANDS=true docker --help

New CLI for System Resource Management:

Docker 1.13 addresses one of the biggest issue faced by Docker users & community – how to reclaim disk space used by docker? Issues like <none>:<none> images, dangling images, getting disk full even if container consumes less spaces etc. are few of pain points which has been reported number of times.Docker 1.13 introduces new system resource management commands to help users understand how much disk space Docker is using, and help remove unused data.

Docker team has introduced a new system management CLI – docker system to help users to get information like disk usage, system-wide information and real time events from the server.

Below tables depicts the various system-level commands & its usage:

A New Centralized Logging Under Swarm Mode

Under Docker 1.12, one feature which we really missed was “Centralized Logging”. There was no docker service logs due to which Docker users has to depend upon 3rd party tool like rsyslog or ELK based solution. With Docker 1.13, a new command docker service log has been introduced. It is a powerful new experimental command that simplifies the debugging services. Now there is NO NEED to track down hosts and containers powering a particular service and pulling logs from those containers, docker service logs utility pulls logs from all containers running a service and streams them to your console.

Want to try out this feature? As shown in the screenshot, one can easily retrieve logs from scaled-out services from various worker and master nodes using this simple API:

Let us scale-out the redis service to 5 and check if the logs collects data from all of those worker nodes and push it to master node:

Auto Locking Feature under Swarm Mode:

Docker 1.13 brings an interesting security feature called Auto Locking for Swarm Mode. To understand this concept, let us go back to raft consensus which forms the backbone of Swarm Mode implementation. By default, all of the raft consensus data is stored encrypted at rest in all the managers.A key is generated and stored on disk, so that managers can restart without operator intervention.However, if a disk gets corrupted, or one accidentally backup both the data and the key, all of the cluster data will get leaked(which starting with 1.13 might include secrets). Because of this, the newer release allow you to take ownership of this key, and you are able to enable autolock, which effectively means that the key used for encryption of your data never gets persisted to disk. Taking ownerhip of the key also means that a manager can no longer restart without human intervention, since providing the key is now of the responsibility of the operator (or external application). At any point in time you may rotate the key that is used to unlock the cluster, or give the responsibility of managing the key back to the managers, so that they can go back to restarting without external intervention.You can change which mode the cluster is operating in using docker swarm update --autolock=true/false.You can inspect what the current key is, or rotate it, by using docker swarm unlock-key

There are two ways to implement AutoLocking feature under 1.13 Swarm Mode.Either using:

docker swarm init --autolock

or

docker swarm update --autolock( to turn on manager locking).

When you run the above command, it prints out a key.On restart docker swarm unlock is necessary to start the manager. This takes the key on stdin.You can retrieve the current key with docker swarm unlock-key, or rotate it with docker swarm unlock-key --rotate. The next question could be – What happens while the swarm in locked? The swarm components do not operate until you run docker swarm unlock.

Let us look at how it actually works:-

Step-1: Let assume that we have 6-node cluster and Master1 is our Leader node.

Step-2: Let us update the “Node1” as Leader node as shown below:

Step-3: Run docker swarm update command to enable autolock. This command displays the key to unlock a swarm manager.

Step-4: One can provide the key to unlock the manager node.

Step-5: Let us test if it really works by restarting the manager node.

Step-6: After the system come back, you need to join it back as manager only when you have unlock key.

Step-7: As shown below, your Manager1 join back as the manager node:

Deploy Stack directly from docker-compose

One of the compelling feature under Docker 1.12 was introduction to “Distributed Application Bundle” rightly called as DAB. DAB helped developers to build and test containerized apps and then wrap all the app artifacts into stable .dab files. Operation teams, in turn, can take those .dab to deploy apps by creating stacks of services from DABs.

Under Docker 1.13, the two stage process has been simplified and turned into one single command to build microservices under Swarm Mode. You don’t need to run docker-compose bundle to build .DAB file, instead Docker 1.13 adds Compose-file support to the `docker stack deploy` command so that services can be deployed using a `docker-compose.yml` file directly

Below picture depicts one single-liner command to achieve this:

To test this out, let us write a docker-compose file as shown below:

Create a directory called “collab” and place the below docker-compose.yml file under the same directory:

2016 has been a great year for Docker Inc. With the announcement of Docker 1.12 release in last Dockercon, a new generation Docker clustering & distributed system was born. With an optional “Swarm Mode” feature rightly integrated into core Docker Engine, a native management of a cluster of Docker Engines, orchestration, decentralized design, service and application deployment, scaling, rolling updates, desired state reconciliation, multi-host networking, service discovery and routing mesh implementation – all of these features works flawlessly. With the recent Engine 1.12.5 release, all of these features have matured enough to make it production ready.

Under this blog post, I will be spending another 20-minutes to go quickly through the complete A-Z tutorial around Swarm Mode covering the important features like Orchestration, Scaling, Routing Mesh, Overlay Networking, Rolling Updates etc.

Docker Compose has gained lots of attention in the recent past due to its easy one-liner installation(on Linux, Windows & Mac OS X), easy-to-use JSON & YAML format support , available sample docker-compose files on GITHUB and a one-liner command to create and start all the services from your configuration. If you are looking out for Microservices implementation, Docker Compose is a great tool to get started with. With Compose, you can define and run complex application with Docker. Also, you define a multi-container application in a single file, then spin up your application in a single command which takes care of linking services together through Service Discovery.

Docker Compose 1.9 is currently under RC4 phase and nearing the Final Release. Several new features and improvements in terms of Networking, Logging & Compose CLI has been introduced. With this release, Docker Compose version 2.1 has been introduced for the first time.This release will support the setting up of volume labels and network labels in YAML specification. BUT there is a good news for Microsoft Windows enthusiasts. Interactive mode for docker-compose run and docker-compose exec is now supported on Windows platforms and this is surely going to help Microsoft enthusiasts to play around with the services flawlessly.

The below picture shows what major features has been introduced since last year in Docker Compose release:

In case you are very new to Docker Compose, I suggest you to read this official documentation. If you are an experienced Compose user and curious to know how Docker Compose fits into Swarm Mode, don’t miss out my recent blog post. Under this blog post, we will look at the new features which are being introduced under Docker Compose 1.9 release.

Installation of Docker Compose v1.9

On Windows Server 2016 system, you can run the below command to get started with Docker Compose 1.9-rc4 release.

If you are on Linux host, the installation just goes flawless as shown below:

Introduction of Version 2.1 YAML specification format for the first time

Docker 1.9 introduces the newer version of Docker Compose YAML specification format rightly called “Version: 2.1” for the first time. To test drive, I created a docker-compose file for my wordpress application and it just worked well.

The docker-compose up -d just went good as shown below:

We can have a look at the list of services running using Docker compose as shown below:

Interactive Mode for docker exec & docker run

Though this feature has been there for Linux users quite for sometime, it has been newly introduced and supported on Windows Platform too. In case you are new to docker-compose run command, here is the simplified way to demonstrate it.

On Linux Host:

Note: In case you are new to docker-compose config command, it is a CLI tool which validates your Docker compose file.

Cool. One can use docker-compose run command to target one service out of several services mentioned under docker-compose.yml file and interact with that particular service without any issue.

On Windows Host:

To quickly test this feature, I spun up Windows Server 2016 on Azure, installed Docker and Docker Compose and forked https://github.com/ajeetraina/Virtualization-Documentation repository which has collection of Windows Docker images. Though it was quite slow in the beginning, but once pulled bringing up services using Docker Compose was pretty quick.

NOTE: When running docker-compose, you will either need to explicitly reference the host port by adding the option “-H tcp://localhost:2375” to the end of this command (e.g. docker-compose -H “tcp://localhost:2375” or by setting your DOCKER_HOST environment variable to always use this port (e.g. $env:DOCKER_HOST=”tcp://localhost:2375”

As shown below, the services finally were up and running and one can easily check through docker-compose ps command as shown below:

Let us test docker-compose run feature now. I tried targeting the db service and running cmd command to see if it works well.

Support for setting volume labels and network labels in docker-compose.yml

This is an important addition to Docker compose release. There has been several ask from Docker community user to bring up this feature and Docker team has done a great job in introducing it under this release.

If you look at the last few lines, the volume labels has been specified in the following format:

volumes:

volume_with_labels:

labels:

– “alpha=beta”

To verify if it rightly build up the container with the volume labels, one can issue the below command:

In the upcoming posts, I will be covering more features and bug fixes introduced under Docker Compose 1.9. Keep watching this space for further updates.

Containers are stateless by nature and likely to be short-lived. They are quite ephemeral than VMs. What it actually means? Say, you have any data or logs generated inside the container and you don’t really care about loosing it no matter how many times you spin it up and down, like HTTP requests, then the ideal stateless feature should be good enough. BUT in case you are looking out for a solution which should record “stateful” applications like storing databases, storing logs etc. you definitely need persistent storage to be implemented. This is achieved by leveraging Docker’s volume mounts to create either a data volume or a data volume container that can be used and shared by other containers.

In case you’re new to Docker Storage, Docker Volumes are logical building blocks for shared storage when combined with plugins. It helps us to store state from the application to locations outside the docker image layer. Whenever you run a Docker command with -v, and provide it with a name or path, this gets managed within /var/lib/docker or in case you’re using a host mount, it’s something that exists on the host. The problem with this implementation is that the data is pretty inflexible, which means anything you write to that specific location, yes, it’ll stay there after the container’s life cycle, but only on that host. If you lose that host, the data will get erased. This clearly means that the situation is very prone to data loss. Within Docker, it looks very similar to what shown in the above picture, /var/lib/docker directory structure. Let’s talk about how to implement the management of data with an external storage. This could be anything from NFS to distributed file systems to block storage.

In my previous blog, we discussed about Persistent Storage implementation with DellEMC RexRay for Docker 1.12 Swarm Mode. Under this blog, we will look at how NFS works with Swarm Mode.I assume that you have an existing NFS server running in your infrastructure. If not, you can quickly set it up in just few minutes. I have Docker 1.12 Swarm Mode initialized with 1 master node and 4 worker nodes. Just for an example, I am going to leverage a Ubuntu 16.04 node(outside the swarm mode cluster) as NFS server and rest of the nodes( 1 master & 4 workers) as NFS client systems.

Setting up NFS environment:

There are two ways to setup NFS server – either using available Docker image or manually setting up NFS server on the Docker host machine. As I already have NFS server working on one of Docker host running Ubuntu 16.04 system, I will just verify if the configuration looks good.

Let us ensure that NFS server packages are properly installed as shown below:

raina_ajeet@master1:~$ sudo dpkg -l | grep nfs-kernel-server

1:1.2.8-9ubuntu12 amd64 support for NFS kernel server

raina_ajeet@master1:~$

I created the following NFS directory which I want to share across the containers running the swarm cluster.

$sudo mkdir /var/nfs

$sudo chown nobody:nogroup /var/nfs

It’s time to configure NFS shares. For this, let’s edit the export file to look like as show below:

raina_ajeet@master1:~$ cat /etc/exports

# /etc/exports: the access control list for filesystems which may be exported

# to NFS clients. See exports(5).#

# Example for NFSv2 and NFSv3:

# /srv/homes hostname1(rw,sync,no_subtree_check) hostname2

(ro,sync,no_subtree_check)#

# Example for NFSv4:

# /srv/nfs4 gss/krb5i(rw,sync,fsid=0,crossmnt,no_subtree_check)

# /srv/nfs4/homes gss/krb5i(rw,sync,no_subtree_check)

/var/nfs *(rw,sync,no_subtree_check)

raina_ajeet@master1:~$

As shown above, we will be sharing /var/nfs directory among all the worker nodes in the Swarm cluster.

Let’s not forget to run the below commands to provide the proper permission:

$sudo chown nobody:nogroup /var/nfs

$sudo exportfs -a

$sudo service nfs-kernel-server start

Great ! Let us cross-check if the configuration holds good.

raina_ajeet@master:~$ sudo df -h

Filesystem Size Used Avail Use% Mounted on

udev 1.9G 0 1.9G 0% /dev

tmpfs 370M 43M 328M 12% /run

/dev/sda1 20G 6.6G 13G 35% /

tmpfs 1.9G 0 1.9G 0% /dev/shm

tmpfs 5.0M 0 5.0M 0% /run/lock

tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup

tmpfs 100K 0 100K 0% /run/lxcfs/controllers

tmpfs 370M 0 370M 0% /run/user/1001

10.140.0.7:/var/nfs 20G 19G 1.3G 94% /mnt/nfs/var/nfs

As shown above, we have NFS server with IP: 10.140.0.7 and ready to share volume to all the worker nodes.

Running NFS service on Swarm Mode

In case you are new to –mount option introduced under Docker 1.12 Swarm Mode, here is an easy explanation:-

In our previous posts, we spent considerable amount of time deep-diving into Swarm Mode which is in-built orchestration engine in Docker 1.12 release. The Swarm Mode orchestration engine comprises of desired state reconciliation, replicated and global services, configuration updates in the form of parallelism/delay and restart policies to name a few. Docker Engine 1.12 is not just about the multi-host and multi-container orchestration but there are numerous improvements in terms of Scheduling, Cluster management and Security. Under this post, I am going to talk about scheduling(primarily Engine & Swarm Labels) aspect in terms of new service API introduced under 1.12 engine.

Looking at Engine 1.12, scheduling can be referred to a subset of Orchestration.Orchestration is a broader term that refers to container scheduling, cluster management, and possibly the provisioning of master and worker nodes.When applications are scaled out across multiple swarm nodes, the ability to manage each nodes and abstract away the complexity of the underlying platform becomes more important.Under Docker 1.12 swarm mode cluster, we talk more of docker service rather than docker run.In terms of new service API, the “scheduling” refers to the ability for an operation team to build application services onto a swarm node cluster that establishes how to run a specific group of tasks/containers. While scheduling refers to the specific act of loading the application service , in a more general sense, schedulers are responsible for hooking into a node’s init system(dockerd ~ docker daemon) to manage services.

Under Docker 1.12, scheduling refers to resource awareness, constraints and strategies. Resource awareness is about being aware of resources available on nodes and will place tasks/containers accordingly. Swarm Mode handles that quite effectively. As of today, the newer 1.12 ships with a spread strategy which will attempt to schedule tasks on the least loaded nodes, provided they meet the constraints and resource requirements.Under constraints, the operation team can limit the set of nodes where a task/containers can be scheduled by defining constraint expressions. Multiple constraints find nodes that satisfy every expression, i.e., an AND match. Constraints can match node attributes in the following table.

Few Important Tips:

The engine.labels are collected from Docker Engine with information like operating system, drivers, etc.

The node.labels are added by the operations team for operational purpose. For example, some nodes have security compliant labels to run tasks with compliant requirements

Below is a snippet of the constraints used under 1.12 Swarm:

node attribute

matches

example

node.id

node’s ID

node.id == 5ivku8v2gvtg4

node.hostname

node’s hostname

node.hostname != node-1

node.role

node’s manager or worker role

node.role == manager

node.labels

node’s labels added by cluster admins

node.labels.security == high

engine.labels

Docker Engine’s labels

engine.labels.operatingsystem == ubuntu 16.04

Let’s take a look at Docker 1.12 labels and constraints in detail. A Label is a key-value pair which serves a wide range of uses, such as identifying the right set of node/s etc. The label is a metadata which can be attached to dockerd(docker daemon). Labels, along with the semantics of constraints can help services run on a target worker nodes. For example, payment related application services can be targeted at the nodes which are more secured, some of the database R/W operations can be limited to specific number of SSD equipped worker nodes etc.

Under 1.12, there are two types of labels – Swarm labels and Engine labels. Swarm Labels adds a security scheduling decisions on top of Engine labels. It is important to note that Engine labels can’t be trusted for security sensitive scheduling decisions,since any worker can report any label up to a manager. However, they can be useful for certain scenarios like scheduling containers on SSD specific nodes, running application services based on resources and so on.

On the other hand, Swarm labels adds an additional layer of trust as they can be explicitly defined by the operations folks. One can easily label worker nodes as “production” or “secure” to ensure that the payment related application can get scheduled on those nodes primarily and this ensures that malicious workers can be kept away.

To get started, let us setup 5 node Swarm Cluster running Docker 1.12 on Google Cloud Engine. I will be running Ubuntu 16.04 so as to show what new changes has to be made under Ubuntu/Debian specific OS to make it work. Before I start setting up Swarm cluster, let us pick up 2 nodes( node-2 and node-3) for which we will adding labels and constraints.

Login to node3 instance and add the following lines under [Service] section:

[Service]

EnvironmentFile=/etc/default/docker

ExecStart=/usr/bin/dockerd -H fd:// $DOCKER_OPTS

Your file should look like as shown below:

Next, open up /etc/default/docker and add the highlighted line as shown below:

As shown above, I have added a label named “com.example.environment” with a value “production” so as to differentiate this node from the other nodes.

PLEASE NOTE : These are systemd specific changes which works great for Debian/Ubuntu specific distributions.

To ensure that the $DOCKER_OPTS variable is rightly integrated into the docker daemon, run the below command:

As shown in the screenshot, the Labels is right attached to the dockerd daemon.

Follow the same step for node-2 before we start building the swarm cluster.

Once done, let’s setup a swarm cluster as shown below:

Setup a worker nodes, by joining all the nodes one by one. Hence we have 5-node Swarm Cluster setup ready:

It’s time to create a service which schedules the tasks or containers only on node3 and node2 based on the labels which we defined earlier. This is how the docker service command should look like:

If you notice the ‘docker service’ command(shown above), a new prefix ‘engine.labels’ has been added which is very specific to service API introduced under this new release. Once you pass this constraint with the service name specification, the scheduler will ensure that these tasks will only be run on specific set of nodes( node2 and node3).

Even though we had 5-node cluster, the master node just chose node2 and node3 based on the constraints which supplied at the Engine and Swarm labels.

Demonstrating Node Label constraints:

Let us pick up node1 and node4 to demonstrate node labels constraints. We will be using docker node update command to add labels to the nodes directly.( Please remember it doesn’t require engine level label changes).

As shown above, we added ostype=ubuntu to both the nodes individually. Now create a service with name –collabtest1 passing labels through –constraint option. You can easily verify the labels for each individual nodes using docker node inspect format as shown below:

Now if you try scaling the service to 4, it will restrict to the node1 and node4 as per the node label constraints we supplied earlier.

This brings an important point of consideration where if two containers should always run on the same host because they operate as a unit, that affinity can often be declared during the scheduling. On the other hand, if two containers should not be placed on the same host, for example to ensure high availability of two instances of the same service, this can be possible through scheduling. In my next post, I will be covering more on affinity and additional filters in terms of Swarm Mode.

In the previous blog post, we deep-dived into Service Discovery aspects of Docker. A service is now a first class citizen in Docker 1.12.0 which allows replication, update of images and dynamic load-balancing. With Docker 1.12, services can be exposed on ports on all Swarm nodes and load balanced internally by Docker using either a virtual IP(VIP) based or DNS round robin(RR) based Load-Balancing method or both.

In case you are very new to Load-balancing concept, the load balancer assigns workload to a set of networked computer servers or components in such a manner that the computing resources are used in an optimal manner. A load balancer provides high availability by detecting server or component failure and re-configuring the system appropriately. Under this post, I will try to answer the following queries:

Let’s get started –

Is Load-Balancing new to Docker?

Load-balancing(LB) feature is not at all new for Docker. It was firstly introduced under Docker 1.10 release where Docker Engine implements an embedded DNS server for containers in user-defined networks.In particular, containers that are run with a network alias ( — net-alias) were resolved by this embedded DNS with the IP address of the container when the alias is used.

No doubt, DNS Round robin is extremely simple to implement and is an excellent mechanism to increment capacity in certain scenarios, provided that you take into account the default address selection bias but it possess certain limitations and issues like some applications cache the DNS host name to IP address mapping and this causes applications to timeout when the mapping gets changed.Also, having non-zero DNS TTL value causes delay in DNS entries reflecting the latest detail. DNS based load balancing does not do proper load balancing based on the client implementation. To learn more about DNS RR which is sometimes called as poor man’s protocol, you can refer here.

What’s new in Load-balancing feature under Docker 1.12.0?

Docker 1.12.0 comes with built-in Load Balancing feature now.LB is designed as an integral part of Container Network Model (rightly called as CNM) and works on top of CNM constructs like network, endpoints and sandbox. Docker 1.12 comes with VIP-based Load-balancing.VIP based services use Linux IPVS load balancing to route to the backend containers

No more centralized Load-Balancer, it’s distributed and hence scalable. LB is plumbed into individual container. Whenever container wants to talk to another service, LB is actually embedded into the container where it happens. LB is more powerful now and just works out of the box.

Docker 1.12 introduces Routing Mesh for the first time.With IPVS routing packets inside the kernel, swarm’s routing mesh delivers high performance container-aware load-balancing.Docker Swarm Mode includes a Routing Mesh that enables multi-host networking. It allows containers on two different hosts to communicate as if they are on the same host. It does this by creating a Virtual Extensible LAN (VXLAN), designed for cloud-based networking. we will talk more on Routing Mesh at the end of this post.

Whenever you create a new service in Swarm cluster, the service gets Virtual IP(VIP) address. Whenever you try to make a request to the particular VIP, the swarm Load-balancer will distribute that request to one of the container of that specified service. Actually the built-in service discovery resolves service name to Virtual-IP. Lastly, the service VIP to container IP load-balancing is achieved using IPVS. It is important to note here that VIP is only useful within the cluster. It has no meaning outside the cluster because it is a private non-routable IP.

2. Let’s create a new service called collabweb which is a simple Nginx server as shown:

$ dockerservice create \

—replicas 3 \

—name collabweb\

—network collabnet\

nginx

3. As shown below, there are 3 nodes where 3 replicas of containers are running the service under the swarm overlay network called “collabnet”.

4. Use docker inspect command to look into the service internally as shown below:

It shows “VIP” address added to each service. There is a single command which can help us in getting the Virtual IP address as shown in the diagram below:

5. You can use nsenter utility to enter into its sandbox to check the iptables configuration:

In any iptables, usually a packets enters the Mangle Table chains first and then the NAT Table chains.Mangling refers to modifying the IP Packet whereas NAT refers to only address translation. As shown above in the mangle table,10.0.3.2 service IP gets marking of 0x10c using iptables OUTPUT chain. IPVS uses this marking and load balances it to containers 10.0.3.3, 10.0.3.5 and 10.0.3.6 as shown:

As shown above, you can use ipvsadm to set up, maintain or inspect the IP virtual server table in the Linux kernel.This tool can be installed on any of Linux machine through apt or yum based on the Linux distribution.

A typical DNS RR and IPVS LB can be differentiated as shown in the below diagram where DNS RR shows subsequent list of IP addresses when we try to access the service each time(either through curl or dig) while VIP load balances it to containers(i.e. 10.0.0.1, 10.0.0.2 and 10.0.0.3)

6. Let’s create a new service called collab-box under the same network. As shown in the diagram, a new Virtual-IP (10.0.3.4) will be automatically attached to this service as shown below:

Also, the service discovery works as expected,

Why IPVS?

IPVS (IP Virtual Server) implements transport-layer load balancing inside the Linux kernel, so called Layer-4 switching. It’s a load balancing module integrated into the linux kernel. It is based on Netfilter.It supports TCP, SCTP & UDP, v4 and v7. IPVS running on a host acts as a load balancer before a cluster of real servers, it can direct requests for TCP/UDP based services to the real servers, and makes services of the real servers to appear as a virtual service on a single IP address.

It is important to note that IPVS is not a proxy — it’s a forwarder that runs on Layer 4. IPVS forwards traffic from clients to back-ends, meaning you can load balance anything, even DNS! Modes it can use include:

UDP support

Dynamically configurable

8+ balancing methods

Health checking

IPVS holds lots of interesting features and has been in kernel for more than 15 years. Below chart differentiate IPVS from other LB tools:

Is Routing Mesh a Load-balancer?

Routing Mesh is not Load-Balancer. It makes use of LB concepts.It provides global publish port for a given service. The routing mesh uses port based service discovery and load balancing. So to reach any service from outside the cluster you need to expose ports and reach them via the Published Port.

In simple words, if you had 3 swarm nodes, A, B and C, and a service which is running on nodes A and C and assigned node port 30000, this would be accessible via any of the 3 swarm nodes on port 30000 regardless of whether the service is running on that machine and automatically load balanced between the 2 running containers. I will talk about Routing Mesh in separate blog if time permits.

It is important to note that Docker 1.12 Engine creates “ingress” overlay network to achieve the routing mesh. Usually the frontend web service and sandbox are part of “ingress” network and take care in routing mesh.All nodes become part of “ingress” overlay network by default using the sandbox network namespace created inside each node. You can refer this link to learn more about the internals of Routing Mesh.

Is it possible to integrate an external LB to the services in the cluster.Can I use HA-proxy in Docker Swarm Mode?

You can expose the ports for services to an external load balancer. Internally, the swarm lets you specify how to distribute service containers between nodes.If you would like to use an L7 LB you either need to point them to any (or all or some) node IPs and PublishedPort. This is only if your L7 LB cannot be made part of the cluster. If the L7 LB can be made of the cluster by running the L7 LB itself as a service then they can just point to the service name itself (which will resolve to a VIP). A typical architecture would look like this:

Prior to Docker 1.12 release, setting up Swarm cluster needed some sort of service discovery backend. There are multiple discovery backends available like hosted discovery service, using a static file describing the cluster, etcd, consul, zookeeper or using static list of IP address.

Thanks to Docker 1.12 Swarm Mode, we don’t have to depend upon these external tools and complex configurations. Docker Engine 1.12 runs it’s own internal DNS service to route services by name.Swarm manager nodes assign each service in the swarm a unique DNS name and load balances running containers. You can query every container running in the swarm through a DNS server embedded in the swarm.

How does it help?

When you create a service and provide a name for it, you can use just that name as a target hostname, and it’s going to be automatically resolved to the proper container IP of the service. In short, within the swarm, containers can simply reference other services via their names and the built-in DNS will be used to find the appropriate IP and port automatically. It is important to note that if the service has multiple replicas, the requests would be round-robin load-balanced. This would still work if you didn’t forward any ports when you created your docker services.

Embedded DNS is not a new concept. It was first included under Docker 1.10 release. Please note that DNS lookup for containers connected to user-defined networks works differently compared to the containers connected to default bridge network. As of Docker 1.10, the docker daemon implements an embedded DNS server which provides built-in service discovery for any container created with a valid name or net-alias or aliased by link. Moreover,container name configured using --name is used to discover a container within an user-defined docker network. The embedded DNS server maintains the mapping between the container name and its IP address (on the network the container is connected to).

How does Embedded DNS resolve unqualified names?

With Docker 1.12 release, a new API called “service” is being included which clearly talks about the functionality of service discovery. It is important to note that Service discovery is scoped within the network. What it really means is – If you have redis application and web client as two separate services , you combine into single application and put them into same network.If you try build your application in such a way that you are trying to reach to redis through name “redis”,it will always resolve to name “redis”. Reason – both of these services are part of the same network. You don’t need to be inside the application trying to resolve this service using FQDN. Reason – FQDN name is not going to be portable which in turn, makes your application non-portable.

Internally, there is a listener opened inside the container itself. If we try to enter into the container which is providing a service discovery and look at /etc/resolv.conf, we will find that the nameserver entry holds something really different like 127.0.0.11.This is nothing but a loopback address. So, whenever resolver tried to resolve, it will resolve to 127.0.0.11 and this request is rightly trapped.

Once this request is trapped, it is sent to particular random UDP / TCP port currently being listened under the docker daemon. Consequently, the socket is to be created inside the namespace. When DNS server and daemon gets the request, it knows that this is coming from which specific network, hence gets aware of the context of from where it is coming from.Once it knows the context, it can generate the appropriate DNS response.

To demonstrate Service Discovery under Docker 1.12, I have upgraded Docker 1.12.rc5 to 1.12.0 GA version. The swarm cluster look like:

I have created a network called “collabnet” for the new services as shown below:

Let’s create a service called “wordpressdb” under collabnet network :

You can list the running tasks(containers) and the node on which these containers are running on:

Let’s create another service called “wordpressapp” under the same network:

Now, we can list out the number of services running on our swarm cluster as shown below.

I have scaled out the number of wordpressapp and wordpressdb just for demonstration purpose.

Let’s consider my master node where I have two of the containers running as shown below:

I can reach out one service(wordpressapp) from another service(wordpressapp) through just service-name as shown below:

Also, I can reach out to particular container by its name from other container running different service but on the same network. As shown below, I can reach out to wordpressapp.3.6f8bthp container via wordpressdb.7.e62jl57qqu running wordpressdb.

The below picture depicts the Service Discovery in a nutshell:

Every service has Virtual IP(VIP) associated which can be derived as shown below:

As shown above, each service has an IP address and this IP address maps to multiple container IP address associated with that service. It is important to note that service IP associated with a service does not change even though containers associated with the service dies/ restarts.

Few important points to remember:

VIP based services use Linux IPVS load balancing to route to the backend containers. This works only for TCP/UDP protocols. When you use DNS-RR mode services don’t have a VIP allocated. Instead service names resolves to one of the backend container IPs randomly.

Ping not working for VIP is as designed. Technically, IPVS is a TCP/UDP load-balancer, while ping uses ICMP and hence IPVS is not going to load-balance the ping request.

For VIP based services the reason ping works on the local node is because the VIP is added a 2nd IP address on the overlay network interface

You can any of the tools like dig, nslookup or wget -O- <service name> to demonstrate the service discovery functionality

Below picture depicts that the network is the scope of service discoverability which means that when you have a service running on one network , it is scoped to that network and won’t be able to reach out to different service running on different network(unless it is part of that network).

Let’s dig little further introducing Load-balancing aspect too. To see what is basically enabling the load-balancing functionality, we can go into sandbox of each containers and see how it has been resolved.

Let’s pick up the two containers running on the master node. We can see the sandbox running through the following command:

Under /var/run/docker/netns, you will find various namespaces. The namespaces marked with x-{id} represents network namespace managed by the overlay network driver for its operation (such as creating a bridge, terminating vxlan tunnel, etc…). They don’t represent the container network namespace. Since it is managed by the driver, it is not recommended to manipulate anything within this namespace. But if you are curious on the deep dive, then you can use the “nsenter” tool to understand more about this internal namespace.

We can enter into sandbox through the nsenter utility:

In case you faced an error stating “nsenter: reassociate to namespace ‘ns/net’ failed: Invalid argument”, I suggest to look at this workaround.

10.0.3.4 service IP is marked 0x108 using iptables OUTPUT chain. ipvs uses this marking and load balances it to containers 10.0.3.5 and 10.0.3.6 as shown below:

Here are key takeaways from this entire post:

In my next blog post, I am going to deep dive into Load-Balancing aspect of Swarm Mode. Thanks for reading.

In the last Meetup (#Docker Bangalore), there has been lots of curiosity around “Desired State Reconciliation” & “Node Management” feature in case of Docker Engine 1.12 Swarm Mode. I found lots of queries post the presentation session on how Node Failure Handling is taken care in case of new Docker Swarm Mode , particularly when master node participating in the raft consensus goes down. Under this blog post, I will demonstrate how Master Node Failure is achieved which is very specific to RAFT consensus algorithm. We will look at how Swarmkit (the technical foundation of Swarm Mode implementation) uses the raft consensus algorithm and enables NO single point of failure feature to perform effective decision in the distributed system.

In the previous post we did a deep-dive into Swarm Mode implementation where we talked about the communication in between manager and worker nodes. Machines running SwarmKit can be grouped together in order to form a Swarm, coordinating tasks with each other. Once a machine joins, it becomes a Swarm Node. Nodes can either be worker nodes or manager nodes. Worker nodes are responsible for running Tasks while Manager nodes accept specifications from the user and are responsible for reconciling the desired state with the actual cluster state.

Manager nodes maintain a strongly consistent, replicated (Raft based) and extremely fast (in-memory reads) view of the cluster which allows them to make quick scheduling decisions while tolerating failures.Node roles (Worker or Manager) can be dynamically changed through API/CLI calls. Say, if any of master or worker node fails, SwarmKit reschedules its tasks(which are nothing but containers) onto a different node.

A Quick Brief on Raft Consensus Algorithm

Let’s understand what raft consensus is all about. A Raft cluster contains several servers; five is a typical number, which allows the system to tolerate two failures. At any given time each server is in one of three states: leader, follower, or candidate. In normal operation there is exactly one leader and all of the other servers are followers. Followers are passive: they issue no requests on their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader). The third state, candidate, is used to elect a new leader. Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader. To understand the raft implementation, I recommend reading https://github.com/hashicorp/raft

PLEASE NOTE that there should always be an odd number of managers (1,3,5 or 7) to reach the consensus. If you have just two managers, with one manager down results in a situation where you can’t achieve the consensus.Reason – greater than 50% of the managers need to “agree” to actually makes the raft consensus work.

The Swarm Mode cluster is already running a service which is replicated across 3 nodes – test-master1, test-node2 and test-node1 out of total 5 nodes. Let us use docker-machine(my all-time favorite) command to ssh to test-master1 and promote workers (test-node1 and test-node2) to the manager node as shown above.

Hence, the worker nodes are rightly promoted to manager node which is shown as “Reachable”.

The “$docker ps” command shows that there is a task (container) already running on the master node. Please remember that “$docker ps” has to manually run on the dedicated node to know what local containers are running on the particular node.

The below picture depicts the detailed list of the containers(or tasks) which are distributed across the swarm cluster.

Let’s bring down the manager node “test-master1” either by shutting it down uncleanly or stopping the instance through the available GCE feature.(as show below). The manager node(test-master1) is no longer reachable. If you try to ssh to test-node2 and check if the cluster is up and running, you will find that node failure has been taken care and desired state reconciliation comes into the picture. Now the 3-replicas of tasks or containers are running on test-node1, test-node2 and test-node3.

To implement raft consensus, there is a minimal recommendation of an odd number of managers (1,3,5 or 7). The maximum recommendation of manager node is 5 for better performance while increasing the manager nodes to 7 might incur performance bottleneck as there will be additional overhead in terms of communication to keep the mutual agreement in place between the managers.

Today Docker Inc. released Engine 1.12 Release Candidate 4 with numerous improvements and added security features. With an optional “Swarm Mode” feature rightly integrated into core Docker Engine, a native management of a cluster of Docker Engines, orchestration, decentralized design, service and application deployment, scaling, desired state reconciliation, multi-host networking, service discovery and routing mesh implementation is just a matter of few liner commands.

In the previous posts, we introduced Swarm Mode, implemented a simple service applications and went through 1.12 networking model. Under this post, we will deep dive into Swarm Mode and study what kind of communication gets generated between master and worker nodes in the Swarm cluster.

Setting up Swarm Master Node

Let’s start setting up Swarm Mode cluster and see how underlying communication takes place. I will be using docker-machine to setup master and worker nodes on my Google Cloud Engine.

As you see below, Docker Hosts machines gets created through docker-machine with all the nodes running Docker Engine 1.12-rc4.

Let’s initialize the swarm mode on the first master node as shown below:

I have used one liner docker-machine command to keep it clean and simple. The docker-machine command will SSH to the master node and initialize the swarm mode.

The newly released RC4 version holds improvement in terms of security which is enabled by default. In earlier release, one has to pass –secret parameter to secure and control which worker node can join and which can’t. But going forward, the swarm mode automatically generates random secret key. This is just awesome !!!

When further nodes joins the swarm, they are issues their own keypair, signed by the root CA, and they also receive the root CA public key and certificate. All the communication is encrypted over TLS.

The node keys and certificates are automatically renewed on regular intervals (by default 90 days) but one can tune it with docker swarm update command.

Let us spend some time understanding the master and worker architecture in detail.

Every node in Swarm Mode has a role which can be categorized as Manager and Worker. Manager node has responsibility to actually orchestrate the cluster, perform the health-check, running containers serving the API and so on. The worker node just execute the tasks which are actually containers. It can-not decide to schedule the containers on the different machine. It can-not change the desired state. The workers only takes work and report back the status. You can enable node promotion or demotion easily through one-liner command.

Managers and Workers uses two different communication models. Managers have built-in RAFT system that allows them to share information for new leader election. At one time, only manager is actually performing the scaling and they use a leader follower model to figure out which one is supposed to be what. No External K-V store is required as built-in internal distributed state store is available.

Workers, on the other side, uses GOSSIP network protocol which is quite fast and consistent. Whenever any new container/tasks gets generated in the cluster, the gossip is going to broadcast it to all the other containers in a specific overlay network that this new container has started. Please remember that ONLY the containers which are running in the specific overlay network will be communicated and NOT globally. Gossip is optimized for heavy traffic.

Let us go one level more deeper to understand how the underlying service is created and dispatched to the worker nodes. Before creating the service, let us first create a new overlay network called mynetwork.

–network mynetwork dockercloud/hello-world

Once you run the above command, 3 replicas of services gets generated and distributed across the cluster nodes.

[Under the hood] – Let’s understand what happens whenever a new service is created.

Whenever we create overlay network through “docker network create -d overlay” command, it basically goes to manager. Manager is built up of multiple pipeline stages. One of them is Allocator. Allocator takes the network creation request and choose particular pre-defined sub network that is available. Allocation purely happen in the memory and hence it goes quick. Once network is created, it’s time to connect service to that network. Say, you start with service creation, orchestrator is involved and try to generate the requisite number of tasks which is nothing but containers in real world. But the tasks needs IP address, VXLAN ids as the overlay network needs that too. The allocation happens in the manager nodes. Once allocation gets completed, tasks are created and the state is preserved in the raft store. Once allocation is done, only then the scheduler will be able to move that particular task into the assigned state which is then dispatched to one of the worker node. Manager can also be worker. Every task goes through multiple stages – New, Allocated, Assigned etc. if the task has not been moved to allocator stage, it will not be assigned to worker nodes. With the help of network control plane(gossip protocol), multiple tasks distributed across the multiple worker node is taken care and managed effectively.