#SwarmWeek: Docker Swarm Exceeds Kubernetes Performance at Scale

There are people who will tell you that the community has made up its mind when it comes to container orchestration.

The reality could not be further from the truth. A recent survey, of over 500 respondents, addressing questions about DevOps, microservices and the public cloud revealed a three way orchestration race between Docker Swarm, Google Kubernetes, and Amazon EC2 Container Service (ECS).

When you think about which orchestration tool is right for your environment, we believe the following three key things must be considered:

Performance: How fast can I get containers up and running at scale? How responsive is the system when under load?

Simplicity: What’s the learning curve to set up and ongoing burden to maintain? How many moving parts are there?

Flexibility: Does it integrate with my current environment and workflows? Will my applications seamlessly move from dev to test to production? Will I be locked into a specific platform?

Docker Swarm leads in all three areas.

Performance at Scale

We released the first beta of Swarm just over year ago, and since then we’ve made remarkable progress. In less than a year, we introduced Swarm 1.0 (November 2015) and made clear that Swarm can scale to support 1,000 nodes running in a production environment, and our internal testing proves that.

Kubernetes previously released their own blog detailing performance testing on a 100 node cluster. The problem for customers is that there was no way to really compare the results between these two efforts as the test methodologies were fundamentally different

In order to accurately assess performance across orchestration tools there needs to be a unified framework.

To that end Docker engaged Jeff Nickoloff, an independent technology consultant, to help create this framework, to make it available to the larger container community for use in their own evaluations.

Today Jeff released the results of his independent study comparing the performance of Docker Swarm to Google Kubernetes at scale. The study and article, commissioned by Docker, tested the performance of both platforms while running 30,000 containers across 1,000 node clusters.

The tests were designed to measure two things:

Container startup time: How quickly can a new container actually be brought online versus simply scheduling it to start.

System responsiveness under load: How quickly does the system respond to operational requests under load (in this case listing all the running containers)

The test harness looks at both of these measurements as the cluster is built. A fully loaded cluster is 1,000 nodes running 30,000 containers (30 containers per node).

As nodes are added to the cluster, the harness will stop and measure container startup time, and system responsiveness. These breakpoints happened when the cluster was 10%, 50%, 90%, 99%, and 100% full. At each of these load levels 1,000 test iterations are executed.

What this means is that, for instance, when the cluster is 10% full (100 nodes, and 3,000 containers), the harness will pause adding new nodes. It will instead measure the time it takes to startup a new container (in this case the 3,001st container), and how long it takes to list all the running containers (3,001). It does this particular sequence 1,000 times. The 3,001st container is created, the startup and list times are measured, and the container is removed 1,000 times.

The results show that Swarm is on average 5X faster in terms of container startup time and 7X faster in delivering operational insights necessary to run a cluster at scale in production.

Looking more closely at the results for container startup time, there is a clear performance advantage for Swarm regardless of cluster load level.

From Jeff’s blog:

Half the time Swarm will start a container in less than .5 seconds as long as the cluster is not more than 90% full. Kubernetes will start a container in over 2 seconds half of the time if the cluster is 50% full or more.

One important thing to note is that this test isn’t about container scheduling, it’s about getting containers running and doing work.

The reality is nobody cares if a container was “scheduled” to run, what they care about is that the container is actually running. I think about it like this: If I go out to eat, taking my order and handing it off to the kitchen is great, but what’s really important is how long it takes to actually get my meal prepared and delivered to my table.

One of the promises of containers is agility and responsiveness. A 5X delay in container startup time absolutely wreaks havoc on distributed applications that need near real-time responsiveness. Even in cases where real-time responsiveness isn’t needed, taking all that extra time to bring up infrastructure is painful – think about using orchestration as part of a continuous integration workflow, longer container startup times directly correspond to longer test cycle times.

It’s one thing to scale a cluster to 30,000 containers, and it’s a completely different thing to be able to be able to efficiently manage that environment. System responsiveness under load is critical to effective management. In a world where containers may only live for a few minutes, having a significant delay in gathering real-time insight into the state of the environment means you never really know what’s happening in your infrastructure at any particular moment in time.

In order to gauge system responsiveness under load, the test harness measured the time it took to list out all the running containers at various levels of cluster load.

The result: Compared to Swarm, Kubernetes took up to 7x longer to list all the running containers as the cluster approached full load – taking over 2 minutes to list out the running containers. Furthermore, Kubernetes had a 98X increase in response time (that’s not a typo it’s 98X not 98%) as the cluster went from 10% to 100% full.

Simplicity

So why exactly is Kubernetes so much slower and less responsive than Swarm? It really comes down to system architecture. A quick glance at the diagrams from Jeff’s testing environments shows that Swarm has fewer moving parts than Kubernetes.

All of these components introduce a high degree of complexity to the setup process, inject latency in executing commands and makes troubleshooting and remediation difficult.. The diagram below depicts the number of component level interactions in Kubernetes compared to Swarm. The 8X more “hops” to complete a command like run or list add latency and result in a 7X slower system for critical orchestration functions. Another impact of these many interactions is that when a command fails to complete, it is difficult to deduce at which point the failure occurred.

Kubernetes was born out of Google’s internal Borg project, so people assume it’s designed to perform well at “cloud scale”. The test results are one proof point that Kubernetes is fairly divergent from Borg. However, it does share one thing in common with Borg: being overly complex and needing teams of cloud engineers to implement and manage it day to day.

Swarm, on the other hand, shares in a core Docker discipline of democratizing complex cloud technologies. Swarm has been built from day one with the intent of being the best way to orchestrate containers for organizations of all sizes without requiring an army of engineers. With an easy to use experience that is the same whether you are testing a small cluster on your laptop, setting up some test servers in a datacenter or your production cloud infrastructure.

As Jeff said, “Docker Swarm is quantitatively easier to adopt and support than Kubernetes clustering components.”

Some might argue that Kubernetes is more complicated because it does more. But “doing more” does not bring any value to the table if the “more” isn’t anything you care about. And, in reality, it can actually end up being a detriment as “more” can introduce additional points of failure, increased support costs, and unnecessary infrastructure investments.

Or as Jeff describes it:

“…Kubernetes is a larger project, with more moving parts, more facets to learn, and more opportunities for failure. Even though the architecture implemented by Kubernetes can prevent a few known weaknesses in the Swarm architecture it creates opportunities for more esoteric problems and nuances.”

Flexibility

As I stated at the outset of this post, performance and simplicity are only two factors when considering an orchestration tool. The third critical element is flexibility and flexibility itself means many things.

The previously mentioned survey results show that there are three main orchestration tools companies are using or considering include: Docker Swarm, Google Kubernetes, and Amazon EC2 Container Service (ECS).

Of those three, only Docker is fully committed to ensure that your application runs unfettered across the full gamut of infrastructure: From your developers to your test environment, to a production deployment on the platform of your choosing. On a laptop, in your private datacenter, or on the cloud provider of your choosing. Docker Swarm allows you to cluster hosts and orchestrate containers anywhere.

Beyond offering true portability of your workloads across public and private infrastructure, Docker features a plugin based architecture. These plugins ensure that your Dockerized applications will work with your existing technology investments across networking, storage, and compute and can be moved to a different network or storage provider without any change to your application code.

We know this because the same survey previously mentioned also tells us that users want tools that address the full application lifecycle, feature integrated tooling for both their developers and operations engineers, as well supporting the widest range of developer tools.

Docker Swarm allows organizations to leverage the full power of the native Docker CLI and APIs. It allows developers to work in a consistent way, regardless of where their applications are developed or where they will run. Docker works with the infrastructure investments you have today and smooths your transition to different providers. Our design philosophy puts you – the user – and your applications first.

Don’t forget to participate in our DockerCon ticket raffle! Share a picture or description of your Swarm with us on Twitter and tag @docker and #SwarmWeek for a chance to win a free ticket to DockerCon 2016 in Seattle, June 20-21.

#SwarmWeek: Docker Swarm Exceeds Kubernetes Performance at Scale

JeffB

Mike, you write “Even in cases where real-time responsiveness isn’t needed, taking all that extra time to bring up infrastructure is painful – think about using orchestration as part of a continuous integration workflow, longer container startup times directly correspond to longer test cycle times.”

But on Jeff Nickoloff’s blog entry he writes “Anecdotally, I’d like to add that the Kubernetes parallel container scheduling provided by replication controllers is remarkable. Using a Kubernetes replication controller I was able to create 3000 container replicas in under 155 seconds. Without using parallel requests it would take approximately 1100 seconds to do with Swarm and almost 6200 seconds on Kubernetes.”

It seems that k8s definitely has an edge here in spinning up infrastructure for disposable environments.

I hope Docker, Inc is also looking at Nickoloff’s notes on Docker Machine “Provisioning machines with Docker Machine is too slow and experiences unrecoverable errors too often to be viable when creating a cluster with 1000 nodes.”

Peter

That’s all pretty nice but if on the other “half of the time” Swarm needs 10 minutes to fire up a container it would be pretty useless. Please don’t drop vital parts of a statistic, always include extremas or other properties of the distribution.

bo

As a newbie, I tried to deploy 4 containers on 2 swarm nodes at AWS. It worked. But I found the containers on different nodes are inaccessible mutually. They even have the same set of private ips. Could it be possible to fix this?

Ashish

Mohammed Elshambakey

Hi
I'm trying to use the c4-benchmark, but I'm limited on AWS resources (less than 10 which is smaller than the least cluster size required by c4, as I understood).
I wonder if any one could manage running the c4-benchmark on local machines, or reduce the cluster size to less than 10?
Also, if other benchmarks (like those used for HPC, and any other benchmarks for CPU, network, I/O and energy) are running inside containers of each cluster, is it expected to introduce new differences between Docker swarm and Kubernetes? or the differences between the two cluster will still be the same as in this article?