{collecting solutions}

Scalability of CFEngine 3.3.5 and Puppet 2.7.19

Introduction

CFEngine and Puppet are configuration management tools that can help you automate IT infrastructure. Practical examples include adding local users, installing Apache, and making sure password-based authentication in sshd is turned off. The more servers and complexity you have, the more interesting such a solution becomes for you.

In this test, we set out to explore the performance of the tools as the environment scales, both in terms of nodes and policy/manifest size. Amazon EC2 is used to easily increase the size of the environment. This test is primarily a response to the comments in this older test.

I want to start with a few disclaimers:

I am affiliated with CFEngine (the company), and so it is extremely important for me to provide all the details so the test procedure and results can be scrutinized and reproduced. I would love for some of you to create independent and alternative tests.

The exact numbers in this test do probably not map directly to your environment, as everybody’s environment is a little different (especially in hardware, node count, policy/manifest). The goal is therefore to identify trends and degree of differences, not so much exact numbers.

For simplicity, all ports were left open during the tests (the “Everything” security group by Amazon was used).

Both tools were set to run every 5 minutes.

Test procedure

The policy server/puppet master were set up manually first, as described in detail below. For test efficiency reasons, they were set to accept new clients automatically (trusting all keys or autosigning).

Clients were added in steps of 50, up to 300 every 15 minutes (there wasn’t enough time to go higher). The manifest/policy was changed twice during the test, to see what impact this had. These were the exact steps taken:

Time

Client count

Policy/manifest

0:00

50

Apache

0:15

100

Apache

0:30

150

Apache

0:45

200

Apache

1:00

200

Apache, 100 echo commands

1:15

250

Apache, 100 echo commands

1:30

250

Apache, 200 echo commands

1:45

300

Apache, 200 echo commands

CPU usage at the policy server/master was measured with Amazon CloudWatch. Client run-time was measured by picking a random client and invoking a manual run of each tool in the time utility. Each run-time was measured three times and the average was taken.

The policy/manifest was changed twice, by adding 100 echo commands each time (run /bin/echo 1, /bin/echo 2,… /bin/echo 100) to see how the tools handled a simple increase in work-size.

Setting up the CFEngine policy server

The Ubuntu 12.04 package was found and downloaded at the CFEngine web site. In order to save time and money, this package was uploaded to the Amazon S3 for the clients to download (internal Amazon communication is free on EC2). Note that I could have used the CFEngine apt repository, but since CFEngine is just one package I chose to install it directly with dpkg. The following steps were carried out:

Setting up the Puppet master

It is important to note that the default Puppet master configuration is not production ready according to the documentation: “The default web server is simpler to configure and better for testing, but cannot support real-life workloads”.

The main reason for this is that Ruby does not support multi-threading, so puppetmasterd can only handle one connection at the time. This would limit the scale to just tens of clients, which is too low for our purposes.

The recommended way to get around this is to create an Apache proxy balancer with the passenger extension that receives all connections and hands them over Puppet. There are some documents that describe this for Red Hat and Debian. The configuration is quite complex, so I used a Ubuntu 12.04, which supports a package that contains the necessary configuration.

These steps were taken to install and set up the Puppet master with passenger:

Results

Server-side CPU usage

These graphs were taken directly from Amazon CloudWatch at the master/server instance.

CFEngine policy server CPU usagePuppet master CPU usage

From the graphs, we can see that the Puppet master instance uses about 10 times as much CPU at 50 clients. At 300 clients with 200 echo commands, the Puppet master uses about 18 times as much CPU.

The most interesting points, though, are not when we increase the client count, but increase client work in the policy/manifest.

This happens where we add 100 extra echo commands and the following 15 minutes when they are run. These areas are indicated by red lines in both graphs:

200 clients, 100 echo commands (until we go to 250 clients)

250 clients, 200 echo commands (until we go to 300 clients)

The reason this is more interesting is that users will probably extend the policy/manifest more frequently than add nodes. How does changing the policy/manifest impact the server?

We can clearly see that the Puppet master is heavily impacted by changing the manifest, while the CFEngine policy server seems unaffected by the changes (the load increase at the red lines).

Client-side execution time

The data for the client execution time was captured with three runs of time /var/cfengine/bin/cf-agent -K and time puppet apply --onetime --no-daemonize, respectively. The resulting data-files for CFEngine and Puppet are also provided. If we calculate averages, the graph for comparing will look like the following (the ods file is available here).

At 50 hosts with just the Apache configuration, CFEngine agents run 20 times faster than Puppet agents. At 300 hosts, with 200 echo commands, CFEngine agents run 166 times faster than Puppet agents.

Note that some spikes at 200c,100e and 250c,200e are to be expected since we added 100 more echo commands in the policy/manifest at these points. At 100c,a the Puppet agent had one very long run (as shown in the data files), which caused a spike there for Puppet.

The individual charts compare each tool to itself more easily — does the client execution time increase much when only the number of nodes increases?

For completeness, the client execution results are provided in tabular form below.

Environment

CFEngine time (seconds)

Puppet time (seconds)

50c,a

0.172

3.427

100c,a

0.173

19.24

150c,a

0.172

3.63

200c,a

0.178

3.63

200c,100e

0.481

22.408

250c,100e

0.459

32.56

250c,200e

0.742

106.4

300c,200e

0.732

121.86

Final remarks

It is clear that CFEngine is much more efficient and vertically scalable than Puppet. This is probably due to two items:

Puppet’s architecture is heavily centralised, the Puppet master does a lot of work for every client – especially with cataloge compilation. In contrast, CFEngine agents will interpret the policy in a distributed fashion. The CFEngine policy server is just a file server.

Puppet runs under the Ruby interpreter, which is far less efficient than running natively on the operating system.

The most interesting observation, I think, is that the Puppet master and agent performance were heavily influenced by the manifest complexity. When the manifest is small, increasing the agent count did not have much impact on the agent performance. However, as the manifest grew, the performance of all the agents (and the master) degraded significantly. It is also evident that as the master gets more loaded, all the Puppet agents run slower. This can also probably be attributed to the heavily centralised architecture of Puppet.

It would be interesting to create a more real-world policy/manifest to explore this further. The manifest in this test did not have much dependencies, and so the Puppet resource dependency graph was quite simple. If the dependency graph was more realistic – would that have had impact on the test results?

This post was primarily created due to feedback gotten from an older post on the same matter. Please don’t hesitate to add your comments below!

The Puppet Way(tm) and DSL just seem easier to start using for most people. Also for newcomers to programming, C has a steeper learning curve than Ruby. CFEngine altogether seems more academic (read: harder to understand and appreciate).

I’m not saying that either one is better. Just that the audiences are different.

I think ease-of-use is highly subjective. Learning a new DSL is always going to have some sort of learning curve. Indeed, the Puppet people have tried to positioned Puppet as easier than CFEngine 2 (upon which its ideas are based).

Very interesting post. I wasn’t aware of the heavily centralised approach to Puppet’s execution – makes me wonder what happens if the master server goes down – can clients execute at all? I have many servers/devices with no or poor network connectivity (satellite links and such) and one of the few assurances I have that thinks are “OK” is the knowledge they are running Cfengine3 which is keeping things in order.

Makes me feel a bit more validated in the decision to use Cfengine in the first place and stick with it over the years…

I think your measurements are too much a difference between Puppet’s “centralised and online” architecture versus CFEngine’s “distributed and offline” approach. Have you thought about measuring against Masterless Puppet, where Puppet manifests can be distributed to your nodes and applied locally with no involvement of the server? Since you effectively cut the Puppet Master out of this equation and make Puppet behave like CFEngine, I think what you’ll be left with is the same near-infinite vertical scalability. After this the difference in client execution times will just be difference in the Puppet and CFEngine interpreters – a faster C program and a slower Ruby program.

Also, if I saw someone using a powerful tool like CFEngine or Puppet and all they were doing is echo statements, I’d sit them down for a serious talking to A more real use case might be to pick a complex service, say a LAMP stack with a few extras thrown in, then time how long each of them take. I have no doubt CFEngine will be faster – because it just is – however the difference between them might not look so great under a real workload, say when your nodes are taking 15 seconds each to do Yum/Apt installs.

I did the tests with the architecture that is recommended with each tool (including the Apache/passenger load-balancer for Puppet), and I expect that this is the way pretty much any user is architecting the respective tools.

I could set up Masterless Puppet, and use scp/git or something to distribute manifest updates, but this was not the purpose of this test. I also believe that this is not how people use Puppet, so it would not be that interesting.

I totally agree that a more real-world policy/manifest would be more interesting (and I also stated this at the end). However, I hope the yum/apt installs wouldn’t happen every time the tool runs (just the first), so the difference across multiple runs may still be similar. It would also be very interesting to see if a more complex manifest resource-graph would have impact on the performance of Puppet.

Thanks for taking the time to do a simple yet through comparison of both tools.

The heavy workload on the master with increased policy complexity does not suprise me given that the puppetmaster does all of the distilation of the policy that needs to be applied, and doing that work for multiple clients concurrently really adds up.

The need to stand up all the webserver infrastructure to serv policy always bothered me when I was using puppet. My management tools are supposed to be light weight and I really appriciated the simplicity of CFEngine when I switched. One package, and an efficient built in fileserver.

I too would be curious to see how a masterless puppet would compare. Even a comparison of the fileserver component to other typical policy transfer mechanisms (rsync, svn (svnsrv and http/https, git)) would be interesting. I think that one big advantage you will find with the cfengine server is that you gain flexibility to choose how different files are transferred and difference checked (encryption/hash/mtime). There are a lot of variations that could be compared just in the policy transfer mechanism itself.