CFEngine's Decentralized Approach to Configuration Management

Configuration management is the foundation that makes modern infrastructure possible. Tools that enable configuration management are required in the toolbox of any operations team, and many development teams as well. Although all the tools aim to solve the same basic set of problems, they adhere to different visions and exhibit different characteristics. The issue is how to choose the tool that best fits each organization's scenarios.

This InfoQ article is part of a series that aims to introduce some of the configuration tools on the market, the principles behind each one and what makes them stand out from each other. You can subscribe to notifications about new articles in the series here.

CFEngine was born in the early 1990s as a way of creating and maintaining complex requirements over the very diverse operating systems of the day, in a hands-free manner. Today, the landscape is very different, with far fewer operating systems to worry about, but the key challenges are still the same. According to our reckoning, there are still three challenges that IT faces over the coming decade: scale, complexity and knowledge.

CFEngine is the all-terrain vehicle of automation software, and it has gone through many variations since it was released in 1993. It helped pioneer self-repairing automation and desired-state technology. After five years of extensive research, it was completely rewritten in 2008 (as CFEngine 3) to capture the lessons learned over its then 15 years of history. During the 2000s CFEngine 2 was very widespread and was involved in the growth of some of the major players like Facebook, Amazon and LinkedIn. Indeed that legacy is still with us in many more companies, but today's world needs a more sophisticated tool, hence CFEngine 3 was written.

The landscape today has evolved as IT has become a platform for global business, woven into the fabric of society: change happens faster, online services are ubiquitous, and developers play a larger role in steering operational issues. They represent the generators of business value in the modern economy. The now infamous concept of WebScale is not just about housing massive numbers of computers in the cloud, it means going beyond deploying (virtualized) boxes and operating systems, but entire software stacks, including networking, across environments from massive datacenters to tiny mobile devices. Here are the challenges:

Scale - CFEngine was designed to run with the smallest hardware footprint in the largest possible numbers, in a fundamentally decentralized way. By avoiding the need for centralization (though not dismissing the possibility), it allows management of hundreds of thousands of hosts from a single model (one CFEngine 3 user reports 200,000 hosts under CFEngine management). The rise of mobility further means that IT management has to be performed across partially connected environments with changing address spaces. All of this leads to increased complexity.

Complexity - Scale is not the only cause of complexity. Strong couplings or dependencies between systems are one of the bad habits of classical IT design. Strong dependence means that a failure in one part of a system is transmitted quickly to the rest of the system leading to Byzantine failures. Part of the research that went into designing CFEngine 3 includes a very simple model for avoiding strong dependences, called Promise Theory. Indeterminism in systems can no longer be ignored and papered over by brute force patching. Systems need to be built to support it as an inevitable reality of operations. (For a full introduction to the issues, see Mark Burgess's In Search of Certainty (see InfoQ review)). The challenge is that complexity makes comprehending systems hard.

Knowledge - What we truly crave of a complex infrastructure is knowledge -- to know and understand what assets we have and how well they are delivering business value. Hiding complexity sounds fair until something goes wrong, then it becomes a nightmare. Auditors too need to peer into systems to hold them accountable to standards of security and safety. Compliance with public regulators is a major issue that few automation schemes can address plausibly. CFEngine was designed to handle this `from the ground up', using its model of keeping promises. The issue of insight into resources and processes goes far deeper than this however. As complexity rises, our ability to comprehend the monsters we create in these virtualized laboratories is diminishing rapidly --- and `software-defined-everything' can only help if it is based on a clear model of intent that is verifiable. CFEngine's latest version, released last week, gives special attention to solving this challenge. For instance, an improved dashboard and detailed inventory reports are made possible due to CFEngine's knowledge-oriented principles.

The easier we try to make management through deployment of commodity boxes, the less visibility into the details we have.

Fundamentally decentralized and knowledge-oriented

Let's take a moment to understand these aspects of CFEngine. CFEngine decentralizes management in the following way. Every device runs a copy of the CFEngine software. This includes a lightweight agent for making targeted changes, and some helper programs like a server and scheduler, totaling a few megabytes. Each device can, in principle, have its own separate policy determined by the owner of that device. An agent cannot be forced into submission by an external authority. Thus policy is fundamentally federated.

In practice, however, agents often adopt a policy of following an external authority's guidance voluntarily, accepting updated policies from a single coordination point. Each agent can take as much or as little as it wants from a trusted source. To avoid bottlenecks associated with centralization, each host caches the policy it downloads so that it is never dependent on being able to talk to the coordination hub. All computation, reasoning and change is performed by each agent in a fully decentralized way, based on this policy. Thus, distribution of policy works in one of two ways: by federation and by caching. No device is ever strongly dependent on any resource it does not own.

This also leads to the claim that CFEngine is knowledge-oriented. Although we sometimes confuse knowledge with available information, knowledge is really about our level of certainty about information. As humans we say we know someone (like a friend) if we communicate with the regularly and learn to understand their behaviours and habits. This allows us to form expectations so that we can tell when something is wrong. CFEngine uses machine learning to characterize machine behaviours.

Similarly, we say that we know a skill if we practice it often. CFEngine's model of promises defines states that it revisits and checks every few minutes in order to verify whether they have changed. CFEngine manages persistent or knowable state, it does not merely change one state into another unexpectedly. It classifies the environment it learns into types (like operating system, disk and runtime integrity, performance levels, etc) and we use these characteristics in defining policy. Thus a CFEngine policy is based on what we believe we can expect, rather than just what we want.

Knowledge is a documented relationship, a feedback loop that we revisit regularly. By having a continuous and on-going relationship with every promised resource, based on its model, CFEngine knows the state of the system (like a friend), because it regularly checks in and says: how are you?

Marrying intent with outcome

To close the loop between what we intend for our IT systems and what actually happens, CFEngine uses a desired state model. Many people have likened CFEngine to a rather sophisticated Makefile in the sense that, instead of focusing on what to do next, you focus on the desired end state that you want to achieve. The target (or the maker of the promise according to Promise Theory) is the object in focus, and our goal is to describe its desired state.

The design goals of the CFEngine `engine' are the four S's: scale, speed, security and stability. Today, CFEngine is unparalleled in these areas, across platforms from hand-held Android devices to mainframes and global datacentres. Moreover, we take it for granted that everything you can express in CFEngine is `convergent', i.e. idempotent and always leading to a correct desired outcome.

In terms of the three challenges above, CFEngine's goal has been to lead the way in researching solutions to them. Simplicity is not the same as ease: if we make complex things too easy, we can quickly get into a state we don't understand. This is one of the main reasons people seek out CFEngine and knowledge-oriented solutions today.

The rise of DevOps has emphasized the human aspects of integrating automation into our workflows, and we think this is crucial. We need to understand why we do it. Automation is only meaningful in the hands of clear human intentions. The goal is not to remove humans from the loop, simply to take away the buttons and levers that lead to accidents due to lack of awareness or diligence. Human faculties are limited, and consuming necessary situational knowledge without automation is no longer plausible.

The current tendency for encouraging programmability through APIs puts a lot of power in the hands of developers. However, this cannot be an answer in itself. Developers also need to delegate, and often have the wrong expertise for operational decisions. Programmability opens businesses up to a potential minefield of incorrect reasoning, spurred on by power tools. Engineering of fundamentally safe systems has to be a goal for systems society can rely on. The aim of CFEngine is to minimize the amount of reasoning in a system and simply provide a defined outcome. In many ways, CFEngine is like cascading style-sheets (CSS) but for devices: data-driven promises about a desired state.

The challenges facing all the automation frameworks today, including CFEngine, is to find a simple way to unify the stories we want to tell about our requirements with their outcomes. The dilemma is that while we are building, we focus very differently on issues of climbing mountains. When doing the post-mortems after failures, we are trying to figure out how to climb down again. If we knew more about what was intended, these two stories could come together in a more meaningful cycle of continuous improvement, simply by planning ahead.

Example:

CFEngine installs a small agent of a few megabytes on every device. Each agent looks at a common policy that can be distributed amongst the agents. A CFEngine policy is made from bundles of `promises'. Here is a promise to report a message:

The word `bundle' refers to the fact that the curly braces gather together a bundle of promises. The word agent denotes that this bundle of promises is kept by the CFEngine agent, i.e. not by the server or the scheduler. The word `reports' denotes the type of promise, and the `hello world' string is formally the desired outcome or the promise to be kept.

The `sys' variables expand to the fully qualified hostname, the IP address and date for the host keeping the promise at the moment of verification. CFEngine verifies whether these promises are kept (and usually takes measures to keep them) every five minutes, by default. We could add to this a promise to install some software, like a web-server, just on certain classes of machines:

Now, wherever CFEngine runs, whether it be a small handheld phone, a virtual machine on your laptop, or a server in a datacenter, CFEngine will ask: am I an Ubuntu system? If so, make sure the apache2 software package is installed. On Cumulus Linux systems, it would ensure that OpenLLDP was up to date. What actually happens to keep that promise can be configured as much or as little as you want as you drill into the details. The same policy works on every device in the fleet, because CFEngine knows about context and adapts promises to the targeted environments.

This is what we mean by orchestration. Just as the players in an orchestra only play their own part of the total score, so each agent only plays its role. Orchestration is about sharing the plan and delegating roles, not about remote control from a central place.

At a higher level, we can describe the storyline of our intended state in terms of more descriptive encapsulations. CFEngine `methods' are bundles of promises that can be `called up', by name, in a particular context, i.e. they can be re-used like subroutines, possibly with parameters. Methods are the entry-point mechanism by which bundles of promises may be verified in a sequential storyline, more like classical imperative programming, but still in a continuously revisited feedback relationship at the atomic level. Each promise is a convergent, idempotent and standalone, but attains a meaning within the whole by the storyline we build around it.

CFEngine services, on the other hand, are also implemented as promise bundles. These represent persistent and ever-present operating system services. The underlying mechanism is the same, but the semantics of description are slightly different, mainly for readability.

Thus, while all hosts in a publishing environment would run the Web service and SSH, only a build slave would keep promises to automatically construct XHTML content from source materials for publishing. (CFEngine can perform sophisticated editing of files, that goes far beyond sed or awk much more efficiently, and in a convergent way.) Editing of text files is a surprisingly common requirement of automation. Software systems (like publishing format translators) don't always do exactly what we want of them. We find ourselves patching up files, modifying style-sheets that were generated by one tool before feeding into another, and so on. Naturally, CFEngine does this in a convergent manner so `insert_lines:' really means `convergently insert lines if they are not already present in the modeled context'.

This bundle, when tied to a file, consists of three promises of type `insert_lines' and one promise of type replace. The desired outcomes are that the quoted lines should be inserted at the start of the CSS file if they do not already exist somewhere (order is important in CSS parsing). Similarly, the replace_pattern desired outcome is to have no instances of the font-name `monospace' in the style-sheer. We promise to replace any such instances by replacing them with the font `serif'. This is a convergent operation.

If we pay attention to writing pedagogically, addressing the knowledge challenge by aiming for readability, then a CFEngine configuration becomes executable documentation for the system.

Engineering for the future

There are many tools one could use for automation, but CFEngine is unique in its distributed model of operation. It embodies and integrates many aspects of the tools one needs to deploy software and infrastructure quickly and safely, and it is robust in the most mission critical of environments. CFEngine allows autonomy, cooperation, direct secure file copying from point to point for decentralized sharing; it can manage routing and networking services as well as server-based systems, and it runs disconnected in embedded devices as well as in massive datacenters.

CFEngine is used in some of the most demanding environments on the planet. Our goal has been to design not merely a tool but a systematic approach to maintaining the software stack for the coming decade. It is based on state-of-the-art research and tried-and-tested techniques so that an investment in infrastructure does not become a legacy issue as soon as it is deployed. We believe that self-healing system state should be based on a minimum of programming. CFEngine's model of promises achieves that. What users get for free is immediate and continuous measurements of compliance based on a documented model of intent, without the need for independent monitoring.

The CFEngine community co-exists in a lively and growing arena of automation solutions. We are always looking to extend this base of expertise and viewpoints in our community, and address the challenges from small to large in day-to-day operations. The CFEngine community features many passionate engineers who manage are some of the most impressive (and sometimes secretive) installations on the planet in terms of size, complexity and knowledge.

CFEngine is free to download, and has Enterprise grade enhancements.

About the Authors

Mark Burgess Mark Burgess is the CTO and Founder of CFEngine, formerly professor of Network and System Administration at Oslo University College, and the principal author of the Cfengine software. He’s the author of numerous books and papers on topics from physics, Network and System Administration, to fiction.

Diego Zamboni is a computer scientist, consultant, author, programmer and sysadmin who works as Senior Security Advisor and Product Manager at CFEngine. He has more than 20 years of experience in system administration and security, and has worked in both the applied and theoretical sides of the computer science field. He holds a Ph.D. from Purdue University, has worked as a sysadmin at a supercomputer center, as a researcher at the IBM Zurich Research Lab, and as a consultant at HP Enterprise Services. He is the author of the book "Learning CFEngine 3", published by O'Reilly Media.

Configuration management is the foundation that makes modern infrastructure possible. Tools that enable configuration management are required in the toolbox of any operations team, and many development teams as well. Although all the tools aim to solve the same basic set of problems, they adhere to different visions and exhibit different characteristics. The issue is how to choose the tool that best fits each organization's scenarios.

This InfoQ article is part of a series that aims to introduce some of the configuration tools on the market, the principles behind each one and what makes them stand out from each other. You can subscribe to notifications about new articles in the series here.