UpGuard was created to answer the fundamental questions of configuration management: how are my systems configured, are they configured correctly, what's changed since yesterday, what's for lunch– the stuff you absolutely need to know. In its first release, UpGuard satisfied the first three by scanning and recording configuration state, continuously testing with policies, and giving users the ability to difference configuration state over time or between nodes. But one thing was missing: the ability to difference a group of nodes all at one time.

The Challenge

Defining what "group differencing" even meant was the first challenge. In a one-to-one comparison each item is in one of a known number of states: either it is only on one node, only on the other, is on both but differs, or is on both nodes. A group differences expands the possible number of states as a function of number of nodes and goes from a two-dimensional problem to an n-dimensional problem. Even if that information might be useful, we needed to simplify it into terms that humans could understand.

Make it Useful

Next we turned to math for help. There are formulas and algorithms for calculating variance but they wouldn't solve the real problem at hand. Our end goal wasn't to measure variance; our goal was to provide the information needed to reduce it. Finding that one-in-a-billion instance might be cool if you're studying the fish in Lake Springfield, but if you're looking at your environment configurations your first thought is to fix it before it breaks. The output of a group diff shouldn't be a number telling you exactly how deep you're in trouble. It should be a map telling you how to get out of it.

The Solution

After a week of prototyping we arrived at a solution that made sense. We would assemble a superset of all configuration items (CIs) in the group, grade each CI on how many nodes agreed it was the right configuration, and present the findings as a heat map.

Measuring "how different" an item is across multiples nodes has its own challenges. Each configuration item can have multiple attributes which means that a set of several nodes can disagree about a configuration item in several different ways. One might lack the CI entirely, while three others have different versions. The more attributes, the more permutations of difference per CI.

Given that we're more interested in discovering points of interest than getting precise but useless calculations, we created an algorithm inspired by the Raft election metaphor. Each node votes for what it thinks is the correct version of each configuration item. Where there is a unanimous opinion, all is well. Where there is disagreement, we count how many nodes voted for the most popular version of the item and compare that to the total number of nodes. This provides a continuum of items where there is high consensus– many nodes voted for the same version– to items with low consensus– each node voted for a different version of the item.

In this example we have eight nodes that are supposed to be configured the same, and for the most part they are. As it turns out, only one node is misconfigured. Using a one-to-one comparison method, it would take several attempts and a good bit of luck to find the misconfigured node– and even then we wouldn't have any guarantee that there weren't other misconfigurations hiding in the group. Here, in one view, we see all the configuration items, whether or not they are identical across the nodes, and how they differ when they do.

What's Next

The variance report contains a lot of information and can be overwhelming, especially in environments with many configuration inconsistencies. We are trying to find new ways to simplify the presentation without losing the depth of information. If you want to try it for yourself and see how well your environments match up to your expectations, UpGuard is available by demo by clicking the button below.