December 23, 2014

Prologue: A Concept

One aspect of Chef that’s confusing to people comes up when searching for nodes that have
some attribute: just what is the difference between a nodes reported ‘role’ attribute,
and its ‘roles’ attribute? It seems like it could almost be taken for a typo – but underlying it
are some very deep statements about pluralism, pluralization, and the differences between them.

One definition of the term ‘pluralism’ is “a condition or system in which two or more states, groups, principles,
sources of authority, etc., coexist.”
And while pluralism is common in descriptions of politics, religion and culture,
it also has a place in computing: to describe situations in which many systems
are in more than one desired state.

Once a desired state is determined, it’s enforced. But then time passes – days,
minutes, seconds or even nanoseconds – and every moment has the potential to
change the server’s actual state.
Files are edited, hardware degrades, new data is pulled from external sources;
anyone who has run a production service can attest to this.

Act I: Terms

Businesses commonly offer products. These products may be composed of multiple systems, where each system
could be a collection of services, which run on any number of servers, which run on some amount of hosts.
Each host, in turn, provides another set of services to the server that makes up part of the system, which then
makes up part of the product, which the business sells.

An example to illustrate: MyFace offers a social web site (the product), which may need a web portal,
a user authentication system, index and search systems, long-term photo storage systems, and many more.
The web portal system may need servers like Apache or Nginx, running on any number of instances.
A given server-instance will need to use any number of host services, such as /o, cpu, memory and more.

In Days of Yore, when a Company ran a ‘Web Site’, they may have had a single System, maybe some web content Service,
made up of a web Server, a database Server (maybe even on the same host) - both consuming host services
(CPU, Memory, Disk, Network) - to provide the Service the Company then sells, hopefully at a profit (right!?).

Back then, if you wanted to enact a change on the web and database at the same time (maybe release a new feature),
it was relatively simple, as you could control both things in one place, at roughly the same time.

Intermission

In English, to pluralize something, we generally add a suffix of “s” to the word.
For instance, to convey more than one instance, “instance” becomes “instances”, “server” becomes “servers”,
“system” becomes “systems”, “turtle” becomes “turtles”.

We commonly use pluralization to describe the concept of a collection of similar items, like “apples”, “oranges”,
“users”, “web pages”, “databases”, “servers”, “hosts”, “turtles”. I think you see the pattern.

This extends even in to programming languages and idiomatic use in development frameworks.
For example, all tables in a Rails application will typically pluralize a table name for objects named Apple to apples.

This emphasizes that the table in question does not store a singular Apple, rather many
Apple instances will be located in a table named apples.

This is not pluralism, this is pluralization - don’t get them confused. Let’s move on to the next act.

Act II: Progress

We’ve evolved quite a bit since the Days of Yore. Now, a given business product
can span hundreds or even thousands of systems of servers running on hosts all over the world.

As systems grow, it becomes more difficult to enact a desired change at a deterministic point in time across
a fleet of servers and hosts.

In the realm of systems deployment, many solutions perform what has become known as “test-and-repair” operations -
meaning that when provided a “map” desired state (which typically manifests in human-written and readable code), that
when executed, will “test the current state of a given host, and perform and ”repair" operations to bring the host
to the desired state - whether it be installing packages, writing files

Each system calls this map something different - cfengine:policies, bcfg2:specifications, puppet:modules, chef:recipes,
ansible:playbooks, and so on.
While they don’t always map 1:1, they all have some sort of concept for ‘Things that are similar, but not the same.’
They will have unique IP addresses, hostnames, while sharing enough of a set of common features to become termed something
like “web heads” or the like.

Act III: Change

In the previous sections, I laid the groundwork to understand one of the more subtle features in Chef.
This feature may be available in other services, but I’ll describe the one I know.

Using Chef, there is a common deployment model where Chef Clients check in with a Chef Server to ask
“What is the desired state I should have?” The Chef terminology is ‘a node asks the server for its run list’.

A run list can contain a list of recipes and/or roles. A recipe tells Chef how to accomplish
a particular set of tasks, like installing a package or editing a file.
A role is typically a collection of recipes, and maybe some role-specific metadata (‘attributes’ in Chef lingo).

The node may be in any state at this point. Chef will test for each desired state, and take action to enforce
it: install this package, write that file, etc.
The end result should either be “this node now conforms to the desired state” or “this node was unable to comply”.

When the node completes successfully, it will report back to Chef Server that “I am node ‘XYZZY’,
and my roles are ‘base’ and ‘webhead’, my recipes are ‘base::packages’, ‘nginx’, ‘webapp’” along with a lot of node-specific metdata (IP addresses, CPU, Memory, Disk, and much more).

This information is then indexed and available for others to search for.
A common use case we have is where a load balancing node will perform a search for all nodes holding the webhead role, and add these to the balancing list.

Pièce de résistance, or Searching for Servers

In a world where we continue to scale and deploy systems rapidly and repeatedly, we often choose to reduce
the need for strong consistency amongst a cluster of hosts.
This means we cannot expect to change all hosts at the precise same moment.
Rather we opt for eventual consistency: either all my nodes
will eventually be correct, or failures will occur and I’ll be notified that something is wrong.

This changes how we think about deployments, and more importantly, how do we use our tools to find other nodes.

Using Chef’s search feature, a search like this:

webheads = search(:node, 'role:webheads')

will use the node index (a collection of node data) to look for nodes with the webheads role in the node’s
run list - this will also return nodes that have not yet completed an initial Chef run and reported the complete
run list back to Chef Server.

This means that my load balancer could find a node that is still mid-provisioning, and potentially begin
to send traffic to a node that’s not ready to receive yet, based on the role assignment alone.

A better search, in this case might be:

webheads = search(:node, 'roles:webheads')

One letter, and all the difference.

This search now looks for an “expanded list” that the node has reported back.
Any node with the role webheads that has completed a Chef run would be included.
If the mandate is that only webhead nodes get the webhead role assigned to them, then I can safely use this search to include nodes that have completed their provisioning cycle.

Another way to use this search to our benefit is to search one axis and compare with another to find nodes that never completed provisioning:

This will grab any nodes with an assignment but not a completion – very helpful when launching large amounts of nodes.

Note: This is not restricted to roles; this also applies to recipe/recipes. I’ve used roles here,
as we use them heavily in our organization, but the same search patterns apply for using recipes directly
in a run list.

Curtain

This little tidbit of role vs roles has proven time and again to be a confusing point when someone tries to pick up more of Chef’s searching abilities.
But having both adjectives describe a state of the node is helpful in making a determination
of what state the node is in, and whether it should be included in some other node’s list (such as in the
loadbalancer/webhead example from before).

Now, you may argue against the use of roles entirely, or the use of Chef Server and search, and use something else for service discovery.
This is a valid argument - but be careful you’re not tethering a racehorse to a city
carriage.
If you don’t fully understand its abilities, someday it might run away on you.

Epilogue

A surgeon spends a lot of time how to use a sharpened bit of metal to fix the human body.
While there are many instruments he or she will go on to master, the scalpel remains the fundamental tool, available when all else is gone.

While we don’t have the same risks involved as a surgeon, the tools we use can be more complex, and provide us
with a large amount of power at our fingertips.

It behooves us to learn how they work, and when and how to use its features to provide better systems and services for our businesses.

Chef’s ability to discern between what a node has been told about itself, and what it reports about itself, can make all the difference when using Chef to accomplish complex deployment scenarios and maintain flexible infrastructure as code.
This not only lets you accomplish fundamentals of service discovery and less hard-coded configurations, but lets you avoid the uncertainty of bringing in yet another
outside tool.