Saturday, June 02, 2018

Immutable infrastructure is a pattern to change/evolve processes without modifying the running code, actual configuration, and the base components (library, auxiliary processes, SO configuration, SO packages, etc). In summary, avoid any manual or automatic configuration change to the running systems.
But if we don't allow changes, how can we evolve our application and service? Easy, change the whole piece (function, container, machine) in one shot...

So to make any kind of change (code, configuration, SO) we don't connect to the target box to execute commands, we "cook" a new artifact (container image or machine image) and use it to create a new instance (container or machine) with the changes. In the past, the cost/time to create a new instance was huge, so in order to optimize the process, we tend to execute the minimal change needed to run the new version. I mean, use a manual or automatic process to ssh to the box, install the needed packages, change the configuration, update the code, etc.

But...
¿What happens when the ssh dies in the middle of an update?
¿And if we have a problem installing a package?
¿How can we be sure about the actual state of a machine?
¿How can we calculate the changes to execute if we are not sure about the actual state of a machine?

Making changes in a machine using ssh, is not a transactional operation so we can have a failure in the middle of the process.

The solution, Immutable Infrastructure...
Create a new artifact and run it. Without intermediate states. Only "Not Ready" or "Ready". Simple. And if something is wrong, destroy the artifact and try again.

This pattern is at the core of the principals container-orchestration systems and PaaS (kubernetes, swarm, mesos, heroku, open shift, Deis, etc).

Why is a good idea

Simplicity. Is an order of magnitude easier to destroy a resource and create a new one from scratch than to calculate the deltas to apply and execute them.

If we need scalability we need to support this patter anyway.

Right now is easy to implement with the help of the different clouds and PaaS providers.

Very easy to return to the previous version.

Very easy to troubleshoot a problem, because there are no intermediate states. You can know the exact content of a running version (the SO state, the concrete conf, the concrete code, etc).

With this approach, there is no difference between languages and running environment (python, jruby, java, all the same...). Is a general solution for all the technology your systems requires.

What are the downsides

In some cases, the bootstrapping/setup time to use a new machine is larger than modifying an existing one. But the time is improving continuously and if we have a scalable architecture we already are dealing with it. Another solution is to use the patter at a different level, for example, functions or containers instead of at the machine level.

This pattern requires more and longer steps that the classic approach so it is not practical to do it without automation. But not automatize this kind of task is shooting in your foot anyway.

Implementation Samples

As a general implementation strategy we need to be capable to make the following steps:

Start a new running process (without processing workload).

Detect when this running process is ready to accept real workload.

Connect the running process to start processing workload.

Disconnect the old running process to stop receiving new workload.

The old running process completes the workload that already has.

Detect that all the pending workload of the old process is completed.

Destroy the old process.

If we talk about web servers the steps can be:

Start a new machine/container running the web server in an internal endpoint.

Detect when the web server is ready using the health check endpoint.

Connect the new web server internal endpoint to the load balancer with the external endpoint.

Inform the old web server to stop processing new traffic and disconnect from the load balancer.

Detect when the old web server finished.

Destroy the old web server.

If we talk about a background queue job processor the workflow can be:

Start a new machine/container running the background queue job processor.

Detect that the new machine is processing jobs from the queue.

Inform the old background queue job processor to not get new jobs.

Detect when the old background queue job processor has completed its pending work.

Destroy the old background queue job processor.

We can think similar steps for other kinds of processes.

Design Notes

As we can see, this pattern requires collaboration from our code and support from the platform, but I assume that this is part of designing scalable systems that make great use of the capabilities of the cloud. For a good reference about designing applications for the cloud, please read The 12 Factor Apps

We can apply the same pattern at different levels. Virtual machines, Containers, Functions... The ideas are the same but the granularity is different.