Patching Docker Containers – The Balance of Secure and Functional

PaaS, IaaS, SaaS, CaaS, …

The cloud is evolving at a rapid pace. We have increasingly more options for how to host and run the tools that empower our employees, customers, friends and family.
New apps depend on the capabilities of underlying sdks, frameworks, services, platforms which depend on operating systems and hardware. For each layer of this stack, things are constantly moving. We want and need them to move and evolve. And, while our “apps” evolve, bugs surface. Some simple. Some more severe, such as the dreaded vulnerability that must be patched.
We’re seeing a new tension where app authors, companies, enterprises want secure systems, but don’t want to own the patching. It’s great to say the cloud vendor should be responsible for the patching, but how do you know the patching won’t break your apps? Just because the problem gets moved down the stack to a different owner doesn’t mean the behavior your apps depend upon won’t be impacted by the “fix”.
I continually hear the tension between IT and devs. IT wants to remove a given version of the OS. Devs need to understand the impact of IT updating or changing their hosting environment. IT wants to patch a set of servers and needs to account for downtime. When does someone evaluate if the pending update will break the apps? Which is more important; a secure platform, or functioning apps? If the platform is secure, but the apps don’t work, does your business continue to operate? If the apps continue to operate, but expose a critical vulnerability, there are many a story of a failed company.

So, what to do? Will containers solve this problem?

There are two layers to think about. The app and the infrastructure. We’ll start with the app layer

Apps and their OS

One of the major benefits of containers is the packaging of the app and the OS. The app can take dependencies on behaviors and features of a given OS. They package it up in an image, put it in a container registry, and deploy it. When the app needs an update, the developers write the code, submit it to the build system, test it – (that’s an important part…) and if the test succeeds, the app is updated. If we look at how containers are defined, we see a lineage of dependencies.
An overly simplified version of our app dockerfile may look something like this:

At any point, one of these images may get updated. If the updates are functional, the tags should change, indicating a new version that developers can opt into. However, if a vulnerability or some other fix is introduced, the update is applied using the same tag, notifications are sent between the different registries indicating the change. The Debian image takes an update. The dotnet image takes the update and rebuilds. The mycriticalapp gets notified, rebuilds and redeploys; or should it?

Now you might remember that important testing step. At any layer of these automated builds, how do we know the framework, the service or our app will continue to function? Tests. By running automation tests each layered owner can decide if it’s ready to proceed. It’s incumbent on the public image owners to make sure their dependencies don’t break them.

By building an automated build system that not only builds your code when it changes, but also rebuilds when the dependent images change, you're now empowered with the information to decide how to proceed. If the update passes tests and the app just updates, life is good. You might be on vacation, see the news of a critical vulnerability. You check the health of your system, and you can see that a build traveled through, passed its tests and your apps are continuing to report a healthy status. You can go back to your drink at the pool bar knowing your investments in automation and containers have paid off.

What about the underlying infrastructure?

We’ve covered our app updates, and the dependencies they must react to. But what about the underlying infrastructure that’s running our containers? It doesn’t really matter who’s responsible for them. If the customer maintains them, they’re annoyed that they must apply patches, but they’re empowered to test their apps before rolling out the patches. If we move the responsibility to the cloud provider, how do they know if the update will impact the apps? Salesforce has a great model for this as they continually update their infrastructure. If your code uses their declartive model, they can inspect your code to know if it will continue to function. If you write custom code, you must provide tests that have 75% code coverage. Why? So Salesforce can validate that their updates won't break your custom apps.
Containers are efficient in size and start up performance because they share core parts of the kernel with the host OS. When a host OS is updated, how does anyone know it will not impact the running apps in a bad way? And, how would they be updated? Does each customer need to schedule down time? In the cloud, the concept of down time shouldn't exist.

Enter the orchestrator…

A basic premise of containerized apps is they’re immutable. Another aspect developers should understand: any one container can and will be moved. It may fail, the host may fail, or the orchestrator may simply want to shuffle workloads to balance the overall cluster. A specific node may get over utilized by one of many processes. Just as your hard drive defrags and moves bits without you ever knowing, the container orchestrator should be able to move containers throughout the cluster. It should be able to expand and shrink the cluster on demand. And that is the next important part.

Rolling Updates of Nodes

If the apps are designed to have individual containers moved at any time, and if nodes are generic and don’t have app centric dependencies, then the same infrastructure used to expand and shrink the cluster can be used to roll out updates to nodes. Imagine the cloud vendor is aware of, or owns the nodes. The cloud vendor wants/needs to roll out an underlying OS update or perhaps even a hardware update. It asks the orchestrator to stand up some new nodes, which have the new OS and/or hardware updates. The orchestrator starts to shift workloads to the new node. While we can’t really run automated tests on the image, the app can report its health status. As the cloud vendor updates nodes, it's monitoring the health status. If it's seeing failures, we now have the clue that the update must stop, de-provision the node and resume on the previous nodes. The cloud vendor now has a choice to understand if it’s something they must fix, or they must notify the customer that update x is attempting to be applied, but the apps aren’t functioning. The cloud vendor provides information for the customer to test, identify and fix their app.

Dependencies

The dependencies to build such a system look something like this:

Unit and functional tests for each app

A container registry with notifications

Automated builds that can react to image update notifications as well as app updates

Running the automated functional tests as part of the build and deploy pipeline

Apps designed to fail and be moved at any time

Orchestrators that can expand and contract on demand

Health checks for the apps to report their state as they’re moved

Monitoring systems to notify the cloud vendor and customer of the impact of underlying changes

Cloud vendors to interact with their orchestrators to roll out updates, monitor the impact, roll forward or roll back

The challenges of software updates, vulnerabilities, bugs will not go away. The complexity of the layers will likely only increase the possibility of update failures. However, by putting the right automation in place, customers can be empowered to react, the apps will be secure and the lights will remain on.