Kubernetes: Industry Transformation 10 Years in the Making

We’ve come a long way as an industry in last 10 years. As I travel to #KubeCon in Austin, I’m reflecting back on what has changed.

10 years ago, I ran an independent research group called the IT Process Institute, and was lead researcher on a study designed to identify change, config, release best practices. I had the privilege of working with @realgenekim and personally interviewing IT ops teams from a dozen companies recognized for their exemplary results. From the interviews, we created hypothesis about what practices enabled highest levels of performance. And then collected data from 250 companies in order to test whether those practices correlated with higher performance across a broad industry sample.

Back Then – People were Breaking Things

Change was the biggest cause of system failures. Applications were hard wired to their environments. Systems were reaching a point of complexity where a single person didn’t have knowledge to understand the impact of a simple change. And people were responsible for making changes. Changes made by people up and down the stack often had unintended consequences. As a result, we used change advisory boards, forward schedule of change, release engineers, and a CMDB to help document dependencies. Change management was a major ITIL process implemented to help gain control. Controls made sure people followed processes, and helped reduce the chaos related to managing brittle, finicky, prickly systems.

The general approach to successful code release, was to test changes in a pre-production environment that was “sufficiently similar” to production, in order to verify changes worked before rollout.

Changes to deployed systems – in response to change request or service impacting incident – often left production systems in an unknown state. That resulted in additional service quality and security/compliance risk. As a result, the collective “we” IT professionals shot ourselves in the foot over, and over again.

Pinnacle of the Slow and Careful Era

As example of exemplary practice at one organization where the whole IT org’s bonus was tied to down time (think IT group that ran a US stock exchange)

Rollouts – including environment and application changes, were documented in a runbook. They practiced and timed the rollouts in a pre-production environment. They knew what should happen, and how long it should take.

Rollbacks – were documented in a runbook, and practiced, and timed.

Scheduled changes – during nightly maintenance windows. If the rollout wasn’t successful by a pre-set time, they would trigger rollback. A task that didn’t match the runbook also triggered rollback.

Devs were banned from Production – and they had a “break glass” process where developers could fix production in an emergency. But someone from Ops literally looked over their shoulder and wrote down everything they did.

A key question of that time, was how much money to spend on building and maintaining a redundant, underutilized, “sufficiently similar” pre-production environment in order to pre-test changes to ensure success?

Digital Eats “Slow and Careful” for Lunch

The “Slow and careful” era had an inherent conflict built in. Everyone knew that slowing down improved results. A careful and cautious approach improved uptime and security and compliance related to complex systems. However, that approach turned out to be wholly inadequate as Marc Andreessen realized that Software is eating the world and The lean startup with minimally viable products, and new digital business models (Uber, AirBnB) — all relied on getting new products and features into users hand faster, not slower.

Looking back at my interview notes, 10 years ago, I asked everyone “What metrics do you use to measure success?” Everyone measured uptime and change success rate. Nobody measured frequency of change, or time between change request and completed change.

Along Comes Kubernetes

At same time I was conducting this research, Google was building Borg the first unified container management system. Their second iteration was called Omega. Both remain proprietary. But their third version of this system is called Kubernetes. And they launched this as an open source project to share their new and powerful way of doing things, and help drive usage of their infrastructure as a service Google Cloud Platform.

Kubernetes is a container orchestration system. But more importantly, Kubernetes codifies a new way of doing things that wasn’t yet aspirational in the “Slow and careful” era. Kubernetes changes how you build, deploy and manage applications – that is “built for purpose” to meet the needs of the digital era.

Slow and careful IT – with a focus on uptime, doesn’t support digital business models that need new features to attract users. Fast and careless Dev – that produces unusable or unavailable applications, drives users away.

Velocity – as a measure, combines the two. It measures the number of features you can ship while maintaining quality of service. Kubernetes and ecosystem tools – give you what you need to move quickly while maintaining quality.

@kelseyhightower, Brendan Burns, and Joe Beda explain in “Kubernetes up and running” that there are three core concepts baked into Kubernetes that enable velocity. And based on my look back, represent an 180 degree shift and transformation from the best practice of the slow and careful era.

Immutability – Once an artifact is created, it is not changed by users. Antipattern: change something in a container or application deployed via container. It is better to create a new container and redeploy, than for a human to make a change to a deployed system. This supports a green/blue release process. There is no runbook rollback. There is no “break glass” process for people making changes to deployed systems.

Declarative configuration – Kubernetes objects define desired state of the system. Kubernetes makes sure the actual state matches the desired state. There is no runbook with a documented series of steps to take. It does not need to be executed to be understood. Its impact is declared.

Self-healing – Kubernetes includes a controller-manager that continuously takes actions to make sure current state matches desired state. People don’t repair (e.g. make changes) via mitigation steps performed in response to an alert or change request. Kubernetes consistently and repeatedly takes actions to ensure current state matches desired state.

Some of the individuals posting to this site, including the moderators, work for Cisco Systems. Opinions expressed here and in any corresponding comments are the personal opinions of the original authors, not of Cisco. The content is provided for informational purposes only and is not meant to be an endorsement or representation by Cisco or any other party. This site is available to the public. No information you consider confidential should be posted to this site. By posting you agree to be solely responsible for the content of all information you contribute, link to, or otherwise upload to the Website and release Cisco from any liability related to your use of the Website. You also grant to Cisco a worldwide, perpetual, irrevocable, royalty-free and fully-paid, transferable (including rights to sublicense) right to exercise all copyright, publicity, and moral rights with respect to any original content you provide. The comments are moderated. Comments will appear as soon as they are approved by the moderator.