Secure, scalable and robust production operations is complex. In fact, most of these platforms are specifically designed to hide that fact from developers.

That means that these platforms intentionally hide the very complexity that they themselves need to run effectively. Adding that complexity, at best, undermines the utility of the platform and, at worst, causes distractions that keep us forever looping on “day 1” installation issues.

I believe that systems designed to manage ops process and underlay are different than the platforms designed to manage developer life-cycle. This is different than the fidelity gap which is about portability. Accepting that allows us to focus on delivering secure, scalable and robust infrastructure for both users.

In a pair of DevOps.com posts, I lay out my arguments about the harm being caused by trying to blend these concepts in much more detail:

As much as we talk about how we should have shared goals spanning Dev and Ops, it’s not nearly as easy as it sounds. To fuel a DevOps culture, we have to build robust tooling, also.

That means investing up front in five key areas: abstraction, composability, automation, orchestration, and idempotency.

Together, these concepts allow sharing work at every level of the pipeline. Unfortunately, it’s tempting to optimize work at one level and miss the true system bottlenecks.

Creating production-like fidelity for developers is essential: We need it for scale, security and upgrades. It’s not just about sharing effort; it’s about empathy and collaboration.

But even with growing acceptance of DevOps as a cultural movement, I believe deployment disparities are a big unsolved problem. When developers have vastly different working environments from operators, it creates a “fidelity gap” that makes it difficult for the teams to collaborate.

Before we talk about the costs and solutions, let me first share a story from back when I was a bright-eyed OpenStack enthusiast…

RackN creates infrastructure agnostic automation so you can run physical and cloud infrastructure with the same elastic operational patterns. If you want to make infrastructure unimportant then your hybrid DevOps objective is simple:

Create multi-infrastructure Amazon equivalence for ops automation.

Even if you are not an AWS fan, they are the universal yardstick (15 minute & 40 minute presos) That goes for other clouds (public and private) and for physical infrastructure too. Their footprint is simply so pervasive that you cannot ignore “works on AWS” as a need even if you don’t need to work on AWS. Like PCs in the late-80s, we can use vendor competition to create user choice of infrastructure. That requires a baseline for equivalence between the choices. In the 90s, the Windows’ monopoly provided those APIs.

Why should you care about hybrid DevOps? As we increase operational portability, we empower users to make economic choices that foster innovation. That’s valuable even for AWS locked users.

We’re not talking about “give me a VM” here! The real operational need is to build accessible, interconnected systems – what is sometimes called “the underlay.” It’s more about networking, configuration and credentials than simple compute resources. We need consistent ways to automate systems that can talk to each other and static services, have access to dependency repositories (code, mirrors and container hubs) and can establish trust with other systems and administrators.

These “post” provisioning tasks are sophisticated and complex. They cannot be statically predetermined. They must be handled dynamically based on the actual resource being allocated. Without automation, this process becomes manual, glacial and impossible to maintain. Does that sound like traditional IT?

Side Note on Containers: For many developers, we are adding platforms like Docker, Kubernetes and CloudFoundry, that do these integrations automatically for their part of the application stack. This is a tremendous benefit for their use-cases. Sadly, hiding the problem from one set of users does not eliminate it! The teams implementing and maintaining those platforms still have to deal with underlay complexity.

I am emphatically not looking for AWS API compatibility: we are talking about emulating their service implementation choices. We have plenty of ways to abstract APIs. Ops is a post-API issue.

In fact, I believe that red herring leads us to a bad place where innovation is locked behind legacy APIs. Steal APIs where it makes sense, but don’t blindly require them because it’s the layer under them where the real compatibility challenge lurk.

Side Note on OpenStack APIs (why they diverge): Trying to implement AWS APIs without duplicating all their behaviors is more frustrating than a fresh API without the implied AWS contracts. This is exactly the problem with OpenStack variation. The APIs work but there is not a behavior contract behind them.

For example, transitioning to IPv6 is difficult to deliver because Amazon still relies on IPv4. That lack makes it impossible to create hybrid automation that leverages IPv6 because they won’t work on AWS. In my world, we had to disable default use of IPv6 in Digital Rebar when we added AWS. Another example? Amazon’s regional AMI pattern, thankfully, is not replicated by Google; however, their lack means there’s no consistent image naming pattern. In my experience, a bad pattern is generally better than inconsistent implementations.

As market dominance drives us to benchmark on Amazon, we are stuck with the good, bad and ugly aspects of their service.

For very pragmatic reasons, even AWS automation is highly fragmented. There are a large and shifting number of distinct system identifiers (AMIs, regions, flavors) plus a range of user-configured choices (security groups, keys, networks). Even within a single provider, these options make impossible to maintain a generic automation process. Since other providers logically model from AWS, we will continue to expect AWS like behaviors from them. Variation from those norms adds effort.

Failure to follow AWS without clear reason and alternative path is frustrating to users.

Do you agree? Join us with Digital Rebar creating real a hybrid operations platform.

The RackN team has been working on making DevOps more portable for over five years. Portable between vendors, sites, tools and operating systems means that our automation needs be to hybrid in multiple dimensions by design.

I believe that application should drive the infrastructure, not the reverse. I’ve heard may times that the “infrastructure should be invisible to the user.” Unfortunately, lack of abstraction and composibility make it difficult to code across platforms. I like the term “fidelity gap” to describe the cost of these differences.

Everyone wants to get stuff done quickly; however, we make the same hard-coded ops choices over and over again. Big bang configuration automation that embeds sequence assumptions into the script is not just technical debt, it’s fragile and difficult to upgrade or maintain. The problem is not configuration management (that’s a critical component!), it’s the lack of system level tooling that forces us to overload the configuration tools.

My ops automation experience says that these four factors must be solved together because they are interconnected.

What would a platform that embraced all these ideas look like? Here is what we’ve been working towards with Digital Rebar at RackN:

Mono-Infrastructure IT

“Hybrid DevOps”

Locked into a single platform

Portable between sites and infrastructures with layered ops abstractions.

Limited interop between tools

Adaptive to mix and match best-for-job tools. Use the right scripting for the job at hand and never force migrate working automation.

Ad hoc security based on site specifics

Secure using repeatable automated processes. We fail at security when things get too complex change and adapt.

Difficult to reuse ops tools

Composable Modules enable Ops Pipelines. We have to be able to interchange parts of our deployments for collaboration and upgrades.

Fragile Configuration Management

Service Oriented simplifies API integration. The number of APIs and services is increasing. Configuration management is not sufficient.

Big bang: configure then deploy scripting

Orchestrated action is critical because sequence matters. Building a cluster requires sequential (often iterative) operations between nodes in the system. We cannot build robust deployments without ongoing control over order of operations.

Should we call this “Hybrid Devops?” That sounds so buzz-wordy!

I’ve come to believe that Hybrid DevOps is the right name. More technical descriptions like “composable ops” or “service oriented devops” or “cross-platform orchestration” just don’t capture the real value. All these names fail to capture the portability and multi-system flavor that drives the need for user control of hybrid in multiple dimensions.

The RackN team is working on the “Start to Scale” position for Digital Rebar that targets the IT industry-wide “fidelity gap” problem. When we started on the Digital Rebar journey back in 2011 with Crowbar, we focused on “last mile” problems in metal and operations. Only in the last few months did we recognize the importance of automating smaller “first mile” desktop and lab environments.

A fidelity gap is created when work done on one platform, a developer laptop, does not translate faithfully to the next platform, a QA lab. Since there are gaps at each stage of deployment, we end up with the ops staircase of despair.

These gaps hide defects until they are expensive to fix and make it hard to share improvements. Even worse, they keep teams from collaborating.

With everyone trying out Container Orchestration platforms like Kubernetes, Docker Swarm, Mesosphere or Cloud Foundry (all of which we deploy, btw), it’s important that we can gracefully scale operational best practices.

For companies implementing containers, it’s not just about turning their apps into microservice-enabled immutable-rock stars: they also need to figure out how to implement the underlying platforms at scale.

My example of fidelity gap harm is OpenStack’s “all in one, single node” DevStack. There is no useful single system OpenStack deployment; however, that is the primary system for developers and automated testing. This design hides production defects and usability issues from developers. These are issues that would be exposed quickly if the community required multi-instance development. Even worse, it keeps developers from dealing with operational consequences of their decisions.

What are we doing about fidelity gaps? We’ve made it possible to run and faithfully provision multi-node systems in Digital Rebar on a relatively light system (16 Gb RAM, 4 cores) using VMs or containers. That system can then be fully automated with Ansible, Chef, Puppet and Salt. Because of our abstractions, if deployment works in Digital Rebar then it can scale up to 100s of physical nodes.

My take away? If you want to get to scale, start with the end in mind.