Every Ops team I know is underwater and doesn’t have the time to catch their breath.

Why does the load increase and leave Ops behind? It’s because IT is increasingly fragmented and siloed by both new tech and past behaviors. Many teams simply step around their struggling compatriots and spin up yet more Ops work adding to the backlog. Dashing off yet another Ansible playbook to install on AWS is empowering but ultimately adds to the Ops sustaining backlog.

Ops Tsunami

That terrifying observation two years ago led me to create this graphic showing how operations is getting swamped by new demand for infrastructure.

It’s not just the amount of infrastructure: we’ve got an unbounded software variation problem too.

It’s unbounded because we keep rapidly evolving new platforms and those platforms are build on rapidly evolving components. For example, Kubernetes has a 3 month release cycle. That’s really fast; however, it built on other components like Docker, SDN and operating systems that also have fast release cycles. That means that even your single Kubernetes infrastructure has many moving parts that may not be consistent in your own organization. For example, cloud deploys may use CoreOS while internal ones use a Corporate approved Centos.

And the problem will get worse because infrastructure is cheap and developer productivity is improving.

Since then, we’ve seen an container fueled explosion in developer productivity and AI driven-rise in new hardware-flavored instances. Both are power drivers of infrastructure consumption; however, we have not seen a matching leap in operations tooling (that’s a future post topic!).

If the ratio is >50 then the team slowly sinks under growing operational load. If you are not actively decreasing the load via automation then your teams get underwater and basic ops hygiene fails.

This is not optional – if you are behind now then it will just get worse!

The escape from the cycle is to get help. Stop writing automation that you can buy or re-use. Get help running it. Don’t waste time solving problems that other people have solved. That may mean some upfront learning and investment but if you aren’t getting out of your own way then you’ll be run over.

I’ve been posting about the unique composable operations approach the RackN team has taken with Digital Rebar to enable hybrid infrastructure and mix-and-match underlay tooling. The orchestration design (what we call annealing) allows us to dynamically add roles to the environment and execute them as single role/node interactions in operational chains.

With our latest patches (short demo videos below), you can now create single role Ansible or Bash scripts dynamically and then incorporate them into the node execution.

That makes it very easy to extend an existing deployment on-the-fly for quick changes or as part of a development process.

You can also run an ad hoc bash script against one or groups of machines. If that script is something unique to your environment, you can manage it without having to push it back upsteam because Digital Rebar workloads are composable and designed to be safely integrated from multiple sources.

Beyond tweaking running systems, this is fastest script development workflow that I’ve ever seen. I can make fast, surgical iterative changes to my scripts without having to rerun whole playbooks or runlists. Even better, I can build multiple operating system environments side-by-side and test changes in parallel.

For secure environments, I don’t have to hand out user SSH access to systems because the actions run in Digital Rebar context. Digital Rebar can limit control per user or tenant.

I’m very excited about how this capability can be used for dev, test and production systems. Check it out and let me know what you think.

￼2016 is the year we break down the monoliths. We’ve spent a lot of time talking about monolithic applications and microservices; however, there’s an equally deep challenge in ops automation.

Anti-monolith composability means making our automation into function blocks that can be chained together by orchestration.

What is going wrong? We’re building fragile tightly coupled automation.

Most of the automation scripts that I’ve worked with become very long interconnected sequences well beyond the actual application that they are trying to install. For example, Kubernetes needs etcd as a datastore. The current model is to include the etcd install in the install script. The same is true for SDN install/configuation and post-install test and dashboard UIs. The simple “install Kubernetes” quickly explodes into a kitchen sink of related adjacent components.

Those installs quickly become fragile and bloated. Even worse, they have hidden dependencies. What happens when etcd changes. Now, we’ve got to track down all the references to it burried in etcd based applications. Further, we don’t get the benefits of etcd deployment improvements like secure or scale configuration.

What can we do about it? Resist the urge to create vertical silos.

It’s temping and fast to create automation that works in a very prescriptive way for a single platform, operating system and tool chain. The work of creating abstractions between configuration steps seems like a lot of overhead. Even if you create those boundaries or reuse upstream automation, you’re likely to be vulnerable to changes within that component. All these concerns drive operators to walk away from working collaboratively with each other and with developers.

Giving up on collaborative Ops hurts us all and makes it impossible to engineer excellent operational tools.

And…v2.1 is the first release with commercial support!

RackN (rackn.com) offers consulting and support for the OpenCrowbar v2.1 release. The company was started by Crowbar founders Greg Althaus, Scott Jensen, Dan Choquette, and myself specifically to productize and extend Crowbar.