Tag Archives: orchestration

Post navigation

Update 12/14/16: Docker announced that they would create a container engine only project, ContinainerD, to decouple the engine from management layers above. Hopefully this addresses this issues outlined in the post below.

Monday, The New Stack broke news about a possible fork of the Docker Engine and prominently quoted me saying “Docker consistently breaks backend compatibility.” The technical instability alone is not what’s prompting industry leaders like Google, Red Hat and Huawei to take drastic and potentially risky community action in a central project.

So what’s driving a fork? It’s the intersection of Cash, Complexity and Community.

In fact, I’d warned about this risk over a year ago: Docker is both a core infrastucture technology (the docker container runner, aka Docker Engine) and a commercial company that manages the Docker brand. The community formed a standard, runC, to try and standardize; however, Docker continues to deviate from (or innovate faster) that base.

It’s important for me to note that we use Docker tools and technologies heavily. So far, I’ve been a long-time advocate and user of Docker’s innovative technology. As such, we’ve also had to ride the rapid release roller coaster.

Let’s look at what’s going on here in three key areas:

1. Cash

The expected monetization of containers is the multi-system orchestration and support infrastructure. Since many companies look to containers as leading the disruptive next innovation wave, the idea that Docker is holding part of their plans hostage is simply unacceptable.

So far, the open source Docker Engine has been simply included without payment into these products. That changed in version 1.12 when Docker co-mingled their competitive Swarm product into the Docker Engine. That effectively forces these other parties to advocate and distribute their competitors product.

2. Complexity

When Docker added cool Swarm Orchestration features into the v1.12 runtime, it added a lot of complexity too. That may be simple from a “how many things do I have to download and type” perspective; however, that single unit is now dragging around a lot more code.

In one of the recent comments about this issue, Bob Wise bemoaned the need for infrastructure to be boring. Even as we look to complex orchestration like Swarm, Kubernetes, Mesos, Rancher and others to perform application automation magic, we also need to reduce complexity in our infrastructure layers.

Along those lines, operators want key abstractions like containers to be as simple and focused as possible. We’ve seen similar paths for virtualization runtimes like KVM, Xen and VMware that focus on delivering a very narrow band of functionality very well. There is a lot of pressure from people building with containers to have a similar experience from the container runtime.

This approach both helps operators manage infrastructure and creates a healthy ecosystem of companies that leverage the runtimes.

Note: My company, RackN, believes strongly in this need and it’s a core part of our composable approach to automation with Digital Rebar.

3. Community

Multi-vendor open source is a very challenging and specialized type of community. In these communities, most of the contributors are paid by companies with a vested (not necessarily transparent) interest in the project components. If the participants of the community feel that they are not being supported by the leadership then they are likely to revolt.

Ultimately, the primary difference between Docker and a fork of Docker is the brand and the community. If there companies paying the contributors have the will then it’s possible to move a whole community. It’s not cheap, but it’s possible.

Developers vs Operators

One overlooked aspect of this discussion is the apparent lock that Docker enjoys on the container developer community. The three Cs above really focus on the people with budgets (the operators) over the developers. For a fork to succeed, there needs to be a non-Docker set of tooling that feeds the platform pipeline with portable application packages.

In Conclusion…

The world continues to get more and more heterogeneous. We already had multiple container runtimes before Docker and the idea of a new one really is not that crazy right now. We’ve already got an explosion of container orchestration and this is a reflection of that.

My advice? Worry less about the container format for now and focus on automation and abstractions.

Steven Spector and I talked about “Hybrid DevOps” as a concept. Our discussion led to a ‘there’s a picture for that!’ moment that often helped clarify the concept. We believe that this concept, like Rugged DevOps, is additive to existing DevOps thinking and culture. It’s about expanding our thinking to include orchestration and composability.

The RackN team has been working on making DevOps more portable for over five years. Portable between vendors, sites, tools and operating systems means that our automation needs be to hybrid in multiple dimensions by design.

I believe that application should drive the infrastructure, not the reverse. I’ve heard may times that the “infrastructure should be invisible to the user.” Unfortunately, lack of abstraction and composibility make it difficult to code across platforms. I like the term “fidelity gap” to describe the cost of these differences.

Everyone wants to get stuff done quickly; however, we make the same hard-coded ops choices over and over again. Big bang configuration automation that embeds sequence assumptions into the script is not just technical debt, it’s fragile and difficult to upgrade or maintain. The problem is not configuration management (that’s a critical component!), it’s the lack of system level tooling that forces us to overload the configuration tools.

My ops automation experience says that these four factors must be solved together because they are interconnected.

What would a platform that embraced all these ideas look like? Here is what we’ve been working towards with Digital Rebar at RackN:

Mono-Infrastructure IT

“Hybrid DevOps”

Locked into a single platform

Portable between sites and infrastructures with layered ops abstractions.

Limited interop between tools

Adaptive to mix and match best-for-job tools. Use the right scripting for the job at hand and never force migrate working automation.

Ad hoc security based on site specifics

Secure using repeatable automated processes. We fail at security when things get too complex change and adapt.

Difficult to reuse ops tools

Composable Modules enable Ops Pipelines. We have to be able to interchange parts of our deployments for collaboration and upgrades.

Fragile Configuration Management

Service Oriented simplifies API integration. The number of APIs and services is increasing. Configuration management is not sufficient.

Big bang: configure then deploy scripting

Orchestrated action is critical because sequence matters. Building a cluster requires sequential (often iterative) operations between nodes in the system. We cannot build robust deployments without ongoing control over order of operations.

Should we call this “Hybrid Devops?” That sounds so buzz-wordy!

I’ve come to believe that Hybrid DevOps is the right name. More technical descriptions like “composable ops” or “service oriented devops” or “cross-platform orchestration” just don’t capture the real value. All these names fail to capture the portability and multi-system flavor that drives the need for user control of hybrid in multiple dimensions.

Normalizing the APIs for hardware configuration is a noble and long-term goal. While the end result, a configured server, is very easy to describe; the differences between vendors’ hardware configuration tools are substantial. These differences make it impossible challenging to create repeatable operations automation (DevOps) on heterogeneous infrastructure.

Illustration to show potential changes in provisioning control flow over time.

The OpenStack Ironic project is a multi-vendor community solution to this problem at the server level. By providing a common API for server provisioning, Ironic encourages vendors to write drivers for their individual tooling such as iDRAC for Dell or iLO for HP.

Ironic abstracts configuration and expects to be driven by an orchestration system that makes the decisions of how to configure each server. That type of orchestration is the heart of Crowbar physical ops magic [side node: 5 ways that physical ops is different from cloud]

The OpenCrowbar project created extensible orchestration to solve this problem at the system level. By decomposing system configuration into isolated functional actions, Crowbar can coordinate disparate configuration actions for servers, switches and between systems.

Today, the Provisioner component of Crowbar performs similar functions as Ironic for operating system installation and image lay down. Since configuration activity is tightly coupled with other Crowbar configuration, discovery and networking setup, it is difficult to isolate in the current code base. As Ironic progresses, it should be possible to shift these activities from the Provisioner to Ironic and take advantage of the community-based configuration drivers.

The immediate synergy between Crowbar and Ironic comes from accepting two modes of operation for OpenStack: bootstrapping infrastructure and multi-tenant server allocation.

Crowbar was designed as an operational platform that seeds an OpenStack ready environment. Once that environment is configured, OpenStack can take over ownership of the resources and allow Ironic to manage and deliver “hypervisor-free” servers for each tenant. In that way, we can accelerate the adoption of OpenStack for self-service metal.

Physical operations is messy and challenging, but we’re committed to working together to make it suck less. Operators of the world unite!

I’ve had several conversations comparing OpenCrowbar with other “bare metal provisioning” tools that do thing like serve golden images to PXE or IPXE server to help bootstrap deployments. It’s those are handy tools, they do nothing to really help operators drive system-wide operations; consequently, they have a limited system impact/utility.

In building the new architecture of OpenCrowbar (aka Crowbar v2), we heard very clearly to have “less magic” in the system. We took that advice very seriously to make sure that Crowbar was a system layer with, not a replacement to, standard operations tools.

Specifically, node boot & kickstart alone is just not that exciting. It’s a combination of DHCP, PXE, HTTP and TFTP or DHCP and an IPXE HTTP Server. It’s a pain to set this up, but I don’t really get excited about it anymore. In fact, you can pretty much use open ops scripts (Chef) to setup these services because it’s cut and dry operational work.

Note: Setting up the networking to make it all work is perhaps a different question and one that few platforms bother talking about.

So, if doing node provisioning is not a big deal then why is OpenCrowbar important? Because sustaining operations is about ongoing system orchestration (we’d say an “operations model“) that starts with provisioning.

It’s not the individual services that’s critical; it’s doing them in a system wide sequence that’s vital.

Crowbar does NOT REPLACE the services. In fact, we go out of our way to keep your proven operations tool chain. We don’t want operators to troubleshoot our IPXE code! We’d much rather use the standard stuff and orchestrate the configuration in a predicable way.

In that way, OpenCrowbar embraces and composes the existing operations tool chain into an integrated system of tools. We always avoid replacing tools. That’s why we use Chef for our DSL instead of adding something new.

What does that leave for Crowbar? Crowbar is providing a physical infratsucture targeted orchestration (we call it “the Annealer”) that coordinates this tool chain to work as a system. It’s the system perspective that’s critical because it allows all of the operational services to work together.

For example, when a node is added then we have to create v4 and v6 IP address entries for it. This is required because secure infrastructure requires reverse DNS. If you change the name of that node or add an alias, Crowbar again needs to update the DNS. This had to happen in the right sequence. If you create a new virtual interface for that node then, again, you need to update DNS. This type of operational housekeeping is essential and must be performed in the correct sequence at the right time.

The critical insight is that Crowbar works transparently alongside your existing operational services with proven configuration management tools. Crowbar connects links in your tool chain but keeps you in the driver’s seat.

The operations challenge

A deployment framework is key to solving the problems of deploying, configuring, and scaling open source clusters for cloud computing.

Deploying an open source cloud can be a complex undertaking. Manual processes, can take days or even weeks working to get a cloud fully operational. Even then, a cloud is never static, in the real world cloud solutions are constantly on an upgrade or improvement path. There is continuous need to deploy new servers, add management capabilities, and track the upstream releases, while keeping the cloud running, and providing reliable services to end users. Service continuity requirements dictate a need for automation and orchestration. There is no other way to reduce the cost while improving the uptime reliability of a cloud.

These were among the challenges that drove the development of the OpenCrowbar software framework from it’s roots as an OpenStack installer into a much broader orchestration tool. Because of this evolution, OpenCrowbar has a number of architectural features to address these challenges:

Abstraction Around OrchestrationOpenCrowbar is designed to simplify the operations of large scale cloud infrastructure by providing a higher level abstraction on top of existing configuration management and orchestration tools based on a layered deployment model.

Web ArchitectureOpenCrowbar is implemented as a web application server, with a full user interface and a predictable and consistent REST API.

Platform Agnostic ImplementationOpenCrowbar is designed to be platform and operating system agnostic. It supports discovery and provisioning from a bare metal state, including hardware configuration, updating and configuring BIOS and BMC boards, and operating system installation. Multiple operating systems and heterogeneous operating systems are supported. OpenCrowbar enables use of time-honored tools, industry standard tools, and any form of scriptable facility to perform its state transition operations.

Modular ArchitectureOpenCrowbar is designed around modular plug-ins called Barclamps. Barclamps allow for extensibility and customization while encapsulating layers of deployment in manageable units.

State Transition Management EngineThe core of OpenCrowbar is based on a state machine (we call it the Annealer) that tracks nodes, roles, and their relationships in groups called deployments. The state machine is responsible for analyzing dependencies and scheduling state transition operations (transitions).

Data modelOpenCrowbar uses a dedicated database to track system state and data. As discovery and deployment progresses, system data is collected and made available to other components in the system. Individual components can access and update this data, reducing dependencies through a combination of deferred binding and runtime attribute injection.

Network AbstractionOpenCrowbar is designed to support a flexible network abstraction, where physical interfaces, BMC’s, VLANS, binding, teaming, and other low level features are mapped to logical conduits, which can be referenced by other components. Networking configurations can be created dynamically to adapt to changing infrastructure.

I have to say that last week’s Opscode Community Summit was one of the most productive summits that I have attended. Their use of the open-space meeting format proved to be highly effective for a team of motivated people to self-organize and talk about critical topics. I especially like the agenda negations (see picture for an agenda snapshot) because everyone worked to adjust session times and locations based on what else other sessions being offered. Of course, is also helped to have an unbelievable level of Chef expertise on tap.

Overall

Overall, I found the summit to be a very valuable two days; consequently, I feel some need to pay it forward with some a good summary. Part of the goal was for the community to document their sessions on the event wiki (which I have done).

The roadmap sessions were of particular interest to me. In short, Chef is converging the code bases of their three products (hosted, private and open). The primary change on this will moving from CouchBD to a SQL based DB and moving away the API calls away from Merb/Ruby to Erlang. They are also improving search so that we can make more fine-tuned requests that perform better and return less extraneous data.

I had a lot of great conversations. Some of the companies represented included: Monster, Oracle, HP, DTO, Opscode (of course), InfoChimps, Reactor8, and Rackspace. There were many others – overall >100 people attended!

Crowbar & Chef

Greg Althaus and I attended for Dell with a Crowbar specific agenda so my notes reflect the fact that I spent 80% of my time on sessions related to features we need and explaining what we have done with Chef.

Observations related to Crowbar’s use of Chef

There is a class of “orchestration” products that have similar objectives as Crowbar. Ones that I remember are Cluster Chef, Run Deck, Domino

Crowbar uses Chef in a way that is different than users who have a single application to deploy. We use roles and databags to store configuration that other users inject into their recipes. This is dues to the fact that we are trying to create generic recipes that can be applied to many installations.

Our heavy use of roles enables something of a cookbook service pattern. We found that this was confusing to many chef users who rely on the UI and knife. It works for us because all of these interactions are automated by Crowbar.

We picked up some smart security ideas that we’ll incorporate into future versions.

Managed Nodes / External Entities

Our primary focus was creating an “External Entity” or “Managed Node” model. Matt Ray prefers the term “managed node” so I’ll defer to that name for now. This model is needed for Crowbar to manage system components that cannot run an agent such as a network switch, blade chassis, IP power distribution unit (PDU), and a SAN array. The concept for a managed node is that that there is an instance of the chef-client agent that can act as a delegate for the external entity. I had so much to say about that part of the session, I’m posting it as its own topic shortly.