Tag Archives: operations

Post navigation

Contrary to pundit expectations, OpenStack did not roll over and die during the keynotes yesterday.

In my 2011 Boston Summit shirt.

In fact, I saw the signs of a maturing project seeing real use and adoption. More critically, OpenStack leadership started the event with an acknowledgement of being part of, not owning, the vibrant open infrastructure community.

Continued Growth in Core Areas

Practical reasons for running dedicated infrastructure (compliance, control and cost) make OpenStack relevant for companies and governments with significant budgets. There is also a healthy shared infrastructure (aka public cloud) market living in the shadow of the big 3 players. It’s still unclear how this ecosystem will make money for the vendors.

What do customers buy? Should the Core be free?

My personal experience is that most customers are reluctant to (but grudgingly do) buy distros for the core open technology. They are much more willing to pay for adjacencies like security, storage and networking.

Emerging Challenges from Adjacent Technologies

Containers and Kubernetes are making a significant impact on the OpenStack community. At points, the OpenStack keynote was more about Kubernetes than OpenStack. It’s also clear that customers want to use containers as an abstraction layer to make infrastructure less visible or locked-in. That opens the market for using servers directly (bare metal) or other clouds. That portability is likely to help OpenStack more than hurt it because customers can exit workloads from the Big 3 players.

Friction for adoption remains a critical hurdle.

Containers, which are cloud first platforms, have much less friction than IaaS platforms. IaaS platforms, even managed ones, require physical infrastructure with the matching complexity and investment.

OpenStack: an open infrastructure software community

Overall, the summit remains an amazing community space for open infrastructure software and cloud alternatives to the Big 3 players. The Foundation’s pivot to embrace Kubernetes and foster several other open technologies helps maintain the central enthusiasm for open source infrastructure that gave birth to the platform in the first place.

A healthy pragmatic vibe

The summit may not have the same heady taking-on-the-world feeling as the early days; instead, it has a healthy pragmatic vibe. Considering how frothy this space remains, that may be a welcome relief.

Originally posted on Kubernetes Blog. I wanted to repost here because it’s part of the RackN ongoing efforts to focus on operational and fidelity gap challenges early. Please join us in this effort!

We think Kubernetes is an awesome way to run applications at scale! Unfortunately, there’s a bootstrapping problem: we need good ways to build secure & reliable scale environments around Kubernetes. While some parts of the platform administration leverage the platform (cool!), there are fundamental operational topics that need to be addressed and questions (like upgrade and conformance) that need to be answered.

Enter Cluster Ops SIG – the community members who work under the platform to keep it running.

Our objective for Cluster Ops is to be a person-to-person community first, and a source of opinions, documentation, tests and scripts second. That means we dedicate significant time and attention to simply comparing notes about what is working and discussing real operations. Those interactions give us data to form opinions. It also means we can use real-world experiences to inform the project.

We aim to become the forum for operational review and feedback about the project. For Kubernetes to succeed, operators need to have a significant voice in the project by weekly participation and collecting survey data. We’re not trying to create a single opinion about ops, but we do want to create a coordinated resource for collecting operational feedback for the project. As a single recognized group, operators are more accessible and have a bigger impact.

What about real world deliverables?

We’ve got plans for tangible results too. We’re already driving toward concrete deliverables like reference architectures, tool catalogs, community deployment notes and conformance testing. Cluster Ops wants to become the clearing house for operational resources. We’re going to do it based on real world experience and battle tested deployments.

Connect with us.

Cluster Ops can be hard work – don’t do it alone. We’re here to listen, to help when we can and escalate when we can’t. Join the conversation at:

Last Tuesday, I had the honor of joining an OpenStack scale operators dinner. Foundation executives, Jonathan Bryce and Lauren Sell, were also on the guest list so talk naturally turned to “how can OpenStack better support operators.” Notably, the session was distinctly not OpenStack bashing.

The conversation was positive, enthusiastic and productive, but one thing was clear: the OpenStack default “we’ll fix it in the upstream” answer does not work for this group of operators.

What is upstreaming? A sans nuance answer is that OpenStack drives fixes and changes in the next community release (longer description). The project and community have a tremendous upstream imperative that pervades the culture so deeply that we take it for granted. Have an issue with OpenStack? Submit a patch! Is there any other alternative?

Upstreaming [to trunk] makes perfect sense considering the project vendor structure and governance; however, it is a very frustrating experience for operators. OpenStack does have robust processes to backport fixes and sustain past releases and documentation; yet, the feeling at the table was that they are not sufficiently operator focused.

Operators want fast, incremental and pragmatic corrections to the code and docs they are deploying (which is often two releases back). They want it within the community, not from individual vendors.

There are great reasons for focusing on upstream trunk. It encourages vendors to collaborate and makes it much easier to add and expand the capabilities of the project. Allowing independent activity on past releases creates a forward integration mess and could make upgrades even harder. It will create divergence on APIs and implementation choices.

The risk of having a stable, independently sustained release is that operators have less reason to adopt the latest shiny release. And that is EXACTLY what they are asking for.

Upstreaming is a core value to OpenStack and essential to our collaborative success; however, we need to consider that it is not the right answer to all questions. Discussions at that dinner reinforced that pushing everything to latest trunk creates a significant barrier for OpenStack operators and users.

What are your experiences? Is there a way to balance upstreaming with forking? How can we better serve operators?

Last week, Forbes and ZDnet posted articles discussing the cost of various cloud (451 source material behind wall) full of dollar per hour costs analysis. Their analysis talks about private infrastructure being an order of magnitude cheaper (yes, cheaper) to own than public cloud; however, the open source price advantages offered by OpenStack are swallowed by added cost of finding skilled operators and its lack of maturity.

At the end of the day, operational concerns are the differential factor.

The Magic 8 Cube

These articles get tied down into trying to normalize clouds to $/vm/hour analysis and buried the lead that the operational decisions about what contributes to cloud operational costs. I explored this a while back in my “magic 8 cube” series about six added management variations between public and private clouds.

In most cases, operations decisions is not just about cost – they factor in flexibility, stability and organizational readiness. From that perspective, the additional costs of public clouds and well-known stacks (VMware) are easily justified for smaller operations. Using alternatives means paying higher salaries and finding talent that requires larger scale to justify.

Operational complexity is a material cost that strongly detracts from new platforms (yes, OpenStack – we need to address this!)

Unfortunately, it’s hard for people building platforms to perceive the complexity experienced by people outside their community. We need to make sure that stability and operability are top line features because complexity adds a very real cost because it comes directly back to cost of operation.

In my thinking, the winners will be solutions that reduce BOTH cost and complexity. I’ve talked about that in the past and see the trend accelerating as more and more companies invest in ops automation.

I’ve been trying to explain the pain Tao of physical ops in a way that’s accessible to people without scale ops experience. It comes down to a yin-yang of two elements: exploding complexity and iterative learning.

Exploding complexity is pretty easy to grasp when we stack up the number of control elements inside a single server (OS RAID, 2 SSD cache levels, 20 disk JBOD, and UEFI oh dear), the networks that server is connected to, the multi-layer applications installed on the servers, and the change rate of those applications. Multiply that times 100s of servers and we’ve got a problem of unbounded scope even before I throw in SDN overlays.

But that’s not the real challenge! The bigger problem is that it’s impossible to design for all those parameters in advance.

When my team started doing scale installs 5 years ago, we assumed we could ship a preconfigured system. After a year of trying, we accepted the reality that it’s impossible to plan out a scale deployment; instead, we had to embrace a change tolerant approach that I’ve started calling “Apply, Rinse, Repeat.”

Using Crowbar to embrace the in-field nature of design, we discovered a recurring pattern of installs: we always performed at least three full cycle installs to get to ready state during every deployment.

The first cycle was completely generic to provide a working baseline and validate the physical environment.

The second cycle attempted to integrate to the operational environment and helped identify gaps and needed changes.

The third cycle could usually interconnect with the environment and generally exposed new requirements in the external environment

The subsequent cycles represented additional tuning, patches or redesigns that could only be realized after load was applied to the system in situ.

Every time we tried to shortcut the Apply-Rinse-Repeat cycle, it actually made the total installation longer! Ultimately, we accepted that the only defense was to focus on reducing A-R-R cycle time so that we could spend more time learning before the next cycle started.

Simulated Annealing

Simulated Annealing is a modeling strategy from Computer Science for seeking optimum or stable outcomes through iterative analysis. The physical analogy is the process of strengthening steel by repeatedly heating, quenching and hammering. In both computer science and metallurgy, the process involves evaluating state, taking action, factoring in new data and then repeating. Each annealing cycle improves the system even though we may not know the final target state.

Annealing is well suited for problems where there is no mathematical solution, there’s an irregular feedback loop or the datasets change over time. We have all three challenges in continuous operations environments. While it’s clear that a deployment can modeled as directed graph (a mathematical solution) at a specific point in time, the reality is that there are too many unknowns to have a reliable graph. The problem is compounded because of unpredictable variance in hardware (NIC enumeration, drive sizes, BIOS revisions, network topology, etc) that’s even more challenging if we factor in adapting to failures. An operating infrastructure is a moving target that is hard to model predictively.

Crowbar implements the simulated annealing algorithm by decomposing the operations infrastructure into atomic units, node-roles, that perform the smallest until of work definable. Some or all of these node-roles are changed whenever the infrastructure changes. Crowbar anneals the environment by exercising the node-roles in a consistent way until system re-stabilizes.

One way to visualize Crowbar annealing is to imagine children who have to cross a field but don’t have a teacher to coordinate. Once student takes a step forward and looks around then another sees the first and takes two steps. Each child advances based on what their peers are doing. None wants to get too far ahead or be left behind. The line progresses irregularly but consistently based on the natural relationships within the group.

To understand the Crowbar Annealer, we have to break it into three distinct components: deployment timeline, annealing and node-role state. The deployment timeline represents externally (user, hardware, etc) initiated changes that propose a new target state. Once that new target is committed, Crowbar anneals by iterating through all the node-roles in a reasonable order. As the Annealer runs the node-roles they update their own state. The aggregate state of all the node-roles determines the state of the deployment.

A deployment is a combination of user and system defined state. Crowbar’s job is to get deployments stable and then maintain over time.

Ops Late Binding

In terms of computer science languages, late binding describes a class of 4th generation languages that do not require programmers to know all the details of the information they will store until the data is actually stored. Historically, computers required very exact and prescriptive data models, but later generation languages embraced a more flexible binding.

Ops is fluid and situational.

Many DevOps tooling leverages eventual consistency to create stable deployments. This iterative approach assumes that repeated attempts of executing the same idempotent scripts do deliver this result; however, they are do not deliver predictable upgrades in situations where there are circular dependencies to resolve.

It’s not realistic to predict the exact configuration of a system in advance –

the operational requirements recursively impact how the infrastructure is configured

ops environments must be highly dynamic

resilience requires configurations to be change tolerant

Even more complex upgrade where the steps cannot be determined in advanced because the specifics of the deployment direct the upgrade.

Late Binding is a foundational topic for Crowbar that we’ve been talking about since mid-2012. I believe that it’s an essential operational consideration to handle resiliency and upgrades. We’ve baked it deeply into OpenCrowbar design.