How IT Ops can avoid the DevOps capacity crunch

DevOps and digital transformations have brought an unprecedented increase in the pace and volume of daily change in IT. While this may sound like great news to development and product groups, IT operations management is often alarmed by the potential risk that its already overloaded workforce will be squeezed beyond the breaking point.

IT Ops finds itself squeezed between the "go-go-go, deploy-deploy-deploy" demands of DevOps and digital transformations and the "Don't be the next hack! Don't be the next outage! Slow down!" demands of today’s business environment.

The result is a labor-capacity crunch that can undermine your business’s ability to operate. Since the pressures causing this crunch show no sign of abating, IT operations must rethink how it handles both planned and unplanned work. Self-service operations represent a design pattern that you can use to relieve your labor capacity crunch and simultaneously improve organizational agility and response times.

Unplanned work kills capacity

You can look to DevOps and agile for guidance on handling an increased pace and volume of planned work. However, IT operations organizations are struggling with the increasing load of unplanned work—outages, performance issues, security issues, unexpected load, and other incidents.

It is the unplanned work and the interrupt-driven nature of IT operations that is at the heart of most labor-capacity challenges. Unplanned and planned work mix like oil and water. Interruptions of planned work by unplanned work, and the costly context-switching that comes with them, prevent teams from completing planned project work that the business needs to move forward.

Unplanned work has a ripple effect throughout the organization in the form of cascading escalations (e.g., from Level 1 teams to Level 3 teams) and compounding schedule slippage (project A is delayed by outage A; which in turn delays project B, which is dependent on project A; that in turn delays project C, which needs people from project B and requires deliverables from project A).

While IT operations management cannot fully prevent unplanned work from wreaking havoc with its schedules, the IT Ops team needs to find a different way to work that limits its impact.

Current models for handling operations support

Most unplanned work arrives in the form of operational support tasks generated during and after the deployment of an application, service, or environment. Currently, there are two divergent schools of thought for handling operational support:

You build it. They run it.

This is the more traditional operating model, where development/delivery teams deliver parts of the system to a separate team responsible for running those parts as a coherent, functioning system.

This model scales well with the project-based funding model that most enterprises use. That strategy is to move expensive developers on to the next business-driven project and move support and maintenance to lower-cost engineers.

In reality, this methodology often results in hand-off problems that lead to a high rate of errors. Also, development/delivery teams aren’t protected as planned and are frequently interrupted as they are pulled into support issues.

You build it. You run it.

In this newer operating model, development/delivery staff consist of fully cross-functional teams dedicated to managing the full build, deploy, and run lifecycle for specific parts of the system. The team benefits from operating as one tight unit throughout the end-to-end lifecycle, without hand-offs or breaks in context.

Teams working in this manner generally operate at a high velocity and maintain higher quality. Unfortunately, this model operates contrary to how many large enterprises are structured and funded. As a result, if you don’t fundamentally change how the business works, you can quickly get to the point where dedicated, integrated teams have too many services to maintain (or legacy dependencies are too great), team capacity is overrun, and the model collapses.

These are the two divergent schools of thought on how to handle operations support work.

Amazon and Netflix are often cited as examples of companies that made the required structural changes to their businesses and moved wholesale to a “You build it. You run it” model. Unfortunately, most enterprises aren’t in a position to restructure their business to achieve a new IT operating model.

There is a happy medium, however. Companies such as Ticketmaster and Equifax are blending the best of both IT operating models. As part of their overall IT transformations, these IT operations teams are deploying the processes and tools needed to get the labor-scaling benefits of "You build it. They run it" where needed while also getting the responsiveness and control of "You build it You run it." A critical mechanism in this blended model is a design pattern called “self-service operations.”

The many benefits of self-service operations

Self-service operations help relieve the IT operations capacity crunch while simultaneously improving organizational agility and incident-response times. Self-service operations enable IT Ops to safely shift activity to where its workforce can be best utilized. This includes moving the ability to take action closer to the problem or party in need.

The key to self-service operations is that they allow IT operations to divide and distribute the essential parts of an automated operations procedure: the definition of an automated procedure, the ability to execute that automation procedure, and control over the security and management policies governing that automated procedure.

This split enables popular scenarios such as having the development organization define operations procedures (e.g., restart for development's applications), the IT operations group vet those procedures, and the security organization control where the procedure can run and who can run it (e.g., development or perhaps another dedicated IT Ops support team).

Rather then keeping all of the pressure on operations, safely shift activity to where the IT Ops workforce can be best utilized.

Improve the capacity of teams doing operational support

Operations support teams can create standard operating procedures for known problems and expected events. By creating a central repository to catalog and collaborate on automated procedures, teams reduce manual effort and reduce variability—a key to reducing errors and rework.

Enable other teams to do more specialized, advanced work

In any enterprise, there are key personnel and specialists who get pulled into a disproportionate share of incidents. The result is bottlenecks and delays, even when it looks on paper as though there should be plenty of labor capacity to go around.

The self-service design pattern helps you protect the capacity of these internal experts. You need them to stay focused on work that moves the company forward, rather than being swamped with repetitive requests. By capturing specialized knowledge in a reusable and shareable way, you can safely delegate those repetitive tasks to others.

With the self-service model, IT operations support teams handle more and more requests on their own. This directly cuts down on escalations and protects the capacity of specialist resources who need to focus on project work. And teams will respond to incidents/requests quicker when they don’t have to escalate.

Protect the capacity of specialists by having them define procedures that a broader set of people can safely execute on-demand.

Enable the safe, secure, and effective delegation of operational support work

Self-service operations let you safely give a wide range of teams access to tasks that were formerly only entrusted to a handful of people in operations. This could mean letting developers or QA engineers do deployments or running previously approved procedures to respond to production incidents. In addition to providing labor-capacity improvements, this also allows delivery teams to streamline how they work, speed up the end-to-end delivery lifecycle, and work in the tightest feedback loops possible.

By deploying self-service operations, IT Ops can support the streamlined, rapid pace of working espoused by modern development practices while maintaining the high level of quality, security, and reliability that is expected of a modern IT operations organization.

Allow the people who know best how to operate a component (the team that created it) to provide operations with the procedures to operate it.

If you follow newer industry advice by handing off code (scripts or tools) that the receiving party can execute, you can use the tooling implemented to support your self-service operations model and provide a controlled, standardized way to hand off knowledge in the form of vetted, tested automated procedures. This allows your organization to move faster and removes the bottlenecks, long delays, and self-inflicted errors that come with the traditional, manual hand-offs between delivery and operations support teams.

Self-service operations: The tactical and strategic advantages

Applying self-service operations to your organization brings the immediate tactical benefit of helping you relieve the pressure of the capacity crunch brought on by the pace of modern application lifecycles. Beyond those short-term benefits, self-service operations bring a strategic advantage in that you will have a more flexible organization capable of moving faster while maintaining control. Self-service operations constitute a straightforward, powerful, design pattern that should be in every IT leader's playbook.

Want to know more? I'll be discussing self-service operations and the future of IT operations at the upcoming DevOps Enterprise Summit London session “Better, Faster, Cheaper: What Does It Mean for Ops?” Can't make it? Post your questions below, and I'll do my best to answer them.