Software @ Scale

The Evolution of Self-Driving IT Ops

A Practical Look at Machine Learning, Augmented Intelligence, and Automation

Self-driving cars are making headlines every day; the future being envisioned as a car that runs itself, maintains itself, sends alerts when help is needed, and prevents accidents. While opinions of a self-driving car vary from excitement about simplifying the daily commute to “no way would I ever put total control in the hands of a machine,” the concept gives rise to thoughts about self-driving data centers. What would they look like and how would they change IT as we know it? Reports indicate that enterprises are losing $21.8 million per year on average in downtime and 87 percent expect this to increase1. For organizations that are trying to manage and optimize increasingly complex hybrid IT environments that span mainframe and multi-cloud infrastructures, could evolving to a self-driven data center provide the keys to driving smarter, faster IT operations and preventing downtime?

Augmented Rather than Artificial Intelligence

For some, the thought of a self-driven data center conjures up scenes from classic sci-fi flicks like War Games, Terminator, and Tron. But, as Erik Brynjoloffson shares in his Ted Talk, the future shouldn’t be about machines and humans competing against each other, but about how they can work together to achieve business objectives through intelligent automation – they are better together.

Self-driving capabilities in IT operations are very real, and are progressing rapidly due to increasing levels of automation coupled with advancements in machine learning and augmented intelligence (AI). It’s important to note that self-driven IT Ops is more about augmenting the operator through data-driven intelligence and automation rather than completely replacing them. AI in this article should be understood as ‘augmented’ intelligence rather than ‘artificial’ intelligence.

Stages of Automation

To explore the concept of self-driven IT Ops further, we will look at how the Society of Automotive Engineers (SAE) breaks down the levels of automation in self-driving cars into six stages – all of which have a direct correlation to the growth and development of the autonomous data center through machine learning and augmented intelligence (Figure 1).

Figure 1 – The Six Stages of Automation2

Level 0 – No Automation: This is our starting point. It’s what most of us drive today – cars with no automation, where the person at the wheel is in complete control.

Historically, this has described the corporate data center perfectly. IT operational tasks are performed by people, be it troubleshooting or root cause analysis across a complex technical stack. For example, with a Level 0 data center, if a server threshold runs out of capacity, you must figure out if it’s an application problem, a server problem, a data flow problem, etc. – then set about fixing it. Not infrequently, multiple subject matter experts need to be called in, and the mean time to repair is high because the entire process is reactive in nature.

In the data center, this translates to basic automation in a single domain, such as monitoring networks. There is some automated data collection which increases productivity and offers insight to the network expert, but the feedback loop is still delayed and reactive in nature. Complicated issues still typically require “war room” situations impacting the majority of IT staff.

Level 2 – Partial Automation: More convenience enters the automobile world at Level 2, with features such as automation of both steering and speed. However, driver engagement is still a “must-have.”

In similar fashion, data centers at Level 2 can recognize “normal” and “abnormal” system behavior. Automation solutions monitor logs and take corrective action on simple problems that have been encountered and addressed in the past. Scripts are typically created for more complex tasks. While IT administrators continue to keep a watchful eye on the data center, routine repetitive tasks can be taken out of their hands.

Level 3 – Conditional Automation: Here is where things start getting really interesting. At Level 3, conditional autonomy enters the picture, and cars can handle the road unless the automation system itself fails, in which case the driver intervenes.

Through machine learning, clustering, and correlation, a data center at this level can bring various knowledge feeds to bear on system activity, understand relationships and causality, compare situations to past events, remediate problems in real-time, and even predict issues before they occur while simultaneously recommending appropriate action. Unless the augmented intelligence itself is in jeopardy or an especially acute and unique issue arises, IT administrators are free to focus on other value-added IT projects and leave the system to run itself. At this level, root cause analysis is focused on creating fixes to prevent recurrence and the need for war rooms begins to diminish.

Level 4– High Automation: In Level 4 vehicles, no driver interaction is needed. A Level 4 car can stop itself if the systems fail and will be able to drive from point A to point B in most use-cases. However, these cars will include functional driving apparatus (like wheels, brakes, and gas pedals), so people can manually drive when they choose to.

In a highly-automated data center, machine learning and predictive intelligence algorithms trigger adaptive automation that encompasses applications, workload inputs, and the whole business environment, reaching past IT with process and release automation.

Level 5 – Full Automation: Some view this as the holy grail, where the driver becomes the passenger and the car is in complete and total control. Although moving from Level 4 to Level 5 may seem like a small step, in reality, it’s a giant leap. Level 5 autonomy takes the driver all the way out of the equation. Automakers have not yet laid out a firm timeline for when Level 5 cars will become mainstream, and many have said that it’s at least a decade or longer away.

In the IT realm, this is the data center that is completely self-sufficient: better at healing itself, optimizing itself, and running itself than would be possible even with the most elite IT operations team. But, as mentioned at the beginning of this article, the near-term reality of a self-driving data center is more about augmented intelligence, a complement—not a replacement—to human intelligence. It’s about helping people (in this case, IT operators) become faster and smarter at the tasks they’re performing, freeing up time for more meaningful business and personal activities.

Moving toward Self-driving IT Ops

With a clear understanding of the evolution of self-driven IT Ops, the question becomes, how do you move forward to whatever level of automation you deem optimal for your business?

As with so much in business today, the center of everything is data. When it comes to machine-learning and AI, the broader the set of data, the more opportunity you have to analyze relationships and perform pattern analysis to proactively predict when performance is veering out of norm, and the more accurate those predictions become. Therefore, it is critical to gather data from multiple sources across your IT infrastructure.

Attempting to manually track data across all your sub-systems and weed out meaningful insights is a daunting to impossible task. Therefore, to make sense of the wealth of operations data you gather, automate data correlation to provide actionable insights. Advanced data science algorithms such as alert clustering can ensure that the information presented to your organization is timely, focused, and relevant. For example, dozens of network alerts over a thirty-minute period may appear normal and go unnoticed. But, when correlated with numerous mobile app alerts plus multiple CPU spikes, alert clustering can signal that a critical system may be heading toward a slowdown or even an outage.

Traditionally, when triaging an issue, Level 1 IT operators are forced to call in numerous experts if they don’t know the source of a problem. One of the many benefits of using machine learning and embedded intelligence to provide actionable insights is that it helps novice IT Ops staff quickly rule out what isn’t the problem, isolate root causes of issues faster, and only involve the experts who are needed. This increases the productivity of your entire IT staff.

When automating your IT operations, you should deploy a machine learning solution that incorporates tribal knowledge into the data-driven predictions and recommendations (Figure 2). For example, in the case of the mainframe, where many experienced operators are retiring from the work force, look for an intelligent automation system that captures the knowledge of these experts and enables them to correct or assist the automatic recommendations. Sentiment analysis is one way of achieving this goal. In sentiment analysis, operators vote “thumbs up” or “thumbs down” when the system makes a prediction or recommendation. Since user sentiment may be of variable quality, the model gradually learns the sentiment through strength of sentiment and spread.

Summary

Companies across all industries are exploring intelligent automation tools to use their resources better, know their environments better, and make better decisions. Like owning or riding in a self-driving car, many of us are not quite ready to hand over total control to a fully-automated data center. But the ability to dramatically improve IT operations’ speed and agility while maintaining control is a reality today. When it comes to IT operations, we have so much data – spanning mainframe to cloud – but we’re not using it to learn. Automation and machine learning will provide a learning-based approach to help deliver numerous positive outcomes: delighting customers, increasing efficiencies, lowering costs, and addressing skills gaps, to name just a few. These are exciting times, and I encourage you to embrace machine learning and operational intelligence solutions to begin your journey toward a self-driving data center that will fuel your organization in becoming a true digital enterprise. I invite you to reach out to me at vikas.sinha@ca.com.

Greg Lotko

Greg Lotko is Senior Vice President and General Manager of Broadcom’s Mainframe Division helping enterprises build, operate, automate, and secure their current and next generation systems to optimize and fuel digital transformation. He joined Broadcom (previously CA Technologies) in August of 2017 as the Senior Vice President, Software Engineering, Mainframe at CA technologies where he was responsible for leading engineering for all Mainframe products including CA’s primary labs around the world in North America, Czech Republic, and India.
Author's Articles

Jeff Henry

Jeff Henry is a Vice President of Strategy and Product Management at Broadcom. Jeff is responsible for the Workload Automation across distributed and mainframe systems, and the IT Operations and Automation mainframe portfolio of products. Jeff is passionate about blending the disciplines of Product Management, Engineering and Design, placing results-based customer experiences as the primary target in delivering innovative and differentiated value to our customers.
Author's Articles