Can Machine Learning Help You Manage vSphere?

In 2008, the communications technology provider Alcatel-Lucent (soon to be part of Nokia) acquired a consumer analytics firm called Motive. Some of the dozen or so people who kept track of that move wondered what a telecom service provider would want with a consumer-facing analytics tool.

The answer was something no one expected: A-L leveraged Motive’s machine learning algorithms to diagnose service failures in its IP communications network, as well as to detect failures before they happened.

Consumer-grade analytics had become strong enough to spot trouble spots better than any program purpose-built for the task.

Well, if we apply that same logic to our own data centers, should we be using consumer-grade analytics as a monitoring tool for administering infrastructure systems as complex as VMware’s vSphere? It’s a problem being tackled by SIOS Technology, to which CMSWire introduced you last April.

Patterns of Learning

It provides the requisite, tablet-ready, touch-sensitive monitoring dashboard for VMware environments, particularly those which are also running Microsoft SQL Server instances, at least for these initial releases.

What’s different is that IQ boasts the ability to perform problem detection using machine learning algorithms. As SIOS Chief Operations Officer Jerry Melnick told CMSWire, IQ scans the event logs being produced throughout the entire VMware infrastructure, including storage, networking, memory, CPU, and where feasible, application behavior.

“We’re identifying patterns of behavior in the system,” said Melnick. “First we identify the relationships of all the objects in the system — which virtual machines are talking to other virtual machines, and which storage devices.

“When we see a pattern of behavior become abnormal, as defined by our observations over a period of time,” he continued, “that’s when we’re pro-actively alerting, and identifying their root cause.”

In this context, “behavior” is a mathematical pattern. In any system defined by transactions, a set of metrics can be attached to those transactions that measures efficiency, latency, time to completion, number of repetitions, total time elapsed and things of that nature.

Any scientific experiment regarding observed patterns of behavior in systems makes no assumptions about what to expect in advance. Just ask any first-term congressperson.

If you’ve had any experience with machine learning, you’ll recall that in order for an algorithm to successfully identify abnormal behavior with any degree of reliability, it must first know what normal behavior looks like. That’s part of the training of any ML system. Unless it has some seed data to help it advance an idea of which behaviors deserve alert notices, it won’t know which behavior is good and which is bad.

Super Vision

So how does SIOS IQ draw any conclusions about the environment it’s studying without the kind of rigorous “supervised training” sessions that supercomputer program architects give to their AI algorithms?

Melnick acknowledged to us that SIOS’ baseline for “normal behavior” comes from its observations of the state of the virtualization platform at the time it’s first launched. That could be an issue for data centers that may have been plagued with bad behavior for a long period of time, without their users ever really knowing it’s bad.

“If we report a behavior as anomalous, and you do nothing about it, over time the learning set incorporates that into ‘normal,’” said Melnick. “Yes, you will get some alerts over that period of time, but if you keep your hands off of them or tell us they’re not anomalous, we’re going to learn that that’s the case.”

If you’ve ever used a software-based firewall that learns which applications are permitted to use open ports by alerting you about them the first time it sees them, you’ll get the idea. The training period for SIOS IQ may very well be a hands-on affair, during which time, it will construct a training set.

“We get more efficient over time the more we learn,” he continued.

One possibility that SIOS is exploring, he mentioned, is for IQ to anonymize that training set and share it with a broader, public network. This way, different sets of “normal” behavior may be compared against one another right away, and perhaps a crowd-sourced notion of normality may emerge from the process. He did emphasize, however, that this feature has not yet been made available in SIOS’ July release.

The free edition of SIOS IQ offers limited insight into VMware environments using SQL Server. The commercial version, said Melnick, will go somewhat deeper, and may in future releases probe into what’s called the “Deep Root Causes” of networking bottlenecks.

For now, he says, IQ is a first-response tool of sorts, expediting the triage process to help admins decide what decisions they need to make next.