A Quantum Theory of IT Monitoring

Image: infocux Technologies/Flickr

There are certain things, which are true in the quantum world, but just make no sense in our reality. I remember in a college advanced physics course, having to calculate the likelihood that a baseball thrown at a window will pass through, emerge on the other side and leave both the ball and the window intact, due to quantum effects and tunneling. I was astonished to see that while the odds of this happening are infinitesimally small, they are not zero.

Never mind the fact that you’d have to continuously throw the ball at the window, not accounting for breakage, for longer than the universe has existed to even have a remote chance of observing this, the odds are not zero and can be calculated. And at the sub-atomic level, not the physical object one, this type of behavior isn’t just common, it is expected. This small fact has stuck with me for decades as a great illustration of how odd the quantum world truly is.

What then does that possibly have to do with IT monitoring? It might be a stretch, but I think the new world of applications, which we call Web Scale, is in some ways as strange to traditional monitoring products as the world of Quantum behavior is to baseballs, windows and normal humans.

Let me explain. In the past, we built applications that were not quite so sensitive to small changes in infrastructure performance, for two main reasons. First, our users had very low expectations. From batch, to time sharing, to PC networks, to early web applications, we became accustomed to waiting for a screen to advance, an hour glass to spin, a web page to update. But somewhere along the couple of years or so, our expectations have changed. Movies stink when they stall, missed stock quotes can cost us real money, and we voraciously hang on our phones and tablets for real-time updates of everything from sporting events to natural disasters, to pictures and updates from loved ones, to new orders from customers.

Second, we just got tired of the standard practice of over-provisioning data centers for peak loads, running at 50% capacity or less to ensure performance. Despite falling hardware costs, our appetites for data and applications just kept growing. So we virtualized everything, and when we tapped out the efficiency there, just like we stopped building power plants at office buildings decades ago, we went to the cloud, where we could “scale” on demand, and share the economies of scale of computing experts.

Yet while the entire infrastructure changed and the costs of performance delays and degradations increased, we happily kept monitoring things every five minutes or so–or even every hour checking for the same things we used to, capacity, resource utilization, and the like. Yet today users scream and customers leave over five-second delays. Outages of streaming information cost us money. Our “quantum” of time we care about has shrunk dramatically to match the needs of the new application infrastructure, applications and user expectations. We live in a real-time world, yet we continue to monitor our last architecture.

Which brings me to another engineering theorem deep in my memory. The Nyquist–Shannon sampling theorem in its simplest form, says that in order not to lose information, the sampling frequency that you measure at needs to be at least 2x as fast as the event you want to capture. If any slower, your reconstructed signals suffers from “aliasing”, or loss of information.

Today’s Web-Scale IT architecture and demanding users care about changes and delays that last a few seconds, sometime even less. If our quantum of caring is now measured in a second or two, Nyquist and common sense says we better be capturing and processing monitoring data every second or so also.

Last generation IT monitoring solutions simply can’t capture and process data fast enough. They can stare all day at the baseball, but it will never tunnel through the window. But unlike our quantum baseball example, the slow sampling of infrastructure monitoring data leaves us blind to things that happen that we actually care about. Stalled video, missed quotes, lost business opportunity, service delays and outages that cost us money.

Our new math of IT monitoring needs solutions like Boundary’s, ones that measure in seconds. That fact is as plain and simple to see as the shattered window that I am staring at right now.