In an earlier post we looked into using the Performance Co-Pilot toolkit to explore performance characteristics of complex systems. While surprisingly rewarding, and often unexpectedly insightful, this kind of analysis can be rightly criticized for being “hit and miss”. When a system has many thousands of metric values it is not feasible to manually explore the entire metric search space in a short amount of time. Or the problem may be less obvious than the example shown – perhaps we are looking at a slow degradation over time.

There are other tools that we can use to help us quickly reduce the search space and find interesting nuggets. To illustrate, here’s a second example from our favorite ACME Co. production system.

Investigating performance in a complex system is a fascinating undertaking. When that system spans multiple, closely-cooperating machines and has open-ended input sources (shared storage, or faces the Internet, etc) then the degree of difficulty of such investigations ratchets up quickly. There are often many confounding factors, with many things going on all at the same time.

The observable behaviour of the system as a whole can be frequently changing even while at a micro level things may appear the same. Or vice-versa – the system may appear healthy, average and 95th percentile response times are in excellent shape, yet a small subset of tasks are taking an unusually large amount of time to complete, just today perhaps. Fascinating stuff!

Let’s first consider endearing characteristics of the performance tools we’d want to have at our disposal for exploring performance in this environment.