Pinned topicProcess Monitoring -- Expecting too much?

In general, how accurate should one expect 3rd party monitoring tools to be with respect to process level CPU reporting.

In other words, I'm used to looking at TOPAS and/or NMON to view running processes and how much CPU they are eating. I understand that TOPAS and NMON can report the numbers in slightly different ways. I'm evaluating a 3rd part tool that uses the perfstat library to retrieve performance data and many other things to display. My issue is that their process level numbers are nowhere close to what TOPAS or NMON show. They claim "Overall AIX data collection, reporting, and interpretation are complex and almost subjective in virtual (LPAR) situations; hence the agent will not be able to provide matching statistics to the AIX cmd 'topas'." All of my systems are shared, uncapped systems.

I guess I'd like to know if I'm expecting too much? Is TOPAS not nearly the default tool for peeking at current running process performance stats for a running instance of AIX? They went on to mention that "Topas is not part of the base AIX install so we cannot assume it always being present, nor can we easily obtain information from 'topas' through a remote connection."

Don't VMSTAT, SAR, NMON, & TOPAS all get their data from the same place?

If I'm not expecting too much, are there any 3rd party tools that are recommended as being consistent and complimentary to the built in AIX performance tools?

Thanks for any feedback. I'm really trying to make sure I'm not just being unreasonable in my expectations that process level reporting in this tool be close to TOPAS or NMON.

Re: Process Monitoring -- Expecting too much?

‏2014-04-22T15:00:42Z

This is the accepted answer.
This is the accepted answer.

Hi,

I dont dare to provide definite answer, because we have similar dilema, one thing to be taken into account is that amount of raw CPU provided to LPAR by firmware varies (in shared mode, of course). It can be below and instant later above the entitled capacity. So the basic question is - what is 100% of CPU utilization.

Still there are tools to find out (or even graph) actual CPU time provided to LPAR and this can be used as another source to track actual LPAR's CPU usage.

Re: Process Monitoring -- Expecting too much?

I dont dare to provide definite answer, because we have similar dilema, one thing to be taken into account is that amount of raw CPU provided to LPAR by firmware varies (in shared mode, of course). It can be below and instant later above the entitled capacity. So the basic question is - what is 100% of CPU utilization.

Still there are tools to find out (or even graph) actual CPU time provided to LPAR and this can be used as another source to track actual LPAR's CPU usage.

Agreed; when in a shared uncapped environment, it can very hard to do trending analysis, when actual entitlement could vary over time.

However, if taking a a snapshot at a time, for instance...say looking at 5 or 10 minute interval. TOPAS and/or NMON each at least tell me what they are doing.

TOPAS gives percentages relative to what I'm being given... if that's over my entitlement..no matter, the process utilization is shown relative to that which I am given. It will also show me if I'm being given more than my entitlement.

NMON will tell me process utilization relative to the number of logical cores allocated to me.

If I'm being sold a 3rd party tool, I expect it to cover the basics and take an NMON or TOPAS style of viewpoint; if neither, and they clearly don't match NMON or TOPAS, then they better do a darn good job of explaining what they are claiming to be utilizations across processes. That just can't be too much to ask, is it? <sigh>

it is very frusting to have very powerful systems, with complex virtualized configurations and have someone saying they can help monitor our systems with some fancy graphics and throwing numbers around on animated web pages.... And when I question the numbers (from a test environment *I* created), they claim "it's very complex" and "it'll never match what TOPAS or NMON says" ... " there are too many variables ".... well, you can see how irritating it can be.

And to be fair, I'm not claiming ther isn't value in some of the other functions provided by this software and others out there. What I do demand is that if say you do something... you do it, and do it correctly, and prove that you do it correcty.

I think more than anything I was looking to make sure I wasn't being unreasonable ;-)

Re: Process Monitoring -- Expecting too much?

‏2014-04-29T17:57:47Z

This is the accepted answer.
This is the accepted answer.

It is reasonable to ask the vendor "how are you computing process utilization?"

All of the tools, ps, topas & nmon ultimately use the legacy unix counters in procinfo.h (getprocs64() system call). In kernel speak, this is known as sysproc. This code isn't normalized for "PURR" (SMT dispatch "tics") activity at a micro level. This is the real measurement of how many cycles something executed. As far as I recall,

ps - reports utilization relative to the number of logical cpus, whether they are dispatched or not and thus treats each logical thread as though it was 100%. So, a 4 VP with 4-way SMT = 16 lcpus, and 15.7% of (16 x 100%) = ~2.5 cores

Pretty much everything else in perfstat is derived from "modern" kernel interfaces, and perfstat actually abstracts this now (/usr/include/libperstat.h perfstat_process_t) providing user and system cpu time per process. But it's just exactly the same values from getprocs64(). These values are converted to the variable number of "tics" consumed each second by the dispatcher under virtualization.

Basically, when you're dedicated, all the VPs dispatch equally, so every second you have the same number of tics, you get process tics from these data structures and it's easy to figure out.

In virtualization, the VP dispatch time varies all the time. So you have to account for that (say every second) and then you have to account for the delta in process time over the same interval to figure out how much that process took of the variable number of dispatcher tics.

Actually, these process structs are new to libperfstat (after my time). Send this to the developer: