Blogs

About this blog

AIXpert Blog is about the AIX operating system from IBM running on POWER based machines called Power Systems and software related to it like IBM Systems Director, PowerVM for virtualisation and PowerSC for security plus performance monitoring and nmon

Links

Tags

Recent tweets

FAQ2: Analyzing Large Volumes of nmon Data

I regularly get asked a question like: I have 4 months of data from 25 machines and have to develop a Capacity Planning model to size these LPARs on to new machines but I am having problems with having so much data. What can you recommend? We need graphs of

CPU compared to Entitlement

CPU Physical CPU use

Maximum real memory use

Network MB/s

Disk MB/s

Disk IOPS

Sometimes this data is needed as input in to the Workload Estimator tool or Server Consolidation tools.

My standard response is: You now understand Performance Monitoring and Tuning level data is NOT what you really need if you are doing Capacity Planning!

Followed by: Have you also collected the nmon -x data? This collect 15 minute sample rate data and so just 96 samples a day - this is what you need for Capacity Planning. Note: below I assume this reduced data is not available.

You have roughly 3000 nmon files of (let us guess here) 10 MB each = 30 GB of raw or something like 300 GB of Excel data. Microsoft Excel would explode with a handful of these files. The nmon consolidator has limits on the number of files and size of files and in particular the number of snapshots in the files.

There is also hidden traps:

If you average out the CPU, Net, RAM, disks etc statistics you will dilute the peaks to meaningless low averages - and its the peaks you really want.

If you take just the peaks then you will find every LPAR has tiny periods of peaks at 100% CPU and RAM and you will not see if the peaks happen at the same time of day across machines or not.

I do NOT like the use of anything relative to Entitlement - please use Physical CPU used - otherwise you can't add up the CPU requirement.

So what can I recommend - well it is a complex question to answer and there is no simple answer. If there was I would take a patient out and retire!

Some Approaches are - I attempt to make some of these humorous:

Just do one day - Ask the users for the busiest day in the past 4 months and just look at that - silently ditch the rest of the data.

Just do them all - By hand look at each days worth of nmon files (25 of them) and work though all 120 days. If you work hard that is roughly 180 hours work = roughly 5 weeks work plus two weeks off sick due to headaches.

Fix the world - Fix the nmon Consolidator and Microsoft Excel while you are there and run it on a PC with 300GB of RAM. Your fellow nmon specialist will thank you for for a long time to come - just don't expect any cash from them!

Script to pick out the data you need - If the data is highly consistent (same OS level, similar numbers of CPU and disk drives) then you could attempt awk, sed, grep scripts to start reducing the data set so you could aggregate the 25 machines in to data files one for each statistics for CPU, data for disks etc. then load the CSV data in to Excel or similar tool for graphing.

Use nmon2web - Pour the data in to nmon2web (if you have not used it before this would take some time to setup - it creates rrdtool databases on a web server and displays the results on a browser) and get the nmon2web front end to aggregate the days and machines as you want. This is a workable solution but has some "up front" costs in setup and assumes you have a webserver and hands-on skills.

Make it someone elses problem - Hire IBM or IBM Business Partner Services and make it their problem!

Do one week and cross check - Ask the users for a busy week then look at just the 175 graphs and create a spreadsheet of those for the core numbers you want. Then go sanity check the peaks are roughly the same in other weeks. For example, go check that the online peaks normally Friday at 2 pm but also other machines busy on Wednesday mornings and the heavy batch jobs on specific servers between 10 pm and 1 am and is the network backup always done by 7 am?

RDBMS - Stuff all the nmon data into a RDBMS and use RDBMS tools to extract the data in the format you need - thankfully not my (Mr nmon's) problem.

Go for 3rd party Tools - There are third party performance monitoring (for example Midrange Performance Group their default collector for AIX, VIOS and Linux is nmon) and capacity planing tools that will take nmon data. That is a very nice option if you already have the tools available but most come at a cost due to the marvelous data handling and modeling functions they supply.

Use rrdtool to consolidate the data - Lastly, the nmon2rrd tool could be used to extract the data from each machine into longer term rrdtool databases and the graphs generated from there. This would require some rrdtool and scripting knowledge.

I don't think any of these are a simple solution to the large problem here and there is no missing a trick - this is a genuine large problem faced by many.

Final thoughts:

If you have some magic bullet for fixing this problem that I have missed then please lets have them.

Perhaps you already have script to extract the key global stats (as seen at the top of this blog entry) from nmon data.

Or a simple way to produce say 15 minute stats from 10 second nmon data.

An nmon user group supported project for method 10 would be good. Any one got 4 months of data from one machine? My machines are pretty boring and they are not production and regularly busy.