Gathering performance data on a virtual windows server

When troubleshooting a potential storage related performance problem on a virtual windows server, it’s a bit more difficult to anaylze a because many virtual hosts share the same LUN for a datastore in ESX. Using EMC’s analyzer or Control Center Performance Manager only gives me statistics on specific disks or LUNs, I have no visibility into a specific virtual server with those tools. When this situation arises, I use a windows batch script to gather data with the typeperf command line utility for a specific time period and run it directly on the server. Typically I’ll let it run for 24 hours and then analyze the data in Excel, where it’s easy to make charts and graphs to get a visual view of what’s going on.

Sometimes the most difficult thing to figure out is the correct syntax for the command and which parameters to use. For reference, here is the command and it’s parameters:

-c { Path [ path ... ] | -cf FileName } : Specifies the performance counter path to log. To list multiple counter paths, separate each command path by a space.
-cf FileName : Specifies the file name of the file that contains the counter paths that you want to monitor, one per line.
-f { csv | tsv | bin } : Specifies the output file format. File formats are csv (comma-delimited), tsv (tab-delimited), and bin (binary). Default format is csv.
-si interval [ mm: ] ss : Specifies the time between samples, in the [mm:] ss format. Default is one second.
-o FileName : Specifies the pathname of the output file. Defaults to stdout.
-q [ object ] : Displays and queries available counters without instances. To display counters for one object, include the object name.
-qx [ object ] : Displays and queries all available counters with instances. To display counters for one object, include the object name.
-sc samples : Specifies the number of samples to collect. Default is to sample until you press CTRL+C.
-config FileName : Specifies the pathname of the settings file that contains command line parameters.
-s computer_name : Specifies the system to monitor if no server is specified in the counter path.
/? : Displays help at the command prompt.

EMC’s Analyzer vs. Windows Perfmon Metrics

I tend to look at Response time, disk queue length, Total/Read/Write IO, and Service time first. I dive into how to interpret many of the SAN performance metrics in my older post here.

The counters you’ll choose in Windows performance monitor don’t precisely line up with what we commonly look at using EMC’s tools in how they are named, and in addition you can choose ‘LogicalDisk’ and ‘PhysicalDisk’ when selecting the counters.

What is the difference between the Physical Disk vs. Logical Disk performance objects in Perfmon, and why monitor both? Their counters are calculated the same way but their scope is different. I generally use both “\LogicalDisk(*)\” and “\PhysicalDisk(*)\” when I run my perfmon script.

The Physical Disk performance object monitors disk drives on the computer. It identifies the instances representing the physical hardware, and the counters are the sum of the access to all partitions on the physical instance.

The Logical Disk Performance object monitors logical partitions. Performance monitor identifies logical disks by their drive letter or mount point. If a physical disk contains multiple partitions, this counter will report the values just for the partition selected and not for the entire disk. On the other hand, when using Dynamic Disks the logical volumes may span more than one physical disk, in this scenario the counter values will include the access to the logical disk in all the physical disks it spans.

Here are the performance monitor counters that I frequently use, and how they compare to EMC’s navisphere analyzer (or ECC):

“\LogicalDisk(*)\Avg. Disk Queue Length” – (Named the same as EMC) The average number of outstanding requests when the disk was busy“\LogicalDisk(*)\%% Disk Time” – (No direct EMC equivalent) The “% Disk Time” counter is the “Avg. Disk Queue Length” counter multiplied by 100. It is the same value displayed in a different scale.“\LogicalDisk(*)\Disk Transfers/sec” – Total Throughput (IO/sec) – the total number of individual disk IO requests completed over a period of one second. We’ll use this value to help determine Disk Service Time.“\LogicalDisk(*)\Disk Reads/sec” – Read Throughput (IO/sec)“\LogicalDisk(*)\Disk Writes/sec” – Write Throughput (IO/sec)“\LogicalDisk(*)\%% Idle Time” – (No direct EMC equivalent) This counter provides a very precise measurement of how much time the disk remained in idle state, meaning all the requests from the operating system to the disk have been completed and there are zero pending requests. We’ll also use this to calculate disk service time.“\LogicalDisk(*)\Avg. Disk sec/Transfer” – Response time (sec) – EMC uses milliseconds, windows uses seconds, so you’ll see 8ms represented as .008 in the results.“\LogicalDisk(*)\Avg. Disk sec/Read” – Response times for read IO“\LogicalDisk(*)\Avg. Disk sec/Write” – Response times for write IO

Disk Service Time is caculated with this formula: Disk Utilization = 100 – %Idle Time, then Disk Utilization / Disk Transfers/Sec. = Disk Service Time.

Configuring the Script

This batch script collects all of the relevant data for disk activity. After 24 hours, it will dump the data into a csv file. The length of time is controller by the combination of the “-sc” and “-si” parameters. To collect data in one minute intervals for 24 hours, you’d set si to 60 (collect data every 60 seconds), and sc to 1440 (1440 minutes = 24 hours). To collect data every one minute for 30 minutes, you’d enter “-si 60 -sc 30”. This script assumes you have a local directory on the C: Drive named ‘Collection’.

I took a look this morning. I copied and pasted the script and initially I received the same error as you. The error is related to syntax – A single percent sign is needed on a variable when running directly from the command line, two percent signs are needed when running from within a script (%A vs %%A). That wasn’t the issue in this case however as it was already correctly set to use two percent signs from within a script. I believe it was due to an encoding error when copying and pasting the script directly from the web page. When I manually typed in the same “@for” commands in a brand new script, it worked fine. I recopied the script into the original post and placed it into a preformatted code window, hopefully it will work now if you try again. If not, it seems that manually retyping it works.