Monday, 1 June 2015

I have pulled together a quick and dirty guide to capturing and interpreting ESXTOP results.

Launching ESXTOP

ESXTOP can be launched from the command line by ssh'ing into the ESXI host:

esxtop

Metrics and Thresholds

Display

Metric

Threshold

Explanation

CPU

%RDY

10

Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check %MLMTD) has been set. See Jason’s explanation for vSMP VMs

CPU

%CSTP

3

Excessive usage of vSMP. Decrease amount of vCPUs for this
particular VM. This should lead to increased scheduling opportunities.

CPU

%SYS

20

The percentage of time spent by system services on behalf of the
world. Most likely caused by high IO VM. Check other metrics and VM for
possible root cause

CPU

%MLMTD

0

The percentage of time the vCPU was ready to run but deliberately
wasn’t scheduled because that would violate the “CPU limit” settings. If
larger than 0 the world is being throttled due to the limit on CPU.

If less than 80 VM experiences poor NUMA locality. If a VM has a
memory size greater than the amount of memory local to each processor,
the ESX scheduler does not attempt to use NUMA optimizations for that VM
and “remotely” uses memory via “interconnect”. Check “GST_ND(X)” to
find out which NUMA nodes are used.

Aborts issued by guest(VM) because storage is not responding. For
Windows VMs this happens after 60 seconds by default. Can be caused for
instance when paths failed or array is not accepting any IO for whatever
reason.

DISK

RESETS/s

1

The number of commands reset per second.

DISK

CONS/s

20

SCSI Reservation Conflicts per second. If many SCSI Reservation
Conflicts occur performance could be degraded due to the lock on the
VMFS.

%VMWAIT: Is a derivitive of %WAIT and reperesents just the hardware and SWAP waiting time and hence is a better metric to use than %WAIT when diagnosing performance issues such as storage controllers etc.

%WAIT: Reperesents the waiting time for devices (e.g. storage controller), SWAP waiting time AND %IDLE time - so should not be taken at face value!

%RUN: Reperesents the percentage of total time scheduled for the world to run. %USED = %RUN + %SYS – %OVRLP. When the %RUN value of a virtual machine is high, it means the VM is using a lot of CPU resource.

ESXTOP Toggles

c = cpu
m = memory
n = network
i = interrupts
d = disk adapter
u = disk device (includes NFS as of 4.0 Update 2)
v = disk VM
p = power states
V = only show virtual machine worlds
e = Expand/Rollup CPU statistics, show details of all worlds associated with group (GID)
k = kill world, for tech support purposes only!
l = limit display to a single group (GID), enables you to focus on one VM
# = limiting the number of entitites, for instance the top 5
2 = highlight a row, moving down
8 = highlight a row, moving up
4 = remove selected row from view
e = statistics broken down per world
6 = statistics broken down per world

Exporting results from ESXTOP

From the command line we can run:

esxtop -b -d 2 -n 250 > esxtopout.csv

Interpreting results from ESXTOP

You can directly hook into ESXTOP with a utility called VisualESXTOP (rather than having to manually export it's results - that will build pretty graphs to help you interpet the data a little easier.