Quick Lesson: VM Performance Checklist for Troubleshooting

Let’s set the stage for this VMware troubleshooting lesson…

It’s 2 o’clock in the morning and you just received a call from the NOC. Hundreds of users from around the world are calling the help desk and reporting very slow performance [and some are unable to log-in].

Welcome to virtualization Admin reality!

For many virtualization admins this is a common event and having a plan or checklist for troubleshooting VM performance is key to getting back to sleep.

5 Basic VM Troubleshooting Steps

Over the years I’ve learned what works best in these situations.

Having basic VM performance checklist for troubleshooting and to help me analyze the stack from the bottom-up (environment to the user) or top-down (user to the environment) has been the key.

So let get the lesson started…

First, Gather Information

1. Before you do anything get a complete update from the NOC or help desk on what’s happening, who are impacted and try to narrow it down to location or application-pool.

The reason why you need to narrow it down is that many applications span more than one data center (or cloud) location and the issue could be caused from a patch installed on a server halfway across the world and the wrong admin was pulled out of bed.

Once you verify the issue is within your realm of responsibility and you understand the impact, you’re ready to start troubleshooting.

Is the outage due to too many users trying to use the system? Sometimes systems fail because there aren’t enough system resources to handle hordes of users visiting after an advertisement campaign [this really does happen when marketing and IT are out of sync].

Is there an outage with one of the service providers [Internet and telecom providers have outages too]?

Begin Troubleshooting Steps

2. As I said, I start from the bottom of the stack and work upward [depending on your environment you can start at the top and work downward, too].

The first step I usually do is make sure there’s power and check if something has failed.

Are there any alarms or alerts in your email from equipment?

What alerts has the NOC been seeing on their monitors? Many environments have SiteScope or other monitoring tools.

Is the issue due to a hardware failure? Power supply, disk, server crash, etc.

Is there a dependency offline? License expired or 3rd party vendor issue.

Once you have checked for alarms and verified, then move up the stack if the problem still exists.

3. Now it’s time to drill down and see if something is going on with network, storage or servers.

Is there network congestion caused by backups running or something similar?

Are backups trashing storage and using up all your IOPs?

Is your ESXi host maxed because too many VMs are sucking up all the memory and CPU?

Is something else going on that is causing performance issues?

Once you have finished checking for infrastructure, then move up the stack and log into vCenter if the problem still exists.

Next, Start the Deep Drive

4. Log into vCenter and check the VMs that are in the application pool having the issue.

Are the VMs powered on? Sometimes the server will accidentally get powered off after patching or code updates.

How does the performance chart look on memory, CPU, storage latency or network?

Do you see anything else going on in vCenter that could be causing VM performance issue?

Because vSphere shares resources, sometimes a bully VM could be using up all the resources and strangling the virtual servers sharing the same host or data store. Look for signs of other VMs having issues.

If vCenter shows performance on the VMs and shared resources looks good, then move up the stack and start logging into VMs if the problem still exists.

Diving Deeper

5. Traditional application stacks have at least 3 servers [Database, application, and web], but this can be scaled into hundreds of servers in an app-pool so becoming familiar with your environment is key to guiding you to where to start.

Since our example is about users complaining of slowness, let’s start with the web server and check it first since it’s at the front-end.

Did you notice any latency when logging into the server?

Or Maybe the login failed and the server is frozen and needs to be rebooted?

After logging in how’s the VM performance? Check perfmon or run top.

Are there any services stopped that need to be restarted, IIS or Apache?

Is there a resource issue like memory, CPU or storage space?

If all looks good can you ping the gateway?

Is storage latency too high?

Is antivirus scanning or backups running?

Did you check the logs for errors?

Now go through each server in the stack and check the same thing. If nothing can be found then move up the stack again and get other teams involved.

Warm Hand-off…

Never just assume the NOC or help desk know what is going on when an incident is in progress. Keep them updated as you work your way through the issue.

And once you have ruled-out the infrastructure, VM, and OS as the root cause of the performance issue or outage, it’s time to transfer the ownership of the issue to the next layer of the stack.

Make sure you do a warm hand-off and someone acknowledges they are picking up the ball now.

For example if the DBA is now checking the DB stack then make sure they know to do a clean pass-off to the application team next and keep the NOC updated.

This will save you from getting called again once you have returned to bed because the issue was left hanging unresolved and nobody knew you were done with your part.

Single Point of Support

Has a gigantic report been running for hours and taking up all the resources?

Is the application hanging?

Was there a code update pushed recently?

Did a developer make an unapproved change?

Was the system hacked?

Lessons Learned

I did say quick so this brings us to the end of this VM performance troubleshooting lesson. And yes, there are many other steps you can add to the checklist if I missed something unique to your vSphere cloud.

My goal here was to provide a high-level guide for beginner admins and get them understanding the need for having a troubleshooting plan or checklist to follow… rather than jumping head first into the abyss of a chaotic incident call.