Win 2K8 with PVSCSI Critical Issue

I’ve started to see reports recently of I/O errors when running very high I/O workloads on Windows 2008 and Windows 2008 R2 VM’s. Mostly this was during artificial benchmark tests run against MS SQL and Exchange 2010 with Jetstress. However it could impact production workloads. Upon further investigation it appears these I/O errors are a known defect with a certain version of the PVSCSI driver that comes with VMware Tools and can affect vSphere 4.0 U1, 4.1 and 5.0. Here I’ll cover more about this potentially serious issue and how to fix it.

Although the referenced KB articles describe a situation with SQL it is possible for this to happen under any high I/O workload on the impacted versions of Windows, including for example Exchange. The information available right now doesn’t mention Windows 7 VM’s. But Win7 VM’s are generally less susceptible to the same high I/O workloads as Exchange and SQL servers. Even though Win7 VM’s are less susceptible to the same load conditions that would cause this issue the PVSCSI driver in Win7 is still affected by this problem and should be updated. In the case of VDI desktops could be re-provisioned if they experienced this issue.

What makes this issue potentially serious is that in the worst case (rare) scenario this problem could lead to data corruption. This makes it very important that you upgrade or patch your vSphere environment to address this defect. With ESXi 5.0 the patch is included with Update 1. For ESXi 4.1 you should deploy Patch 04 described in VMware KB 2009144 – VMware ESXi 4.1 Patch ESXi410-201201402-BG: Updates VMware Tools.

Like this:

Related

About the Author

Michael is Technical Director, Business Critical Applications Engineering at Nutanix. He has been using VMware products since 1998 and has been deploying ESX solutions since 2002. He specializes in designing virtualization solutions for Unix to Linux migrations, business critical applications, disaster avoidance, and mergers and acquisitions. Michael has been in the IT industry since 1995 and consulting since 2001. Michael is Nutanix Platform Expert (NPX) #007. In addition to VMware Certified Design Expert (VCDX) he holds VCP, VCP-Cloud, VCAP-DCD, VCAP-DCA, VCAP-CID, VCAP-CIA, ITIL Foundation, MCP I, and MCSE (NT4 – 2K3).

Thank you for the helpful post! I'm looking into standardizing a Windows Server 2012 Datacenter template and was curious about this adapter. I may steer clear of this for now. Data integrity comes before performance any day of the week in my book, and I would assume this might not be the only risk from using the PVSCSI adapter.

I agree data integrity comes before performance. In this case the bug has been fixed and I've tested it and have a number of customers who have deployed PVSCSI successfully without any issues. Given that Windows Server 2012 isn't supported on any version of vSphere that has this bug you're pretty safe to use it. But as with everything you should do some testing yourself and also ensure the benefits are what you're expecting. The results I posted on my Fusion IO testing were with Windows 2008 R2 using the PVSCSI driver. Without PVSCSI the latency was 3x slower. But we are talking something that was 300 microseconds being 1 millisecond. This bug was definitely a surprise from left field as I'd used PVSCSI quite a bit in the work I do with business critical applications. Fortunately it wasn't a problem for long and VMware got the fix out. I didn't hear of any actual instances of data corruption in production environments caused by this bug either, as it was picked up during customer project test phases.

Actually, we discovered this bug back in June of 2011 and we were one of the original cases reporting this. VMware stated that it was a problem with our configuration at the time, but we proved it was PVSCSI by jetstressing a VM on SSD disks, generating the errors, but when we switched to LSI Logic SAS, the errors went away. We encountered this issue in production with our Exchange 2007 VM's. Further more, the likely hood of the issue increased when we added more than 7 VMDK's to an individual PVSCSI adapter. We ended up using 1 LSI logic adapter for C: (due to support statements by Microsoft at the time) and 3 PVSCSI adapters with all of our VMDK's spread across it to alleviate the problem, but that just alleviated the frequency not the problem itself. We rolled back to LSI on our exchange servers and haven't switched back yet, but we do have a large number of SQL and Indexing Crawlers using PVSCSI with I/O loads in excess of 2500 iops on SSD (as seen by the guest VM) without any issue. On our next maintenance window, we are now comfortable with the PVSCSI adapter, and will be rolling our exchange servers back.

Since two weeks every other day one or more oracle instances get terminated. At first it seemed like the issue is related to a disk issue. I did some research and found the articles on paravirtual adapters of VMware seemed to be causing the issue.

We changed the scsi controllers back from Paravirtual to LSI logic SAS for the disks that contain Oracle datafiles, Logfiles and Controlfiles etc. but not the disk of the OS.

However we are still experiencing the terminations.

Should we also change the controller of the disk where windows is installed?

What version of vSphere are you running? Not all versions have this problem. In fact only one build has this problem. So if you have the most recent patches it will likely be another cause. Have you looked in the VM and host logs and have you logged a support request with VMware Support?

We dont have the most recent patches the article says we have to go to 4.1 but then we have to patch our SAN as well. Thats why we choose to change back to LSI Logic SAS instead. It might not be enough.

To rule out the vSCSI adapters I'd recommend you change them all back to LSI Logic SAS. But I think it's likely you're not being impacted by this specific problem and that your terminated process is happening for another reason. Think about when the problems first started and what might have been changing at the time. Given that VMware Supports the full stack, including the Oracle components, you should log a support request with VMware Support.

[…] corruptions being reported in the windows event logs and SQL logs. Similar to those reported in Win 2K8 with PVSCSI Critical Issue. Although I don’t have conclusive evidence that the PVSCSI driver was the cause, in all cases […]

Sponsors

Featured Virtualization Book

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Email

Disclaimer

The views expressed anywhere on this site are strictly mine and not the opinions and views of VMware or anyone else. All content is provided without any form or warranty explicit or implied, for informational purposes and for use at your own risk.