Friday, December 17, 2010

Hyper-V Clustering: Guest Connection Problem

Virtualization is pretty neat. It allows you to consolidate several servers that you have into fewer servers, this will allow us to cut down on power consumption (make us more green and save us some money). For server virtualization, Microsoft has Hyper-V. With Windows Server 2008 and Windows Server 2008 R2, we can set up Hyper-V to have failover clustering. This is cool since it would allow us to have a high availability environment associated with the virtualized environment. In addition, since Windows Server 2008 R2, Microsoft has added live migration capability in Hyper-V … super cool. Live migration basically allow us to move virtual system (while it is live/running) from one host node to another host node. You don’t have to shut down the virtual system. From the system administrator perspective, this is very helpful, let say that you need to reboot one of the host node (let say after a Microsoft security patch), you can just live migrate the guest system from that host node to another host node in the cluster, and reboot the host node without causing significant downtime to the guest system.

Setup

Recently, we have the opportunity to set up Hyper-V clustering on Dell PowerEdge 710 servers connected to a SAN box using Windows Server 2008 R2 Enterprise Edition. The set up itself went without any problem. We were able to set up a guest (virtual) system and was able to do the live migration back and forth from one host node to another without any problem.

Problem

Everything was running smoothly for a few days, then suddenly one day, the guest system just suddenly dropped from the network. We could not even ping the system. The only way to connect to the system is by using the Hyper-V manager. The guest system itself is up and running, but its network connection just died or hang. We waited to see if it would recover by itself (for about a half day), but it would not. The only way to get the network connection back is to restart the guest system. There are not much clue given by the Windows event log on the guest system, and nothing on the host system. The only error that we are seeing in the Windows event log on the guest system was:

The miniport ‘Microsoft Virtual Machine Bus Network Adapter’ hung.

And

The miniport ‘Microsoft Virtual Machine Bus Network Adapter’ reset.

In term of patches, it seemed that we have downloaded and applied the latest Microsoft patches on both the host and guest systems. We noticed that this problem happened during the time when we copied large sized files (about 500+ GB in total size) from the guest system onto another system in the network. So we tried that again (copying large sized files from the guest system onto another system in the network), and sure enough the network adapter on the guest system would stop working. Strangely, the problem did not appear when we tried copying large sized file from other system in the network onto the guest system. What could it be?

Solution

After searching the Microsoft and Dell sites, we found one Microsoft KB article that seems to provide a hotfix to the problem that we have. It is KB article 974909 (The network connection of a running Hyper-V virtual machine is lost under heavy outgoing network traffic on a Windows Server 2008 R2-based computer). You can find it here. Well what do you know … the title of the KB article actually describe the problem that we were having. So here’s what we did:

Request the hotfix download from Microsoft

Download the hotfix

Backup the guest system

Shutdown the guest system

Apply the hotfix on all of the host nodes of the Hyper-V cluster

Restart the Hyper-V host nodes

Restart the guest system

Re-install the Integration service on the guest system

Once those steps were done, we tested the guest system again by copying large sized file from that guest system onto another system in the network. That hotfix seems to have fixed the problem.

One thing that I must say is that before you attempt to re-install the integration service on the guest system, I would strongly recommend that you backup the guest system, one possible way is by creating a snapshot in Hyper-V manager of that guest system. The reason for this is because, when I was trying to re-install the integration service on one of the guest system, I ran into the following error during the install process:

An error has occurred: One of the update processes returned error code 61658.

To resolve this error, we had to do the following:

Restore the guest system from the snapshot

While the guest system is off, add the Legacy Network Adapter to the guest system. We did this from the Hyper-V manager.

Turn on the guest system

Install the Integration service on the guest system

Turn off the guest system

Removed the Legacy Network Adapter

Turn on the guest system

Lots of steps but somehow that seems to do the trick. We got this solution from Michael Phillip blog post. Since then, the guest (virtual) system has been running great, no more connectivity problem.

Some Notes:

While researching on this problem, I came across Michael Hanes’ blog in which he listed the hotfixes needed for Windows Server 2008 R2. You might want to check his blog out. It can be found here.

You might want to also consider calling the Microsoft Support. They might charge you, but they are usually good at troubleshooting this type of problem.