문의자

Windows Server 2012 BSOD caused by vmswitch.sys

일반 토론

Situation is the following. I have 2 Windows Server 2012 with Hyper-V installed. We have approximately 10 VMs on each server and these VMs are replicating in both directions.

We had 2 NICs in teaming (switch independent) connected to 1 switch. Interfaces are in access mode.

On Friday evening we added another switch, where we connected the other 2 NICs of each server, so we decrease the load from the 1st switch. We added these NICs to the team. The switches are connected by trunk port.

Today (Monday) at around 11am (maybe higher network load), both servers got BSOD with 3 minutes difference. One at 11:02, the other at 11:05.

On both servers the cause is listed as vmswitch.sys. I did not find any info on this happening in Google. Here is the output from WinDBG:

*** ERROR: Module load completed but symbols could not be loaded for bxnd60a.sys*** ERROR: Module load completed but symbols could not be loaded for bxvbda.sysProbably caused by : vmswitch.sys ( vmswitch!VmsPtNicPvtPacketRouted+ae )

DPC_WATCHDOG_VIOLATION (133)The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVELor above.Arguments:Arg1: 0000000000000001, The system cumulatively spent an extended period of time atDISPATCH_LEVEL or above. The offending component can usually beidentified with a stack trace.Arg2: 0000000000001e0d, The watchdog period.Arg3: 0000000000000000Arg4: 0000000000000000

*** ERROR: Module load completed but symbols could not be loaded for bxnd60a.sys*** ERROR: Module load completed but symbols could not be loaded for bxvbda.sysPage d8bdc0 not present in the dump file. Type ".hh dbgerr004" for detailsProbably caused by : vmswitch.sys ( vmswitch!RndisDevHostDeviceIndicatePackets+1e1 )

DPC_WATCHDOG_VIOLATION (133)The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVELor above.Arguments:Arg1: 0000000000000001, The system cumulatively spent an extended period of time atDISPATCH_LEVEL or above. The offending component can usually beidentified with a stack trace.Arg2: 0000000000000784, The watchdog period.Arg3: 0000000000000000Arg4: 0000000000000000

Debugging Details:------------------

Page d8bdc0 not present in the dump file. Type ".hh dbgerr004" for details

모든 응답

Please contact Microsoft Customer Service and Support (CSS) via telephone so that a dedicated Support Professional can assist with your request. To troubleshoot this kind of kernel crash issue, we need to debug the crashed system dump.
Unfortunately, debugging is beyond what we can do in the forum. Please be advised that contacting phone support will be a charged call.

To obtain the phone numbers for specific technology request please take a look at the web site listed below:

On one of the server we have the update installed, but we do not have it on the other one (servers are in different patch groups), so I do not think this update is the issue - since on both server the problem occurred at nearly the same time, regardless
of the patch being installed or not. Not to mention, patch is for live migration, not for replication - and at this point of time, we did NOT perform failover between primary and replica server.

Nonetheless, we shall re-install the patch. We shall schedule a change and perform it in the next week or two.

David,

Our NICs are: HP NC382i DP Multifunction Gigabit Server Adapter.

What we noticed is that since we added the new switch, we had hosts' MAC addresses flapping between interfaces of servers and of the trunk. This was due to the algorithm we were using - Switch Independent Address Hash. Since the hosts generate the most traffic
due to the replication, the traffic from the hosts was coming out from different interfaces each time. So, in outbound traffic we had loss of packets and timeouts. Unfortunately, since the switches we use are not high-end Cisco products and we cannot stack
them, there is no other solution how to fix this on 2 trunked switches. As solution, we changed the teaming to Switch Independent Hyper-V port. No MAC address flapping anymore. We are waiting to see if the issue will reoccur.

In address hash algorithm we have additional layer of complexity in the outbound traffic - calculation on which interface the packet should go out from. While in Hyper-V port there is static mapping for the VMs and the host. May be (here I am speculating)
this additional complexity, combined with the MAC address flapping, cause the issue with the vmswitch driver.

For me it happened only once (for now). I hope it will not occur again. That is why I have not done some more advanced troubleshooting and monitoring on the systems. Nonetheless, it seems that is an issue.

Since I am in a partner organization, I am eligible to request assistance from MS. I would try to do that and provide them with the memory dumps. Hopefully, they will provide patch or some other form of fix. Will keep you updated on the development.

None of the solutions above worked. I tried disabling the TCP chimney, tried disabling offload properties. It seems to occur when I use the internet connection actively. I am using Windows Server 2012 with a wireless adapter.