ESXi 5.5 Hosts randomly lose network connectivity

I recently ran into an issue where my ESXi 5.5 hosts started randomly dropping off the network and the only way to get them back was to reboot them. In going through the logs below is the findings of what happened and why:

This is a low-level driver crash caused by a MC assert without any other hypervisor problems at the time. The bnx2x card then begins a crash dump and resets itself. The data from the crash dump of the adapter has data but it appears to only be useful to Broadcom/Qlogic. At the end of the crash dump it shows that the card gets reset.

After the crash dump, the state of the card is not exactly clear as the driver did not report a state. A few seconds later, ESXi’s netdev watchdog service that is responsible for monitoring the health of the network adapters, determined that vmnic0 was not in a good state. It can clearly see that the adapter is unresponsive via the bnx2x driver:

The watchdog service did as designed and issued a reset of the network adapter in an attempt to wake it up from whatever state it is in. A few seconds later, you can see the watchdog service still reporting vmnic0 as unresponsive and attempts to reset it several more times:

The interesting part is that it gets no response at all from the bnx2x driver when the card reset is performed. Normally we’d expect to see the card initialize. A few seconds later, we get a slew of internal failures from the bnx2x driver indicating timeouts, and failures to enable features and rx queues:

After this the link remains down and we hear nothing futher from the bnx2x driver nor vmnic0.

Based on this log analysis, below are the findings:

1. The bnx2x adapter crashed due to an assert failure in the bnx2x driver code.

2. The adapter failed to respond for almost a full minute before going link-down. This caused an outage during this time as ESXi would not have initiated a failover action until the link went down.

After much digging I finally found the root cause of the issue, VMware KB 2114957. The issue has been identified as a problem with the bnx2x driver and there is currently no updated driver. For now there is a workaround. It consists of disabling TSO inside the guest OS of the VM. Alternatively if there are a large number of VM’s, you can disable VXLAN offloading at the host level: