and other brilliant error messages

Tag Archives: bnx2

UPDATE – Ignore the Broadcom driver stuff. It seemed to be ok all afternoon, but I have rebooted the ESXi host and it’s gone completely unstable again, with pretty much continuous iSCSI disconnects. Clearly this TOE/iSCSI offload support is absolutely terrible. I’m going to have to use the software initiator. What is the point of Dell marketing this?

UPDATE 2 – Dell decided to get to the bottom of this and, following an extended troubleshooting session in which I reverted one of the hypervisors, they were able to replicate the fault in their lab. It’s now being escalated with VMware and Broadcom. More news as I get it…

I’m doing this upgrade at the moment from vSphere 4.1U1 so I wanted to make notes, particularly on the hypervisor rebuild part, so I don’t have to keep looking stuff up when I do each one. Since 4.1 I have used the hardware iSCSI offload features of the Broadcom bnx2 chips in the servers, using them as HBAs in their own right. As per the Dell MEM driver 1.1 release notes they still don’t support using jumbo frames with this configuration. However, I had big problems with getting this working at all with 5.0. According to Dell support I’m in a minority of customers that use TOE so their inclination was to suggest I fall back to software iSCSI. I purposely delayed adopting vSphere 5.0 until it had been out for a few months to hopefully avoid being among the first to hit major issues, but I still ran into this. The problem manifests itself as regular errors (every few seconds) in the array logs like this:

iSCSI login to target ‘192.168.100.12:3260, iqn.2001-05.com.equallogic:0-8a0906-c541d5105-94c0000000a4adc3-vsphere’ from initiator ‘192.168.100.25:2076, iqn.1998-01.com.vmware:server.domain.com:1454019294:34’ failed for the following reason: Initiator disconnected from target during login.

These errors are generated by all HBAs that are configured for storage. Furthermore only one path is established, and the volume will occasionally go offline altogether. The ESXi host’s /var/log/vmkernel.log shows bnx2 disconnection events like this:

Dell support’s first suggestion is to edit the iSCSI login timeout value from 5 seconds to 60 seconds, and you need to use build 515841 to be able to edit this. However, this did not fix the issue using TOE. It turned out to be a Broadcom driver issue.

The vanilla install of ESXi 5.0.0 (build 469512), the Hypervisor Driver Rollup 1, and the update to build 515841 all include these same driver vib packages which seem to be broken. You can audit these by running esxcli --server=servername software vib list

However, there is a further complication. These drivers have to be loaded on after the VMware updates. When the Broadcom drivers are installed the VMware-supplied drivers for these devices are removed. Confusingly, the VMware updater to build 515841 will see that they are missing, will ignore the OEM Broadcom replacements, and will re-install the older versions! If the host reboots at that point it will crash to a magenta screen of death as the kernel inits, possibly because two different driver versions are trying to access the same hardware. Take note, the Broadcom installer removes the following bootbank packages from the host:

So my recommendation would be to cross check this list whenever you install any further roll-ups to your ESXi hosts. If these or future non-OEM versions are reinstated, remove them before you restart the host, or it may not boot at all.

vCenter Server migration

Migrate vCenter server – for 4.1 -> 5.0 the wizard does it all automatically (big improvement!)

After upgrade you’ll get HA failing to find a master agent, and probaby some vCenter cert warnings about the hosts

Shutdown all guests, put both hosts into maintenance mode and shutdown

Use WebUI to update the EqualLogic firmware to 5.1.2

Restart the SAN

Use iDRAC to power on ESXi hosts

If vCenter is a VM you need to use the v4 infrastructure client to connect directly its ESXi host

Power up a DC first, then vCenter

Quit the v4 client

Load the v5 infrastructure client and connect to vCenter

Start other DCs, Exchange, and SQL servers

Start web, app, and file servers

ESXi host update

From your iSCSI vSwitch make a note of the current iSCSI kernel port IP addresses

vMotion guests off ESXi host, maintenance mode, shutdown

Remove host from vCenter

For Dell servers use iDRAC, boot into System Services mode and try connecting to the net for updates

vmnic0 was in the management vSwitch and it was port channelled on the network switch

Telnet to switch, use the descriptions to find the correct port channel. If you don’t have descriptions in your switch config you could as a fallback find the MAC addresses in the server BIOS and look up the switch MAC address table, or use CDP show neighbors while VMware is running

Disable each of the ports in turn, checking in iDRAC to see if that fixes the access to the Dell firmware repo

Apply all firmware updates

Use iDRAC’s Virtual Media feature to present the VMVisor ISO image to the server

Reboot selecting the boot menu, then boot from the virtual CD

Select new install for ESXi host and install to SD card

This way there is no legacy partition table, and the upgrade would still require you to install the Dell MEM driver in any case

Use iDRAC to set management IP

Start v5 infrastructure client and connect to vCenter

Add ESXi host back into vCenter

Add vmnic4 back to the management vSwitch

Remove VM Network port group

Configure NIC teaming as Route based on IP hash (for each vmkernel and port group!)

Enable vMotion on the Management vmkernel port

Commit changes and re-enable the disabled switchport on your switch

Configure NTP service and hostname

Configure ESXi licence key

Compare the MAC addresses with of the vmbha initiators in Storage Adapters with the NICs listed in Network Adapters. You may notice that the numbering is different from the vmbha initiators that your ESXi 4.1 host was using

Download the Dell MEM 1.1 Early Production Access, since there are bug fixes over v 1.0.1 and it is certified for vSphere 5.0

Some of these archives need extracting to expose the actual vib zipfile, some don’t

Install VMware vSphere CLI

Use the infrastructure client’s Datastore browser to upload the MEM, the 515841 patch release, and the Broadcom vib files to a local volume on the ESXi host (mine all have a single SATA hard disk for scratch)

For each HBA, check the iqn name and amend to use the hostname instead of localhost, and check the numbering. On my servers the vmbha designations shifted during one of the reboots, leaving the iqns with misleading names which caused additional confusion while setting up the array volume access. e.g. vmbha34 showed up as iqn.1998-01.com.vmware:localhost.domain.com:2062235227:36

Run the MEM configuration script, selecting vmnic1 and vmnic3, and using the IP addresses you noted from the old ESXi instance. Dell support also advised creating the heartbeat vmkernel port, though it’s described as optional

setup.pl --configure --server=servername

Update these new iqns on the SAN’s ACLs for the vSphere storage volume(s)

After that has finished take the CHAP passwords for each vmbha from the EqualLogic Web UI and add those to the Storage Adapter configs in the infrastructure client. Remember to use the username as you see it in the EqualLogic UI not the initiator iqn

For each of your active HBAs use the advanced settings to edit the iSCSI login timeout from 5 to 15 seconds (to match what ESXi 4.1 had)

Configure a scratch disk path and enable scratch – use the real drive UID in the path, rather than the volume name in case you change it later. To retrieve that, use