and other brilliant error messages

Tag Archives: TOE

UPDATE – Ignore the Broadcom driver stuff. It seemed to be ok all afternoon, but I have rebooted the ESXi host and it’s gone completely unstable again, with pretty much continuous iSCSI disconnects. Clearly this TOE/iSCSI offload support is absolutely terrible. I’m going to have to use the software initiator. What is the point of Dell marketing this?

UPDATE 2 – Dell decided to get to the bottom of this and, following an extended troubleshooting session in which I reverted one of the hypervisors, they were able to replicate the fault in their lab. It’s now being escalated with VMware and Broadcom. More news as I get it…

I’m doing this upgrade at the moment from vSphere 4.1U1 so I wanted to make notes, particularly on the hypervisor rebuild part, so I don’t have to keep looking stuff up when I do each one. Since 4.1 I have used the hardware iSCSI offload features of the Broadcom bnx2 chips in the servers, using them as HBAs in their own right. As per the Dell MEM driver 1.1 release notes they still don’t support using jumbo frames with this configuration. However, I had big problems with getting this working at all with 5.0. According to Dell support I’m in a minority of customers that use TOE so their inclination was to suggest I fall back to software iSCSI. I purposely delayed adopting vSphere 5.0 until it had been out for a few months to hopefully avoid being among the first to hit major issues, but I still ran into this. The problem manifests itself as regular errors (every few seconds) in the array logs like this:

iSCSI login to target ‘192.168.100.12:3260, iqn.2001-05.com.equallogic:0-8a0906-c541d5105-94c0000000a4adc3-vsphere’ from initiator ‘192.168.100.25:2076, iqn.1998-01.com.vmware:server.domain.com:1454019294:34’ failed for the following reason: Initiator disconnected from target during login.

These errors are generated by all HBAs that are configured for storage. Furthermore only one path is established, and the volume will occasionally go offline altogether. The ESXi host’s /var/log/vmkernel.log shows bnx2 disconnection events like this:

Dell support’s first suggestion is to edit the iSCSI login timeout value from 5 seconds to 60 seconds, and you need to use build 515841 to be able to edit this. However, this did not fix the issue using TOE. It turned out to be a Broadcom driver issue.

The vanilla install of ESXi 5.0.0 (build 469512), the Hypervisor Driver Rollup 1, and the update to build 515841 all include these same driver vib packages which seem to be broken. You can audit these by running esxcli --server=servername software vib list

However, there is a further complication. These drivers have to be loaded on after the VMware updates. When the Broadcom drivers are installed the VMware-supplied drivers for these devices are removed. Confusingly, the VMware updater to build 515841 will see that they are missing, will ignore the OEM Broadcom replacements, and will re-install the older versions! If the host reboots at that point it will crash to a magenta screen of death as the kernel inits, possibly because two different driver versions are trying to access the same hardware. Take note, the Broadcom installer removes the following bootbank packages from the host:

So my recommendation would be to cross check this list whenever you install any further roll-ups to your ESXi hosts. If these or future non-OEM versions are reinstated, remove them before you restart the host, or it may not boot at all.

vCenter Server migration

Migrate vCenter server – for 4.1 -> 5.0 the wizard does it all automatically (big improvement!)

After upgrade you’ll get HA failing to find a master agent, and probaby some vCenter cert warnings about the hosts

Shutdown all guests, put both hosts into maintenance mode and shutdown

Use WebUI to update the EqualLogic firmware to 5.1.2

Restart the SAN

Use iDRAC to power on ESXi hosts

If vCenter is a VM you need to use the v4 infrastructure client to connect directly its ESXi host

Power up a DC first, then vCenter

Quit the v4 client

Load the v5 infrastructure client and connect to vCenter

Start other DCs, Exchange, and SQL servers

Start web, app, and file servers

ESXi host update

From your iSCSI vSwitch make a note of the current iSCSI kernel port IP addresses

vMotion guests off ESXi host, maintenance mode, shutdown

Remove host from vCenter

For Dell servers use iDRAC, boot into System Services mode and try connecting to the net for updates

vmnic0 was in the management vSwitch and it was port channelled on the network switch

Telnet to switch, use the descriptions to find the correct port channel. If you don’t have descriptions in your switch config you could as a fallback find the MAC addresses in the server BIOS and look up the switch MAC address table, or use CDP show neighbors while VMware is running

Disable each of the ports in turn, checking in iDRAC to see if that fixes the access to the Dell firmware repo

Apply all firmware updates

Use iDRAC’s Virtual Media feature to present the VMVisor ISO image to the server

Reboot selecting the boot menu, then boot from the virtual CD

Select new install for ESXi host and install to SD card

This way there is no legacy partition table, and the upgrade would still require you to install the Dell MEM driver in any case

Use iDRAC to set management IP

Start v5 infrastructure client and connect to vCenter

Add ESXi host back into vCenter

Add vmnic4 back to the management vSwitch

Remove VM Network port group

Configure NIC teaming as Route based on IP hash (for each vmkernel and port group!)

Enable vMotion on the Management vmkernel port

Commit changes and re-enable the disabled switchport on your switch

Configure NTP service and hostname

Configure ESXi licence key

Compare the MAC addresses with of the vmbha initiators in Storage Adapters with the NICs listed in Network Adapters. You may notice that the numbering is different from the vmbha initiators that your ESXi 4.1 host was using

Download the Dell MEM 1.1 Early Production Access, since there are bug fixes over v 1.0.1 and it is certified for vSphere 5.0

Some of these archives need extracting to expose the actual vib zipfile, some don’t

Install VMware vSphere CLI

Use the infrastructure client’s Datastore browser to upload the MEM, the 515841 patch release, and the Broadcom vib files to a local volume on the ESXi host (mine all have a single SATA hard disk for scratch)

For each HBA, check the iqn name and amend to use the hostname instead of localhost, and check the numbering. On my servers the vmbha designations shifted during one of the reboots, leaving the iqns with misleading names which caused additional confusion while setting up the array volume access. e.g. vmbha34 showed up as iqn.1998-01.com.vmware:localhost.domain.com:2062235227:36

Run the MEM configuration script, selecting vmnic1 and vmnic3, and using the IP addresses you noted from the old ESXi instance. Dell support also advised creating the heartbeat vmkernel port, though it’s described as optional

setup.pl --configure --server=servername

Update these new iqns on the SAN’s ACLs for the vSphere storage volume(s)

After that has finished take the CHAP passwords for each vmbha from the EqualLogic Web UI and add those to the Storage Adapter configs in the infrastructure client. Remember to use the username as you see it in the EqualLogic UI not the initiator iqn

For each of your active HBAs use the advanced settings to edit the iSCSI login timeout from 5 to 15 seconds (to match what ESXi 4.1 had)

Configure a scratch disk path and enable scratch – use the real drive UID in the path, rather than the volume name in case you change it later. To retrieve that, use

There are several big motivators to moving over to vSphere 4.1 with respect to storage. Firstly, there’s support for vStorage APIs in new EqualLogic array firmwares (starting at v5.0.0 which sadly, together with 5.0.1 have been withdrawn pending some show-stopping bugs). VM snapshot and copy operations will be done by the SAN at no I/O cost to the hypervisor. Next there’s the support for vendor-specific Multipathing Extension Modules – EqualLogic’s one is available for download under the VMware Integration category. Finally, there’s the long overdue TCP Offload Engine (TOE) support for Broadcom bnx2 NICs. All of this means a healthy increase in storage efficiency.

If you’re upgrading to vSphere 4.1 and have everything set up as per Dell EqualLogic’s vSphere 4.0 best practice documents you’ll first need to:

Upgrade the hypervisors using vihostupdate.pl as per VMware’s upgrade guide, taking care to backup their configs first with esxcfg-cfgbackup.pl

Once that’s done choose an ESXi host to update, and put it in Maintenance Mode.

Make a note of your iSCSI VMkernel port IP addresses.

Make sure your ScratchConfig (Configuration -> Advanced Settings) is set to local storage. Reboot and check the change has persisted.

If the server has any Broadcom bnx2 family adapters they will now be treated as iSCSI HBAs so they will each have a vmhba designation. So, to unassign the previous explicit bindings to the Software iSCSI Initiator you need to check for its new name in the Storage Adapters configuration page.

You can’t unbind the VMkernel ports while there is an active iSCSI session using them so edit the properties of the Software iSCSI Initiator and remove the Dynamic and Static targets, then perform a rescan. Find your bound VMkernel ports using the vSphere CLI (replacing vmhba38 with the name of your software initiator):

Now you can disable the Software iSCSI Initiator using the vSphere Client and then remove all the VMkernel ports and your iSCSI vSwitches.

Take note at this point that, according to the release notes PDF for the EqualLogic MEM driver, the Broadcom bnx2 TOE-enabled driver in vSphere 4.1 does not support jumbo frames. This information is further on in the document and unfortunately I only read it after I had already configured everything with jumbo frames so I had to start again. Any improvement they offer is kind of moot here since the Broadcom TOE will take over all the strenuous TCP calculation duties from the CPU, and is probably able to cope with traffic at line speed even at 1500 bytes per packet. I guess it could affect performance at the SAN end so perhaps they will work on supporting a 9000 byte MTU in forthcoming releases.
Make sure you set the MTU back to 1500 for any software initiators running in your VMs that used jumbo frames!

Re-patch your cables so you’re using your available TOE NICs for storage. On a server like the Dell PowerEdge R710 the four Broadcom TOE NICs are in fact two dual chips. So if you want to maximize your fault tolerance, be sure to use vmnic0 & vmnic2 as your iSCSI pair, or vmnic1 & vmnic3.

Log in to your EqualLogic Group Manager and delete the CHAP user you were using for the Software iSCSI Initiator for this ESXi host. Create new entries for each hardware HBA you will be using. Copy the intiator names from the vSphere GUI, and be sure to grant them access in the VDS/VSS pane too. Add these users to the volume permissions, and remove the old one.

Now go back to your active HBAs and enter the new CHAP credentials. Re-scan and you should see your SAN datastores.

Recreate a pair of iSCSI VM Port Groups for any VMs that may use their own software initiators (very convenient for off-host backup of Exchange or SQL), making sure to explicitly set only one network adapter active, and the other to unused. Reverse the order for the second VM port group. Notice that setup.pl has done this for the VMkernel ports which it created.

Reboot again for good measure since we’ve made big changes to the storage config. I noticed at this point that on my ESXi hosts the Path Selection Policy for my EqualLogic datastore reset itself to Round Robin (VMware). I had to manually set it back to DELL_PSP_EQL_ROUTED. Once I had done that it persisted after a reboot.