Cisco C200 Interrupt Debacle

During a recent storage upgrade and virtual environment refresh on a customer site, I ran into an issue that kept me busy, and really tested my patience. My patience for vendor support, customers, and IT in general. Please forgive the rambling. Skip ahead to the Resolution section for the answer.

The documents I found along the way are represented as links throughout the article.

1. The array and the FC fabric (Brocade 5100B) were configured, the current vCenter instance was upgraded from 4.1 update 1 to 5.0 Update 1.

– Due to a previously identified bug between the VNX running Block OE 5.31 and VMware ESXi 5.x we made sure all of the hosts were configured for VMware NMP FIXED and Failover Mode 4 (ALUA) on the array.

2. I installed ESXi 5.0 Update 1 on the new Cisco C200 and validated it could see the legacy CX4 datastores, as well as the new VNX datastores.

3. After moving one of the production host’s virtual machines to the new C200, I upgraded it’s version of ESXi 4.1 Update 1 to version 5.0 Update 1.

At this point the implementation was smooth sailing, so I upgraded the remaining hosts.

The next day I was informed that the virtual environment was extremely unresponsive. Virtual machines were still running but latencies were high and sporadic (10,000ms – 45,000ms). What I found most interesting, was that this was NOT consistent among all of the hosts. The newly installed C200 server was operating as expected, as were some of the older servers.

After reviewing the ESXi vmkerel.log logs for each of the ESXi hosts, I noticed very specific failures resident in the hosts exhibiting the storage performance issues.

Error Messages

Here is an example of some of the messages seen in /var/log/vmkernel.log:

Now the main entries to note are the references to “qla2xxx”. I only found sparse references to similar errors, but nothing promising. Finally I threw in the towel and escalated the incident to VMWare.

VMware (Round 1)

My experience with VMware support has been hit and miss, but in this case they were on the ball. The case was quickly picked up by an eager engineer and we had some environment details to verify. The first item we identified was outdated firmware for each of the qLogic HBAs. The default response from all vendor support is usually to have you make sure “everything” is at the latest revision before proceeding. (This is a good practice as it rules out previously identified/resolved bugs, but can be frustrating when firmware/drivers are out of the problems scope.)

That said, I was apprehensive about spending time going down the “upgrade bios/firmware/drivers” route but the Engineer found a few internal cases between VMware/qLogic (not releasable) that cited the newest HBA firmware resolved this issue. The article can be found here: Qlogic ISP Error

Current Code

Latest Code

BIOS version 2.10FCODE version 2.04EFI version 2.04Flash FW version 4.06.02

BIOS version 3.00FCODE version 3.15EFI version 2.21Flash FW version 5.04.01

To my disappointment, the issue was still resident after all of the HBA drivers had been upgraded. Enter Cisco TAC.

Cisco (Round 2)

A colleague of mine mentioned a BIOS (or vmKernel) setting that had to be made on Cisco C200 hosts and provided me with this article here: KB1030265. Even though this change is simple enough to implement, this particular environment couldn’t withstand a host reboot because virtual machines could not be evacuated due to the performance issues. Furthermore, the article specifically states that this fix is only applicable if “ALERT: APIC: 1823: APICID 0x00000000 – ESR = 0x40” is present in the /var/log/vmKernel.log log (Which it wasn’t)

Needing further assistance, I escalated the case to Cisco TAC. After reviewing my findings, as well as some ESXi host logs, support agreed that the APIC interrupt issue referenced above did NOT apply, but that the server firmware was not supported with ESXi 5.x. After reviewing the release notes (Found here), I confirmed this was the case.

The Holy Grail

To my surprise, the answer was in the release notes. Although the VMware KB referenced above states that the APIC Interrupt mapping issue is only occurring if you see the “ALERT: APIC: 1823: APICID 0x00000000 – ESR = 0x40” entry, the release notes also mention another symptom. Here is a little gem I pulled from the release notes: