Month: July 2018

A few days ago I mentioned that MS KB RSS had stopped working. Well, today I was reading some articles and noticed “Subscribe RSS Feed” button. I’m not sure when it showed but it can’t have been there for long. It’s the same thing (URIs and UI are different though), great stuff!

I’m not sure if these values would make any sense or work at all but my guess is that they will not crash anything. By observation, i think each mitigation is optional and can be enabled atomatically if hardware/microcode supports it. I don’t have an AMD at hand but someone could try out these homebrew combinations.

Windows CUs get a lot of hate these days. Rightfully so, occasionally. But you must consider times before CUs, and these were arguably even worse.

Going back to era before Windows 8, there was service pack + hotfix model. Deploy SP and get hotfixes for a few years. Deploy SP and cycle starts again. But over time less and less SPs came out and years between SP releases got longer. The worse with Vista+ releases. Vista SP2 came out in early 2009 that left 8 years of hotfix-only years until EOL. Windows 7 SP1 was early 2010 so we were 6 years in before CUs begun.

The ugly part. Massive majority of hotfixes were limited release. This meant that they never showed up on WU/WSUS. You just couldn’t find them. There was no general list of updates. Some of them couldn’t be downloaded at all. Some MS teams had their private lists of recommended updates. Better but always out of date. And still, most updates went under the radar. At one point I found out that Microsoft KB portal had a per-product RSS feed. It was a great somewhat obscure and semi-hidden feature to be up-to-date, sadly it stopped working about 2 years ago it’s back with a respin, see here , I think around the time CUs became the new black.

Before Windows 7 2016 convenience update, I think I had ~500 hotfixes in my image building workflow. Maybe a quarter of them were public ones. Sure, quite a few were for obscure features and problems but I believe in proactive patching. But the really bad part was patching already deployed systems. These hotfixes couldn’t be used in WSUS/SCCM so custom scripting it was. But as WU detection is really slow from script and because of sheer number of patches and plumbing required to handle supersedence… it was unfeasible to deploy more then maybe a dozen or two most critical ones.

And there were a quite a few. I think folder redirection and offline files required 5 patches to different components to work properly. ALL had to be hunted down quite manually. These were dark times…

Over the years, some community projects started to mitigate the problem. MyDigitalLife’s WHDownloader worked best for me, it’s main maintainer Abbodi86 is a Windows servicing genius. I built a image building framework around it that I use to this day.

Windows 8 era started with monthly optional rollups. And these were great! Just great! Oh how much I miss them! Pretty much (or totally?) every optional hotfix was quickly rolled up into monthly rollup. These were not cumulative so you could skip buggy ones (there were a few…) and still deploy next month’s one. And they had proper detailed release notes. Every issue fixed, each with reasonably detailed symptoms, cause and fix. Sure, you had to deploy quite a few updates each month, but not having to hunt down limited hotfixes was a breeze. However this model was abruptly stopped at the end of 2014, I never saw an announcement about this.

Windows 10 came and later in 2016, cumulative updates came to downlevel OS. While not perfect, it’s a HUGE upgrade over what we had before Windows 8. I believe that Windows 8 model was still superior. If you think now is bad, you didn’t know the pain or you just didn’t know better

In the end RSS and iSCSI were separate issues. RSS is to be fixed in vSphere 6.7U2 sometime this spring. Update Marvell (wow, Broadcom -> QLogic -> Cavium -> Marvell, I’m not sure what to call it by now) drivers are on VMware’s support portal. I haven’t tested them yet as I don’t currently have any Marvell NICs to try out.

Three months have passed and QLogic/Cavium drivers are still broken. I’ve gotten a few debug drivers (and others have as well) but there’s no solution. Initial suspicion about bad optics was a red herring (optics really was bad but it was unrelated). Currently there are 2 issues:

Hardware iSCSI offload will PSOD the system (in my case in 5-30 minutes, in other cases randomly)

NIC RSS configuration will randomly fail (once every few weeks), causing total loss of network connectivity or PSOD or a NMI by BIOS/BMC (or a combination of 3).

So far I’ve had to swap everything to Intels (being between a rock and a hard place). They have their own set of problems, but at least no PSODs or networking losses. Beacon probing doesn’t seem to work with Intel X710 based cards (confirmed by HPE) – incoming packets just disappear in NIC/driver. Compared to random PSOD, I can live with that.

Edit 2018.07.11

HPE support confirmed that qfle3 bundle is dead in water. Our VAR was astonished that sales branch was completely unaware of severe stability issues. Edited subject to reflect findings.

Edit 2018.07.09

Qlogic qfle3i (and whole Qlogic 57810 driver bundle) seems to be just fucked. qfle3i crashes on no matter what. Even basic NIC driver qfle3 crashes occasionally. So if you’re planning to switch from bnx2 to qfle3 as required by HPE, don’t! bnx2 is at least stable for now. Latest HPE images already contain this fix – however it doesn’t fix these specific crashes. VMware support also confirmed that there’s an ongoing investigation into this known common issue and it also affects vSphere 6.5. I’m suffering on HPE 534FLR-SFP+ adapters but your OEM may have other names for Qlogic/Cavium/Broadcom 57810 chipset.

A few days ago I was setting up a new green-field VMware deployment. As a team effort, we were ironing out configuration bugs and oversights, but all despite all the fixes, vSpheres kept PSODing consistently. Stack showed crashes in Qlogic hardware iSCSI adapter driver qfle3i.

Firmwares were updated and updates were installed, to no effect. After looking around and trial-and-errors, one fiber cable turned out to be faulty and caused occasional packet loss on SAN to switch path. TCP is supposed to fix that in theory but hardware adapters seem to be much more picky. Monitoring was not yet configured so it was quite annoying to track down. Also as SAN was not properly accessible, no persistent storage for logs nor dumps.

So if you’re using hardware adapters and seeing PSODs, check for packet loss in switches. I won’t engage support for this as I have no logs nor dumps. But if you see “qfle3i_tear_down_conn” in PSOD, look for Ethernet problems.

Who?

I'm a Microsoft technologies sysadmin. Mainly core services but also bits from here and there.

Disclaimer

Any and all opinions or statements are my own, not ones of my employers or clients.
Any and all scripts or guides are provided as-is without any warranties or guarantees. Test and understand the risks before use.