Inspired by Insiders – A Tale of Two Kernels

This is absolutely true. Each and every Windows Insider out there is an individual who represents not only themselves, but also – to varying degrees – countless others who don’t participate in this amazing program. One area where the Windows Insider Program excels is in gathering broad coverage on the holistic Windows experience. We receive feedback and insights on every aspect of the OS – from app usage and core OS functionality, to usability and accessibility, from each of the primary languages around the globe. Windows Insiders represent the world in a microcosm.

One of the things we as a team have learned along the way is that at times the voice of one or two Insiders can be representative of hundreds of Windows users. Or thousands. Or even hundreds of thousands. Data from a handful of Insiders can truly be the “needle in a haystack” we’re often looking for while trying to squash bugs during a development cycle. The fun and challenging part of our work is when a small number of users – even just one or two – report an issue. Is this signal weak? Did so few people experience this issue that only one or two found a unique scenario (commonly referred to as a single-user or one-off issue), or is it because these users are a tiny number representing the proverbial tip of a “bug iceberg?” How do we tell the difference between these two categories?

The Tale of a Blue Screen (Well, Two Actually)

Let’s take a quick trip down memory lane. It’s mid-December 2016. We’ve released Build 14986 for PC to Insiders in the Fast ring. Users are installing and giving feedback and all seems well as we head into the end of the year. Fast forward a few weeks into the beginning of January 2017. Build 15002 was released. The Engineering team is taking in Insider feedback, reviewing bugs, and as usual, we’re chatting with Insiders on Twitter. As part of these conversations, a tweet comes across like so many others. One user seems to be having an issue installing the new build:

Together we go through various troubleshooting steps, we work with the user so that feedback gets filed, and we wait until the next build to see if it was an issue only for this user. About a week after 15002 was released, another build was shared out to the Fast ring for PC users, build 15007. Unfortunately for this user, it’s the same situation yet again:

Same user, same problem, but still only one report that landed in front of us.

A Blue Screen by Any Other Name

Fast forward a bit to early February. We’re doing the second #WinBugBash for the Windows 10 Creators Update and a new error appears from another Insider who is frustrated with not being able to update and participate.

Well this is interesting. Another single-user issue of being stuck on 14986, but a very different error message. We work through all the usual troubleshooting steps, we ensure that feedback gets filed, and we test through a few more releases on the chance that any potential changes in the OS end up solving this issue. These steps didn’t resolve the issue.

As we work on this issue over the course of a few weeks, the frustration begins to set in for these two users. On our side, we’re still pouring through the data. How do we fix this issue? Why is it so hard to diagnose? And why aren’t other users reporting this? If this was an OS issue, there should be more than one person hitting each of these errors. And if it was the same issue as the PTE_Misuse error from the first user, why aren’t both machines showing the same error? The one thing in common here is that both users are stuck on 14986 and haven’t been able to update to any new builds since then. Something is wrong and we need to tackle this from a different angle. It’s time to put a new strategy in place.

Comparing Notes

It’s March 1st, I have two unresolved issues, and I haven’t been able to make headway for these Insiders. Pulling out all the stops, I loop both users into a private conversation via Twitter DM. It’s time to crush these issues once and for all. I’ve talked extensively with the Deployment team here at Microsoft, but there’s no actionable data in the setup error logs. All that we know is there’s a failure and it’s early enough in the update process that no telemetry is being captured. All signs point to a kernel-level failure, but we don’t know exactly what is happening.

Sometimes taking drastic actions can have surprising results. After starting the private conversation, both users share their system information data and immediately we see a connection:

Now we have something to work with! Deeper investigation and additional hardware comparison shows that both users are running the near-exact laptops, both laptops being from the same manufacturer, and both with the same processor. The only notable difference is the amount of RAM in each machine (8GB and 12GB). At last we have a link! But even with this commonality, we can’t resolve the issue without seeing what’s going on. It’s time for kernel debugging, but neither user has done this before, and neither have I. There’s always time to learn something new though, and that’s part of what makes the Windows Insider Program fun.

Cracking the Issue

I could spend pages of writing detailing what happened over the month of March and early April, but I’ll sum it up in a few bullets:

I shared instructions with Tony and he set up a kernel debugger. We noticed that the failure happened before the debugger could grab anything during the boot process. The effort dead-ended on his machine.

I did some research and found another laptop from a different manufacturer with the same processor. I set up a kernel debugger and tried to reproduce the issue. The issue didn’t appear. This points us to viewing this bug as being specific to the OEM (manufacturer) of the laptops these two users have.

We engage with the OEM and kick off a full investigation. The OEM tested the failure scenario on a variety of laptops and could reproduce the issue. This was reassuring that we were on the right path, but we still had to root cause the bug.

Several weeks of investigation and hypothesis testing ensued. Nearly 50 emails and countless investigation hours later between the OEM and Microsoft, our combined triage team found the issue.

It turned out that a BIOS update released by the OEM in late November for this family of laptop models would fail to allow newer preview builds of Windows 10 to install. The failure was so early in the boot process that none of our log gathering tools would capture the failure, including a standard kernel debugger. Now that we knew the full scope of the problem, we could begin to work on a fix.

Once the OEM had prepared an updated BIOS for these machines, it was time to test. The OEM completed a full range of testing and validation on their side. But what about the two Insiders who had helped tirelessly to help us troubleshoot? It was time to thank them. Once again partnering with this OEM, the updated BIOS that had been prepared was delivered securely to these two Insiders. Both users installed the new BIOS and then attempted to install build 15063. At last… success! The BIOS update solved the issue and had fully resolved the update issue for these two users.

An Interesting Side-Note

From of all of this, it’s funny to look back and realize that we had accidentally found the issue back at the beginning of March and not even realized it. While the noted timing of the BIOS update was slightly off (Brent had updated in December), this bit of conversation captured the problem:

Build 15002 hadn’t been released when each of these users had taken the BIOS update. Both users had installed build 14986 and then installed the BIOS that was now causing problems. Sometimes it’s best to trust your instincts, and it is always important to pay attention to all the details, but I digress from our story.

Learning an Important Lesson or Two (Actually, Three)

After fully detailing the scope of the issue and identifying all possible laptop models from this OEM that could be affected by the issue, we put a block in place to prevent affected retail users across the globe from attempting to install the Windows 10 Creators Update and hitting this upgrade failure. Months of hard work and investigation had helped protect users from a failed OS upgrade experience.Â The work of two dedicated Windows Insiders helped prevent potentially hundreds of thousands of failed OS upgrades and frustrated users. These two users and their feedback helped others from across the world to keep from using their internet bandwidth (which can be extremely expensive depending on locale) to download an upgrade they wouldn’t be able to install. We listened to Insider feedback and understood there was something bigger here, even if there were the only two people reporting the issue.

Windows Insiders are an amazing bunch of people. They like to learn, they like to explore new features and functionality, and with just as much importance, they like to poke around and find bugs. It’s this nature of curiosity that helps us continually make Windows even better. Another important lesson we learned from this experience is in highlighting the various layers of software that are at play in making a PC work properly. The initial thoughts from the affected users was that we had regressed (broken) functionality at the OS-level. As it turned out in this scenario, it was the OEM firmware that was the culprit of the regression. The software creation process is complex and fraught with ways to create potentially unexpected results (a polite way of saying “bugs”!). Anyone who writes software wants to create a good product and a positive end-user experience. To help with the troubleshooting process in the future, we’ll be creating an additional series of documentation; some will highlight the various layers of software and how they work, as other docs will lay out the foundation of engaging in troubleshooting and some of the related best practices. We’re committed to helping Insiders learn just as much as we’re committed to learning from Insiders.

Lastly, this overall scenario was an important reminder to us on the Windows Insider Engineering Team. Seeking to increase the diversity of hardware and application usage will be an even-greater focal point of the Windows Insider Program as we move forward. Seeing two Insiders represent so many retail users was a profound moment. We’re already working through ways to identify additional scenarios such as this and working to give them the attention they so greatly deserve. It will be an ongoing effort, but we are committed to this program and to listening to our greatest fans who take the time to share their insights and feedback. Knowing how important it is for the millions (of Insiders) to represent the billions (of Windows users), we also know it is just as important for the two who represent the hundreds of thousands. You never know if you may be the one user, or one of a small handful of users, whose feedback and dedication ends up being responsible for helping so many others.

This tale of two kernels is an expression of my unending gratitude to the dedication of every Windows Insider who helps make a difference every day. Without your efforts, this program would not be the success it is.