Application Troubleshooting: An Error Message, Without Context, is Worthless

I was inspired to write this article based off an internal discussion I was involved with where someone was requesting a comprehensive list of all possible event log messages delivered in Windows Server. I have always been interested in the reasoning behind an “ask” such as this because it could be misguided. Where to begin? First of all, it is quite a loaded “ask” as it is not specifying whether we are talking about all of the in-box default error messages relating to Windows – the operating system specifically, or are we also talking about all of the roles and features which also contain event providers? Where would all of these go? What would this be used for? Second of all, what is the purpose?

Answers always baffle me: “Oh, for the help desk.” “As a reference.” “For our in-house troubleshooting guide.” The last example leads me to what often results in a bi-directional obtuse conversation where I re-ask what are the use cases and contexts of these error messages. Will the person requesting the error messages also be attempting to follow-up on each error message to determine all of the possible situations in which these may occur? That would be quite an insurmountable task. In most cases, no, these will be either copied-and-pasted into a “guide” – sometimes even being printed out into a very thick, but mostly useless material reference.

HRESULTS vs. Event Logs

First of all, the event log message tells you everything you need to know about the ERROR itself. This has evolved significantly over the past decade with advancements in the Windows. Often, you will get errors/exceptions thrown from an application and the level of description will depend on the generosity of the programmer/developer of the application. One thing is for sure – at the very minimum in most cases, an integer-based or hex-based response will occur. The most common of these are HRESULTS. A few years ago, due to the nature of overlap between operating system components and external software, a tool called ERR.EXE was made available on the Microsoft Download Center (http://www.microsoft.com/en-us/download/details.aspx?id=985.) This tool parsed all of the known header files for windows and applications to provide the corresponding strings and descriptions of known HRESULTS and error codes within the Microsoft software ecosystem. The use of the utility would yield all known matches such as the example below:

This utility was especially useful because it could also automatically translate the decimal-based equivalent of an HRESULT:

But . . . Here’s the Thing. Where do you go from Here?

Well, how did you get there? What was occurring when you encountered that error? For example: Let’s say you get an HRESULT 0x80000005. You use the ERR tool (or you know from memory) that it is “Access Denied.” What was occurring when this happened? Let’s say this occurred within an application. You could then leverage Process Monitor or another tool to trace the issue to see what it file/registry entry/etc. the program was trying to accessed. There is no way to give anything more than a generic recommendation without additional context related to the error.

In the case of an event log entry, there is much more information. The source component, Event ID, description, and more – along with additional XML detail. In the example below, what else would you gather from this simply looking up Event ID 36888 other than what you see in the dialog boxes?

That is where the scenario in which it happened plays an important role in determine what the root cause of this error is. In some cases, only one situation warrants a particular error and resolution. These are the ones we love – yet they are rare. In most cases there will be more than one scenario.

Why are Some Event Messages More Descriptive than Others?

Like with applications errors, event log entries vary regarding the degree of detail. In many cases during the development and evolution of a component, particular errors are hypothetically conceived during the development cycle and are laid out with the event tracing framework. These get further nailed down in the test and beta phases and the events are adjusted accordingly. At the release time as many clear-cut, known issues are mapped out through release notes and knowledge base articles. This process continues through the lifecycle of the component or software.

But again, these are only events. On all of my machines I have ever worked with, I would venture to say that I have only ever encountered a small fraction of all of the potentially available windows events which warrant errors or warnings. There are easily over 200,000 ETW and legacy windows events. Why reinvent the wheel and create a huge list for searching when you already have BING at your fingertips. And with Bing, you can search the event\error with additional contents.

What is probably the most beneficial troubleshooting assistant is the known resolution – in the form of a knowledge base article. The “Symptom(s)” section of most types of knowledge base articles is where the context is mapped out with as much detail as possible. It is usually in the form of “You are running [SOFTWARE\COMPONENT] and you are performing [ACTION]” or “you are attempting to [CONFIGURE\LAUNCH\RUN\CLICK ACTION] and you encounter the following [ERROR\MESSAGE]

The Symptom(s) section is usually followed by the “Cause” section. Often this section is prefaced with “This issue may be caused by . . .” indicating that this may not be the only cause of the issue. Yet this particular cause will be remediated\resolved by the “Resolution” section. That format is what makes the knowledge base article so great. It cuts right down to the break-fix scenario. Symptom-Error-Cause-Resolution.

Building knowledge bases “from the error up” is not an optimal way of building out a framework for a help desk. These are built and drive by the overall experience of the issue itself. Working in support for many years, one of many elements that separated the true escalation engineers from what could automated with a diagnostic utility or front-line support was the ability to resolve an issue for the first time. While the first question was usually “What’s the Error Message?” it was the second question that was most important – “What were you doing?”