Best Practices for Event Logging...

Best Practices for Event Logging...

I think this is a good topic for a debate. So what is the best way to use the event log? The classical thinking is that a well-designed event log entry should contain only very easy-to-read, easy-to-follow, pertinent observations such as "the disk space on C:\, less than 200 Mb, was insufficient to complete the operation X. Please re-configure the application Y, or make more room on this drive". If the application works as expected, it should be silent. If something is outside the plan, it should graciously communicate that fact. No extraneous stuff.

But in reality this will never be true. It is very easy to go wrong. First, a bug in the application (or a missaplied design) will cause thousands of useless event log entries to be logged. Not to be too picky, but on my machine at least, 99% of the Application event log is filled with ESENT informational errors.

The next capital mistake is logging impossible-to-read event logs. For example (just to pick one of my own unfortunate coding mistakes on which I am not proud at all ...)

In the end, such event logs will only add confusion to the problem. When seeing such an event log, what will be the user reaction? First, you see a warning there. You feel that there is something wrong. Or at least it might be but you don't know for sure (since it is a warning, not an error). So what is GetVolumeInformationW? As an IT administrator, you might be unfamiliar but a quick web search might tell you that this API returns more information about a certain volume. Now, what is this string \\?\Volume{57d9017d-f07e-11d8-8e52-505054503030}? And the rest of the blurb? This is bad.

So the conclusion (at least at this stage) is: don't put hard-to-read stuff in your event logs. But we are not done yet! We solved the part of "clear event logs" but this doesn't mean that your application shouldn't log anything that might be undecipherable for the user. There might be an additional reason to add more information on potential errors. The reason is called "supportability".

To see why, let's walk to a typical scenario when one of your customers experiences a bug in some application, let's say an antivirus product. The application doesn't log anything, it just refuses to scan a certain directory. No errors, no event logs, no popups, nothing. The system was a clean install of the latest version of the OS.

So what the customer will do? Again, call the Product Support department for that software company. Most likely (assuming that this is a reputable company), the PSS engineer will walk the customer through a series of lengthly investigation steps. That investigation might take hours. Total absence of any information about this failure is equally bad.

Now, the fact that the application silently fails might make the investigation much harder. There are no immediate ways to approach the problem, and a live debugging might not be possible. Maybe the customer has some stringent privacy issues and you can't just debug the application on his machine. What then? The engineer might end up enabling some tracing or advanced logging functionality that the application might provide, or if these are missing, he might send scripts for automated, offline debugging. But the conclusion is that a lot of time and money will be spent to investigate this problem.

This is why supportability is important. Sometimes a well-directed event log might be cryptic for a customer but might be of a great help for the PSS engineer... That said, event logs like the one above are not an example to be followed :-)