DBAs and SQL Server Logs

In last month's "DBAs and Windows Event Logs," I opined about how import it can be for DBAs to keep tabs on their underlying hosts by paying attention to Windows Event Logs. In that same spirit, it effectively goes without saying that DBAs should also make a point of regularly checking their SQL Server Logs as well. Accordingly, I wanted to share a couple of tips and tricks about what I typically look for when scanning SQL Server’s logs.

Checking SQL Server Logs - Frequency

Back in the day when I was a full-time production DBA, I made a point of checking SQL Server logs at least daily on most of my production servers. On some of my more mission critical servers I even checked my logs more frequently than once a day. Whereas, on the other hand, I would only check on logs for some dev or testing servers every few days or maybe only once a week or so. So, the point in this case is that frequency really depends upon the priority and importance of your workloads – and there’s no real ‘one-size-fits-all’ recommendation for WHEN (or how frequently) you should be checking on your logs. Stated differently, checking your logs isn’t part of a checklist of things you HATE to do and need to ‘check off’ of your list – it’s about using some VERY powerful tools that SQL Server provides to help make sure you’re not running into problems, getting attacked, or missing something critical that could jeopardize your systems or data.

Checking SQL Server Logs – What to Look For

Of course, if you’re regularly checking your logs as a means of gaining insight into any potential problems or concerns that might be cropping up on your servers, that kind of begs the question of what kinds of things you should be looking for. And while an exhaustive list of things to look for is impossible to compile, what follows is a list of some of the more important things that I typically tend to check (though feel free to ‘spout off’ on any thing you think I may have missed in the comments):

SQL Server Recovery Time. As I mentioned in my post on SQL Server 2012’s Indirect Checkpoints, one of my favorite things about SQL Server is checking on how well it does in terms of meeting the Recovery Interval goal – something that you can check on in the logs by looking for how long it takes the ‘Recovery is complete’ informational message to show up in the logs after SQL Server starts up. But, more than being a mere morbid curiosity about whether or not SQL Server is able to pull off the goal of recovering on time, there’s real value in checking on how long it takes for recovery to complete – because as Paul Randal recently pointed out, this interval is NOT guaranteed. As such, if you’re regularly seeing situations where the recovery interval takes longer than expected, you should do additional investigation of the cause. In some cases that’ll just be ‘how the cookie crumbles’ because you MIGHT be ‘stuck’ with lots of large and long-running transactions. But, in my experience, situations like that should be very rare and should be a known- quantity in the sense that this shouldn’t really be a surprise to you (meaning that due-diligence has been done to try and decrease the scope, size, and duration of transactions). Otherwise, the key point in keeping an eye on recovery times is to help make sure you’re avoiding any surprises when it comes to your RTOs.

Login Failures. Login failures typically come in two forms – either from ‘innocent’ end-users or applications that are having what should be temporary or fleeting problems, and situations where malicious users or applications are trying to brute-force their way into your box. For the most part, you don’t typically need to worry or care about ‘innocent’ users and apps failing to login. Instead, your primary point of interest should be to watch for repeated attempts to login within VERY short time periods – as seen in the screenshot below:
And, this is especially true in cases where SQL Server authentication is being used and when you have a SQL Server that is internet facing.
As such, I can’t recommend enough that you PAY ATTENTION to these kinds of events if you’ve got SQL Server Auth enabled on an internet facing server – because watching for brute-force attacks in your log is SADLY the only line of defense that you really have – and that’s only a line of defense after you’ve done everything outlined in my post on protecting yourself from brute-force attacks that are very efficient at trying to hack their way into SQL Servers using SQL Auth that are internet-facing.

Backups. Another key thing to keep an eye on within the SQL Server logs is reports on backups. Or, more specifically, what you’ll want to keep an eye out for would be any errors or warnings about potential backup failures. Accordingly (and especially if you’re on a multi-tenant system (or any system) with lots of databases and lots of transaction log backups), you might want to actually DISABLE informational messages about SUCCESSFUL backups – just as a means of cutting down the noise to signal ratio so that problems are easier to spot when skimming.

Errors and other Warnings. Many times errors and warnings that end up in SQL Server’s logs will end up in the Windows Application Event log as well – which is why checking them can be such a beneficial task. But, the short story here is that if something is ailing, running into security/connectivity problems, or is about to break (or has already done so), then the log is going to be one of your first lines of defense against such an occasion taking you by surprise. Or, in other words, if you’re regularly checking your logs and something is running into problems, then you’re less likely to let such a problem go days/weeks/months unnoticed.

Aggressive Workset Trimming. This one sucks, and it’s something you hopefully won’t see in your SQL Server logs. But, by the same token, it’s also a perfect example of something that you CAN see in your logs that will be indicative of big/ugly problems. And, the short story is that it’s possible for the OS to tell SQL Server to ‘give up’ huge/contiguous blocks of memory (typically for a driver or other new/spun-up process) that will, in turn, force SQL Server to page large amounts of memory to disk. Only, SQL Server is literally too good to use paged memory (and for good reason – it’d be tantamount to performance suicide). So, instead, SQL Server just kind of takes it in the shorts when this happens and performance drops into the toilet as you see messages about how “A significant part of sql server process memory has ben paged out” in the log at the same times when performance starts to tank. Happily, though, this is fairly easy to correct via the instructions outlined in this KB article. (Though, you’ll sadly have to jump through additional hoops with an additional Trace Flag to get those work-around steps to actually work if you’re NOT running Enterprise Edition (i.e., if you’re running into these trimming problems on SQL Server Standard Edition).