My Approach to Scanning SQL Server Agent Activity

For the few production servers that I manage for clients, whenever I peek in on their servers (which might be as often as a few times a week or only every few weeks – depending upon workload characteristics), my primary goal is to ‘get in and out’ as quickly as possible – without missing anything pertinent.

To do that, I start by scanning the Job Activity Monitor – just by double-clicking on it.

Because by looking at this ‘view’ of SQL Server Agent Jobs, I’m able to quickly spot any big problems such as jobs that might have failed or crashed or run into issues – simply because those jobs will have a red icon next to them. (Of course, for jobs that are mission critical and which can’t fail without causing some problems, I always make sure to set up a Notification so that someone will be notified when these jobs run into issues. But, for other (less critical) jobs – scanning them via the Job Activity Monitor is a great way to go as you get a lot of information in very short order.)

Of course, just because a job succeeded the last time it ran, doesn’t mean there might not be problems. Likewise, even if a job continues to run on a regular basis without any actual errors, that doesn’t mean it’s necessarily working correctly either.

As such, another thing that I very commonly do is take a peek at the history of particular jobs:

As this, too, represents a great way to get a lot of information ‘at a glance’ because, for example, if there have been any recent failures, they’ll show up in the Log – along with any potential details about notifications, and so on.

Whereas, if jobs are running along without any problems, you’ll see “Nuthin’ but Green” which is what you’re hoping to see.

Don’t Forget to Check Durations

Of course, while Green/Success outcomes are what you want to see – their presence isn’t always 100% indicative of the lack of problems. And that, in turn, is where it’s really up to you as the DBA to scrutinize the potential for any issues going on with key jobs on your server.

And this especially applies to the kinds of jobs you might have already set up alerts (or Notifications) for should they ‘fall down’. And, what I’m specifically referring to here would be key jobs such as Data Integrity Checks (i.e., execution of DBCC CHECKDB()), backups, index maintenance, or other mission-critical business operations that may be deployed on your server via the SQL Server Agent.

And, as an example, no where are Durations more important than when it comes to looking at various ‘size of data’ jobs – or jobs that are directly keyed to how much data you have. Because sharp increases in the amount of time these jobs take can either be indicative of unexpected and un-planned-for growth (that you may need to immediately address), or they can even be indicative of the potential for database corruption – such as in situations where SQL Server might have to be spending MUCH more time reading data than it might normally have to (thanks to retries vs corrupt sectors/etc.)

Likewise, another anomaly you will want to watch for will be cases where a job that normally takes a long time to execute, suddenly and unexpectedly executes in much less time than normal – because cases like this might mean that some key aspect of the job was skipped. In fact, a great example of where you can run into just such a situation would be a case where you might be trying to run DBCC CHECKDB() dynamically against all of your databases using the undocumented sp_msForEachDb stored procedure to ‘loop’ through the names of all of your databases. But, as Aaron Bertrand (blog | twitter) points out, sp_msForEachDb is capricious, treacherous, and simply shouldn’t be trusted – because (under load) it can quite simply LIE about how many databases there are, and if you’re using this undocumented proc (that Microsoft is within their rights to NOT fix), then you’ll run into just such a problem. And, the same thing would apply to situations where you might be using similar techniques for index maintenance, backups, or whatever. The key point being that sudden DROPS in how long an operation takes MIGHT be an indicator that there’s been some sort of a problem.

Otherwise, for other kinds of jobs, by studying them periodically you should be able to gain a sense of their ‘ebbs and flows’ in terms of the variability that you see in their individual execution times – and, as you continue regular check-ins and scans of your SQL Server Agent Jobs, you should also be able to potentially get to a point where you can even gain a sense of when there might be problems with these jobs – based solely upon anomalies with their durations.

Finally, don’t forget too that the SQL Server Agent has its OWN logs – where it complains about issues, problems, and hiccups that it might have. And while I personally don’t check it as often as I should, it’s something you’ll probably want to take a peek at semi-regularly as well.

Contributors

Michael K. Campbell is a contributing editor for SQL Server Pro and Dev Pro and is an ASPInsider. Michael is the president of OverAchiever Productions, a consultancy dedicated to technical evangelism...