T-SQL Tuesday #66 – Monitoring

T-SQL Tuesday is a recurring blog party, that is started by Adam Machanic (Blog | @AdamMachanic). Each month a blog will host the party, and everyone that want’s to can write a blog about a specific subject.

This month the subject is “Monitoring”. If you want to read the opening post, please click the image below to go to the party-starter: Cathrine Wilhelmsen (Blog | @cathrinew).

Monitoring the performance or health of your environment might be the single most important thing in your daily job. But this might be the most important question you need to answer, before you start working on any monitoring or process. You don’t want to find out that your database isn’t working anymore, because someone from the business starts calling you his application isn’t working anymore. You want to know it as soon as possible, so when the business calls you, you can tell them you’re already working on it, or maybe even solved it. But how do you do that?

What do you need to monitor?
There are many different things you can and want to monitor, that’s why it’s difficult to explain the different aspects of monitoring. You don’t only want to monitor SQL Server, but also the server it’s running on. If a disk fills up, SQL Server will eventually stop working. But do you want to find out that you have a problem when your SQL Server stops working? I don’t think so…

But even within SQL Server, there are so much things you can monitor, depending on your needs. A number of these things are explained by Kevin Kline (Blog | @kekline) in his SQLBits session: “An End-to-End Troubleshooting Checklist for SQL Server”. He talks about troubleshooting SQL Server, but he mentions a lot of stuff you can start monitoring today. So if this will ever cause problems in the future, you can always look back at the history of a specific setting or configuration. You don’t want to spend time on searching for a previous setting value, when your SQL server isn’t working anymore.

Different ways of monitoring
There are different ways of monitoring your environment. Some parts you want to monitor with an automatic monitoring solution, that raises errors when you need to take action. But there are also parts of your environment, that you might want to monitor “manually”. Let me explain that.

SCOM
One of the corporate monitoring solutions you could use is System Center (SCOM). This gives you the oppertunity to monitor your whole environment from top to bottom: from server to specific disk. The downside is that it might be pretty expensive when you compare it to your other options.

And SCOM might be to overwhelming for your environment or process. Dealing with SCOM means you need a certain level of knowledge of your environment, and you need to know how to configure SCOM. This learning curve might be too much, when you just want to monitor a specific part of your environment or server.

SQL Sentry

SQL Sentry has proven itself as a tool provider with tools like SQL Sentry Plan Explorer. Personally, I use this tool every time I need to do some performance tuning. We’ve blogged about this tool with the SQLCoop team. It’s also featured on the SQL Sentry Plan Explorer page (which I’m very proud of!).

But besides that awesome tool, SQL Sentry is a company that built a number of monitoring solutions for your SQL Server. These can be found on their website. These are created specifically for monitoring SQL Server, and is a better choice compared to SCOM, if you just want to monitor SQL Server. You can use this if you want to monitor your whole SQL Server environment, with a tool build specifically for SQL Server.

Manual monitoring
Let’s say you have a production SQL Server instance. On this instance, you run a number of SQL Server Agent jobs. These jobs run on a specific schedule, and you want to monitor the outcome of those jobs. But perhaps you don’t need to take action when a specific job fails. If this is the case, using a big solution like SCOM for example, might be a bit too much.

Having a manual tool to check on job outcome and run time might be a better solution for you. If you want to do that, the SQL Server Agent job history window in SQL Server Management Studio (SSMS) might be the right tool for you. There’s no need for big tools, if your monitoring needs to be as basic as that.

Another tool you could use is SQLjobvis. I’ve mentioned this tool before, and I still use it almost on a daily basis. This can be used to monitor your environment “manually”.

A failed job can also send out a notfication by mail. This can be configured in the SQL Server Agent. This can be used to monitor your job activity as well.

Creating your own monitoring solution?
In the years I’ve worked in IT, I’ve seen a lot of companies and different monitoring systems. And there’s always that one company, that want to create its own SCOM implementation. They think the rest of the world is crazy, and that they can build a system better that’s better than everything else. Let me tell you, all custom build SCOM systems I’ve seen so far never lasted longer than a year. After 12 months, they use either another system on the side to fill up the gaps the custom SCOM system has left, or the custom SCOM system is gone entirely.

Maybe you can build a custom monitoring system for your business-specific processes (like checking the outcome of a webservice call), that could work. But when you’re building your own system to check the free space of a server, sorry to burst your bubble, you’re doing it wrong! Do you really think that you can build a custom system that can beat another monitoring solution that has evolved in 10+ years, and that has been tested in thousands of companies? I need to come across a system like that yet…

Conclusion
Looking at the different ways to monitor your environment, and the tools you can use for that, we can only conclude you have a world full of options. But in the end, you just fall back to 1 basic rule: if you monitor your environment, you need to act upon it as well. If a SQL Server Agent job fails, don’t just notice it and carry on your job, but act on it! Fix the job, or fix the underlying issue.

It’s basically the same as your backup process: when you’re taking backups of your database, and you never try to see if you can restore it, just stop taking backups! You can test, monitor and backup your entire environment, but if you don’t act on any issues, it’s all a waste of time. And I’ve seen that happening just too often…