Monitoring SQL Servers Availability

July 31, 2006

How many SQL Server
instances are you running? Do you ever have customers calling you stating that
their application is not working, then when you research the problem you find
that the instance that supports their application is unavailable? Have you
ever had someone mistakenly shutdown the SQL Agent service and forget to
restart it, causing a number of scheduled jobs to not be run? One of the tasks
of a DBA is to monitor the availability of all SQL Server instances and
services. If you have a large number of instances this monitoring task could
take you a fair amount of time. This article will discuss why a DBA should
monitor instance availability, and provides an approach to automating this process.

Why Monitor Server Availability

A DBA needs to monitor the
availability of the SQL Server machines they support. Without monitoring, you
can never be sure whether all your SQL Server machines are available and
performing as expected. If you dont monitor then your first indication of a problem
is when someone reports they cannot connect to one of your servers. No DBA
wants to hear that an instance of SQL Server is not available.

If you are going to monitor,
you need to do it routinely. The goal of monitoring is to identify a problem
soon after it occurs. By monitoring routinely and frequently, you can be more
pro-active in resolving server availability issues. The sooner you know about
a problem, the sooner you can resolve it. Ideally, you want to resolve
problems before they are discovered by your customers. Monitoring server and
service availability can be a time consuming task. Because of this, it is best
if the monitoring process can be automated.

What to Monitor

One of the first things you
need to determine is what should you monitor. If you think about what needs to
be available to make your customer applications work then you can easily
develop a list of things to monitor. In my shop, I wanted to make sure that an
application/customer could connect to SQL Server. If an application/customer
could not connect then either the MSSQLSERVER service was not running, the box
was shutdown, powered off, or there was some kind of network connectivity
issue. I also wanted to verify that SQL Agent was running. If SQL Agent is not
running then routine scheduled jobs might not be started as scheduled.

The Automated process

The automated process to monitor
my servers consisted of a table, 4 different stored procedures, and two
different monitoring servers. Let me discuss each one of these pieces.

The two different monitoring
servers are just two different SQL Server machines, on two different networks
with different power supplies. One server is defined as the primary monitoring
server, and the other one as the backup monitoring server. The reason for two
monitoring servers is to allow the backup monitoring server to take over the
duties of monitoring should the primary monitoring server become unavailable.
Each SQL Server machine has connectivity to all servers that are being
monitored. Linked server definitions are defined on the two monitoring servers
for every machine that is being monitored. These linked server definitions are
used to submit commands to verify that the monitoring server can connect to the
server being monitored, and that all necessary services are running.

This table contains a list
of servers and services that will be monitored. Each monitoring server has an
identical list of servers and services. Here might be a typical set of records
for this table for a shop that only has three different servers to monitor:

Here you can see I am only
monitoring the "MSSQLSERVER" and "SQLSERVERAGENT" services. If in your shop,
you want to monitor additional services on your machines this process can
handle that by just adding additional table entries for the additional services
you wish to monitor.

The 4 different stored
procedures to support this process are: "sp_send_cdosysmail", "usp_check_connection",
"usp_services_monitor", and "usp_service_monitor_backup". I have placed each
of these stored procedures on both monitoring servers. Although the "usp_service_monitor_backup"
stored procedure (SP) really only needs to be placed on the backup monitoring
server. The first two stored procedures play a supporting role in the
monitoring process. The third SP, in the list above, does the actual
monitoring, while the forth SP is used to detect whether the primary monitoring
server is working.

The "sp_send_cdosysmail" SP provides
a mechanism to send SMTP mail. I obtained this SP from Microsoft; therefore,
I have not included the code for this SP. I like using SMTP mail, instead of
using "xp_sendmail". If you want to use SMTP mail and do not already have this
SP, you can find the code at the link below. Or, you can replace my executions
of this SP with however your shop supports mail from T-SQL. Here is the link
to a Microsoft page that describes using SMTP mail and the "sp_send_cdosysmail"
SP:

The next supporting SP is "usp_check_connection".
This SP is used to verify that the monitoring SP can connect to the servers
being monitored. This SP is critical to this process since it keeps the
process from failing when it cannot connect to a server being monitored.
Testing connectivity is accomplished using OLE automation. This SP tests by connecting
to the server being monitored via a linked server definition from the primary
or backup monitoring servers. This allows for this process to not to have to
store passwords, since the passwords are stored in the linked server security
definitions. Here is the code for this SP:

As you can see this SP creates
a SQLServer object, sets the "LoginSecure" property to "True" and then makes a
connection to the monitoring server using Windows Authentication. Once
connected to the monitoring server this SP builds a simple dynamic query using
a linked server "SELECT" statement to determine if the server being monitored
is reachable from the monitoring server. If it cannot connect to either the
monitoring server or the server being monitored then an error is returned to
the calling SP.

The reason I go through all
this logic just to verify that a server is up and running is because a normal
inline T-SQL linked server "SELECT" command would cause my SP to stop executing
if the "SELECT" command failed because a server was down. By using OLE
automation to submit my T-SQL command, this SP is able to detect the outage,
and continue running, so it can monitor the rest of the servers. Now that you
understand the supporting SPs, let's look at the two different monitoring
stored procedures.

Prior to reviewing the code
in the monitoring SP, let me first explain the method I use to determine if a
particular service is running. Microsoft provides an undocumented extended
stored procedure called "xp_servicecontrol". This undocumented stored
procedure can perform a number of service related functions. One of these
functions, "QueryState", returns the status of a particular service. A status
for a service like "SQLSERVERAGENT" can have 3 different possible values: "Running.",
"Stopped.", or "Starting...". I use the "QueryState" function of "xp_servicecontrol"
to verify that the service I want to monitor has a status of "Running".

As mentioned earlier, I have
a stored procedure "usp_service_monitor_backup" that verifies that the primary
monitor appears to be functioning. The "usp_service_monitor_backup" SP is scheduled
to run every 10 minutes, on the backup server, via a SQL Agent job. If this SP
cannot connect to the primary server, it assumes it needs to take over
monitoring duties and executes "usp_service_monitor" the primary monitoring
SP. Here is the code for this backup monitoring SP:

Here you can see I use the "usp_check_connection"
SP to determine if I can connect to the primary monitoring machine from the
backup monitoring machine. If I cannot then this SP assumes it needs to
perform the monitoring and executes the "usp_service_monitor" SP, which is the
main monitoring SP. If this SP can connect to the primary monitor, it then
checks to verify that the SQL Server Agent is up and running. If SQL Server
Agent is not running then this SP takes over the primary monitoring duties by
executing "usp_service_monitor". If this SP finds that SQL Server Agent is
running then it assumes that the primary monitor is functioning normally and
therefore the backup machine does not need to take over primary monitoring
duties.

I have already mentioned
that the "usp_service_monitor" SP is the primary SP for monitoring SQL Server services.
This SP needs to be placed on both the primary and backup monitoring machines.
Here is the code for this SP:

This code processes through
a list of servers and services found in table "MonitorServerServices" one record
at a time. For each record, this code calls "usp_check_connection" to
determine if the server to be monitored is reachable. If this SP cannot
connect to the server being monitored then an error message is sent to the
email distribution address passed as a parameter to this SP using the "sp_send_cdosysmail"
SP. If the server is reachable, then this SP uses the "xp_servicecontrol" SP
to determine the STATE of the service being monitored. If the service being
monitored is found to not be running then this SP sends an email to the email
distribution list. This email notifies the individuals on the distribution
list of the status of the service, so they can quickly respond to SQL Server
service problems. If the current service is running then the next server and
service to be monitored is obtained from the "MonitorServerServices" table and the
while loop is repeated. This process continues until all servers and services
have been monitored

Conclusion

I have found this monitor
process provides an early indication on whether MSSQLSERVER or SQLSERVERAGENT
services are not running. Every so often, our System Administrators will take
down both SQL Agent and SQL Server, but only restart SQL Server. When this
happens, this process lets the DBAs know that SQL Server Agent needs to be
started to prevent possible loss of work due to SQL Server Agent being down.
If you have experienced outages of services on your critical SQL Server
machines then you already know how nice it would be to have a service
monitoring process.