Monitoring Databases

June 10, 2009

It seems there is no
end of tools or ways to monitor a database. Monitoring tools range from the
expensive, high-end see-all be-all variety to homegrown, simplistic shell
scripted ones. Before pursuing one or more approaches (and before spending
money), identify what it is you want to accomplish. What is it you want to
monitor, how that information should be surfaced, and by what means do you want
to capture events are some of the questions for which you should have answers.

Categorize the severity

Starting with the most
severe errors, pretty much anything that prevents users from connecting and
completing transactions is a showstopper, and is something you should be
notified about immediately. On the other hand, the low end may include events
of interest, which are just above the trivial. The stuff in between? Maybe they
are events you want to know about, but are not of such urgency that direct
human interaction is required right then and there.

To formalize these
categories, a set of tiers can be used.

Tier 1 events are
characterized by anything having an immediate impact on the business and
typically involve production systems. Instance and media failure are examples
of where the system (the database proper) is unavailable. The instance is not
running nor is the database open. Whatever the reason, users and processes
cannot connect.

Tier 2 events have a
limited impact, or are of a nature where no immediate response may result in an
escalation to Tier 1. A check disk space or tablespace full
type monitoring process can send a warning when a threshold value is reached.
You want to know about an event in this category, and when you do become aware
of it, you tend to have ample time to rectify the situation.

Tier 3 events have
none to little impact on the business as a whole, and may be events related to
after the fact. These are events you can wait to find out about when you get to
work in the morning. Its not that these are inherently trivial, but theres
practically nothing you could have done to prevent them in the first place, and
when they do occur, life generally goes on. An example is the snapshot too old
error. It could have been a one-time thing because of a users bad query being
run late at night.

However, there are many
other situations where non-database killing events qualify for Tier 1. Many
of them are related to space issues. Other events can span tiers, depending on
what is taking place. A failed database job is an example of this crossover. A
once a month job that fails, but can take place later (such as the next day) is
Tier 2, but if the job was supposed to run a critical process and it failed or
ran late, youd want to know right away.

Speaking of time, the
checks for Tier 1 events needs to be frequent. Further, if the monitoring
system fails and that is your only check mechanism, then you are blinded to
extinction level events within the database. Put another way, what monitoring
system do you use to check the monitoring system? For the homegrown approach, its
hard to beat the reliability of multiserver crosscheck. For example, to check
if database A is available on server A, a 5-minute between checks interval on
server A can be scheduled via a cron job. From server B however, you can use a
15-minute check interval against database A.

Checks in the other
tiers can be performed less frequently. Once a day, or at the end of the event
are two epochs. An example of something I use, which is based on these epochs,
is a check on the completion of RMAN backups. A (Veritas) scheduled backup of
datafiles for a particular database runs during a window of 0700 to 1000 every
day. A simple script to scrape the last line of the spooled log file, testing
for certain keywords, can trigger an alert in the form of an email if the job
did not finish or failed. At the end of the day, a last check of every
databases RMAN log files (datafile backup and archive log sweep) can be
consolidated and sent via email.

What are some of the
show stopping events, where something other than instance or media failure is
preventing normal operations to take place?

Max segments reached 
in the dark ages, max extents started at 505, and now the typical limit is
unlimited. If you migrated from dictionary managed to locally managed tablespaces,
the max extents stayed at whatever was being used in the DMT. You can still get
bit by max extent errors depending upon how you migrated and upgraded.

Out of space  running
out of space in an archived redo log destination is an attention getter as the
database hangs until the space issue is resolved. This lasts forever until
rectified. Other out of space events may cause operations to hang, but are not
fatal (as they used to be) if using resumable space allocation and you allow
enough time to address the problem.

Invalid objects  if
an object goes invalid, its because a user (to include a DBA) did something to
invalidate it, but it could also be due to an Oracle-generated reason. A
statistics gathering job, that is, something normal taking place within Oracle,
may generate a self-inflicted event. However, if something else went horribly
awry, you would want to know why there are scores of invalid objects all of a
sudden.

Distributed
transactions  a distributed transaction failure could represent something
serious or something inconvenient. In the serious case, connection to a remote
server may have been lost, so what happens on your server? If it was based on a
users ad hoc query, well, try again later--but if it involved DML, something
critical could have failed. This one is kind of a wild card due to what was
taking place at the time of the error.

Alert log checks 
probably one of the more classic checks is to run a difference between tail end
copies of the alert log, and check for differences, especially where the new
version has ORA in it. This follows that ORA errors raised (written) to the
alert log are of relative importance to the database. Whats different here is
that the alert log in and of itself is obviously not a show stopping event, but
is the collector of serious errors.

The People Side of Monitoring

In a multiple DBA shop
where the on-call responsibility rotates, alerts about Tier 1 events are
directed to the on-call DBA. As to whether or not you want all DBAs to get a
text/SMS alert sent to cell phones is up to you, but consider what happens if
the primary on-call person cannot be reached. Tier 2 events can be emailed to
the group, and notification about Tier 3 events may go to the principal
owners of a database. It is a hugely self-defeating system where an on-call
DBA receives an alert, but has to call and wake up the owner of the problem
database when a routine operation needs to be performed  especially when it is
something that could have been taken care of during business hours the day
before, or was entirely avoidable in the first place. Likewise, what is the
point of responding to a transient invalid object check that resolves itself,
but your monitoring system sends out an alert about the event, and it is an
alert that requires human interaction such as rolling the alert log? Are you
really doing your job, or are you building in things to make you look busy or
important?

In Closing

Monitor what is
important to you and keep the amount of nice to know but not essential
notifications to a minimum. When you are paged in the middle of the night, it
should be for something serious (and fun to fix) as opposed to you having to
react to mundane car alarms. Sometimes, the simplest solution is the best
solution. If building in monitoring via crond scripts on a server, consider
keeping a running log file of the alert. The benefit of using a robust
monitoring system is its ability to maintain a history of notification events.
Documentation by email is a poor way to maintain a central repository about the
history of an event.