Monitoring, recovery, and precise logs can often do more to reduce your number of outages and limit the scope of system failures than the typical panacea more hardware, more software, and more hot spares.

Technical types are inclined to improve availability by applying technology: more hardware, more software, and more hot spares. But as Linux escapes the lab and moves into the machine room and increasingly the corner office, monitoring, recovery, and precise logs can often do more to reduce the number of outages, shorten the duration of each outage, and limit the scope of failures. Moreover, well-planned responses to IT â€œeventsâ€ can significantly improve availability.

In many ways, an IT staff is like a fire department: all things are quiet until a call comes in, and then everyone leaps into action. And while a server crash or network hiccup isnâ€™t as life threatening or dangerous as a house fire, an outage can still translate to dire consequences.

Letâ€™s define an event as as anything out-of-the-ordinary that occurs in IT and that potentially or actually causes an outage. By this definition, any outage clearly involves one or more events. An event also occurs when the filesystem fills up or CPU utilization exceeds some threshold. The loss of a redundant component is also an event, even if it doesnâ€™t cause a service outage, since it may have a potential impact, such as reducing your infrastructure to a single point of failure or impaired capacity.

If you can catalog the types of events that may occur in your organization and how each one may impact your IT service, you can define the right approach to monitoring. You can also ensure that you plan appropriately for how and when to respond to each type of event. Depending on the event in question, the planned response is a mix of people and technology.

Basic Principles

A well-organized IT staff demonstrates these best practices:

It maintains control and duration predictability by always governing with time bounds (like RTOâ€™s)

It has a consistent and well-understood support model

It ensures that every event has a clearly-defined response. For example, â€œignoreâ€ is a perfectly acceptable response â€” for certain events

Concern for inter-relationships

The latter point may seem the most esoteric, but it may be the most important practice of all. You must consider how a failure in each and every component of your infrastructure impacts the availability of other elements in the layers above, as well as any other components with which it interacts. Given todayâ€™s IT complexity, â€œsiloedâ€ knowledge is inevitable â€” no person can be â€œa mile wide and a mile deep.â€ One thing that distinguishes highly-available shops from those in constant crisis is the presence of specialists that understand how components in their domain of expertise impact and interact with other components in the system.

To react to a problem, you must first be aware of the problem. Notification is critical â€” you should know about a service outage before users start calling the help desk. The best helpdesk response is often, â€œWe are aware of the problem. Estimated uptime isâ€¦ â€

To receive notification, you must monitor appropriately. Several monitoring tools are available for Linux systems, including Nagios and OpenNMS. At the core, each monitoring tool performs the same basic set of functions: the tool checks a service on a regular basis and takes some action if the response is abnormal.

OpenNMS, for example, provides a number of predefined services that it can poll without much configuration. It also allows you to poll your own services. In the words of its developers, â€œOpenNMS was developed from the beginning to be an enterprise-grade solution capable of monitoring an eventual unlimited number of devices.â€

The best approached to monitoring architect for three levels of monitoring:

High-priority component monitoring (monitored for automatic failover by automation software such as heartbeat or Tivoli System Automation. Servers are often monitored at this level).

There are all sorts of other considerations, such as where to send events (system logs, operator consoles, pagers, etc.), what events to ignore, how to review system logs proactively in an effort to prevent events, and many others. For a very good discussion on event-handling concepts, check out the first couple of chapters of the IBM Redbook â€œEvent Management and Best Practices,â€ available here.

How to Respond?

Once youâ€™ve defined your events and have deployed monitoring software, the next step is to specify the corresponding responses. Remember that each response must include both technology and process. As one practitioner noted in comments on the OpenNMS website,

â€œWatching a screen of scrolling messages is a bad way to monitor anything. Monitoring processes and procedures are just as important, if not more important that the application that you choose.â€

Many IT organizations employ a three-tiered event prioritization structure, and place each monitoeed event into one of the tiers. Within each tier, you can define several parameters that govern the defined response. Figure One shows one simple example of tier definitions. (Feel free to modify these to fit your organizationâ€™s needs.)

Defining tiers of events for application ABC

Application/Service: ABC

Top-Tier Events: Events that indicate that users are either currently unable to access ABC or will soon be unable to access the application.

Examples: Monitoring of the ABC application itself reports the application is unavailable; monitoring of the database supporting ABC indicates no instances available; XYZ filesystem reports greater than 95% full; or utilization of CPU and/or memory across all nodes in ABC cluster above 90%.

Middle-Tier Events: Events that indicate that service may not be fully robust. Users could be impacted over time or if another event happens.

Lower-Tier Events: Anything out of the ordinary with less potential impact than middle and upper tier events.

Examples: Sustained transaction volume for some time period higher than normal; transaction response times slower than normal; event logging filesystem reports greater than 75% full

A three-tiered scheme allows you to define separate policies for each event type. For example:

All events: Logged to disk. Logs reviewed daily.

Middle-tier and above: Fed back to system console. Response required within 4 hours.

Upper-tier events: Operator on-call is paged and initial response required within 10 minutes. If no response within 5 minutes, operator is paged again. If no response within 10 minutes from initial page, operatorâ€™s backup is paged. Resolution of issue required within 60 minutes or explanation must be provided to Availability Manager.

Ideally, the policy structure cascades so that middle-tier events are logged to disk and fed to the system console. Similarly, upper-tier events are logged to disk, fed back to the system console, and sent to an operatorâ€™s pager.

Policy also drives a proactive approach to maintaining availability with an appropriate time-bound and focus. Obviously, an upper-tier event must be addressed immediately since it corresponds to an actual or imminent outage. A middle-tier event is something like capacity falling to a single of failure or other semi-serious signs of trouble. A lower-tier event must be noticed because it may point to a more serious impending problem, such as an unusually high transaction volume, or buildup of old logs that are ready to be archived and removed from primary storage.

The exact nature of your tiers and policies depend on your requirements, Whatâ€™s important is that you have this consistency and that the policies reflect an acceptable level of support to the business. Furthermore, it is essential to assign time bounds to each of the associated reactions. Response times ensure that recovery efforts stay â€œon trackâ€ but also provides a level of predictability to anybody who depends on the system, especially users of the business processes supported by IT.

Keeping Track

Another aspect of mindful service is the mechanism by which you maintain information about each incident. In most cases, you should create and track a trouble ticket in some kind of tool. Some of the information to maintain is similar to what would be maintained in a bug report, except that updates are presumably happening within a much smaller time period.

Your issue tracking system should support four important goals:

Progress is matched against recovery policies and recovery time objectives.

Any interested party can obtain status information without bothering the teams working on the problem. The technical teams need only write once, and it can be read by many.

Technical information about the issue is recorded so that others may benefit from it.

Reports, such as post-incident reviews, trending analyses, and other problem management procedures can be obtained or facilitated.

Free-flow comments are good because theyâ€™re flexible, but the lack of structure can also be a hindrance to the objectives above. One way around this is to define certain types of comments and a standard one-line header to be placed in front of each one. That way, a person scanning the record can simply scroll to the comments of interest. Even better, itâ€™s helpful if scripts can parse out the information into a single stream.

One perfectly serviceable approach is to simply define two types of comments: technical notes and status notes. As their respective names indicate, technical notes describe information about the problem itself and its solution. Status notes are intended both for keeping people updated in real time, as well as for timeline analysis during a post-incident review. An even simpler solution is to assume that most comment content is technical and embed status information where appropriate. Status notes are then set aside with a clear indicator that they can be picked out by scripts or by someone scanning the record.

For example, a comment in Bugzilla may look something like:

------- Comment #x From Network_Admin 2006-03-01 06:17 EST [reply] -------
****STATUS****
Estimated uptime 60 minutes from the point at which the source of the bottleneck is pinpointed. Currently scanning network logs. Log info should allow us to determine the bottleneck within 30 minutes.
******************
Performance issue has been reported from multiple users on the XYZ network. Network congestion is likely the culprit with a number of possible components that could be responsible. Parsing network logs with diagnostic scripts to determine the source(s) of the bottleneck.

This comment helps in a number of ways. During the incident, anybody trying to work around the performance issue would know that restoration is expected in 90 minutes. Long after the incident is resolved, anybody facing a similar issue can see some of the steps to take to diagnose it. Finally, a timeline of events can easily be constructed for a post-incident review. The timeline might simply contain the timestamps and the status messages:

2006-03-01 06:03 EST:
Performance issue first reported to helpdesk
2006-03-01 06:17 EST:
Transferred to level 2 Network Support
Estimated uptime 60 minutes from the point at which the source of the bottleneck is pinpointed. Currently scanning network logs. Log info should allow us to determine the bottleneck within 30 minutes.
2006-03-01 06:47 EST:
Source of failure pinpointed toâ€¦

The Bigger Picture

Now suppose that an event reveals some larger issue and the latter must also be resolved. One classic example is rapid escalation of allocated memory. The event itself may be resolved by recycling one or more instances of the live application, but there is an underlying cause that must be investigated and addressed.

If such a circumstance occurs, you can either keep the same record open or create another one. Closing the first report and opening another is the best approach, since the isolated incident is resolved, yet the larger, perhaps more systemic problem is yet to be addressed. However, your policies should ensure that a reference to the initial incident record appears in the problem record created to track the application problem.

This cross-reference is important for several reasons:

When an application fix is available, change control can have good information on the importance of the fix in determining when and how to roll it out.

Depending on the amount of information gathered in the event reported, the test team could better verify that the (alleged) fix addresses the issue seen in production.

Once the fix is applied to production, the Availability Manager can monitor whether or not this was the only issue leading to high memory utilization.

Hardware and software are often required to improve availability, but processes, policies, and organization can minimize the effect of an outage when (not if) it occurs. Connecting the dots â€” being aware of how your components depend on and interact with each other â€” is a vital skill. Shops that see the big picture enjoy highly-available IT; those that are more narrowly-focused will always be putting out fires.

Comments on "Beyond Redundancy: Applying High Availability Throughout Your Organization"

Pretty great post. I simply stumbled upon your blog and wanted to say that I’ve truly enjoyed browsing your weblog posts. In any case I will be subscribing on your feed and I hope you write once more soon!