Alerts, Alert Escalations, and Server Arrays

Table of Contents

Objective

The purpose of this document is to explain the relationship between alerts, alert escalations, and server arrays.

Overview

In order to truly understand RightScale's alert and monitoring system and how you can use it to automatically perform various actions on your setup, it's important to understand each component in the context of the others. The alert and monitoring system is comprised of the following key components. Each component is discussed in more detail here as well.

Scalable Architecture Diagrams

Architectures for front-end, client-facing sites. A common basic scalable setup is a four-server setup featuring front-end servers that act as both load balancers and application servers.

For larger websites, you might have two front-end servers that are used strictly as load balancers, so you can expand the number of application servers underneath.

Monitoring

RightScale's Monitoring system serves as the backbone for Alerts and Alert Escalations. Before you can use any alerts, you must first enable each server for monitoring. If you are not using one of RightScale's ServerTemplates, you need to enable all of your servers for monitoring. See Setting up collectd.

Important! Alerts only work on servers that have monitoring enabled.

Server Arrays

A Server Array is a group of mostly identical instances where the number of instances in the array varies (i.e., scales) over time in response to changing factors. RightScale offers both alert and queue-based arrays. Alert-based arrays are most commonly used for scaling a pool of application servers.

When you create an array, you need to associate it with a particular deployment. Multiple arrays can be associated with the same deployment. For example, you might have an alert-based array for scaling the number of application servers and a queue-based array for scaling worker instances for back-end batch processing.

In some ways, you have the same level of control with server arrays as you do with deployments. For example, you can define common inputs and alerts that will be inherited by all servers in the array or run RightScripts on server instances.

When you create a server array you must define how you want that pool of server resources to scale-up and scale-down. For example, you can define the minimum and maximum number of servers that can be launched in an array, which availability zones you want server instances to be launched into, as well as basic server launch options that will be used for all server instances including ServerTemplate, SSH Key, Security Group, instance type, and machine image (RightImage). The server array's scaling parameters also dictate how many servers should be launched when scaling-up or how many servers to shutdown when scaling-down. You can also specify different rates for scale-up and scale-down actions. For example, you might want to scale-up quickly by adding four servers at a time, whereas you might take a more conservative approach when scaling-down by terminating only two servers at a time.

When scaling-up, you must define the decision threshold (Default: 51%) at the server array level, which defines the percent of servers that must be voting for the "vote_grow/shrink_array" action before the array is allowed to scale. This threshold helps prevent an outlier server from unnecessarily scaling the array. For example, one of the application servers might get hit with an odd spike, but the rest of the servers are operating normally. Instead of one server making the decision to scale, you can set a 51% decision threshold to ensure that you only scale-up when a majority of the servers are experiencing the same alert condition. Be sure to check out the voting section that follows so that you thoroughly understand the flow of the scale-up and scale-down actions.

If you have predictable scaling patterns, you can also configure a daily/weekly scaling schedule where your array automatically grows/shrinks at predefined times. See Server Array Schedule and Server Arrays for more information.

Alert Escalations

An alert escalation defines the action or set of actions to be taken when specified alert conditions are met. Each alert escalation can have a series of actions that are executed in sequential order until the alert condition no longer exists. By default, you must associate an alert escalation when you create an alert. Therefore, you will need to create an alert escalation before you create an alert (unless you use one of our predefined alert escalations). See Create an Alert Escalation.

Alert Escalations are assigned at the Deployment level. Alert Escalations are either assigned to a specific deployment or they are made available to all deployments. (As shown by the labels "Deployment - A" and "All Deployments" in the following illustration.) Once an Alert Escalation is assigned to a deployment, all servers within that deployment can call for that alert escalation.

Actions

Each alert escalation is comprised of a series of one or more actions. An alert escalation defines which actions are performed in response to a triggered alert condition. Actions are executed in sequence as long as the alert exists. If the alert goes away and then returns again, the actions are processed again in sequence from the beginning of the list.

Alerts

An alert (specification) defines the conditions under which an alert is triggered and an alert escalation is called. Several of the most common alerts have already been predefined for your convenience and are already included in many of our ServerTemplates. You can either create an alert from scratch (see Create an Alert) or import an alert from one of the following locations:

Default (RightScale Alerts) - A list of commonly used alerts configured by RightScale

An alert must be assigned at either the ServerTemplate, Server, or Server Array level.

When you create an alert you need to associate it with a particular alert escalation. You can only assign one alert escalation to an alert. However, you can create multiple alerts that monitor the same metric and then vote for a different alert escalation. Similarly, you can have multiple alerts that call the same alert escalation. For example, you might have several alerts that point to the 'default' alert escalation.

Understanding the Voting Process

Let's say you have a basic 4 server setup with two front end servers. You want to set up a scalable server array where additional application servers are launched when 51% of the frontend and application servers (in the array) have a cpu idle value that's less than 60% for more than two minutes. First you need to make sure that the ServerTemplate's that are used by both the front end servers and the application servers monitor the same alerts. Remember, each server that is launched into an array will use the associated ServerTemplate. By associating the alert to the ServerTemplate (e.g. PHP App Server), each server that is launched into the array will monitor the same alerts.

The basic equation for the alerts is as follows:

Let's watch the voting process in action.

First there are no servers voting for an alert escalation. Later, one front end server is getting hit pretty hard. Once its cpu idle value is less than 60% for more than two minutes, it starts to vote for the 'scale-up' alert escalation. Each server can only cast one vote. So even if a server has three alerts that are being triggered and escalating to a 'vote to grow' action, the server will only count as one vote, not three. When scaling in the cloud, you always want to scale democratically where the majority of the servers must experience the same alert conditions before you take any corrective actions such as launching additional servers. In order to prevent against accidental scaling situations where one outlier server causes a scale up event, you must define a decision threshold, where you define a percentage of servers that must be voting for a 'vote to grow' action before any servers are launched. For example, you might want to set the server array to only scale if 51% of the servers are voting for the same alert escalation. This way, a majority of the servers must be voting for the same 'vote to grow' action before any new servers are launched. In the following example, once two servers are voting for the same alert escalation for longer than two minutes, it is time to execute the actions associated with the 'scale-up' alert escalation.

Once 51% of the servers are voting, the 'scale-up' alert escalation is called and its actions are executed, as long as at least 51% of the servers continue to vote. In this example, the first action is to launch a new application server into the server array.

In this example, we've configured the array to have a 'calm time' of 10 minutes in order to prevent the array from scaling too quickly. Since it can take several minutes to launch and configure a new server, you don't want to launch a bunch of servers all at once without giving the new servers enough time to become operational and actually have an impact on your site's performance.

However, if the new application server did not help the front end servers and the alert conditions continue to persist, another server will be launched into the array (after the calm time) because more than 51% of the servers are still voting for the alert escalation.

This time, the second application server in the array was able to reduce the load on the two front end servers enough to resolve the alert condition. Notice how the cpu idle suddenly increased back to 90% once the second server came online. Once the alert conditions have been satisfied and the alert escalation is no longer being called, everything is reset. A server will need to have a cpu idle value of less than 60% for two minutes before it starts to vote for an alert escalation.

In this example, the alert escalation did not have enough time to execute its second action (send email), because the alert conditions were satisfied in less than an hour.

Important! Don't forget to create a similar alert and alert escalation to 'scale-down' the array once the additional servers are no longer needed.

You may need to test different scaling parameters in order to determine an appropriate array resize value to use when scaling up and down. In the preceding example, we should probably change our resize value to 2 instead of 1, because the addition of a single server did not resolve the alert condition.