New Relic University

Set proactive alerting: understand and respond to performance issues

The term "alerting" often carries some negative connotations; for too many developers, alerting correlates too closely with errors, mistakes, and ongoing issues. However, for developers who are proactive about alerting, they know they don’t have to stare at their dashboards all day, because effective alerts will tell them when to check in.

Well-defined alerts help you understand the health of your systems, so you can respond to performance problems before they affect your customers.

1. Define required alerting policies based on Service Level Objectives

A service level objective (SLO) is an agreed upon means of measuring the performance of your service. The SLO defines a target value of a specified quantitative measure, which is called a service level indicator (SLI). Examples service level indicators could be average response time, response time percentile, and application availability. SLOs would then clarify a target value for those SLIs such as:

Average response time should be less than 200 ms.

95% of requests should be completed within 250 ms.

Availability of the service should be 99.99%.

These SLOs can also be logically grouped together to provide an overall boolean indicator of whether the service is meeting expectations or not (for example, “95% of requests completed within 250 ms AND availability is 99.99%”), which can be helpful for alerting purposes.

Use these SLIs as key performance indicators (KPIs) so your team and organization can measure your service and ensure it’s meeting customer expectations. By breaking down the quantitative performance metrics that are required of your services, you can then identify what kinds of alerts you should set for each. For instance, you could set an alert to notify you if web transaction times go above half a millisecond, or if the error rate goes higher than .20%.

However, not every SLO needs to become an alert. A strong alert strategy takes SLOs and creates a set of simple, actionable alerts. New Relic often finds that our most mature DevOps customers set fewer alerts in general and focus those alerts on a core set of metrics that indicate when their customer experience is truly degraded. As a result, New Relic DevOps customers often use Apdex as part of their alerting strategy to align alerts with signs of degraded user satisfaction.

As you design your alert strategy, keep this question in mind: “If the customer isn’t impacted, is it worth waking someone up?”

For a simple framework of areas to set alerts for, use the following questions and advised metrics and KPIs:

With New Relic you can set alerts on your instrumented applications, end-user experience, infrastructure, databases, and more. New Relic will alert you if your site’s availability dips or if your error rate spikes above acceptable levels, as defined by your SLOs. You can set warning thresholds to monitor issues that may be approaching a critical severity but don’t yet warrant a pager notification.

Setting thresholds for when alerts should notify teams can be challenging. Thresholds that are too tight will create alert fatigue while thresholds that are too loose will lead to unhappy customers.

Baseline alerts allow you to set dynamic thresholds for alerts based on historical performance. Use baselines to tune your alert to the right threshold. For example, an alert in APM can notify incident response teams if web transaction times deviate from historical performance for an allotted amount of time.

As you develop smaller, independent services running on increasingly ephemeral architectures, your environments become significantly more complex. Visibility into outliers can be an important tool for understanding likely performance issues. You should set alerts to automatically fire when you have an outlier -- this can indicate misbehaving hosts, load balancers, or apps.

For example, a load balancer divides web traffic approximately evenly across five different servers. You can set an alert based on a NRQL query, and notification to be sent if any server starts getting significantly more or less traffic than the other servers. Here’s the graph:

3. Identify groups to alert, and set broadcasting methods

Alerting without the proper broadcasting methods leaves you vulnerable. Your alerts strategy should include a notification channel to ensure the appropriate teams are notified if your application or architecture encounters issues. New Relic has many notification integrations, but we recommend that you start simple and add more complexity later.

We recommend that you first send alerts to a group chat channel (for example, using Slack or HipChat). Evaluate these alerts in real time for several weeks to understand which alerts are indicative of important or problematic issues. These are the the types of alerts that warrant waking someone up.

4. Fine-tune alerts and thresholds

As you use New Relic to optimize your application and infrastructure performance, tighten your New Relic Alerts policy conditions to keep pace with your improved performance.

When rolling out new code or modifications that could negatively impact performance over a period of time, loosen your threshold conditions to allow for these short-term changes. For instance, we recommend using pre-established baselines and thresholds to increase efficiency during high-impact times for your business, such as annual events and major releases. Fine-tuning gives you the flexibility you need to increase efficiencies and extend your notification channels.

As noted earlier, we recommend you start with a group chat service when you first establish notifications. Once you’ve identified other tools you’d like to integrate with, set up a notification channel to maintain your momentum. Tools such as xMatters and PagerDuty provide popular integrations, but don’t overlook simple methods, such as webhooks. The goal is to continuously improve your alerting scheme.

Be sure to check your alerts and confirm that they're firing regularly and are still relevant for your customer satisfaction metrics. Use the New Relic platform to create dashboards centered around alerts and incidents for the most common policy conditions and violations.

Conclusion

Establishing a focused alerts policy helps you pinpoint any degradation that could impact performance in your application or infrastructure. With proactive alerting, you will decrease user-reported incidents, and allow your teams to spend less time firefighting and more time deploying significant changes to your product.

For more help

For more tips and best practices for alerting, see the following documentation: