Incident Monitoring and Response

Within most technical operations environments, you’ll find some form of incident response plan. These procedures are used to manage and resolve an incident as well as to communicate with customers when issues arise.

At Dyn, we’ve evolved a philosophy around incident response that scales, and it takes into consideration the human impact of this work– for our colleagues and our customers– as we work through issues of varied durations, severities, and complexity. Our incident response process is broken down into three distinctive components:

Incident Command: The management of resources and coordination of efforts within the confines of incident response.

Communications Coordination: The communication both internally and externally to our customers when issues arise or when we make changes to production environments related to the remediation of incidents.

Documentation Coordination: The collection of metrics, analysis of alerts, key performance indicators and analysis of various data sets to properly inform the other component layers of the impact and scope of an incident.

The key to effective monitoring and alerting is to provide a clear illustration of the blast radius in a complex-system failure or security-related incident. Monitoring visualizes “situation normal”; the number of failed login attempts on critical infrastructure, latency on page loads, or the time it takes DNS to perform lookups. All of which are harder to detect without regular monitoring in any technical operations environment. Many organizations struggle to implement effective monitoring that triggers meaningful alerts when deviations from established operating norms are identified. The unfortunate byproduct can manifest itself as noise or alerts devoid of meaningful context.

In the Network Operations Center (NOC) at Dyn, we rely on many tools to help navigate the landscape of supporting diverse complex systems with varying operating norms. With these tools we can ascertain impact to our customers when unforeseen events occur. In addition, we can identify capacity needs to better service our customers.

This brings us to a critical facet of monitoring–external service monitoring. An organization can invest considerable amounts of time implementing a monitoring solution that paints a portrait of expected operating norms by collecting metrics and alerts from internally focused monitoring infrastructure. Of course, it’s extremely vital to know that a server went down or that a database partitioning function is taking an inordinate amount of time to complete. Unfortunately, these conditions may not fully illustrate the scope of deliverable services and their relative functionality and availability. Functional tests are great, but if reachability is an issue, it doesn’t matter how quickly your user interface (UI) is poised to respond or how many queries per second our DNS servers can answer. A critical component of incident response is gauging this impact quickly. Your decisions around the best course of action may stem from your customers’ perception of reality and not the conclusions you can draw exclusively from your own monitoring. It’s infinitely better to be able to quickly draw your own conclusions about service delivery than it is to be told by a flooded call center that you missed something in that last deployment to your edge services.

Understanding variances in reachability and availability means understanding your transit and service delivery capabilities. These are critical components of defining normal. The internet can be chaotic under normal conditions; provider outages, bgp session flaps, datacenter outages and DDoS are normal occurrences in this space we all share and conduct our business in.

Our internet intelligence products provide us with a great illustration of transit congestion and reachability of our services from diverse markets. We leverage sensible alerting to reconcile maintenance activities and to understand the impact to our customers when we have occasional provider outages. These alerts help us reconcile our understanding of what normal looks like and when deviations from it occur they help us quantify the real impact within our incident response process.

Catchpoint provides Dyn with critical functional DNS tests leveraging both their direct nameserver tests and also experience tests which utilize recursive resolvers to simulate customer interactions with our products globally and in near real time. We rely on alerting and historical graphs from Catchpoint to indicate milestone events within incidents and also to gauge our performance and availability serving queries from various global vantage points.

Having this visibility into our expected behavior externally, with reconciliation against internal metrics helps Dyn’s Operations teams ensure that we are working on problems before they become impacting. In those incidents that do become service affecting, we are afforded an opportunity to post accurate and transparent status posts that illustrate the impact to our customers.

We utilize Catchpoint tests to gauge our API and websites by simulating customer experience, global availability and performance. This reconciliation point has worked to effectively gauge actual impact of discrete degradations within complex systems that we wouldn’t expect to cause real impact.

Within the scope of security incidents inclusive of DDoS attacks, an incident response plan with a strong emphasis on the consumption of metrics from both internal and external sources creates an equalizer for our operations team in an evolving internet. Knowing how to prevent impact is ideal and having a staff that knows how to react when incidents do occur can greatly help reduce the overall impact to customers.

To hear more about this topic, check out my upcoming talk with my colleague, Phil Stanhope, Dyn’s VP, Technology Strategy, at Catchpoint Elevate in April.