Summary

On December 11th 2017, from 19:10 UTC to 21:45 UTC, we experienced a degradation in our ability to process events. As a result, incident creation and thus notifications were delayed for some customers. A subset of the delayed events were dropped as well.

What Happened?

We had received a number of events which caused heavy concurrent resource contention in one of our downstream services. As a result, our event processing pipeline began halting periodically in short bursts.

Once we had identified the problem, these events were isolated from the processing pipeline to provide relief from the pressure of events backlog. Unfortunately, a number of other events which were not related to the cause of the issue were put into the same isolation as well. When we had made the decision to drop the isolated events, the benign events were dropped with the problematic events.

What Are We Doing About This?

We will be adding increased flexibility in our ability to isolate and fail specifically problematic events. Also, in the longer term, we will be working on replacing our downstream services and removing the bottleneck that had lowered our processing throughput.

We would like to reiterate our regret for the service interruption. For any questions, comments, or concerns, please reach out to support@pagerduty.com

Posted 11 months ago. Dec 18, 2017 - 21:17 UTC

Resolved

All systems are operating normally at this time. Please reach out to support@pagerduty.com should you have any further questions.

Posted 11 months ago. Dec 11, 2017 - 21:27 UTC

Update

New events are being processed normally. There is a queue of older events that we working on processing. All other systems are functioning normally.

Posted 11 months ago. Dec 11, 2017 - 21:12 UTC

Update

We are still investigating event processing delays. All other services are unaffected.

Posted 11 months ago. Dec 11, 2017 - 20:49 UTC

Investigating

Event processing is currently experiencing delays. All other services are unaffected.