SUMMARY

15:43 UTC | 08:43 PT We have seen improvement to the performance issues affecting Talk and Help Center/Guide on Pod 14. We are still investigating other issues.

14:55 UTC | 07:55 PT We are currently experiencing performance issues with POD14. Updates to follow.

POST-MORTEM

Several issues were reported involving services being unable to connect to various endpoints. Based on the error messages reported, it was determined that DNS name resolution was failing or taking an excessive amount of time, causing timeouts and failed connections. The CPUs of dnscache hosts were overcome by the amount of DNS requests and verbose logging. We reduced log verbocity and restarted the DNS service. Services recovered after the change and restart. To prevent this from happening again in the future, we have reduced log verbosity in DNS cache, will optimize DNS performance, and update monitoring for CPU load and DNS resolution.

During the windows defined any user visiting a page with the web widget embedded would not have had the widget load for them. Users with a page with an embedded widget already loaded may have been unable to submit tickets. In response to increased load in pod 14 AWS scaled the instances underlying the ELB for several host groups. Upon load reduction these instances were then also scaled down. This addition and removal of instances was not handled by our underlying proxy configuration that is currently only updating its view of upstream hosts at load time. This lead to traffic being directed to now non-existent hosts until a reload of the proxies. Reload of the proxy software forced a rediscovery of upstream hosts leading to recovery. To prevent this from happening again in the future, we will update proxy configuration to use upstream variables, embeddables scaling and fix embeddables app errors when writing to DB.

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.