Problem description & ImpactOn Thursday, August 3, 2017, at approximately 4:55am PDT, Okta experienced a performance issue in US Cell 2 in which end-users and administrators encountered sporadic slowness while attempting to access any interactive user-pages within their Okta tenant. During the event, an average of 10% of incoming requests to US Cell 2 experienced latency until 8:05am, with peaks of up to 40% latency in requests between 4:55am and 6:30am PDT and between 7:55am and 8:05am PDT.On Friday, August 4th, 2017, Okta experienced a similar issue in US Cell 2 beginning at 5:30am PDT, resulting in a similar user experience. During this event, an average of 20% of incoming requests to US Cell 2 experienced intermittent latency until 6:00am, with a peak of 25% of requests.

Root CauseOkta has identified the root cause of the performance issues as a recently deployed performance optimization, which was intended to improve the speed of processing certain endpoint requests. The optimization had the unintended side-effect of using additional database CPU resources than expected during periods of high load in US Cell 2, which resulted in slow system responsiveness during these peak load times.The performance optimization had been thoroughly tested and did not exhibit performance issues during development/testing or subsequent deployment to Okta’s Preview or Production Cell environments. However, the peak traffic pattern in US Cell 2 surfaced the unexpected performance profile.

Mitigating Steps & Corrective ActionsOn Thursday, August 3rd, at approximately 4:57am PDT, and Friday, August 4th, at approximately 5:30am PDT, Okta’s monitoring identified and alerted on a spike in database CPU utilization and corresponding interactive user session latency. Okta began mitigation steps by routing traffic away from the affected server-tier and adding additional web application server capacity to handle the increased load until the increased load subsided. The process and performance learnings gained during the August 3rd event directly affected Okta’s ability to respond quicker during the performance issue on August 4th, and subsequently minimize the risk of future occurrences.

Following the events of August 3rd & 4th, Okta has taken the following action to prevent further performance issues and improve our response to similar issues in the past.

Okta tested and implemented a new method and process for routing requests to an additional server-tier during high server utilization scenarios. This greatly improves our capacity and ability to respond quickly to dynamic load changes and ensure service reliability for our customers.

Okta has identified and optimized several database queries which were identified as contributing factors to these performance issues.

Okta has deployed a code-fix across all Preview and Production Cells to resolve the problematic optimizations which were the root cause of the performance issues observed in US Cell 2.

Okta continues to investigate in other dimensions of the performance issues and will continue to invest in new opportunities for improvement to ensure we are providing the most stable and reliable service for our customers.

Help Article Feedback

We’re sorry this article didn’t meet your needs. What specifically about the article was not helpful?