What happened

Between May 19, 2020 22:47 PST and May 20, 2020 02:51 PST, we experienced an incident impacting several video services.

API calls to create new sessions and archive requests failed intermittently on both Enterprise and Standard environments. Other services unavailable as a result of this incident included: Broadcast, SIP and Session Monitoring, logging service to the Account Portal and Developer Tools, including Inspector and Playground.

Most services were recovered at 00:58 PST. The Logging service to the Account Portal and Developer Tools, were recovered at 02:51 PST.

Ongoing sessions or sessions created prior to the beginning of the incident, without further API calls, were not impacted.

Causes

One of the servers hosting the non-persistent database, used to provide Video API services, experienced a problem (Kernel error). As a result, the server triggered an automatic fail over to two master databases, which is the expected behavior. While one of the masters became operational, the other master experienced a configuration problem and remained unavailable.

Due to the above, all services requiring interaction with the impacted server were affected. Services interacting with the master were successfully setup and were not affected.

Most services were recovered after a migration from the current Data Center provider. The Account Portal and Developer Tools were fully recovered after the servers hosting these services were able to access the new database nodes.

Preventive Actions

Improve monitoring of our non-persistent database to make alerts more informative.

Improve the post deployment tests for the non-persistent database to proactively prevent issues.

Upgrade the tools and processes used to perform routine procedures and operations for the database, once the migration to a new cloud provider is completed.

Review and improve the internal escalation processes to reduce the time to announce incidents to customers.

Posted May 22, 2020 - 05:48 PDT

Resolved

This incident is now resolved.Please contact support@tokbox.com if you have any questions.

Posted May 21, 2020 - 00:30 PDT

Update

Pending Development and Account management tools services should be reestablished now as well, and under monitoring.For any questions in the meantime, please reach out to us at support@tokbox.com

Posted May 20, 2020 - 03:50 PDT

Update

API, services, and connectivity should be reestablished now, under monitoring. Development and account management tools are being reestablished. Customers should avoid using sessions that failed to establish during the incident.For any questions in the meantime, please reach out to us at support@tokbox.com.

Posted May 20, 2020 - 02:38 PDT

Update

Issues with non-persistent database cluster were detected at 2020-05-19T22:47:00 PST, they have been addressed at 2020-05-20T00:58:00 PST. API, services, and connectivity should be reestablished now, under monitoring. Development and account management tools are being reestablished. For any question in the meantime, please reach out to us at support@tokbox.com

Posted May 20, 2020 - 02:07 PDT

Monitoring

The issue has been identified as related to a problem in the non-persistent database cluster. Remediation has been put in place and the reliability of our API, services, and connectivity should be reestablished now. For any questions in the meantime, please reach out to us at support@tokbox.com.

Posted May 20, 2020 - 01:45 PDT

Investigating

We are observing some issues in our backend service which may affect the reliability of our API and could affect connectivity. Our Engineering team is looking into this issue, will update you soon on this. Meanwhile, you could also reach out to us at support@tokbox.com.