Coinbase Conclusion of traffic and scaling; December 6th-7th Post Mortem

Over the past week, Coinbase has continued to invest significant resources into scaling our backend systems to provide a reliable customer experience during periods of high traffic.

Despite these efforts to scale, the continued rise of legitimate user traffic leads to periods of slowness and downtime over the course of December 6th and December 7th. The chart below shows the relative change in Coinbase traffic from our incident last week and the peak traffic period this week.

December 6th

On December 6th, Coinbase continued to experience high traffic as a result of large Bitcoin price movements. At 12:22 PST, one of our primary MongoDB clusters began to experience degraded performance leading to elevated response times across all API endpoints. We attributed this slowness to saturation of the storage engine’s concurrency mechanisms as a result of poor disk performance.

Earlier in the day, we began the process of upgrading the hardware of several of our most important databases (including this saturated MongoDB cluster) as part of our ongoing efforts to scale Coinbase’s backend databases. At 13:13, we successfully completed the process of upgrading this database to an instance type with significantly faster instance storage.

By 13:25 PST, services were fully restored. We are continuing the process of upgrading all of our databases to prepare for further periods of high traffic.

December 7th

On December 7th, starting at 07:10 PST Coinbase began to experience extreme levels of traffic as a result of a rapid Bitcoin price movements.

Starting at 07:20 PST this traffic began to bottleneck behind one of our primary database clusters, even though that cluster had been upgraded on December 6th. This resulted in elevated response times and periods of downtime. Over the course of the next several hours, we took several steps to reduce web requests and better cache queries to this cluster, yielding small improvements in usability.

At 13:55 PST, it became apparent that the cluster was suffering from a bottleneck in memory management, which lead us to alter lower level configuration to better utilize the upgraded resources. This yielded a significant improvement in query performance under the high concurrency we had been experiencing. By 14:00 PST services were fully restored.