Overview

On 18th Jan, 2019 at 12:16 PM AEDT (UTC+11) we started receiving alerts from customers that they were seeing errors when loading the Buildkite dashboard. We started investigating right away, and our automated systems also started sending us alerts that the majority of our systems were down.

We quickly identified a storage problem with our main transactional database. We increased database storage capacity and waited for recovery to complete.

At 12:32 PM AEDT, Buildkite was back online and started serving requests as usual.

During this period, we had a major outage of all of our components. Agents would continue to run any in-process work, however no new work would have been scheduled during this time. Our webhook intake system was also down during this period, which subsequently means we didn’t process any GitHub webhooks during this time.

Root Cause

As part of regular database maintenance, we occasionally use pg_repack to optimise our database storage and performance. It does so by creating a new table, re-writing the data from the existing table into it (using triggers to keep the two tables in sync), and then switching the tables when finished. It does all this without any downtime, and is generally not noticeable. pg_repack is also a great alternative to running VACUUM FULL to optimise a table which exclusively locks the table until the process completes, and could take hours depending on the size of the table. Because it creates a copy of the table, it requires at least as much free storage space as the size of the table being repacked. We never repack a table unless we know it will fit well within our free storage space with a large margin for regular operations.

This morning we kicked off a process to repack one of our tables. But instead of repacking a target table which we knew was safe to repack with our available free storage space, due to human error we instead repacked a much larger, similarly-named table which exceeded our storage capacity. This caused a storage error and pushed our database into recovery until we increased storage space.

Our database free storage space dropped from over half a terabyte to zero within a matter of minutes. Our monitoring was designed to alert on regular application activity causing storage exhaustion, and did not trigger soon enough because it happened so fast. When it hit zero our database went into recovery, Buildkite started throwing errors, and we had a major system outage.

All systems are back online, agents are being assigned work as usual, and we're consistently serving 200s for dashboard requests. During this period, some GitHub Webhooks may have been lost.

If you're seeing issues with builds not being created in response to pushing to a GitHub Pull Request, try closing/re-opening the Pull Request (that should update Buildkite's internal cache of pull requests).

If you're still seeing problems (or any sort of problem for that matter) please shoot us an email so we can help get it resolved hello@buildkite.com

Posted 8 months ago. Jan 18, 2019 - 13:08 AEDT

Monitoring

Everything is running smoothly again. Agents should start receiving work again - we're keeping a very close eye on all our sever monitoring to ensure all components are working as usual.

We've identified an issue with our main transactional database that is causing errors across all components. Our database is currently being restored and we're all monitoring the issue closely. Your agents will continue to run any existing work that doesn't interact with the Buildkite Agent API, but no new work is currently being scheduled.