GitHub October 21st outage RCA: How prioritizing ‘data integrity’ launched a series of unfortunate events that led to a day-long outage

Yesterday, GitHub posted the root-cause analysis of its outage that took place on 21st October. The outage started at 23:00 UTC on 21st October and left the site broken until 23:00 UTC, 22nd October.

Although the backend git services were up and running during the outage, multiple internal systems were affected. Users were unable to log in, submit Gists or bug reports, outdated files were being served, branches went missing, and so forth. Moreover, GitHub couldn’t serve webhook events or build and publish GitHub Pages sites.

“At 22:52 UTC on October 21, routine maintenance work to replace failing 100G optical equipment resulted in the loss of connectivity between our US East Coast network hub and our primary US East Coast data center. Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation” mentioned the GitHub team.

GitHub uses MySQL to store GitHub metadata. It operates multiple MySQL clusters of different sizes. Each cluster consists of up to dozens of read replicas that help GitHub store non-Git metadata. These clusters are how GitHub’s applications are able to provide pull requests and issues, manage authentication, coordinate background processing, and serve additional functionality beyond raw Git object storage.

For improved performance, GitHub applications direct writes to the relevant primary for each cluster, but delegate read requests to a subset of replica servers. Orchestrator is used to managing the GitHub’s MySQL cluster topologies. It also handles automated failover. Orchestrator considers a number of factors during this process and is built on top of Raft for consensus. In some cases, Orchestrator implements topologies that the applications are unable to support, which is why it is very crucial to align Orchestrator configuration with application-level expectations.

Here’s a timeline of the events that took place on 21st October leading to the Outage

22:52 UTC, 21st Oct

Orchestrator began a process of leadership deselection as per the Raft consensus. After the Orchestrator managed to organize the US West Coast database cluster topologies and the connectivity got restored, write traffic started directing to the new primaries in the West Coast site.

The database servers in the US East Coast data center contained writes that had not been replicated to the US West Coast facility. Due to this, the database clusters in both the data centers included writes that were not present in the other data center. This is why the GitHub team was unable to failover (a procedure via which a system automatically transfers control to a duplicate system on detecting failures) the primaries back over to the US East Coast data center safely.

22:54 UTC, 21st Oct

GitHub’s internal monitoring systems began to generate alerts indicating that the systems are undergoing numerous faults. By 23:02 UTC, GitHub engineers found out that the topologies for numerous database clusters were in an unexpected state. Later, Orchestrator API displayed a database replication topology including the servers only from the US West Coast data center.

23:07 UTC, 21st Oct

The responding team then manually locked the deployment tooling to prevent any additional changes from being introduced. At 23:09 UTC, the site was placed into yellow status. At 23:11 UTC, the incident coordinator changed the site status to red.

23:13 UTC, 21st Oct

As the issue had affected multiple clusters, additional engineers from GitHub’s database engineering team started investigating the current state. This was to determine the actions that should be taken to manually configure a US East Coast database as the primary for each cluster and rebuild the replication topology. This was quite tough as the West Coast database cluster had ingested writes from GitHub’s application tier for nearly 40 minutes.

To preserve this data, engineers decided that 30+ minutes of data written to the US West Coast data center. This prevented them from considering options other than failing-forward in order to keep the user data safe. So, they further extended the outage to ensure the consistency of the user’s data.

23:19 UTC, 21st Oct

After querying the state of the database clusters, GitHub stopped running jobs that write metadata about things such as pushes. This lead to partially degraded site usability as the webhook delivery and GitHub Pages builds had been paused.

“Our strategy was to prioritize data integrity over site usability and time to recovery” as per the GitHub team.

00:05 UTC, 22nd Oct

Engineers started resolving data inconsistencies and implementing failover procedures for MySQL.Recovery plan included failing forward, synchronization, fall back, then churning through backlogs before returning to green.

The time needed to restore multiple terabytes of backup data caused the process to take hours. The process to decompress, checksum, prepare, and load large backup files onto newly provisioned MySQL servers took a lot of time.

00:41 UTC, 22nd Oct

A backup process started for all affected MySQL clusters. Multiple teams of engineers started to investigate ways to speed up the transfer and recovery time.

06:51 UTC, 22nd Oct

Several clusters completed restoration from backups in the US East Coast data center and started replicating new data from the West Coast. This resulted in slow site load times for pages executing a write operation over a cross-country link.

The GitHub team identified the ways to restore directly from the West Coast in order to overcome the throughput restrictions caused by downloading from off-site storage. The status page was further updated to set an expectation of two hours as the estimated recovery time.

07:46 UTC, 22nd Oct

GitHub published a blog post for more information. “We apologize for the delay. We intended to send this communication out much sooner and will be ensuring we can publish updates in the future under these constraints”, said the GitHub team.

11:12 UTC, 22nd Oct

All database primaries established in US East Coast again. This resulted in the site becoming far more responsive as writes were now directed to a database server located in the same physical data center as GitHub’s application tier. This improved performance substantially but there were dozens of database read replicas that delayed behind the primary. These delayed replicas made users experience inconsistent data on GitHub.

13:15 UTC, 22nd Oct

GitHub.com started to experience peak traffic load and the engineers began to provide the additional MySQL read replicas in the US East Coast public cloud earlier in the incident.

16:24 UTC, 22nd Oct

Once the replicas got in sync, a failover to the original topology was conducted. This addressed the immediate latency/availability concerns. The service status was kept red while GitHub began processing the backlog of data accumulated. This was done to prioritize data integrity.

16:45 UTC, 22nd Oct

At this time, engineers had to balance the increased load represented by the backlog. This potentially overloaded GitHub’s ecosystem partners with notifications. There were over five million hook events along with 80 thousand Pages builds queued.

“As we re-enabled processing of this data, we processed ~200,000 webhook payloads that had outlived an internal TTL and were dropped. Upon discovering this, we paused that processing and pushed a change to increase that TTL for the time being”, mentions the GitHub team.

To avoid degrading the reliability of their status updates, GitHub remained in degraded status until the entire backlog of data had been processed.

23:03 UTC, 22nd Oct

At this point in time, all the pending webhooks and Pages builds had been processed. The integrity and proper operation of all systems had also been confirmed. The site status got updated to green.

Apart from this, GitHub has identified a number of technical initiatives and continue to work through an extensive post-incident analysis process internally.

“All of us at GitHub would like to sincerely apologize for the impact this caused to each and every one of you. We’re aware of the trust you place in GitHub and take pride in building resilient systems that enable our platform to remain highly available. With this incident, we failed you, and we are deeply sorry”, said the GitHub team.