Gitlab database incident write-up

This reads like a disaster porn travelogue. Favourite statements of admin guilt include:

Removed a user for using a repository as some form of CDN, resulting in 47 000 IPs signing in using the same account (causing high DB load). This was communicated with the infrastructure and support team.

YP adjusts max_connections to 2000 from 8000, PostgreSQL starts again (despite 8000 having been used for almost a year)

db2.cluster still refuses to replicate, though it no longer complains about connections; instead it just hangs there not doing anything

At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden.

YP thinks that perhaps pg_basebackupis being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com

2017/01/31 23:27 YP - terminates the removal, but it’s too late. Of around 310 GB only about 4.5 GB is left

Sid: try to undelete files?

CW: Not possible! rm -Rvf Sid: OK

YP: PostgreSQL doesn't keep all files open at all times, so that wouldn't work. Also, Azure is apparently also really good in removing data quickly, but not at sending it over to replicas. In other words, the data can't be recovered from the disk itself.

2017/02/01 23:00 - 00:00: The decision is made to restore data from db1.staging.gitlab.com to db1.cluster.gitlab.com (production). While 6 hours old and without webhooks, it’s the only available snapshot. YP says it’s best for him not to run anything with sudo any more today, handing off the restoring to JN.

Create issue to change terminal PS1 format/colours to make it clear whether you’re using production or staging (red production, yellow staging)

Somehow disallow rm -rf for the PostgreSQL data directory?

Figure out why PostgreSQL suddenly had problems with max_connections being set to 8000, despite it having been set to that since 2016-05-13. A large portion of frustration arose because of this suddenly becoming a problem.

Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.

SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.

The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost

The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented

Our backups to S3 apparently don’t work either: the bucket is empty

So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

I actually sympathize with YP and GitLab crew here. Where the personnel involved could have attempted to shift, squelch, or otherwise deny the awfulness of this issue, they acted instead with honesty, humility, and humour. Self-effacement this extreme takes big brass balls.

Also, that writeup is pure meme gold. This is the ongoing horror show that just keeps giving.

I generally pitch it as: Sure, they're paid to do it, and they're probably good at it, but if their service goes down, or we go down, I can get you back up in running with an physical off-site back up, and they're just floating out there somewhere.

I work for a very small company, we have some servers in the cloud...and even I know that for every backup I need some system to tell me automatically that the backup is actually happening in the expected way. It's amazing they didn't use any common sense in this.

Well, even a simple script that tells you that you aren't writing an empty backup, and that the last backup has happened in the expected timeframe...this is pretty simple and it would have been useful to them. I agree on the restore part. Sadly that can't automated.

I'm sure they used all their common sense, but common sense usually tells you "someone else surely took care of this". Running a reliable high-tech business required plenty of uncommon sense and paranoia.

That's the problem with the DevOps approach. "Backup? Yeah, we do something there I think. But let's rather focus on bringing new awesome features in, quickly! It will be awesome!"

LVM snapshots are not backups, the same way as RAID is not a backup.

Also, trying to prevent this happening from again by disallowing "rm -rf"? yeah, right. They should implement proper processes for the admin(s) so that they don't do stuff like this after having worked for 12+ hours already. Or at least make it so that nobody cannot be logged in to production and staging at the same time. But not to disallow random commands in the shell.

I see this all the time, and then the customer calls us, at 10PM, after they screwed up and we have to fix their stuff. "I just wanted to quickly fix $foo" is the #1 reason shit like this happens.

I hope they can resolve it somehow with not too much data loss (and without this YP guy having to commit seppuku or something ;-) but I'm pretty sure that after this they won't neglect their backups anymore.