Tech:Incidents/2016-10-18-Database

Contents

A search query crashed MariaDB on October 18, getting MariaDB into a state where InnoDB corruption was present and it refused to restart. An attempt to restore Bacula backups (2 days old) failed. All wikis have been completely inaccessible (not readable, not editable) for almost 25 hours, and the whole incident lasted 29 hours.

19:46 Southparkfan: tries to find out which databases are corrupt using "mysqlcheck --all-databases", but needs to get MariaDB running first

20:24 Southparkfan: rsyncing /srv/mariadb @ db2 to the same server into another location, backup measure before trying to touch the actual databases

21:10-21:41: John and Southparkfan issue a Bacula job to restore the most recent database backup (2 days old) on db1

22:09 Southparkfan: reports db2 backup is complete

22:45: Bacula restore progress is 39%

October 19

01:05: Bacula restore is done

01:10 Southparkfan: tries to restart MariaDB on db1 with the Bacula backup, but MariaDB refuses to restart here as well

01:54 Southparkfan goes to sleep

03:58 NDKilla: set InnoDB force recovery to level 6 on db2, MariaDB came alive (previously tested at 3 through 5 with no success)

03:58-05:02 NDKilla: Restarted MariaDB on non-standard port to prevent external connections as he wasn't sure if the databases were in read only mode. During this time NDKilla also started to dump databases one at a time to another directory on db2, but stopped at allthetropeswiki because of disk space and time reasons.

05:02 NDKilla: stopped MariaDB on db2

05:02: NDKilla sends an email to other staff members explaining what he's done and apologizing for ridiculously high CPU usage, then goes to bed

11:01: NDKilla and Southparkfan are having a conversation about level 6 InnoDB recovery

11:29-11:57 Southparkfan: manually running mysqld_safe on db2 with level 6 InnoDB recovery, dumping metawiki on db2 and working on getting scp work between db2 and db1

11:57 Southparkfan: moved corrupt Bacula backup out of the way, re-installed MariaDB on db1

~11:58 Southparkfan: started importing metawiki dump on db1

~11:58 Southparkfan: changed DB server to db1 on mw1, depooled mw2

12:00: confirmed metawiki was back online

12:08 Southparkfan: commited db1 patch, repooled mw2

12:00-12:30 Southparkfan: recovered several other wikis to see if my method was working

13:09 Southparkfan: starting the first recovery batch. Dumped 267 wikis, transferred those to db1 and importing them all there

From now on Southparkfan was running batches (~200-250 wikis each batch) all the time.

14:50: 505 wikis imported

14:55 Southparkfan: deleted Bacula backup from db1

16:41: 1168 wikis imported

17:50: all wikis (except All The Tropes) imported

18:00: Operations decided to continue operating in read-only mode, after ATT has been backupped to db1 we'll be working on transferring db1's SQL files to db2 again

18:55 Southparkfan: a compressed version of ATT's database dump has been transferred to db1

19:00-19:34: Southparkfan was not sure if all databases were imported, some valuable time was wasted here

The way to dump all wikis was by setting innodb-force-recovery to 6. Southparkfan had been messing with that variable a big part of the evening, but nobody tried setting it to higher than 3. After 10.5 hours site outage NDKilla finally found the way to get MariaDB running with non-corrupted data.

Two out of three members of the Operations team live in Europe (NL and UK), and the incident started in the evening. Southparkfan left at 4AM local time (shortly before NDKilla found out how to dump all databases), and 9 hours passed by before Southparkfan came back. During those 9 hours there was only one system administrator online. John's absence was a few hours longer. While real-life obligations are unavoidable (thus no personal attack towards anyone), it is unfortunate a lot of hours went by without multiple available Operations members.

We originally suspected a Bacula failure and MariaDB bug, but after Southparkfan opened a bug report, it seems data corruption happened before the incident, and is rather the cause of this incident and refusing to restart.

The above theory is supported by running the offending query after the db2 re-install: it ran without issues.