Tuesday, May 13, 2014

Troubleshooting Database Corruption

When your database gets corrupted, one of the most important things to do is figure out why that happened, so that you can try to ensure that it doesn't happen again. After all, there's little point in going to a lot of trouble to restore a corrupt database from backup, or in attempting to repair the damage, if it's just going to get corrupted again. However, there are times when root cause analysis must take a back seat to getting your database back on line.

In general, if you have a working Hot Standby to which you can fail over, or if you have a recent backup you can recover, you shouldn't bother trying to recover the corrupted database. Just fail over or restore from backup. The reason for this is that it's very difficult to determine the nature and extent of the damage to your original database. You can run pg_dumpall (or, better, pg_dumpall -g plus pg_dump -Fc in each individual database) to verify that all of your data can be read, but how do you know that none of that data has been corrupted? Also, what if pg_dumpall itself fails? Plus, even if everything looks OK today, it's hard to be sure that there aren't lurking problems that will bite you in the future. For example, pg_dump will happily dump a table with a duplicate value in a supposedly-unique index; you won't necessarily know about the problem until you try to reload the dump. Even if you successfully reload the dump on another machine, it doesn't guarantee that there aren't subtler problems on the original machine, like an all-visible bit set on a data page containing a dead tuple, that won't cause surprising inconsistencies at some possibly quite distant point in the future.

But sometimes you have no choice. Perhaps you don't have a Hot Standby server (you really should!) or the corruption event affects both machines. Perhaps your data changes too quickly to consider restoring from a backup, or you do restore from backup but still want to extract what you can from the original, now corrupt master. In that case, it's a very good idea to use pg_dumpall and/or pg_dump to dump the corrupted database and restore it into a newly-initialized cluster on known-good hardware. The data you extracted from the old cluster may not be intact, but at least by dumping and restoring it you have a good chance of finding any obvious problems, like duplicated "unique" keys, and you also free yourself of any low-level corruption that doesn't show up clearly at the SQL level. You might do something like this:

pg_catcheck checks all of your system catalogs for references to nonexistent objects. In the above case, the problem is a reference to a schema that does not exist. PostgreSQL uses numeric identifiers called OIDs to identify schemas and many other kinds of SQL objects. In the above example, a table, function, or some other object is identified as being located in the schema with OID 559701082, but in fact no such schema exists. You could issue queries against each system catalog individually, trying to figure out which ones contains a reference to that OID; and then you could issue further queries to figure out exactly which row of that table contains the bad reference. After fixing the issue, you might rerun pg_dump and find that there's another, similar problem; and you won't really know how many more things remain to be fixed until pg_dump finally works. Alternatively, you could run pg_catcheck, and it will tell you about all problems of this type throughout the system.

(For those familiar with the structure of the PostgreSQL system catalogs, a few technical details: pg_catcheck verifies that every oid column that acts like a foreign key points to a matching row on the referring side. It also checks oidvector and oid[] columns, as well as references contained in pg_depend, pg_shdepend, pg_description, pg_shdescription, pg_seclabel, and pg_shseclabel using class OID/object OID notation. And, it catches a few other problems, too, like pg_class.relnatts failing to match the number of rows actually present in pg_attribute. Also, it does all of this using only sequential scans, so that it won't be fooled by corrupt system catalog indexes. You could check all of this by hand, but running pg_catcheck is a lot faster.)

pg_catcheck is not a general corruption detector; for that, you should use initdb -k to create a database cluster with checksums enabled. It also won't provide advice on how to recover from system catalog corruption; for that, you should consult an expert. But it's a useful aid in getting your database back to the point where it can be successfully dumped.

That's a good question. It's hard to know. I could count the number of times that such problems are reported to pgsql-bugs, or to other PostgreSQL mailing lists, or to EnterpriseDB's support team. But I have no way of counting the number of people who didn't have a problem, which is much larger. Perhaps someone who has run a large number of PostgreSQL instances for a long time could provide aggregated statistics, but I don't have access to any such information.

In terms of how many people are using checksums, I suspect that it's relatively few, just because it isn't the default. However, I don't have any statistics to back it up.

The inability to enable checksums on pg_upgrade also limits the number of users. Many of those who might be most interested in it can't easily enable it because they'd have to initdb, dump and reload, and most bigger sites don't tend to be keen on that.

I see corruption reported quite frequently on Stack Overflow. With the exception of an obvious spate of corruption reports linked to the nasty 9.3.x (x<4) multixact replication/recovery bugs, these mostly seem to be corruption by administrator action, power failure with cheap SSDS, disk failure, incorrect basebackup procedures for failover, etc.

I used to manage 300 to 400 small postgres instance (from 1gb to 15gb) and, for what I remember, I only had 2 or 3 problems in 7 years. The first one was human related: a system backup has been restored on a running instance of postgresql and the second was a faulty raid-controller. The reputation of robustness is not a legend!

Two months ago I recover a damaged cluster from a customer. The damage occurred at 'pg_type' catalog then we didn't able to run any kind of query, even a simple 'SELECT 1'.What I did:

0) before all I did a "physical backup" of course ;-)1) create a new fresh cluster (initdb)2) create an empty database with the same structure (ddl) that the damaged3) stop all clusters (damaged and new)4) copy the 'pg_type' datafile from this new cluster to the damaged (overwrite the damaged)5) start the damaged cluster in a single-user mode6) reindex system7) start the damaged cluster and run a pg_dump... and worked!! \o/

During the restore we discover some unique violations as you mentioned and we fixed it. After all the customer did some checks (business data) and all was ok. The lost data doesn't was significant.

This is one of my 'horror stories' with PostgreSQL. ;-)

After this incident the customer discovered how important are things like backup, archiving, standby, etc... :-)

A really great step forward. Now if there was a download ready binary / RPM perhaps ? While compiling from source is not a problem, many database end users will unfortunately only allow legitimate EnterpriseDB created binaries on their servers...