Postgres et cetera

Tuple Flensing II: Treating Postgres Corruption

Mon, 2013-09-02 10:21:00 PDT

Part Two: Ingress

This is the second part of the saga of treating a nasty case of
Postgres corruption. This post discusses reloading a set of data that was
wrenched from the jaws of oblivion with careful manual intervention
(read all about that in Part I).

Integrity constraints for fun and profit

If your recently-generated dump from a corrupt database does not
restore correctly, the first thing you should do is thank your lucky
stars. Every constraint check that fails when restoring is one you
don’t have to deal with later, in the application, where it’s much
harder to track down, and probably much harder to reason about.

Unfortunately, Postgres does not process database integrity
constraints when generating a dump (to do so would be handy but would
involve significant overhead). They’re only checked when
restoring. But in an immediate dump/restore situation like corruption
recovery, this is not a big deal.

Postgres integrity constraints
probably can’t represent all the data semantics in your application,
but features like NOT NULL, foreign keys, enumerated types,
unique constraints, and general CHECK constraints can help you avoid
shooting yourself in the foot by forbidding some nonsensical data from
entering your system. Sometimes, integrity constraints can be a pain to
set up and update, especially for a fast-moving schema, but consider
looking back at parts of your system where the dust has settled and
adding these as sanity checks there.

The first thing that hit me in this case was the perenial favorite, NULL.

What do we want? NULL semantics! When do we want ‘em? False!

NULL semantics in SQL are a perennial source of bugs (Jeff Davis has
a great post
about their gotchas and inconsistencies). What bit me here was more pedestrian:

Yeah, good call, Postgres: that doesn’t look quite right. It’s
probably safe to say that anything with this much NULL is
nonsense. Unfortunately, we can’t delete those rows directly in the
existing dump file, and going back to face the corrupt database again
for a fresh dump does not sound like fun. So let’s gin up a surrogate
table, drop constraints on that, and temporarily swap it in for the
original table:

I chose to do it by just copying all the data to the original table
since it was fairly small. You could probably also just add the
original constraints to the surrogate table instead, but this gets
dicier if you have foreign key or other constraints to maintain.

Like a snowflake

A primary key guarantees uniqueness of a piece of data. It ensures
that there is one canonical piece of data describing a user, comment,
or other entity in your system.

In Postgres, this is enforced by an index, and when restoring data,
that index is only built after the data has been loaded (this is more
efficient, as the index does not have to be updated individually for
each row).

In this case, the index would not rebuild because there were several
entries for certain primary key values (this is the next error from
the original restore; it’ll keep going and you’ll get these all at
once):

Fortunately, this table had an updated_at column, so I simply copied
all duplicates to a surrogate table, assuming that newer was legit
(not always true, in the case of rolled-back transactions, but I had
nothing better to go on):

I had to resolve the same primary key issue on another table, but
after resolving that, the restore was finally completed.

Victory (for real)

At this point, the original data was back in a sane Postgres system,
running smoothly. Chances are, there was still a lot of work to do in
resolving data model issues at the application level (basically,
ensuring any semantic constraints at the application level that were
not represented with database constraints). Luckily, I was free of
that, since I was not the application owner. I handed off the restored
database and explained the duplicate row issues and the potential
outstanding issues with invalid data. I did not get any further
requests or complaints, so I hope the remaining data was in reasonably
good shape.

Overall, this whole process, while rather stressful and somewhat
terrifying, was an incredible learning experience. Thanks to
knowledgeable colleagues, relevant posts archived in the Postgres
mailing lists, some other blogs posts about corruption, and Postgres’
impressive robustness, I was able to save most of the affected
database. I hope the notes in these two posts come in useful should
others be in the unfortunate position of dealing with corruption
issues like these.

About Me

I'm a jack-of-all-trades software engineer, working with Heroku's Department of Data
to provide Postgres as a service.
Besides SQL, I currently write Ruby, Go, more bash than I care to admit, and a touch of
Python here and there.