Postgres et cetera

Tuple Flensing I: Treating Postgres Corruption

Tue, 2013-08-27 01:25:00 PDT

Part One: Egress

While Postgres is an incredibly robust and defensively-written system,
occasionally one does run into problems with data corruption. This can
happen due to hardware issues, filesystem or other OS bugs, and
sometimes, yes, even Postgres bugs. However, even when data corruption
does occur, Postgres will generally contain the damage to specific
chunks of data, and manual intervention will let you recover anything
that was not affected.

This is the story of lessons learned from treating an extensive case
of such corruption in the course of my job with the Heroku Postgres
Department of Data. This post describes recognizing the problems and
getting the data out of a corrupt system; the next details restoring it.

All data and metadata is mocked out; no customer data was used in the making of this post.

Preface

Josh Williams has a fantastic post
about treating corruption; Josh’s post and tips from my colleagues
(and Postgres contributors) Daniel Farina
and Peter Geogheghan were instrumental
in helping me wrap my head around all the nuts and bolts relevant here.

Now before we dig in, note that the easiest recovery is one you don’t need
to perform at all. So how can you get a pristine copy of your database? Try your
standby. There’s
a chance that the corruption only affects the primary, because it
never made it into the WAL stream and was not present when the
replica’s base backup was taken. Don’t have a standby, but have a
base backup
and the required WAL segments? Try making a fresh replica (this can
take some time when the base backup is old, but it’s better than the
alternative). You can try a fresh replica in parallel with other
recovery efforts. Validate these with pg_dump, as discussed below,
and then run a dump and restore (again, see below) to ensure that the
corruption is quarantined. Seriously,
don’t delude yourself into thinking it’s going to be okay.

This is by far the easiest and safest mechanism for treating
corruption. It’s not a panacea, but we’ve found it to be unreasonably
effective.

In my case, the existing standby and fresh standbys all failed and
extensive manual recovery was required, but at least it makes for an
interesting story.

If you do need to treat corruption manually, there are a few things
you should keep in mind:

Avoid writes: Postgres is in a precarious state; don’t tempt fate
by asking it to do more than necessary. In particular, this is a good
point to turn off autovacuum.

Stop Postgres and make a filesystem copy of the full data
directory, ideally to another disk if you suspect the hardware. The
recovery process will involve clobbering data; you want to be
able to start over or cross-reference with the original state of
the system if you run into problems.

Your final step in a corruption recovery should almost always be a
dump and restore via pg_dump. This is the only way to ensure that
you’re actually running on a sane system. This can take many hours
for a large database, but if you skip this you might be back at
square one before too long.

The zero_damaged_pages
setting may be an easier way to recover. It doesn’t always work, but
it basically does what is outlined here for you.

This ain’t Kansas anymore: this is effectively Byzantine failure
territory. The advice here might work or everything might go
horribly wrong. There are no fool-proof step-by-step
instructions. You’re on your own. Furthermore, what’s outlined here is a
general-purpose, fairly coarse approach to recovery. If you need to
take a more precise approach or just need someone to blame,
consider hiring a Postgres consultancy.

A bad sign

In this particular case, the problem was almost certainly caused by a
filesystem or hardware issue. Postgres keeps track of the state of
your database via three main sets of data: the heap, where all the
physical data resides; the c(ommit)log, which tracks which
transactions have been committed and which haven’t; and the
write-ahead log (WAL), which does all the bookkeeping to ensure
crash-safety and is a key mechanism in Postgres replication.

After intially noticing the corruption, I poked at the affected files
with hexdump, and noticed that one of the clog files, which normally
look like this:

Note the ASCII view of the last few lines. If you see anything
human-readable in the clog bitmaps, you should either start buying
lottery tickets, or something very bad has happened.

I originally checked the file after seeing an unexpected error message
that led me to file a Postgres bug.
(The Postgres community, by the way, takes data integrity issues very
seriously and addresses them quickly, so don’t hesitate to file a bug
if you think you’re running into a legitimate Postgres problem.)

Trouble with transactions

The above happened on a replica, and checking the master revealed a
different—but just as serious—problem in the logs:

12

PG::Error: ERROR: could not access status of transaction 2616655872DETAIL: Could not open file "pg_clog/09BF": No such file or directory.

The clog files are bitmaps that live in the pg_clog subdirectory of
your Postgres data directory and track which transactions have been
committed and which have aborted.

You can take a look in the subdirectory and see the various clog
files. If you don’t know where your data directory is, you can run the
following Postgres command (this must be run as a Postgres supseruser,
e.g. postgres in most installations):

These are sequential, and end way before the 09BF file that the
error message complained about, so that’s not a great sign. So what
can we do? A reasonable assumption might be that everything
referencing transactions in that suspect clog file committed
(actually, that’s a terrible assumption, but it’s easy to fake
transaction metadata like that, and that’s the only way to see the
effects of these alleged transactions, so let’s roll with it). In
order to gin up a hunky-dory clog file, where everything committed, we
can use the trusty dd tool. If you’re not familiar with dd, you
can think of it as a bit scalpel for file surgery:

Here, of is the output clog file you need to generate (dd has a
somewhat unconventional way to pass arguments). Everything else (block
size, block count, and input file—the handy /dev/zero special file)
can stay the same. Note that you’ll probably need to run with sudo as
whatever user owns the other clog files (typically postgres).

Checking sanity

So does this fix our problem? One way to check is to re-run the query
that triggered the error in the first place. A more thorough check is
to just run pg_dump to /dev/null:

1

$ pg_dump --verbose >/dev/null my_db

This works because generally, corruption manifests as either
references to bogus transactions (which can be addressed as above), or
some sort of failure when attempting to read a specific set of data
from disk. Postgres organizes data on disk in pages, normally 8192
bytes each. When things get corrupted, problems tend to be
page-aligned—either a page will be fine, or it’s wholy
suspect. Failure to read a page indicates that either the Postgres
page headers are corrupt, or that the data is mangled in such a way as
to cause errors in the per-datatype recv deserialization functions.

Occasionally, a page will be corrupt in such a way as to crash
Postgres when one attempts to read it. This is problematic for
recovery, but not a showstopper—if we can figure out what pages are
causing crashes, we can surgically excise them.

In either case, pg_dump attempts to read every single live page of
data, so if you make it through that, you’re on your way to a
functioning system. Note, however, that this does not validate
the integrity of your indexes: another good reason to always go
through a dump and restore.

Note also that the most pernicious aspects of corruption occur when
none of the above trigger: when bits get shuffled around, but in such
a way as to still represent valid data (just not the data you put
there). Unfortunately, prior to Postgres 9.3’s
checksums
feature, there’s not much you can do about this. Fortunately, this is
also the least likely problem, as corruption tends to cause chaos and
very few datatypes are resilient to bit scrambling (numeric types
being a notable exception).

Pages behaving badly

In this case, the pg_dump still failed with the same error a few more
times. After applying the same remedy a few times to different clog
files, I eventually got to a different error:

While Postgres supports pretty big rows
and fields, its internal memory management is limited to working with chunks
up to 1GB.
This request for 2.5GB seems shady. It’s likely the metadata about the
size of the value stored on this page is corrupted.

Without significantly more digging, we have to give up on this data as
lost and nuke the page. There are a few other errors you may see that
likely point to corruption:

So how do we track down the offending page? Fortunately, all rows in
Postgres have a hidden column called ctid that is effectively the
“physical” row id. If you have the ctid, you can figure out where
the page lives on disk. You can project the ctid just like any other
column:

The ctid values you see are a (page, offset) pair. Chances are,
projecting just the ctid from an affected table is not going to
trigger the memory allocation error we saw above (because Postgres
doesn’t have to read any of the affected data).

With that, you can go row by row, projecting every value to see if it
causes problems. Doing this manually can take forever on a good-sized
table, so I wrote a PL/pgSQL function to do it (the FETCH_COUNT trick
from Josh’s post could also have helped, but I didn’t think of that
at the time):

This looks at all the ctids in the target table, and for each one,
tries to project the values in every column for that row. The columns
are projected one at a time, so that if one exhibits a problem and we
have to purge the row, we can still check the valid ones and save
that data beforehand.

This works for small and medium-sized tables, but takes forever on
anything sizable. I tried to optimize it first by doing a binary
search for bad tuples across ctids, and then by taking advantage of
the fact that corruption is—as mentioned above—typically
page-oriented. In the end, neither of these panned out (for reasons
I’ll get to below), but I learned some interesting PL/pgSQL. In any
case, this approach was serviceable for a while and did help me make
progress. The output looks something like this:

Trouble on page 6! At this point, you can try to project the other
columns from the affected rows (and any potentially unaffected rows)
so you can save whatever data is still recoverable.

Once we find an affected page, we need to figure out where it lives on
disk. Again, we find what we’re looking for in the data directory.
The base subdirectory contains all the “actual” data (as opposed to
metadata) in your Postgres instance, so this is where we look. The
first level looks like this:

To find the right table, you might assume you can use the oid again (I
certainly did), but this won’t always work. The pg_class system
catalog table can show you the actual file you need (adjust the namespace
as needed):

So we have the subdirectory for the database, the file for the table,
and the page and offset from the ctid. Now we need to do some bit
surgery, just as with the clog files above. This is where dd’s
“addressing” feature (via its blocksize and seek parameters) comes
in really handy: dd makes it easy to treat a file as a sequence
of contiguous chunks, and to overwrite just one of these chunks.

Here, bs=8192 indicates the Postgres page size, count=1 means we
overwrite just one page, and seek=6 means we want to seek to the 6th
8192-byte page. Again, we use /dev/zero as a source. The very
important conv=notruc means don’t truncate the file; just write to
the middle. Remarkably, overwriting a page like this in Postgres is enough
to wipe it out with no ill effects. That’s a fantastic property for what
is effectively a massive binary serialization format.

In any case, doing these one-by-one can get tedious (and
error-prone). If you find a lot of these up front, you can even script
the dd invocations:

Be extremely careful with this, though—it’s easy to fat-finger your
way to wiping out large swathes of your database.

Trouble in purgatory

After a while of making progress like this, I ran into a problem that
I couldn’t solve with PL/pgSQL:

1234

LOG: server process (PID 23544) was terminated by signal 11LOG: terminating any other active server processesFATAL: the database system is in recovery modeLOG: all server processes terminated; reinitializing

Postgres is written defensively, but if its data is scrambled badly
enough, it will still occasionally wig out. With luck, it will recover in
a few seconds and you can continue treating the corruption, but
subsequent attempts to read the affected pages are likely to cause
repeat crashes. This means that any PL/pgSQL functions that rely on
being able to scan a whole table can’t really cope with server crashes.

Bash (and shell scripting in general) gets a bad rap—in many cases,
deservedly so—but it’s still a fantastic way to glue together other
programs. It seemed like an appropriate tool here:

This selects all ctids from a table and then iterates over them.
For each, it dumps all data on a page, and then skips ctids until
the next page (yes, it would probably be faster to query for only
distinct pages and iterate over just that; this was fast enough).
This prints errors on failed pages (I no longer cared about individual
columns anymore; these few pages were a mess and I couldn’t get
anything out), but the really nice part is if that Postgres crashes,
it also just prints an error and waits for it to come back up. This
lets you scan a full table even if Postgres crashes on every page,
and you can apply the page-zapping technique above.

Victory (sort of)

After a few more rounds of dealing with a potpourri of the issues above,
I was eventually able to clear out all the errors and get pg_dump to
complete successfully. Only a few dozen pages were affected, but
sorting through everything took hours and was very mentally demanding:
this is where any abstraction provided by your database breaks down,
and you need to start thinking of your data in a completely different
way.

I’ve also elided some of the more frustrating inconsistencies and
gotchas in dealing with a system in this state. Occasionally corrupt
pages that I had cleared would come back (autovacuum and other
connections were off, so I’m still not sure what caused these),
and in one case a manual COPY worked even though pg_dump’s invocation
of the same command failed. Be prepared for weirdness.

Overall, though, Postgres held up admirably: it’s very impressive how
well it does in the face of serious data corruption.

Unfortunately, after all this, the dump I got would still not restore
cleanly. I’ll go over the (much simpler) steps I had to take to get
things back into Postgres in the next post.

About Me

I'm a jack-of-all-trades software engineer, working with Heroku's Department of Data
to provide Postgres as a service.
Besides SQL, I currently write Ruby, Go, more bash than I care to admit, and a touch of
Python here and there.