Apologies if this is a duplicate, but my original post stalled and I
noticed I had omitted the postgres version, which you will want.
I'm reporting this as a PostgreSQL bug because it involves an index
corruption. I can't see any other way our application should be able to
corrupt an index. I will attach the tail of the log when the corruption
was detected (and the postmaster shut itself down), as well as the
subsequent attempt to start. Fortunately we run our web site off of a
farm of four database servers, so we are taking one of the others out of
the mix, stopping postmaster, and copying its data directory over to
this machine for recovery, so we don't need advice on that aspect of
things; but, we'd like to do what we can to help track down the cause to
prevent a recurrence. We have renamed the data directory to make room
for recovery at the normal location, but otherwise the failing data
directory structure is unmodified.
For context, this is running on Windows 2003 Server. Eight Xeon box,
no HT, 6 GB RAM, 13 drive 15,000 RPM RAID5 array through battery backed
controller for everything. This database is about 180 GB with about 300
tables. We are running 8.1.3 modified with a patch we have submitted
(pending review last I saw) to implement the standard_conforming_strings
TODO. We have autovacuum running every ten seconds because of a few
very small tables with very high update rates, and we have a scheduled
VACUUM ANALYZE VERBOSE every night. It appears that last night's vacuum
found the problem, which the previous night's vacuum didn't. We had
some event which started at 14:25 yesterday which persisted until we
restarted the middle tier at 15:04. The symptom was that a fraction of
the queries which normally run in a few ms were timing out on a 20
second limit. pg_locks showed no blocking. We've been getting episodes
with these symptoms occassionally, but they have only lasted a minute or
two; this duration was unusual. We haven't identified a cause. One odd
thing is that with the number of queries per second that we run, the
number of timeouts during an episode is too small to support the notion
that _all_ similar queries are failing.
How best to proceed?
-Kevin