If we call pg_ctl stop, the server might continue and thus
hold a log file for a short time after it has deleted its pid file,
(which is when pg_ctl will exit), and so a subsequent attempt to
open the log file might fail.
We therefore try to open it a few times, sleeping one second between
tries, to give the server time to exit.
This corrects an error that was observed on the buildfarm.
Backpatched to 9.2,

When the startup process restores a WAL file from the archive, it deletes
any old file with the same name and renames the new file in its place. On
Windows, however, when a file is deleted, it still lingers as long as a
process holds a file handle open on it. With cascading replication, a
walsender process can hold the old file open, so the rename() in the startup
process would fail. To fix that, rename the old file to a temporary name, to
make the original file name available for reuse, before deleting the old
file.

pg_upgrade opened the output from pg_dumpall in text mode and
wrote the split files in text mode. This caused unwanted eating
of intended carriage returns on input and production of spurious
carriage returns on output. To avoid this, open all these files
in binary mode. On non-Windows platforms, this change has no
effect.
Backpatch to 9.0. On 9.0 and 9.1, we also switch from redirecting
pg_dumpall's output to using pg_dumpall's -f switch, for the same
reason.

Perl, for some unaccountable reason, believes it's a good idea to reset
SIGFPE handling to SIG_IGN. Which wouldn't be a good idea even if it
worked; but on some platforms (Linux at least) it doesn't work at all,
instead resulting in forced process termination if the signal occurs.
Given the lack of other complaints, it seems safe to assume that Perl
never actually provokes SIGFPE and so there is no value in the setting
anyway. Hence, reset it to our normal handler after initializing Perl.
Report, analysis and patch by Andres Freund.

The cascading replication code assumed that the current RecoveryTargetTLI
never changes, but that's not true with recovery_target_timeline='latest'.
The obvious upshot of that is that RecoveryTargetTLI in shared memory needs
to be protected by a lock. A less obvious consequence is that when a
cascading standby is connected, and the standby switches to a new target
timeline after scanning the archive, it will continue to stream WAL to the
cascading standby, but from a wrong file, ie. the file of the previous
timeline. For example, if the standby is currently streaming from the middle
of file 000000010000000000000005, and the timeline changes, the standby
will continue to stream from that file. However, the WAL on the new
timeline is in file 000000020000000000000005, so the standby sends garbage
from 000000010000000000000005 to the cascading standby, instead of the
correct WAL from file 000000020000000000000005.
This also fixes a related bug where a partial WAL segment is restored from
the archive and streamed to a cascading standby. The code assumed that when
a WAL segment is copied from the archive, it can immediately be fully
streamed to a cascading standby. However, if the segment is only partially
filled, ie. has the right size, but only N first bytes contain valid WAL,
that's not safe. That can happen if a partial WAL segment is manually copied
to the archive, or if a partial WAL segment is archived because a server is
started up on a new timeline within that segment. The cascading standby will
get confused if the WAL it received is not valid, and will get stuck until
it's restarted. This patch fixes that problem by not allowing WAL restored
from the archive to be streamed to a cascading standby until it's been
replayed, and thus validated.

Serializable Snapshot Isolation used for serializable transactions
depends on acquiring SIRead locks on all heap relation tuples which
are used to generate the query result, so that a later delete or
update of any of the tuples can flag a read-write conflict between
transactions. This is normally handled in heapam.c, with tuple level
locking. Since an index-only scan avoids heap access in many cases,
building the result from the index tuple, the necessary predicate
locks were not being acquired for all tuples in an index-only scan.
To prevent problems with tuple IDs which are vacuumed and re-used
while the transaction still matters, the xmin of the tuple is part of
the tag for the tuple lock. Since xmin is not available to the
index-only scan for result rows generated from the index tuples, it
is not possible to acquire a tuple-level predicate lock in such
cases, in spite of having the tid. If we went to the heap to get the
xmin value, it would no longer be an index-only scan. Rather than
prohibit index-only scans under serializable transaction isolation,
we acquire an SIRead lock on the page containing the tuple, when it
was not necessary to visit the heap for other reasons.
Backpatch to 9.2.
Kevin Grittner and Tom Lane

In pg_upgrade, pull the port number from postmaster.pid, like we do for socket location. Also, prevent putting the socket in the current directory for pre-9.1 servers in live check and non-live check mode, because pre-9.1 pg_ctl -w can’t handle it.

pg_upgrade produces a platform-specific script to remove the old
directory, but on Windows it has not been making sure that the
paths it writes as arguments for rmdir and del use the backslash
path separator, which will cause these scripts to fail.
The fix is backpatched to Release 9.0.

This syncs contrib/pg_upgrade in the 9.2 branch with HEAD, except for the
HEAD changes related to converting XLogRecPtr to 64-bit int. It includes
back-patching these commits:
666d494d19dbd5dc7a177709a2f7069913f8ab89
pg_upgrade: abstract out copying of files from old cluster to new
7afa8bed65ea925208f128048f3a528a64e1319a
pg_upgrade: Run the created scripts in the test suite
ab577e63faf792593ca728625a8ef0b1dfaf7500
Remove analyze_new_cluster.sh on make clean, too
34c02044ed7e7defde5a853b26dcd806c872d974
Fix thinko in comment
088c065ce8e405fafbfa966937184ece9defcf20
pg_upgrade: Fix exec_prog API to be less flaky
f763b77193b04eba03a1f4ce46df34dc0348419e
Fix pg_upgrade to cope with non-default unix_socket_directory scenarios.

Formerly it would only show them for relkinds 'r' and 'f' (plain tables
and foreign tables). However, as of 9.2, views can also have reloptions,
namely security_barrier. The relkind restriction seems pointless and
not at all future-proof, so just print reloptions whenever there are any.
In passing, make some cosmetic improvements to the code that pulls the
"tableinfo" fields out of the PGresult.
Noted and patched by Dean Rasheed, with adjustment for all relkinds by me.

This was removed in commit cd004067742ee16ee63e55abfb4acbd5f09fbaab,
we're not quite sure why, but there have been reports of crashes due
to AS Perl being built with it when we are not, and it certainly
seems like the right thing to do. There is still some uncertainty
as to why it sometimes fails and sometimes doesn't.
Original patch from Owais Khani, substantially reworked and
extended by Andrew Dunstan.

We previously supposed that any given platform would supply both or neither
of these functions, so that one configure test would be sufficient. It now
appears that at least on AIX this is not the case ... which is likely an
AIX bug, but nonetheless we need to cope with it. So use separate tests.
Per bug #6758; thanks to Andrew Hastie for doing the followup testing
needed to confirm what was happening.
Backpatch to 9.1, where we began using these functions.

This back-ports commits c8ba697a4bdb934f0c51424c654e8db6133ea255 and
e5db11c5582b469c04a11f217a0f32c827da5dd7, which fix one definite and one
speculative bug in gistchoose, and make the code a lot more intelligible as
well. In 9.2 only, this also affects the largely-copied-and-pasted logic
in gistRelocateBuildBuffersOnSplit.
The impact of the bugs was that the functions might make poor decisions
as to which index tree branch to push a new entry down into, resulting in
GiST index bloat and poor performance. The fixes rectify these decisions
for future insertions, but a REINDEX would be needed to clean up any
existing index bloat.
Alexander Korotkov, Robert Haas, Tom Lane

The existing documentation in Linux Memory Overcommit seemed to
assume that PostgreSQL itself could never be the problem, or at
least it didn't tell you what to do about it.
Per discussion with Craig Ringer and Kevin Grittner.

This is so that these sections will have stable HTML tags that one can
link to, rather than things like "AEN1902". Perhaps we should mount a
campaign to do this everywhere, but I've found myself pointing at
syntax.sgml subsections often enough to be sure it's useful here.

This threw ERROR, not the expected NOTICE, if the index didn't exist.
The bug was actually visible in not-as-expected regression test output,
so somebody wasn't paying too close attention in commit
8cb53654dbdb4c386369eb988062d0bbb6de725e.
Per report from Brendan Byrd.

The GUC check hooks for transaction_read_only and transaction_isolation
tried to check RecoveryInProgress(), so as to disallow setting read/write
mode or serializable isolation level (respectively) in hot standby
sessions. However, GUC check hooks can be called in many situations where
we're not connected to shared memory at all, resulting in a crash in
RecoveryInProgress(). Among other cases, this results in EXEC_BACKEND
builds crashing during child process start if default_transaction_isolation
is serializable, as reported by Heikki Linnakangas. Protect those calls
by silently allowing any setting when not inside a transaction; which is
okay anyway since these GUCs are always reset at start of transaction.
Also, add a check to GetSerializableTransactionSnapshot() to complain
if we are in hot standby. We need that check despite the one in
check_XactIsoLevel() because default_transaction_isolation could be
serializable. We don't want to complain any sooner than this in such
cases, since that would prevent running transactions at all in such a
state; but a transaction can be run, if SET TRANSACTION ISOLATION is done
before setting a snapshot. Per report some months ago from Robert Haas.
Back-patch to 9.1, since these problems were introduced by the SSI patch.
Kevin Grittner and Tom Lane, with ideas from Heikki Linnakangas

If we revoke a grant option from some role X, but X still holds the option
via another grant, we should not recursively revoke the privilege from
role(s) Y that X had granted it to. This was supposedly fixed as one
aspect of commit 4b2dafcc0b1a579ef5daaa2728223006d1ff98e9, but I must not
have tested it, because in fact that code never worked: it forgot to shift
the grant-option bits back over when masking the bits being revoked.
Per bug #6728 from Daniel German. Back-patch to all active branches,
since this has been wrong since 8.0.

As of 9.2, constraint exclusion should work okay with prepared statements:
the planner will try custom plans with actual values of the parameters,
and observe that they are a lot cheaper than the generic plan, and thus
never fall back to using the generic plan. Noted by Tatsuhito Kasahara.

The docs claimed that this mode only waits for the standby to receive WAL
data, but actually it waits for the data to be written out to the standby's
OS; which is a pretty significant difference because it removes the risk of
crash of the walreceiver process.

If a view has circular dependencies, pg_dump splits it into a CREATE TABLE
and a CREATE RULE command to break the dependency loop. However, if the
view has reloptions, those options cannot be applied in the CREATE TABLE
command, because views and tables have different allowed reloptions so
CREATE TABLE would reject them. Instead apply the reloptions after the
CREATE RULE, using ALTER VIEW SET.

We had put a test for libxml2's xmlStructuredErrorContext variable in
configure, but of course that doesn't work on Windows builds. The next
best alternative seems to be to test the LIBXML_VERSION symbol provided
by xmlversion.h.
Per report from Talha Bin Rizwan, though this fixes it in a different way
than his proposed patch.

In the initial cut at the "parameterized paths" feature, I'd simplified
create_index_paths() to the point where it would only generate a single
parameterized bitmap path per relation. Experimentation with an example
supplied by Josh Berkus convinces me that that's not good enough: we really
need to consider a bitmap path for each possible outer relation. Otherwise
we have regressions relative to pre-9.2 versions, in which the planner
picks a plain indexscan where it should have used a bitmap scan in queries
involving three or more tables. Indeed, after fixing this, several queries
in the regression tests show improved plans as a result of using bitmap not
plain indexscans.

We use a hash table to track the parents of inner pages, but when inserting
to a leaf page, the caller of gistbufferinginserttuples() must pass a
correct block number of the leaf's parent page. Before gistProcessItup()
descends to a child page, it checks if the downlink needs to be adjusted to
accommodate the new tuple, and updates the downlink if necessary. However,
updating the downlink might require splitting the page, which might move the
downlink to a page to the right. gistProcessItup() doesn't realize that, so
when it descends to the leaf page, it might pass an out-of-date parent block
number as a result. Fix that by returning the block a tuple was inserted to
from gistbufferinginserttuples().
This fixes the bug reported by Zdeněk Jílovec.

The previous coding essentially assumed that nodes would be rescanned in
the same order they were initialized in; or at least that the "leader" of
a group of CTEscans would be rescanned before any others were required to
execute. Unfortunately, that isn't even a little bit true. It's possible
to devise queries in which the leader isn't rescanned until other CTEscans
on the same CTE have run to completion, or even in which the leader never
gets a rescan call at all.
The fix makes the leader specially responsible only for initial creation
and final destruction of the tuplestore; rescan resets are now a
symmetrically shared responsibility. This means that we might reset the
tuplestore multiple times when restarting a plan subtree containing
multiple CTEscans; but resetting an already-empty tuplestore is cheap
enough that that doesn't seem like a problem.
Per report from Adam Mackler; the new regression test cases are based on
his example query.
Back-patch to 8.4 where CTE scans were introduced.

This situation creates a dependency loop that confuses pg_dump and probably
other things. Moreover, since the mental model is that the extension
"contains" schemas it owns, but "is contained in" its extschema (even
though neither is strictly true), having both true at once is confusing for
people too. So prevent the situation from being set up.
Reported and patched by Thom Brown. Back-patch to 9.1 where extensions
were added.

This essentially reverts commit e54b10a62db2991235fe800c629baef4531a6d67,
in which I'd decided that the "last ditch" join logic was useless. The
folly of that is now exposed by a report from Pavel Stehule: although the
function should always find at least one join in a self-contained join
problem, it can still fail to do so in a sub-problem created by artificial
from_collapse_limit or join_collapse_limit constraints. Adjust the
comments to describe this, and simplify the code a bit to match the new
coding of the earlier loop in the function.
I'm not terribly happy about this: I still subscribe to the opinion stated
in the previous commit message that the "last ditch" code can obscure logic
bugs elsewhere. But the alternative seems to be to complicate the earlier
tests for does-this-relation-have-a-join-clause to the point where they can
tell whether the join clauses link outside the current join sub-problem.
And that looks messy, slow, and possibly a source of bugs in itself.
In any case, now is not the time to be inserting experimental code into
9.2, so let's just go back to the time-tested solution.

libxslt offers the ability to read and write both files and URLs through
stylesheet commands, thus allowing unprivileged database users to both read
and write data with the privileges of the database server. Disable that
through proper use of libxslt's security options.
Also, remove xslt_process()'s ability to fetch documents and stylesheets
from external files/URLs. While this was a documented "feature", it was
long regarded as a terrible idea. The fix for CVE-2012-3489 broke that
capability, and rather than expend effort on trying to fix it, we're just
going to summarily remove it.
While the ability to write as well as read makes this security hole
considerably worse than CVE-2012-3489, the problem is mitigated by the fact
that xslt_process() is not available unless contrib/xml2 is installed,
and the longstanding warnings about security risks from that should have
discouraged prudent DBAs from installing it in security-exposed databases.
Reported and fixed by Peter Eisentraut.
Security: CVE-2012-3488

xml_parse() would attempt to fetch external files or URLs as needed to
resolve DTD and entity references in an XML value, thus allowing
unprivileged database users to attempt to fetch data with the privileges
of the database server. While the external data wouldn't get returned
directly to the user, portions of it could be exposed in error messages
if the data didn't parse as valid XML; and in any case the mere ability
to check existence of a file might be useful to an attacker.
The ideal solution to this would still allow fetching of references that
are listed in the host system's XML catalogs, so that documents can be
validated according to installed DTDs. However, doing that with the
available libxml2 APIs appears complex and error-prone, so we're not going
to risk it in a security patch that necessarily hasn't gotten wide review.
So this patch merely shuts off all access, causing any external fetch to
silently expand to an empty string. A future patch may improve this.
In HEAD and 9.2, also suppress warnings about undefined entities, which
would otherwise occur as a result of not loading referenced DTDs. Previous
branches don't show such warnings anyway, due to different error handling
arrangements.
Credit to Noah Misch for first reporting the problem, and for much work
towards a solution, though this simplistic approach was not his preference.
Also thanks to Daniel Veillard for consultation.
Security: CVE-2012-3489

DST law changes in Morocco; Tokelau has relocated to the other side of
the International Date Line; and apparently Olson had Tokelau's GMT
offset wrong by an hour even before that.
There are also a large number of non-significant changes in this update.
Upstream took the opportunity to remove trailing whitespace, and the
SCCS-style version numbers on the individual files are gone too.

This command generated new pg_depend entries linking the index to the
constraint and the constraint to the table, which match the entries made
when a unique or primary key constraint is built de novo. However, it did
not bother to get rid of the entries linking the index directly to the
table. We had considered the issue when the ADD CONSTRAINT USING INDEX
patch was written, and concluded that we didn't need to get rid of the
extra entries. But this is wrong: ALTER COLUMN TYPE wasn't expecting such
redundant dependencies to exist, as reported by Hubert Depesz Lubaczewski.
On reflection it seems rather likely to break other things as well, since
there are many bits of code that crawl pg_depend for one purpose or
another, and most of them are pretty naive about what relationships they're
expecting to find. Fortunately it's not that hard to get rid of the extra
dependency entries, so let's do that.
Back-patch to 9.1, where ALTER TABLE ADD CONSTRAINT USING INDEX was added.

Prevent pg_upgrade from crashing if it can’t write to the current directory.

Force archive_status of .done for xlogs created by dearchival/replication. This prevents spurious attempts to archive xlog files after promotion of standby, a bug introduced by cascading replication patch in 9.2.

Fix minor bug in XLogFileRead() that accidentally worked. Cascading replication copied the incoming file into pg_xlog but didn’t set path correctly, so the first attempt to open file failed causing it to loop around and look for file in pg_xlog. So the earlier coding worked, but accidentally rather than by design.

This was broken in commit ed0b409d22346b1b027a4c2099ca66984d94b6dd,
which revised the GlobalTransactionData struct to not include the
associated PGPROC as its first member, but overlooked one place where
a cast was used in reliance on that equivalence.
The most effective way of fixing this seems to be to create a new function
that looks up the GlobalTransactionData struct given the XID, and make
both TwoPhaseGetDummyBackendId and TwoPhaseGetDummyProc rely on that.
Per report from Robert Ross.

Fix pg_upgrade file share violation on Windows created by the commit 4741e9afb93f0d769655b2d18c2b73b86f281010. This was done by adding an optional second log file parameter to exec_prog(), and closing and reopening the log file between system() calls.

In particular, with a controlled shutdown of the master, pg_basebackup
with streaming log could terminate without an error message, even though
the backup is not consistent.
In passing, fix a few cases where walfile wasn't properly set to -1 after
closing.
Fujii Masao

We used to convert the unicode object directly to a string in the server
encoding by calling Python's PyUnicode_AsEncodedString function. In other
words, we used Python's routines to do the encoding. However, that has a
few problems. First of all, it required keeping a mapping table of Python
encoding names and PostgreSQL encodings. But the real killer was that Python
doesn't support EUC_TW and MULE_INTERNAL encodings at all.
Instead, convert the Python unicode object to UTF-8, and use PostgreSQL's
encoding conversion functions to convert from UTF-8 to server encoding. We
were already doing the same in the other direction in PLyUnicode_FromString,
so this is more consistent, too.
Note: This makes SQL_ASCII to behave more leniently. We used to map
SQL_ASCII to Python's 'ascii', which on Python means strict 7-bit ASCII
only, so you got an error if the python string contained anything but pure
ASCII. You no longer get an error; you get the UTF-8 representation of the
string instead.
Backpatch to 9.0, where these conversions were introduced.
Jan Urbański

DecodeInterval() failed to honor the "range" parameter (the special SQL
syntax for indicating which fields appear in the literal string) if the
time was signed. This seems inappropriate, so make it work like the
not-signed case. The inconsistency was introduced in my commit
f867339c0148381eb1d01f93ab5c79f9d10211de, which as noted in its log message
was only really focused on making SQL-compliant literals work per spec.
Including a sign here is not per spec, but if we're going to allow it
then it's reasonable to expect it to work like the not-signed case.
Also, remove bogus setting of tmask, which caused subsequent processing to
think that what had been given was a timezone and not an hh:mm(:ss) field,
thus confusing checks for redundant fields. This seems to be an aboriginal
mistake in Lockhart's commit 2cf1642461536d0d8f3a1cf124ead0eac04eb760.
Add regression test cases to illustrate the changed behaviors.
Back-patch as far as 8.4, where support for spec-compliant interval
literals was added.
Range problem reported and diagnosed by Amit Kapila, tmask problem by me.

As noted by Noah Misch, btree_xlog_delete_get_latestRemovedXid is
critically dependent on the assumption that it's examining a consistent
state of the database. This was undocumented though, so the
seemingly-unrelated check for no active HS sessions might be thought to be
merely an optional optimization. Improve comments, and add an explicit
check of reachedConsistency just to be sure.
This function returns InvalidTransactionId (thereby killing all HS
transactions) in several cases that are not nearly unlikely enough for my
taste. This commit doesn't attempt to fix those deficiencies, just
document them.
Back-patch to 9.2, not from any real functional need but just to keep the
branches more closely synced to simplify possible future back-patching.

In yesterday's commit 962e0cc71e839c58fb9125fa85511b8bbb8bdbee, I added the
ResolveRecoveryConflictWithSnapshot call in the wrong place. I correctly
put it before spgRedoVacuumRedirect itself would modify the index page ---
but not before RestoreBkpBlocks, so replay of a record with a full-page
image would modify the page before kicking off any conflicting HS
transactions. Oops.

The correct test for whether a redirection tuple is removable is whether
tuple's xid < RecentGlobalXmin, not OldestXmin; the previous coding
failed to protect index searches being done in concurrent transactions that
have no XID. This mirrors the recent fix in btree's page recycling logic
made in commit d3abbbebe52eb1e59e621c880ad57df9d40d13f2.
Also, WAL-log the newest XID of any removed redirection tuple on an index
page, and apply ResolveRecoveryConflictWithSnapshot during InHotStandby WAL
replay. This protects against concurrent Hot Standby transactions possibly
needing to see the redirection tuple(s).
Per my query of 2012-03-12 and subsequent discussion.

After taking awhile to digest the row-processor feature that was added to
libpq in commit 92785dac2ee7026948962cd61c4cd84a2d052772, we've concluded
it is over-complicated and too hard to use. Leave the core infrastructure
changes in place (that is, there's still a row processor function inside
libpq), but remove the exposed API pieces, and instead provide a "single
row" mode switch that causes PQgetResult to return one row at a time in
separate PGresult objects.
This approach incurs more overhead than proper use of a row processor
callback would, since construction of a PGresult per row adds extra cycles.
However, it is far easier to use and harder to break. The single-row mode
still affords applications the primary benefit that the row processor API
was meant to provide, namely not having to accumulate large result sets in
memory before processing them. Preliminary testing suggests that we can
probably buy back most of the extra cycles by micro-optimizing construction
of the extra results, but that task will be left for another day.
Marko Kreen

Parse analysis neglected to cover the case of a WITH clause attached to an
intermediate-level set operation; it only handled WITH at the top level
or WITH attached to a leaf-level SELECT. Per report from Adam Mackler.
In HEAD, I rearranged the order of SelectStmt's fields to put withClause
with the other fields that can appear on non-leaf SelectStmts. In back
branches, leave it alone to avoid a possible ABI break for third-party
code.
Back-patch to 8.4 where WITH support was added.

Fix syslogger so that log_truncate_on_rotation works in the first rotation.

In the original coding of the log rotation stuff, we did not bother to make
the truncation logic work for the very first rotation after postmaster
start (or after a syslogger crash and restart). It just always appended
in that case. It did not seem terribly important at the time, but we've
recently had two separate complaints from people who expected it to work
unsurprisingly. (Both users tend to restart the postmaster about as often
as a log rotation is configured to happen, which is maybe not typical use,
but still...) Since the initial log file is opened in the postmaster,
fixing this requires passing down some more state to the syslogger child
process.
It's always been like this, so back-patch to all supported branches.

The most user-visible part of this is to change the long options
--statusint and --noloop to --status-interval and --no-loop,
respectively, per discussion.
Also, consistently enclose file names in double quotes, per our
conventions; and consistently use the term "transaction log file" to
talk about WAL segments. (Someday we may need to go over this
terminology and make it consistent across the whole source code.)
Finally, reflow the code to better fit in 80 columns, and have pgindent
fix it up some more.

When the internal loop mode was added, freeing memory and closing
filedescriptors before returning became important, and a few cases
in the code missed that.
This is a backpatch of commit 058a050e to the 9.2 branch, which seems to
have been neglected (in error, because the bugs it fixes were introduced
in commit 16282ae6 which is present in both master and 9.2).
Fujii Masao

Now that the diskchecker.pl author has updated the download link on his website, revert the separate link to the download git repository.

This function suppressed any stderr output from the called program, which
is unnecessary in the normal case and unhelpful in error cases. It also
gave a rather opaque message along the lines of "fgets failure: Success"
in case the called program failed to return anything on stdout. Since
we've seen multiple reports of people not understanding what's wrong when
pg_ctl reports this, improve the message.
Back-patch to all active branches.

In the original coding of the autovacuum cancel feature, commit
acac68b2bcae818bc8803b8cb8cbb17eee8d5e2b, an autovacuum process was
considered a target for cancellation if it was found to hard-block any
process examined in the deadlock search. This patch tightens the test so
that the autovacuum must directly hard-block the current process. This
should make the behavior more predictable in general, and in particular
it ensures that an autovacuum will not be canceled with less than
deadlock_timeout grace period. In the old coding, it was possible for an
autovacuum to be canceled almost instantly, given unfortunate timing of two
or more other processes' lock attempts.
This also justifies the logging methodology in the recent commit
d7318d43d891bd63e82dcfc27948113ed7b1db80; without this restriction, that
patch isn't providing enough information to see the connection of the
canceling process to the autovacuum. Like that one, patch all the way
back.

The old message was at DEBUG2, so typically it didn't show up in the
log at all. As a result, in most cases where autovacuum was canceled,
the only information that was logged was the table being vacuumed,
with no indication as to what problem caused the cancel. Crank up
the level to LOG and add some more details to assist with debugging.
Back-patch all the way, per discussion on pgsql-hackers.

If a crash occurred immediately after the first nextval() call for a serial
column, WAL replay would restore the sequence to a state in which it
appeared that no nextval() had been done, thus allowing the first sequence
value to be returned again by the next nextval() call; as reported in
bug #6748 from Xiangming Mei.
More generally, the problem would occur if an ALTER SEQUENCE was executed
on a freshly created or reset sequence. (The manifestation with serial
columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step
to serial column creation.) The cause is that sequence creation attempted
to save one WAL entry by writing out a WAL record that made it appear that
the first nextval() had already happened (viz, with is_called = true),
while marking the sequence's in-database state with log_cnt = 1 to show
that the first nextval() need not emit a WAL record. However, ALTER
SEQUENCE would emit a new WAL entry reflecting the actual in-database state
(with is_called = false). Then, nextval would allocate the first sequence
value and set is_called = true, but it would trust the log_cnt value and
not emit any WAL record. A crash at this point would thus restore the
sequence to its post-ALTER state, causing the next nextval() call to return
the first sequence value again.
To fix, get rid of the idea of logging an is_called status different from
reality. This means that the first nextval-driven WAL record will happen
at the first nextval call not the second, but the marginal cost of that is
pretty negligible. In addition, make sure that ALTER SEQUENCE resets
log_cnt to zero in any case where it touches sequence parameters that
affect future nextval results. This will result in some user-visible
changes in the contents of a sequence's log_cnt column, as reflected in the
patch's regression test changes; but no application should be depending on
that anyway, since it was already true that log_cnt changes rather
unpredictably depending on checkpoint timing.
In addition, make some basically-cosmetic improvements to get rid of
sequence.c's undesirable intimacy with page layout details. It was always
really trying to WAL-log the contents of the sequence tuple, so we should
have it do that directly using a HeapTuple's t_data and t_len, rather than
backing into it with some magic assumptions about where the tuple would be
on the sequence's page.
Back-patch to all supported branches.

The initially implemented syntax, "CHECK NO INHERIT (expr)" was not
deemed very good, so switch to "CHECK (expr) NO INHERIT" instead. This
way it looks similar to SQL-standards compliant constraint attribute.
Backport to 9.2 where the new syntax and feature was introduced.
Per discussion.

Commit f5bcd398addcbeb785f0513cf28cba5d1ecd2c8a introduced a test using
a table named "circles" in inherit.sql. Unfortunately, the concurrently
executed constraints test was already using that table name, so the
parallel regression tests would sometimes fail. Rename table to dodge
the problem. Per buildfarm.

We made use of the ROWS estimate for set-returning functions used in FROM,
but not for those used in SELECT targetlists; which is a bit of an
oversight considering there are common usages that require the latter
approach. Improve that. (I had initially thought it might be worth
folding this into cost_qual_eval, but after investigation concluded that
that wouldn't be very helpful, so just do it separately.) Per complaint
from David Johnston.
Back-patch to 9.2, but not further, for fear of destabilizing plan choices
in existing releases.

There is no point in running this test when prepared transactions are disabled,
which is the default. New make targets that include the test are provided. This
will save some useless waste of cycles on buildfarm machines.
Backpatch to 9.1 where these tests were introduced.

The code was setting it true for other constraints, which is
bogus. Doing so caused bogus catalog entries for such constraints, and
in particular caused an error to be raised when trying to drop a
constraint of types other than CHECK from a table that has children,
such as reported in bug #6712.
In 9.2, additionally ignore connoinherit=true for other constraint
types, to avoid having to force initdb; existing databases might already
contain bogus catalog entries.
Includes a catversion bump (in HEAD only).
Bug report from Miroslav Šulc
Analysis from Amit Kapila and Noah Misch; Amit also contributed the patch.

When a whole-row Var is reading the result of a subquery, we need it to
ignore any "resjunk" columns that the subquery might have evaluated for
GROUP BY or ORDER BY purposes. We've hacked this area before, in commit
68e40998d058c1f6662800a648ff1e1ce5d99cba, but that fix only covered
whole-row Vars of named composite types, not those of RECORD type; and it
was mighty klugy anyway, since it just assumed without checking that any
extra columns in the result must be resjunk. A proper fix requires getting
hold of the subquery's targetlist so we can actually see which columns are
resjunk (whereupon we can use a JunkFilter to get rid of them). So bite
the bullet and add some infrastructure to make that possible.
Per report from Andrew Dunstan and additional testing by Merlin Moncure.
Back-patch to all supported branches. In 8.3, also back-patch commit
292176a118da6979e5d368a4baf27f26896c99a5, which for some reason I had
not done at the time, but it's a prerequisite for this change.

Instead of having one hash table entry per relation/fork/segment, just have
one per relation, and use bitmapsets to represent which specific segments
need to be fsync'd. This eliminates the need to scan the whole hash table
to implement FORGET_RELATION_FSYNC, which fixes the O(N^2) behavior
recently demonstrated by Jeff Janes for cases involving lots of TRUNCATE or
DROP TABLE operations during a single checkpoint cycle. Per an idea from
Robert Haas.
(FORGET_DATABASE_FSYNC still sucks, but since dropping a database is a
pretty expensive operation anyway, we'll live with that.)
In passing, improve the delayed-unlink code: remove the pass over the list
in mdpreckpt, since it wasn't doing anything for us except supporting a
useless Assert in mdpostckpt, and fix mdpostckpt so that it will absorb
fsync requests every so often when clearing a large backlog of deletion
requests.

We were sending one per fork, but a little bit of refactoring allows us
to send just one request with forknum == InvalidForkNumber. This not only
reduces pressure on the shared-memory request queue, but saves repeated
traversals of the checkpointer's hash table.

Functions like range_eq, range_before etc. are exposed at the SQL-level, but
they're also used internally by the GiST consistent support function. The
code sharing was done by a hack, TrickFunctionCall2, which relied on the
knowledge that all the functions used fn_extra the same way. This commit
splits the functions into internal versions that take a TypeCacheEntry as
argument, and thin wrappers to expose the functions at the SQL-level. The
internal versions can then be called directly and in a less hacky way from
the GiST consistent function.
This is just cosmetic, but backpatch to 9.2 anyway, to avoid having a
different version of this code in the 9.2 branch. That would make
backpatching fixes in this area more difficult.
Alexander Korotkov

ForwardFsyncRequest() supposed that it could only be called in regular
backends, which used to be true; but since the splitup of bgwriter and
checkpointer, it is also called in the bgwriter. We do not want to count
such calls in pg_stat_bgwriter.buffers_backend statistics, so fix things
so that they aren't.
(It's worth noting here that this implies an alarmingly large increase in
the expected amount of cross-process fsync request traffic, which may well
mean that the process splitup was not such a hot idea.)

mdinit() was misusing IsBootstrapProcessingMode() to decide whether to
create an fsync pending-operations table in the current process. This led
to creating a table not only in the startup and checkpointer processes as
intended, but also in the bgwriter process, not to mention other auxiliary
processes such as walwriter and walreceiver. Creation of the table in the
bgwriter is fatal, because it absorbs fsync requests that should have gone
to the checkpointer; instead they just sit in bgwriter local memory and are
never acted on. So writes performed by the bgwriter were not being fsync'd
which could result in data loss after an OS crash. I think there is no
live bug with respect to walwriter and walreceiver because those never
perform any writes of shared buffers; but the potential is there for
future breakage in those processes too.
To fix, make AuxiliaryProcessMain() export the current process's
AuxProcType as a global variable, and then make mdinit() test directly for
the types of aux process that should have a pendingOpsTable. Having done
that, we might as well also get rid of the random bool flags such as
am_walreceiver that some of the aux processes had grown. (Note that we
could not have fixed the bug by examining those variables in mdinit(),
because it's called from BaseInit() which is run by AuxiliaryProcessMain()
before entering any of the process-type-specific code.)
Back-patch to 9.2, where the problem was introduced by the split-up of
bgwriter and checkpointer processes. The bogus pendingOpsTable exists
in walwriter and walreceiver processes in earlier branches, but absent
any evidence that it causes actual problems there, I'll leave the older
branches alone.

Since the scandir() emulation was taken out of pg_upgrade, there's
no longer any need for scandir_file_pattern to exist as a global
variable. Replace it with a local in the one remaining function
that was making use of it.

Error out on out-of-memory, rather than returning -1, which the sole
existing caller wasn't checking for anyway. There doesn't seem to be
any use-case for making the caller check for failure here.
Detect failure return from readdir().
Use a less platform-dependent method of calculating the entrysize.
It's possible, but not yet confirmed, that this explains bug #6733,
in which Mike Wilson reports a pg_upgrade crash that did not occur
in 9.1. (Note that load_directory is effectively new code in 9.2,
at least on platforms that have scandir().)
Fix up comments, avoid uselessly using two counters, reduce the number
of realloc calls to something sane.

In all branches back to 8.3, this patch fixes a questionable assumption in
CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are
no uninitialized pad bytes in the request queue structs. This would only
cause trouble if (a) there were such pad bytes, which could happen in 8.4
and up if the compiler makes enum ForkNumber narrower than 32 bits, but
otherwise would require not-currently-planned changes in the widths of
other typedefs; and (b) the kernel has not uniformly initialized the
contents of shared memory to zeroes. Still, it seems a tad risky, and we
can easily remove any risk by pre-zeroing the request array for ourselves.
In addition to that, we need to establish a coding rule that struct
RelFileNode can't contain any padding bytes, since such structs are copied
into the request array verbatim. (There are other places that are assuming
this anyway, it turns out.)
In 9.1 and up, the risk was a bit larger because we were also effectively
assuming that struct RelFileNodeBackend contained no pad bytes, and with
fields of different types in there, that would be much easier to break.
However, there is no good reason to ever transmit fsync or delete requests
for temp files to the bgwriter/checkpointer, so we can revert the request
structs to plain RelFileNode, getting rid of the padding risk and saving
some marginal number of bytes and cycles in fsync queue manipulation while
we are at it. The savings might be more than marginal during deletion of
a temp relation, because the old code transmitted an entirely useless but
nonetheless expensive-to-process ForgetRelationFsync request to the
background process, and also had the background process perform the file
deletion even though that can safely be done immediately.
In addition, make some cleanup of nearby comments and small improvements to
the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.

These only pass cleanly on UTF8 and SQL_ASCII encodings, besides the
Japanese encoding in which they were originally written, which is clearly
not good enough. Since the functionality they test has not ever been
tested from PL/Perl, the best answer seems to be to remove the new tests
completely.
Per buildfarm results and ensuing discussion.

Formerly, when trying to copy both indexes and comments, CREATE TABLE LIKE
had to pre-assign names to indexes that had comments, because it made up an
explicit CommentStmt command to apply the comment and so it had to know the
name for the index. This creates bad interactions with other indexes, as
shown in bug #6734 from Daniele Varrazzo: the preassignment logic couldn't
take any other indexes into account so it could choose a conflicting name.
To fix, add a field to IndexStmt that allows it to carry a comment to be
assigned to the new index. (This isn't a user-exposed feature of CREATE
INDEX, only an internal option.) Now we don't need preassignment of index
names in any situation.
I also took the opportunity to refactor DefineIndex to accept the IndexStmt
as such, rather than passing all its fields individually in a mile-long
parameter list.
Back-patch to 9.2, but no further, because it seems too dangerous to change
IndexStmt or DefineIndex's API in released branches. The bug exists back
to 9.0 where CREATE TABLE LIKE grew the ability to copy comments, but given
the lack of prior complaints we'll just let it go unfixed before 9.2.

rfree() failed to cope with the case that pg_regcomp() had initialized the
regex_t struct but then failed to allocate any memory for re->re_guts (ie,
the first malloc call in pg_regcomp() failed). It would try to touch the
guts struct anyway, and thus dump core. This is a sufficiently narrow
corner case that it's not surprising it's never been seen in the field;
but still a bug is a bug, so patch all active branches.
Noted while investigating whether we need to call pg_regfree after a
failure return from pg_regcomp. Other than this bug, it turns out we
don't, so adjust comments appropriately.

Walsenders must have working SIGALRM handling during InitPostgres,
but they set the handler to SIG_IGN so that nothing would happen
if a timeout was reached. This could result in two failure modes:
* If a walsender participated in a deadlock during its authentication
transaction, and was the last to wait in the deadly embrace, the deadlock
would not get cleared automatically. This would require somebody to be
trying to take out AccessExclusiveLock on multiple system catalogs, so
it's not very probable.
* If a client failed to respond to a walsender's authentication challenge,
the intended disconnect after AuthenticationTimeout wouldn't happen, and
the walsender would wait indefinitely for the client.
For the moment, fix in back branches only, since this is fixed in a
different way in the timeout-infrastructure patch that's awaiting
application to HEAD. If we choose not to apply that, then we'll need
to do this in HEAD as well.

Document that Log-Shipping Standby Servers cannot be upgraded by pg_upgrade.

Back-patch of commits 72dd6291f216440f6bb61a8733729a37c7e3b2d2,
f6a05fd973a102f7e66c491d3f854864b8d24844, and
60e9c224a197aa37abb1aa3aefa3aad42da61f7f.
This is needed to support fixing the regex prefix extraction bug in
back branches.

When in SQL_ASCII encoding, strings passed around are not necessarily
UTF8-safe. We had already fixed this in some places, but it looks like
we missed some.
I had to backpatch Peter Eisentraut's a8b92b60 to 9.1 in order for this
patch to cherry-pick more cleanly.
Patch from Alex Hunsaker, tweaked by Kyotaro HORIGUCHI and myself.
Some desultory cleanup and comment addition by me, during patch review.
Per bug report from Christoph Berg in
20120209102116.GA14429@msgid.df7cb.de

Previously, pattern_fixed_prefix() was defined to return whatever fixed
prefix it could extract from the pattern, plus the "rest" of the pattern.
That definition was sensible for LIKE patterns, but not so much for
regexes, where reconstituting a valid pattern minus the prefix could be
quite tricky (certainly the existing code wasn't doing that correctly).
Since the only thing that callers ever did with the "rest" of the pattern
was to pass it to like_selectivity() or regex_selectivity(), let's cut out
the middle-man and just have pattern_fixed_prefix's subroutines do this
directly. Then pattern_fixed_prefix can return a simple selectivity
number, and the question of how to cope with partial patterns is removed
from its API specification.
While at it, adjust the API spec so that callers who don't actually care
about the pattern's selectivity (which is a lot of them) can pass NULL for
the selectivity pointer to skip doing the work of computing a selectivity
estimate.
This patch is only an API refactoring that doesn't actually change any
processing, other than allowing a little bit of useless work to be skipped.
However, it's necessary infrastructure for my upcoming fix to regex prefix
extraction, because after that change there won't be any simple way to
identify the "rest" of the regex, not even to the low level of fidelity
needed by regex_selectivity. We can cope with that if regex_fixed_prefix
and regex_selectivity communicate directly, but not if we have to work
within the old API. Hence, back-patch to all active branches.

We can do this without creating an API break for estimation functions
by passing the collation using the existing fmgr functionality for
passing an input collation as a hidden parameter.
The need for this was foreseen at the outset, but we didn't get around to
making it happen in 9.1 because of the decision to sort all pg_statistic
histograms according to the database's default collation. That meant that
selectivity estimators generally need to use the default collation too,
even if they're estimating for an operator that will do something
different. The reason it's suddenly become more interesting is that
regexp interpretation also uses a collation (for its LC_TYPE not LC_COLLATE
property), and we no longer want to use the wrong collation when examining
regexps during planning. It's not that the selectivity estimate is likely
to change much from this; rather that we are thinking of caching compiled
regexps during planner estimation, and we won't get the intended benefit
if we cache them with a different collation than the executor will use.
Back-patch to 9.1, both because the regexp change is likely to get
back-patched and because we might as well get this right in all
collation-supporting branches, in case any third-party code wants to
rely on getting the collation. The patch turns out to be minuscule
now that I've done it ...

join_path_components() tried to remove leading ".." components from its
tail argument, but it was not nearly bright enough to do so correctly
unless the head argument was (a) absolute and (b) canonicalized.
Rather than try to fix that logic, let's just get rid of it: there is no
correctness reason to remove "..", and cosmetic concerns can be taken
care of by a subsequent canonicalize_path() call. Per bug #6715 from
Greg Davidson.
Back-patch to all supported branches. It appears that pre-9.2, this
function is only used with absolute paths as head arguments, which is why
we'd not noticed the breakage before. However, third-party code might be
expecting this function to work in more general cases, so it seems wise
to back-patch.
In HEAD and 9.2, also make some minor cosmetic improvements to callers.

That caused the plpython_unicode regression test to fail on SQL_ASCII
encoding, as evidenced by the buildfarm. The reason is that with the patch,
you don't get the detail in the error message that you got before. That
detail is actually very informative, so rather than just adjust the expected
output, let's revert that part of the patch for now to make the buildfarm
green again, and figure out some other way to avoid the recursion of
PLy_elog() that doesn't lose the detail.

Windows encodings, "win1252" and so forth, are named differently in Python,
like "cp1252". Also, if the PyUnicode_AsEncodedString() function call fails
for some reason, use a plain ereport(), not a PLy_elog(), to report that
error. That avoids recursion and crash, if PLy_elog() tries to call
PLyUnicode_Bytes() again.
This fixes bug reported by Asif Naeem. Backpatch down to 9.0, before that
plpython didn't even try these conversions.
Jan Urbański, with minor comment improvements by me.

Fix missing regex slash that caused perltidy to get confused on copyright.pl.

Per bug #6593, REASSIGN OWNED fails when the affected role has created
an extension. Even though the user related to the extension is not
nominally the owner, its OID appears on pg_shdepend and thus causes
problems when the user is to be dropped.
This commit adds code to change the "ownership" of the extension itself,
not of the contained objects. This is fine because it's currently only
called from REASSIGN OWNED, which would also modify the ownership of the
contained objects. However, this is not sufficient for a working ALTER
OWNER implementation extension.
Back-patch to 9.1, where extensions were introduced.
Bug #6593 reported by Emiliano Leporati.

A thinko in commit 029dfdf1157b6d837a7b7211cd35b00c6bcd767c caused the year
519 to be handled differently from either adjacent year, which was not the
intention AFAICS. Report and diagnosis by Marc Cousin.
In passing, remove redundant re-tests of year value.

When (re) loading the typcache comparison cache for an enum type's values,
use an up-to-date MVCC snapshot, not the transaction's existing snapshot.
This avoids problems if we encounter an enum OID that was created since our
transaction started. Per report from Andres Freund and diagnosis by Robert
Haas.
To ensure this is safe even if enum comparison manages to get invoked
before we've set a transaction snapshot, tweak GetLatestSnapshot to
redirect to GetTransactionSnapshot instead of throwing error when
FirstSnapshotSet is false. The existing uses of GetLatestSnapshot (in
ri_triggers.c) don't care since they couldn't be invoked except in a
transaction that's already done some work --- but it seems just conceivable
that this might not be true of enums, especially if we ever choose to use
enums in system catalogs.
Note that the comparable coding in enum_endpoint and enum_range_internal
remains GetTransactionSnapshot; this is perhaps debatable, but if we
changed it those functions would have to be marked volatile, which doesn't
seem attractive.
Back-patch to 9.1 where ALTER TYPE ADD VALUE was added.

If a CHECK constraint or index definition contained a whole-row Var (that
is, "table.*"), an attempt to copy that definition via CREATE TABLE LIKE or
table inheritance produced incorrect results: the copied Var still claimed
to have the rowtype of the source table, rather than the created table.
For the LIKE case, it seems reasonable to just throw error for this
situation, since the point of LIKE is that the new table is not permanently
coupled to the old, so there's no reason to assume its rowtype will stay
compatible. In the inheritance case, we should ideally allow such
constraints, but doing so will require nontrivial refactoring of CREATE
TABLE processing (because we'd need to know the OID of the new table's
rowtype before we adjust inherited CHECK constraints). In view of the lack
of previous complaints, that doesn't seem worth the risk in a back-patched
bug fix, so just make it throw error for the inheritance case as well.
Along the way, replace change_varattnos_of_a_node() with a more robust
function map_variable_attnos(), which is capable of being extended to
handle insertion of ConvertRowtypeExpr whenever we get around to fixing
the inheritance case nicely, and in the meantime it returns a failure
indication to the caller so that a helpful message with some context can be
thrown. Also, this code will do the right thing with subselects (if we
ever allow them in CHECK or indexes), and it range-checks varattnos before
using them to index into the map array.
Per report from Sergey Konoplev. Back-patch to all supported branches.

The LISTEN/NOTIFY subsystem got confused if SimpleLruZeroPage failed,
which would typically happen as a result of a write() failure while
attempting to dump a dirty pg_notify page out of memory. Subsequently,
all attempts to send more NOTIFY messages would fail with messages like
"Could not read from file "pg_notify/nnnn" at offset nnnnn: Success".
Only restarting the server would clear this condition. Per reports from
Kevin Grittner and Christoph Berg.
Back-patch to 9.0, where the problem was introduced during the
LISTEN/NOTIFY rewrite.

The callers of UtilityContainsQuery want it to return a non-utility Query
if it returns anything at all. However, since we made CREATE TABLE
AS/SELECT INTO into a utility command instead of a variant of SELECT,
a command like "EXPLAIN SELECT INTO" results in two nested utility
statements. So what we need UtilityContainsQuery to do is drill down
to the bottom non-utility Query.
I had thought of this possibility in setrefs.c, and fixed it there by
looping around the UtilityContainsQuery call; but overlooked that the call
sites in plancache.c have a similar issue. In those cases it's
notationally inconvenient to provide an external loop, so let's redefine
UtilityContainsQuery as recursing down to a non-utility Query instead.
Noted by Rushabh Lathia. This is a somewhat cleaned-up version of his
proposed patch.

A similar change was made previously for pg_cancel_backend, so now it
all matches again.
Dan Farina, reviewed by Fujii Masao, Noah Misch, and Jeff Davis,
with slight kibitzing on the doc changes by me.

Cope with smaller-than-normal BLCKSZ setting in SPGiST indexes on text.

The original coding failed miserably for BLCKSZ of 4K or less, as reported
by Josh Kupershmidt. With the present design for text indexes, a given
inner tuple could have up to 256 labels (requiring either 3K or 4K bytes
depending on MAXALIGN), which means that we can't positively guarantee no
failures for smaller blocksizes. But we can at least make it behave sanely
so long as there are few enough labels to fit on a page. Considering that
btree is also more prone to "index tuple too large" failures when BLCKSZ is
small, it's not clear that we should expend more work than this on this
case.

While pg_dump has included dependency information in archive-format output
ever since 7.3, it never made any large effort to ensure that that
information was actually useful. In particular, in common situations where
dependency chains include objects that aren't separately emitted in the
dump, the dependencies shown for objects that were emitted would reference
the dump IDs of these un-dumped objects, leaving no clue about which other
objects the visible objects indirectly depend on. So far, parallel
pg_restore has managed to avoid tripping over this misfeature, but only
by dint of some crude hacks like not trusting dependency information in
the pre-data section of the archive.
It seems prudent to do something about this before it rises up to bite us,
so instead of emitting the "raw" dependencies of each dumped object,
recursively search for its actual dependencies among the subset of objects
that are being dumped.
Back-patch to 9.2, since that code hasn't yet diverged materially from
HEAD. At some point we might need to back-patch further, but right now
there are no known cases where this is actively necessary. (The one known
case, bug #6699, is fixed in a different way by my previous patch.) Since
this patch depends on 9.2 changes that made TOC entries be marked before
output commences as to whether they'll be dumped, back-patching further
would require additional surgery; and as of now there's no evidence that
it's worth the risk.

As of 9.2, with the --section option, it is very important that the concept
of "pre data", "data", and "post data" sections of the output be honored
strictly; else a dump divided into separate sectional files might be
unrestorable. However, the dependency-sorting logic knew nothing of
sections and would happily select output orderings that didn't fit that
structure. Doing so was mostly harmless before 9.2, but now we need to be
sure it doesn't do that. To fix, create dummy objects representing the
section boundaries and add dependencies between them and all the normal
objects. (This might sound expensive but it seems to only add a percent or
two to pg_dump's runtime.)
This also fixes a problem introduced in 9.1 by the feature that allows
incomplete GROUP BY lists when a primary key is given in GROUP BY.
That means that views can depend on primary key constraints. Previously,
pg_dump would deal with that by simply emitting the primary key constraint
before the view definition (and hence before the data section of the
output). That's bad enough for simple serial restores, where creating an
index before the data is loaded works, but is undesirable for speed
reasons. But it could lead to outright failure of parallel restores, as
seen in bug #6699 from Joe Van Dyk. That happened because pg_restore would
switch into parallel mode as soon as it reached the constraint, and then
very possibly would try to emit the view definition before the primary key
was committed (as a consequence of another bug that causes the view not to
be correctly marked as depending on the constraint). Adding the section
boundary constraints forces the dependency-sorting code to break the view
into separate table and rule declarations, allowing the rule, and hence the
primary key constraint it depends on, to revert to their intended location
in the post-data section. This also somewhat accidentally works around the
bogus-dependency-marking problem, because the rule will be correctly shown
as depending on the constraint, so parallel pg_restore will now do the
right thing. (We will fix the bogus-dependency problem for real in a
separate patch, but that patch is not easily back-portable to 9.1, so the
fact that this patch is enough to dodge the only known symptom is
fortunate.)
Back-patch to 9.1, except for the hunk that adds verification that the
finished archive TOC list is in correct section order; the place where
it was convenient to add that doesn't exist in 9.1.

Repeated execution of an uncorrelated ARRAY_SUBLINK sub-select (which
I think can only happen if the sub-select is embedded in a larger,
correlated subquery) would leak memory for the duration of the query,
due to not reclaiming the array generated in the previous execution.
Per bug #6698 from Armando Miraglia. Diagnosis and fix idea by Heikki,
patch itself by me.
This has been like this all along, so back-patch to all supported versions.

Before, some places didn't document the short options (-? and -V),
some documented both, some documented nothing, and they were listed in
various orders. Now this is hopefully more consistent and complete.

Since this is the easy way of doing it, it should be listed first. All
the old information is retained for those who want the more advanced way.
Also adds a subheading for compressing logs, that seems to have been missing

Because permissions are assigned to element types, not array types,
complaining about permission denied on an array type would be
misleading to users. So adjust the reporting to refer to the element
type instead.
In order not to duplicate the required logic in two dozen places,
refactor the permission denied reporting for types a bit.
pointed out by Yeb Havinga during the review of the type privilege
feature

Instead of identifying error locations only by line number (which could
be entirely unhelpful with long input lines), provide a fragment of the
input text too, placing this info in a new CONTEXT entry. Make the
error detail messages conform more closely to style guidelines, fix
failure to expose some of them for translation, ensure compiler can
check formats against supplied parameters.

This reverts commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724. Per
discussion, it does not seem like a good idea to allow committed changes to
go un-checkpointed indefinitely, as could happen in a low-traffic server;
that makes us entirely reliant on the WAL stream with no redundancy that
might aid data recovery in case of disk failure.
This re-introduces the original problem of hot-standby setups generating a
small continuing stream of WAL traffic even when idle, but there are other
ways to address that without compromising crash recovery, so we'll revisit
that issue in a future release cycle.

Aside from adjusting the documentation to say that these are deprecated,
we now report a warning (not an error) for use of GLOBAL, since it seems
fairly likely that we might change that to request SQL-spec-compliant temp
table behavior in the foreseeable future. Although our handling of LOCAL
is equally nonstandard, there is no evident interest in ever implementing
SQL modules, and furthermore some other products interpret LOCAL as
behaving the same way we do. So no expectation of change and no warning
for LOCAL; but it still seems a good idea to deprecate writing it.
Noah Misch

The simplest way to handle this is just to copy-and-paste the relevant
code block in fork_process.c, so that's what I did. (It's possible that
something more complicated would be useful to packagers who want to work
with either the old or the new API; but at this point the number of such
people is rapidly approaching zero, so let's just get the minimal thing
done.) Update relevant documentation as well.

In pg_upgrade, verify that the install user has the same oid on both clusters, and make sure the new cluster has no additional users.

Remove a couple of items that were actually back-patched bug fixes.
Add additional details to a couple of items which lacked a description.
Improve attributions for a couple of items I was involved with.
A few other miscellaneous corrections.

Commit aaa6e1def292cdacb6b27088898793b1b879fedf introduced multiple hazards
in the case where pg_ctl is executed with neither a -D switch nor any
PGDATA environment variable. It would dump core on machines which are
unforgiving about printf("%s", NULL), or failing that possibly give a
rather unhelpful complaint about being unable to execute "postgres -C",
rather than the logically prior complaint about not being told where the
data directory is.
Edmund Horner's report suggests that there is another, Windows-specific
hazard here, but I'm not the person to fix that; it would in any case only
be significant when trying to use a config-only PGDATA pointer.

getopt_long() allows abbreviating long options, so we might as well
give the option the full name, and users can abbreviate it how they
like.
Do some general polishing of the --help output at the same time.

This prevents a pg_basebackup backup session that just does a base
backup (no xlog involved at all) from becoming the synchronous slave
and thus blocking all access while it runs.
Also fixes the problem when a higher priority slave shows up it would
become the sync standby before it has reached the STREAMING state, by
making sure we can only switch to a walsender that's actually STREAMING.
Fujii Masao

Fix bug in early startup of Hot Standby with subtransactions. When HS startup is deferred because of overflowed subtransactions, ensure that we re-initialize KnownAssignedXids for when both existing and incoming snapshots have non-zero qualifying xids.

This provides a speedup of about 4X when NBuffers is large enough.
There is also a useful reduction in sinval traffic, since we
only do CacheInvalidateSmgr() once not once per fork.
Simon Riggs, reviewed and somewhat revised by Tom Lane

DropRelFileNodeBuffers, DropDatabaseBuffers, FlushRelationBuffers, and
FlushDatabaseBuffers have to scan the whole shared_buffers pool because
we have no index structure that would find the target buffers any more
efficiently than that. This gets expensive with large NBuffers. We can
shave some cycles from these loops by prechecking to see if the current
buffer is interesting before we acquire the buffer header lock.
Ordinarily such a test would be unsafe, but in these cases it should be
safe because we are already assuming that the caller holds a lock that
prevents any new target pages from being loaded into the buffer pool
concurrently. Therefore, no buffer tag should be changing to a value of
interest, only away from a value of interest. So a false negative match
is impossible, while a false positive is safe because we'll recheck after
acquiring the buffer lock. Initial testing says that this speeds these
loops by a factor of 2X to 3X on common Intel hardware.
Patch for DropRelFileNodeBuffers by Jeff Janes (based on an idea of
Heikki's); extended to the remaining sequential scans by Tom Lane

Wake WALSender to reduce data loss at failover for async commit. WALSender now woken up after each background flush by WALwriter, avoiding multi-second replication delay for an all-async commit workload. Replication delay reduced from 7s with default settings to 200ms and often much less, allowing significantly reduced data loss at failover.

In lazy_scan_heap, we could issue bogus warnings about incorrect
information in the visibility map, because we checked the visibility
map bit before locking the heap page, creating a race condition. Fix
by rechecking the visibility map bit before we complain. Rejigger
some related logic so that we rely on the possibly-outdated
all_visible_according_to_vm value as little as possible.
In heap_multi_insert, it's not safe to clear the visibility map bit
before beginning the critical section. The visibility map is not
crash-safe unless we treat clearing the bit as a critical operation.
Specifically, if the transaction were to error out after we set the
bit and before entering the critical section, we could end up writing
the heap page to disk (with the bit cleared) and crashing before the
visibility map page made it to disk. That would be bad. heap_insert
has this correct, but somehow the order of operations got rearranged
when heap_multi_insert was added.
Also, add some more comments to visibilitymap_test, lazy_scan_heap,
and IndexOnlyNext, expounding on concurrency issues.
Per extensive code review by Andres Freund, and further review by Tom
Lane, who also made the original report about the bogus warnings.

The original coding misbehaved if "char" is signed, and also made the
extremely poor decision to print control characters literally when trying
to complain about them. Report and patch by Shigeru Hanada.
In passing, also fix core dump risk in report_parse_error() should the
parse state be something other than what it expects.

It failed to check for error return from xsltApplyStylesheet(), as reported
by Peter Gagarinov. (So far as I can tell, libxslt provides no convenient
way to get a useful error message in failure cases. There might be some
inconvenient way, but considering that this code is deprecated it's hard to
get enthusiastic about putting lots of work into it. So I just made it say
"failed to apply stylesheet", in line with the existing error checks.)
While looking at the code I also noticed that the string returned by
xsltSaveResultToString was never freed, resulting in a session-lifespan
memory leak.
Back-patch to all supported versions.

Fix memory leaks in failure paths in buildACLCommands and parseAclItem.

This is currently only cosmetic, since all the call sites just curl up
and die in event of a failure return. It might be important for some
future use-case, though, and in any case it quiets warnings from the
clang static analyzer (as reported by Anna Zaks).
Josh Kupershmidt

In pg_upgrade, report pre-PG 8.1 plpython helper functions left in the public schema that no longer point to valid shared object libraries, and suggest a solution.

Avoid early reuse of btree pages, causing incorrect query results. When we allowed read-only transactions to skip assigning XIDs we introduced the possibility that a fully deleted btree page could be reused. This broke the index link sequence which could then lead to indexscans silently returning fewer rows than would have been correct. The actual incidence of silent errors from this is thought to be very low because of the exact workload required and locking pre-conditions. Fix is to remove pages only if index page opaque->btpo.xact precedes RecentGlobalXmin.

We allow non-superusers to create procedural languages (with restrictions)
and range datatypes. Previously, the automatically-created support
functions for these objects ended up owned by the creating user. This
represents a rather considerable security hazard, because the owning user
might be able to alter a support function's definition in such a way as to
crash the server, inject trojan-horse SQL code, or even execute arbitrary
C code directly. It appears that right now the only actually exploitable
problem is the infinite-recursion bug fixed in the previous patch for
CVE-2012-2655. However, it's not hard to imagine that future additions of
more ALTER FUNCTION capability might unintentionally open up new hazards.
To forestall future problems, cause these support functions to be owned by
the bootstrap superuser, not the user creating the parent object.

It's not very sensible to set such attributes on a handler function;
but if one were to do so, fmgr.c went into infinite recursion because
it would call fmgr_security_definer instead of the handler function proper.
There is no way for fmgr_security_definer to know that it ought to call the
handler and not the original function referenced by the FmgrInfo's fn_oid,
so it tries to do the latter, causing the whole process to start over
again.
Ordinarily such misconfiguration of a procedural language's handler could
be written off as superuser error. However, because we allow non-superuser
database owners to create procedural languages and the handler for such a
language becomes owned by the database owner, it is possible for a database
owner to crash the backend, which ideally shouldn't be possible without
superuser privileges. In 9.2 and up we will adjust things so that the
handler functions are always owned by superusers, but in existing branches
this is a minor security fix.
Problem noted by Noah Misch (after several of us had failed to detect
it :-(). This is CVE-2012-2655.

Expand the allowed range of timezone offsets to +/-15:59:59 from Greenwich.

We used to only allow offsets less than +/-13 hours, then it was +/14,
then it was +/-15. That's still not good enough though, as per today's bug
report from Patric Bechtel. This time I actually looked through the Olson
timezone database to find the largest offsets used anywhere. The winners
are Asia/Manila, at -15:56:00 until 1844, and America/Metlakatla, at
+15:13:42 until 1867. So we'd better allow offsets less than +/-16 hours.
Given the history, we are way overdue to have some greppable #define
symbols controlling this, so make some ... and also remove an obsolete
comment that didn't get fixed the last time.
Back-patch to all supported branches.

First, the previous code failed to account for the fact that, during Hot
Standby operation, the startup process takes AccessExclusiveLocks on
relations without setting MyDatabaseId. This resulted in fast path
strong lock counts failing to be incremented with the startup process
took locks, which in turn allowed conflicting lock requests to succeed
when they should not have. Report by Erik Rijkers, diagnosis by Heikki
Linnakangas.
Second, LockReleaseAll() failed to honor the allLocks and lockmethodid
restrictions with respect to fast-path locks. It's not clear to me
whether this produces any user-visible breakage at the moment, but it's
certainly wrong. Rearrange order of operations in LockReleaseAll to fix.
Noted by Tom Lane.

Overly tight coding caused the password transformation loop to stop
examining input once it had processed a byte equal to 0x80. Thus, if the
given password string contained such a byte (which is possible though not
highly likely in UTF8, and perhaps also in other non-ASCII encodings), all
subsequent characters would not contribute to the hash, making the password
much weaker than it appears on the surface.
This would only affect cases where applications used DES crypt() to encode
passwords before storing them in the database. If a weak password has been
created in this fashion, the hash will stop matching after this update has
been applied, so it will be easy to tell if any passwords were unexpectedly
weak. Changing to a different password would be a good idea in such a case.
(Since DES has been considered inadequately secure for some time, changing
to a different encryption algorithm can also be recommended.)
This code, and the bug, are shared with at least PHP, FreeBSD, and OpenBSD.
Since the other projects have already published their fixes, there is no
point in trying to keep this commit private.
This bug has been assigned CVE-2012-2143, and credit for its discovery goes
to Rubin Xu and Joseph Bonneau.

We used to mimic the way a stack is constructed when descending the tree
during normal GiST inserts, but that was quite complicated during a buffered
build. It was also wrong: in GiST, the left-to-right relationships on
different levels might not match each other, so that when you know the
parent of a child page, you won't necessarily find the parent of the page to
the right of the child page by following the rightlinks at the parent level.
This sometimes led to "could not re-find parent" errors while building a
GiST index.
We now use a simple hash table to track the parent of every internal page.
Whenever a page is split, and downlinks are moved from one page to another,
we update the hash table accordingly. This is also better for performance
than the old method, as we never need to move right to re-find the parent
page, which could take a significant amount of time for buffers that were
created much earlier in the index build.

Delete the temporary file used in buffered GiST build, after the build.

There were two bugs here: We forgot to call gistFreeBuildBuffers() function
at the end of build, and we passed interXact == true to BufFileCreateTemp,
so the file wasn't automatically cleaned up at end-of-transaction either.

The initial implementation of pg_dump's --section option supposed that the
existing --schema-only and --data-only options could be made equivalent to
--section settings. This is wrong, though, due to dubious but long since
set-in-stone decisions about where to dump SEQUENCE SET items, as seen in
bug report from Martin Pitt. (And I'm not totally convinced there weren't
other bugs, either.) Undo that coupling and instead drive --section
filtering off current-section state tracked as we scan through the TOC
list to call _tocEntryRequired().
To make sure those decisions don't shift around and hopefully save a few
cycles, run _tocEntryRequired() only once per TOC entry and save the result
in a new TOC field. This required minor rejiggering of ACL handling but
also allows a far cleaner implementation of inhibit_data_for_failed_table.
Also, to ensure that pg_dump and pg_restore have the same behavior with
respect to the --section switches, add _tocEntryRequired() filtering to
WriteToc() and WriteDataChunks(), rather than trying to implement section
filtering in an entirely orthogonal way in dumpDumpableObject(). This
required adjusting the handling of the special ENCODING and STDSTRINGS
items, but they were pretty weird before anyway.
Minor other code review for the patch, too.

The result of (maintenance_work_mem * 1024) / BLCKSZ doesn't fit in a signed
32-bit integer, if maintenance_work_mem >= 2GB. Use double instead. And
while we're at it, write the calculations in an easier to understand form,
with the intermediary steps written out and commented.

AbortOutOfAnyTransaction failed to do anything if the state it saw on
entry corresponded to failing partway through StartTransaction. I fixed
AbortCurrentTransaction to cope with that case way back in commit
60b2444cc3ba037630c9b940c3c9ef01b954b87b, but evidently overlooked that
AbortOutOfAnyTransaction should do likewise.
Back-patch to all supported branches. It's not clear that this omission
has any more-than-cosmetic consequences, but it's also not clear that it
doesn't, so back-patching seems the least risky choice.

This patch fixes three places (which AFAICT is all of them) where runtime
was O(N^2) in the number of TOC entries, by using an index array to replace
linear searches of the TOC list. This performance issue is a bit less bad
than those recently fixed, because it depends on the number of items dumped
not the number in the source database, so the problem can be dodged by
doing partial dumps.
The previous coding already had an instance of one of the two index arrays
needed, but it was only calculated in parallel-restore cases; now we need
it all the time. I also chose to move the arrays into the ArchiveHandle
data structure, to make this code a bit more ready for the day that we
try to sling multiple ArchiveHandles around in pg_dump or pg_restore.
Since we still need some server-side work before pg_dump can really cope
nicely with tens of thousands of tables, there's probably little point in
back-patching.

Drop special handling of host component with slashes to mean
Unix-domain socket. Specify it as separate parameter or using
percent-encoding now.
Allow omitting username, password, and port even if the corresponding
designators are present in URI.
Handle percent-encoding in query parameter keywords.
Alex Shulgin
some documentation improvements by myself

Write the file to a temporary name and then rename() it into the
permanent name, to ensure it can't end up half-written and corrupt
in case of a crash during shutdown.
Unlink the file after it has been read so it's removed from the data
directory and not included in base backups going to replication slaves.

The only interesting-for-performance case wherein we force heapscan here
is when we're rebuilding the relcache init file, and the only such case
that is likely to be examining a catalog big enough to be syncscanned is
RelationBuildTupleDesc. But the early-exit optimization in that code gets
broken if we start the scan at a random place within the catalog, so that
allowing syncscan is actually a big deoptimization if pg_attribute is large
(at least for the normal case where the rows for core system catalogs have
never been changed since initdb). Hence, prevent syncscan here. Per my
testing pursuant to complaints from Jeff Frost and Greg Sabino Mullane,
though neither of them seem to have actually hit this specific problem.
Back-patch to 8.3, where syncscan was introduced.

Fix string truncation to be multibyte-aware in text_name and bpchar_name.

Previously, casts to name could generate invalidly-encoded results.
Also, make these functions match namein() more exactly, by consistently
using palloc0() instead of ad-hoc zeroing code.
Back-patch to all supported branches.
Karl Schnaitter and Tom Lane

The previous coding presented a significant bottleneck when dumping
databases containing many thousands of schemas, since the total time
spent searching would increase roughly as O(N^2) in the number of objects.
Noted by Jeff Janes, though I rewrote his proposed patch to use the
existing findObjectByOid infrastructure.
Since this is a longstanding performance bug, backpatch to all supported
versions.

When backing up from a standby server, the backup process
will not automatically switch xlog segment. So we must
accept a partially transferred xlog file in this case, but
rename it into position anyway.
In passing, merge the two callbacks for segment end and
stop stream into a single callback, since their implementations
were close to identical, and rename this callback to
reflect that it stops streaming rather than continues it.
Patch by Magnus Hagander, review by Fujii Masao

On Windows, have pg_upgrade use different two files to log pg_ctl start/stop output, to fix file share error reported by Edmund Horner

zaptreesubs() was coded to unconditionally reset a capture subre's
corresponding pmatch[] entry. However, in regexes without backrefs, that
array is caller-supplied and might not have as many entries as the regex
has capturing parens. So check the array length and do nothing if there
is no corresponding entry, much as subset() does. Failure to check this
resulted in a stack clobber in the case reported by Marko Kreen.
This bug appears to have been latent in the regex library from the
beginning. It was not exposed because find() called dissect() not
cdissect(), and the dissect() code path didn't ever call zaptreesubs()
(formerly zapmem()). When I unified dissect() and cdissect() in commit
4dd78bf37aa29d04b3f358b08c4a2fa43cf828e7, the problem was exposed.
Now that I've seen this, I'm rather suspicious that we might need to
back-patch it; but will refrain for now, for lack of evidence that
the case can be hit in the previous coding.

If a seqscan encounters many consecutive pages containing only dead tuples,
it can remain in the loop in heapgettup for a long time, and there was no
CHECK_FOR_INTERRUPTS anywhere in that loop. This meant there were
real-world situations where a query would be effectively uncancelable for
long stretches. Add a check placed to occur once per page, which should be
enough to provide reasonable response time without adding any measurable
overhead.
Report and patch by Merlin Moncure (though I tweaked it a bit).
Back-patch to all supported branches.

When the column name is an unqualified name, rather than table.column,
the error message complains about too many dotted names, which is
wrong. Report by Peter Eisentraut based on examination of the
sepgsql regression test output, but the problem also affects COMMENT.
New wording as suggested by Tom Lane.

In commit d526575f893c1a4e05ebd307e80203536b213a6d, we changed things so
that buffer usage counts are incremented when the buffer is pinned, rather
than when it is unpinned, but the README file didn't get the memo.
Report by Amit Kapila.

Move postmaster’s RemovePgTempFiles call to a less randomly chosen place.

There is no reason to do this as early as possible in postmaster startup,
and good reason not to do it until we have completely created the
postmaster's lock file, namely that it might contribute to pg_ctl thinking
that postmaster startup has timed out. (This would require a rather
unusual amount of time to be spent scanning temp file directories, but we
have at least one field report of it happening reproducibly.)
Back-patch to 9.1. Before that, pg_ctl didn't wait for additional info to
be added to the lock file, so it wasn't a problem.
Note that this is not a complete fix to the slow-start issue in 9.1,
because we still had identify_system_timezone being run during postmaster
start in 9.1. But that's at least a reasonably well-defined delay, with
an easy workaround if needed, whereas the temp-files scan is not so
predictable and cannot be avoided.

Per discussion, we should explain that we follow RFC 3339 and not really
the letter of the ISO 8601 spec for timestamp output format. Mostly
Brendan Jurd's wording, though I tweaked it to clarify that we do take 'T'
on input. Minor additional copy-editing and markup-tweaking, too.

When we create a temporary copy of the old node buffer, in stack, we mustn't
leak that into any of the long-lived data structures. Before this patch,
when we called gistPopItupFromNodeBuffer(), it got added to the array of
"loaded buffers". After gistRelocateBuildBuffersOnSplit() exits, the
pointer added to the loaded buffers array points to garbage. Often that goes
unnotied, because when we go through the array of loaded buffers to unload
them, buffers with a NULL pageBuffer are ignored, which can often happen by
accident even if the pointer points to garbage.
This patch fixes that by marking the temporary copy in stack explicitly as
temporary, and refrain from adding buffers marked as temporary to the array
of loaded buffers.
While we're at it, initialize nodeBuffer->pageBlocknum to InvalidBlockNumber
and improve comments a bit. This isn't strictly necessary, but makes
debugging easier.

Per recent discussion, the error message for this was actually a trifle
inaccurate, since it said "cannot be cast" which might be incorrect.
Adjust that wording, and add a HINT suggesting that a USING clause might
be needed.

We have no need for a timeout here really, but some broken products from
Redmond seem to lose FD_READ events occasionally, and waking up and
retrying the recv() is the only known way to work around that. Perhaps
somebody will be motivated to figure out a better answer here; but not I.

The BSD-ish members of the buildfarm all seem to think removing this
was a bad idea. It looks to me like it resulted in omitting the system
header inclusion necessary to detect the fields of struct tm correctly.

Assert that WaitLatchOrSocket callers cannot wait only for writability.

Since we have chosen to report socket EOF and error conditions via the
WL_SOCKET_READABLE flag bit, it's unsafe to wait only for
WL_SOCKET_WRITEABLE; the caller would never be notified of the socket
condition, and in some of these implementations WaitLatchOrSocket would
busy-wait until something else happens. Add this restriction to the API
specification, and add Asserts to check that callers don't try to do that.
At some point we might want to consider adjusting the API to relax this
restriction, but until we have an actual use case for waiting on a
write-only socket, it seems premature to design a solution.

ENABLE_DTRACE unused as of a7b7b07af340c73adee9959edf260695591a9496
HAVE_ERR_SET_MARK unused as of 4ed4b6c54e5fab24ab2624d80e26f7546edc88ad
HAVE_FCVT unused as of 4553e1d80f824291932cfde30aa24a76dd8f1941
HAVE_STRUCT_SOCKADDR_UN unused as of b4cea00a1fc9d2270bfe9aeeee44915378d5f733
HAVE_SYSCONF unused as of f83356c7f574bc69969f29dc7b430b286a0cd9f4
TM_IN_SYS_TIME never used, obsolescent per Autoconf documentation

Test results from buildfarm members mastodon/narwhal (Windows Server 2003)
make it look like that platform just plain loses FD_READ events
occasionally, and the only reason our previous coding seemed to work was
that it timed out every couple of seconds and retried the whole operation.
Try to verify this by reinserting a finite timeout into the pgstat loop.
This isn't meant to be a permanent patch either, just to confirm or
disprove a theory.

This should get rid of the usage of pgwin32_waitforsinglesocket entirely,
and perhaps thereby remove the race condition that's evidently still
present on some versions of Windows. The previous arrangement was a bit
unsafe anyway, since waiting at the recv() would not allow pgstat to notice
postmaster death.

When the "hot" members of PGPROC were split off to separate PGXACT structs,
many PGPROC fields referred to in comments were moved to PGXACT, but the
comments were neglected in the commit. Mostly this is just a search/replace
of PGPROC with PGXACT, but the way the dummy PGPROC entries are created for
prepared transactions changed more, making some of the comments totally
bogus.
Noah Misch

Log main-loop blocking events and the results of inquiry messages.
This is to get some clarity as to what's happening on those Windows
buildfarm members that still don't like the latch-ified stats collector.
This bulks up the postmaster log a tad, so I won't leave it in place for
long.

If the tablespace directory is missing entirely, we allow DROP TABLESPACE
to go through, on the grounds that it should be possible to clean up the
catalog entry in such a situation. However, we forgot that the pg_tblspc
symlink might still be there. We should try to remove the symlink too
(but not fail if it's no longer there), since not doing so can lead to
weird behavior subsequently, as per report from Michael Nolan.
There was some discussion of adding dependency links to prevent DROP
TABLESPACE when the catalogs still contain references to the tablespace.
That might be worth doing too, but it's an orthogonal question, and in
any case wouldn't be back-patchable.
Back-patch to 9.0, which is as far back as the logic looks like this.
We could possibly do something similar in 8.x, but given the lack of
reports I'm not sure it's worth the trouble, and anyway the case could
not arise in the form the logic is meant to cover (namely, a post-DROP
transaction rollback having resurrected the pg_tablespace entry after
some or all of the filesystem infrastructure is gone).

Make sure WaitLatchOrSocket regards FD_CLOSE as a read-ready condition.
We might want to tweak this further, but it was surely wrong as-is.
Make pgwin32_waitforsinglesocket detach its private event object from the
passed socket before returning. I suspect that failure to do so leads
to race conditions when other code (such as WaitLatchOrSocket) attaches
a different event object to the same socket. Moreover, the existing
coding meant that repeated calls to pgwin32_waitforsinglesocket would
perform ResetEvent on an event actively connected to a socket, which
is rumored to be an unsafe practice; the WSAEventSelect documentation
appears to recommend against this, though it does not say not to do it
in so many words.
Also, uniformly use the coding pattern "WSAEventSelect(s, NULL, 0)" to
detach events from sockets, rather than passing the event in the second
parameter. The WSAEventSelect documentation says that the second parameter
is ignored if the third is 0, so theoretically this should make no
difference. However, elsewhere on the same reference page the use of NULL
in this context is recommended, and I have found suggestions on the net
that some versions of Windows have bugs with a non-NULL second parameter
in this usage.
Some other mostly-cosmetic cleanup, such as using the right one of
WSAGetLastError and GetLastError for reporting errors from these functions.

syslogger was coded to wake up once per second whether there was anything
useful to do or not. As part of our campaign to reduce the server's idle
power consumption, change it to use a latch for waiting. Now, in the
absence of any data to log or any signals to service, it will only wake up
at the programmed logfile rotation times (if any).

These were apparently never used. The AC_SUBST was probably just
added in a copy-and-paste manner. (The shell variables continue to be
used inside configure. The change is just that we don't need them
outside of configure.)

When using poll(), EOF on a socket is reported with the POLLHUP not
POLLIN flag (at least on Linux). WaitLatchOrSocket failed to check
this bit, causing it to go into a busy-wait loop if EOF occurs.
We earlier fixed the same mistake in the test for the state of the
postmaster_alive socket, but missed it for the caller-supplied socket.
Fortunately, this error is new in 9.2, since 9.1 only had a select()
based code path not a poll() based one.

The string representation of ImportError changed. Remove printing
that; it's not necessary for the test.
The order in which members of a dict are printed changed. But this
was always implementation-dependent, so we have just been lucky for a
long time. Do the printing the hard way to ensure sorted order.

We previously recognized that citext wouldn't get marked as collatable
during pg_upgrade from a pre-9.1 installation, and hacked its
create-from-unpackaged script to manually perform the necessary catalog
adjustments. However, we overlooked the fact that domains over citext,
as well as the citext[] array type, need the same adjustments. Extend
the script to handle those cases.
Also, the documentation suggested that this was only an issue in pg_upgrade
scenarios, which is quite wrong; loading any dump containing citext from a
pre-9.1 server will also result in the type being wrongly marked.
I approached the documentation problem by changing the 9.1.2 release note
paragraphs about this issue, which is historically inaccurate. But it
seems better than having the information scattered in multiple places, and
leaving incorrect info in the 9.1.2 notes would be bad anyway. We'll still
need to mention the issue again in the 9.1.4 notes, but perhaps they can
just reference 9.1.2 for fix instructions.
Per report from Evan Carroll. Back-patch into 9.1.

When inserting the downlinks for a split gist page, we used hold the locks
on the child pages until the insertion into the parent - and recursively its
parent if it had to be split too - were all completed. Change that so that
the locks on child pages are released after the insertion in the immediate
parent is done, before recursing further up the tree.
This reduces the number of lwlocks that are held simultaneously. Holding
many locks is bad for concurrency, and in extreme cases you can even hit
the limit of 100 simultaneously held lwlocks in a backend. If you're really
unlucky, you can hit the limit while in a critical section, which brings
down the whole system.
This fixes bug #6629 reported by Tom Forbes. Backpatch to 9.1. The page
splitting code was rewritten in 9.1, and the old code did not have this
problem.

This patch reverts commit 49340037ee3ab46cb24144a86705e35f272c24d5 and some
follow-on tweaking in pgstat.c. While the basic scheme of latch-ifying the
stats collector seems sound enough, it's failing on most Windows buildfarm
members for unknown reasons, and there's no time left to debug that before
9.2beta1. Better to ship a beta version without this improvement. I hope
to re-revert this once beta1 is out, though.

Per a suggestion from Peter Geoghegan, make WaitLatch responsible for
verifying that the WL_POSTMASTER_DEATH bit it returns is truthful (by
testing PostmasterIsAlive). Then simplify its callers, who no longer
need to do that for themselves. Remove weasel wording about falsely-set
result bits from WaitLatch's API contract.

The old way of implementing slicing support by implementing
PySequenceMethods.sq_slice no longer works in Python 3. You now have
to implement PyMappingMethods.mp_subscript. Do this by simply
proxying the call to the wrapped list of result dictionaries.
Consolidate some of the subscripting regression tests.
Jan Urbański

The original coding failed to reset ImmediateInterruptOK before returning,
which would potentially allow a subsequent query-cancel interrupt to be
accepted at an unsafe point. This is a really nasty bug since it's so hard
to predict the consequences, but they could be unpleasant.
Also, ensure that signal handlers are serviced before this function
returns, even if the semaphore is already set. This should make the
behavior more like Unix.
Back-patch to all supported versions.

Ensure that signal handlers are serviced before this function returns.
This should make the behavior more like Unix. Also, add some more
error checking, and make some other cosmetic improvements.
No back-patch since it's not clear whether this is fixing any live bug
that would affect 9.1. I'm more concerned about 9.2 anyway given our
considerable recent expansions in the usage of WaitLatch.

It was already on its last legs, and it turns out that it was
accidentally broken in commit 89e850e6fda9e4e441712012abe971fe938d595a
and no one cared. So remove the rest the support for it and update
the documentation to indicate that Python 2.3 is now required.

In checkpointer and walwriter, avoid calling PostmasterIsAlive unless
WaitLatch has reported WL_POSTMASTER_DEATH. This saves a kernel call per
iteration of the process's outer loop, which is not all that much, but a
cycle shaved is a cycle earned. I had already removed the unconditional
PostmasterIsAlive calls in bgwriter and pgstat in previous patches, but
forgot that WL_POSTMASTER_DEATH is supposed to be treated as untrustworthy
(per comment in unix_latch.c); so adjust those two cases to match.
There are a few other places where the same idea might be applied, but only
after substantial code rearrangement, so I didn't bother.

Commit 6d90eaaa89a007e0d365f49d6436f35d2392cfeb added a hibernation mode
to the bgwriter to reduce the server's idle-power consumption. However,
its interaction with the detailed behavior of BgBufferSync's feedback
control loop wasn't very well thought out. That control loop depends
primarily on the rate of buffer allocation, not the rate of buffer
dirtying, so the hibernation mode has to be designed to operate only when
no new buffer allocations are happening. Also, the check for whether the
system is effectively idle was not quite right and would fail to detect
a constant low level of activity, thus allowing the bgwriter to go into
hibernation mode in a way that would let the cycle time vary quite a bit,
possibly further confusing the feedback loop. To fix, move the wakeup
support from MarkBufferDirty and SetBufferCommitInfoNeedsSave into
StrategyGetBuffer, and prevent the bgwriter from entering hibernation mode
unless no buffer allocations have happened recently.
In addition, fix the delaying logic to remove the problem of possibly not
responding to signals promptly, which was basically caused by trying to use
the process latch's is_set flag for multiple purposes. I can't prove it
but I'm suspicious that that hack was responsible for the intermittent
"postmaster does not shut down" failures we've been seeing in the buildfarm
lately. In any case it did nothing to improve the readability or
robustness of the code.
In passing, express the hibernation sleep time as a multiplier on
BgWriterDelay, not a constant. I'm not sure whether there's any value in
exposing the longer sleep time as an independently configurable setting,
but we can at least make it act like this for little extra code.

Every time since the current rule for postgres.bki was put in place
when we change the major version, people complain that their tests
fail in strange ways. This is because the version number in
postgres.bki is not updated, because it has no dependency for that.
And you can't even force the rebuild manually if you don't happen to
know which file has the problem. Fix that now before it will happen
again.
The only remaining problem with switching major versions, as far as
the regression tests are concerned, is that contrib needs to be
rebuilt. But that's easily invoked, and in any case the failure modes
are more friendly if you forget that.

Create separate appendixes for contrib extensions and other server
plugins on the one hand, and utility programs on the other. Recast
the documentation of the latter as refentries, so that man pages are
generated.

Users of asynchronous-commit mode expect there to be a guaranteed maximum
delay before an async commit's WAL records get flushed to disk. The
original version of the walwriter hibernation patch broke that. Add an
extra shared-memory flag to allow async commits to kick the walwriter out
of hibernation mode, without adding any noticeable overhead in cases where
no action is needed.

Latch-ify the stats collector, so that it does not need an arbitrary wakeup
cycle to check for postmaster death. The incremental savings in idle power
is pretty marginal, since we only had it waking every two seconds; but I
believe that this patch may also improve the collector's performance under
load, by reducing the number of kernel calls made per message when messages
are arriving constantly (we now avoid a select/poll call except when we
need to sleep). The change also reduces the time needed for a normal
database shutdown on platforms where signals don't interrupt select().

Reduce idle power consumption of walwriter and checkpointer processes.

This patch modifies the walwriter process so that, when it has not found
anything useful to do for many consecutive wakeup cycles, it extends its
sleep time to reduce the server's idle power consumption. It reverts to
normal as soon as it's done any successful flushes. It's still true that
during any async commit, backends check for completed, unflushed pages of
WAL and signal the walwriter if there are any; so that in practice the
walwriter can get awakened and returned to normal operation sooner than the
sleep time might suggest.
Also, improve the checkpointer so that it uses a latch and a computed delay
time to not wake up at all except when it has something to do, replacing a
previous hardcoded 0.5 sec wakeup cycle. This also is primarily useful for
reducing the server's power consumption when idle.
In passing, get rid of the dedicated latch for signaling the walwriter in
favor of using its procLatch, since that comports better with possible
generic signal handlers using that latch. Also, fix a pre-existing bug
with failure to save/restore errno in walwriter's signal handlers.
Peter Geoghegan, somewhat simplified by Tom

This adds the variable COMP_KEYWORD_CASE, which controls in what case
keywords are completed. This is partially to let users configure the
change from commit 69f4f1c3576abc535871c6cfa95539e32a36120f, but it
also offers more behaviors than were available before.

Because they use their own compilation rule, they don't use the
dependency tracking logic from Makefile.global. To make sure that
dependency tracking works anyway for the *_srv.o files, depend on
their *.o siblings as well, which do have proper dependencies. It's a
hack that might fail someday if there is a *_srv.o without a
corresponding *.o, but it works for now (and those would probably go
into src/backend/port/ anyway).

According to the Autoconf documentation, there should be a make rule
pg_config.h: stamp-h
so that with the right setup around this, a change in pg_config.h.in
will trigger a rebuild of everything that depends on pg_config.h. But
this doesn't always work, sometimes you need to run make twice to get
everything up to date after a change of pg_config.h.in.
The fix is to write the rule as
pg_config.h: stamp-h ;
instead (with an empty command instead of no command). This is what
Automake-generated makefiles effectively do, so it seems safe to be on
this side.
It's not actually clear why this is (apparently) more correct. It's
been posted to
<http://lists.gnu.org/archive/html/help-make/2012-04/msg00058.html>
without response so far.

"Unexpected EOF on client connection" without an open transaction
is mostly noise, so turn it into DEBUG1. With an open transaction it's
still indicating a problem, so keep those as ERROR, and change the message
to indicate that it happened in a transaction.

Document that it is the pgsql version we are matching for psqlrc version-specific files, not the server version.

Commit 62c7bd31c8878dd45c9b9b2429ab7a12103f3590 had assorted problems, most
visibly that it broke PREPARE TRANSACTION in the presence of session-level
advisory locks (which should be ignored by PREPARE), as per a recent
complaint from Stephen Rees. More abstractly, the patch made the
LockMethodData.transactional flag not merely useless but outright
dangerous, because in point of fact that flag no longer tells you anything
at all about whether a lock is held transactionally. This fix therefore
removes that flag altogether. We now rely entirely on the convention
already in use in lock.c that transactional lock holds must be owned by
some ResourceOwner, while session holds are never so owned. Setting the
locallock struct's owner link to NULL thus denotes a session hold, and
there is no redundant marker for that.
PREPARE TRANSACTION now works again when there are session-level advisory
locks, and it is also able to transfer transactional advisory locks to the
prepared transaction, but for implementation reasons it throws an error if
we hold both types of lock on a single lockable object. Perhaps it will be
worth improving that someday.
Assorted other minor cleanup and documentation editing, as well.
Back-patch to 9.1, except that in the 9.1 branch I did not remove the
LockMethodData.transactional flag for fear of causing an ABI break for
any external code that might be examining those structs.

The default for the choice attribute of the <arg> element is "opt",
which would normally put the argument inside brackets. But the DSSSL
stylesheets contain a hack that treats <arg> directly inside <group>
specially, so that <group><arg>-x</arg><arg>-y</arg></group> comes out
as [ -x | -y ] rather than [ [-x] | [-y] ], which it would technically
be. But when building man pages, this doesn't work, and so the
command synopses on the man pages contain lots of extra brackets.
By putting choice="opt" or choice="plain" explicitly on every <arg>
and <group> element, we avoid any toolchain dependencies like that,
and it also makes it clearer in the source code what is meant.
In passing, make some small corrections in the documentation about
which arguments are really optional or not.

Remove BSD/OS (BSDi) port. There are no known users upgrading to Postgres 9.2, and perhaps no existing users either.

Add test cases for inline handler of plython2u (when using that
language name), and for result object element assignment. There is
now at least one test case for every top-level functionality, except
plpy.Fatal (annoying to use in regression tests) and result object
slice retrieval and slice assignment (which are somewhat broken).

Allocate PLyResultObject.tupdesc in TopMemoryContext, because its
lifetime is the lifetime of the Python object and it shouldn't be
freed by some other memory context, such as one controlled by SPI. We
trust that the Python object will clean up its own memory.
Before, this would crash the included regression test case by trying
to use memory that was already freed.
reported by Asif Naeem, analysis by Tom Lane

At the time we check whether the tuple is dead to all running
transactions, we've already verified that it isn't visible to our
scan, setting hint bits if appropriate. So there's no need to
recheck CLOG for the all-dead test we do just a moment later.
So, add HeapTupleIsSurelyDead() to test the appropriate condition
under the assumption that all relevant hit bits are already set.
Review by Tom Lane.

Remove the following ports:
- dgux
- nextstep
- sunos4
- svr4
- ultrix4
- univel
These are obsolete and not worth rescuing. In most cases, there is
circumstantial evidence that they wouldn't work anymore anyway.

Add more markup in particular so that the command options appear
consistently in monospace in the HTML output.
On the vacuumdb reference page, remove listing all the possible
options in the synopsis. They have become too many now; we have the
detailed options list for that.

This patch adjusts the core statistics views to match the decision already
taken for pg_stat_statements, that values representing elapsed time should
be represented as float8 and measured in milliseconds. By using float8,
we are no longer tied to a specific maximum precision of timing data.
(Internally, it's still microseconds, but we could now change that without
needing changes at the SQL level.)
The columns affected are
pg_stat_bgwriter.checkpoint_write_time
pg_stat_bgwriter.checkpoint_sync_time
pg_stat_database.blk_read_time
pg_stat_database.blk_write_time
pg_stat_user_functions.total_time
pg_stat_user_functions.self_time
pg_stat_xact_user_functions.total_time
pg_stat_xact_user_functions.self_time
The first four of these are new in 9.2, so there is no compatibility issue
from changing them. The others require a release note comment that they
are now double precision (and can show a fractional part) rather than
bigint as before; also their underlying statistics functions now match
the column definitions, instead of returning bigint microseconds.

Get rid of the per-column documentation of underlying functions, which did
far more to clutter the view descriptions than it did to be helpful, and
was rather incomplete and typo-ridden anyway. Instead suggest that people
consult the definitions of the standard views to see the underlying
functions.
The older functions for obtaining individual facts about backends are now
somewhat obsoleted by pg_stat_get_activity, which means that they are not
documented by any standard view. So I put that information into a separate
table. (Maybe we should just deprecate them instead?)
In passing, fix a couple more documentation errors.

Change return type of ExceptionalCondition to void and mark it noreturn

Display total time and I/O timings in milliseconds, for consistency with
the units used for timings in the core statistics views. The columns
remain of float8 type, so that sub-msec precision is available. (At some
point we will probably want to convert the core views to use float8 type
for the same reason, but this patch does not touch that issue.)
This is a release-note-requiring change in the meaning of the total_time
column. The I/O timing columns are new as of 9.2, so there is no
compatibility impact from redefining them.
Do some minor copy-editing in the documentation, too.

Normally whole-row Vars are printed as "tabname.*". However, that does not
work at top level of a targetlist, because per SQL standard the parser will
think that the "*" should result in column-by-column expansion; which is
not at all what a whole-row Var implies. We used to just print the table
name in such cases, which works most of the time; but it fails if the table
name matches a column name available anywhere in the FROM clause. This
could lead for instance to a view being interpreted differently after dump
and reload. Adding parentheses doesn't fix it, but there is a reasonably
simple kluge we can use instead: attach a no-op cast, so that the "*" isn't
syntactically at top level anymore. This makes the printing of such
whole-row Vars a lot more consistent with other Vars, and may indeed fix
more cases than just the reported one; I'm suspicious that cases involving
schema qualification probably didn't work properly before, either.
Per bug report and fix proposal from Abbas Butt, though this patch is quite
different in detail from his.
Back-patch to all supported versions.

If it fails to open a new log file, the syslogger assumes there's something
wrong with its parameters (such as log_directory), and stops attempting
automatic time-based or size-based log file rotations. Sending it SIGHUP
is supposed to start that up again. However, the original coding for that
was really bogus, involving clobbering a couple of GUC variables and hoping
that SIGHUP processing would restore them. Get rid of that technique in
favor of maintaining a separate flag showing we've turned rotation off.
Per report from Mark Kirkwood.
Also, the syslogger will automatically attempt to create the log_directory
directory if it doesn't exist, but that was only happening at startup.
For consistency and ease of use, it should do the same whenever the value
of log_directory is changed by SIGHUP.
Back-patch to all supported branches.

The alternative of disallowing index-only scans in HS operation was
discussed, but the consensus was that it was better to treat marking
a page all-visible as a recovery conflict for snapshots that could still
fail to see XIDs on that page. We may in the future try to soften this,
so that we simply force index scans to do heap fetches in cases where
this may be an issue, rather than throwing a hard conflict.

Get rid of section 8.5.6 (Date/Time Internals), which appears to confuse
people more than it helps, and anyway discussion of Postgres' internal
datetime calculation methods seems pretty out of place here. Instead,
make datatype.sgml just say that we follow the Gregorian calendar (a bit
of specification not previously present anywhere in that chapter :-()
and link to the History of Units appendix for more info. Do some mild
editorialization on that appendix, too, to make it clearer that we are
following proleptic Gregorian calendar rules rather than anything more
historically accurate.
Per a question from Florence Cousin and subsequent discussion in
pgsql-docs.

bitmap_scan_cost_est() has to be able to cope with a BitmapOrPath, but
I'd taken a shortcut that didn't work for that case. Noted by Heikki.
Add some regression tests since this area is evidently under-covered.

Before 9.1, PL/Python functions returning composite types could return
a string and it would be parsed using record_in. The 9.1 changes made
PL/Python only expect dictionaries, tuples, or objects supporting
getattr as output of composite functions, resulting in a regression
and a confusing error message, as the strings were interpreted as
sequences and the code for transforming lists to database tuples was
used. Fix this by treating strings separately as before, before
checking for the other types.
The reason why it's important to support string to database tuple
conversion is that trigger functions on tables with composite columns
get the composite row passed in as a string (from record_out).
Without supporting converting this back using record_in, this makes it
impossible to implement pass-through behavior for these columns, as
PL/Python no longer accepts strings for composite values.
A better solution would be to fix the code that transforms composite
inputs into Python objects to produce dictionaries that would then be
correctly interpreted by the Python->PostgreSQL counterpart code. But
that would be too invasive to backpatch to 9.1, and it is too late in
the 9.2 cycle to attempt it. It should be revisited in the future,
though.
Reported as bug #6559 by Kirill Simonov.
Jan Urbański

We have been seeing intermittent buildfarm failures due to a query
sometimes not using an index-only scan plan, because a background
auto-ANALYZE prevented the table's all-visible bits from being set
immediately, thereby causing the estimated cost of an index-only scan
to go up considerably. Adjust the test case so that a bitmap index scan is
preferred instead, which serves equally well for the purpose the test case
is actually meant for. (Of course, it would be better to eliminate the
interference from auto-ANALYZE, but I see no low-risk way to do that,
so any such fix will have to be left for 9.3 or later.)

setrefs.c failed to do "rtoffset" adjustment of Vars in RETURNING lists,
which meant they were left with the wrong varnos when the RETURNING list
was in a subquery. That was never possible before writable CTEs, of
course, but now it's broken. The executor fails to notice any problem
because ExecEvalVar just references the ecxt_scantuple for any normal
varno; but EXPLAIN breaks when the varno is wrong, as illustrated in a
recent complaint from Bartosz Dmytrak.
Since the eventual rtoffset of the subquery is not known at the time
we are preparing its plan node, the previous scheme of executing
set_returning_clause_references() at that time cannot handle this
adjustment. Fortunately, it turns out that we don't really need to do it
that way, because all the needed information is available during normal
setrefs.c execution; we just have to dig it out of the ModifyTable node.
So, do that, and get rid of the kluge of early setrefs processing of
RETURNING lists. (This is a little bit of a cheat in the case of inherited
UPDATE/DELETE, because we are not passing a "root" struct that corresponds
exactly to what the subplan was built with. But that doesn't matter, and
anyway this is less ugly than early setrefs processing was.)
Back-patch to 9.1, where the problem became possible to hit.

Due to rather sloppy thinking (on my part, I'm afraid) about the
appropriate behavior for boundary conditions, pg_next_dst_boundary() gave
undefined, platform-dependent results when the input time is exactly the
last recorded DST transition time for the specified time zone, as a result
of fetching values one past the end of its data arrays.
Change its specification to be that it always finds the next DST boundary
*after* the input time, and adjust code to match that. The sole existing
caller, DetermineTimeZoneOffset, doesn't actually care about this
distinction, since it always uses a probe time earlier than the instant
that it does care about. So it seemed best to me to change the API to make
the result=1 and result=0 cases more consistent, specifically to ensure
that the "before" outputs always describe the state at the given time,
rather than hacking the code to obey the previous API comment exactly.
Per bug #6605 from Sergey Burladyan. Back-patch to all supported versions.

Prohibiting this outright would break dumps taken from older versions
that contain such casts, which would create far more pain than is
justified here.
Per report by Jaime Casanova and subsequent discussion.

We must set the visibility map bit before releasing our exclusive lock
on the heap page; otherwise, someone might clear the heap page bit
before we set the visibility map bit, leading to a situation where the
visibility map thinks the page is all-visible but it's really not.
This problem has existed since 8.4, but it wasn't critical before we
had index-only scans, since the worst case scenario was that the page
wouldn't get vacuumed until the next scan_all vacuum.
Along the way, a couple of minor, related improvements: (1) if we
pause the heap scan to do an index vac cycle, release any visibility
map page we're holding, since really long-running pins are not good
for a variety of reasons; and (2) warn if we see a page that's marked
all-visible in the visibility map but not on the page level, since
that should never happen any more (it was allowed in previous
releases, but not in 9.2).

Instead of an exact cost comparison, use a fuzzy comparison with 1e-10
delta after all other path metrics have proved equal. This is to avoid
having platform-specific roundoff behaviors determine the choice when
two paths are really the same to our cost estimators. Adjust the
recently-added test case that made it obvious we had a problem here.

The original syntax wasn't universally loved, and it didn't allow its
usage in CREATE TABLE, only ALTER TABLE. It now works everywhere, and
it also allows using ALTER TABLE ONLY to add an uninherited CHECK
constraint, per discussion.
The pg_constraint column has accordingly been renamed connoinherit.
This commit partly reverts some of the changes in
61d81bd28dbec65a6b144e0cd3d0bfe25913c3ac, particularly some pg_dump and
psql bits, because now pg_get_constraintdef includes the necessary NO
INHERIT within the constraint definition.
Author: Nikhil Sontakke
Some tweaks by me

For an initial relation that lacks any join clauses (that is, it has to be
cartesian-product-joined to the rest of the query), we considered only
cartesian joins with initial rels appearing later in the initial-relations
list. This creates an undesirable dependency on FROM-list order. We would
never fail to find a plan, but perhaps we might not find the best available
plan. Noted while discussing the logic with Amit Kapila.
Improve the comments a bit in this area, too.
Arguably this is a bug fix, but given the lack of complaints from the
field I'll refrain from back-patching.

This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.

Previously, we used SetBufferCommitInfoNeedsSave, but that's really
intended for dirty-marks we can theoretically afford to lose, such as
hint bits. As for 9.2, the PD_ALL_VISIBLE mustn't be lost in this
way, since we could then end up with a heap page that isn't
all-visible and a visibility map page that is all visible, causing
index-only scans to return wrong answers.

Mostly, this consists of adding support for fields which exist in the
structure but aren't handled by copy/equal/outfuncs; but the create
foreign table case can actually produce garbage output.
Noah Misch

A number of utility programs were rather careless about paremeters
that can be set via both an option argument and a positional
argument. This leads to results which can violate the Principal
Of Least Astonishment. These changes refuse to use positional
arguments to override settings that have been made via positional
arguments. The changes are backpatched to all live branches.

When using synchronous replication, we waited for the commit record to be
replicated, but if we our transaction didn't write any other WAL records,
that's not required because we don't even flush the WAL locally to disk in
that case. This lead to long waits when committing a transaction that only
modified a temporary table. Bug spotted by Thom Brown.

The header file is needed by any module that wants to use the PL/pgSQL
instrumentation plugin interface. Most notably, the pldebugger plugin needs
this. With this patch, it can be built using pgxs, without having the full
server source tree available.

The output of the new pg_xlog_location_diff function is of type numeric,
since it could theoretically overflow an int8 due to signedness; this
provides a convenient way to format such values.
Fujii Masao, with some beautification by me.

Add description of long options for ‘-c’, ‘-D’, ‘-l’ and ‘-s’. Per discussion of hackers list on 2012/3/10 “missing description initdb manual”.

Remove lots of outdated information that is duplicated by the
better-maintained SGML documentation. In particular, remove the
outdated listing of contrib modules. Update the installation
instructions to mention CREATE EXTENSION, but don't go into too much
detail.

So far as I can tell, it is no longer possible for this heuristic to do
anything useful, because the new weaker definition of
have_relevant_joinclause means that any relation with a joinclause must be
considered joinable to at least one other relation. It would still be
possible for the code block to be entered, for example if there are join
order restrictions that prevent any join of the current level from being
formed; but in that case it's just a waste of cycles to attempt to form
cartesian joins, since the restrictions will still apply.
Furthermore, IMO the existence of this code path can mask bugs elsewhere;
we would have noticed the problem with cartesian joins a lot sooner if
this code hadn't compensated for it in the simplest case.
Accordingly, let's remove it and see what happens. I'm committing this
separately from the prerequisite changes in have_relevant_joinclause,
just to make the question easier to revisit if there is some fault in
my logic.

We should be willing to cross-join two small relations if that allows us
to use an inner indexscan on a large relation (that is, the potential
indexqual for the large table requires both smaller relations). This
worked in simple cases but fell apart as soon as there was a join clause
to a fourth relation, because the existence of any two-relation join clause
caused the planner to not consider clauseless joins between other base
relations. The added regression test shows an example case adapted from
a recent complaint from Benoit Delbosc.
Adjust have_relevant_joinclause, have_relevant_eclass_joinclause, and
has_relevant_eclass_joinclause to consider that a join clause mentioning
three or more relations is sufficient grounds for joining any subset of
those relations, even if we have to do so via a cartesian join. Since such
clauses are relatively uncommon, this shouldn't affect planning speed on
typical queries; in fact it should help a bit, because the latter two
functions in particular get significantly simpler.
Although this is arguably a bug fix, I'm not going to risk back-patching
it, since it might have currently-unforeseen consequences.

Per mailing list discussion, we would like to keep the bytea functions
parallel to the text functions, so rename bytea_agg to string_agg,
which already exists for text.
Also, to satisfy the rule that we don't want aggregate functions of
the same name with a different number of arguments, add a delimiter
argument, just like string_agg for text already has.

cost_index's method for estimating per-tuple costs of evaluating filter
conditions (a/k/a qpquals) was completely wrong in the presence of derived
indexable conditions, such as range conditions derived from a LIKE clause.
This was largely masked in common cases as a result of all simple operator
clauses having about the same costs, but it could show up in a big way when
dealing with functional indexes containing expensive functions, as seen for
example in bug #6579 from Istvan Endredy. Rejigger the calculation to give
sane answers when the indexquals aren't a subset of the baserestrictinfo
list. As a side benefit, we now do the calculation properly for cases
involving join clauses (ie, parameterized indexscans), which we always
overestimated before.
There are still cases where this is an oversimplification, such as clauses
that can be dropped because they are implied by a partial index's
predicate. But we've never accounted for that in cost estimates before,
and I'm not convinced it's worth the cycles to try to do so.

Silently ignore any nonexistent schemas that are listed in search_path.

Previously we attempted to throw an error or at least warning for missing
schemas, but this was done inconsistently because of implementation
restrictions (in many cases, GUC settings are applied outside transactions
so that we can't do system catalog lookups). Furthermore, there were
exceptions to the rule even in the beginning, and we'd been poking more
and more holes in it as time went on, because it turns out that there are
lots of use-cases for having some irrelevant items in a common search_path
value. It seems better to just adopt a philosophy similar to what's always
been done with Unix PATH settings, wherein nonexistent or unreadable
directories are silently ignored.
This commit also fixes the documentation to point out that schemas for
which the user lacks USAGE privilege are silently ignored. That's always
been true but was previously not documented.
This is mostly in response to Robert Haas' complaint that 9.1 started to
throw errors or warnings for missing schemas in cases where prior releases
had not. We won't adopt such a significant behavioral change in a back
branch, so something different will be needed in 9.1.

postgres:// URIs are an attempt to "stop the bleeding" in this general
area that has been said to occur due to external projects adopting their
own syntaxes. The syntaxes supported by this patch:
postgres://[user[:pwd]@][unix-socket][:port[/dbname]][?param1=value1&...]
postgres://[user[:pwd]@][net-location][:port][/dbname][?param1=value1&...]
should be enough to cover most interesting cases without having to
resort to "param=value" pairs, but those are provided for the cases that
need them regardless.
libpq documentation has been shuffled around a bit, to avoid stuffing
all the format details into the PQconnectdbParams description, which was
already a bit overwhelming. The list of keywords has moved to its own
subsection, and the details on the URI format live in another subsection.
This includes a simple test program, as requested in discussion, to
ensure that interesting corner cases continue to work appropriately in
the future.
Author: Alexander Shulgin
Some tweaking by Álvaro Herrera, Greg Smith, Daniel Farina, Peter Eisentraut
Reviewed by Robert Haas, Alexey Klyukin (offlist), Heikki Linnakangas,
Marko Kreen, and others
Oh, it also supports postgresql:// but that's probably just an accident.

This definition is convenient when applying the function to the
reltablespace column of pg_class, since that's what zero means there;
and it doesn't interfere with any other plausible use of the function.
Per gripe from Bruce Momjian.

Fix pg_upgrade to properly upgrade a table that is stored in the cluster default tablespace, but part of a database that is in a user-defined tablespace. Caused “file not found” error during upgrade.

This patch reverts commit 191ef2b407f065544ceed5700e42400857d9270f
and thereby restores the pre-7.3 behavior of EXTRACT(EPOCH FROM
timestamp-without-tz). Per discussion, the more recent behavior was
misguided on a couple of grounds: it makes it hard to get a
non-timezone-aware epoch value for a timestamp, and it makes this one
case dependent on the value of the timezone GUC, which is incompatible
with having timestamp_part() labeled as immutable.
The other behavior is still available (in all releases) by explicitly
casting the timestamp to timestamp with time zone before applying EXTRACT.
This will need to be called out as an incompatible change in the 9.2
release notes. Although having mutable behavior in a function marked
immutable is clearly a bug, we're not going to back-patch such a change.

Point the URL to PL/py directly to the page about the procedural language.

It's still non-deterministic in some sense ... but given fixed settings
and identical planning problems, it will now always choose the same plan,
so we probably shouldn't tar it with that brush. Per bug #6565 from
Guillaume Cottenceau. Back-patch to 9.0 where the behavior was fixed.

Re-add documentation recommendation to use gzip/gunzip for archive file storage.

estimate_num_groups() gets unhappy with
create table empty();
select * from empty except select * from empty e2;
I can't see any actual use-case for such a query (and the table is illegal
per SQL spec), but it seems like a good idea that it not cause an assert
failure.

The case could not arise when this code was originally written, but it can
now (since we made zero-column MergeJoins work for the benefit of FULL JOIN
ON TRUE). I don't think there is any actual bug here, but we might as well
treat it consistently with other uses of COPY_POINTER_FIELD(). Per comment
from Ashutosh Bapat.

Save a few cycles while creating “sticky” entries in pg_stat_statements.

There's no need to sit there and increment the stats when we know all the
increments would be zero anyway. The actual additions might not be very
expensive, but skipping acquisition of the spinlock seems like a good
thing. Pushing the logic about initialization of the usage count down into
entry_alloc() allows us to do that while making the code actually simpler,
not more complex. Expansion on a suggestion by Peter Geoghegan.

This patch addresses a deficiency in the previous pg_stat_statements patch.
We want to give sticky entries an initial "usage" factor high enough that
they probably will stick around until their query is completed. However,
if the query never completes (eg it gets an error during execution), the
entry shouldn't persist indefinitely. Manage this by starting out with
a usage setting equal to the (approximate) median usage value within the
whole hashtable, but decaying the value much more aggressively than we
do for normal entries.
Peter Geoghegan

We used to only initialize the stack base pointer when starting up a regular
backend, not in other processes. In particular, autovacuum workers can run
arbitrary user code, and without stack-depth checking, infinite recursion
in e.g an index expression will bring down the whole cluster.
The comment about PL/Java using set_stack_base() is not yet true. As the
code stands, PL/java still modifies the stack_base_ptr variable directly.
However, it's been discussed in the PL/Java mailing list that it should be
changed to use the function, because PL/Java is currently oblivious to the
register stack used on Itanium. There's another issues with PL/Java, namely
that the stack base pointer it sets is not really the base of the stack, it
could be something close to the bottom of the stack. That's a separate issue
that might need some further changes to this code, but that's a different
story.
Backpatch to all supported releases.

XLOG_GIN_UPDATE_META_PAGE and XLOG_GIN_DELETE_LISTPAGE records were printed
with a list link field labeled as "blkno", which was confusing, especially
when the link was empty (InvalidBlockNumber). Print the metapage block
number instead, since that's what's actually being updated. We could
include the link values too as a separate field, but not clear it's worth
the trouble.
Back-patch to 8.4 where the dubious code was added.

Commit 337b6f5ecf05b21b5e997986884d097d60e4e3d0 contained the entirely
fanciful assumption that it had made comparetup_datum unreachable.
Reported and patched by Takashi Yamamoto.
Fix up some not terribly accurate/useful comments from that commit, too.

If we make the initially-called function return the table physical-size
estimate, acquire_inherited_sample_rows will be able to use that to
allocate numbers of samples among child tables, when the day comes that
we want to support foreign tables in inheritance trees.

ANALYZE now accepts foreign tables and allows the table's FDW to control
how the sample rows are collected. (But only manual ANALYZEs will touch
foreign tables, for the moment, since among other things it's not very
clear how to handle remote permissions checks in an auto-analyze.)
contrib/file_fdw is extended to support this.
Etsuro Fujita, reviewed by Shigeru Hanada, some further tweaking by me.

Somebody didn't bother to fix this comment while adding foreign table
support to the code below it.
In passing, remove the explicit calling-out of relkind letters, which adds
complexity to the comment but doesn't help in understanding the code.

The parser got confused if a cursor parameter had the same name as
a plpgsql variable. Reported and diagnosed by Yeb Havinga, though
this isn't exactly his proposed fix.
Also, some mostly-but-not-entirely-cosmetic adjustments to the original
named-cursor-parameter patch, for code readability and better error
diagnostics.

This patch provides a test case for libpq's row processor API.
contrib/dblink can deal with very large result sets by dumping them into
a tuplestore (which can spill to disk) --- but until now, the intermediate
storage of the query result in a PGresult meant memory bloat for any large
result. Now we use a row processor to convert the data to tuple form and
dump it directly into the tuplestore.
A limitation is that this only works for plain dblink() queries, not
dblink_send_query() followed by dblink_get_result(). In the latter
case we don't know the desired tuple rowtype soon enough. While hack
solutions to that are possible, a different user-level API would
probably be a better answer.
Kyotaro Horiguchi, reviewed by Marko Kreen and Tom Lane

Add a “row processor” API to libpq for better handling of large results.

Traditionally libpq has collected an entire query result before passing
it back to the application. That provides a simple and transactional API,
but it's pretty inefficient for large result sets. This patch allows the
application to process each row on-the-fly instead of accumulating the
rows into the PGresult. Error recovery becomes a bit more complex, but
often that tradeoff is well worth making.
Kyotaro Horiguchi, reviewed by Marko Kreen and Tom Lane

There is no existing or foreseeable case in which psql should see a
PGRES_COPY_BOTH PQresultStatus; and if such a case ever emerges, it's a
pretty good bet that these code fragments wouldn't do the right thing
anyway. Remove them, and let the existing default cases do the appropriate
thing, namely emit an "unexpected PQresultStatus" bleat.
Noted while working on libpq row processor patch, for which I was
considering adding a PGRES_SUSPENDED status code --- the same default-case
treatment would be appropriate for that.

The original coding of the syslogger had an arbitrary limit of 20 large
messages concurrently in progress, after which it would just punt and dump
message fragments to the output file separately. Our ambitions are a bit
higher than that now, so allow the data structure to expand as necessary.
Reported and patched by Andrew Dunstan; some editing by Tom

dblink_exec leaked temporary database connections if any error occurred
after connection setup, for example
SELECT dblink_exec('...connect string...', 'select 1/0');
Add a PG_TRY block to ensure PQfinish gets done when it is needed.
(dblink_record_internal is on the hairy edge of needing similar treatment,
but seems not to be actively broken at the moment.)
Also, in 9.0 and up, only one of the three functions using tuplestore
return mode was properly checking that the query context would allow
a tuplestore result.
Noted while reviewing dblink patch. Back-patch to all supported branches.

Combining the loop workspace with the record of already-processed objects
might have been a cute trick, but it behaves horridly if there are many
dependency loops to repair: the time spent in the first step of findLoop()
grows as O(N^2). Instead use a separate flag array indexed by dump ID,
which we can check in constant time. The length of the workspace array
is now never more than the actual length of a dependency chain, which
should be reasonably short in all cases of practical interest. The code
is noticeably easier to understand this way, too.
Per gripe from Mike Roest. Since this is a longstanding performance bug,
backpatch to all supported versions.

The loop that matched owned sequences to their owning tables required time
proportional to number of owned sequences times number of tables; although
this work was only expended in selective-dump situations, which is probably
why the issue wasn't recognized long since. Refactor slightly so that we
can perform this work after the index array for findTableByOid has been
set up, reducing the time to O(M log N).
Per gripe from Mike Roest. Since this is a longstanding performance bug,
backpatch to all supported versions.

ecpg and pg_dump each contain keyword arrays with structure similar
to the backend's keyword array. Up to now, we actually named those
arrays the same as the backend's and relied on parser/keywords.h
to declare them. This seems a tad too cute, though, and it breaks
now that we need to PGDLLIMPORT-decorate the backend symbols.
Rename to avoid the problem. Per buildfarm.
(It strikes me that maybe we should get rid of the separate keywords.c
files altogether, and just define these arrays in the modules that use
them, but that's a rather more invasive change.)

The DBLINK_GET_CONN and DBLINK_GET_NAMED_CONN macros did not set the
surrounding function's conname variable, causing errors to be incorrectly
reported as having occurred on the "unnamed" connection in some cases.
This bug was actually visible in two cases in the regression tests,
but apparently whoever added those cases wasn't paying attention.
Noted by Kyotaro Horiguchi, though this is different from his proposed
patch.
Back-patch to 8.4; 8.3 does not have the same type of error reporting
so the patch is not relevant.

It's actually more useful for the module to ignore these. Ignoring
EXECUTE (and not incrementing the nesting level) allows the executor
hooks to charge the time to the underlying prepared query, which
shows up as a stats entry with the original PREPARE as query string
(possibly modified by suppression of constants, which might not be
terribly useful here but it's not worth avoiding). This is much more
useful than cluttering the stats table with a distinct entry for each
textually distinct EXECUTE.
Experimentation with this idea shows that it's also preferable to ignore
PREPARE. If we don't, we get two stats table entries, one with the query
string hash and one with the jumble-derived hash, but with the same visible
query string (modulo those constants). This is confusing and not very
helpful, since the first entry will only receive costs associated with
initial planning of the query, which is not something counted at all
normally by pg_stat_statements. (And if we do start tracking planning
costs, we'd want them blamed on the other hash table entry anyway.)

When tracking nested statements, contrib/pg_stat_statements formerly
double-counted the execution costs of utility statements that directly
contain an executable statement, such as EXPLAIN and DECLARE CURSOR.
This was not obvious since the ProcessUtility and Executor hooks
would each add their measured costs to the same stats table entry.
However, with the new implementation that hashes utility and plannable
statements differently, this showed up as seemingly-duplicate stats
entries. Fix that by disabling the Executor hooks when the query has a
queryId of zero, which was the case already for such statements but is now
more clearly specified in the code. (The zero queryId was causing problems
anyway because all such statements would add to a single bogus entry.)
The PREPARE/EXECUTE case still results in counting the same execution
in two different stats table entries, but it should be much less surprising
to users that there are two entries in such cases.
In passing, include a CommonTableExpr's ctename in the query hash.
I had left it out originally on the grounds that we wanted to omit all
inessential aliases, but since RTE_CTE RTEs are hashing their referenced
names, we'd better hash the CTE names too to make sure we don't hash
semantically different queries the same.

Some Windows-only messages had apparently been forgotten so far.
Also make the wording of the messages more consistent with similar
messages other parts, such as pg_ctl and pg_regress.

Correct epoch of txid_current() when executed on a Hot Standby server. Initialise ckptXidEpoch from starting checkpoint and maintain the correct value as we roll forwards. This allows GetNextXidAndEpoch() to return the correct epoch when executed during recovery. Backpatch to 9.0 when the problem is first observable by a user.

Postmaster sets max_safe_fds by testing how many open file descriptors it
can open, and that is normally inherited by all child processes at fork().
Not so on EXEC_BACKEND, ie. Windows, however. Because of that, we
effectively ignored max_files_per_process on Windows, and always assumed
a conservative default of 32 simultaneous open files. That could have an
impact on performance, if you need to access a lot of different files
in a query. After this patch, the value is passed to child processes by
save/restore_backend_variables() among many other global variables.
It has been like this forever, but given the lack of complaints about it,
I'm not backpatching this.

pg_stat_statements now hashes selected fields of the analyzed parse tree
to assign a "fingerprint" to each query, and groups all queries with the
same fingerprint into a single entry in the pg_stat_statements view.
In practice it is expected that queries with the same fingerprint will be
equivalent except for values of literal constants. To make the display
more useful, such constants are replaced by "?" in the displayed query
strings.
This mechanism currently supports only optimizable queries (SELECT,
INSERT, UPDATE, DELETE). Utility commands are still matched on the
basis of their literal query strings.
There remain some open questions about how to deal with utility statements
that contain optimizable queries (such as EXPLAIN and SELECT INTO) and how
to deal with expiring speculative hashtable entries that are made to save
the normalized form of a query string. However, fixing these issues should
require only localized changes, and since there are other open patches
involving contrib/pg_stat_statements, it seems best to go ahead and commit
what we've got.
Peter Geoghegan, reviewed by Daniel Farina

Generally, the parse location assigned to a multiple-token construct is
the location of its leftmost token. This commit breaks that rule for
the syntaxes TYPENAME 'LITERAL' and CAST(CONSTANT AS TYPENAME) --- the
resulting Const will have the location of the literal string, not the
typename or CAST keyword. The cases where this matters are pretty thin on
the ground (no error messages in the regression tests change, for example),
and it's unlikely that any user would be confused anyway by an error cursor
pointing at the literal. But still it's less than consistent. The reason
for changing it is that contrib/pg_stat_statements wants to know the parse
location of the original literal, and it was agreed that this is the least
unpleasant way to preserve that information through parse analysis.
Peter Geoghegan

Add a queryId field to Query and PlannedStmt. This is not used by the
core backend, except for being copied around at appropriate times.
It's meant to allow plug-ins to track a particular query forward from
parse analysis to execution.
The queryId is intentionally not dumped into stored rules (and hence this
commit doesn't bump catversion). You could argue that choice either way,
but it seems better that stored rule strings not have any dependency
on plug-ins that might or might not be present.
Also, add a post_parse_analyze_hook that gets invoked at the end of
parse analysis (but only for top-level analysis of complete queries,
not cases such as analyzing a domain's default-value expression).
This is mainly meant to be used to compute and assign a queryId,
but it could have other applications.
Peter Geoghegan

Currently, the only way to see the numbers this gathers is via
EXPLAIN (ANALYZE, BUFFERS), but the plan is to add visibility through
the stats collector and pg_stat_statements in subsequent patches.
Ants Aasma, reviewed by Greg Smith, with some further changes by me.

It used to be case that lazy vacuum could call this function with only
a shared lock on the buffer, but neither lazy vacuum nor any other
code path does that any more. Simplify the code accordingly and clean
up some related, obsolete comments.

Fix COPY FROM for null marker strings that correspond to invalid encoding.

The COPY documentation says "COPY FROM matches the input against the null
string before removing backslashes". It is therefore reasonable to presume
that null markers like E'\\0' will work ... and they did, until someone put
the tests in the wrong order during microoptimization-driven rewrites.
Since then, we've been failing if the null marker is something that would
de-escape to an invalidly-encoded string. Since null markers generally
need to be something that can't appear in the data, this represents a
nontrivial loss of functionality; surprising nobody noticed it earlier.
Per report from Jeff Davis. Backpatch to 8.4 where this got broken.

Replace empty locale name with implied value in CREATE DATABASE and initdb.

setlocale() accepts locale name "" as meaning "the locale specified by the
process's environment variables". Historically we've accepted that for
Postgres' locale settings, too. However, it's fairly unsafe to store an
empty string in a new database's pg_database.datcollate or datctype fields,
because then the interpretation could vary across postmaster restarts,
possibly resulting in index corruption and other unpleasantness.
Instead, we should expand "" to whatever it means at the moment of calling
CREATE DATABASE, which we can do by saving the value returned by
setlocale().
For consistency, make initdb set up the initial lc_xxx parameter values the
same way. initdb was already doing the right thing for empty locale names,
but it did not replace non-empty names with setlocale results. On a
platform where setlocale chooses to canonicalize the spellings of locale
names, this would result in annoying inconsistency. (It seems that popular
implementations of setlocale don't do such canonicalization, which is a
pity, but the POSIX spec certainly allows it to be done.) The same risk
of inconsistency leads me to not venture back-patching this, although it
could certainly be seen as a longstanding bug.
Per report from Jeff Davis, though this is not his proposed patch.

For some reason, in the original coding of the PlaceHolderVar mechanism
I had supposed that PlaceHolderVars couldn't propagate into subqueries.
That is of course entirely possible. When it happens, we need to treat
an outer-level PlaceHolderVar much like an outer Var or Aggref, that is
SS_replace_correlation_vars() needs to replace the PlaceHolderVar with
a Param, and then when building the finished SubPlan we have to provide
the PlaceHolderVar expression as an actual parameter for the SubPlan.
The handling of the contained expression is a bit delicate but it can be
treated exactly like an Aggref's expression.
In addition to the missing logic in subselect.c, prepjointree.c was failing
to search subqueries for PlaceHolderVars that need their relids adjusted
during subquery pullup. It looks like everyplace else that touches
PlaceHolderVars got it right, though.
Per report from Mark Murawski. In 9.1 and HEAD, queries affected by this
oversight would fail with "ERROR: Upper-level PlaceHolderVar found where
not expected". But in 9.0 and 8.4, you'd silently get possibly-wrong
answers, since the value transmitted into the subquery wouldn't go to null
when it should.

We were doing the recursive simplification of function/operator arguments
in half a dozen different places, with rather baroque logic to ensure it
didn't get done multiple times on some arguments. This patch improves that
by postponing argument simplification until after we've dealt with named
parameters and added any needed default expressions.
Marti Raudsepp, somewhat hacked on by me

Fix loss of previous expression-simplification work when a transform
function fires: we must not simply revert to untransformed input tree.
Instead build a dummy FuncExpr node to pass to the transform function.
This has the additional advantage of providing a simpler, more uniform
API for transform functions.
Move documentation to a somewhat less buried spot, relocate some
poorly-placed code, be more wary of null constants and invalid typmod
values, add an opr_sanity check on protransform function signatures,
and some other minor cosmetic adjustments.
Note: although this patch touches pg_proc.h, no need for catversion
bump, because the changes are cosmetic and don't actually change the
intended catalog contents.

An incorrect and entirely unnecessary "safety check" in exec_stmt_getdiag()
caused the code to treat an assignment to a variable with dno zero as a
no-op. Unfortunately, that's a perfectly valid dno. This has been broken
since GET DIAGNOSTICS was invented. It's not terribly surprising that the
bug went unnoticed for so long, since in most cases you probably wouldn't
use the function's first-created variable (normally its first parameter)
as a GET DIAGNOSTICS target. Nonetheless, it's broken. Per bug #6551
from Adam Buraczewski.

Per a suggestion from Euler Taveira, it seems like a good idea to include
this information in \du (and \dg) output. This costs nothing for people
who are not using the VALID UNTIL feature, while for those who are, it's
rather critical information.
Fabrízio de Royes Mello

PGAC_PATH_COLLATEINDEX supposed that it could use AC_PATH_PROGS to search
for collateindex.pl, but that macro will only accept files that are marked
executable, and at least some DocBook installations don't mark the script
executable (a case the docs Makefile was already prepared for). Accept the
script if it's present and readable in $DOCBOOKSTYLE/bin, and otherwise
search the PATH as before.
Having fixed that up, we don't need the fallback case that was in the docs
Makefile, and instead can throw an understandable error if configure didn't
find the script. Per recent trouble report from John Lumby.

Clean up compiler warnings from unused variables with asserts disabled

Document that routine vacuuming is now also important for the purpose
of index-only scans; and mention in the section that describes the
visibility map that it is used to implement index-only scans.
Marti Raudsepp, with some changes by me.

This restores the pre-9.0 situation that it's possible to add new indexes
on pg_class and other mapped-but-not-shared catalogs, so long as you broke
the glass and flipped the big red Dont-Touch-Me switch. As before, there
are a lot of gotchas, and you'd have to be pretty desperate to try this
on a production database; but there doesn't seem to be a reason for
relmapper.c to be preventing such things all by itself. Per
experimentation with a case suggested by Cody Cutrer.

I broke this in commit 337b6f5ecf05b21b5e997986884d097d60e4e3d0, which
among other things arranged for quicksorts to CHECK_FOR_INTERRUPTS()
slightly less frequently. Sadly, it also arranged for heapsorts to
CHECK_FOR_INTERRUPTS() much less frequently. Repair.

Instead of just stopping after removing an arbitrary subset of orphaned
large objects, commit and start a new transaction after each -l objects.
This is just as effective as the original patch at limiting the number of
locks used, and it doesn't require doing the OID collection process
repeatedly to get everything. Since the option no longer changes the
fundamental behavior of vacuumlo, and it avoids a known server-side
limitation, enable it by default (with a default limit of 1000 LOs per
transaction).
In passing, be more careful about properly quoting the names of tables
and fields, and do some other cosmetic cleanup.

The old code was using exit_horribly or die_horribly other depending on
whether it had an ArchiveHandle on which to close the connection or not;
but there were places that were passing a NULL ArchiveHandle to
die_horribly, and other places that used exit_horribly while having an
AH available. So there wasn't all that much consistency.
Improve the situation by keeping only one of the routines, and instead
of having to pass the AH down from the caller, arrange for it to be
present for an on_exit_nicely callback to operate on.
Author: Joachim Wieland
Some tweaks by me
Per a suggestion from Robert Haas, in the ongoing "parallel pg_dump"
saga.

This was for demonstration only, and now it was creating compiler
warnings from zlib without an obvious fix (see also
d923125b77c5d698bb8107a533a21627582baa43), let's just remove it. The
"directory" format is presumably similar enough anyway.

Making this operation look like a utility statement seems generally a good
idea, and particularly so in light of the desire to provide command
triggers for utility statements. The original choice of representing it as
SELECT with an IntoClause appendage had metastasized into rather a lot of
places, unfortunately, so that this patch is a great deal more complicated
than one might at first expect.
In particular, keeping EXPLAIN working for SELECT INTO and CREATE TABLE AS
subcommands required restructuring some EXPLAIN-related APIs. Add-on code
that calls ExplainOnePlan or ExplainOneUtility, or uses
ExplainOneQuery_hook, will need adjustment.
Also, the cases PREPARE ... SELECT INTO and CREATE RULE ... SELECT INTO,
which formerly were accepted though undocumented, are no longer accepted.
The PREPARE case can be replaced with use of CREATE TABLE AS EXECUTE.
The CREATE RULE case doesn't seem to have much real-world use (since the
rule would work only once before failing with "table already exists"),
so we'll not bother with that one.
Both SELECT INTO and CREATE TABLE AS still return a command tag of
"SELECT nnnn". There was some discussion of returning "CREATE TABLE nnnn",
but for the moment backwards compatibility wins the day.
Andres Freund and Tom Lane

Failing to do so causes trigger invocation to fail when they are nested
within a function invocation that changes the current package.
Backpatch to 9.1; previous releases used a different method to obtain
_TD. Per bug report from Mark Murawski (bug #6511)
Author: Alex Hunsaker

In pg_upgrade, remove dependency on pg_config, as that might not be in the non-development install. Instead, use the LOAD mechanism to check for the pg_upgrade_support shared object, like we do for other shared object checks.

When converting source files, pg_regress' inputdir and outputdir options were
ignored when computing the locations of the destination files. In consequence,
these options were effectively unusable when the regression inputs need to
be adjusted by pg_regress. This patch makes pg_regress put the converted files
in the same place that these options specify non-converted input or results
files are to be found. Backpatched to all live branches.

For a little while there I thought match_pathkeys_to_index() was broken
because it wasn't trying to match index columns to pathkeys in order.
Actually that's correct, because GiST can support ordering operators
on any random collection of index columns, but it sure needs a comment.

In commit 57664ed25e5dea117158a2e663c29e60b3546e1c I tried to fix a bug
reported by Teodor Sigaev by making non-simple-Var output columns distinct
(by wrapping their expressions with dummy PlaceHolderVar nodes). This did
not work too well. Commit b28ffd0fcc583c1811e5295279e7d4366c3cae6c fixed
some ensuing problems with matching to child indexes, but per a recent
report from Claus Stadler, constraint exclusion of UNION ALL subqueries was
still broken, because constant-simplification didn't handle the injected
PlaceHolderVars well either. On reflection, the original patch was quite
misguided: there is no reason to expect that EquivalenceClass child members
will be distinct. So instead of trying to make them so, we should ensure
that we can cope with the situation when they're not.
Accordingly, this patch reverts the code changes in the above-mentioned
commits (though the regression test cases they added stay). Instead, I've
added assorted defenses to make sure that duplicate EC child members don't
cause any problems. Teodor's original problem ("MergeAppend child's
targetlist doesn't match MergeAppend") is addressed more directly by
revising prepare_sort_from_pathkeys to let the parent MergeAppend's sort
list guide creation of each child's sort list.
In passing, get rid of add_sort_column; as far as I can tell, testing for
duplicate sort keys at this stage is dead code. Certainly it doesn't
trigger often enough to be worth expending cycles on in ordinary queries.
And keeping the test would've greatly complicated the new logic in
prepare_sort_from_pathkeys, because comparing pathkey list entries against
a previous output array requires that we not skip any entries in the list.
Back-patch to 9.1, like the previous patches. The only known issue in
this area that wasn't caused by the ill-advised previous patches was the
MergeAppend planning failure, which of course is not relevant before 9.1.
It's possible that we need some of the new defenses against duplicate child
EC entries in older branches, but until there's some clear evidence of that
I'm going to refrain from back-patching further.

This is intended as infrastructure to allow sepgsql to cooperate with
connection pooling software, by allowing the effective security label
to be set for each new connection.
KaiGai Kohei, reviewed by Yeb Havinga.

Dave Malcolm of Red Hat is working on a static code analysis tool for
Python-related C code. It reported a number of problems in plpython,
most of which were failures to check for NULL results from object-creation
functions, so would only be an issue in very-low-memory situations.
Patch in HEAD and 9.1. We could go further back but it's not clear that
these issues are important enough to justify the work.
Jan Urbański

This replaces the former global variable PLy_curr_procedure, and provides
a place to stash per-call-level information. In particular we create a
per-call-level scratch memory context.
For the moment, the scratch context is just used to avoid leaking memory
from datatype output function calls in PLyDict_FromTuple. There probably
will be more use-cases in future.
Although this is a fix for a pre-existing memory leakage bug, it seems
sufficiently invasive to not want to back-patch; it feels better as part
of the major rearrangement of plpython code that we've already done as
part of 9.2.
Jan Urbański

pgstattuple: Use a BufferAccessStrategy object to avoid cache-trashing.

Extracted from a larger patch by Jaime Casanova, reviewed by Noah Misch.
I think this error message could use some more extensive revision, but
this at least makes the handling of spgist consistent with what we do for
other types of indexes that this code doesn't know how to handle.

A leaf tuple that we need to delete could get moved as a consequence of an
insertion happening concurrently with the VACUUM scan. If it moves from a
page past the current scan point to a page before, we'll miss it, which is
not acceptable. Hence, when we see a leaf-page REDIRECT that could have
been made since our scan started, chase down the redirection pointer much
as if we were doing a normal index search, and be sure to vacuum every page
it leads to. This fixes the issue because, if the tuple was on page N at
the instant we start our scan, we will surely find it as a consequence of
chasing the redirect from page N, no matter how much it moves around in
between. Problem noted by Takashi Yamamoto.

We have always created a whole-table dependency for the target relation,
but that's not really good enough, as it doesn't prevent scenarios such
as dropping an individual target column or altering its type. So we
have to create an individual dependency for each target column, as well.
Per report from Bill MacArthur of a rule containing UPDATE breaking
after such an alteration. Note that this patch doesn't try to make
such cases work, only to ensure that the attempted ALTER TABLE throws
an error telling you it can't cope with adjusting the rule.
This is a long-standing bug, but given the lack of prior reports
I'm not going to risk back-patching it. A back-patch wouldn't do
anything to fix existing rules' dependency lists, anyway.

This patch fixes the other major compatibility-breaking limitation of
SPGiST, that it didn't store anything for null values of the indexed
column, and so could not support whole-index scans or "x IS NULL"
tests. The approach is to create a wholly separate search tree for
the null entries, and use fixed "allTheSame" insertion and search
rules when processing this tree, instead of calling the index opclass
methods. This way the opclass methods do not need to worry about
dealing with nulls.
Catversion bump is for pg_am updates as well as the change in on-disk
format of SPGiST indexes; there are some tweaks in SPGiST WAL records
as well.
Heavily rewritten version of a patch by Oleg Bartunov and Teodor Sigaev.
(The original also stored nulls separately, but it reused GIN code to do
so; which required undesirable compromises in the on-disk format, and
would likely lead to bugs due to the GIN code being required to work in
two very different contexts.)

It now prints the argument that was at fault.
Also fix a small misbehavior where the error message issued by
getopt() would complain about a program named "--single", because
that's what argv[0] is in the server process.

The original API definition was incapable of supporting whole-index scans
because there was no way to invoke leaf-value reconstruction without
checking any qual conditions. Also, it was inefficient for
multiple-qual-condition scans because value reconstruction got done over
again for each qual condition, and because other internal work in the
consistent functions likewise had to be done for each qual. To fix these
issues, pass the whole scankey array to the opclass consistent functions,
instead of only letting them see one item at a time. (Essentially, the
loop over scankey entries is now inside the consistent functions not
outside them. This makes the consistent functions a bit more complicated,
but not unreasonably so.)
In itself this commit does nothing except save a few cycles in
multiple-qual-condition index scans, since we can't support whole-index
scans on SPGiST indexes until nulls are included in the index. However,
I consider this a must-fix for 9.2 because once we release it will get
very much harder to change the opclass API definition.

This allows loadable modules to get control at drop time, perhaps for the
purpose of performing additional security checks or to log the event.
The initial purpose of this code is to support sepgsql, but other
applications should be possible as well.
KaiGai Kohei, reviewed by me.

Further reflection shows that a single callback isn't very workable if we
desire to let FDWs generate multiple Paths, because that forces the FDW to
do all work necessary to generate a valid Plan node for each Path. Instead
split the former PlanForeignScan API into three steps: GetForeignRelSize,
GetForeignPaths, GetForeignPlan. We had already bit the bullet of breaking
the 9.1 FDW API for 9.2, so this shouldn't cause very much additional pain,
and it's substantially more flexible for complex FDWs.
Add an fdw_private field to RelOptInfo so that the new functions can save
state there rather than possibly having to recalculate information two or
three times.
In addition, we'd not thought through what would be needed to allow an FDW
to set up subexpressions of its choice for runtime execution. We could
treat ForeignScan.fdw_private as an executable expression but that seems
likely to break existing FDWs unnecessarily (in particular, it would
restrict the set of node types allowable in fdw_private to those supported
by expression_tree_walker). Instead, invent a separate field fdw_exprs
which will receive the postprocessing appropriate for expression trees.
(One field is enough since it can be a list of expressions; also, we assume
the corresponding expression state tree(s) will be held within fdw_state,
so we don't need to add anything to ForeignScanState.)
Per review of Hanada Shigeru's pgsql_fdw patch. We may need to tweak this
further as we continue to work on that patch, but to me it feels a lot
closer to being right now.

Phil Sorber reported that a rewriting ALTER TABLE within an extension
update script failed, because it creates and then drops a placeholder
table; the drop was being disallowed because the table was marked as an
extension member. We could hack that specific case but it seems likely
that there might be related cases now or in the future, so the most
practical solution seems to be to create an exception to the general rule
that extension member objects can only be dropped by dropping the owning
extension. To wit: if the DROP is issued within the extension's own
creation or update scripts, we'll allow it, implicitly performing an
"ALTER EXTENSION DROP object" first. This will simplify cases such as
extension downgrade scripts anyway.
No docs change since we don't seem to have documented the idea that you
would need ALTER EXTENSION DROP for such an action to begin with.
Also, arrange for explicitly temporary tables to not get linked as
extension members in the first place, and the same for the magic
pg_temp_nnn schemas that are created to hold them. This prevents assorted
unpleasant results if an extension script creates a temp table: the forced
drop at session end would either fail or remove the entire extension, and
neither of those outcomes is desirable. Note that this doesn't fix the
ALTER TABLE scenario, since the placeholder table is not temp (unless the
table being rewritten is).
Back-patch to 9.1.

In constructs such as "x IN (1,2,3,4)" and "x <> ALL(ARRAY[1,2,3,4])",
we formerly always used a general-purpose assumption that the probability
of success is independent for each comparison of "x" to an array element.
But in real-world usage of these constructs, that's a pretty poor
assumption; it's much saner to assume that the array elements are distinct
and so the match probabilities are disjoint. Apply that assumption if the
operator appears to behave as equality (for ANY) or inequality (for ALL).
But fall back to the normal independent-probabilities calculation if this
yields an impossible result, ie probability > 1 or < 0. We could protect
ourselves against bad estimates even more by explicitly checking for equal
array elements, but that is expensive and doesn't seem worthwhile: doing
it would amount to optimizing for poorly-written queries at the expense
of well-written ones.
Daniele Varrazzo and Tom Lane, after a suggestion by Ants Aasma

Multi-line "Inherits:" and "Child tables:" footers were misindented when
those strings' translations involved multibyte characters, because we were
using strlen() instead of an appropriate display width measurement.
In passing, avoid doing gettext() more than once per loop in these places.
While at it, fix pg_wcswidth(), which has been entirely broken since about
8.2, but fortunately has been unused for the same length of time.
Report and patch by Sergey Burladyan (bug #6480)

Add GetForeignColumnOptions() to foreign.c, and add some documentation.

GetForeignColumnOptions provides some abstraction for accessing
column-specific FDW options, on a par with the access functions that were
already provided here for other FDW-related information.
Adjust file_fdw.c to use GetForeignColumnOptions instead of equivalent
hand-rolled code.
In addition, add some SGML documentation for the functions exported by
foreign.c that are meant for use by FDW authors.
(This is the fdw_helper portion of the proposed pgsql_fdw patch.)
Hanada Shigeru, reviewed by KaiGai Kohei

Due to an apparent thinko, when printing a table in expanded mode
(\x), space would be allocated for 1 slot plus 1 byte per line,
instead of 1 slot per line plus 1 slot for the NULL terminator. When
the line count is small, reading or writing the terminator would
therefore access memory beyond what was allocated.

Now that cache invalidation callbacks get only a hash value, and not a
tuple TID (per commits 632ae6829f7abda34e15082c91d9dfb3fc0f298b and
b5282aa893e565b7844f8237462cb843438cdd5e), the only way they can restrict
what they invalidate is to know what the hash values mean. setrefs.c was
doing this via a hard-wired assumption but that seems pretty grotty, and
it'll only get worse as more cases come up. So let's expose a calculation
function that takes the same parameters as SearchSysCache. Per complaint
from Marko Kreen.

Use-cases for this include custom log filtering rules and custom log
message transmission mechanisms (for instance, lossy log message
collection, which has been discussed several times recently).
As is our common practice for hooks, there's no regression test nor
user-facing documentation for this, though the author did exhibit a
sample module using the hook.
Martin Pihlak, reviewed by Marti Raudsepp

This simplifies the code a little bit. The new rule is that to update
XLogCtl->LogwrtResult, you must hold both WALWriteLock and info_lck, whereas
before we had two copies, one that was protected by WALWriteLock and another
protected by info_lck. The code that updates them was already holding both
locks, so merging the two is trivial.
The third copy, XLogCtl->Insert.LogwrtResult, was not totally redundant, it
was used in AdvanceXLInsertBuffer to update the backend-local copy, before
acquiring the info_lck to read the up-to-date value. But the value of that
seems dubious; at best it's saving one spinlock acquisition per completed
WAL page, which is not significant compared to all the other work involved.
And in practice, it's probably not saving even that much.

It's harmless to do full page writes even when not strictly necessary, so
when turning full_page_writes on, we can set the global flag first, and then
call XLogInsert. Likewise, when turning it off, we can write the WAL record
first, and then clear the flag. This way XLogInsert doesn't need any special
handling of the XLOG_FPW_CHANGE record type. XLogInsert is complicated
enough already, so anything we can keep away from there is a good thing.
Actually I don't think the atomicity of the shared memory flag matters,
anyway, because we only write the XLOG_FPW_CHANGE at the end of recovery,
when there are no concurrent WAL insertions going on. But might as well make
it safe, in case we allow changing full_page_writes on the fly in the
future.

In pg_upgrade, only lock the old cluster if link mode is used, and do it right after we restore the schema (a common failure point), and right before we do the link operation.

The original API specification only allowed an FDW to create a single
access path, which doesn't seem like a terribly good idea in hindsight.
Instead, move the responsibility for building the Path node and calling
add_path() into the FDW's PlanForeignScan function. Now, it can do that
more than once if appropriate. There is no longer any need for the
transient FdwPlan struct, so get rid of that.
Etsuro Fujita, Shigeru Hanada, Tom Lane

In backup.sgml, point out that you need to be using the logging collector
if you want to log messages from a failing archive_command script. (This
is an oversimplification, in that it will work without the collector as
long as you're not sending postmaster stderr to /dev/null; but it seems
like a good idea to encourage use of the collector to avoid problems
with multiple processes concurrently scribbling on one file.)
In config.sgml, do some wordsmithing of logging_collector discussion.
Per bug #6518 from Janning Vygen

This patch installs significantly smarter penalty and picksplit functions
for ranges, making GiST indexes for them smaller and faster to search.
There is no on-disk format change, so no catversion bump, but you'd need
to REINDEX to get the benefits for any existing index.
Alexander Korotkov, reviewed by Jeff Davis

The code in this function that tried to cope with a missing count histogram
was quite ineffective for anything except a perfectly flat distribution.
Furthermore, since we were already punting for missing MCELEM slot, it's
rather useless to sweat over missing DECHIST: there are no cases where
ANALYZE will create the first but not the second. So just simplify the
code by punting rather than pretending we can do something useful.

This patch improves selectivity estimation for the array <@, &&, and @>
(containment and overlaps) operators. It enables collection of statistics
about individual array element values by ANALYZE, and introduces
operator-specific estimators that use these stats. In addition,
ScalarArrayOpExpr constructs of the forms "const = ANY/ALL (array_column)"
and "const <> ANY/ALL (array_column)" are estimated by treating them as
variants of the containment operators.
Since we still collect scalar-style stats about the array values as a
whole, the pg_stats view is expanded to show both these stats and the
array-style stats in separate columns. This creates an incompatible change
in how stats for tsvector columns are displayed in pg_stats: the stats
about lexemes are now displayed in the array-related columns instead of the
original scalar-related columns.
There are a few loose ends here, notably that it'd be nice to be able to
suppress either the scalar-style stats or the array-element stats for
columns for which they're not useful. But the patch is in good enough
shape to commit for wider testing.
Alexander Korotkov, reviewed by Noah Misch and Nathan Boley

gzFile is already a pointer, so code like
gzFile *handle = gzopen(...)
is wrong.
This used to pass silently because gzFile used to be defined as void*,
and you can assign a void* to a void**. But somewhere between zlib
versions 1.2.3.4 and 1.2.6, the definition of gzFile was changed to
struct gzFile_s *, and with that new definition this usage causes
compiler warnings.
So remove all those extra pointer decorations.
There is a related issue in pg_backup_archiver.h, where
FILE *FH; /* General purpose file handle */
is used throughout pg_dump as sometimes a real FILE* and sometimes a
gzFile handle, which also causes warnings now. This is not yet fixed
here, because it might need more code restructuring.

This fixes an oversight in commit 11cad29c91524aac1d0b61e0ea0357398ab79bf8,
which introduced MergeAppend plans. Before that happened, we never
particularly cared about the sort ordering of scans of inheritance child
relations, since appending their outputs together would destroy any
ordering anyway. But now it's important to be able to match child relation
sort orderings to those of the surrounding query. The original coding of
add_child_rel_equivalences skipped ec_has_const EquivalenceClasses, on the
originally-correct grounds that adding child expressions to them was
useless. The effect of this is that when a parent variable is equated to
a constant, we can't recognize that index columns on the equivalent child
variables are not sort-significant; that is, we can't recognize that a
child index on, say, (x, y) is able to generate output in "ORDER BY y"
order when there is a clause "WHERE x = constant". Adding child
expressions to the (x, constant) EquivalenceClass fixes this, without any
downside that I can see other than a few more planner cycles expended on
such queries.
Per recent gripe from Robert McGehee. Back-patch to 9.1 where MergeAppend
was introduced.

For those of us who prefer the formatting of the docs using the
website stylesheets. Use "make STYLE=website draft" (for example) to use.
The stylesheet itself is referenced directly to the website, so there
is currently no copy of it stored in the source repository. Thus, docs
built with it will only look correct if the browser can access the website
when viewing them.

When a GiST page is split during index build, it might not have a buffer.

Previously it was thought that it's impossible as the code stands, because
insertions create buffers as tuples are cascaded downwards, and index
split also creaters buffers eagerly for all halves. But the example from
Jay Levitt demonstrates that it can happen, when the root page is split.
It's in fact OK if the buffer doesn't exist, so we just need to remove the
sanity check. In fact, we've been discussing the possibility of destroying
empty buffers to conserve memory, which would render the sanity check
completely useless anyway.
Fix by Alexander Korotkov

The <literal> markup is not visible as distinct on man pages, which
creates a bit of confusion when looking at the documentation of the
pg_basebackup -l option. Rather than reinventing the entire font
system for man pages to remedy this, just put some quotes around this
particular case, which should also help in other output formats.

Running "psql -f -" used to print
psql:<stdin>:1: ERROR: blah
but that got broken between 8.4 and 9.0 (commit
b291c0fba83a1e93868e2f69c03be195d620f30c), and now it printed
psql:-:1: ERROR: blah
This reverts to the old behavior and cleans up some code that was left
dead or useless by the mentioned commit.

The only toastable column now is datacl, but we don't really support
long ACLs anyway. The TOAST table should have been removed when the
pg_db_role_setting catalog was introduced in commit
2eda8dfb52ed9962920282d8384da8bb4c22514d, but I forgot to do that.
Per -hackers discussion on March 2011.

Several places were still written as though standard_conforming_strings
didn't exist, much less be the default. Now that it is on by default,
we can simplify the text and just insert occasional notes suggesting that
you might have to think harder if it's turned off. Per discussion of a
suggestion from Hannes Frederic Sowa.
Back-patch to 9.1 where standard_conforming_strings was made the default.

A prepared transaction can get new conflicts in and out after preparing, so
we cannot rely on the in- and out-flags stored in the statefile at prepare-
time. As a quick fix, make the conservative assumption that after a restart,
all prepared transactions are considered to have both in- and out-conflicts.
That can lead to unnecessary rollbacks after a crash, but that shouldn't be
a big problem in practice; you don't want prepared transactions to hang
around for a long time anyway.
Dan Ports

This makes it much more convenient to build tools for Postgres that are
separately compiled and require a matching CRC implementation.
To prevent multiple copies of the CRC polynomial tables being introduced
into the postgres binaries, they are now included in the static library
libpgport that is mainly meant for replacement system functions. That
seems like a bit of a kludge, but there's no better place.
This cleans up building of the tools pg_controldata and pg_resetxlog,
which previously had to build their own copies of pg_crc.o.
In the future, external programs that need access to the CRC tables can
include the tables directly from the new header file pg_crc_tables.h.
Daniel Farina, reviewed by Abhijit Menon-Sen and Tom Lane

We don't need to constrain the other side of an indexable join clause to
not be below an outer join; an example here is
SELECT FROM t1 LEFT JOIN t2 ON t1.a = t2.b LEFT JOIN t3 ON t2.c = t3.d;
We can consider an inner indexscan on t3.d using c = d as indexqual, even
though t2.c is potentially nulled by a previous outer join. The comparable
logic in orindxpath.c has always worked that way, but I was being overly
cautious here.

psql backslash commands that deal with file or directory names require
quotes around those that have spaces, single quotes, or backslashes.
However, tab-completing such names does not provide said quotes, and is
thus almost useless with them.
This patch fixes the problem by having a wrapper function around
rl_filename_completion_function that dequotes on input and quotes on
output. This eases dealing with such names.
Author: Noah Misch

We already skip rewriting the table in these cases, but we still force a
whole table scan to validate the data. This can be skipped, and thus
we can make the whole ALTER TABLE operation just do some catalog touches
instead of scanning the table, when these two conditions hold:
(a) Old and new pg_constraint.conpfeqop match exactly. This is actually
stronger than needed; we could loosen things by way of operator
families, but it'd require a lot more effort.
(b) The functions, if any, implementing a cast from the foreign type to
the primary opcintype are the same. For this purpose, we can consider a
binary coercion equivalent to an exact type match. When the opcintype
is polymorphic, require that the old and new foreign types match
exactly. (Since ri_triggers.c does use the executor, the stronger check
for polymorphic types is no mere future-proofing. However, no core type
exercises its necessity.)
Author: Noah Misch
Committer's note: catalog version bumped due to change of the Constraint
node. I can't actually find any way to have such a node in a stored
rule, but given that we have "out" support for them, better be safe.

In commit 4016bdef8aded77b4903c457050622a5a1815c16 I fixed a bunch of
ginxlog.c bugs having to do with not handling XLogReadBuffer failures
correctly. However, in ginRedoUpdateMetapage and ginRedoDeleteListPages,
I unaccountably thought that failure to read the metapage would be
impossible and just put in an elog(PANIC) call. This is of course wrong:
failure is exactly what will happen if the index got dropped (or rebuilt)
between creation of the WAL record and the crash we're trying to recover
from. I believe this explains Nicholas Wilson's recent report of these
errors getting reached.
Also, fix memory leak in forgetIncompleteSplit. This wasn't of much
concern when the code was written, but in a long-running standby server
page split records could be expected to accumulate indefinitely.
Back-patch to 8.4 --- before that, GIN didn't have a metapage.

Claiming that the typevar argument to DefineCompositeType() is const
was a plain lie. A similar case in DefineVirtualRelation() was
already changed in passing in commit 1575fbcb. Also clean up the now
unnecessary casts that used to cast away the const.

This makes it easier to match a column name with the description of it,
and makes it possible to add more detailed documentation in the future.
This patch does not add that extra documentation at this point, only
the structure required for it.
Modeled on the changes already done to pg_stat_activity.

Merge dissect() into cdissect() to remove a pile of near-duplicate code.

The "uncomplicated" case isn't materially less complicated than the full
case, certainly not enough so to justify duplicating nearly 500 lines
of code. The only extra work being done in the full path is zaptreesubs,
which is very cheap compared to everything else being done here, and
besides that I'm less than convinced that it's not needed in some cases
even without backrefs.

In nested sub-regex trees, lower-level nodes created DFAs and then
destroyed them again before exiting, which is a bit dumb considering that
the recursive search is likely to call those nodes again later. Instead
cache each created DFA until the end of pg_regexec(). This is basically a
space for time tradeoff, in that it might increase the maximum memory
usage. However, in most regex patterns there are not all that many subre
nodes, so not that many DFAs --- and in any case, the peak usage occurs
when reaching the bottom recursion level, and except for alternation cases
that's going to be the same anyway.

Apparently some primordial version of Spencer's engine needed cdissect()
and child functions to be able to continue matching from a previous
position when re-called. That is dead code, though, since trivial
inspection shows that cdissect can never be entered without having
previously done zapmem which resets the relevant retry counter. I have
also verified experimentally that no case in the Tcl regression tests
reaches cdissect with a nonzero retry value. Accordingly, remove that
logic. This doesn't really save any noticeable number of cycles in itself,
but it is one step towards making dissect() and cdissect() equivalent,
which will allow removing hundreds of lines of near-duplicated code.
Since struct subre's "retry" field is no longer particularly related to
any kind of retry, rename it to "id". As of this commit it's only used
for identifying a subre node in debug printouts, so you might think we
should get rid of the field entirely; but I have a plan for another use.

Cases where a back-reference is part of a larger subexpression that
is quantified have never worked in Spencer's regex engine, because
he used a compile-time transformation that neglected the need to
check the back-reference match in iterations before the last one.
(That was okay for capturing parens, and we still do it if the
regex has *only* capturing parens ... but it's not okay for backrefs.)
To make this work properly, we have to add an "iteration" node type
to the regex engine's vocabulary of sub-regex nodes. Since this is a
moderately large change with a fair risk of introducing new bugs of its
own, apply to HEAD only, even though it's a fix for a longstanding bug.

pg_dump was incautious about sanitizing object names that are emitted
within SQL comments in its output script. A name containing a newline
would at least render the script syntactically incorrect. Maliciously
crafted object names could present a SQL injection risk when the script
is reloaded.
Reported by Heikki Linnakangas, patch by Robert Haas
Security: CVE-2012-0868

Remove arbitrary limitation on length of common name in SSL certificates.

Both libpq and the backend would truncate a common name extracted from a
certificate at 32 bytes. Replace that fixed-size buffer with dynamically
allocated string so that there is no hard limit. While at it, remove the
code for extracting peer_dn, which we weren't using for anything; and
don't bother to store peer_cn longer than we need it in libpq.
This limit was not so terribly unreasonable when the code was written,
because we weren't using the result for anything critical, just logging it.
But now that there are options for checking the common name against the
server host name (in libpq) or using it as the user's name (in the server),
this could result in undesirable failures. In the worst case it even seems
possible to spoof a server name or user name, if the correct name is
exactly 32 bytes and the attacker can persuade a trusted CA to issue a
certificate in which that string is a prefix of the certificate's common
name. (To exploit this for a server name, he'd also have to send the
connection astray via phony DNS data or some such.) The case that this is
a realistic security threat is a bit thin, but nonetheless we'll treat it
as one.
Back-patch to 8.4. Older releases contain the faulty code, but it's not
a security problem because the common name wasn't used for anything
interesting.
Reported and patched by Heikki Linnakangas
Security: CVE-2012-0867

Require execute permission on the trigger function for CREATE TRIGGER.

This check was overlooked when we added function execute permissions to the
system years ago. For an ordinary trigger function it's not a big deal,
since trigger functions execute with the permissions of the table owner,
so they couldn't do anything the user issuing the CREATE TRIGGER couldn't
have done anyway. However, if a trigger function is SECURITY DEFINER,
that is not the case. The lack of checking would allow another user to
install it on his own table and then invoke it with, essentially, forged
input data; which the trigger function is unlikely to realize, so it might
do something undesirable, for instance insert false entries in an audit log
table.
Reported by Dinesh Kumar, patch by Robert Haas
Security: CVE-2012-0866

This allows changing the location of the files that were previously
hard-coded to server.crt, server.key, root.crt, root.crl.
server.crt and server.key continue to be the default settings and are
thus required to be present by default if SSL is enabled. But the
settings for the server-side CA and CRL are now empty by default, and
if they are set, the files are required to be present. This replaces
the previous behavior of ignoring the functionality if the files were
not found.

When "vacuuming" a single btree page by removing LP_DEAD tuples, we are not
actually within a vacuum operation, but rather in an ordinary insertion
process that could well be running concurrently with a vacuum. So clearing
the cycleid is incorrect, and could cause the concurrent vacuum to miss
removing tuples that it needs to remove. This is a longstanding bug
introduced by commit e6284649b9e30372b3990107a082bc7520325676 of
2006-07-25. I believe it explains Maxim Boguk's recent report of index
corruption, and probably some other previously unexplained reports.
In 9.0 and up this is a one-line fix; before that we need to introduce a
flag to tell _bt_delitems what to do.

The syntax "\n*", that is a backref with a * quantifier directly applied
to it, has never worked correctly in Spencer's library. This has been an
open bug in the Tcl bug tracker since 2005:
https://sourceforge.net/tracker/index.php?func=detail&aid=1115587&group_id=10894&atid=110894
The core of the problem is in parseqatom(), which first changes "\n*" to
"\n+|" and then applies repeat() to the NFA representing the backref atom.
repeat() thinks that any arc leading into its "rp" argument is part of the
sub-NFA to be repeated. Unfortunately, since parseqatom() already created
the arc that was intended to represent the empty bypass around "\n+", this
arc gets moved too, so that it now leads into the state loop created by
repeat(). Thus, what was supposed to be an "empty" bypass gets turned into
something that represents zero or more repetitions of the NFA representing
the backref atom. In the original example, in place of
^([bc])\1*$
we now have something that acts like
^([bc])(\1+|[bc]*)$
At runtime, the branch involving the actual backref fails, as it's supposed
to, but then the other branch succeeds anyway.
We could no doubt fix this by some rearrangement of the operations in
parseqatom(), but that code is plenty ugly already, and what's more the
whole business of converting "x*" to "x+|" probably needs to go away to fix
another problem I'll mention in a moment. Instead, this patch suppresses
the *-conversion when the target is a simple backref atom, leaving the case
of m == 0 to be handled at runtime. This makes the patch in regcomp.c a
one-liner, at the cost of having to tweak cbrdissect() a little. In the
event I went a bit further than that and rewrote cbrdissect() to check all
the string-length-related conditions before it starts comparing characters.
It seems a bit stupid to possibly iterate through many copies of an
n-character backreference, only to fail at the end because the target
string's length isn't a multiple of n --- we could have found that out
before starting. The existing coding could only be a win if integer
division is hugely expensive compared to character comparison, but I don't
know of any modern machine where that might be true.
This does not fix all the problems with quantified back-references. In
particular, the code is still broken for back-references that appear within
a larger expression that is quantified (so that direct insertion of the
quantification limits into the BACKREF node doesn't apply). I think fixing
that will take some major surgery on the NFA code, specifically introducing
an explicit iteration node type instead of trying to transform iteration
into concatenation of modified regexps.
Back-patch to all supported branches. In HEAD, also add a regression test
case for this. (It may seem a bit silly to create a regression test file
for just one test case; but I'm expecting that we will soon import a whole
bunch of regex regression tests from Tcl, so might as well create the
infrastructure now.)

While this doesn't save a huge amount of runtime, it still seems worth
doing, especially since I realized that the data copying I did in my first
draft was quite unnecessary. In this version, once we have the results
cached, getting them back for re-use is really very cheap.
Also, remove the hard-wired limitation to not consider wctype.h results for
character codes above 255. It turns out that we can't push the limit as
far up as I'd originally hoped, because the regex colormap code is not
efficient enough to cope very well with character classes containing many
thousand letters, which a Unicode locale is entirely capable of producing.
Still, we can push it up to U+7FF (which I chose as the limit of 2-byte
UTF8 characters), which will at least make Eastern Europeans happy pending
a better solution. Thus, this commit resolves the specific complaint in
bug #6457, but not the more general issue that letters of non-western
alphabets are mostly not recognized as matching [[:alpha:]].

Create src/backend/regex/README to hold an implementation overview of
the regex package, and fill it in with some preliminary notes about
the code's DFA/NFA processing and colormap management. Much more to
do there of course.
Also, improve some code comments around the colormap and cvec code.
No functional changes except to add one missing assert.

Some line feeds are added to target lists and from lists to make
them more readable. By default they wrap at 80 columns if possible,
but the wrap column is also selectable - if 0 it wraps after every
item.
Andrew Dunstan, reviewed by Hitoshi Harada.

In ecpglib rewrote code that used strtok_r to not use library functions anymore. This way we don’t have to worry which compiler on which OS offers which version of strtok.

Sync our regex code with upstream changes since last time we did this,
which was Tcl 8.5.0 (see commit df1e965e12cdd48c11057ee6e15346ee2b8b02f5).
There are no functional changes here; the main point is just to lay down
a commit-log marker that somebody has looked at this recently, and to do
what we can to keep the two codebases comparable.

The array intersection code would give wrong results if the first entry of
the correct output array would be "1". (I think only this value could be
at risk, since the previous word would always be a lower-bound entry with
that fixed value.)
Problem spotted by Julien Rouhaud, initial patch by Guillaume Lelarge,
cosmetic improvements by me.

Improve statistics estimation to make some use of DISTINCT in sub-queries.

Formerly, we just punted when trying to estimate stats for variables coming
out of sub-queries using DISTINCT, on the grounds that whatever stats we
might have for underlying table columns would be inapplicable. But if the
sub-query has only one DISTINCT column, we can consider its output variable
as being unique, which is useful information all by itself. The scope of
this improvement is pretty narrow, but it costs nearly nothing, so we might
as well do it. Per discussion with Andres Freund.
This patch differs from the draft I submitted yesterday in updating various
comments about vardata.isunique (to reflect its extended meaning) and in
tweaking the interaction with security_barrier views. There does not seem
to be a reason why we can't use this sort of knowledge even when the
sub-query is such a view.

Use exit_horribly() and ExecuteSqlQueryForSingleRow() in various
places where it's equivalent, or nearly equivalent, to the prior
coding. Apart from being more compact, this also makes the error
messages for the wrong-number-of-tuples case more consistent.

Parallel pg_dump wants to have multiple ArchiveHandle objects, and
therefore multiple PGconns, in play at the same time. This should
be just about the end of the refactoring that we need in order to
make that workable.

This extends the changes of commit 6252c4f9e201f619e5eebda12fa867acd4e4200e
so that we run the cleanup hook earlier for failure cases as well as
success cases. As before, the point is to avoid an assertion failure from
an Assert I added in commit a874fe7b4c890d1fe3455215a83ca777867beadd, which
was meant to check that no user-written code can be called during portal
cleanup. This fixes a case reported by Pavan Deolasee in which the Assert
could be triggered during backend exit (see the new regression test case),
and also prevents the possibility that the cleanup hook is run after
portions of the portal's state have already been recycled. That doesn't
really matter in current usage, but it foreseeably could matter in the
future.
Back-patch to 9.1 where the Assert in question was added.

This is some preliminary refactoring related to a pending patch
to allow sepgsql-enable sessions to make dynamic label transitions.
But this commit doesn't involve any functional change: it just puts
some bits of code in more logical places.
KaiGai Kohei

Per recent work by Peter Geoghegan, it's significantly faster to
tuplesort on a single sortkey if ApplySortComparator is inlined into
quicksort rather reached via a function pointer. It's also faster
in general to have a version of quicksort which is specialized for
sorting SortTuple objects rather than objects of arbitrary size and
type. This requires a couple of additional copies of the quicksort
logic, which in this patch are generate using a Perl script. There
might be some benefit in adding further specializations here too,
but thus far it's not clear that those gains are worth their weight
in code footprint.

The hstore and json datatypes both have record-conversion functions that
pay attention to column names in the composite values they're handed.
We used to not worry about inserting correct field names into tuple
descriptors generated at runtime, but given these examples it seems
useful to do so. Observe the nicer-looking results in the regression
tests whose results changed.
catversion bump because there is a subtle change in requirements for stored
rule parsetrees: RowExprs from ROW() constructs now have to include field
names.
Andrew Dunstan and Tom Lane

We don't normally allow quals to be pushed down into a view created
with the security_barrier option, but functions without side effects
are an exception: they're OK. This allows much better performance in
common cases, such as when using an equality operator (that might
even be indexable).
There is an outstanding issue here with the CREATE FUNCTION / ALTER
FUNCTION syntax: there's no way to use ALTER FUNCTION to unset the
leakproof flag. But I'm committing this as-is so that it doesn't
have to be rebased again; we can fix up the grammar in a future
commit.
KaiGai Kohei, with some wordsmithing by me.

If tuples were toasted, heap_multi_insert didn't update the ctid on the
original tuples. This caused a failure if there was an after trigger
(including a foreign key), on the table, and a tuple got toasted.
Per off-list report and test case from Ted Phelps

Silence warning about deprecated assignment to $[ in check_keywords.pl

Datatype I/O functions are allowed to leak memory in CurrentMemoryContext,
since they are generally called in short-lived contexts. However, plpgsql
calls such functions for purposes of type conversion, and was calling them
in its procedure context. Therefore, any leaked memory would not be
recovered until the end of the plpgsql function. If such a conversion
was done within a loop, quite a bit of memory could get consumed. Fix by
calling such functions in the transient "eval_econtext", and adjust other
logic to match. Back-patch to all supported versions.
Andres Freund, Jan Urbański, Tom Lane

If an extension has not been selected to be dumped (perhaps because of
a --schema or --table switch), the contents of its configuration tables
surely should not get dumped either. Per gripe from
Hubert Depesz Lubaczewski.

In pre-7.3 databases, pg_attribute.attislocal doesn't exist. The easiest
way to make sure the new inheritance logic behaves sanely is to assume it's
TRUE, not FALSE. This will result in printing child columns even when
they're not really needed. We could work harder at trying to reconstruct a
value for attislocal, but there is little evidence that anyone still cares
about dumping from such old versions, so just do the minimum necessary to
have a valid dump.
I had this correct in the original draft of the patch, but for some
unaccountable reason decided it wasn't necessary to change the value.
Testing against an old server shows otherwise...

Revise pg_dump's handling of inherited columns, which was last looked at
seriously in 2001, to eliminate several misbehaviors associated with
inherited default expressions and NOT NULL flags. In particular make sure
that a column is printed in a child table's CREATE TABLE command if and
only if it has attislocal = true; the former behavior would sometimes cause
a column to become marked attislocal when it was not so marked in the
source database. Also, stop relying on textual comparison of default
expressions to decide if they're inherited; instead, don't use
default-expression inheritance at all, but just install the default
explicitly at each level of the hierarchy. This fixes the
search-path-related misbehavior recently exhibited by Chester Young, and
also removes some dubious assumptions about the order in which ALTER TABLE
SET DEFAULT commands would be executed.
Back-patch to all supported branches.

Add ORDER BY to a query to prevent occasional regression test failures.

Add new psql settings and command-line options to support setting the
field and record separators for unaligned output to a zero byte, for
easier interfacing with other shell tools.
reviewed by Abhijit Menon-Sen

These were added to kwlist.h as unreserved keywords in separate patches,
but authors forgot to add them to the corresponding list in gram.y.
Because of that, even though they were supposed to be unreserved keywords,
they could not be used as identifiers. src/tools/check_keywords.pl is your
friend.

Various filters that were meant to prevent dumping of table data were not
being applied to extension config tables, notably --exclude-table-data and
--no-unlogged-table-data. We also would bogusly try to dump data from
views, sequences, or foreign tables, should an extension try to claim they
were config tables. Fix all that, and refactor/redocument to try to make
this a bit less fragile. This reverts the implementation, though not the
feature, of commit 7b070e896ca835318c90b02c830a5c4844413b64, which had
broken config-table dumping altogether :-(.
It is still the case that the code will dump config-table data even if
--schema is specified. That behavior was intentional, as per the comments
in getExtensionMembership, so I think it requires some more discussion
before we change it.

If somebody puts a window function in WHERE, we should complain about that
in so many words. The previous coding tended to complain about the window
function's arguments instead, which is likely to be misleading to users who
are unclear on the semantics of window functions; as seen for example in
bug #6440 from Matyas Novak.
Just another example of how "add new code at the end" is frequently a bad
heuristic.

Since bool_and() is equivalent to min(), and bool_or() to max(), we might
as well let them be index-optimized in the same way. The practical value
of this is debatable at best, but it seems nearly cost-free to enable it.
Code-wise, we need only adjust the entries in pg_aggregate. There is a
measurable planning speed penalty for a query involving one of these
aggregates, but it is only a few percent in simple cases, so that seems
acceptable.
Marti Raudsepp, reviewed by Abhijit Menon-Sen

Mark some more I/O-conversion-invoking functions as stable not volatile.

When written, textanycat, anytextcat, quote_literal, and quote_nullable
were marked volatile, because they could invoke arbitrary type-specific
output functions as part of casting their anyelement arguments to text.
Since then, we have defined a project policy that I/O functions must not
be volatile, as per commit aab353a60b95aadc00f81da0c6d99bde696c4b75.
So these functions can safely be downgraded to stable. Most of the time
this makes no difference since they'll get inlined anyway, but as noted
by Andrew Dunstan, there are cases where the volatile marking prevents
optimizations that the planner does before function inlining. (I think
I might have overlooked these functions in the earlier commit on the
grounds that inlining would make it moot, but not so --- tgl)
This change results in a change in the expected output of the json
regression tests, because the planner can now flatten a sub-select
that it failed to before. The old output is preferable, but getting
that back will require some as-yet-unfinished work on RowExpr handling.
Marti Raudsepp

The immediate impetus for this is that Noah Misch's patch to elide
unnecessary table and index rebuilds when changing typmod for temporal
types uses it; and this is extracted from that patch, with some
further commentary by me. But it seems logically separate from the
remainder of the patch, so I'm committing it separately; this is not
the first time someone has wanted fls() in the backend and probably
won't be the last.
If we end up using this in more performance-critical spots it may be
worthwhile to add some architecture-specific optimizations to our
src/port version of fls() - e.g. any x86 platform can implement this
using the assembly instruction BSRL. But performance won't matter
a bit for assessing typmod changes, so I'm not worried about that
right now.

This enables ALTER TABLE to skip table and index rebuilds when the
new type is unconstraint varbit, or when the allowable number of bits
is not decreasing.
Noah Misch, with review and a fix for an OID collision by me.

This enables ALTER TABLE to skip table and index rebuilds when a column
is changed to an unconstrained numeric, or when the scale is unchanged
and the precision does not decrease.
Noah Misch, with a few stylistic changes and a fix for an OID
collision by me.

Sometimes it may be useful to get actual row counts out of EXPLAIN
(ANALYZE) without paying the cost of timing every node entry/exit.
With this patch, you can say EXPLAIN (ANALYZE, TIMING OFF) to get that.
Tomas Vondra, reviewed by Eric Theise, with minor doc changes by me.

This is another round of refactoring to make things simpler for parallel
pg_dump. pg_dump.c now issues SQL queries through the relevant Archive
object, rather than relying on the global variable g_conn. This commit
isn't quite enough to get rid of g_conn entirely, but it makes a big
dent in its utilization and, along the way, manages to be slightly less
code than before.

Do not prompt when options were not specified. Assume --no-createdb,
--no-createrole, --no-superuser by default.
Also disable prompting for user name in dropdb, unless --interactive
was specified.
reviewed by Josh Kupershmidt

When building with LWLOCK_STATS, initialize the stats in LWLockWaitUntilFree.

If LWLockWaitUntilFree was called before the first LWLockAcquire call, you
would either crash because of access to uninitialized array or account the
acquisition incorrectly. LWLockConditionalAcquire doesn't have this problem
because it doesn't update the lwlock stats.
In practice, this never happens because there is no codepath where you would
call LWLockWaitUntilfree before LWLockAcquire after a new process is
launched. But that's just accidental, there's no guarantee that that's
always going to be true in the future.
Spotted by Jeff Janes.

The postmaster was coded to treat any unexpected exit of the startup
process (i.e., the WAL replay process) as a catastrophic crash, and not try
to restart it. This was OK so long as the startup process could not have
any sibling postmaster children. However, if a hot-standby backend
crashes, we SIGQUIT the startup process along with everything else, and the
resulting exit is hardly "unexpected". Treating it as such meant we failed
to restart a standby server after any child crash at all, not only a crash
of the WAL replay process as intended. Adjust that. Back-patch to 9.0
where hot standby was introduced.

Allow the connection keyword array to carry all seven items in ecpglib.

Although we will not even issue an XLOG_TBLSPC_DROP WAL record unless
removal of the tablespace's directories succeeds, that does not guarantee
that the same operation will succeed during WAL replay. Foreseeable
reasons for it to fail include temp files created in the tablespace by Hot
Standby backends, wrong directory permissions on a standby server, etc etc.
The original coding threw ERROR if replay failed to remove the directories,
but that is a serious overreaction. Throwing an error aborts recovery,
and worse means that manual intervention will be needed to get the database
to start again, since otherwise the same error will recur on subsequent
attempts to replay the same WAL record. And the consequence of failing to
remove the directories is only that some probably-small amount of disk
space is wasted, so it hardly seems justified to throw an error.
Accordingly, arrange to report such failures as LOG messages and keep going
when a failure occurs during replay.
Back-patch to 9.0 where Hot Standby was introduced. In principle such
problems can occur in earlier releases, but Hot Standby increases the odds
of trouble significantly. Given the lack of field reports of such issues,
I'm satisfied with patching back as far as the patch applies easily.

Instead, everything that needs the Archive object now gets it as a
parameter. This is necessary infrastructure for parallel pg_dump,
but is also amply justified by the ugliness of the current code
(though a lot more than this is needed to fix that problem).

Change various places in the code that are referencing the global
Archive object g_fout to instead reference the Archive object fout
which is already being passed as a parameter. For parallel pg_dump to
work, we're going to need multiple Archive(Handle) objects, so the
real solution here is to pass down the Archive object to everywhere
that it needs to go, but we might as well pick the low-hanging fruit
first.

Originally, most of this code assumed that no Postgres backends could be
running concurrently with it, and so no locking could be needed. That
assumption fails in Hot Standby. While it's still true that Hot Standby
backends should never change values like nextXid, they can examine them,
and consistency is important in some cases such as when computing a
snapshot. Therefore, prudence requires that WAL replay code obtain the
relevant locks when modifying such variables, even though it can examine
them without taking a lock. We were following that coding rule in some
places but not all. This commit applies the coding rule uniformly to all
updates of ShmemVariableCache and MultiXactState fields; a search of the
replay routines did not find any other cases that seemed to be at risk.
In addition, this commit fixes a longstanding thinko in replay of NEXTOID
and checkpoint records: we tried to advance nextOid only if it was behind
the value in the WAL record, but the comparison would draw the wrong
conclusion if OID wraparound had occurred since the previous value.
Better to just unconditionally assign the new value, since OID assignment
shouldn't be happening during replay anyway.
The additional locking seems to be more in the nature of future-proofing
than fixing any live bug, so I am not going to back-patch it. The NEXTOID
fix will be back-patched separately.

RestoreBkpBlocks was in the habit of zeroing and refilling the target
buffer; which was perfectly safe when the code was written, but is unsafe
during Hot Standby operation. The reason is that we have coding rules
that allow backends to continue accessing a tuple in a heap relation while
holding only a pin on its buffer. Such a backend could see transiently
zeroed data, if WAL replay had occasion to change other data on the page.
This has been shown to be the cause of bug #6425 from Duncan Rance (who
deserves kudos for developing a sufficiently-reproducible test case) as
well as Bridget Frey's re-report of bug #6200. It most likely explains the
original report as well, though we don't yet have confirmation of that.
To fix, change the code so that only bytes that are supposed to change will
change, even transiently. This actually saves cycles in RestoreBkpBlocks,
since it's not writing the same bytes twice.
Also fix seq_redo, which has the same disease, though it has to work a bit
harder to meet the requirement.
So far as I can tell, no other WAL replay routines have this type of bug.
In particular, the index-related replay routines, which would certainly be
broken if they had to meet the same standard, are not at risk because we
do not have coding rules that allow access to an index page when not
holding a buffer lock on it.
Back-patch to 9.0 where Hot Standby was added.

All other WAL redo routines either call RestoreBkpBlocks() or Assert that
they haven't been passed any backup blocks. Make this one do likewise.
Also, fix incorrect routine name in its failure message.

This reverts commit 500cf66d5522b39ddfdc26b309f8b5b0e385f42e. As was
more or less expected, a small minority of platforms won't accept
denormalized input even with the recent changes. It doesn't seem
especially helpful to test this if we're going to have to provide an
alternate expected-file to allow failure.

Further improve on commit c75e1436467f32a06b5ab9d594d2a390e7f4594d.
Instead of building both .o files and binaries in the same make rule,
just rely on the normal .c -> .o rule. This will ensure that
dependency tracking is used when enabled. To do this, disable the
implicit direct .c -> binary rule globally, which will also prevent
the original problem (*.dSYM junk) from reappearing elsewhere.

When testing bits (but not when setting or clearing them), we now
won't check whether the map has been extended. This significantly
improves performance in the case where the visibility map doesn't
exist yet, by avoiding an extra system call per tuple. To make
sure backends notice eventually, send an smgr inval on VM extension.
Dean Rasheed, with minor modifications by me.

On some platforms, strtod() reports ERANGE for a denormalized value (ie,
one that can be represented as distinct from zero, but is too small to have
full precision). On others, it doesn't. It seems better to try to accept
these values consistently, so add a test to see if the result value
indicates a true out-of-range condition. This should be okay per Single
Unix Spec. On machines where the underlying math isn't IEEE standard, the
behavior for such small numbers may not be very consistent, but then it
wouldn't be anyway.
Marti Raudsepp, after a proposal by Jeroen Vermeulen

In dry-run mode, just the name of the file to be removed is printed to
stdout; this is so the user can easily plug it into another program
through a pipe. If debug mode is also specified, a more verbose message
is printed to stderr.
Author: Gabriele Bartolini
Reviewer: Josh Kupershmidt

Like the XML data type, we simply store JSON data as text, after checking
that it is valid. More complex operations such as canonicalization and
comparison may come later, but this is enough for not.
There are a few open issues here, such as whether we should attempt to
detect UTF-8 surrogate pairs represented as \uXXXX\uYYYY, but this gets
the basic framework in place.

The sequence USAGE privilege is sufficiently similar to the SQL
standard that it seems reasonable to show in the information schema.
Also add some compatibility notes about it on the GRANT reference
page.

Add result object functions .colnames, .coltypes, .coltypmods to
obtain information about the result column names and types, which was
previously not possible in the PL/Python SPI interface.
reviewed by Abhijit Menon-Sen

In some hopeless situations, certain library functions in libpq and
libpgport quit the program. Use abort() for that instead of exit(),
so we don't interfere with the normal exit codes the program might
use, we clearly signal the abnormal termination, and the caller has a
chance of catching the termination.
This was originally pointed out by Debian's Lintian program.

When a backend needs to flush the WAL, and someone else is already flushing
the WAL, wait until it releases the WALInsertLock and check if we still need
to do the flush or if the other backend already did the work for us, before
acquiring WALInsertLock. This helps group commit, because when the WAL flush
finishes, all the backends that were waiting for it can be woken up in one
go, and the can all concurrently observe that they're done, rather than
waking them up one by one in a cascading fashion.
This is based on a new LWLock function, LWLockWaitUntilFree(), which has
peculiar semantics. If the lock is immediately free, it grabs the lock and
returns true. If it's not free, it waits until it is released, but then
returns false without grabbing the lock. This is used in XLogFlush(), so
that when the lock is acquired, the backend flushes the WAL, but if it's
not, the backend first checks the current flush location before retrying.
Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although
this patch as committed ended up being very different from that.