Remove sanity test in XRecOffIsValid. Commit 061e7efb1b4c5b8a5d02122b7780531b8d5bf23d changed the rules for splitting xlog records across pages, but neglected to update this test. It's possible that there's some better action here than just removing the test completely, but this at least appears to get some of the things that are currently broken (like initdb on MacOS X) working again.
http://git.postgresql.org/pg/commitdiff/2dfa87bcb6eb3ee6e41c40ed4a8a43019a66bd38

Backport fsync queue compaction logic to all supported branches. This backports commit 7f242d880b5b5d9642675517466d31373961cf98, except for the counter in pg_stat_bgwriter. The underlying problem (namely, that a full fsync request queue causes terrible checkpoint behavior) continues to be reported in the wild, and this code seems to be safe and robust enough to risk back-porting the fix.
http://git.postgresql.org/pg/commitdiff/73a3b1d4589eaf3dc7f0a573727a2a6ac5279e6a

Reduce use of heavyweight locking inside hash AM. Avoid using LockPage(rel, 0, lockmode) to protect against changes to the bucket mapping. Instead, an exclusive buffer content lock is now viewed as sufficient permission to modify the metapage, and a shared buffer content lock is used when such modifications need to be prevented. This more relaxed locking regimen makes it possible that, when we're busy getting a heavyweight bucket on the bucket we intend to search or insert into, a bucket split might occur underneath us. To compenate for that possibility, we use a loop-and-retry system: release the metapage content lock, acquire the heavyweight lock on the target bucket, and then reacquire the metapage content lock and check that the bucket mapping has not changed. Normally it hasn't, and we're done. But if by chance it has, we simply unlock the metapage, release the heavyweight lock we acquired previously, lock the new bucket, and loop around again. Even in the worst case we cannot loop very many times here, since we don't split the same bucket again until we've split all the other buckets, and 2^N gets big pretty fast. This results in greatly improved concurrency, because we're effectively replacing two lwlock acquire-and-release cycles in exclusive mode (on one of the lock manager locks) with a single acquire-and-release cycle in shared mode (on the metapage buffer content lock). Testing shows that it's still not quite as good as btree; for that, we'd probably have to find some way of getting rid of the heavyweight bucket locks as well, which does not appear straightforward. Patch by me, review by Jeff Janes.
http://git.postgresql.org/pg/commitdiff/76837c1507cb5a5a0048046233568447729e66dd

Dramatically reduce System V shared memory consumption. Except when compiling with EXEC_BACKEND, we'll now allocate only a tiny amount of System V shared memory (as an interlock to protect the data directory) and allocate the rest as anonymous shared memory via mmap. This will hopefully spare most users the hassle of adjusting operating system parameters before being able to start PostgreSQL with a reasonable value for shared_buffers. There are a bunch of documentation updates needed here, and we might need to adjust some of the HINT messages related to shared memory as well. But it's not 100% clear how portable this is, so before we write the documentation, let's give it a spin on the buildfarm and see what turns red.
http://git.postgresql.org/pg/commitdiff/b0fc0df9364d2d2d17c0162cf3b8b59f6cb09f67

Make walsender more responsive. Per testing by Andres Freund, this improves replication performance and reduces replication latency and latency jitter. I was a bit concerned about moving more work into XLogInsert, but testing seems to show that it's not a problem in practice. Along the way, improve comments for WaitLatchOrSocket. Andres Freund. Review and stylistic cleanup by me.
http://git.postgresql.org/pg/commitdiff/f83b59997d29f06c3d67e7eb9a1f2c9cd017d665

Make commit_delay much smarter. Instead of letting every backend participating in a group commit wait independently, have the first one that becomes ready to flush WAL wait for the configured delay, and let all the others wait just long enough for that first process to complete its flush. This greatly increases the chances of being able to configure a commit_delay setting that actually improves performance. As a side consequence of this change, commit_delay now affects all WAL flushes, rather than just commits. There was some discussion on pgsql-hackers about whether to rename the GUC to, say, wal_flush_delay, but in the absence of consensus I am leaving it alone for now. Peter Geoghegan, with some changes, mostly to the documentation, by me.
http://git.postgresql.org/pg/commitdiff/f11e8be3e812cdbbc139c1b4e49141378b118dee

Unify calling conventions for postgres/postmaster sub-main functions. There was a wild mix of calling conventions: Some were declared to return void and didn't return, some returned an int exit code, some claimed to return an exit code, which the callers checked, but actually never returned, and so on. Now all of these functions are declared to return void and decorated with attribute noreturn and don't return. That's easiest, and most code already worked that way.
http://git.postgresql.org/pg/commitdiff/eeece9e60984e76e5a41c1e2fa9efc5a1761e560

Make pg_dump emit more accurate dependency information. While pg_dump has included dependency information in archive-format output ever since 7.3, it never made any large effort to ensure that that information was actually useful. In particular, in common situations where dependency chains include objects that aren't separately emitted in the dump, the dependencies shown for objects that were emitted would reference the dump IDs of these un-dumped objects, leaving no clue about which other objects the visible objects indirectly depend on. So far, parallel pg_restore has managed to avoid tripping over this misfeature, but only by dint of some crude hacks like not trusting dependency information in the pre-data section of the archive. It seems prudent to do something about this before it rises up to bite us, so instead of emitting the "raw" dependencies of each dumped object, recursively search for its actual dependencies among the subset of objects that are being dumped. Back-patch to 9.2, since that code hasn't yet diverged materially from HEAD. At some point we might need to back-patch further, but right now there are no known cases where this is actively necessary. (The one known case, bug #6699, is fixed in a different way by my previous patch.) Since this patch depends on 9.2 changes that made TOC entries be marked before output commences as to whether they'll be dumped, back-patching further would require additional surgery; and as of now there's no evidence that it's worth the risk.
http://git.postgresql.org/pg/commitdiff/8a504a363925fc5c7af48cd723da3f7e4d7ba9e2

Improve pg_dump's dependency-sorting logic to enforce section dump order. As of 9.2, with the --section option, it is very important that the concept of "pre data", "data", and "post data" sections of the output be honored strictly; else a dump divided into separate sectional files might be unrestorable. However, the dependency-sorting logic knew nothing of sections and would happily select output orderings that didn't fit that structure. Doing so was mostly harmless before 9.2, but now we need to be sure it doesn't do that. To fix, create dummy objects representing the section boundaries and add dependencies between them and all the normal objects. (This might sound expensive but it seems to only add a percent or two to pg_dump's runtime.) This also fixes a problem introduced in 9.1 by the feature that allows incomplete GROUP BY lists when a primary key is given in GROUP BY. That means that views can depend on primary key constraints. Previously, pg_dump would deal with that by simply emitting the primary key constraint before the view definition (and hence before the data section of the output). That's bad enough for simple serial restores, where creating an index before the data is loaded works, but is undesirable for speed reasons. But it could lead to outright failure of parallel restores, as seen in bug #6699 from Joe Van Dyk. That happened because pg_restore would switch into parallel mode as soon as it reached the constraint, and then very possibly would try to emit the view definition before the primary key was committed (as a consequence of another bug that causes the view not to be correctly marked as depending on the constraint). Adding the section boundary constraints forces the dependency-sorting code to break the view into separate table and rule declarations, allowing the rule, and hence the primary key constraint it depends on, to revert to their intended location in the post-data section. This also somewhat accidentally works around the bogus-dependency-marking problem, because the rule will be correctly shown as depending on the constraint, so parallel pg_restore will now do the right thing. (We will fix the bogus-dependency problem for real in a separate patch, but that patch is not easily back-portable to 9.1, so the fact that this patch is enough to dodge the only known symptom is fortunate.) Back-patch to 9.1, except for the hunk that adds verification that the finished archive TOC list is in correct section order; the place where it was convenient to add that doesn't exist in 9.1.
http://git.postgresql.org/pg/commitdiff/a1ef01fe163b304760088e3e30eb22036910a495

Cope with smaller-than-normal BLCKSZ setting in SPGiST indexes on text. The original coding failed miserably for BLCKSZ of 4K or less, as reported by Josh Kupershmidt. With the present design for text indexes, a given inner tuple could have up to 256 labels (requiring either 3K or 4K bytes depending on MAXALIGN), which means that we can't positively guarantee no failures for smaller blocksizes. But we can at least make it behave sanely so long as there are few enough labels to fit on a page. Considering that btree is also more prone to "index tuple too large" failures when BLCKSZ is small, it's not clear that we should expend more work than this on this case.
http://git.postgresql.org/pg/commitdiff/757773602c424b1e51c2d9cad8a59398ba7f7b2c

Make UtilityContainsQuery recurse until it finds a non-utility Query. The callers of UtilityContainsQuery want it to return a non-utility Query if it returns anything at all. However, since we made CREATE TABLE Alexander Shulgin/SELECT INTO into a utility command instead of a variant of SELECT, a command like "EXPLAIN SELECT INTO" results in two nested utility statements. So what we need UtilityContainsQuery to do is drill down to the bottom non-utility Query. I had thought of this possibility in setrefs.c, and fixed it there by looping around the UtilityContainsQuery call; but overlooked that the call sites in plancache.c have a similar issue. In those cases it's notationally inconvenient to provide an external loop, so let's redefine UtilityContainsQuery as recursing down to a non-utility Query instead. Noted by Rushabh Lathia. This is a somewhat cleaned-up version of his proposed patch.
http://git.postgresql.org/pg/commitdiff/bde689f809027422d4c0baaa3e63b69ce5762e2c

Fix NOTIFY to cope with I/O problems, such as out-of-disk-space. The LISTEN/NOTIFY subsystem got confused if SimpleLruZeroPage failed, which would typically happen as a result of a write() failure while attempting to dump a dirty pg_notify page out of memory. Subsequently, all attempts to send more NOTIFY messages would fail with messages like "Could not read from file "pg_notify/nnnn" at offset nnnnn: Success". Only restarting the server would clear this condition. Per reports from Kevin Grittner and Christoph Berg. Back-patch to 9.0, where the problem was introduced during the LISTEN/NOTIFY rewrite.
http://git.postgresql.org/pg/commitdiff/ae90128dc5f8412c8ce31804bbf0a0f8ab345db1

Prevent CREATE TABLE LIKE/INHERITS from (mis) copying whole-row Vars. If a CHECK constraint or index definition contained a whole-row Var (that is, "table.*"), an attempt to copy that definition via CREATE TABLE LIKE or table inheritance produced incorrect results: the copied Var still claimed to have the rowtype of the source table, rather than the created table. For the LIKE case, it seems reasonable to just throw error for this situation, since the point of LIKE is that the new table is not permanently coupled to the old, so there's no reason to assume its rowtype will stay compatible. In the inheritance case, we should ideally allow such constraints, but doing so will require nontrivial refactoring of CREATE TABLE processing (because we'd need to know the OID of the new table's rowtype before we adjust inherited CHECK constraints). In view of the lack of previous complaints, that doesn't seem worth the risk in a back-patched bug fix, so just make it throw error for the inheritance case as well. Along the way, replace change_varattnos_of_a_node() with a more robust function map_variable_attnos(), which is capable of being extended to handle insertion of ConvertRowtypeExpr whenever we get around to fixing the inheritance case nicely, and in the meantime it returns a failure indication to the caller so that a helpful message with some context can be thrown. Also, this code will do the right thing with subselects (if we ever allow them in CHECK or indexes), and it range-checks varattnos before using them to index into the map array. Per report from Sergey Konoplev. Back-patch to all supported branches.
http://git.postgresql.org/pg/commitdiff/541ffa65c32b278da8ab2d19433fa6d37bd15c8d

Declare AnonymousShmem pointer as "void *". The original coding had it as "PGShmemHeader *", but that doesn't offer any notational benefit because we don't dereference it. And it was resulting in compiler warnings on some platforms, notably buildfarm member castoroides, where mmap() and munmap() are evidently declared to take and return "char *".
http://git.postgresql.org/pg/commitdiff/81e82643837de93909e0c5a8e14e805f3f69f41e

Suppress compiler warnings in readfuncs.c. Commit 7357558fc8866e3a449aa9473c419b593d67b5b6 introduced "(void) token;" into the READ_TEMP_LOCALS() macro, to suppress complaints from gcc 4.6 when the value of token was not used anywhere in a particular node-read function. However, this just moved the warning around: inspection of buildfarm results shows that some compilers are now complaining that token is being read before it's set. Revert the READ_TEMP_LOCALS() macro change and instead put "(void) token;" into READ_NODE_FIELD(), which is the principal culprit for cases where the warning might occur. In principle we might need the same in READ_BITMAPSET_FIELD() and/or READ_LOCATION_FIELD(), but it seems unlikely that a node would consist only of such fields, so I'll leave them alone for now.
http://git.postgresql.org/pg/commitdiff/39bfc94c86f1990e9db8ea3da0e82995cc1b76db

Fix race condition in enum value comparisons. When (re) loading the typcache comparison cache for an enum type's values, use an up-to-date MVCC snapshot, not the transaction's existing snapshot. This avoids problems if we encounter an enum OID that was created since our transaction started. Per report from Andres Freund and diagnosis by Robert Haas. To ensure this is safe even if enum comparison manages to get invoked before we've set a transaction snapshot, tweak GetLatestSnapshot to redirect to GetTransactionSnapshot instead of throwing error when FirstSnapshotSet is false. The existing uses of GetLatestSnapshot (in ri_triggers.c) don't care since they couldn't be invoked except in a transaction that's already done some work --- but it seems just conceivable that this might not be true of enums, especially if we ever choose to use enums in system catalogs. Note that the comparable coding in enum_endpoint and enum_range_internal remains GetTransactionSnapshot; this is perhaps debatable, but if we changed it those functions would have to be marked volatile, which doesn't seem attractive. Back-patch to 9.1 where ALTER TYPE ADD VALUE was added.
http://git.postgresql.org/pg/commitdiff/9ad45c18b6c8d03ce18a26223eb0d15e900c7a2c

Fix pg_upgrade, broken by the xlogid/segno -> 64-bit int refactoring. The xlogid + segno representation of a particular WAL segment doesn't make much sense in pg_resetxlog anymore, now that we don't use that anywhere else. Use the WAL filename instead, since that's a convenient way to name a particular WAL segment. I did this partially for pg_resetxlog in the original xlogid/segno -> uint64 patch, but I neglected pg_upgrade and the docs. This should now be more complete.
http://git.postgresql.org/pg/commitdiff/038f3a05092365eca070bdc588554520dfd5ffb9

Validate xlog record header before enlarging the work area to store it. If the record header is garbled, we're now quite likely to notice it before we try to make a bogus memory allocation and run out of memory. That can still happen, if the xlog record is split across pages (we cannot verify the record header until reading the next page in that scenario), but this reduces the chances. An out-of-memory is treated as a corrupt record anyway, so this isn't a correctness issue, just a case of giving a better error message. Per Amit Kapila's suggestion.
http://git.postgresql.org/pg/commitdiff/567787f216da750b3805aea6fd8aef19e8b882a1