Switch to 2.1 version of pg_bsd_indent. This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.
Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com

This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.
Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us

This code was still using the old style of forming a heap tuple rather
than using tuple slots. This would be less efficient if a non-heap
access method was used. And using tuple slots is actually quite a bit
faster when using heap as well.
Also add some test cases for generated columns with null values and
with varlena values. This lack of coverage was discovered while
working on this patch.
Discussion: https://www.postgresql.org/message-id/flat/20190331025744.ugbsyks7czfcoksd%40alap3.anarazel.de

Commit 41b54ba78e allowed not only VACUUM but also ANALYZE options
to take a boolean argument. But it forgot to update the documentation
for ANALYZE. This commit adds the descriptions about those ANALYZE
boolean options into the documentation.
This patch also updates tab-completion for ANALYZE boolean options.
Reported-by: Kyotaro Horiguchi
Author: Fujii Masao
Reviewed-by: Masahiko Sawada, Michael Paquier
Discussion: https://postgr.es/m/CAHGQGwHTUt-kuwgiwe8f0AvTnB+ySqJWh95jvmh-qcoKW9YA9g@mail.gmail.com

The original coding of this view relied on a correlated IN sub-query.
Our planner is not very bright about correlated sub-queries, and even
if it were, there's no way for it to know that the output of
pg_get_publication_tables() is duplicate-free, making the de-duplicating
semantics of IN unnecessary. Hence, rewrite as a LATERAL sub-query.
This provides circa 100X speedup for me with a few hundred published
tables (the whole regression database), and things would degrade as
roughly O(published_relations * all_relations) beyond that.
Because the rules.out expected output changes, force a catversion bump.
Ordinarily we might not want to do that post-beta1; but we already know
we'll be doing a catversion bump before beta2 to fix pg_statistic_ext
issues, so it's pretty much free to fix it now instead of waiting for v13.
Per report and fix suggestion from PegoraroF10.
Discussion: https://postgr.es/m/1551385426763-0.post@n3.nabble.com

That leads to unsatisfied external references if the C compiler fails
to elide unused static functions. Apparently, we have no buildfarm
members building HEAD that have that issue ... but such compilers still
exist in the wild. Need to do something about that.
In passing, fix Berkeley-era typo in comment.
Discussion: https://postgr.es/m/27054.1558533367@sss.pgh.pa.us

This uses a method similar to 68a7c24f, which guarantees that GRANT
commands using the WITH GRANT OPTION are dumped in a way so as cascading
dependencies are respected. As databases do not have support for
initial privileges via pg_init_privs, we need to repeat again the same
ACL reordering method.
ACL for databases have been moved from pg_dumpall to pg_dump in v11, so
this impacts pg_dump for v11 and above, and pg_dumpall for v9.6 and
v10.
Discussion: https://postgr.es/m/15788-4e18847520ebcc75@postgresql.org
Author: Nathan Bossart
Reviewed-by: Haribabu Kommi
Backpatch-through: 9.6

Besides implementing the new statement this change fix some issues with the
parsing of PREPARE and EXECUTE statements. The different forms of these
statements are now all handled in a ujnified way.
Author: Matsumura-san <matsumura.ryo@jp.fujitsu.com>

When $(MAKE) is present in a rule, make assumes that target is a
submake, and it doesn't need to buffer its output. But in this case
it's a shell script that needs buffered output. Avoid that heuristic,
by referring to $(MAKE) via an indirection.
Discussion: https://postgr.es/m/20190521004717.qsktdsugj3shagco@alap3.anarazel.de

The use of "set -x" to echo a subset of the test's commands might've
been a good idea during development of this test, but it's been stable
for long enough now that the extra output isn't very useful. Also
our project expectations have been trending towards less output in
non-error cases; the fact that "set -x" produces output on stderr
is particularly annoying from that standpoint. So get rid of it.
Also, pass "-A trust" to initdb explicitly so that it won't issue
a warning about "trust" being an insecure default. This matches
what the TAP tests have done for a long time, and again gets rid
of some noise on stderr.
Discussion: https://postgr.es/m/21766.1558397960@sss.pgh.pa.us

We're seeing occasional instability in the plans generated for
parallel queries on the "a_star" table hierarchy. This suggests
that something is changing the planner's stats for those tables,
but that should not be happening within a regression test run.
To try to gather some information about what's happening, insert
additional queries to check the basic page/tuple counts for these
tables, as well as whether any vacuums or analyzes have happened
on them. (We expect that only the database-wide VACUUM in
sanity_check.sql will have touched them.)
I added the probes not only in select_parallel.sql itself, but
also in stats.sql, bearing in mind that the stats collector's
lag may prevent the initial query from reporting current truth.
If any extra vacuum/analyze has happened, the recheck in stats.sql
definitely ought to see it.
This commit can be reverted once we figure out what's going on.
Per suggestion from David Rowley, though I changed the queries around.
Discussion: https://postgr.es/m/CA+hUKG+0CxrKRWRMf5ymN3gm+BECHna2B-q1w8onKBep4HasUw@mail.gmail.com

This allows table AMs to completely suppress TOAST table creation, or
to modify the conditions under which they are created.
Patch by me. Reviewed by Andres Freund.
Discussion: http://postgr.es/m/CA+Tgmoa4O2n=yphqD2pERUnYmUO84bH1SqMsA-nSxBGsZ7gWfA@mail.gmail.com

Define the meanings of the POSIX-spec character classes in line,
rather than referring to the ctype(3) man page. That man page
doesn't even exist on many modern systems, and if it does exist
it probably says the wrong things about non-ASCII characters.
Also document our non-POSIX-spec "ascii" character class.
Also, point out here that this behavior is controlled by collation or
LC_CTYPE, since the existing text explaining that is pretty far away.
Per gripe from Geert Lobbestael. Given the lack of prior complaints,
I'm not excited about back-patching this.
Discussion: https://postgr.es/m/155837022049.1359.2948065118562813468@wrigleys.postgresql.org

This shouldn't have been committed without even running the tests (nor
were the tests added that were suggested). I'm fixing up the results
to get the buildfarm back to green, it's quite possible we'll want to
revert this later.

Commit 41b54ba78e allowed existing VACUUM options to take a boolean
argument. It's documented that valid boolean values that VACUUM can
accept are true, false, on, off, 1, and 0. But previously the parser
failed to accept 1 and 0 as a boolean value in VACUUM syntax because
of a lack of NumericOnly clause for vac_analyze_option_arg in gram.y.
This commit adds such NumericOnly clause so that VACUUM options
can take also 1 and 0 as a boolean value.
Discussion: https://postgr.es/m/CAHGQGwGYg82A8UCQxZe7Zn9MnyUBGdyB=1CNpKF3jBny+RbyfA@mail.gmail.com

For partial aggregation combine steps,
AggStatePerTrans->numTransInputs was set to the transition function's
number of inputs, rather than the combine function's number of
inputs (always 1).
That lead to partial aggregates with strict combine functions to
wrongly check for NOT NULL input as required by strictness. When the
aggregate wasn't exactly passed one argument, the strictness check was
either omitted (in the 0 args case) or too many arguments were
checked. In the latter case we'd read beyond the end of
FunctionCallInfoData->args (only in master).
AggStatePerTrans->numTransInputs actually has been wrong since since
9.6, where partial aggregates were added. But it turns out to not be
an active problem in 9.6 and 10, because numTransInputs wasn't used at
all for combine functions: Before c253b722f6 there simply was no NULL
check for the input to strict trans functions, and after that the
check was simply hardcoded for the right offset in fcinfo, as it's
done by code specific to combine functions.
In bf6c614a2f2 (11) the strictness check was generalized, with common
code doing the strictness checks for both plain and combine transition
functions, based on numTransInputs. For combine functions this lead to
not emitting an expression step to check for strict input in the 0
arguments case, and in the > 1 arguments case, we'd check too many
arguments.Due to the fact that the relevant fcinfo->isnull[2..] was
always zero-initialized (more or less by accident, by being part of
the AggStatePerTrans struct, which is palloc0'ed), there was no
observable damage in the latter case before a9c35cf85ca1f, we just
checked too many array elements.
Due to the changes in a9c35cf85ca1f, > 1 argument bug became visible,
because these days fcinfo is a) dynamically allocated without being
zeroed b) exactly the length required for the number of specified
arguments (hardcoded to 2 in this case).
This commit only contains a fairly minimal fix, setting numTransInputs
to a hardcoded 1 when building a pertrans for a combine function. It
seems likely that we'll want to clean this up further (e.g. the
arguments build_pertrans_for_aggref() aren't particularly meaningful
for combine functions). But the wrap date for 12 beta1 is coming up
fast, so it seems good to have a minimal fix in place.
Backpatch to 11. While AggStatePerTrans->numTransInputs was set
wrongly before that, the value was not used for combine functions.
Reported-By: Rajkumar Raghuwanshi
Diagnosed-By: Kyotaro Horiguchi, Jeevan Chalke, Andres Freund, David Rowley
Author: David Rowley, Kyotaro Horiguchi, Andres Freund
Discussion: https://postgr.es/m/CAKcux6=uZEyWyLw0N7HtR9OBc-sWEFeByEZC7t-KDf15FKxVew@mail.gmail.com

The comment for SNAPSHOT_SELF was unfortunately explaining
SNAPSHOT_DIRTY, as reported by Sergei. Also expand a few comments, and
include a few more comments from heapam_visibility.c, so they're in an
AM independent place.
Reported-By: Sergei Kornilov
Author: Andres Freund
Discussion: https://postgr.es/m/9152241558192351@sas1-d856b3d759c7.qloud-c.yandex.net

Before this commit, when ANALYZE was run on a table and serializable
was used (either by virtue of an explicit BEGIN TRANSACTION ISOLATION
LEVEL SERIALIZABLE, or default_transaction_isolation being set to
serializable) a null pointer dereference lead to a crash.
The analyze scan doesn't need a snapshot (nor predicate locking), but
before this commit a scan only contained information about being a
bitmap or sample scan.
Refactor the option passing to the scan_begin callback to use a
bitmask instead. Alternatively we could have added a new boolean
parameter, but that seems harder to read. Even before this issue
various people (Heikki, Tom, Robert) suggested doing so.
These changes don't change the scan APIs outside of tableam. The flags
argument could be exposed, it's not necessary to fix this
problem. Also the wrapper table_beginscan* functions encapsulate most
of that complexity.
After these changes fixing the bug is trivial, just don't acquire
predicate lock for analyze style scans. That was already done for
bitmap heap scans. Add an assert that a snapshot is passed when
acquiring the predicate lock, so this kind of bug doesn't require
running with serializable.
Also add a comment about sample scans currently requiring predicate
locking the entire relation, that previously wasn't remarked upon.
Reported-By: Joe Wildish
Author: Andres Freund
Discussion:
https://postgr.es/m/4EA80A20-E9BF-49F1-9F01-5B66CAB21453@elusive.cx
https://postgr.es/m/20190411164947.nkii4gaeilt4bui7@alap3.anarazel.de
https://postgr.es/m/20190518203102.g7peu2fianukjuxm@alap3.anarazel.de

When this suite runs installcheck, redirect file creations from
src/test/regress to src/bin/pg_upgrade/tmp_check/regress. This closes a
race condition in "make -j check-world". If the pg_upgrade suite wrote
to a given src/test/regress/results file in parallel with the regular
src/test/regress invocation writing it, a test failed spuriously. Even
without parallelism, in "make -k check-world", the suite finishing
second overwrote the other's regression.diffs. This revealed test
"largeobject" assuming @abs_builddir@ is getcwd(), so fix that, too.
Buildfarm client REL_10, released forty-five days ago, supports saving
regression.diffs from its new location. When an older client reports a
pg_upgradeCheck failure, it will no longer include regression.diffs.
Back-patch to 9.5, where pg_upgrade moved to src/bin.
Reviewed by Andrew Dunstan.
Discussion: https://postgr.es/m/20181224034411.GA3224776@rfd.leadboat.com

Discussion of bug #15804 reveals that this test didn't really prove
that the syslogger child process ever launched successfully, much
less did anything. It was only checking that the expected log file
gets created, and that's done in the postmaster. Moreover, the
test assumed it could rename the log file, which is likely to fail
on Windows (cf. commit d611175e5).
Instead, use the default log file name pattern, which should result
in a new file name being chosen after 1 second, and verify that
rotation has occurred by checking for a new file name. Also add code
to test that messages actually do propagate through the syslogger.
In theory this version of the test should work on Windows, so
revert d611175e5.
Discussion: https://postgr.es/m/15804-3721117bf40fb654@postgresql.org

This commit reverts 57431a911d3a650451d198846ad3194900666152.
While that's still a good idea in the abstract, we found out
that there are multiple crasher bugs in it on Windows builds,
making the logging_collector option unusable on Windows.
There's no time left to fix these issues before 12beta1,
so revert the patch to allow Windows beta testing to proceed.
We'll try again at some future date.
Per bug #15804 from Yulian Khodorkovskiy and additional
investigation by Michael Paquier.
Discussion: https://postgr.es/m/15804-3721117bf40fb654@postgresql.org

It appears that some variants of .** jsonpath accessor are undocumented. In
particular undocumented variants are:
.**{level}
.**{lower_level to upper_level}
.**{lower_level to last}
This commit adds missing documentation for them.

We still had a couple of these left in ancient src/port/ files.
Convert them to modern style in preparation for switching to
a version of pg_bsd_indent that doesn't cope well with K&R style.
Discussion: https://postgr.es/m/16886.1558104483@sss.pgh.pa.us

If PrepareTempTablespaces() has never been called in the current
transaction, OpenTemporaryFile() will fall back to using the default
tablespace, which is a bug if the user wanted temp files placed elsewhere.
gistInitBuildBuffers() appears to have this disease already, and it
seems like an easy trap for future coders to fall into.
We discussed other ways to close this gap, but none of them are prettier
or more reliable than just having BufFileCreateTemp do it. In particular,
having fd.c do this creates layering issues that we could do without.
Per suggestion from Melanie Plageman. Arguably this is a bug fix, but
nobody seems very excited about back-patching, so change in HEAD only.
Discussion: https://postgr.es/m/CAAKRu_YwzjuGAmmaw4-8XO=OVFGR1QhY_Pq-t3wjb9ribBJb_Q@mail.gmail.com

Previously various parts of the code routed size requests through
RelationGetNumberOfBlocks[InFork]. That works if md.c is used by the
AM, but not otherwise.
Add a tableam callback to return the size of the table. As not every
AM will use postgres' BLCKSZ, have it return bytes, and have
RelationGetNumberOfBlocksInFork() round the byte size up into blocks.
To allow code outside of the AM to determine the actual relation size
map InvalidForkNumber the total size of a relation, as not every AM
might just need the postgres defined forks.
A few users of RelationGetNumberOfBlocks() ought to be converted away
from that. One case, the use of it to determine whether a tid is
valid, will be fixed in a follow up commit. Others will have to wait
for v13.
Author: Andres Freund
Discussion: https://postgr.es/m/20190423225201.3bbv6tbqzkb5w7cw@alap3.anarazel.de

Previously, gen_partprune_steps() always built executor pruning steps
using all suitable clauses, including those containing PARAM_EXEC
Params. This meant that the pruning steps were only completely safe
for executor run-time (scan start) pruning. To prune at executor
startup, we had to ignore the steps involving exec Params. But this
doesn't really work in general, since there may be logic changes
needed as well --- for example, pruning according to the last operator's
btree strategy is the wrong thing if we're not applying that operator.
The rules embodied in gen_partprune_steps() and its minions are
sufficiently complicated that tracking their incremental effects in
other logic seems quite impractical.
Short of a complete redesign, the only safe fix seems to be to run
gen_partprune_steps() twice, once to create executor startup pruning
steps and then again for run-time pruning steps. We can save a few
cycles however by noting during the first scan whether we rejected
any clauses because they involved exec Params --- if not, we don't
need to do the second scan.
In support of this, refactor the internal APIs in partprune.c to make
more use of passing information in the GeneratePruningStepsContext
struct, rather than as separate arguments.
This is, I hope, the last piece of our response to a bug report from
Alan Jackson. Back-patch to v11 where this code came in.
Discussion: https://postgr.es/m/FAD28A83-AC73-489E-A058-2681FA31D648@tvsquared.com

75445c1 has caused various failures in tests across the tree after
updating some error messages, so fix the newly-expected output.
Author: Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/8332.1558048838@sss.pgh.pa.us

It's not safe for nbtree VACUUM to attempt to delete a target page whose
right sibling is already half-dead, since that would fail the
cross-check when VACUUM attempts to re-find a downlink to the right
sibling in the parent page. Logic to prevent this from happening was
added by commit 8da31837803, which addressed a bug in the overhaul of
page deletion that went into PostgreSQL 9.4 (commit efada2b8e92).
VACUUM was made to check the right sibling page, and back off when it
happened to be half-dead already.
However, it is only truly necessary to do the right sibling check on the
leaf level, since that transitively determines if the deletion target's
parent's right sibling page is itself undergoing deletion. Remove the
internal page level check, and add a comment explaining why the leaf
level check alone suffices.
The extra check is also unnecessary due to the fact that internal pages
that are marked half-dead are generally considered corrupt. Commit
efada2b8e92 established the principle that there should never be
half-dead internal pages (internal pages pending deletion are possible,
but that status is never directly represented in the internal page).
VACUUM will complain about corruption when it encounters half-dead
internal pages, so VACUUM is bound to raise an error one way or another
when an nbtree index has a half-dead internal page (contrib/amcheck will
also report that the page is corrupt).
It's possible that a pg_upgrade'd 9.3 database will still have half-dead
internal pages, so it may seem like there is an argument for leaving the
check in place to reliably get a cleaner error message that advises the
user to REINDEX. However, leaf pages are also deleted in the first
phase of deletion prior to PostgreSQL 9.4, so I believe we won't even
attempt to re-find the parent page anyway (we won't have the fully
deleted leaf page as the right sibling of our target page, so we won't
even try to find a downlink for it).
Discussion: https://postgr.es/m/CAH2-Wzm_ntmqJjWLRyKzimFmFvk+BnVAvUpaA4s1h9Ja58woaQ@mail.gmail.com

gen_prune_steps_from_opexps's notion of how to do this was overly
complicated and underly correct.
Per discussion of a report from Alan Jackson (though this fixes only one
aspect of that problem). Back-patch to v11 where this code came in.
Amit Langote
Discussion: https://postgr.es/m/FAD28A83-AC73-489E-A058-2681FA31D648@tvsquared.com

Cross-type comparison operators in a btree or hash opclass might be
only stable not immutable (this is true of timestamp vs. timestamptz
for example). partprune.c ignored this possibility and would perform
plan-time pruning with them anyway, possibly leading to wrong answers
if the environment changed between planning and execution.
To fix, teach gen_partprune_steps() to do things differently when
creating plan-time pruning steps vs. run-time pruning steps.
analyze_partkey_exprs() also needs an extra check, which is rather
annoying but now is not the time to restructure things enough to
avoid that.
While at it, simplify the logic for the plan-time case a little
by insisting that the comparison value be a Const and nothing else.
This relies on the assumption that eval_const_expressions will have
reduced any immutable expression to a Const; which is not quite
100% true, but certainly any case that comes up often enough to be
interesting should have simplification logic there.
Also improve a bunch of inadequate/obsolete/wrong comments.
Per discussion of a report from Alan Jackson (though this fixes only one
aspect of that problem). Back-patch to v11 where this code came in.
David Rowley, with some further hacking by me
Discussion: https://postgr.es/m/FAD28A83-AC73-489E-A058-2681FA31D648@tvsquared.com

Remove a Berkeley-era comment above _bt_insertonpg() that admonishes the
reader to grok Lehman and Yao's paper before making any changes. This
made a certain amount of sense back when _bt_insertonpg() was
responsible for most of the things that are now spread across
_bt_insertonpg(), _bt_findinsertloc(), _bt_insert_parent(), and
_bt_split(), but it doesn't work like that anymore.
I believe that this comment alludes to the need to "couple" or "crab"
buffer locks as we ascend the tree as page splits cascade upwards. The
nbtree README already explains this in detail, which seems sufficient.
Besides, the changes to page splits made by commit 40dae7ec537 altered
the exact details of how buffer locks are retained during splits; Lehman
and Yao's original algorithm seems to release the lock on the left child
page/buffer slightly earlier than _bt_insertonpg()/_bt_insert_parent()
can.

Commit fab25024, which taught nbtree to choose candidate split points
more carefully, had _bt_findsplitloc() record all possible split points
in an initial pass over a page that is about to be split. The order
that candidate split points were processed and stored in was assumed to
match the offset number order of split points on an imaginary version of
the page that contains the same items as the original, but also fits
newitem (the item that provoked the split precisely because it didn't
fit).
However, the order of split points in the final array was not quite what
was expected: the split point that makes newitem the firstright item
came after the split point that makes newitem the lastleft item -- not
before. As a result, _bt_findsplitloc() could get confused about the
leftmost and rightmost tuples among all possible split points recorded
for the page. This seems to have no appreciable impact on the quality
of the final split point chosen by _bt_findsplitloc(), but it's still
wrong.
To fix, switch the order in which newitem candidate splits are recorded
in. This also makes it possible to describe candidate split points in
terms of which pair of adjoining tuples enclose the split point within
_bt_findsplitloc(), making it clearer why it's generally safe for
_bt_split() to expect lastleft and firstright tuples.

For some reason both callsite and the implementation for heapam had
the meaning inverted (i.e. succeeded == true was passed in case of
conflict). That's confusing.
I (Andres) briefly pondered whether it'd be better to rename
table_complete_speculative's argument to 'bool specConflict' or such,
but decided not to. The 'complete' in the function name for me makes
`succeeded` sound a bit better.
Reported-By: Ashwin Agrawal, Melanie Plageman, Heikki Linnakangas
Discussion:
https://postgr.es/m/CALfoeitk7-TACwYv3hCw45FNPjkA86RfXg4iQ5kAOPhR+F1Y4w@mail.gmail.com
https://postgr.es/m/97673451-339f-b21e-a781-998d06b1067c@iki.fi

This path previously was not reliably covered. There was some
heuristic coverage via insert-conflict-toast.spec, but that test is
not deterministic, and only tested for a somewhat specific bug.
Backpatch, as this is a complicated and otherwise untested code
path. Unfortunately 9.5 cannot handle two waiting sessions, and thus
cannot execute this test.
Triggered by a conversion with Melanie Plageman.
Author: Andres Freund
Discussion: https://postgr.es/m/CAAKRu_a7hbyrk=wveHYhr4LbcRnRCG=yPUVoQYB9YO1CdUBE9Q@mail.gmail.com
Backpatch: 9.5-

A src/test/isolation/expected/insert-conflict-specconflict.out
M src/test/isolation/isolation_schedule
A src/test/isolation/specs/insert-conflict-specconflict.spec

The original placement of this module in src/fe_utils/ is ill-considered,
because several src/common/ modules have dependencies on it, meaning that
libpgcommon and libpgfeutils now have mutual dependencies. That makes it
pointless to have distinct libraries at all. The intended design is that
libpgcommon is lower-level than libpgfeutils, so only dependencies from
the latter to the former are acceptable.
We already have the precedent that fe_memutils and a couple of other
modules in src/common/ are frontend-only, so it's not stretching anything
out of whack to treat logging.c as a frontend-only module in src/common/.
To the extent that such modules help provide a common frontend/backend
environment for the rest of common/ to use, it's a reasonable design.
(logging.c does not yet provide an ereport() emulation, but one can
dream.)
Hence, move these files over, and revert basically all of the build-system
changes made by commit cc8d41511. There are no places that need to grow
new dependencies on libpgcommon, further reinforcing the idea that this
is the right solution.
Discussion: https://postgr.es/m/a912ffff-f6e4-778a-c86a-cf5c47a12933@2ndquadrant.com

The existence of these files became rather confusing with the
introduction of a widely-known logging.h header in commit cc8d41511.
(Indeed, there's already some duplicative #includes here, perhaps
betraying such confusion.) The only thing left in them, after that
commit, is a progress-reporting function that's neither general-purpose
nor tied in any way to other logging infrastructure. Hence, let's just
move that function to pg_rewind.c, and get rid of the separate files.
Discussion: https://postgr.es/m/3971.1557787914@sss.pgh.pa.us

SQL's regular-expression substring() function is defined to have a
pattern argument that's separated into three subpatterns by escape-
double-quote markers; the function result is the part of the input
matching the second subpattern. The standard makes it clear that
if there is ambiguity about how to match the input to the subpatterns,
the first and third subpatterns should be taken to match the smallest
possible amount of text (i.e., they're "non greedy", in the terms of
our regex code). We were not doing it that way: the first subpattern
would eat the largest possible amount of text, causing the function
result to be shorter than what the spec requires.
Fix that by attaching explicit greediness quantifiers to the
subpatterns. (This depends on the regex fix in commit 8a29ed053;
before that, this didn't reliably change the regex engine's behavior.)
Also, by adding parentheses around each subpattern, we ensure that
"|" (OR) in the subpatterns behave sanely. Previously, "|" in the
first or third subpatterns didn't work.
This patch also makes the function throw error if you write more than
two escape-double-quote markers, and do something sane if you write
just one, and document that behavior. Previously, an odd number of
markers led to a confusing complaint about unbalanced parentheses,
while extra pairs of markers were just ignored. (Note that the spec
requires exactly two markers, but we've historically allowed there
to be none, and this patch preserves the old behavior for that case.)
In passing, adjust some substring() test cases that didn't really
prove what they said they were testing for: they used patterns
that didn't match the data string, so that the output would be
NULL whether or not the function was really strict.
Although this is certainly a bug fix, changing the behavior in back
branches seems undesirable: applications could perhaps be depending on
the old behavior, since it's not obviously wrong unless you read the
spec very closely. Hence, no back-patch.
Discussion: https://postgr.es/m/5bb27a41-350d-37bf-901e-9d26f5592dd0@charter.net

Previously, the code pointed the standard process-termination signals
to postgres.c's die(). That would typically result in an attempt to
execute a transaction abort, which is not possible in bootstrap mode,
leading to PANIC. This choice seems to be a leftover from an old code
structure in which the same signal-assignment code was used for many
sorts of auxiliary processes, including interactive standalone
backends. It's not very sensible for bootstrap mode, which has no
interest in either interactivity or continuing after an error. We can
get better behavior with less effort by just letting normal process
termination happen, after which the parent initdb process will clean up.
This is basically cosmetic in any case, since initdb will react the
same way whether bootstrap dies on a signal or abort(). Given the
lack of previous complaints, I don't feel a need to back-patch,
even though the behavior is old.
Discussion: https://postgr.es/m/3850b11a.5121.16aaf827e4a.Coremail.thunder1@126.com

Per previous convention (see
ace397e9d24eddc56e7dffa921f506117b602d78), drop SQL:2008 and only keep
the latest two standards and SQL-92.
Note: SQL:2016-2 lists a large number of non-reserved keywords that
are really just information_schema column names related to new
features. Those kinds of thing have not previously been listed as
keywords, and this was apparently done here by mistake, since these
keywords have been removed again in post-2016 working drafts. So in
order to avoid bloating the keywords table unnecessarily, I have
omitted these erroneous keywords here.

As we descend the GiST tree during insertion, we modify any downlinks on
the way down to include the new tuple we're about to insert (if they don't
cover it already). Modifying an existing downlink might cause an internal
page to split, if the new downlink tuple is larger than the old one. If
that happens, we need to back up to the parent and re-choose a page to
insert to. We used to detect that situation, thanks to the NSN-LSN
interlock normally used to detect concurrent page splits, but that got
broken by commit 9155580fd5. With that commit, we now use a dummy constant
LSN value for every page during index build, so the LSN-NSN interlock no
longer works. I thought that was OK because there can't be any other
backends modifying the index during index build, but missed that the
insertion itself can modify the page we're inserting to. The consequence
was that we would sometimes insert the new tuple to an incorrect page, one
whose downlink doesn't cover the new tuple.
To fix, add a flag to the stack that keeps track of the state while
descending tree, to indicate that a page was split, and that we need to
retry the descend from the parent.
Thomas Munro first reported that the contrib/intarray regression test was
failing occasionally on the buildfarm after commit 9155580fd5. The failure
was intermittent, because the gistchoose() function is not deterministic,
and would only occasionally create the right circumstances for this bug to
cause the failure.
Patch by Anastasia Lubennikova, with some changes by me to make it work
correctly also when the internal page split also causes the "grandparent"
to be split.
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJRzLo7tZExWfSbwM3XuK7aAK7FhdBV0FLkbUG%2BW0v0zg%40mail.gmail.com

The conditions listed in this comment have changed several times, and at
some point the thing that the "if so" referred to was negated.
The text was OK up to 9.6. It was differently wrong in v10, v11 and
master, so fix in all those versions.

The term "item pointer" should not be used to refer to ItemIdData
variables, since that is needlessly ambiguous. Only
ItemPointerData/ItemPointer variables should be called item pointers.
To fix, establish the convention that ItemIdData variables should always
be referred to either as "item identifiers" or "line pointers". The
term "item identifier" already predominates in docs and translatable
messages, and so should be the preferred alternative there.
Discussion: https://postgr.es/m/CAH2-Wz=c=MZQjUzde3o9+2PLAPuHTpVZPPdYxN=E4ndQ2--8ew@mail.gmail.com

An upcoming HEAD-only patch will standardize the terminology around
ItemIdData variables/line pointers, ending the practice of referring to
them as "item pointers". Make the "Database Page Layout" docs
consistent with the new policy. The term "item identifier" is already
used in the same section, so stick with that.
Discussion: https://postgr.es/m/CAH2-Wz=c=MZQjUzde3o9+2PLAPuHTpVZPPdYxN=E4ndQ2--8ew@mail.gmail.com
Backpatch: All supported branches.

Only hand-assigned type OIDs should be presumed to match across different
PG servers; those assigned during genbki.pl or during initdb are likely
to change due to addition or removal of unrelated objects.
This means that the cutoff should be FirstGenbkiObjectId (in HEAD)
or FirstBootstrapObjectId (before that), not FirstNormalObjectId.
Compare postgres_fdw's is_builtin() test.
It's likely that this error has no observable consequence in a
normally-functioning system, since ATM the only affected type OIDs are
system catalog rowtypes and information_schema types, which would not
typically be interesting for logical replication. But you could
probably break it if you tried hard, so back-patch.
Discussion: https://postgr.es/m/15150.1557257111@sss.pgh.pa.us

The FirstNormalObjectId test here is a kluge that needs to go away,
but the only substitute we can think of is to add a column to pg_class,
which will take more work than can be handled right now. Add some
commentary in the meanwhile.
Discussion: https://postgr.es/m/15150.1557257111@sss.pgh.pa.us

Commit 8fa30f906be reduced the elevel of a number of "can't happen"
_bt_split() errors from PANIC to ERROR. At the same time, the new right
page buffer for the split could continue to be acquired well before the
critical section. This was possible because it was relatively
straightforward to make sure that _bt_split() could not throw an error,
with a few specific exceptions. The exceptional cases were safe because
they involved specific, well understood errors, making it possible to
consistently zero the right page before actually raising an error using
elog(). There was no danger of leaving around a junk page, provided
_bt_split() stuck to this coding rule.
Commit 8224de4f, which introduced INCLUDE indexes, added code to make
_bt_split() truncate away non-key attributes. This happened at a point
that broke the rule around zeroing the right page in _bt_split(). If
truncation failed (perhaps due to palloc() failure), that would result
in an errant right page buffer with junk contents. This could confuse
VACUUM when it attempted to delete the page, and should be avoided on
general principle.
To fix, reorganize _bt_split() so that truncation occurs before the new
right page buffer is even acquired. A junk page/buffer will not be left
behind if _bt_nonkey_truncate()/_bt_truncate() raise an error.
Discussion: https://postgr.es/m/CAH2-WzkcWT_-NH7EeL=Az4efg0KCV+wArygW8zKB=+HoP=VWMw@mail.gmail.com
Backpatch: 11-, where INCLUDE indexes were introduced.

The comment implies that a 1 in the null bitmap indicates a null value,
but actually a 0 in the null bitmap indicates a null value. Try to
be more clear.
Patch by me; proposed wording reviewed by Alvaro Herrera and Tom Lane.
Discussion: http://postgr.es/m/CA+TgmobHOP8r6cG+UnsDFMrS30-m=jRrCBhgw-nFkn0k9QnFsg@mail.gmail.com

pgtls_read_pending is declared to return bool, but what the underlying
SSL_pending function returns is a count of available bytes.
This is actually somewhat harmless if we're using C99 bools, but in
the back branches it's a live bug: if the available-bytes count happened
to be a multiple of 256, it would get converted to a zero char value.
On machines where char is signed, counts of 128 and up could misbehave
as well. The net effect is that when using SSL, libpq might block
waiting for data even though some has already been received.
Broken by careless refactoring in commit 4e86f1b16, so back-patch
to 9.5 where that came in.
Per bug #15802 from David Binderman.
Discussion: https://postgr.es/m/15802-f0911a97f0346526@postgresql.org

equalsJsonbScalarValue() uses a boolean as return type, however for one
code path -1 gets returned, which is confusing. The origin of the
confusion is visibly that this code got copy-pasted from
compareJsonbScalarValue() since it has been introduced in d1d50bf.
No backpatch, as this is only cosmetic.
Author: Rikard Falkeborn
Discussion: https://postgr.es/m/CADRDgG7mJnek6HNW13f+LF6V=6gag9PM+P7H5dnyWZAv49aBGg@mail.gmail.com

A bounded quantifier with m = n = 1 might be thought a no-op. But
according to our documentation (which traces back to Henry Spencer's
original man page) it still imposes greediness, or non-greediness in the
case of the non-greedy variant "{1,1}?", on whatever it's attached to.
This turns out not to work though, because parseqatom() optimizes away
the m = n = 1 case without regard for whether it's supposed to change
the greediness of the argument RE.
We can fix this by just not applying the optimization when the greediness
needs to change; the subsequent general cases handle it fine.
The three cases in which we can still apply the optimization are
(a) no quantifier, or quantifier does not impose a preference;
(b) atom has no greediness property, implying it cannot match a
variable amount of text anyway; or
(c) quantifier's greediness is same as atom's.
Note that in most cases where one of these applies, we'd have exited
earlier in the "not a messy case" fast path. I think it's now only
possible to get to the optimization when the atom involves capturing
parentheses or a non-top-level backref.
Back-patch to all supported branches. I'd ordinarily be hesitant to
put a subtle behavioral change into back branches, but in this case
it's very hard to see a reason why somebody would write "{1,1}?" unless
they're trying to get the documented change-of-greediness behavior.
Discussion: https://postgr.es/m/5bb27a41-350d-37bf-901e-9d26f5592dd0@charter.net

The function had been interpreting SQL_ASCII messages as UTF8, throwing
an error when they were invalid UTF8. The new behavior is consistent
with pg_do_encoding_conversion(). This affects LOG_DESTINATION_STDERR
and LOG_DESTINATION_EVENTLOG, which will send untranslated bytes to
write() and ReportEventA(). On buildfarm member bowerbird, enabling
log_connections caused an error whenever the role name was not valid
UTF8. Back-patch to 9.4 (all supported versions).
Discussion: https://postgr.es/m/20190512015615.GD1124997@rfd.leadboat.com

We long ago decided to design the shared PgBackendStatus data structure to
minimize the cost of writing status updates, which means that writers just
have to increment the st_changecount field twice. That isn't hooked into
any sort of resource management mechanism, which means that if something
were to throw error between the two increments, the st_changecount field
would be left odd indefinitely. That would cause readers to lock up.
Now, since it's also a bad idea to leave the field odd for longer than
absolutely necessary (because readers will spin while we have it set),
the expectation was that we'd treat these segments like spinlock critical
sections, with only short, more or less straight-line, code in them.
That was fine as originally designed, but commit 9029f4b37 broke it
by inserting a significant amount of non-straight-line code into
pgstat_bestart(), code that is very capable of throwing errors, not to
mention taking a significant amount of time during which readers will spin.
We have a report from Neeraj Kumar of readers actually locking up, which
I suspect was due to an encoding conversion error in X509_NAME_to_cstring,
though conceivably it was just a garden-variety OOM failure.
Subsequent commits have loaded even more dubious code into pgstat_bestart's
critical section (and commit fc70a4b0d deserves some kind of booby prize
for managing to miss the critical section entirely, although the negative
consequences seem minimal given that the PgBackendStatus entry should be
seen by readers as inactive at that point).
The right way to fix this mess seems to be to compute all these values
into a local copy of the process' PgBackendStatus struct, and then just
copy the data back within the critical section proper. This plan can't
be implemented completely cleanly because of the struct's heavy reliance
on out-of-line strings, which we must initialize separately within the
critical section. But still, the critical section is far smaller and
safer than it was before.
In hopes of forestalling future errors of the same ilk, rename the
macros for st_changecount management to make it more apparent that
the writer-side macros create a critical section. And to prevent
the worst consequences if we nonetheless manage to mess it up anyway,
adjust those macros so that they really are a critical section, ie
they now bump CritSectionCount. That doesn't add much overhead, and
it guarantees that if we do somehow throw an error while the counter
is odd, it will lead to PANIC and a database restart to reset shared
memory.
Back-patch to 9.5 where the problem was introduced.
In HEAD, also fix an oversight in commit b0b39f72b: it failed to teach
pgstat_read_current_status to copy st_gssstatus data from shared memory to
local memory. Hence, subsequent use of that data within the transaction
would potentially see changing data that it shouldn't see.
Discussion: https://postgr.es/m/CAPR3Wj5Z17=+eeyrn_ZDG3NQGYgMEOY6JV6Y-WRRhGgwc16U3Q@mail.gmail.com

The buildfarm client uses TEMP_CONFIG to implement its extra_config
setting. Except for stats_temp_directory, extra_config now applies to
TAP suites; extra_config values seen in the past month are compatible
with this. Back-patch to 9.6, where PostgresNode was introduced, so the
buildfarm can rely on it sooner.
Reviewed by Andrew Dunstan and Tom Lane.
Discussion: https://postgr.es/m/20181229021950.GA3302966@rfd.leadboat.com

When failing to reindex a table or an index, reindexdb would generate an
extra error message related to a database failure, which is misleading.
Backpatch all the way down, as this has been introduced by 85e9a5a0.
Discussion: https://postgr.es/m/CAOBaU_Yo61RwNO3cW6WVYWwH7EYMPuexhKqufb2nFGOdunbcHw@mail.gmail.com
Author: Julien Rouhaud
Reviewed-by: Daniel Gustafsson, Álvaro Herrera, Tom Lane, Michael
Paquier
Backpatch-through: 9.4

As none of the approaches for avoiding the deadlock issues seem
promising enough, and all the expected reindex related changes have
been made, apply 60c2951e1bab7e to master as well.
Discussion: https://postgr.es/m/4622.1556982247@sss.pgh.pa.us

There's a very old race condition in our code to see whether a pre-existing
shared memory segment is still in use by a conflicting postmaster: it's
possible for the other postmaster to remove the segment in between our
shmctl() and shmat() calls. It's a narrow window, and there's no risk
unless both postmasters are using the same port number, but that's possible
during parallelized "make check" tests. (Note that while the TAP tests
take some pains to choose a randomized port number, pg_regress doesn't.)
If it does happen, we treated that as an unexpected case and errored out.
To fix, allow EINVAL to be treated as segment-not-present, and the same
for EIDRM on Linux. AFAICS, the considerations here are basically
identical to the checks for acceptable shmctl() failures, so I documented
and coded it that way.
While at it, adjust PGSharedMemoryAttach's API to remove its undocumented
dependency on UsedShmemSegAddr in favor of passing the attach address
explicitly. This makes it easier to be sure we're using a null shmaddr
when probing for segment conflicts (thus avoiding questions about what
EINVAL means). I don't think there was a bug there, but it required
fragile assumptions about the state of UsedShmemSegAddr during
PGSharedMemoryIsInUse.
Commit c09850992 may have made this failure more probable by applying
the conflicting-segment tests more often. Hence, back-patch to all
supported branches, as that was.
Discussion: https://postgr.es/m/22224.1557340366@sss.pgh.pa.us

The description of the lock type for speculative insertions was
incorrect, being copy-pasted from another one.
As discussed, also move the description for all the fields of lock tag
types from the structure listing lock tag types to the set of macros
setting each LOCKTAG.
Author: John Naylor
Discussion: https://postgr.es/m/CACPNZCtA0-ybaC4fFfaDq_8p_TUOLvGxZH9Dm-=TMHZJarBa7Q@mail.gmail.com

This improves the user experience when it comes to restrict several
flavors of REINDEX CONCURRENTLY. First, for INDEX, remove a restriction
on shared relations as we already check after catalog relations. Then,
for TABLE, add a proper error message when attempting to run the command
on system catalogs. The code path of CREATE INDEX CONCURRENTLY already
complains about that, but if a REINDEX is issued then then the error
generated is confusing.
While on it, add more tests to check restrictions on catalog indexes and
on toast table/index for catalogs. Some error messages are improved,
with wording suggestion coming from Tom Lane.
Reported-by: Tom Lane
Author: Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/23694.1556806002@sss.pgh.pa.us

create_merge_append_plan failed to honor the CP_EXACT_TLIST flag:
it would generate the expected targetlist but then it felt free to
add resjunk sort targets to it. This demonstrably leads to assertion
failures in v11 and HEAD, and it's probably just accidental that we
don't see the same in older branches. I've not looked into whether
there would be any real-world consequences in non-assert builds.
In HEAD, create_append_plan has sprouted the same problem, so fix
that too (although we do not have any test cases that seem able to
reach that bug). This is an oversight in commit 3fc6e2d7f which
invented the CP_EXACT_TLIST flag, so back-patch to 9.6 where that
came in.
convert_subquery_pathkeys would create pathkeys for subquery output
values if they match any EquivalenceClass known in the outer query
and are available in the subquery's syntactic targetlist. However,
the second part of that condition is wrong, because such values might
not appear in the subquery relation's reltarget list, which would
mean that they couldn't be accessed above the level of the subquery
scan. We must check that they appear in the reltarget list, instead.
This can lead to dropping knowledge about the subquery's sort
ordering, but I believe it's okay, because any sort key that the
outer query actually has any interest in would appear in the
reltarget list.
This second issue is of very long standing, but right now there's no
evidence that it causes observable problems before 9.6, so I refrained
from back-patching further than that. We can revisit that choice if
somebody finds a way to make it cause problems in older branches.
(Developing useful test cases for these issues is really problematic;
fixing convert_subquery_pathkeys removes the only known way to exhibit
the create_merge_append_plan bug, and neither of the test cases added
by this patch causes a problem in all branches, even when considering
the issues separately.)
The second issue explains bug #15795 from Suresh Kumar R ("could not
find pathkey item to sort" with nested DISTINCT queries). I stumbled
across the first issue while investigating that.
Discussion: https://postgr.es/m/15795-fadb56c8e44ee73c@postgresql.org

In commit d50d172e51, which added support for LIMIT/OFFSET pushdown in
postgres_fdw, a new struct was introduced as the extra parameter of
GetForeignUpperPaths() set for UPPERREL_FINAL, but I forgot to update
the documentation to mention that.
Author: Etsuro Fujita
Discussion: https://postgr.es/m/CAPmGK17uSXQDe31oRb-z1nYyT6vVzkstZkA3_Wbq38U92b9BmQ%40mail.gmail.com

In commit 7012b132d0, which added support for aggregate pushdown in
postgres_fdw, the expense of evaluating the final scan/join target
computed by make_group_input_target() was not accounted for at all in
costing aggregate pushdown paths with local statistics. The right fix
for this would be to have a separate upper stage to adjust the final
scan/join relation (see comments for apply_scanjoin_target_to_paths());
but for now, fix by adding the tlist eval cost when costing aggregate
pushdown paths with local statistics.
Apply this to HEAD only to avoid destabilizing existing plan choices.
Author: Etsuro Fujita
Reviewed-By: Antonin Houska
Discussion: https://postgr.es/m/5C66A056.60007%40lab.ntt.co.jp

Commit bb16aba50 broke the code that maintains SxactGlobalXmin. It
could get stuck when a well-timed READ ONLY transaction runs. If
SxactGlobalXmin stops advancing, transactions on the
FinishedSerializableTransactions queue are never cleaned up, so
resources are effectively leaked. Revert that hunk of the commit.
Also revert another similar hunk that was probably harmless, but
unnecessary and unjustified, relating to the DOOMED flag in case of
RO_SAFE early release.
Author: Thomas Munro
Reported-by: Tom Lane
Discussion: https://postgr.es/m/16170.1557251214%40sss.pgh.pa.us

The right way for IsCatalogRelation/Class to behave is to return true
for OIDs less than FirstBootstrapObjectId (not FirstNormalObjectId),
without any of the ad-hoc fooling around with schema membership.
The previous code was wrong because (1) it claimed that
information_schema tables were not catalog relations but their toast
tables were, which is silly; and (2) if you dropped and recreated
information_schema, which is a supported operation, the behavior
changed. That's even sillier. With this definition, "catalog
relations" are exactly the ones traceable to the postgres.bki data,
which seems like what we want.
With this simplification, we don't actually need access to the pg_class
tuple to identify a catalog relation; we only need its OID. Hence,
replace IsCatalogClass with "IsCatalogRelationOid(oid)". But keep
IsCatalogRelation as a convenience function.
This allows fixing some arguably-wrong semantics in contrib/sepgsql and
ReindexRelationConcurrently, which were using an IsSystemNamespace test
where what they really should be using is IsCatalogRelationOid. The
previous coding failed to protect toast tables of system catalogs, and
also was not on board with the general principle that user-created tables
do not become catalogs just by virtue of being renamed into pg_catalog.
We can also get rid of a messy hack in ReindexMultipleTables.
While we're at it, also rename IsSystemNamespace to IsCatalogNamespace,
because the previous name invited confusion with the more expansive
semantics used by IsSystemRelation/Class.
Also improve the comments in catalog.c.
There are a few remaining places in replication-related code that are
special-casing OIDs below FirstNormalObjectId. I'm inclined to think
those are wrong too, and if there should be any special case it should
just extend to FirstBootstrapObjectId. But first we need to debate
whether a FOR ALL TABLES publication should include information_schema.
Discussion: https://postgr.es/m/21697.1557092753@sss.pgh.pa.us
Discussion: https://postgr.es/m/15150.1557257111@sss.pgh.pa.us

When running a batch of VACUUM or ANALYZE commands on a given database,
there were cases where it is possible to have vacuumdb not report an
error where it actually should, leading to incorrect status results.
Author: Julien Rouhaud
Reviewed-by: Amit Kapila, Michael Paquier
Discussion: https://postgr.es/m/CAOBaU_ZuTwz7CtqLYJ1Ouuh272bTQPLN8b1bAPk0bCBm4PDMTQ@mail.gmail.com
Backpatch-through: 9.5

Commit dd299df8189, which added suffix truncation to nbtree, simplified
the WAL record format used by page splits. It became necessary to
explicitly WAL-log the new high key for the left half of a split in all
cases, which relieved the REDO routine from having to reconstruct a new
high key for the left page by copying the first item from the right
page. Remove a comment that referred to the previous practice.

Some messages related to foreign servers were reporting the server name
without quotes, or not at all; our style is to have all names be quoted,
and the server name already appears quoted in a few other messages, so
just add quotes and make them all consistent.
Remove an extra "s" in other messages (typos introduced by myself in
f56f8f8da6af).

Previously it's documented that use of replication functions is
restricted to superusers. This is true for the functions which
use replication origin, but not for pg_logicl_emit_message() and
functions which use replication slot. For example, not only
superusers but also users with REPLICATION privilege is allowed
to use the functions for replication slot. This commit fixes
the documentation for the privileges required for those replication
functions.
Back-patch to 9.4 (all supported versions).
Author: Matsumura Ryo
Discussion: https://postgr.es/m/03040DFF97E6E54E88D3BFEE5F5480F74ABA6E16@G01JPEXMBYT04

REINDEX CONCURRENTLY locks tables with ShareUpdateExclusiveLock rather
than the ShareLock used by a plain REINDEX. However,
RangeVarCallbackForReindexIndex() was not updated for that and still
used the ShareLock only. This would lead to lock upgrades later,
leading to possible deadlocks.
Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/flat/20190430151735.wi52sxjvxsjvaxxt%40alap3.anarazel.de

Commit c0985099, later adjusted by commit 4ab02e81, probed 0.0.0.0
in addition to 127.0.0.1, for the benefit of Windows build farm
animals. It isn't really useful on Unix systems, and turned out to
be a bit inconvenient to users of some corporate firewall software.
Switch back to probing just 127.0.0.1 on non-Windows systems.
Back-patch to 9.6, like the earlier changes.
Discussion: https://postgr.es/m/CA%2BhUKG%2B21EPwfgs4m%2BtqyRtbVqkOUvP8QQ8sWk9%2Bh55Aub1H3A%40mail.gmail.com

It is no longer possible under any circumstances for nbtree code to
reconstruct a strict lower bound key (parent page's pivot tuple key) for
a right sibling page by retrieving the first item in the right sibling
page.

Word "singleton" is hard for user understanding, especially taking into account
there is only one place it's used in the docs and there is even no definition.
Use more evident wording instead.
Discussion: https://postgr.es/m/23737.1556550645%40sss.pgh.pa.us

This commit adds new parameter to VACUUM command, TRUNCATE,
which specifies that VACUUM should attempt to truncate off
any empty pages at the end of the table and allow the disk space
for the truncated pages to be returned to the operating system.
This parameter, if specified, overrides the vacuum_truncate
reloption. If neither the reloption nor the VACUUM option is
used, the default is true, as before.
Author: Fujii Masao
Reviewed-by: Julien Rouhaud, Masahiko Sawada
Discussion: https://postgr.es/m/CAD21AoD+qtrSDL=GSma4Wd3kLYLeRC0hPna-YAdkDeV4z156vg@mail.gmail.com

On a 64-bit machine, if you set track_activity_query_size and
max_connections such that their product exceeds 1GB, shared memory
setup will still succeed (given enough RAM), but attempts to read
pg_stat_activity fail with "invalid memory alloc request size".
Work around that by using MemoryContextAllocHuge to allocate the
local copy of the activity strings. Using the "huge" API costs us
nothing extra in normal cases, and it seems better than throwing
an error and/or explaining to people why they can't do this.
This situation seems insanely profligate today, but who knows what
people will consider normal in ten or twenty years? So let's fix it
in HEAD but not worry about a back-patch.
Per report from James Tomson.
Discussion: https://postgr.es/m/1CFDCCD6-B268-48D8-85C8-400D2790B2C3@pushd.com

The SQL keywords table in the documentation had until now been
generated by some ad hoc scripting outside the source tree once for
each major release. This changes it to an automated process.
We have the PostgreSQL keywords available in a parseable format in
parser/kwlist.h. For the relevant SQL standard versions, keep the
keyword lists in new text files. A new script
generate-keywords-table.pl pulls it all together and produces a
DocBook table.
The final output in the documentation should be identical after this
change.
Discussion: https://www.postgresql.org/message-id/flat/07daeadd-8c82-0d95-5e19-e350502cb749%402ndquadrant.com

M doc/src/sgml/.gitignore
M doc/src/sgml/Makefile
M doc/src/sgml/filelist.sgml
A doc/src/sgml/generate-keywords-table.pl
M doc/src/sgml/keywords.sgml
A doc/src/sgml/keywords/sql1992-nonreserved.txt
A doc/src/sgml/keywords/sql1992-reserved.txt
A doc/src/sgml/keywords/sql2008-02-nonreserved.txt
A doc/src/sgml/keywords/sql2008-02-reserved.txt
A doc/src/sgml/keywords/sql2008-09-nonreserved.txt
A doc/src/sgml/keywords/sql2008-09-reserved.txt
A doc/src/sgml/keywords/sql2008-14-nonreserved.txt
A doc/src/sgml/keywords/sql2008-14-reserved.txt
A doc/src/sgml/keywords/sql2011-02-nonreserved.txt
A doc/src/sgml/keywords/sql2011-02-reserved.txt
A doc/src/sgml/keywords/sql2011-09-nonreserved.txt
A doc/src/sgml/keywords/sql2011-09-reserved.txt
A doc/src/sgml/keywords/sql2011-14-nonreserved.txt
A doc/src/sgml/keywords/sql2011-14-reserved.txt

Revert “Avoid the creation of the free space map for small heap relations”.

This feature was using a process local map to track the first few blocks
in the relation. The map was reset each time we get the block with enough
freespace. It was discussed that it would be better to track this map on
a per-relation basis in relcache and then invalidate the same whenever
vacuum frees up some space in the page or when FSM is created. The new
design would be better both in terms of API design and performance.
List of commits reverted, in reverse chronological order:
06c8a5090e Improve code comments in b0eaa4c51b.
13e8643bfc During pg_upgrade, conditionally skip transfer of FSMs.
6f918159a9 Add more tests for FSM.
9c32e4c350 Clear the local map when not used.
29d108cdec Update the documentation for FSM behavior..
08ecdfe7e5 Make FSM test portable.
b0eaa4c51b Avoid creation of the free space map for small heap relations.
Discussion: https://postgr.es/m/20190416180452.3pm6uegx54iitbt5@alap3.anarazel.de

This code was broken as of 582edc3, and is most likely not used anymore.
Note that pg_dump supports servers down to 8.0, and psql has code to
support servers down to 7.4.
Author: Julien Rouhaud
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAOBaU_Y5y=zo3+2gf+2NJC1pvMYPcbRXoQaPXx=U7+C8Qh4CzQ@mail.gmail.com

... and fallout (from branches 10, 11 and master). The change was
ill-considered, and it broke a few normal use cases; since we don't have
time to fix it, we'll try again after this week's minor releases.
Reported-by: Rushabh Lathia
Discussion: https://postgr.es/m/CAGPqQf0iQV=PPOv2Btog9J9AwOQp6HmuVd6SbGTR_v3Zp2XT1w@mail.gmail.com

This adds extra tests for the error message generated for partition
tuple routing in the executor, using more than three levels of
partitioning including partitioned tables with no partitions. These
tests have been added to fix CVE-2019-10129 on REL_11_STABLE. HEAD has
no active bugs in this area, but it lacked coverage.
Author: Michael Paquier
Reviewed-by: Noah Misch
Security: CVE-2019-10129

In examine_variable() and examine_simple_variable(), when checking the
user's table and column privileges to determine whether to grant
access to the pg_statistic data, use checkAsUser for the privilege
checks, if it's set. This will be the case if we're accessing the
table via a view, to indicate that we should perform privilege checks
as the view owner rather than the current user.
This change makes this planner check consistent with the check in the
executor, so the planner will be able to make use of statistics if the
table is accessible via the view. This fixes a performance regression
introduced by commit e2d4ef8de8, which affects queries against
non-security barrier views in the case where the user doesn't have
privileges on the underlying table, but the view owner does.
Note that it continues to provide the same safeguards controlling
access to pg_statistic for direct table access (in which case
checkAsUser won't be set) and for security barrier views, because of
the nearby checks on rte->security_barrier and rte->securityQuals.
Back-patch to all supported branches because e2d4ef8de8 was.
Dean Rasheed, reviewed by Jonathan Katz and Stephen Frost.

In commit e2d4ef8de8, security checks were added to prevent
user-supplied operators from running over data from pg_statistic
unless the user has table or column privileges on the table, or the
operator is leakproof. For a table with RLS, however, checking for
table or column privileges is insufficient, since that does not
guarantee that the user has permission to view all of the column's
data.
Fix this by also checking for securityQuals on the RTE, and insisting
that the operator be leakproof if there are any. Thus the
leakproofness check will only be skipped if there are no securityQuals
and the user has table or column privileges on the table -- i.e., only
if we know that the user has access to all the data in the column.
Back-patch to 9.5 where RLS was added.
Dean Rasheed, reviewed by Jonathan Katz and Stephen Frost.
Security: CVE-2019-10130

Noticed while reviewing nearby code. Given all the disclaimers about
this not being meant as user-facing code, I wonder whether we should
make these non-translatable? But in any case there's little excuse
for them not to be good English.

Project style is to check the success of SearchSysCacheN and friends
by applying HeapTupleIsValid to the result. A tiny minority of calls
creatively did it differently. Bring them into line with the rest.
This is just cosmetic, since HeapTupleIsValid is indeed just a null
check at the moment ... but that may not be true forever, and in any
case it puts a mental burden on readers who may wonder why these
call sites are not like the rest.
Back-patch to v11 just to keep the branches in sync. (The bulk of these
errors seem to have originated in v11 or v12, though a few are old.)
Per searching to see if anyplace else had made the same error
repaired in 62148c352.

Omitted in commit 05b38c7e6 (though it looks like the original blame
belongs to 9e9befac4). A failure is admittedly unlikely, but if it
did happen, SIGSEGV is not the approved method of reporting it.
Per Coverity. Back-patch to v11 where the broken code originated.

M src/backend/commands/indexcmds.c

Fix set of issues with memory-allocation system calls in frontend code

Like the backend, the frontend has wrappers on top of malloc() and such
whose use is recommended. Particularly, it is possible to do memory
allocation without issuing an error. Some binaries missed the use of
those wrappers, so let's fix the gap for consistency.
This also fixes two latent bugs:
- In pg_dump/pg_dumpall when parsing an ACL item, on an out-of-memory
error for strdup(), the code considered the failure as a ACL parsing
problem instead of an actual OOM.
- In pg_waldump, an OOM when building the target directory string would
cause a crash.
Author: Daniel Gustafsson
Discussion: https://postgr.es/m/gY0y9xenfoBPc-Tufsr2Zg-MmkrJslm0Tw_CMg4p_j58-k_PXNC0klMdkKQkg61BkXC9_uWo-DcUzfxnHqpkpoR5jjVZrPHqKYikcHIiONhg=@yesql.se

Commit 3f342839 corrected obsolete comments about buffer locks at the
main _bt_insert_parent() call site, but missed similar obsolete comments
above _bt_insert_parent() itself. Both sets of comments were rendered
obsolete by commit 40dae7ec537, which made the nbtree page split
algorithm more robust. Fix the comments that were missed the first time
around now.
In passing, refine a related _bt_insert_parent() comment about
re-finding the parent page to insert new downlink.

In the wake of commit f912d7dec, RelationSetIndexList isn't used any
more. It was always a horrid wart, so getting rid of it is very nice.
We can also convert rd_indexvalid back to a plain boolean.
Discussion: https://postgr.es/m/28926.1556664156@sss.pgh.pa.us

Commits 3dbb317d3 et al failed under CLOBBER_CACHE_ALWAYS testing.
Investigation showed that to reindex pg_class_oid_index, we must
suppress accesses to the index (via SetReindexProcessing) before we call
RelationSetNewRelfilenode, or at least before we do CommandCounterIncrement
therein; otherwise, relcache reloads happening within the CCI may try to
fetch pg_class rows using the index's new relfilenode value, which is as
yet an empty file.
Of course, the point of 3dbb317d3 was that that ordering didn't work
either, because then RelationSetNewRelfilenode's own update of the index's
pg_class row cannot access the index, should it need to.
There are various ways we might have got around that, but Andres Freund
came up with a brilliant solution: for a mapped index, we can really just
skip the pg_class update altogether. The only fields it was actually
changing were relpages etc, but it was just setting them to zeroes which
is useless make-work. (Correct new values will be installed at the end
of index build.) All pg_class indexes are mapped and probably always will
be, so this eliminates the problem by removing work rather than adding it,
always a pleasant outcome. Having taught RelationSetNewRelfilenode to do
it that way, we can revert the code reordering in reindex_index. (But
I left the moved setup code where it was; there seems no reason why it
has to run without use of the old index. If you're trying to fix a
busted pg_class index, you'll have had to disable system index use
altogether to get this far.)
Moreover, this means we don't need RelationSetIndexList at all, because
reindex_relation's hacking to make "REINDEX TABLE pg_class" work is
likewise now unnecessary. We'll leave that code in place in the back
branches, but a follow-on patch will remove it in HEAD.
In passing, do some minor cleanup for commit 5c1560606 (in HEAD only),
notably removing a duplicate newrnode assignment.
Patch by me, using a core idea due to Andres Freund. Back-patch to all
supported branches, as 3dbb317d3 was.
Discussion: https://postgr.es/m/28926.1556664156@sss.pgh.pa.us

Commit d2599ecfcc74 introduced some contorted, confused code around:
readers would think that it's possible for HeapTupleHeaderGetXmin return
a non-frozen value for some frozen tuples, which would be disastrous.
There's no actual bug, but it seems better to make it clearer.
Per gripe from Tom Lane and Andres Freund.
Discussion: https://postgr.es/m/30116.1555430496@sss.pgh.pa.us

Commit dd299df8189, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe. In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page. This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.
Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page. This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").
Several conditions needed to be met all at once for CREATE INDEX to
fail. The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit. The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.
To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.
Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena

The new nleft_dead_tuples and nleft_dead_itemids fields are confusing
and do not seem like the correct way forward. One of them is tested
via an assertion that can fail, as it has already done on buildfarm
member topminnow. Remove the assertion and the fields.
Change the logic for the case where a tuple is not initially pruned
by heap_page_prune but later diagnosed HEAPTUPLE_DEAD by
HeapTupleSatisfiesVacuum. Previously, tupgone = true was set in
that case, which leads to treating the tuple as one that will be
removed. In a normal vacuum, that's OK, because we'll remove
index entries for it and then the second heap pass will remove the
tuple itself, but when index cleanup is disabled, those things
don't happen, so we must instead treat it as a recently-dead
tuple that we have voluntarily chosen to keep.
Report and analysis by Tom Lane. This patch loosely based on one
from Masahiko Sawada, but I changed most of it.

The message type for temp files and for checksum failures were missing
from the union. Due to the coding style used there was no compiler error
when this happend. So change the code to actively use the union thereby
producing a compiler error if the same mistake happens again, suggested
by Tom Lane.
Author: Julien Rouhaud
Reported-By: Tomas Vondra
Discussion: https://postgr.es/m/20190430163328.zd4rrlnbvgaqlcdz@development

M src/backend/postmaster/pgstat.c
M src/include/pgstat.h

Run catalog reindexing test from 3dbb317d32 serially, to avoid deadlocks.

The tests turn out to cause deadlocks in some circumstances. Fairly
reproducibly so with -DRELCACHE_FORCE_RELEASE
-DCATCACHE_FORCE_RELEASE. Some of the deadlocks may be hard to fix
without disproportionate measures, but others probably should be fixed
- but not in 12.
We discussed removing the new tests until we can fix the issues
underlying the deadlocks, but results from buildfarm animal
markhor (which runs with CLOBBER_CACHE_ALWAYS) indicates that there
might be a more severe, as of yet undiagnosed, issue (including on
stable branches) with reindexing catalogs. The failure is:
ERROR: could not read block 0 in file "base/16384/28025": read only 0 of 8192 bytes
Therefore it seems advisable to keep the tests.
It's not certain that running the tests in isolation removes the risk
of deadlocks. It's possible that additional locks are needed to
protect against a concurrent auto-analyze or such.
Per discussion with Tom Lane.
Discussion: https://postgr.es/m/28926.1556664156@sss.pgh.pa.us
Backpatch: 9.4-, like 3dbb317d3

While discussing comment improvements (see next commit) by Justin
Pryzby, Tom complained about a few details of the logic to infer the
length of the NULL bitmap when building the JITed tuple deforming
function. That bitmap allows to avoid checking the tuple header's
natts, a check which often causes a pipeline stall
Improvements:
a) As long as missing columns aren't taken into account, we can
continue to infer the length of the NULL bitmap from NOT NULL
columns following it. Previously we stopped at the first missing
column. It's unlikely to matter much in practice, but the
alternative would have been to document why we stop.
b) For robustness reasons it seems better to also check against
attisdropped - RemoveAttributeById() sets attnotnull to false, but
an additional check is trivial.
c) Improve related comments
Discussion: https://postgr.es/m/20637.1555957068@sss.pgh.pa.us
Backpatch: -

M src/backend/jit/llvm/llvmjit_deform.c

Clean up handling of constraint_exclusion and enable_partition_pruning.

The interaction of these parameters was a bit confused/confusing,
and in fact v11 entirely misses the opportunity to apply partition
constraints when a partition is accessed directly (rather than
indirectly from its parent).
In HEAD, establish the principle that enable_partition_pruning controls
partition pruning and nothing else. When accessing a partition via its
parent, we do partition pruning (if enabled by enable_partition_pruning)
and then there is no need to consider partition constraints in the
constraint_exclusion logic. When accessing a partition directly, its
partition constraints are applied by the constraint_exclusion logic,
only if constraint_exclusion = on.
In v11, we can't have such a clean division of these GUCs' effects,
partly because we don't want to break compatibility too much in a
released branch, and partly because the clean coding requires
inheritance_planner to have applied partition pruning to a partitioned
target table, which it doesn't in v11. However, we can tweak things
enough to cover the missed case, which seems like a good idea since
it's potentially a performance regression from v10. This patch keeps
v11's previous behavior in which enable_partition_pruning overrides
constraint_exclusion for an inherited target table, though.
In HEAD, also teach relation_excluded_by_constraints that it's okay to use
inheritable constraints when trying to prune a traditional inheritance
tree. This might not be thought worthy of effort given that that feature
is semi-deprecated now, but we have enough infrastructure that it only
takes a couple more lines of code to do it correctly.
Amit Langote and Tom Lane
Discussion: https://postgr.es/m/9813f079-f16b-61c8-9ab7-4363cab28d80@lab.ntt.co.jp
Discussion: https://postgr.es/m/29069.1555970894@sss.pgh.pa.us

Mistake in ab0dfc961b6a; progress reporting would have wrapped around
for indexes created with more than 2^31 tuples.
Reported-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-Wz=WbNxc5ob5NJ9yqo2RMJ0q4HXDS30GVCobeCvC9A1L9A@mail.gmail.com

When reindexing individual indexes on pg_class it was possible to
either trigger an assertion failure:
TRAP: FailedAssertion("!(!ReindexIsProcessingIndex(((index)->rd_id)))
That's because reindex_index() called SetReindexProcessing() - which
enables an asserts ensuring no index insertions happen into the index
- before calling RelationSetNewRelfilenode(). That not correct for
indexes on pg_class, because RelationSetNewRelfilenode() updates the
relevant pg_class row, which needs to update the indexes.
The are two reasons this wasn't noticed earlier. Firstly the bug
doesn't trigger when reindexing all of pg_class, as reindex_relation
has code "hiding" all yet-to-be-reindexed indexes. Secondly, the bug
only triggers when the the update to pg_class doesn't turn out to be a
HOT update - otherwise there's no index insertion to trigger the
bug. Most of the time there's enough space, making this bug hard to
trigger.
To fix, move RelationSetNewRelfilenode() to before the
SetReindexProcessing() (and, together with some other code, to outside
of the PG_TRY()).
To make sure the error checking intended by SetReindexProcessing() is
more robust, modify CatalogIndexInsert() to check
ReindexIsProcessingIndex() even when the update is a HOT update.
Also add a few regression tests for REINDEXing of system catalogs.
The last two improvements would have prevented some of the issues
fixed in 5c1560606dc4c from being introduced in the first place.
Reported-By: Michael Paquier
Diagnosed-By: Tom Lane and Andres Freund
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20190418011430.GA19133@paquier.xyz
Backpatch: 9.4-, the bug is present in all branches

Most of these stem from d25f519107 "tableam: relation creation, VACUUM
FULL/CLUSTER, SET TABLESPACE.".
1) To pass data to the relation_set_new_filenode()
RelationSetNewRelfilenode() was made to update RelationData.rd_rel
directly. That's not OK however, as it makes the relcache entries
temporarily inconsistent. Which among other scenarios is a problem
if a REINDEX targets an index on pg_class - the
CatalogTupleUpdate() in RelationSetNewRelfilenode(). Presumably
that was introduced because other places in the code do so - while
those aren't "good practice" they don't appear to be actively
buggy (e.g. because system tables may not be targeted).
I (Andres) should have caught this while reviewing and signficantly
evolving the code in that commit, mea culpa.
Fix that by instead passing in the new RelFileNode as separate
argument to relation_set_new_filenode() and rely on the relcache to
update the catalog entry. Also revert that the
RelationMapUpdateMap() call was changed to immediate, and undo some
other more unnecessary changes.
2) Document that the relation_set_new_filenode cannot rely on the
whole relcache entry to be valid. It might be worthwhile to
refactor the code to never have to rely on that, but given the way
heap_create() is currently coded, that'd be a large change.
3) ATExecSetTableSpace() shouldn't do FlushRelationBuffers() itself. A
table AM might not use shared buffers at all. Move to
index_copy_data() and heapam_relation_copy_data().
4) heapam_relation_set_new_filenode() previously sometimes accessed
rel->rd_rel->relpersistence rather than the `persistence`
argument. Code movement mistake.
5) Previously heapam_relation_set_new_filenode() re-opened the smgr
relation to create the init for, if necesary. Instead have
RelationCreateStorage() return the SMgrRelation and use it to
create the init fork.
6) Add a note about the danger of modifying the relcache directly to
ATExecSetTableSpace() - it's currently not a bug because there's a
check ERRORing for catalog tables.
Regression tests and assertion improvements that together trigger the
bug described in 1) will be added in a later commit, as there is a
related bug on all branches.
Reported-By: Michael Paquier
Diagnosed-By: Tom Lane and Andres Freund
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20190418011430.GA19133@paquier.xyz

Remove a comment that refers to a coding practice that was fully removed
by commit a8b8f4db, which introduced MarkBufferDirty(). It looks like
the comment was even obsolete before then, since it concerns
write-ordering dependencies with synchronous buffer writes.

This is quite unsafe, even for the case of ereport(FATAL) where we won't
return control to the interrupted code, and despite this code's use of
a flag to restrict the areas where we'd try to do it. It's possible
for example that we interrupt malloc or free while that's holding a lock
that's meant to protect against cross-thread interference. Then, any
attempt to do malloc or free within ereport() will result in a deadlock,
preventing the walreceiver process from exiting in response to SIGTERM.
We hypothesize that this explains some hard-to-reproduce failures seen
in the buildfarm.
Hence, get rid of the immediate-exit code in WalRcvShutdownHandler,
as well as the logic associated with WalRcvImmediateInterruptOK.
Instead, we need to take care that potentially-blocking operations
in the walreceiver's data transmission logic (libpqwalreceiver.c)
will respond reasonably promptly to the process's latch becoming
set and then call ProcessWalRcvInterrupts. Much of the needed code
for that was already present in libpqwalreceiver.c. I refactored
things a bit so that all the uses of PQgetResult use latch-aware
waiting, but didn't need to do much more.
These changes should be enough to ensure that libpqwalreceiver.c
will respond promptly to SIGTERM whenever it's waiting to receive
data. In principle, it could block for a long time while waiting
to send data too, and this patch does nothing to guard against that.
I think that that hazard is mostly theoretical though: such blocking
should occur only if we fill the kernel's data transmission buffers,
and we don't generally send enough data to make that happen without
waiting for input. If we find out that the hazard isn't just
theoretical, we could fix it by using PQsetnonblocking, but that
would require more ticklish changes than I care to make now.
This is a bug fix, but it seems like too big a change to push into
the back branches without much more testing than there's time for
right now. Perhaps we'll back-patch once we have more confidence
in the change.
Patch by me; thanks to Thomas Munro for review.
Discussion: https://postgr.es/m/20190416070119.GK2673@paquier.xyz

If a temporary table with an identity column and ON COMMIT DROP is
created in a single-statement transaction (not useful, but allowed),
it would leave the catalog corrupted. We need to add a
CommandCounterIncrement() so that PreCommit_on_commit_actions() sees
the created dependency between table and sequence and can clean it
up.
The analogous and more useful case of doing this in a transaction
block already runs some CommandCounterIncrement() before it gets to
the on-commit cleanup, so it wasn't a problem in practical use.
Several locations for placing the new CommandCounterIncrement() call
were discussed. This patch places it at the end of
standard_ProcessUtility(). That would also help if other commands
were to create catalog entries that some on-commit action would like
to see.
Bug: #15631
Reported-by: Serge Latyntsev <dnsl48@gmail.com>
Author: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>

M src/backend/tcop/utility.c

Do pre-release housekeeping on catalog data, and fix jsonpath send/recv.

Run renumber_oids.pl to move high-numbered OIDs down, as per pre-beta
tasks specified by RELEASE_CHANGES. (The only change is 8394 -> 3428.)
Also run reformat_dat_file.pl while I'm here.
While looking at the reformat diffs, I chanced to notice that type
jsonpath had typsend and typreceive = '-', which surely is not the
intention given that jsonpath_send and jsonpath_recv exist.
Fix that. It's safe to assume that these functions have never been
tested :-(. I didn't try, but somebody should.

Be more consistent about use of XXXGetDatum macros in new jsonpath
code. This is mostly to avoid having code that looks randomly
different from everyplace else that's doing the exact same thing.
In pg_regress.c, avoid an unreferenced-function warning from
compilers that don't understand pg_attribute_unused(). Putting
the function inside the same #ifdef as its only caller is more
straightforward coding anyway.
In be-secure-openssl.c, avoid use of pg_attribute_unused() on a label.
That's pretty creative, but there's no good reason to suppose that
it's portable, and there's absolutely no need to use goto's here in the
first place. (This wasn't actually causing any buildfarm complaints,
but it's new code in v12 so it has no portability track record.)

This fixes a couple of grammar mistakes, typos and inconsistencies in
the documentation. Particularly, the configuration parsing allows only
"kB" to mean kilobyte but there were references in the docs to "KB".
Some instances of the latter are still in the code comments. Some
parameter values were mentioned with "Minus-one", and using directly
"-1" with proper markups is more helpful to the reader.
Some of these have been pointed out by Justin, and some others are
things I bumped into.
Author: Justin Pryzby, Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/20190330224333.GQ5815@telsasoft.com

foreign_grouping_ok() is willing to put fairly arbitrary expressions into
the targetlist of a remote SELECT that's doing grouping or aggregation on
the remote side, including expressions that have no foreign component to
them at all. This is possibly a bit dubious from an efficiency standpoint;
but it rises to the level of a crash-causing bug if the expression is just
a Param or non-foreign Var. In that case, the expression will necessarily
also appear in the fdw_exprs list of values we need to send to the remote
server, and then setrefs.c's set_foreignscan_references will mistakenly
replace the fdw_exprs entry with a Var referencing the targetlist result.
The root cause of this problem is bad design in commit e7cb7ee14: it put
logic into set_foreignscan_references that IMV is postgres_fdw-specific,
and yet this bug shows that it isn't postgres_fdw-specific enough. The
transformation being done on fdw_exprs assumes that fdw_exprs is to be
evaluated with the fdw_scan_tlist as input, which is not how postgres_fdw
uses it; yet it could be the right thing for some other FDW. (In the
bigger picture, setrefs.c has no business assuming this for the other
expression fields of a ForeignScan either.)
The right fix therefore would be to expand the FDW API so that the
FDW could inform setrefs.c how it intends to evaluate these various
expressions. We can't change that in the back branches though, and we
also can't just summarily change setrefs.c's behavior there, or we're
likely to break external FDWs.
As a stopgap, therefore, hack up postgres_fdw so that it won't attempt
to send targetlist entries that look exactly like the fdw_exprs entries
they'd produce. In most cases this actually produces a superior plan,
IMO, with less data needing to be transmitted and returned; so we probably
ought to think harder about whether we should ship tlist expressions at
all when they don't contain any foreign Vars or Aggs. But that's an
optimization not a bug fix so I left it for later. One case where this
produces an inferior plan is where the expression in question is actually
a GROUP BY expression: then the restriction prevents us from using remote
grouping. It might be possible to work around that (since that would
reduce to group-by-a-constant on the remote side); but it seems like a
pretty unlikely corner case, so I'm not sure it's worth expending effort
solely to improve that. In any case the right long-term answer is to fix
the API as sketched above, and then revert this hack.
Per bug #15781 from Sean Johnston. Back-patch to v10 where the problem
was introduced.
Discussion: https://postgr.es/m/15781-2601b1002bad087c@postgresql.org

Recently added guidance on adding SVG images to the documentation
sources lacks advice on making the images responsive when rendered
in a variety of media types and viewports. Add some.
Patch by Jonathan Katz with some editorialization by me.
Author: Jonathan Katz
Discussion: https://postgr.es/m/6358ae6f-7191-a02b-e7b5-68050636ae71@postgresql.org

This corrects a small bug in zic that caused it to output an incorrect
year-2440 transition in the Africa/Casablanca zone.
More interestingly, zic has grown a "-r" option that limits the range of
zone transitions that it will put into the output files. That might be
useful to people who don't like the weird GMT offsets that tzdb likes
to use for very old dates. It appears that for dates before the cutoff
time specified with -r, zic will use the zone's standard-time offset
as of the cutoff time. So for example one might do
make install ZIC_OPTIONS='-r @-1893456000'
to cause all dates before 1910-01-01 to be treated as though 1910
standard time prevailed indefinitely far back. (Don't blame me for
the unfriendly way of specifying the cutoff time --- it's seconds
since or before the Unix epoch. You can use extract(epoch ...)
to calculate it.)
As usual, back-patch to all supported branches.

DST law changes in Palestine and Metlakatla.
Historical corrections for Israel.
Etc/UCT is now a backward-compatibility link to Etc/UTC, instead
of being a separate zone that generates the abbreviation "UCT",
which nowadays is typically a typo. Postgres will still accept
"UCT" as an input zone name, but it won't output it.

Fix DefineIndex so that it doesn't attempt to pass down a to-be-reused
index relfilenode to a child index creation, and fix TryReuseIndex
to not think that reuse is sensible for a partitioned index.
In v11, this fixes a problem where ALTER TABLE on a partitioned table
could assign the same relfilenode to several different child indexes,
causing very nasty catalog corruption --- in fact, attempting to DROP
the partitioned table then leads not only to a database crash, but to
inability to restart because the same crash will recur during WAL replay.
Either of these two changes would be enough to prevent the failure, but
since neither action could possibly be sane, let's put in both changes
for future-proofing.
In HEAD, no such bug manifests, but that's just an accidental consequence
of having changed the pg_class representation of partitioned indexes to
have relfilenode = 0. Both of these changes still seem like smart
future-proofing.
This is only a stop-gap because the code for ALTER TABLE on a partitioned
table with a no-op type change still leaves a great deal to be desired.
As the added regression tests show, it gets things wrong for comments on
child indexes/constraints, and it is regenerating child indexes it doesn't
have to. However, fixing those problems will take more work which may not
get back-patched into v11. We need a fix for the corruption problem now.
Per bug #15672 from Jianing Yang.
Patch by me, regression test cases based on work by Amit Langote,
who also did a lot of the investigative work.
Discussion: https://postgr.es/m/15672-b9fa7db32698269f@postgresql.org

Commit f831d4accda0 changed pg_dump to emit (and pg_restore to
understand) NULLs for unused members in ArchiveEntry structs, as a side
effect of some code beautification. That broke pg_restore of dumps
generated with older pg_dump, however, so it was reverted in
19455c9f5606. Since the archiver version number has been bumped in
3b925e905de3, we can put it back.
Author: Dmitry Dolgov
Discussion: https://postgr.es/m/CA+q6zcXx0XHqLsFJLaUU2j5BDiBAHig=YRoBC_YVq7VJGvzBEA@mail.gmail.com

Author: Laurenz Albe and Etsuro Fujita
Reviewed-by: Laurenz Albe and Amit Langote
Backpatch-through: 11 where support for that by FDWs was added
Discussion: https://postgr.es/m/bf36a0288e8f31b4f2f40952e225bf892dc1ffc5.camel@cybertec.at

Adopt a more defensive approach to accessing index tuples in
contrib/amcheck: verify that each line pointer looks sane before
accessing associated tuple using pointer arithmetic based on line
pointer's offset. This avoids undefined behavior and assertion failures
in cases where line pointers are corrupt.
Issue spotted following a complaint about an assertion failure by
Grigory Smolkin, which involved a test harness that deliberately
corrupts indexes.
This is arguably a bugfix, but no backpatch given the lack of field
reports beyond Grigory's.
Discussion: https://postgr.es/m/CAH2-WzmkurhCqnyLHxk0VkOZqd49+ZZsp1xAJOg7j2x7dmp_XQ@mail.gmail.com

When an existing index in a partition is attached to a new index on
its parent, we forgot to set the "relispartition" flag correctly, which
meant that it was not possible to find the index in various operations,
such as adding a foreign key constraint that references that partitioned
table. One of four places that was assigning the parent index was
forgetting to do that, so fix by shifting responsibility of updating the
flag to the routine that changes the parent.
Author: Amit Langote, Álvaro Herrera
Reported-by: Hubert "depesz" Lubaczewski
Discussion: https://postgr.es/m/CA+HiwqHMsRtRYRWYTWavKJ8x14AFsv7bmAV46mYwnfD3vy8goQ@mail.gmail.com

Commit ca4103025dfe left a few loose ends. The most important one
(broken pg_dump output) is already fixed by virtue of commit
3b23552ad8bb, but some things remained:
* When ALTER TABLE rewrites tables, the indexes must remain in the
tablespace they were originally in. This didn't work because
index recreation during ALTER TABLE runs manufactured SQL (yuck),
which runs afoul of default_tablespace in competition with the parent
relation tablespace. To fix, reset default_tablespace to the empty
string temporarily, and add the TABLESPACE clause as appropriate.
* Setting a partitioned rel's tablespace to the database default is
confusing; if it worked, it would direct the partitions to that
tablespace regardless of default_tablespace. But in reality it does
not work, and making it work is a larger project. Therefore, throw
an error when this condition is detected, to alert the unwary.
Add some docs and tests, too.
Author: Álvaro Herrera
Discussion: https://postgr.es/m/CAKJS1f_1c260nOt_vBJ067AZ3JXptXVRohDVMLEBmudX1YEx-A@mail.gmail.com

Using PARTITION OF can result in column ordering being changed from the
database being dumped, if the partition uses a column layout different
from the parent's. It's not pg_dump's job to editorialize on table
definitions, so this is not acceptable; back-patch all the way back to
pg10, where partitioned tables where introduced.
This change also ensures that partitions end up in the correct
tablespace, if different from the parent's; this is an oversight in
ca4103025dfe (in pg12 only). Partitioned indexes (in pg11) don't have
this problem, because they're already created as independent indexes and
attached to their parents afterwards.
This change also has the advantage that the partition is restorable from
the dump (as a standalone table) even if its parent table isn't
restored.
Author: David Rowley
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/CAKJS1f_1c260nOt_vBJ067AZ3JXptXVRohDVMLEBmudX1YEx-A@mail.gmail.com
Discussion: https://postgr.es/m/20190423185007.GA27954@alvherre.pgsql

In sigusr1_handler, don't ignore PMSIGNAL_ADVANCE_STATE_MACHINE based
on pmState. The restriction is unnecessary (PostmasterStateMachine
should work in any state), not future-proof (since it makes too many
assumptions about why the signal might be sent), and broken even today
because a race condition can make it necessary to respond to the signal
in PM_WAIT_READONLY state. The race condition seems unlikely, but
if it did happen, a hot-standby postmaster could fail to shut down
after receiving a smart-shutdown request.
In MaybeStartWalReceiver, don't clear the WalReceiverRequested flag
if the fork attempt fails. Leaving it set allows us to try
again in future iterations of the postmaster idle loop. (The startup
process would eventually send a fresh request signal, but this change
may allow us to retry the fork sooner.)
Remove an obsolete comment and unnecessary test in
PostmasterStateMachine's handling of PM_SHUTDOWN_2 state. It's not
possible to have a live walreceiver in that state, and AFAICT has not
been possible since commit 5e85315ea. This isn't a live bug, but the
false comment is quite confusing to readers.
In passing, rearrange sigusr1_handler's CheckPromoteSignal tests so that
we don't uselessly perform stat() calls that we're going to ignore the
results of.
Add some comments clarifying the behavior of MaybeStartWalReceiver;
I very nearly rearranged it in a way that'd reintroduce the race
condition fixed in e5d494d78. Mea culpa for not commenting that
properly at the time.
Back-patch to all supported branches. The PMSIGNAL_ADVANCE_STATE_MACHINE
change is the only one of even minor significance, but we might as well
keep this code in sync across branches.
Discussion: https://postgr.es/m/9001.1556046681@sss.pgh.pa.us

Commit 3d956d9562 added support for update row movement in postgres_fdw.
This patch fixes the following issues introduced by that commit:
* When a remote partition chosen to insert routed rows into was also an
UPDATE subplan target rel that would be updated later, the UPDATE that
used a direct modification plan modified those routed rows incorrectly
because those routed rows were visible to the later UPDATE command.
The right fix for this would be to have some way in postgres_fdw in
which the later UPDATE command ignores those routed rows, but it seems
hard to do so with the current infrastructure. For now throw an error
in that case.
* When a remote partition chosen to insert routed rows into was also an
UPDATE subplan target rel, fmstate created for the UPDATE that used a
non-direct modification plan was mistakenly overridden by another
fmstate created for inserting those routed rows into the partition.
This caused 1) server crash when the partition would be updated later,
and 2) resource leak when the partition had been already updated. To
avoid that, adjust the treatment of the fmstate for the inserting. As
for #1, since we would also have the incorrectness issue as mentioned
above, error out in that case as well.
Update the docs to mention that postgres_fdw currently does not handle
the case where a remote partition chosen to insert a routed row into is
also an UPDATE subplan target rel that will be updated later.
Author: Amit Langote and Etsuro Fujita
Reviewed-by: Amit Langote
Backpatch-through: 11 where row movement in postgres_fdw was added
Discussion: https://postgr.es/m/21e7eaa4-0d4d-20c2-a1f7-c7e96f4ce440@lab.ntt.co.jp

This allows table AMs that don't need these horizons. This was already
documented in the tableam relation_set_new_filenode callback, but an
assert prevented if from actually working (the test AM code contained
the change itself). Defang the asserts in the general code, and move
the stronger ones into heap AM.
Relatedly, after CLUSTER/VACUUM, we'd always assign a relfrozenxid /
relminmxid. Change the table_relation_copy_for_cluster() interface to
allow the AM to overwrite the horizons that get set on the pg_class
entry. This'd also in the future allow AMs like heap to compute a
relfrozenxid during rewrite that's the table's actual minimum rather
than a pre-determined value. Arguably it'd have been better to move
the whole computation / setting of those values into the callback, but
it seems likely that for other reasons it'd be better to be able to
use one value to vacuum/cluster multiple tables (e.g. a toast's
horizon shouldn't be different than the table's).
Reported-By: Heikki Linnakangas
Author: Andres Freund
Discussion: https://postgr.es/m/9a7fb9cc-2419-5db7-8840-ddc10c93f122@iki.fi

cache_locale_time (extraction of LC_TIME-related info) had never been
taught the lessons we previously learned about extraction of info related
to LC_MONETARY and LC_NUMERIC. Specifically, commit 95a777c61 taught
PGLC_localeconv() that data coming out of localeconv() was in an encoding
determined by the relevant locale, but we didn't realize that there's a
similar issue with strftime(). And commit a4930e7ca hardened
PGLC_localeconv() against errors occurring partway through, but failed
to do likewise for cache_locale_time(). So, rearrange the latter
function to perform encoding conversion and not risk failure while
it's got the locales set to temporary values.
This time around I also changed PGLC_localeconv() to treat it as FATAL
if it can't restore the previous settings of the locale values. There
is no reason (except possibly OOM) for that to fail, and proceeding with
the wrong locale values seems like a seriously bad idea --- especially
on Windows where we have to also temporarily change LC_CTYPE. Also,
protect against the possibility that we can't identify the codeset
reported for LC_MONETARY or LC_NUMERIC; rather than just failing,
try to validate the data without conversion.
The user-visible symptom this fixes is that if LC_TIME is set to a locale
name that implies an encoding different from the database encoding,
non-ASCII localized day and month names would be retrieved in the wrong
encoding, leading to either unexpected encoding-conversion error reports
or wrong output from to_char(). The other possible failure modes are
unlikely enough that we've not seen reports of them, AFAIK.
The encoding conversion problems do not manifest on Windows, since
we'd already created special-case code to handle that issue there.
Per report from Juan José Santamaría Flecha. Back-patch to all
supported versions.
Juan José Santamaría Flecha and Tom Lane
Discussion: https://postgr.es/m/CAC+AXB22So5aZm2vZe+MChYXec7gWfr-n-SK-iO091R0P_1Tew@mail.gmail.com

Commit dd299df8 made nbtree treat heap TID as a tiebreaker column,
establishing the principle that there is only one correct location (page
and page offset number) for every index tuple, no matter what.
Insertions of tuples into non-unique indexes proceed as if heap TID
(scan key's scantid) is just another user-attribute value, but
insertions into unique indexes are more delicate. The TID value in
scantid must initially be omitted to ensure that the unique index
insertion visits every leaf page that duplicates could be on. The
scantid is set once again after unique checking finishes successfully,
which can force _bt_findinsertloc() to step right one or more times, to
locate the leaf page that the new tuple must be inserted on.
Stepping right within _bt_findinsertloc() was assumed to occur no more
frequently than stepping right within _bt_check_unique(), but there was
one important case where that assumption was incorrect: inserting a
"duplicate" with NULL values. Since _bt_check_unique() didn't do any
real work in this case, it wasn't appropriate for _bt_findinsertloc() to
behave as if it was finishing off a conventional unique insertion, where
any existing physical duplicate must be dead or recently dead.
_bt_findinsertloc() might have to grovel through a substantial portion
of all of the leaf pages in the index to insert a single tuple, even
when there were no dead tuples.
To fix, treat insertions of tuples with NULLs into a unique index as if
they were insertions into a non-unique index: never unset scantid before
calling _bt_search() to descend the tree, and bypass _bt_check_unique()
entirely. _bt_check_unique() is no longer responsible for incoming
tuples with NULL values.
Discussion: https://postgr.es/m/CAH2-Wzm08nr+JPx4jMOa9CGqxWYDQ-_D4wtPBiKghXAUiUy-nQ@mail.gmail.com

Up to now, DefineIndex() was responsible for adding attnotnull constraints
to the columns of a primary key, in any case where it hadn't been
convenient for transformIndexConstraint() to mark those columns as
is_not_null. It (or rather its minion index_check_primary_key) did this
by executing an ALTER TABLE SET NOT NULL command for the target table.
The trouble with this solution is that if we're creating the index due
to ALTER TABLE ADD PRIMARY KEY, and the outer ALTER TABLE has additional
sub-commands, the inner ALTER TABLE's operations executed at the wrong
time with respect to the outer ALTER TABLE's operations. In particular,
the inner ALTER would perform a validation scan at a point where the
table's storage might be inconsistent with its catalog entries. (This is
on the hairy edge of being a security problem, but AFAICS it isn't one
because the inner scan would only be interested in the tuples' null
bitmaps.) This can result in unexpected failures, such as the one seen
in bug #15580 from Allison Kaptur.
To fix, let's remove the attempt to do SET NOT NULL from DefineIndex(),
reducing index_check_primary_key's role to verifying that the columns are
already not null. (It shouldn't ever see such a case, but it seems wise
to keep the check for safety.) Instead, make transformIndexConstraint()
generate ALTER TABLE SET NOT NULL subcommands to be executed ahead of
the ADD PRIMARY KEY operation in every case where it can't force the
column to be created already-not-null. This requires only minor surgery
in parse_utilcmd.c, and it makes for a much more satisfying spec for
transformIndexConstraint(): it's no longer having to take it on faith
that someone else will handle addition of NOT NULL constraints.
To make that work, we have to move the execution of AT_SetNotNull into
an ALTER pass that executes ahead of AT_PASS_ADD_INDEX. I moved it to
AT_PASS_COL_ATTRS, and put that after AT_PASS_ADD_COL to avoid failure
when the column is being added in the same command. This incidentally
fixes a bug in the only previous usage of AT_PASS_COL_ATTRS, for
AT_SetIdentity: it didn't work either for a newly-added column.
Playing around with this exposed a separate bug in ALTER TABLE ONLY ...
ADD PRIMARY KEY for partitioned tables. The intent of the ONLY modifier
in that context is to prevent doing anything that would require holding
lock for a long time --- but the implied SET NOT NULL would recurse to
the child partitions, and do an expensive validation scan for any child
where the column(s) were not already NOT NULL. To fix that, invent a
new ALTER subcommand AT_CheckNotNull that just insists that a child
column be already NOT NULL, and apply that, not AT_SetNotNull, when
recursing to children in this scenario. This results in a slightly laxer
definition of ALTER TABLE ONLY ... SET NOT NULL for partitioned tables,
too: that command will now work as long as all children are already NOT
NULL, whereas before it just threw up its hands if there were any
partitions.
In passing, clean up the API of generateClonedIndexStmt(): remove a
useless argument, ensure that the output argument is not left undefined,
update the header comment.
A small side effect of this change is that no-such-column errors in ALTER
TABLE ADD PRIMARY KEY now produce a different message that includes the
table name, because they are now detected by the SET NOT NULL step which
has historically worded its error that way. That seems fine to me, so
I didn't make any effort to avoid the wording change.
The basic bug #15580 is of very long standing, and these other bugs
aren't new in v12 either. However, this is a pretty significant change
in the way ALTER TABLE ADD PRIMARY KEY works. On balance it seems best
not to back-patch, at least not till we get some more confidence that
this patch has no new bugs.
Patch by me, but thanks to Jie Zhang for a preliminary version.
Discussion: https://postgr.es/m/15580-d1a6de5a3d65da51@postgresql.org
Discussion: https://postgr.es/m/1396E95157071C4EBBA51892C5368521017F2E6E63@G08CNEXMBPEKD02.g08.fujitsu.local

xml.c passed format = 1 to xmlNodeDump(), resulting in sometimes getting
extra whitespace (newlines + spaces) in the output. We don't really want
that, first because whitespace might be semantically significant in some
XML uses, and second because it happens only very inconsistently. Only
one case in our regression tests is affected.
This potentially affects the results of xpath() and the XMLTABLE construct,
when emitting nodeset values.
Note that the older code in contrib/xml2 doesn't do this; it seems
to have been an aboriginal bad decision in commit ea3b212fe.
While this definitely seems like a bug to me, the small number of
complaints to date argues against back-patching a behavioral change.
Hence, fix in HEAD only, at least for now.
Per report from Jean-Marc Voillequin.
Discussion: https://postgr.es/m/1EC8157EB499BF459A516ADCF135ADCE3A23A9CA@LON-WGMSX712.ad.moodys.net

This commit fixes a couple of issues related to the way password
verifiers hashed with MD5 or SCRAM-SHA-256 are detected, leading to
being able to store in catalogs passwords which do not follow the
supported hash formats:
- A MD5-hashed entry was checked based on if its header uses "md5" and
if the string length matches what is expected. Unfortunately the code
never checked if the hash only used hexadecimal characters, as reported
by Tom Lane.
- A SCRAM-hashed entry was checked based on only its header, which
should be "SCRAM-SHA-256$", but it never checked for any fields
afterwards, as reported by Jonathan Katz.
Backpatch down to v10, which is where SCRAM has been introduced, and
where password verifiers in plain format have been removed.
Author: Jonathan Katz
Reviewed-by: Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/016deb6b-1f0a-8e9f-1833-a8675b170aa9@postgresql.org
Backpatch-through: 10

Due to parallel development, gist added the missing conflict
information in c952eae52a3, while 558a9165e08 moved that computation
to the primary for the index types that already had it. Thus adapt
gist to also compute on the primary, using
index_compute_xid_horizon_for_tuples() instead of its own copy of the
logic.
This also adds pg_waldump support for XLOG_GIST_DELETE records, which
previously was not properly present.
Bumps WAL version.
Author: Andres Freund
Discussion: https://postgr.es/m/20190406050243.bszosdg4buvabfrt@alap3.anarazel.de

This commit adds the description that "non-exclusive" pg_start_backup
and pg_stop_backup can be executed even during recovery. Previously
it was wrongly documented that those functions are not allowed to be
executed during recovery.
Back-patch to 9.6 where non-exclusive backup API was added.
Discussion: https://postgr.es/m/CAHGQGwEuAYrEX7Yhmf2MCrTK81HDkkg-JqsOUh8zw6+zYC5zzw@mail.gmail.com

The formulas used to calculate size while (de)serializing mvndistinct
and functional dependencies were based on offset() of the structs. But
that is incorrect, because the structures are not copied directly, we
we copy the individual fields directly.
At the moment this works fine, because there is no alignment padding
on any platform we support. But it might break if we ever added some
fields into any of the structs, for example. It's also confusing.
Fixed by reworking the macros to directly sum sizes of serialized
fields. The macros are now useful only for serialiation, so there is
no point in keeping them in the public header file. So make them
private by moving them to the .c files.
Also adds a couple more asserts to check the serialization, and fixes
an incorrect allocation of MVDependency instead of (MVDependency *).
Reported-By: Tom Lane
Discussion: https://postgr.es/m/29785.1555365602@sss.pgh.pa.us

The GSSAPI encryption patch neglected to update the protocol
documentation to describe how to set up a GSSAPI encrypted connection
from a client to the server, so fix that by adding the appropriate
documentation to protocol.sgml.
The tests added for encryption support were overly long and couldn't be
run in parallel due to race conditions; this was largely because each
test was setting up its own KDC to perform the tests. Instead, merge
the authentication tests and the encryption tests into the original
test, where we only create one KDC to run the tests with. Also, have
the tests check what the server's opinion is of the connection and if it
was GSS authenticated or encrypted using the pg_stat_gssapi view.
In passing, fix the libpq label for GSSENC-Mode to be consistent with
the "PGGSSENCMODE" environment variable.
Missing protocol documentation pointed out by Michael Paquier.
Issues with the tests pointed out by Tom Lane and Peter Eisentraut.
Refactored tests and added documentation by me.
Reviewed by Robbie Harwood (protocol documentation) and Michael Paquier
(rework of the tests).

For amcanreorderby scans the nodeIndexscan.c's reorder queue holds
heap tuples, but the underlying table likely does not. Before this fix
we'd return different types of slots, depending on whether the tuple
came from the reorder queue, or from the index + table.
While that could be fixed by signalling that the node doesn't return a
fixed type of slot, it seems better to instead remove the separate
slot for the reorder queue, and use ExecForceStoreHeapTuple() to store
tuples from the queue. It's not particularly common to need
reordering, after all.
This reverts most of the iss_ReorderQueueSlot related changes to
nodeIndexscan.c made in 1a0586de3657cd3, except that now
ExecForceStoreHeapTuple() is used instead of ExecStoreHeapTuple().
Noticed when testing zheap against the in-core version of tableam.
Author: Andres Freund

As reported by Tom, when ExecStoreMinimalTuple() had to perform a
conversion to store the minimal tuple in the slot, it forgot to
respect the shouldFree flag, and leaked the tuple into the current
memory context if true. Fix that by freeing the tuple in that case.
Looking at the relevant code made me (Andres) realize that not having
the shouldFree parameter to ExecForceStoreHeapTuple() was a bad
idea. Some callers had to locally implement the necessary logic, and
in one case it was missing, creating a potential per-group leak in
non-hashed aggregation.
The choice to not free the tuple in ExecComputeStoredGenerated() is
not pretty, but not introduced by this commit - I'll start a separate
discussion about it.
Reported-By: Tom Lane
Discussion: https://postgr.es/m/366.1555382816@sss.pgh.pa.us

HoldPinnedPortals() did things in the wrong order: it must not mark
a portal autoHeld until it's been successfully held. Otherwise,
a failure while persisting the portal results in a server crash
because we think the portal is in a good state when it's not.
Also add a check that portal->status is READY before attempting to
hold a pinned portal. We have such a check before the only other
use of HoldPortal(), so it seems unwise not to check it here.
Lastly, rethink the responsibility for where to call HoldPinnedPortals.
The comment for it imagined that it was optional for any individual PL
to call it or not, but that cannot be the case: if some outer level of
procedure has a pinned portal, failing to persist it when an inner
procedure commits is going to be trouble. Let's have SPI do it instead
of the individual PLs. That's not a complete solution, since in theory
a PL might not be using SPI to perform commit/rollback, but such a PL
is going to have to be aware of lots of related requirements anyway.
(This change doesn't cause an API break for any external PLs that might
be calling HoldPinnedPortals per the old regime, because calling it
twice during a commit or rollback sequence won't hurt.)
Per bug #15703 from Julian Schauder. Back-patch to v11 where this code
came in.
Discussion: https://postgr.es/m/15703-c12c5bc0ea34ba26@postgresql.org

If contrib/pageinspect is not installed, this causes the test checking
the minimum recovery point to fail. The point is that the dependency
with pageinspect is not really necessary as the test does also all
checks with an offline cluster by scanning directly the on-disk pages,
which is enough for the purpose of the test.
Per complaint from Tom Lane.
Author: Michael Paquier
Discussion: https://postgr.es/m/17806.1555566345@sss.pgh.pa.us

When such a trigger returns the old row version, it naturally get
stored in the slot for the trigger result. When a table AMs doesn't
store HeapTuples internally, ExecBRUpdateTriggers() frees the old row
version passed to triggers - but before this fix it might still be
referenced by the slot holding the new tuple.
Noticed when running the out-of-core zheap AM against the in-core
version of tableam.
Author: Andres Freund

M src/backend/commands/trigger.c

Fix handling of temp and unlogged tables in FOR ALL TABLES publications

If a FOR ALL TABLES publication exists, temporary and unlogged tables
are ignored for publishing changes. But CheckCmdReplicaIdentity()
would still check in that case that such a table has a replica
identity set before accepting updates. To fix, have
GetRelationPublicationActions() return that such a table publishes no
actions.
Discussion: https://www.postgresql.org/message-id/f3f151f7-c4dd-1646-b998-f60bd6217dd3@2ndquadrant.com

* Remove one unnecessary pg_class join in SQL command. Not needed,
because we use a regclass cast instead.
* Doc: refer to "partitioned relations" rather than specifically tables,
since indexes are also displayed.
* Rename "On table" column to "Table", for consistency with \di.
Author: Justin Pryzby
Discussion: https://postgr.es/m/20190407212525.GB10080@telsasoft.com

Previously, include actions include_dir, include_if_exists, and include
listed commented-out values which were not the defaults, which is
inconsistent with other entries. Instead, replace them with '', which
is the default value.
Reported-by: Emanuel Araújo
Discussion: https://postgr.es/m/CAMuTAkYMx6Q27wpELDR3_v9aG443y7ZjeXu15_+1nGUjhMWOJA@mail.gmail.com
Backpatch-through: 9.4

The previous paragraph trying to explain --check, --link, and no --link
modes and the various points of failure was too complex. Instead, use
bullet lists and sublists.
Reported-by: Daniel Gustafsson
Discussion: https://postgr.es/m/qtqiv7hI87s_Xvz5ZXHCaH-1-_AZGpIDJowzlRjF3-AbCr3RhSNydM_JCuJ8DE4WZozrtxhIWmyYTbv0syKyfGB6cYMQitp9yN-NZMm-oAo=@yesql.se
Backpatch-through: 9.4

The buildfarm points out that UINT64_FORMAT might not work with sscanf;
it's calibrated for our printf implementation, which might not agree
with the platform-supplied sscanf. Fall back to just accepting an
unsigned long, which is already more than the documentation promises.
Oversight in e6c3ba7fb; back-patch to v11, as that was.

I noted that some buildfarm members were complaining about %ld being
used to format values that are (probably) declared size_t. Use %zu
instead, and insert a cast just in case some versions of the GSSAPI
API declare the length field differently. While at it, clean up
gratuitous differences in wording of equivalent messages, show
the complained-of length in all relevant messages not just some,
include trailing newline where needed, adjust random deviations
from project-standard code layout and message style, etc.

Returning 0 could falsely indicate that there is no problem. NULL
correctly indicates that there is no information about potential
problems.
Also return 0 as numbackends instead of NULL for shared objects (as no
connection can be made to a shared object only).
Author: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>

When saving a replication slot, failing to close the temporary path used
to save the slot information is considered as a failure and reported as
such. However the code forgot to leave immediately as other failure
paths do.
Noticed while looking up at this area of the code for another patch.

Transient files and wait events get normally cleaned up when seeing an
exception (be it in the context of a transaction for a backend or
another process like the checkpointer), hence there is little point in
complicating error code paths to do this work. This shaves a bit of
code, and removes some extra handling with errno which needed to be
preserved during the cleanup steps done.
Reported-by: Masahiko Sawada
Author: Michael Paquier
Reviewed-by: Tom Lane, Masahiko Sawada
Discussion: https://postgr.es/m/CAD21AoDhHYVq5KkXfkaHhmjA-zJYj-e4teiRAJefvXuKJz1tKQ@mail.gmail.com

Per discussion with others, allowing REINDEX INDEX CONCURRENTLY to work
for invalid indexes when working directly on them can have a lot of
value to unlock situations with invalid indexes without having to use a
dance involving DROP INDEX followed by an extra CREATE INDEX
CONCURRENTLY (which would not work for indexes with constraint
dependency anyway). This also does not create extra bloat on the
relation involved as this works on individual indexes, so let's enable
it.
Note that REINDEX TABLE CONCURRENTLY still bypasses invalid indexes as
we don't want to bloat the number of indexes defined on a relation in
the event of multiple and successive failures of REINDEX CONCURRENTLY.
More regression tests are added to cover those behaviors, using an
invalid index created with CREATE INDEX CONCURRENTLY.
Reported-by: Dagfinn Ilmari Mannsåker, Álvaro Herrera
Author: Michael Paquier
Reviewed-by: Peter Eisentraut, Dagfinn Ilmari Mannsåker
Discussion: https://postgr.es/m/20190411134947.GA22043@alvherre.pgsql

Turn on the previously disabled automatic scaling of images in HTML
output. To avoid images looking too large on nowadays-normal screens,
restrict the width to 75% on such screens.
Some work is still necessary because SVG images without a viewBox
still won't scale, but that will a separate patch.
Discussion: https://www.postgresql.org/message-id/flat/6d2442d1-84a2-36ef-e014-b6d1ece8a139%40postgresql.org

The regression tests added in commit 7300a69950 test cardinality
estimates using a function that extracts the interesting pieces
from the EXPLAIN output, instead of testing the whole plan. That
seems both easier to understand and less fragile, so this applies
the same approach to pre-existing tests of ndistinct coefficients
and functional dependencies.
Discussion: https://postgr.es/m/dfdac334-9cf2-2597-fb27-f0fb3753f435@2ndquadrant.com

The memcpy() was copying type OIDs in the wrong direction, so the
deserialized MCV list always had them as 0. This is mostly harmless
except when printing the data in pg_mcv_list_items(), in which case
it reported
ERROR: cache lookup failed for type 0
Also added a simple regression test for pg_mcv_list_items() function,
printing a single-item MCV list.
Reported-By: Dean Rasheed
Discussion: https://postgr.es/m/CAEZATCX6T0iDTTZrqyec4Cd6b4yuL7euu4=rQRXaVBAVrUi1Cg@mail.gmail.com

Commit 5e1963fb7 overlooked two places in partbounds.c that now
need to pass a collation identifier to the hash functions for
a partition key column.
Amit Langote, per report from Jesper Pedersen
Discussion: https://postgr.es/m/a620f85a-42ab-e0f3-3337-b04b97e2e2f5@redhat.com

Concurrent autovacuum could result in a change in the order of the
live rows in timestamp_tbl. While this would not happen with the
default autovacuum parameters, it's fairly easy to hit if
autovacuum_vacuum_threshold is made small enough to allow autovac
to decide to process this table. That's a stumbling block for trying
to exercise autovacuum aggressively using the core regression tests.
To fix, replace an unqualified DELETE with a TRUNCATE. There's a
similar DELETE just above (and no order-sensitive queries between),
so this doesn't lose any test coverage and might indeed be argued
to improve it.
Discussion: https://postgr.es/m/17428.1555348950@sss.pgh.pa.us

Commit 3f2393edef changed ExecCleanupTupleRouting() so that it skipped
cleaning up subplan resultrels before calling EndForeignInsert(), but
that would cause an issue: when those resultrels were foreign tables,
the FDWs would fail to shut down. Repair by skipping it after calling
EndForeignInsert() as before.
Author: Etsuro Fujita
Reviewed-by: David Rowley and Amit Langote
Discussion: https://postgr.es/m/5CAF3B8F.2090905@lab.ntt.co.jp

The same code is used to handle both text and bytea, but bytea is not
collation-aware, so we shouldn't call get_collation_isdeterministic()
in that case, since that will error out with an invalid collation.
Reported-by: Jeevan Chalke <jeevan.chalke@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/CAM2%2B6%3DWaf3qJ1%3DyVTUH8_yG-SC0xcBMY%2BSFLhvKKNnWNXSUDBw%40mail.gmail.com

Since Postgres 10, SHOW commands can be triggered with replication
connections in a WAL sender context, however it missed that a
transaction context is needed for syscache lookups. This commit makes
sure that the syscache lookups can happen correctly by setting a
transaction context when running SHOW commands in a WAL sender.
Superuser-only parameters can be displayed using SHOW commands not only
to superusers, but also to members of system role pg_read_all_settings,
which requires a syscache lookup to check if the connected role is a
member of this system role or not, or the instance crashes. Superusers
do not need to check the syscache so it worked correctly in this case.
New tests are added to cover this issue.
Reported-by: Alexander Kukushkin
Author: Michael Paquier
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/15734-2daa8761eeed8e20@postgresql.org
Backpatch-through: 10

Commit c098509927f9a49ebceb301a2cb6a477ecd4ac3c changed
PostgresNode::get_new_node() to probe 0.0.0.0 instead of 127.0.0.1, but
the new test was less effective for Windows native Perl. This increased
the failure rate of buildfarm members bowerbird and jacana. Instead,
test 0.0.0.0 and concrete addresses. This restores the old level of
defense, but the algorithm is still subject to its longstanding time of
check to time of use race condition. Back-patch to 9.6, like the
previous change.
Discussion: https://postgr.es/m/GrdLgAdUK9FdyZg8VIcTDKVOkys122ZINEb3CjjoySfGj2KyPiMKTh1zqtRp0TAD7FJ27G-OBB3eplxIB5GhcQH5o8zzGZfp0MuJaXJxVxk=@yesql.se

Up to now the tests of pg_rewind have been using a superuser for all its
tests (which is the default of many tests actually, and something that
ought to be reviewed) when involving an online source server, still it
is possible to use a non-superuser role to do that as long as this role
is granted permissions to execute all the source-side functions used for
the rewind. This is possible since v11, and was already documented as
of bfc8068.
PostgresNode::init is extended so as callers of this routine can add
extra options to configure the authentication of a new node, which gets
used by this commit, and allows the tests to work properly on Windows
where SSPI is used.
This will allow to catch up easily any change in pg_rewind if the tool
begins to use more backend-side functions, so as the properties
introduced by v11 are kept.
Per suggestion from Peter Eisentraut.
Author: Michael Paquier
Reviewed-by: Magnus Hagander
Discussion: https://postgr.es/m/20190411041336.GM2728@paquier.xyz

Per buildfarm member jacana, the former fails under msys Perl 5.8.8.
Back-patch to 9.6, like the code in question.
Discussion: https://postgr.es/m/GrdLgAdUK9FdyZg8VIcTDKVOkys122ZINEb3CjjoySfGj2KyPiMKTh1zqtRp0TAD7FJ27G-OBB3eplxIB5GhcQH5o8zzGZfp0MuJaXJxVxk=@yesql.se

The original coding of generate_partition_qual() just copied the list
of predicate expressions into the global CacheMemoryContext, making it
effectively impossible to clean up when the owning relcache entry is
destroyed --- the relevant code in RelationDestroyRelation() only managed
to free the topmost List header :-(. This resulted in a session-lifespan
memory leak whenever a table partition's relcache entry is rebuilt.
Fortunately, that's not normally a large data structure, and rebuilds
shouldn't occur all that often in production situations; but this is
still a bug worth fixing back to v10 where the code was introduced.
To fix, put the cached expression tree into its own small memory context,
as we do with other complicated substructures of relcache entries.
Also, deal more honestly with the case that a partition has an empty
partcheck list; while that probably isn't a case that's very interesting
for production use, it's legal.
In passing, clarify comments about how partitioning-related relcache
data structures are managed, and add some Asserts that we're not leaking
old copies when we overwrite these data fields.
Amit Langote and Tom Lane
Discussion: https://postgr.es/m/7961.1552498252@sss.pgh.pa.us

postmaster startup scrutinizes any shared memory segment recorded in
postmaster.pid, exiting if that segment matches the current data
directory and has an attached process. When the postmaster.pid file was
missing, a starting postmaster used weaker checks. Change to use the
same checks in both scenarios. This increases the chance of a startup
failure, in lieu of data corruption, if the DBA does "kill -9 `head -n1
postmaster.pid` && rm postmaster.pid && pg_ctl -w start". A postmaster
will no longer stop if shmat() of an old segment fails with EACCES. A
postmaster will no longer recycle segments pertaining to other data
directories. That's good for production, but it's bad for integration
tests that crash a postmaster and immediately delete its data directory.
Such a test now leaks a segment indefinitely. No "make check-world"
test does that. win32_shmem.c already avoided all these problems. In
9.6 and later, enhance PostgresNode to facilitate testing. Back-patch
to 9.4 (all supported versions).
Reviewed (in earlier versions) by Daniel Gustafsson and Kyotaro HORIGUCHI.
Discussion: https://postgr.es/m/20190408064141.GA2016666@rfd.leadboat.com

This reverts commit d4e2a84, which added a new user with limited
permissions to run the TAP tests of pg_rewind. Buildfarm machine
members on Windows jacana and bowerbird have been complaining about
that, the new role not being able to run the rewind because SSPI is not
configured to allow it.
Fixing the test requires passing down directly the new user to
pg_regress with --create-role so as SSPI can work properly.
Reported-by: Andrew Dunstan
Discussion: https://postgr.es/m/3cd43d33-f415-cc41-ade3-7230ab15b2c9@2ndQuadrant.com

This adds a row to the pg_stat_database view with datoid 0 and datname
NULL for those objects that are not in a database. This was added
particularly for checksums, but we were already tracking more satistics
for these objects, just not returning it.
Also add a checksum_last_failure column that holds the timestamptz of
the last checksum failure that occurred in a database (or in a
non-dataabase file), if any.
Author: Julien Rouhaud <rjuju123@gmail.com>

In case of a partition index, when swapping the old and new index, we
also need to attach the new index as a partition and detach the old
one. Also, to handle partition indexes, we not only need to change
dependencies referencing the index, but also dependencies of the index
referencing something else. The previous code did this only
specifically for a constraint, but we also need to do this for
partitioned indexes. So instead write a generic function that does it
for all dependencies.
Author: Michael Paquier <michael@paquier.xyz>
Author: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/DF4PR8401MB11964EDB77C860078C343BEBEE5A0%40DF4PR8401MB1196.NAMPRD84.PROD.OUTLOOK.COM#154df1fedb735190a773481765f7b874

Commit ad308058 switched to returning a FullTransactionId, but
failed to load the potentially updated value in the case where
xidVacLimit is reached and we release and reacquire the lock.
Repair, closing bug #15727.
While reviewing that commit, also fix the size computation used
by EstimateTransactionStateSize() and switch to the mul_size()
macro traditionally used in such expressions.
Author: Thomas Munro
Reported-by: Roman Zharkov
Discussion: https://postgr.es/m/15727-0be246e7d852d229%40postgresql.org

Up to now the tests of pg_rewind have been using a superuser for all the
tests (which is the default of many tests actually, and something that
ought to be reviewed) when involving an online source server, still it
is possible to use a non-superuser role to do that as long as this role
is granted permissions to execute all the source-side functions used for
the rewind. This is possible since v11, and was already documented as
of bfc8068.
This will allow to catch up easily any change in pg_rewind if the tool
begins to use more backend-side functions, so as the properties
introduced by v11 are kept.
Per suggestion from Peter Eisentraut.
Author: Michael Paquier
Reviewed-by: Magnus Hagander
Discussion: https://postgr.es/m/20190411041336.GM2728@paquier.xyz

M src/bin/pg_rewind/t/RewindTest.pm

Fix more strcmp() calls using boolean-like comparisons for result checks

Such calls can confuse the reader as strcmp() uses an integer as result.
The places patched here have been spotted by Thomas Munro, David Rowley
and myself.
Author: Michael Paquier
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/20190411021946.GG2728@paquier.xyz

Move the strings, numerology, insert, insert_conflict, select and
errors tests to be parts of nearby parallel groups, instead of
executing by themselves. (Moving "select" required adjusting the
constraints test, which uses a table named "tmp" as select also
does. There don't seem to be any other conflicts.)
Move psql and stats_ext to the next parallel group, where the rules
test also has a long runtime. To make it safe to run stats_ext in
parallel with rules, I adjusted the latter to only dump views/rules
from the pg_catalog and public schemas, which was what it was doing
anyway. stats_ext makes some views in a transient schema, which now
will not affect rules.
Reorder serial_schedule to match parallel_schedule.
Discussion: https://postgr.es/m/735.1554935715@sss.pgh.pa.us

This test script verifies that KNN searches of an SP-GiST index
produce the same sort order as a seqscan-and-sort. The FULL JOINs
used for that are exceedingly slow, however. Investigation shows
that the problem is that the initial join is on the rank() values,
and we have a lot of duplicates due to the data set containing 1000
duplicate points. We're therefore going to produce 1000000 join
rows that have to be thrown away again by the join filter.
We can improve matters by using row_number() instead of rank(),
so that the initial join keys are unique. The catch is that
that makes the results sensitive to the sorting of rows with
equal distances from the reference point. That doesn't matter
for the actually-equal points, but as luck would have it, the
data set also contains two distinct points that have identical
distances to the origin. So those two rows could legitimately
appear in either order, causing unwanted output from the check
queries.
However, it doesn't seem like it's the job of this test to
check whether the <-> operator correctly computes distances;
its charter is just to verify that SP-GiST emits the values
in distance order. So we can dodge the indeterminacy problem
by having the check only compare row numbers and distances
not the actual point values.
This change reduces the run time of create_index_spgist by a good
three-quarters, on my machine, with ensuing beneficial effects on
the runtime of create_index (thanks to interactions with CREATE
INDEX CONCURRENTLY tests in the latter). I see a net improvement
of more than 2X in the runtime of their parallel test group.
Discussion: https://postgr.es/m/735.1554935715@sss.pgh.pa.us

The point of this change is to increase the potential for parallelism
while running the core regression tests. Most people these days are
using parallel testing modes on multi-core machines, so we might as
well try a bit harder to keep multiple cores busy. Hence, a test that
runs much longer than others in its parallel group is a candidate to
be sub-divided.
In this patch, create_index.sql and join.sql are split up.
I haven't changed the content of the tests in any way, just
moved them.
I moved create_index.sql's SP-GiST-related tests into a new script
create_index_spgist, and moved its btree multilevel page deletion test
over to the existing script btree_index. (btree_index is a more natural
home for that test, and it's shorter than others in its parallel group,
so this doesn't hurt total runtime of that group.) There might be
room for more aggressive splitting of create_index, but this is enough
to improve matters considerably.
Likewise, I moved join.sql's "exercises for the hash join code" into
a new file join_hash. Those exercises contributed three-quarters of
the script's runtime. Which might well be excessive ... but for the
moment, I'm satisfied with shoving them into a different parallel
group, where they can share runtime with the roughly-equally-lengthy
gist test.
(Note for anybody following along at home: there are interesting
interactions between the runtimes of create_index and anything running
in parallel with it, because the tests of CREATE INDEX CONCURRENTLY
in that file will repeatedly block waiting for concurrent transactions
to commit. As committed in this patch, create_index and
create_index_spgist have roughly equal runtimes, but that's mostly an
artifact of forced synchronization of the CONCURRENTLY tests; when run
serially, create_index is much faster. A followup patch will reduce
the runtime of create_index_spgist and thereby also create_index.)
Discussion: https://postgr.es/m/735.1554935715@sss.pgh.pa.us

The test for statement timeout has a 2-second timeout, which was only
moderately annoying when it was written, but nowadays it contributes
a pretty significant chunk of the elapsed time needed to run the core
regression tests on a fast machine. We can improve this situation by
pushing the test into a plpgsql-specific test file instead of having
it in a core regression test. That's a clean win when considering
just the core tests. Even when considering check-world or a buildfarm
test run, we should come out ahead because the core tests get run
more times in those sequences.
Furthermore, since the plpgsql tests aren't currently parallelized,
it seems likely that the timing problems reflected in commit f1e671a0b
(which increased that timeout from 1 sec to 2) will be much less severe
in this context. Hence, let's try cutting the timeout back to 1 second
in hopes of a further win for check-world. We can undo that if
buildfarm experience proves it to be a bad idea.
To give the new test file some modicum of intellectual coherency,
I moved the surrounding tests related to error-trapping along with
the statement timeout test proper. Those other tests don't run long
enough to have any particular bearing on test-runtime considerations.
The tests are the same as before, except with minor adjustments to
not depend on an externally-created table.
Discussion: https://postgr.es/m/735.1554935715@sss.pgh.pa.us

Code coverage comparisons confirm that the tests using
quad_poly_tbl_ord_seq1/quad_poly_tbl_ord_idx1 hit no code
paths not also covered by the similar tests using
quad_poly_tbl_ord_seq2/quad_poly_tbl_ord_idx2. Since these
test cases are pretty expensive, they need to contribute more
than zero benefit.
In passing, make quad_poly_tbl_ord_seq2 a temp table, since
there seems little reason to keep it around after the test.
Discussion: https://postgr.es/m/735.1554935715@sss.pgh.pa.us

indexing.sql's test for this feature was added along with the
feature in commit 2b2727343. However, shortly later that test was
rendered ineffective by commit 074251db6, which limited when the
optimization would be applied, so that the test didn't test it.
Since then, commit dd299df81 added new tests (in btree_index.sql)
that actually do test the feature. Code coverage comparisons
confirm that this test sequence adds no meaningful coverage, and
it's rather expensive, accounting for nearly half of the runtime
of indexing.sql according to my measurements. So let's remove it.
Per advice from Peter Geoghegan.
Discussion: https://postgr.es/m/735.1554935715@sss.pgh.pa.us

This style is frowned upon. I inadvertently introduced one in commit
fe0e0b4fc7f0. (My compiler does not complain about it, even though
-Wdeclaration-after-statement is specified. Weird.)
Author: Masahiko Sawada

Warnings about unary minus might have been wrong. It's a bit
surprising that nobody noticed yet ... probably the precedence-warning
feature hasn't really been used much in the field.
Rikard Falkeborn
Discussion: https://postgr.es/m/CADRDgG6fzA8A2oeygUw4=o7ywo4kvz26NxCSgpq22nMD73Bx4Q@mail.gmail.com

The transaction that is initiated by the parallel worker to cooperate
with the actual transaction started by the main backend to complete the
query execution should not be counted as a separate transaction. The
other internal transactions started and committed by the parallel worker
are still counted as separate transactions as we that is what we do in
other places like autovacuum.
This will partially fix the bloat in transaction stats due to additional
transactions performed by parallel workers. For a complete fix, we need to
decide how we want to show all the transactions that are started internally
for various operations and that is a matter of separate patch.
Reported-by: Haribabu Kommi
Author: Haribabu Kommi
Reviewed-by: Amit Kapila, Jamison Kirk and Rahila Syed
Backpatch-through: 9.6
Discussion: https://postgr.es/m/CAJrrPGc9=jKXuScvNyQ+VNhO0FZk7LLAShAJRyZjnedd2D61EQ@mail.gmail.com

This has to be prevented because inlining would result in multiple
self-references, which we don't support (and in fact that's disallowed
by the SQL spec, see statements about linearly vs. nonlinearly
recursive queries). Bug fix for commit 608b167f9.
Per report from Yaroslav Schekin (via Andrew Gierth)
Discussion: https://postgr.es/m/87wolmg60q.fsf@news-spur.riddles.org.uk

Commit 25ee70511ec2 introduced a memory leak in pgbench: some PGresult
structs were not being freed during error bailout, because we're now
doing more PQgetResult() calls than previously. Since there's more
cleanup code outside the discard_response() routine than in it, refactor
the cleanup code, removing the routine.
This has little effect currently, since we abandon processing after
hitting errors, but if we ever get further pgbench features (such as
testing for serializable transactions), it'll matter.
Per Coverity.
Reviewed-by: Michaël Paquier

We weren't testing anything involving EPQ on UPDATEs that move tuples
into different partitions. Depending on the implementation,
it might be that these cases aren't actually very interesting ...
but given our thin coverage of EPQ in general, I think it's a good
idea to have a test case.
Amit Langote, minor tweak by me
Discussion: https://postgr.es/m/7889df35-ad1a-691a-00e3-4d4b18f364e3@lab.ntt.co.jp

The MSVC build system already did this, and commit
617dc6d299c957e2784320382b3277ede01d9c63 used it in a second file.
Back-patch to 9.4, like that commit.
Discussion: https://postgr.es/m/CAA8=A7_1SWc3+3Z=-utQrQFOtrj_DeohRVt7diA2tZozxsyUOQ@mail.gmail.com

We've long had reports of intermittent "could not reattach to shared
memory" errors on Windows. Buildfarm member dory fails that way when
PGSharedMemoryReAttach() execution overlaps with creation of a thread
for the process's "default thread pool". Fix that by providing a second
region to receive asynchronous allocations that would otherwise intrude
into UsedShmemSegAddr. In pgwin32_ReserveSharedMemoryRegion(), stop
trying to free reservations landing at incorrect addresses; the caller's
next step has been to terminate the affected process. Back-patch to 9.4
(all supported versions).
Reviewed by Tom Lane. He also did much of the prerequisite research;
see commit bcbf2346d69f6006f126044864dd9383d50d87b4.
Discussion: https://postgr.es/m/20190402135442.GA1173872@rfd.leadboat.com

join_is_legal() needs to reject forming certain outer joins in cases
where that would lead the planner down a blind alley. However, it
mistakenly supposed that the way to handle full joins was to treat them
as applying the same constraints as for left joins, only to both sides.
That doesn't work, as shown in bug #15741 from Anthony Skorski: given
a lateral reference out of a join that's fully enclosed by a full join,
the code would fail to believe that any join ordering is legal, resulting
in errors like "failed to build any N-way joins".
However, we don't really need to consider full joins at all for this
purpose, because we effectively force them to be evaluated in syntactic
order, and that order is always legal for lateral references. Hence,
get rid of this broken logic for full joins and just ignore them instead.
This seems to have been an oversight in commit 7e19db0c0.
Back-patch to all supported branches, as that was.
Discussion: https://postgr.es/m/15741-276f1f464b3f40eb@postgresql.org

The es_root_result_relations array needs to be shallow-copied in the
same way as the main es_result_relations array, else EPQ rechecks on
partitioned result relations fail, as seen in bug #15677 from
Norbert Benkocs.
Amit Langote, isolation test case added by me
Discussion: https://postgr.es/m/15677-0bf089579b4cd02d@postgresql.org
Discussion: https://postgr.es/m/19321.1554567786@sss.pgh.pa.us

Explain that it is not enforced that querying a generated column
returns data that is consistent with the data that was stored. This
is similar to the note about constraints nearby.
Reported-by: Amit Langote <amitlangote09@gmail.com>

vacuum_truncate controls whether vacuum tries to truncate off
any empty pages at the end of the table. Previously vacuum always
tried to do the truncation. However, the truncation could cause
some problems; for example, ACCESS EXCLUSIVE lock needs to
be taken on the table during the truncation and can cause
the query cancellation on the standby even if hot_standby_feedback
is true. Setting this reloption to false can be helpful to avoid
such problems.
Author: Tsunakawa Takayuki
Reviewed-By: Julien Rouhaud, Masahiko Sawada, Michael Paquier, Kirk Jamison and Fujii Masao
Discussion: https://postgr.es/m/CAHGQGwE5UqFqSq1=kV3QtTUtXphTdyHA-8rAj4A=Y+e4kyp3BQ@mail.gmail.com

When using tableam ExecFetchSlotHeapTuple() might return a separately
allocated tuple. We could use the shouldFree argument to explicitly
free it, but it seems more robust to to protect
Also add a CHECK_FOR_INTERRUPTS() after each tuple. It's likely that
each AM has (heap does) a CFI somewhere in the relevant path, but it
seems more robust to have one in validateForeignKeyConstraint()
itself.
Note that this only affects the cases that couldn't be optimized to be
verified with a query.
Author: Andres Freund
Reviewed-By: Tom Lane (in an earlier version)
Discussion:
https://postgr.es/m/19030.1554574075@sss.pgh.pa.us
https://postgr.es/m/CAKJS1f_SHKcPYMsi39An5aUjhAcEMZb6Cx1Sj1QWEWSiKJkBVQ@mail.gmail.com
https://postgr.es/m/20180711185628.mrvl46bjgk2uxoki@alap3.anarazel.de

This commit fixes three, unfortunately related, issues:
1) Since 5db6df0c01, the introduction of DML via tableam, it was
possible to trigger "ERROR: unexpected table_lock_tuple status: 1"
when updating a row that was previously updated in the same
transaction - but only when the previously updated row was before
updated in a concurrent transaction (and READ COMMITTED was
used). The reason for that was that that case simply wasn't
expected. Fixing that lead to:
2) Even before the above commit, there were error checks (introduced
in 6868ed7491b7) preventing a row being updated by different
commands within the same statement (say in a function called by an
UPDATE) - but that check wasn't performed when the row was first
updated in a concurrent transaction - instead the second update was
silently skipped in that case. After this change we throw the same
error as we'd without the concurrent transaction.
3) The error messages (introduced in 6868ed7491b7) preventing such
updates emitted the same error message for both DELETE and
UPDATE ("tuple to be updated was already modified by an operation
triggered by the current command"). While that could be changed
separately, it made it hard to write tests that verify the correct
correct behavior of the code.
This commit changes heap's implementation of table_lock_tuple() to
return TM_SelfModified instead of TM_Invisible (previously loosely
modeled after EvalPlanQualFetch), and teaches nodeModifyTable.c to
handle that in response to table_lock_tuple() and not just in response
to table_(delete|update).
Additionally it fixes the wrong error message (see 3 above). The
comment for table_lock_tuple() is also adjusted to state that
TM_Deleted won't return information in TM_FailureData - it'll not
always be available.
This also adds tests to ensure that DELETE/UPDATE correctly error out
when affecting a row that concurrently was modified by another
transaction.
Author: Andres Freund
Reported-By: Tom Lane, when investigating a bug bug fix to another bug
by Amit Langote
Discussion: https://postgr.es/m/19321.1554567786@sss.pgh.pa.us

As bug #15733 has proved, we are lacking coverage for partition tuple
routing with dropped attributes when involving three levels of
partitioning or more. There was only an active bug in this area for
v11, and HEAD is proving to handle those scenarios fine, still it lacked
some coverage for the previous problem.
Author: Amit Langote, Michael Paquier
Discussion: https://postgr.es/m/15733-7692379e310b80ec@postgresql.org

pg_get_indexdef_worker carelessly fetched indoption entries even for
non-key index columns that don't have one. 99.999% of the time this
would be harmless, since the code wouldn't examine the value ... but
some fine day this will be a fetch off the end of memory, resulting
in SIGSEGV.
Detected through valgrind testing. Odd that the buildfarm's valgrind
critters haven't noticed.

The new command lists partitioned relations (tables and/or indexes),
possibly with their sizes, possibly including partitioned partitions;
their parents (if not top-level); if indexes show the tables they belong
to; and their descriptions.
While there are various possible improvements to this, having it in this
form is already a great improvement over not having any way to obtain
this report.
Author: Pavel Stěhule, with help from Mathias Brossard, Amit Langote and
Justin Pryzby.
Reviewed-by: Amit Langote, Mathias Brossard, Melanie Plageman,
Michaël Paquier, Álvaro Herrera

Before those commits, partitioning-related code in the executor could
assume that ModifyTableState.resultRelInfo[] contains only leaf partitions.
However, now a fully-pruned update results in a dummy ModifyTable that
references the root partitioned table, and that breaks some stuff.
In v11, this led to an assertion or core dump in the tuple routing code.
Fix by disabling tuple routing, since we don't need that anyway.
(I chose to do that in HEAD as well for safety, even though the problem
doesn't manifest in HEAD as it stands.)
In v10, this confused ExecInitModifyTable's decision about whether it
needed to close the root table. But we can get rid of that altogether
by being smarter about where to find the root table.
Note that since the referenced commits haven't shipped yet, this
isn't fixing any bug the field has seen.
Amit Langote, per a report from me
Discussion: https://postgr.es/m/20710.1554582479@sss.pgh.pa.us

This uses the same infrastructure that the CREATE INDEX progress
reporting uses. Add a column to pg_stat_progress_create_index to
report the OID of the index being worked on. This was not necessary
for CREATE INDEX, but it's useful for REINDEX.
Also edit the phase descriptions a bit to be more consistent with the
source code comments.
Discussion: https://www.postgresql.org/message-id/ef6a6757-c36a-9e81-123f-13b19e36b7d7%402ndquadrant.com

Fix some places where we might fail to do Py_DECREF() on a Python
object (thereby leaking it for the rest of the session). Almost
all of the risks were in error-recovery paths, which we don't really
expect to hit anyway. Hence, while this is definitely a bug fix,
it doesn't quite seem worth back-patching.
Nikita Glukhov, Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/28053a7d-10d8-fc23-b05c-b4749c873f63@postgrespro.ru

The foreign-key-checking loop in ATRewriteTables failed to ignore
relations without storage (e.g., partitioned tables), unlike the
initial loop. This accidentally worked as long as RI_Initial_Check
succeeded, which it does in most practical cases (including all the
ones exercised in the existing regression tests :-(). However, if
that failed, as for instance when there are permissions issues,
then we entered the slow fire-the-trigger-on-each-tuple path.
And that would try to read from the referencing relation, and fail
if it lacks storage.
A second problem, recently introduced in HEAD, was that this loop
had been broken by sloppy refactoring for the tableam API changes.
Repair both issues, and add a regression test case so we have some
coverage on this code path. Back-patch as needed to v11.
(It looks like this code could do with additional bulletproofing,
but let's get a working test case in place first.)
Hadi Moshayedi, Tom Lane, Andres Freund
Discussion: https://postgr.es/m/CAK=1=WrnNmBbe5D9sm3t0a6dnAq3cdbF1vXY816j1wsMqzC8bw@mail.gmail.com
Discussion: https://postgr.es/m/19030.1554574075@sss.pgh.pa.us
Discussion: https://postgr.es/m/20190325180405.jytoehuzkeozggxx%40alap3.anarazel.de

Similarly to the set of parameters for keepalive, a connection parameter
for libpq is added as well as a backend GUC, called tcp_user_timeout.
Increasing the TCP user timeout is useful to allow a connection to
survive extended periods without end-to-end connection, and decreasing
it allows application to fail faster. By default, the parameter is 0,
which makes the connection use the system default, and follows a logic
close to the keepalive parameters in its handling. When connecting
through a Unix-socket domain, the parameters have no effect.
Author: Ryohei Nagaura
Reviewed-by: Fabien Coelho, Robert Haas, Kyotaro Horiguchi, Kirk
Jamison, Mikalai Keida, Takayuki Tsunakawa, Andrei Yahorau
Discussion: https://postgr.es/m/EDA4195584F5064680D8130B1CA91C45367328@G01JPEXMBYT04

If we need ordered output from a scan of a partitioned table, but
the ordering matches the partition ordering, then we don't need to
use a MergeAppend to combine the pre-ordered per-partition scan
results: a plain Append will produce the same results. This
both saves useless comparison work inside the MergeAppend proper,
and allows us to start returning tuples after istarting up just
the first child node not all of them.
However, all is not peaches and cream, because if some of the
child nodes have high startup costs then there will be big
discontinuities in the tuples-returned-versus-elapsed-time curve.
The planner's cost model cannot handle that (yet, anyway).
If we model the Append's startup cost as being just the first
child's startup cost, we may drastically underestimate the cost
of fetching slightly more tuples than are available from the first
child. Since we've had bad experiences with over-optimistic choices
of "fast start" plans for ORDER BY LIMIT queries, that seems scary.
As a klugy workaround, set the startup cost estimate for an ordered
Append to be the sum of its children's startup costs (as MergeAppend
would). This doesn't really describe reality, but it's less likely
to cause a bad plan choice than an underestimated startup cost would.
In practice, the cases where we really care about this optimization
will have child plans that are IndexScans with zero startup cost,
so that the overly conservative estimate is still just zero.
David Rowley, reviewed by Julien Rouhaud and Antonin Houska
Discussion: https://postgr.es/m/CAKJS1f-hAqhPLRk_RaSFTgYxd=Tz5hA7kQ2h4-DhJufQk8TGuw@mail.gmail.com

This allows the user to create duplicates of existing replication slots,
either logical or physical, and even changing properties such as whether
they are temporary or the output plugin used.
There are multiple uses for this, such as initializing multiple replicas
using the slot for one base backup; when doing investigation of logical
replication issues; and to select a different output plugins.
Author: Masahiko Sawada
Reviewed-by: Michael Paquier, Andres Freund, Petr Jelinek
Discussion: https://postgr.es/m/CAD21AoAm7XX8y_tOPP6j4Nzzch12FvA1wPqiO690RCk+uYVstg@mail.gmail.com

Prior to v12, if you used a collation-sensitive regex feature in a
pattern handled by processSQLNamePattern() (for instance, \d '\\w+'
in psql), the behavior you got matched the database's default collation.
Since commit 586b98fdf you'd usually get C-collation behavior, because
the catalog "name"-type columns are now marked as COLLATE "C". Add
explicit COLLATE specifications to restore the prior behavior.
(Note for whoever writes the v12 release notes: the need for this shows
that while 586b98fdf preserved pre-v12 behavior of "name" columns for
simple comparison operators, it changed the behavior of regex operators
on those columns. Although this patch fixes it for pattern matches
generated by our own tools, user-written queries will still be affected.
So we'd better mention this issue as a compatibility item.)
Daniel Vérité
Discussion: https://postgr.es/m/701e51f0-0ec0-4e70-a365-1958d66dd8d2@manitou-mail.org

The limitations that it is not allowed to create/attach a foreign table
as a partition of an indexed partitioned table were not documented.
Reported-By: Stepan Yankevych
Author: Etsuro Fujita
Reviewed-By: Amit Langote
Backpatch-through: 11 where partitioned index was introduced
Discussion: https://postgr.es/m/1553869152.858391073.5f8m3n0x@frv53.fwdcdn.com

This reverts commits 2f932f71d9f2963bbd201129d7b971c8f5f077fd,
16ee6eaf80a40007a138b60bb5661660058d0422 and
6f0e190056fe441f7cf788ff19b62b13c94f68f3. The buildfarm has revealed
several bugs. Back-patch like the original commits.
Discussion: https://postgr.es/m/20190404145319.GA1720877@rfd.leadboat.com

Commit 3eb77eba moved a _mdfd_getseg() call from mdsync() into a new
callback function mdsyncfiletag(), but didn't get the arguments quite
right. Without the EXTENSION_DONT_CHECK_SIZE flag we fail to open a
segment if lower-numbered segments have been truncated, and it wants
a block number rather than a segment number.
While comparing with the older coding, also remove an unnecessary
clobbering of errno, and adjust the code in mdunlinkfiletag() to
ressemble the original code from mdpostckpt() more closely instead
of using an unnecessary call to smgropen().
Author: Thomas Munro
Discussion: https://postgr.es/m/CA%2BhUKGL%2BYLUOA0eYiBXBfwW%2BbH5kFgh94%3DgQH0jHEJ-t5Y91wQ%40mail.gmail.com

There was some confusion over the format of the error message returned
from the server during GSSAPI startup; specifically, it was expected
that a length would be returned when, in reality, at this early stage in
the startup sequence, no length is returned from the server as part of
an error message.
Correct the client-side code for dealing with error messages sent by the
server during startup by simply reading what's available into our
buffer, after we've discovered it's an error message, and then reporting
back what was returned.
In passing, also add in documentation of the environment variable
PGGSSENCMODE which was missed previously, and adjust the code to look
for the PGGSSENCMODE variable (the environment variable change was
missed in the prior GSSMODE -> GSSENCMODE commit).
Error-handling issue discovered by Peter Eisentraut, the rest were items
discovered during testing of the error handling.

Since 11, it is possible to use a non-superuser role when using an
online source cluster with pg_rewind as long as the role has proper
permissions to execute on the source all the functions used by
pg_rewind, and the documentation stated that a superuser is necessary.
Let's add at the same time all the details needed to create such a
role.
A second confusion which comes a lot from users is that it is necessary
to issue a checkpoint on a freshly-promoted standby so as its control
file has up-to-date timeline information which is used by pg_rewind to
validate the operation. Let's document that properly. This is
back-patched down to 9.5 where pg_rewind has been introduced.
Author: Michael Paquier
Reviewed-by: Magnus Hagander
Discussion: https://postgr.es/m/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb+Bu=h1m=rLdn=g@mail.gmail.com
Backpatch-through: 9.5

Previously it was allowed to set default_table_access_method to an
empty string. That makes sense for default_tablespace, where that was
copied from, as it signals falling back to the database's default
tablespace. As there is no equivalent for table AMs, forbid that.
Also make sure to throw a usable error when creating a table using an
index AM, by using get_am_type_oid() to implement get_table_am_oid()
instead of a separate copy. Previously we'd error out only later, in
GetTableAmRoutine().
Thirdly remove GetTableAmRoutineByAmId() - it was only used in an
earlier version of 8586bf7ed8.
Add tests for the above (some for index AMs as well).

Commit c1afd175, which added support for rootdescend verification to
amcheck, added only minimal regression test coverage. Address this by
making sure that rootdescend verification is run on a multi-level index.
In passing, simplify some of the regression tests that exercise
multi-level nbtree page deletion.
Both issues spotted while rereviewing coverage of the nbtree patch
series using gcov.

This adds table_multi_insert(), and converts COPY FROM, the only user
of heap_multi_insert, to it.
A simple conversion of COPY FROM use slots would have yielded a
slowdown when inserting into a partitioned table for some
workloads. Different partitions might need different slots (both slot
types and their descriptors), and dropping / creating slots when
there's constant partition changes is measurable.
Thus instead revamp the COPY FROM buffering for partitioned tables to
allow to buffer inserts into multiple tables, flushing only when
limits are reached across all partition buffers. By only dropping
slots when there've been inserts into too many different partitions,
the aforementioned overhead is gone. By allowing larger batches, even
when there are frequent partition changes, we actuall speed such cases
up significantly.
By using slots COPY of very narrow rows into unlogged / temporary
might slow down very slightly (due to the indirect function calls).
Author: David Rowley, Andres Freund, Haribabu Kommi
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20190327054923.t3epfuewxfqdt22e@alap3.anarazel.de

This is intended for use mostly in test scripts for external tools,
which could do without cross-PG-version variations in error message
wording. Of course, the SQLSTATE isn't guaranteed stable either, but
it should be more so than the error message text.
Note: there's a bit of an ABI change for libpq here, but it seems
OK because if somebody compiles against a newer version of libpq-fe.h,
and then tries to pass PQERRORS_SQLSTATE to PQsetErrorVerbosity()
of an older libpq library, it will be accepted and then act like
PQERRORS_DEFAULT, thanks to the way the tests in pqBuildErrorMessage3
have historically been phrased. That seems acceptable.
Didier Gautheron, reviewed by Dagfinn Ilmari Mannsåker
Discussion: https://postgr.es/m/CAJRYxuKyj4zA+JGVrtx8OWAuBfE-_wN4sUMK4H49EuPed=mOBw@mail.gmail.com

The assertions added by commit b04aeb0a0 exposed that there are some
code paths wherein the executor will try to open an index without
holding any lock on it. We do have some lock on the index's table,
so it seems likely that there's no fatal problem with this (for
instance, the index couldn't get dropped from under us). Still,
it's bad practice and we should fix it.
To do so, remove the optimizations in ExecInitIndexScan and friends
that tried to avoid taking a lock on an index belonging to a target
relation, and just take the lock always. In non-bug cases, this
will result in no additional shared-memory access, since we'll find
in the local lock table that we already have a lock of the desired
type; hence, no significant performance degradation should occur.
Also, adjust the planner and executor so that the type of lock taken
on an index is always identical to the type of lock taken for its table,
by relying on the recently added RangeTblEntry.rellockmode field.
This avoids some corner cases where that might not have been true
before (possibly resulting in extra locking overhead), and prevents
future maintenance issues from having multiple bits of logic that
all needed to be in sync. In addition, this change removes all core
calls to ExecRelationIsTargetRelation, which avoids a possible O(N^2)
startup penalty for queries with large numbers of target relations.
(We'd probably remove that function altogether, were it not that we
advertise it as something that FDWs might want to use.)
Also adjust some places in selfuncs.c to not take any lock on indexes
they are transiently opening, since we can assume that plancat.c
did that already.
In passing, change gin_clean_pending_list() to take RowExclusiveLock
not AccessShareLock on its target index. Although it's not clear that
that's actually a bug, it seemed very strange for a function that's
explicitly going to modify the index to use only AccessShareLock.
David Rowley, reviewed by Julien Rouhaud and Amit Langote,
a bit of further tweaking by me
Discussion: https://postgr.es/m/19465.1541636036@sss.pgh.pa.us

This commit adds a new reloption, vacuum_index_cleanup, which
controls whether index cleanup is performed for a particular
relation by default. It also adds a new option to the VACUUM
command, INDEX_CLEANUP, which can be used to override the
reloption. If neither the reloption nor the VACUUM option is
used, the default is true, as before.
Masahiko Sawada, reviewed and tested by Nathan Bossart, Alvaro
Herrera, Kyotaro Horiguchi, Darafei Praliaskouski, and me.
The wording of the documentation is mostly due to me.
Discussion: http://postgr.es/m/CAD21AoAt5R3DNUZSjOoXDUY=naYPUOuffVsRzuTYMz29yLzQCA@mail.gmail.com

_bt_check_unique() failed to invalidate binary search bounds in the
event of a live conflict following commit e5adcb78. This resulted in
problems after waiting for the conflicting xact to commit or abort. The
subsequent call to _bt_check_unique() would restore the initial binary
search bounds, rather than starting a new search. Fix by explicitly
invalidating bounds when it becomes clear that there is a live conflict
that insertion will have to wait to resolve.
Ashutosh Sharma, with a few additional tweaks by me.
Author: Ashutosh Sharma
Reported-By: Ashutosh Sharma
Diagnosed-By: Ashutosh Sharma
Discussion: https://postgr.es/m/CAE9k0PnQp-qr-UYKMSCzdC2FBzdE4wKP41hZrZvvP26dKLonLg@mail.gmail.com

The be_gssapi_get_* prototypes were put close to similar ones for SSL-
but a bit too close since that meant they ended up only being included
for SSL-enabled builds. Move those to be under ENABLE_GSS instead.
Pointed out by Tom.

Previously, md.c and checkpointer.c were tightly integrated so that
fsync calls could be handed off and processed in the background.
Introduce a system of callbacks and file tags, so that other modules
can hand off fsync work in the same way.
For now only md.c uses the new interface, but other users are being
proposed. Since there may be use cases that are not strictly SMGR
implementations, use a new function table for sync handlers rather
than extending the traditional SMGR one.
Instead of using a bitmapset of segment numbers for each RelFileNode
in the checkpointer's hash table, make the segment number part of the
key. This requires sending explicit "forget" requests for every
segment individually when relations are dropped, but suits the file
layout schemes of proposed future users better (ie sparse or high
segment numbers).
Author: Shawn Debnath and Thomas Munro
Reviewed-by: Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com

Commit 2f932f71d9f2963bbd201129d7b971c8f5f077fd added code that elicits
a warning on buildfarm member flaviventris. Back-patch to 9.4, like
that commit.
Reported by Andres Freund.
Discussion: https://postgr.es/m/20190404020057.galelv7by75ekqrh@alap3.anarazel.de

Buildfarm members idiacanthus and komodoensis, which share a host, both
executed this test in the same second. That failed. Back-patch to 9.6,
where the test first appeared.
Discussion: https://postgr.es/m/20190404020543.GA1319573@rfd.leadboat.com

c251336 has added some tests to check if a toast relation should be
empty or not, hardcoding the toast relation name when calling
pg_relation_size(). pg_class.reltoastrelid offers the same information,
so simplify the tests to use that.
Reviewed-by: Daniel Gustafsson
Discussion: https://postgr.es/m/20190403065949.GH3298@paquier.xyz

Before the pgwin32_signal_initialize() call, the backend version of
pg_usleep() has no effect. No in-tree code falls afoul of that today,
but temporary commit 23078689a9921968ac0873b017be6e7f772f10bc did so.
Discussion: https://postgr.es/m/20190402135442.GA1173872@rfd.leadboat.com

M src/backend/port/win32/signal.c
M src/port/open.c

Handle USE_MODULE_DB for all tests able to use an installed postmaster.

When $(MODULES) and $(MODULE_big) are empty, derive the database name
from the first element of $(REGRESS) instead of using a constant string.
When deriving the database name from $(MODULES), use its first element
instead of the entire list; the earlier approach would fail if any
multi-module directory had $(REGRESS) tests. Treat isolation suites and
src/pl correspondingly. Under USE_MODULE_DB=1, installcheck-world and
check-world no longer reuse any database name in a given postmaster.
Buildfarm members axolotl, mandrill and frogfish saw spurious "is being
accessed by other users" failures that would not have happened without
database name reuse. (The CountOtherDBBackends() 5s deadline expired
during DROP DATABASE; a backend for an earlier test suite had used the
same database name and had not yet exited.) Back-patch to 9.4 (all
supported versions), except bits pertaining to isolation suites.
Concept reviewed by Andrew Dunstan, Andres Freund and Tom Lane.
Discussion: https://postgr.es/m/20190401135213.GE891537@rfd.leadboat.com

postmaster startup scrutinizes any shared memory segment recorded in
postmaster.pid, exiting if that segment matches the current data
directory and has an attached process. When the postmaster.pid file was
missing, a starting postmaster used weaker checks. Change to use the
same checks in both scenarios. This increases the chance of a startup
failure, in lieu of data corruption, if the DBA does "kill -9 `head -n1
postmaster.pid` && rm postmaster.pid && pg_ctl -w start". A postmaster
will no longer recycle segments pertaining to other data directories.
That's good for production, but it's bad for integration tests that
crash a postmaster and immediately delete its data directory. Such a
test now leaks a segment indefinitely. No "make check-world" test does
that. win32_shmem.c already avoided all these problems. In 9.6 and
later, enhance PostgresNode to facilitate testing. Back-patch to 9.4
(all supported versions).
Reviewed by Daniel Gustafsson and Kyotaro HORIGUCHI.
Discussion: https://postgr.es/m/20130911033341.GD225735@tornado.leadboat.com

Query planning is affected by a number of configuration options, and it
may be crucial to know which of those options were set to non-default
values. With this patch you can say EXPLAIN (SETTINGS ON) to include
that information in the query plan. Only options affecting planning,
with values different from the built-in default are printed.
This patch also adds auto_explain.log_settings option, providing the
same capability in auto_explain module.
Author: Tomas Vondra
Reviewed-by: Rafia Sabih, John Naylor
Discussion: https://postgr.es/m/e1791b4c-df9c-be02-edc5-7c8874944be0@2ndquadrant.com

This is useful to obtain a view of the different transaction types in an
application, regardless of the durations of the statements each runs.
Author: Adrien Nayrat
Reviewed-by: Masahiko Sawada, Hayato Kuroda, Andres Freund

Not required after nuking the zipfian thread-local cache.
Also add a comment about hazardous pointer punning in threadRun(),
and avoid using "thread" to refer to the threads array as a whole.
Fabien Coelho and Tom Lane, per suggestion from Alvaro Herrera
Discussion: https://postgr.es/m/alpine.DEB.2.21.1904032126060.7997@lancre

Commit ea4e1c0e8f resolved issues with memory alignment in serialized
pg_mcv_list values, but it required copying data to/from the varlena
buffer during serialization and deserialization. As the MCV lits may
be fairly large, the overhead (memory consumption, CPU usage) can get
rather significant too.
This change tweaks the serialization format so that the alignment is
correct with respect to the varlena value, and so the parts may be
accessed directly without copying the data.
Catversion bump, as it affects existing pg_statistic_ext data.

On both the frontend and backend, prepare for GSSAPI encryption
support by moving common code for error handling into a separate file.
Fix a TODO for handling multiple status messages in the process.
Eliminate the OIDs, which have not been needed for some time.
Add frontend and backend encryption support functions. Keep the
context initiation for authentication-only separate on both the
frontend and backend in order to avoid concerns about changing the
requested flags to include encryption support.
In postmaster, pull GSSAPI authorization checking into a shared
function. Also share the initiator name between the encryption and
non-encryption codepaths.
For HBA, add "hostgssenc" and "hostnogssenc" entries that behave
similarly to their SSL counterparts. "hostgssenc" requires either
"gss", "trust", or "reject" for its authentication.
Similarly, add a "gssencmode" parameter to libpq. Supported values are
"disable", "require", and "prefer". Notably, negotiation will only be
attempted if credentials can be acquired. Move credential acquisition
into its own function to support this behavior.
Add a simple pg_stat_gssapi view similar to pg_stat_ssl, for monitoring
if GSSAPI authentication was used, what principal was used, and if
encryption is being used on the connection.
Finally, add documentation for everything new, and update existing
documentation on connection security.
Thanks to Michael Paquier for the Windows fixes.
Author: Robbie Harwood, with changes to the read/write functions by me.
Reviewed in various forms and at different times by: Michael Paquier,
Andres Freund, David Steele.
Discussion: https://www.postgresql.org/message-id/flat/jlg1tgq1ktm.fsf@thriss.redhat.com

Previously, while primary keys could be made on partitioned tables, it
was not possible to define foreign keys that reference those primary
keys. Now it is possible to do that.
Author: Álvaro Herrera
Reviewed-by: Amit Langote, Jesper Pedersen
Discussion: https://postgr.es/m/20181102234158.735b3fevta63msbj@alvherre.pgsql

Instead of WAL-logging every modification during the build separately,
first build the index without any WAL-logging, and make a separate pass
through the index at the end, to write all pages to the WAL. This
significantly reduces the amount of WAL generated, and is usually also
faster, despite the extra I/O needed for the extra scan through the index.
WAL generated this way is also faster to replay.
For GiST, the LSN-NSN interlock makes this a little tricky. All pages must
be marked with a valid (i.e. non-zero) LSN, so that the parent-child
LSN-NSN interlock works correctly. We now use magic value 1 for that during
index build. Change the fake LSN counter to begin from 1000, so that 1 is
safely smaller than any real or fake LSN. 2 would've been enough for our
purposes, but let's reserve a bigger range, in case we need more special
values in the future.
Author: Anastasia Lubennikova, Andrey V. Lepikhov
Reviewed-by: Heikki Linnakangas, Dmitry Dolgov

This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql

When asked for a slice of a TOAST entry, decompress enough to return the
slice instead of decompressing the entire object.
For use cases where the slice is at, or near, the beginning of the entry,
this avoids a lot of unnecessary decompression work.
This changes the signature of pglz_decompress() by adding a boolean to
indicate if it's ok for the call to finish before consuming all of the
source or destination buffers.
Author: Paul Ramsey
Reviewed-By: Rafia Sabih, Darafei Praliaskouski, Regina Obe
Discussion: https://postgr.es/m/CACowWR07EDm7Y4m2kbhN_jnys%3DBBf9A6768RyQdKm_%3DNpkcaWg%40mail.gmail.com

The upper-planner pathification allows FDWs to arrange to push down
different types of upper-stage operations to the remote side. This
commit teaches postgres_fdw to do it for the (FINAL, NULL) upperrel,
which is responsible for doing LockRows, LIMIT, and/or ModifyTable.
This provides the ability for postgres_fdw to handle SELECT commands
so that it 1) skips the LockRows step (if any) (note that this is
safe since it performs early locking) and 2) pushes down the LIMIT
and/or OFFSET restrictions (if any) to the remote side. This doesn't
handle the INSERT/UPDATE/DELETE cases.
Author: Etsuro Fujita
Reviewed-By: Antonin Houska and Jeff Janes
Discussion: https://postgr.es/m/87pnz1aby9.fsf@news-spur.riddles.org.uk

This prevents the tests added by commit 4bbf6edfbd and adjusted by
commit 99f6a17dd6 from being useless by plan changes created by an
upcoming commit.
Author: Etsuro Fujita
Discussion: https://postgr.es/m/87pnz1aby9.fsf@news-spur.riddles.org.uk

The upper-planner pathification allows FDWs to arrange to push down
different types of upper-stage operations to the remote side. This
commit teaches postgres_fdw to do it for the (ORDERED, NULL) upperrel,
which is responsible for evaluating the query's ORDER BY ordering.
Since postgres_fdw is already able to evaluate that ordering remotely
for foreign baserels and foreign joinrels (see commit aa09cd242f et al.),
this adds support for that for foreign grouping relations.
Author: Etsuro Fujita
Reviewed-By: Antonin Houska and Jeff Janes
Discussion: https://postgr.es/m/87pnz1aby9.fsf@news-spur.riddles.org.uk

When accessing a table with RLS via a view, the RLS checks are
performed as the view owner. However, the code neglected to propagate
that to any subqueries in the RLS checks. Fix that by calling
setRuleCheckAsUser() for all RLS policy quals and withCheckOption
checks for RTEs with RLS.
Back-patch to 9.5 where RLS was added.
Per bug #15708 from daurnimator.
Discussion: https://postgr.es/m/15708-d65cab2ce9b1717a@postgresql.org

This adds a new option to pg_checksums called -P/--progress, showing
every second some information about the computation state of an
operation for --check and --enable (--disable only updates the control
file and is quick). This requires a pre-scan of the data folder so as
the total size of checksummable items can be calculated, and then it
gets compared to the amount processed.
Similarly to what is done for pg_rewind and pg_basebackup, the
information printed in the progress report consists of the current
amount of data computed and the total amount of data to compute. This
could be extended later on.
Author: Michael Banck, Bernd Helmle
Reviewed-by: Fabien Coelho, Michael Paquier
Discussion: https://postgr.es/m/1535719851.1286.17.camel@credativ.de

On at least ZFS, it can be beneficial to create new WAL files every
time and not to bother zero-filling them. Since it's not clear which
other filesystems might benefit from one or both of those things,
add individual GUCs to control those two behaviors independently and
make only very general statements in the docs.
Author: Jerry Jelinek, with some adjustments by Thomas Munro
Reviewed-by: Alvaro Herrera, Andres Freund, Tomas Vondra, Robert Haas and others
Discussion: https://postgr.es/m/CACPQ5Fo00QR7LNAcd1ZjgoBi4y97%2BK760YABs0vQHH5dLdkkMA%40mail.gmail.com

Contrib modules pgrowlocks, pgstattuple and some functionality in
pageinspect currently only supports the heap table AM. As they are all
concerned with low-level details that aren't reasonably exposed via
tableam, error out if invoked on a non heap relation.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de

This replaces the previous calls of heap_sync() in places using
bulk-insert. By passing in the flags used for bulk-insert the AM can
decide (first at insert time and then during the finish call) which of
the optimizations apply to it, and what operations are necessary to
finish a bulk insert operation.
Also change HEAP_INSERT_* flags to TABLE_INSERT, and rename hi_options
to ti_options.
These changes are made even in copy.c, which hasn't yet been converted
to tableam. There's no harm in doing so.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de

Remove the code that supported zipfian distribution parameters less
than 1.0, as it had undocumented performance hazards, and it's not
clear that the case is useful enough to justify either fixing or
documenting those hazards.
Also, since the code path for parameter > 1.0 could perform badly
for values very close to 1.0, establish a minimum allowed value
of 1.001. This solution seems superior to the previous vague
documentation warning about small values not performing well.
Fabien Coelho, per a gripe from Tomas Vondra
Discussion: https://postgr.es/m/b5e172e9-ad22-48a3-86a3-589afa20e8f7@2ndquadrant.com

We can't call code that uses syscache while we hold buffer locks
on a catalog relation. If passed such a relation, just fall back
to the general effective_io_concurrency GUC rather than trying to
look up the containing tablespace's IO concurrency setting.
We might find a better way to control prefetching in follow-up
work, but for now this is enough to avoid the deadlock introduced
by commit 558a9165e0.
Reviewed-by: Andres Freund
Diagnosed-by: Peter Geoghegan
Discussion: https://postgr.es/m/CA%2BhUKGLCwPF0S4Mk7S8qw%2BDK0Bq65LueN9rofAA3HHSYikW-Zw%40mail.gmail.com
Discussion: https://postgr.es/m/962831d8-c18d-180d-75fb-8b842e3a2742%40chrullrich.net

Add a section explaining how our XML features depart from current
versions of the SQL standard. Update and clarify the descriptions
of some XML functions.
Chapman Flack, reviewed by Ryan Lambert
Discussion: https://postgr.es/m/5BD1284C.1010305@anastigmatix.net
Discussion: https://postgr.es/m/5C81F8C0.6090901@anastigmatix.net
Discussion: https://postgr.es/m/CAN-V+g-6JqUQEQZ55Q3toXEN6d5Ez5uvzL4VR+8KtvJKj31taw@mail.gmail.com

This unifies the various ad hoc logging (message printing, error
printing) systems used throughout the command-line programs.
Features:
- Program name is automatically prefixed.
- Message string does not end with newline. This removes a common
source of inconsistencies and omissions.
- Additionally, a final newline is automatically stripped, simplifying
use of PQerrorMessage() etc., another common source of mistakes.
- I converted error message strings to use %m where possible.
- As a result of the above several points, more translatable message
strings can be shared between different components and between
frontends and backend, without gratuitous punctuation or whitespace
differences.
- There is support for setting a "log level". This is not meant to be
user-facing, but can be used internally to implement debug or
verbose modes.
- Lazy argument evaluation, so no significant overhead if logging at
some level is disabled.
- Some color in the messages, similar to gcc and clang. Set
PG_COLOR=auto to try it out. Some colors are predefined, but can be
customized by setting PG_COLORS.
- Common files (common/, fe_utils/, etc.) can handle logging much more
simply by just using one API without worrying too much about the
context of the calling program, requiring callbacks, or having to
pass "progname" around everywhere.
- Some programs called setvbuf() to make sure that stderr is
unbuffered, even on Windows. But not all programs did that. This
is now done centrally.
Soft goals:
- Reduces vertical space use and visual complexity of error reporting
in the source code.
- Encourages more deliberate classification of messages. For example,
in some cases it wasn't clear without analyzing the surrounding code
whether a message was meant as an error or just an info.
- Concepts and terms are vaguely aligned with popular logging
frameworks such as log4j and Python logging.
This is all just about printing stuff out. Nothing affects program
flow (e.g., fatal exits). The uses are just too varied to do that.
Some existing code had wrappers that do some kind of print-and-exit,
and I adapted those.
I tried to keep the output mostly the same, but there is a lot of
historical baggage to unwind and special cases to consider, and I
might not always have succeeded. One significant change is that
pg_rewind used to write all error messages to stdout. That is now
changed to stderr.
Reviewed-by: Donald Dong <xdong@csumb.edu>
Reviewed-by: Arthur Zakirov <a.zakirov@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/6a609b43-4f57-7348-6480-bd022f924310@2ndquadrant.com

jsonb_path_match() checks if jsonb document matches jsonpath query. Therefore,
jsonpath query should return single boolean. Currently, if result of jsonpath
is not a single boolean, NULL is returned independently whether silent mode
is on or off. But that appears to be wrong when silent mode is off. This
commit makes jsonb_path_match() throw an error in this case.
Author: Nikita Glukhov

Jsonpath now accepts integers with leading zeroes and floats starting with
a dot. However, SQL standard requires to follow JSON specification, which
doesn't allow none of these cases. Our json[b] datatypes also restrict that.
So, restrict it in jsonpath altogether.
Author: Nikita Glukhov

The syntax
GENERATED BY DEFAULT AS (expr)
is not allowed but we have to accept it in the grammar to avoid
shift/reduce conflicts because of the similar syntax for identity
columns. The existing code just ignored this, incorrectly. Add an
explicit error check and a bespoke error message.
Reported-by: Justin Pryzby <pryzby@telsasoft.com>

One should almost always terminate an old process, not use a manual
removal tool like ipcrm. Removal of the ipcclean script eleven years
ago (39627b1ae680cba44f6e56ca5facec4fdbfe9495) and its non-replacement
corroborate that manual shm removal is now a niche goal. Back-patch to
9.4 (all supported versions).
Reviewed by Daniel Gustafsson and Kyotaro HORIGUCHI.
Discussion: https://postgr.es/m/20180812064815.GB2301738@rfd.leadboat.com

This moves bitmap heap scan support to below an optional tableam
callback. It's optional as the whole concept of bitmap heapscans is
fairly block specific.
This basically moves the work previously done in bitgetpage() into the
new scan_bitmap_next_block callback, and the direct poking into the
buffer done in BitmapHeapNext() into the new scan_bitmap_next_tuple()
callback.
The abstraction is currently somewhat leaky because
nodeBitmapHeapscan.c's prefetching and visibilitymap based logic
remains - it's likely that we'll later have to move more into the
AM. But it's not trivial to do so without introducing a significant
amount of code duplication between the AMs, so that's a project for
later.
Note that now nodeBitmapHeapscan.c and the associated node types are a
bit misnamed. But it's not clear whether renaming wouldn't be a cure
worse than the disease. Either way, that'd be best done in a separate
commit.
Author: Andres Freund
Reviewed-By: Robert Haas (in an older version)
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de

This moves sample scan support to below tableam. It's not optional as
there is, in contrast to e.g. bitmap heap scans, no alternative way to
perform tablesample queries. If an AM can't deal with the block based
API, it will have to throw an ERROR.
The tableam callbacks for this are block based, but given the current
TsmRoutine interface, that seems to be required.
The new interface doesn't require TsmRoutines to perform visibility
checks anymore - that requires the TsmRoutine to know details about
the AM, which we want to avoid. To continue to allow taking the
returned number of tuples account SampleScanState now has a donetuples
field (which previously e.g. existed in SystemRowsSamplerData), which
is only incremented after the visibility check succeeds.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de

Commit 29b64d1d mishandled skipping over truncated high key attributes
during row comparisons. The row comparison key matching loop would loop
forever when a truncated attribute was encountered for a row compare
subkey. Fix by following the example of other code in the loop: advance
the current subkey, or break out of the loop when the last subkey is
reached.
Add test coverage for the relevant _bt_check_rowcompare() code path.
The new test case is somewhat tied to nbtree implementation details,
which isn't ideal, but seems unavoidable.

There was some debate about whether the code I'd added to remap
AppendRelInfos obtained from the initial SELECT planning run is
actually necessary. Add a test case demonstrating that it is.
Discussion: https://postgr.es/m/23831.1553873385@sss.pgh.pa.us

We can set this up once and for all in subquery_planner's initial survey
of the flattened rangetable, rather than incrementally adjusting it in
build_simple_rel. The previous approach made it rather hard to reason
about exactly when the value would be available, and we were definitely
using it in some places before the final value was computed.
Noted while fooling around with Amit Langote's patch to delay creation
of inheritance child rels. That didn't break this code, but it made it
even more fragile, IMO.

An anti-wraparound vacuum has to be by definition aggressive as it needs
to work on all the pages of a relation. However it can happen that due
to some concurrent activity an anti-wraparound vacuum is marked as
non-aggressive, which makes it redundant with a previous run, and
it is actually useless as an anti-wraparound vacuum should process all
the pages of a relation. This commit makes such vacuums to be skipped.
An anti-wraparound vacuum not aggressive can be found easily by mixing
low values of autovacuum_freeze_max_age (to control anti-wraparound) and
autovacuum_freeze_table_age (to control the aggressiveness).
28a8fa9 has added some extra logging printing all the possible
combinations of anti-wraparound and aggressive vacuums, which now gets
simplified as an anti-wraparound vacuum also non-aggressive gets
skipped.
Per discussion mainly between Andrew Dunstan, Robert Haas, Álvaro
Herrera, Kyotaro Horiguchi, Masahiko Sawada, and myself.
Author: Kyotaro Horiguchi, Michael Paquier
Reviewed-by: Andrew Dunstan, Álvaro Herrera
Discussion: https://postgr.es/m/20180914153554.562muwr3uwujno75@alvherre.pgsql

This just moves the table/matview[/toast] determination of relation
size to a callback, and uses a copy of the existing logic to implement
that callback for heap.
It probably would make sense to also move the index specific logic
into a callback, so the metapage handling (and probably more) can be
index specific. But that's a separate task.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de

This is a relatively straightforward move of the current
implementation to sit below tableam. As the current analyze sampling
implementation is pretty inherently block based, the tableam analyze
interface is as well. It might make sense to generalize that at some
point, but that seems like a larger project that shouldn't be
undertaken at the same time as the introduction of tableam.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de

Previously, the planner created RangeTblEntry and RelOptInfo structs
for every partition of a partitioned table, even though many of them
might later be deemed uninteresting thanks to partition pruning logic.
This incurred significant overhead when there are many partitions.
Arrange to postpone creation of these data structures until after
we've processed the query enough to identify restriction quals for
the partitioned table, and then apply partition pruning before not
after creation of each partition's data structures. In this way
we need not open the partition relations at all for partitions that
the planner has no real interest in.
For queries that can be proven at plan time to access only a small
number of partitions, this patch improves the practical maximum
number of partitions from under 100 to perhaps a few thousand.
Amit Langote, reviewed at various times by Dilip Kumar, Jesper Pedersen,
Yoshikazu Imai, and David Rowley
Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp

Commit d85e0f366a tried to fix memory alignment issues in serialization
and deserialization of pg_mcv_list values, but it was a few bricks shy.
The arrays of uint16 indexes in serialized items was not aligned, and
the both the values and isnull flags were using the same pointer.
Per investigation by Tom Lane on gaur.

While trying to plan a partitionwise join, we may be faced with cases
where one or both input partitions for a particular segment of the join
have been pruned away. In HEAD and v11, this is problematic because
earlier processing didn't bother to make a pruned RelOptInfo fully
valid. With an upcoming patch to make partition pruning more efficient,
this'll be even more problematic because said RelOptInfo won't exist at
all.
The existing code attempts to deal with this by retroactively making the
RelOptInfo fully valid, but that causes crashes under GEQO because join
planning is done in a short-lived memory context. In v11 we could
probably have fixed this by switching to the planner's main context
while fixing up the RelOptInfo, but that idea doesn't scale well to the
upcoming patch. It would be better not to mess with the base-relation
data structures during join planning, anyway --- that's just a recipe
for order-of-operations bugs.
In many cases, though, we don't actually need the child RelOptInfo,
because if the input is certainly empty then the join segment's result
is certainly empty, so we can skip making a join plan altogether. (The
existing code ultimately arrives at the same conclusion, but only after
doing a lot more work.) This approach works except when the pruned-away
partition is on the nullable side of a LEFT, ANTI, or FULL join, and the
other side isn't pruned. But in those cases the existing code leaves a
lot to be desired anyway --- the correct output is just the result of
the unpruned side of the join, but we were emitting a useless outer join
against a dummy Result. Pending somebody writing code to handle that
more nicely, let's just abandon the partitionwise-join optimization in
such cases.
When the modified code skips making a join plan, it doesn't make a
join RelOptInfo either; this requires some upper-level code to
cope with nulls in part_rels[] arrays. We would have had to have
that anyway after the upcoming patch.
Back-patch to v11 since the crash is demonstrable there.
Discussion: https://postgr.es/m/8305.1553884377@sss.pgh.pa.us

This is an SQL-standard feature that allows creating columns that are
computed from expressions rather than assigned, similar to a view or
materialized view but on a column basis.
This implements one kind of generated column: stored (computed on
write). Another kind, virtual (computed on read), is planned for the
future, and some room is left for it.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/b151f851-4019-bdb1-699e-ebab07d2f40a@2ndquadrant.com

Blind attempt at fixing ia64, hppa an sparc builds.
The serialized representation of MCV lists did not enforce proper memory
alignment for internal fields, resulting in deserialization issues on
platforms that are more sensitive to this (ia64, sparc and hppa).
This forces a catalog version bump, because the layout of serialized
pg_mcv_list changes.
Broken since 7300a699.

This commit reorders the paragraphs of the Notes section in order of
importance, and clarifies better the safe uses of pg_checksums for
replication setups.
Author: Fabien Coelho
Discussion: https://postgr.es/m/alpine.DEB.2.21.1903231404280.18811@lancre

This makes VACUUM work more like EXPLAIN already does without changing
the meaning of any commands that already work. It is intended to
facilitate the addition of future VACUUM options that may take
non-Boolean parameters or that default to false.
Masahiko Sawada, reviewed by me.
Discussion: http://postgr.es/m/CA+TgmobpYrXr5sUaEe_T0boabV0DSm=utSOZzwCUNqfLEEm8Mw@mail.gmail.com
Discussion: http://postgr.es/m/CAD21AoBaFcKBAeL5_++j+Vzir2vBBcF4juW7qH8b3HsQY=Q6+w@mail.gmail.com

Especially, warn about the hazards of mishandling the backup_label
file. Adjust a couple of server messages to be more clear about
the hazards associated with removing backup_label files, too.
David Steele and Robert Haas, reviewed by Laurenz Albe, Martín
Marqués, Peter Eisentraut, and Magnus Hagander.
Discussion: http://postgr.es/m/7d85c387-000e-16f0-e00b-50bf83c22127@pgmasters.net

The previous code was adding pointers to transient variables to a
list, but by the time the list was read, the variable might be gone,
depending on the compiler. Fix it by making copies in the proper
memory context.

This adds the CONCURRENTLY option to the REINDEX command. A REINDEX
CONCURRENTLY on a specific index creates a new index (like CREATE
INDEX CONCURRENTLY), then renames the old index away and the new index
in place and adjusts the dependencies, and then drops the old
index (like DROP INDEX CONCURRENTLY). The REINDEX command also has
the capability to run its other variants (TABLE, DATABASE) with the
CONCURRENTLY option (but not SYSTEM).
The reindexdb command gets the --concurrently option.
Author: Michael Paquier, Andreas Karlsson, Peter Eisentraut
Reviewed-by: Andres Freund, Fujii Masao, Jim Nasby, Sergei Kornilov
Discussion: https://www.postgresql.org/message-id/flat/60052986-956b-4478-45ed-8bd119e9b9cf%402ndquadrant.com#74948a1044c56c5e817a5050f554ddee

This moves the responsibility for:
- creating the storage necessary for a relation, including creating a
new relfilenode for a relation with existing storage
- non-transactional truncation of a relation
- VACUUM FULL / CLUSTER's rewrite of a table
below tableam.
This is fairly straight forward, with a bit of complexity smattered in
to move the computation of xid / multixid horizons below the AM, as
they don't make sense for every table AM.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de

There were multiple issues in deserialization of pg_mcv_list values.
Firstly, the data is loaded from syscache, but the deserialization was
performed after ReleaseSysCache(), at which point the data might have
already disappeared. Fixed by moving the calls in statext_mcv_load,
and using the same NULL-handling code as existing stats.
Secondly, the deserialized representation used pointers into the
serialized representation. But that is also unsafe, because the data
may disappear at any time. Fixed by reworking and simplifying the
deserialization code to always copy all the data.
And thirdly, when deserializing values for types passed by value, the
code simply did memcpy(d,s,typlen) which however does not work on
bigendian machines. Fixed by using fetch_att/store_att_byval.

Provide GetTopFullTransactionId() and GetCurrentFullTransactionId().
The intended users of these interfaces are access methods that use
xids for visibility checks but don't want to have to go back and
"freeze" existing references some time later before the 32 bit xid
counter wraps around.
Use a new struct to serialize the transaction state for parallel
query, because FullTransactionId doesn't fit into the previous
serialization scheme very well.
Author: Thomas Munro
Reviewed-by: Heikki Linnakangas
Discussion: https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com

Instead of inferring epoch progress from xids and checkpoints,
introduce a 64 bit FullTransactionId type and use it to track xid
generation. This fixes an unlikely bug where the epoch is reported
incorrectly if the range of active xids wraps around more than once
between checkpoints.
The only user-visible effect of this commit is to correct the epoch
used by txid_current() and txid_status(), also visible with
pg_controldata, in those rare circumstances. It also creates some
basic infrastructure so that later patches can use 64 bit
transaction IDs in more places.
The new type is a struct that we pass by value, as a form of strong
typedef. This prevents the sort of accidental confusion between
TransactionId and FullTransactionId that would be possible if we
were to use a plain old uint64.
Author: Thomas Munro
Reported-by: Amit Kapila
Reviewed-by: Andres Freund, Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com

To support building indexes over tables of different AMs, the scans to
do so need to be routed through the table AM. While moving a fair
amount of code, nearly all the changes are just moving code to below a
callback.
Currently the range based interface wouldn't make much sense for non
block based table AMs. But that seems aceptable for now.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de

Add infrastructure for having images in the documentation, in SVG
format. Add two images to start with. See the included README file
for instructions.
Author: Jürgen Purtz <juergen@purtz.de>
Author: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/aaa54502-05c0-4ea5-9af8-770411a6bf4b@purtz.de

The makefile rule for the (rarely used) plain-text output postgres.txt
was still written to use lynx, but in
96b8b8b6f9d8de4af01a77797273ad88c7a8e32e, where the INSTALL file was
switched to pandoc, the rest of the makefile support for lynx was
removed, so this was broken. Rewrite the rule to also use pandoc for
postgres.txt.

The MCV build should always call get_mincount_for_mcv_list(), as the
there is no other logic to decide whether the MCV list represents all
the data. So just remove the (ngroups > nitems) condition.
Also, when building MCV lists, the number of items was limited by the
statistics target (i.e. up to 10000). But when deserializing the MCV
list, a different value (8192) was used to check the input, causing
an error. Simply ensure that the same value is used in both places.
This should have been included in 7300a69950, but I forgot to include it
in that commit.

Introduce a third extended statistic type, supported by the CREATE
STATISTICS command - MCV lists, a generalization of the statistic
already built and used for individual columns.
Compared to the already supported types (n-distinct coefficients and
functional dependencies), MCV lists are more complex, include column
values and allow estimation of much wider range of common clauses
(equality and inequality conditions, IS NULL, IS NOT NULL etc.).
Similarly to the other types, a new pseudo-type (pg_mcv_list) is used.
Author: Tomas Vondra
Reviewed-by: Dean Rasheed, David Rowley, Mark Dilger, Alvaro Herrera
Discussion: https://postgr.es/m/dfdac334-9cf2-2597-fb27-f0fb3753f435@2ndquadrant.com

In the dim past, the planner kept the fully-processed version of the query
targetlist (the result of preprocess_targetlist) in grouping_planner's
local variable "tlist", and only grudgingly passed it to individual other
routines as needed. Later we discovered a need to still have it available
after grouping_planner finishes, and invented the root->processed_tlist
field for that purpose, but it wasn't used internally to grouping_planner;
the tlist was still being passed around separately in the same places as
before.
Now comes a proposed patch to allow appendrel expansion to add entries
to the processed tlist, well after preprocess_targetlist has finished
its work. To avoid having to pass around the tlist explicitly, it's
proposed to allow appendrel expansion to modify root->processed_tlist.
That makes aliasing the tlist with assorted parameters and local
variables really scary. It would accidentally work as long as the
tlist is initially nonempty, because then the List header won't move
around, but it's not exactly hard to think of ways for that to break.
Aliased values are poor programming practice anyway.
Hence, get rid of local variables and parameters that can be identified
with root->processed_tlist, in favor of just using that field directly.
And adjust comments to match. (Some of the new comments speak as though
it's already possible for appendrel expansion to modify the tlist; that's
not true yet, but will happen in a later patch.)
Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp

The new function is only in charge of meta commands, not SQL commands.
This change makes the code a little clearer: now all the state changes
are effected by advanceConnectionState. It also removes one indent
level, which makes the diff look bulkier than it really is.
Author: Fabien Coelho
Reviewed-by: Kirk Jamison
Discussion: https://postgr.es/m/alpine.DEB.2.21.1811240904500.12627@lancre

Column references are not allowed in default expressions and partition
bound expressions, and are restricted as such once the transformation of
their expressions is done. However, trying to use more complex column
references can lead to confusing error messages. For example, trying to
use a two-field column reference name for default expressions and
partition bounds leads to "missing FROM-clause entry for table", which
makes no sense in their respective context.
In order to make the errors generated more useful, this commit adds more
verbose messages when transforming column references depending on the
context. This has a little consequence though: for example an
expression using an aggregate with a column reference as argument would
cause an error to be generated for the column reference, while the
aggregate was the problem reported before this commit because column
references get transformed first.
The confusion exists for default expressions for a long time, and the
problem is new as of v12 for partition bounds. Still per the lack of
complaints on the matter no backpatch is done.
The patch has been written by Amit Langote and me, and Tom Lane has
provided the improvement of the documentation for default expressions on
the CREATE TABLE page.
Author: Amit Langote, Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/20190326020853.GM2558@paquier.xyz

The transaction ID returned by GetNextXidAndEpoch() is in the future,
so we can't attempt to access its status or we might try to read a
CLOG page that doesn't exist. The > vs >= confusion probably stemmed
from the choice of a variable name containing the word "last" instead
of "next", so fix that too.
Back-patch to 10 where the function arrived.
Author: Thomas Munro
Discussion: https://postgr.es/m/CA%2BhUKG%2Buua_BV5cyfsioKVN2d61Lukg28ECsWTXKvh%3DBtN2DPA%40mail.gmail.com

Some code paths have been doing some allocations followed by an
immediate memset() to initialize the allocated area with zeros, this is
a bit overkill as there are already interfaces to do both things in one
call.
Author: Daniel Gustafsson
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/vN0OodBPkKs7g2Z1uyk3CUEmhdtspHgYCImhlmSxv1Xn6nY1ZnaaGHL8EWUIQ-NEv36tyc4G5-uA3UXUF2l4sFXtK_EQgLN1hcgunlFVKhA=@yesql.se

When invoked for the first time in a session, current_schema() and
current_schemas() can finish by creating a temporary schema. Currently
those functions are parallel-safe, however if for a reason or another
they get launched across multiple parallel workers, they would fail when
attempting to create a temporary schema as temporary contexts are not
supported in this case.
The original issue has been spotted by buildfarm members crake and
lapwing, after commit c5660e0 has introduced the first regression tests
based on current_schema() in the tree. After that, 396676b has
introduced a workaround to avoid parallel plans but that was not
completely right either.
Catversion is bumped.
Author: Michael Paquier
Reviewed-by: Daniel Gustafsson
Discussion: https://postgr.es/m/20190118024618.GF1883@paquier.xyz

Relations dropped in a single transaction are tracked in a list of
unowned relations. With large number of dropped relations this resulted
in poor performance at the end of a transaction, when the relations are
removed from the singly linked list one by one.
Commit b4166911 attempted to address this issue (particularly when it
happens during recovery) by removing the relations in a reverse order,
resulting in O(1) lookups in the list of unowned relations. This did
not work reliably, though, and it was possible to trigger the O(N^2)
behavior in various ways.
Instead of trying to remove the relations in a specific order with
respect to the linked list, which seems rather fragile, switch to a
regular doubly linked. That allows us to remove relations cheaply no
matter where in the list they are.
As b4166911 was a bugfix, backpatched to all supported versions, do the
same thing here.
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/80c27103-99e4-1d0c-642c-d9f3b94aaa0a%402ndquadrant.com
Backpatch-through: 9.4

Previously the xid horizon was only computed during WAL replay. That
had two major problems:
1) It relied on knowing what the table pointed to looks like. That was
easy enough before the introducing of tableam (we knew it had to be
heap, although some trickery around logging the heap relfilenodes
was required). But to properly handle table AMs we need
per-database catalog access to look up the AM handler, which
recovery doesn't allow.
2) Not knowing the xid horizon also makes it hard to support logical
decoding on standbys. When on a catalog table, we need to be able
to conflict with slots that have an xid horizon that's too old. But
computing the horizon by visiting the heap only works once
consistency is reached, but we always need to be able to detect
conflicts.
There's also a secondary problem, in that the current method performs
redundant work on every standby. But that's counterbalanced by
potentially computing the value when not necessary (either because
there's no standby, or because there's no connected backends).
Solve 1) and 2) by moving computation of the xid horizon to the
primary and by involving tableam in the computation of the horizon.
To address the potentially increased overhead, increase the efficiency
of the xid horizon computation for heap by sorting the tids, and
eliminating redundant buffer accesses. When prefetching is available,
additionally perform prefetching of buffers. As this is more of a
maintenance task, rather than something routinely done in every read
only query, we add an arbitrary 10 to the effective concurrency -
thereby using IO concurrency, when not globally enabled. That's
possibly not the perfect formula, but seems good enough for now.
Bumps WAL format, as latestRemovedXid is now part of the records, and
the heap's relfilenode isn't anymore.
Author: Andres Freund, Amit Khandekar, Robert Haas
Reviewed-By: Robert Haas
Discussion:
https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de
https://postgr.es/m/20181214014235.dal5ogljs3bmlq44@alap3.anarazel.de
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de

ALTER INDEX .. ATTACH PARTITION fails if the partitioned table where the
index is defined contains more dropped columns than its partition, with
this message:
ERROR: incorrect attribute map
The cause was that one caller of CompareIndexInfo was passing the number
of attributes of the partition rather than the parent, which confused
the length check. Repair.
This can cause pg_upgrade to fail when used on such a database. Leave
some more objects around after regression tests, so that the case is
detected by pg_upgrade test suite.
Remove some spurious empty lines noticed while looking for other cases
of the same problem.
Discussion: https://postgr.es/m/20190326213924.GA2322@alvherre.pgsql

Up to now, otherrel RelOptInfos were built at the same time as baserel
RelOptInfos, thanks to recursion in build_simple_rel(). However,
nothing in query_planner's preprocessing cares at all about otherrels,
only baserels, so we don't really need to build them until just before
we enter make_one_rel. This has two benefits:
* create_lateral_join_info did a lot of extra work to propagate
lateral-reference information from parents to the correct children.
But if we delay creation of the children till after that, it's
trivial (and much harder to break, too).
* Since we have all the restriction quals correctly assigned to
parent appendrels by this point, it'll be possible to do plan-time
pruning and never make child RelOptInfos at all for partitions that
can be pruned away. That's not done here, but will be later on.
Amit Langote, reviewed at various times by Dilip Kumar, Jesper Pedersen,
Yoshikazu Imai, and David Rowley
Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp

Commit caf626b2c missed that the relevant reloptions entry needs
to be moved from the intRelOpts[] array to realRelOpts[].
Somewhat surprisingly, it seems to work anyway, perhaps because
the desired default and limit values are all integers. We ought
to have either a simpler data structure or better cross-checking
here, but that's for another patch.
Nikolay Shaplov
Discussion: https://postgr.es/m/4861742.12LTaSB3sv@x200m

We've been creating duplicate RTEs for partitioned tables just
because we do so for regular inheritance parent tables. But unlike
regular-inheritance parents which are themselves regular tables
and thus need to be scanned, partitioned tables don't need the
extra RTE.
This makes the conditions for building a child RTE the same as those
for building an AppendRelInfo, allowing minor simplification in
expand_single_inheritance_child. Since the planner's actual processing
is driven off the AppendRelInfo list, nothing much changes beyond that,
we just have one fewer useless RTE entry.
Amit Langote, reviewed and hacked a bit by me
Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp

When used on a partition containing foreign keys coming from one of its
ancestors, \d would (rather unhelpfully) print the details about the
pg_constraint row in the partition. This becomes a bit frustrating when
the user tries things like dropping the FK in the partition; instead,
show the details for the foreign key on the table where it is defined.
Also, when a table is referenced by a foreign key on a partitioned
table, we would show multiple "Referenced by" lines, one for each
partition, which gets unwieldy pretty fast. Modify that so that it
shows only one line for the ancestor partitioned table where the FK is
defined.
Discussion: https://postgr.es/m/20181204143834.ym6euxxxi5aeqdpn@alvherre.pgsql
Reviewed-by: Tom Lane, Amit Langote, Peter Eisentraut

After 71bdc99d0d7, "tableam: Add helper for indexes to check if a
corresponding table tuples exist." there's no in-core user left. As
there's unlikely to be an external user, and such an external user
could easily be adjusted to use table_index_fetch_tuple_check(),
remove heap_hot_search().
Per complaint from Peter Geoghegan
Author: Andres Freund
Discussion: https://postgr.es/m/CAH2-Wzn0Oq4ftJrTqRAsWy2WGjv0QrJcwoZ+yqWsF_Z5vjUBFw@mail.gmail.com

Since 7c079d7, partition bounds are able to use generalized expression
syntax when processed, treating "minvalue" and "maxvalue" as specific
cases as they get passed down for transformation as a column references.
The checks for infinite bounds in range expressions have been lax
though, causing crashes when trying to use column reference names with
more than one field. Here is an example causing a crash:
CREATE TABLE list_parted (a int) PARTITION BY LIST (a);
CREATE TABLE part_list_crash PARTITION OF list_parted
FOR VALUES IN (somename.somename);
Note that the creation of the second relation should fail as partition
bounds cannot have column references in their expressions, so when
finding an expression which does not match the expected infinite bounds,
then this commit lets the generic transformation machinery check after
it. The error message generated in this case references as well a
missing RTE, which is confusing. This problem will be treated
separately as it impacts as well default expressions for some time, and
for now only the cases where a crash can happen are fixed.
While on it, extend the set of regression tests in place for list
partition bounds and add an extra set for range partition bounds.
Reported-by: Alexander Lakhin
Author: Michael Paquier
Reviewed-by: Amit Langote
Discussion: https://postgr.es/m/15668-0377b1981aa1a393@postgresql.org

This primarily is to allow WHERE CURRENT OF to continue to work as it
currently does. It's not clear to me that these semantics make sense
for every AM, but it works for the in-core heap, and the out of core
zheap. We can refine it further at a later point if necessary.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de

This is, likely exclusively, useful to verify that conflicts detected
in a unique index are with live tuples, rather than dead ones.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de

We were getting just DEFAULT_INEQ_SEL for comparisons such as
"ctid >= constant", but it's possible to do a lot better if we don't
mind some assumptions about the table's tuple density being reasonably
uniform. There are already assumptions much like that elsewhere in
the planner, so that hardly seems like much of an objection.
Extracted from a patch set that also proposes to introduce a special
executor node type for such queries. Not sure if that's going to make
it into v12, but improving the selectivity estimate is useful
independently of that.
Edmund Horner, reviewed by David Rowley
Discussion: https://postgr.es/m/CAMyN-kB-nFTkF=VA_JPwFNo08S0d-Yk0F741S2B7LDmYAi8eyA@mail.gmail.com

If there's only one child relation, the Append or MergeAppend isn't
doing anything useful, and can be elided. It does have a purpose
during planning though, which is to serve as a buffer between parent
and child Var numbering. Therefore we keep it all the way through
to setrefs.c, and get rid of it only after fixing references in the
plan level(s) above it. This works largely the same as setrefs.c's
ancient hack to get rid of no-op SubqueryScan nodes, and can even
share some code with that.
Note the change to make setrefs.c use apply_tlist_labeling rather than
ad-hoc code. This has the effect of propagating the child's resjunk
and ressortgroupref labels, which formerly weren't propagated when
removing a SubqueryScan. Doing that is demonstrably necessary for
the [Merge]Append cases, and seems harmless for SubqueryScan, if only
because trivial_subqueryscan is afraid to collapse cases where the
resjunk marking differs. (I suspect that restriction could now be
removed, though it's unclear that it'd make any new matches possible,
since the outer query can't have references to a child resjunk column.)
David Rowley, reviewed by Alvaro Herrera and Tomas Vondra
Discussion: https://postgr.es/m/CAKJS1f_7u8ATyJ1JGTMHFoKDvZdeF-iEBhs+sM_SXowOr9cArg@mail.gmail.com

Add additional heuristics to the algorithm for locating an optimal split
location. New logic identifies localized monotonically increasing
values in indexes with multiple columns. When this insertion pattern is
detected, page splits split just after the new item that provoked a page
split (or apply leaf fillfactor in the style of a rightmost page split).
This optimization is a variation of the long established leaf fillfactor
optimization used during rightmost page splits.
50/50 page splits are only appropriate with a pattern of truly random
insertions, where the average space utilization ends up at 65% - 70%.
Without this patch, affected cases have leaf pages that are no more than
about 50% full on average. Future insertions can never make use of the
free space left behind. With this patch, affected cases have leaf pages
that are about 90% full on average (assuming a fillfactor of 90).
Localized monotonically increasing insertion patterns are presumed to be
fairly common in real-world applications. There is a fair amount of
anecdotal evidence for this. Both pg_depend system catalog indexes
(pg_depend_depender_index and pg_depend_reference_index) are at least
20% smaller after the regression tests are run when the optimization is
available. Furthermore, many of the indexes created by a fair use
implementation of TPC-C for Postgres are consistently about 40% smaller
when the optimization is available.
Note that even pg_upgrade'd v3 indexes make use of this optimization.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzkpKeZJrXvR_p7VSY1b-s85E3gHyTbZQzR0BkJ5LrWF_A@mail.gmail.com

threadRun() function is very long and deeply-nested. Extract the code to
print progress reports to a separate function, to make it slightly easier
to read.
Author: Fabien Coelho
Discussion: https://www.postgresql.org/message-id/alpine.DEB.2.21.1903101225270.17271%40lancre

Partial revert of commit 6260cc550b0e, "pgbench: add \cset and \gset
commands".
While \gset is widely considered a useful and necessary tool for user-
defined benchmarks, \cset does not have as much value, and its
implementation was considered "not to be up to project standards"
(though I, Álvaro, can't quite understand exactly how). Therefore,
remove \cset.
Author: Fabien Coelho
Discussion: https://postgr.es/m/alpine.DEB.2.21.1903230716030.18811@lancre
Discussion: https://postgr.es/m/201901101900.mv7zduch6sad@alvherre.pgsql

This uses the same progress reporting infrastructure added in commit
c16dc1aca5e01e6acaadfcf38f5fc964a381dc62 and extends it to these
additional cases. We lack the ability to track the internal progress
of sorts and index builds so the information reported is
coarse-grained for some parts of the operation, but it still seems
like a significant improvement over having nothing at all.
Tatsuro Yamada, reviewed by Thomas Munro, Masahiko Sawada, Michael
Paquier, Jeff Janes, Alvaro Herrera, Rafia Sabih, and by me. A fair
amount of polishing also by me.
Discussion: http://postgr.es/m/59A77072.3090401@lab.ntt.co.jp

Non-backtracking flex parsers work faster than backtracking ones. So, this
commit gets rid of backtracking in jsonpath_scan.l. That required explicit
handling of some cases as well as manual backtracking for some cases. More
regression tests for numerics are added.
Discussion: https://mail.google.com/mail/u/0?ik=a20b091faa&view=om&permmsgid=msg-f%3A1628425344167939063
Author: John Naylor, Nikita Gluknov, Alexander Korotkov

This commit include formatting improvements, renamings and comments. Also,
it makes jsonpath_scan.l be more uniform with other our lexers. Firstly,
states names are renamed to more short alternatives. Secondly, <INITIAL>
prefix removed from the rules. Corresponding rules are moved to the tail, so
they would anyway work only in initial state.
Author: Alexander Korotkov
Reviewed-by: John Naylor

Coverity complained that simple8b_encode() might read beyond the end of
the 'diffs' array, in the loop to encode the integers. That was a false
positive, because we never get into the loop in modes 0 or 1, and the
array is large enough for all the other modes. But I admit it's very
subtle, so it's not surprising that Coverity didn't see it, and it's not
very obvious to humans either. Refactor it, so that the second loop
re-computes the differences, instead of carrying them over from the first
loop in the 'diffs' array. This way, the 'diffs' array is not needed
anymore. It makes no measurable difference in performance, and seems more
straightforward this way.
Also, improve the comments in simple8b_encode(): fix the comment about its
return value that was flat-out wrong, and explain the condition when it
returns EMPTY_CODEWORD better.
In the passing, move the 'selector' from the codeword's low bits to the
high bits. It doesn't matter much, but looking at the original paper, and
googling around for other Simple-8b implementations, that's how it's
usually done.
Per Coverity, and Tom Lane's report off-list.

This way the timestamps line up in a mix of "ok" and "FAILED" output.
Author: Christoph Berg <christoph.berg@credativ.de>
Discussion: https://www.postgresql.org/message-id/20190321115059.GF2687%40msg.df7cb.de

This is essentially the tableam version of heapam_fetch(),
i.e. fetching a tuple identified by a tid, performing visibility
checks.
Note that this different from table_index_fetch_tuple(), which is for
index lookups. It therefore has to handle a tid pointing to an earlier
version of a tuple if the AM uses an optimization like heap's HOT. Add
comments to that end.
This commit removes the stats_relation argument from heap_fetch, as
it's been unused for a long time.
Author: Andres Freund
Reviewed-By: Haribabu Kommi
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de

Previously those directly performed a heap_insert(). Use
table_insert() instead. The input slot of those routines is not of
the target relation - we could fix that by copying if necessary, but
that'd not be beneficial for performance. As those codepaths don't
access any AM specific tuple fields (say xmin/xmax), there's no need
to use an AM specific slot.
Author: Andres Freund
Reviewed-By: Haribabu Kommi
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de

Now that the ordering of DROP messages ought to be stable everywhere,
we should not need these kluges of hiding DETAIL output just to avoid
unstable ordering. Hiding it's not great for test coverage, so
let's undo that where possible.
In a small number of places, it's necessary to leave it in, for
example because the output might include a variable pg_temp_nnn
schema name. I also left things alone in places where the details
would depend on other regression test scripts, e.g. plpython_drop.sql.
Perhaps buildfarm experience will show this to be a bad idea,
but if so I'd like to know why.
Discussion: https://postgr.es/m/E1h6eep-0001Mw-Vd@gemulon.postgresql.org

Commit 8aa9dd74b didn't quite finish the job in this area after all,
because DROP ROLE has a code path distinct from DROP OWNED BY, and
it was still reporting dependent objects in whatever order the index
scan returned them in.
Buildfarm experience shows that index ordering of equal-keyed objects is
significantly less stable than before in the wake of using heap TIDs as
tie-breakers. So if we try to hide the unstable ordering by suppressing
DETAIL reports, we're just going to end up having to do that for every
DROP that reports multiple objects. That's not great from a coverage
or problem-detection standpoint, and it's something we'll inevitably
forget in future patches, leading to more iterations of fixing-an-
unstable-result. So let's just bite the bullet and sort here too.
Discussion: https://postgr.es/m/E1h6eep-0001Mw-Vd@gemulon.postgresql.org

It doesn't make sense to consider the possibility that there will only
be one candidate split point when choosing among split points to find
the split with the lowest penalty. This is a vestige of an earlier
version of the patch that became commit fab25024.
Issue spotted while rereviewing coverage of the nbtree patch series
using gcov.

The code would do "PQclear(res)" twice if lo_unlink failed, evidently
due to careless thinking about how far out a "break" would break.
Remove the extra PQclear and adjust the loop logic so that we'll fall
out of both levels of loop after an error, as was clearly the intent.
Spotted by Coverity. I have no idea why it took this long to notice,
since the bug has been there since commit 67ccbb080. Accordingly,
back-patch to all supported branches.

M contrib/vacuumlo/vacuumlo.c

Make current_logfiles use permissions assigned to files in data directory

Since its introduction in 19dc233c, current_logfiles has been assigned
the same permissions as a log file, which can be enforced with
log_file_mode. This setup can lead to incompatibility problems with
group access permissions as current_logfiles is not located in the log
directory, but at the root of the data folder. Hence, if group
permissions are used but log_file_mode is more restrictive, a backup
with a user in the group having read access could fail even if the log
directory is located outside of the data folder.
Per discussion with the folks mentioned below, we have concluded that
current_logfiles should not be treated as a log file as it only stores
metadata related to log files, and that it should use the same
permissions as all other files in the data directory. This solution has
the merit to be simple and fixes all the interaction problems between
group access and log_file_mode.
Author: Haribabu Kommi
Reviewed-by: Stephen Frost, Robert Haas, Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/CAJrrPGcEotF1P7AWoeQyD3Pqr-0xkQg_Herv98DjbaMj+naozw@mail.gmail.com
Backpatch-through: 11, where group access has been added.

Add command variants COMMIT AND CHAIN and ROLLBACK AND CHAIN, which
start new transactions with the same transaction characteristics as the
just finished one, per SQL standard.
Support for transaction chaining in PL/pgSQL is also added. This
functionality is especially useful when running COMMIT in a loop in
PL/pgSQL.
Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>
Discussion: https://www.postgresql.org/message-id/flat/28536681-324b-10dc-ade8-ab46f7645a5a@2ndquadrant.com

This adds new, required, table AM callbacks for insert/delete/update
and lock_tuple. To be able to reasonably use those, the EvalPlanQual
mechanism had to be adapted, moving more logic into the AM.
Previously both delete/update/lock call-sites and the EPQ mechanism had
to have awareness of the specific tuple format to be able to fetch the
latest version of a tuple. Obviously that needs to be abstracted
away. To do so, move the logic that find the latest row version into
the AM. lock_tuple has a new flag argument,
TUPLE_LOCK_FLAG_FIND_LAST_VERSION, that forces it to lock the last
version, rather than the current one. It'd have been possible to do
so via a separate callback as well, but finding the last version
usually also necessitates locking the newest version, making it
sensible to combine the two. This replaces the previous use of
EvalPlanQualFetch(). Additionally HeapTupleUpdated, which previously
signaled either a concurrent update or delete, is now split into two,
to avoid callers needing AM specific knowledge to differentiate.
The move of finding the latest row version into tuple_lock means that
encountering a row concurrently moved into another partition will now
raise an error about "tuple to be locked" rather than "tuple to be
updated/deleted" - which is accurate, as that always happens when
locking rows. While possible slightly less helpful for users, it seems
like an acceptable trade-off.
As part of this commit HTSU_Result has been renamed to TM_Result, and
its members been expanded to differentiated between updating and
deleting. HeapUpdateFailureData has been renamed to TM_FailureData.
The interface to speculative insertion is changed so nodeModifyTable.c
does not have to set the speculative token itself anymore. Instead
there's a version of tuple_insert, tuple_insert_speculative, that
performs the speculative insertion (without requiring a flag to signal
that fact), and the speculative insertion is either made permanent
with table_complete_speculative(succeeded = true) or aborted with
succeeded = false).
Note that multi_insert is not yet routed through tableam, nor is
COPY. Changing multi_insert requires changes to copy.c that are large
enough to better be done separately.
Similarly, although simpler, CREATE TABLE AS and CREATE MATERIALIZED
VIEW are also only going to be adjusted in a later commit.
Author: Andres Freund and Haribabu Kommi
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20190313003903.nwvrxi7rw3ywhdel@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql

In combination with the previous commit, this ensures that valid XML
data can always be dumped and reloaded, whether it is "document"
or "content".
Discussion: https://postgr.es/m/CAN-V+g-6JqUQEQZ55Q3toXEN6d5Ez5uvzL4VR+8KtvJKj31taw@mail.gmail.com

Previously we were using the SQL:2003 definition, which doesn't allow
this, but that creates a serious dump/restore gotcha: there is no
setting of xmloption that will allow all valid XML data. Hence,
switch to the 2006 definition.
Since libxml doesn't accept <!DOCTYPE> directives in the mode we
use for CONTENT parsing, the implementation is to detect <!DOCTYPE>
in the input and switch to DOCUMENT parsing mode. This should not
cost much, because <!DOCTYPE> should be close to the front of the
input if it's there at all. It's possible that this causes the
error messages for malformed input to be slightly different than
they were before, if said input includes <!DOCTYPE>; but that does
not seem like a big problem.
In passing, buy back a few cycles in parsing of large XML documents
by not doing strlen() of the whole input in parse_xml_decl().
Back-patch because dump/restore failures are not nice. This change
shouldn't break any cases that worked before, so it seems safe to
back-patch.
Chapman Flack (revised a bit by me)
Discussion: https://postgr.es/m/CAN-V+g-6JqUQEQZ55Q3toXEN6d5Ez5uvzL4VR+8KtvJKj31taw@mail.gmail.com

Suppress 3 lines of unstable DETAIL output from a DROP ROLE statement in
event_trigger.sql. This is further cleanup for commit dd299df8.
Note that the event_trigger test instability issue is very similar to
the recently suppressed foreign_data test instability issue. Both
issues involve DETAIL output for a DROP ROLE statement that needed to be
changed as part of dd299df8.
Per buildfarm member macaque.

Teach nbtree forward index scans to check the high key before moving to
the right sibling page in the hope of finding that it isn't actually
necessary to do so. The new check may indicate that the scan definitely
cannot find matching tuples to the right, ending the scan immediately.
We already opportunistically force a similar "continuescan orientated"
key check of the final non-pivot tuple when it's clear that it cannot be
returned to the scan due to being dead-to-all. The new high key check
is complementary.
The new approach for forward scans is more effective than checking the
final non-pivot tuple, especially with composite indexes and non-unique
indexes. The improvements to the logic for picking a split point added
by commit fab25024 make it likely that relatively dissimilar high keys
will appear on a page. A distinguishing key value that can only appear
on non-pivot tuples on the right sibling page will often be present in
leaf page high keys.
Since forcing the final item to be key checked no longer makes any
difference in the case of forward scans, the existing extra key check is
now only used for backwards scans. Backward scans continue to
opportunistically check the final non-pivot tuple, which is actually the
first non-pivot tuple on the page (not the last).
Note that even pg_upgrade'd v3 indexes make use of this optimization.
Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzkOmUduME31QnuTFpimejuQoiZ-HOf0pOWeFZNhTMctvA@mail.gmail.com

Previously there was basically no coverage for UPDATEs encountering
deleted rows, and no coverage for DELETE having to perform EPQ. That's
problematic for an upcoming commit in which EPQ is tought to integrate
with tableams. Also, there was no test for UPDATE to encounter a row
UPDATEd into another partition.
Author: Andres Freund

This is an option consistent with what pg_dump, pg_rewind and
pg_basebackup provide which is useful for leveraging the I/O effort when
testing things, not to be used in a production environment.
Author: Michael Paquier
Reviewed-by: Michael Banck, Fabien Coelho, Sergei Kornilov
Discussion: https://postgr.es/m/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de

An offline cluster can now work with more modes in pg_checksums:
- --enable enables checksums in a cluster, updating all blocks with a
correct checksum, and updating the control file at the end.
- --disable disables checksums in a cluster, updating only the control
file.
- --check is an extra option able to verify checksums for a cluster, and
the default used if no mode is specified.
When running --enable or --disable, the data folder gets fsync'd for
durability, and then it is followed by a control file update and flush
to keep the operation consistent should the tool be interrupted, killed
or the host unplugged. If no mode is specified in the options, then
--check is used for compatibility with older versions of pg_checksums
(named pg_verify_checksums in v11 where it was introduced).
Author: Michael Banck, Michael Paquier
Reviewed-by: Fabien Coelho, Magnus Hagander, Sergei Kornilov
Discussion: https://postgr.es/m/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de

Postpone most of the effort of constructing PartitionedRelPruneInfos
until after we have found out whether run-time pruning is needed at all.
This costs very little duplicated effort (basically just an extra
find_base_rel() call per partition) and saves quite a bit when we
can't do run-time pruning.
Also, merge the first loop (for building relid_subpart_map) into
the second loop, since we don't need the map to be valid during
that loop.
Amit Langote
Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp

This is almost a straight revert of commit fff518d, which itself was a
revert of 7d3bf73ac.
It turns out that commit 8aa9dd74, which sorted dependent objects before
deletion in DROP OWNED BY, was not sufficient to make all remaining
unstable DETAIL output stable. Unstable DETAIL output from DROP ROLE
was not affected, because that happens to use a different code path. It
doesn't seem worthwhile to fix the other code path at this time.
Discussion: https://postgr.es/m/6226.1553274783@sss.pgh.pa.us

I (tgl) remain dubious that it's a good idea for PartitionDirectory
to hold a pin on a relcache entry throughout planning, rather than
copying the data or using some kind of refcount scheme. However, it's
certainly the responsibility of the PartitionDirectory code to ensure
that what it's handing back is a stable data structure, not that of
its caller. So this is a pretty clear oversight in commit 898e5e329,
and one that can cost a lot of performance when there are many
partitions.
Amit Langote (extracted from a much larger patch set)
Discussion: https://postgr.es/m/CA+TgmoY3bRmGB6-DUnoVy5fJoreiBJ43rwMrQRCdPXuKt4Ykaw@mail.gmail.com
Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp

There were more large constants that needed UINT64CONST. And one variable
was declared as "int", when it needed to be uint64. These bugs were only
visible on 32-bit systems; clearly I should've tested on one, given that
this code does a lot of work with 64-bit integers.
Also, in the test "huge distances" test, the code created some values with
random distances between them, but the test logic didn't take into account
the possibility that the random distance was exactly 1. That never actually
happens with the seed we're using, but let's be tidy.

To do this, we scan GiST two times. In the first pass we make note of
empty leaf pages and internal pages. At second pass we scan through
internal pages, looking for downlinks to the empty pages.
Deleting internal pages is still not supported, like in nbtree, the last
child of an internal page is never deleted. That means that if you have a
workload where new keys are always inserted to different area than where
old keys are removed, the index will still grow without bound. But the rate
of growth will be an order of magnitude slower than before.
Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/B1E4DF12-6CD3-4706-BDBD-BF3283328F60@yandex-team.ru

The set is implemented as a B-tree, with a compact representation at leaf
items, using Simple-8b algorithm, so that clusters of nearby values use
less memory.
The IntegerSet isn't used for anything yet, aside from the test code, but
we have two patches in the works that would benefit from this: A patch to
allow GiST vacuum to delete empty pages, and a patch to reduce heap
VACUUM's memory usage, by storing the list of dead TIDs more efficiently
and lifting the 1 GB limit on its size.
This includes a unit test module, in src/test/modules/test_integerset.
It can be used to verify correctness, as a regression test, but if you run
it manully, it can also print memory usage and execution time of some of
the tests.
Author: Heikki Linnakangas, Andrey Borodin
Reviewed-by: Julien Rouhaud
Discussion: https://www.postgresql.org/message-id/b5e82599-1966-5783-733c-1a947ddb729f@iki.fi

M src/backend/lib/Makefile
M src/backend/lib/README
A src/backend/lib/integerset.c
A src/include/lib/integerset.h
M src/test/modules/Makefile
A src/test/modules/test_integerset/.gitignore
A src/test/modules/test_integerset/Makefile
A src/test/modules/test_integerset/README
A src/test/modules/test_integerset/expected/test_integerset.out
A src/test/modules/test_integerset/sql/test_integerset.sql
A src/test/modules/test_integerset/test_integerset–1.0.sql
A src/test/modules/test_integerset/test_integerset.c
A src/test/modules/test_integerset/test_integerset.control

This adds a flag "deterministic" to collations. If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal. That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.
This functionality is only supported with the ICU provider. At least
glibc doesn't appear to have any locales that work in a
nondeterministic way, so it's not worth supporting this for the libc
provider.
The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).
This patch makes changes in three areas:
- CREATE COLLATION DDL changes and system catalog changes to support
this new flag.
- Many executor nodes and auxiliary code are extended to track
collations. Previously, this code would just throw away collation
information, because the eventually-called user-defined functions
didn't use it since they only cared about equality, which didn't
need collation information.
- String data type functions that do equality comparisons and hashing
are changed to take the (non-)deterministic flag into account. For
comparison, this just means skipping various shortcuts and tie
breakers that use byte-wise comparison. For hashing, we first need
to convert the input string to a canonical "sort key" using the ICU
analogue of strxfrm().
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com

Trying to call the function with the top-most parent of a partition tree
was leading to a crash. In this case the correct result is to return
the top-most parent itself.
Reported-by: Álvaro Herrera
Author: Michael Paquier
Reviewed-by: Amit Langote
Discussion: https://postgr.es/m/20190322032612.GA323@alvherre.pgsql

When DefineIndex recurses to create constraints on partitions, it needs
to use the value returned by index_constraint_create to set up partition
dependencies. However, in the course of fixing the DEPENDENCY_INTERNAL_AUTO
mess, commit 1d92a0c9f7dd introduced some code to that function that
clobbered the return value, causing the recorded OID to be of the wrong
object. Close examination of pg_depend after creating the tables leads
to indescribable objects :-( My sin (in commit bdc3d7fa2376, while
preparing for DDL deparsing in event triggers) was to use a variable
name for the return value that's typically used for throwaway objects in
dependency-setting calls ("referenced"). Fix by changing the variable
names to match extended practice (the return value is "myself" rather
than "referenced".)
The pg_upgrade test notices the problem (in an indirect way: the pg_dump
outputs are in different order), but only if you create the objects in a
specific way that wasn't being used in the existing tests. Add a stanza
to leave some objects around that shows the bug.
Catversion bump because preexisting databases might have bogus pg_depend
entries.
Discussion: https://postgr.es/m/20190318204235.GA30360@alvherre.pgsql

These commands allow the argument type list to be omitted if there is
just one object that matches by name. However, if that syntax was
used with DROP IF EXISTS and there was more than one match, you got
a "function ... does not exist, skipping" notice message rather than a
truthful complaint about the ambiguity. This was basically due to
poor factorization and a rats-nest of logic, so refactor the relevant
lookup code to make it cleaner.
Note that this amounts to narrowing the scope of which sorts of
error conditions IF EXISTS will bypass. Per discussion, we only
intend it to skip no-such-object cases, not multiple-possible-matches
cases.
Per bug #15572 from Ash Marath. Although this definitely seems like
a bug, it's not clear that people would thank us for changing the
behavior in minor releases, so no back-patch.
David Rowley, reviewed by Julien Rouhaud and Pavel Stehule
Discussion: https://postgr.es/m/15572-ed1b9ed09503de8a@postgresql.org

LDAP servers can be advertised on a network with RFC 2782 DNS SRV
records. The OpenLDAP command-line tools automatically try to find
servers that way, if no server name is provided by the user. Teach
PostgreSQL to do the same using OpenLDAP's support functions, when
building with OpenLDAP.
For now, we assume that HAVE_LDAP_INITIALIZE (an OpenLDAP extension
available since OpenLDAP 2.0 and also present in Apple LDAP) implies
that you also have ldap_domain2hostlist() (which arrived in the same
OpenLDAP version and is also present in Apple LDAP).
Author: Thomas Munro
Reviewed-by: Daniel Gustafsson
Discussion: https://postgr.es/m/CAEepm=2hAnSfhdsd6vXsM6VZVN0br-FbAZ-O+Swk18S5HkCP=A@mail.gmail.com

This finishes a task we left undone in commit f1ad067fc, by extending
the delete-in-descending-OID-order rule to deletions triggered by
DROP OWNED BY. We've coped with machine-dependent deletion orders
one time too many, and the new issues caused by Peter G's recent
nbtree hacking seem like the last straw.
Discussion: https://postgr.es/m/E1h6eep-0001Mw-Vd@gemulon.postgresql.org

Unstable sort order related to changes to nbtree from commit dd299df8
can cause two lines of DETAIL output to be in opposite-of-expected
order. Suppress the output using the same VERBOSITY hack that is used
elsewhere in the foreign_data tests.
Note that the same foreign_data.out DETAIL output was mechanically
updated by commit dd299df8. Only a few such changes were required,
though.
Per buildfarm member batfish.
Discussion: https://postgr.es/m/CAH2-WzkCQ_MtKeOpzozj7QhhgP1unXsK8o9DMAFvDqQFEPpkYQ@mail.gmail.com

I unnecessarily removed this check in 3de241dba86f because I
misunderstood what the final representation of constraints across a
partitioning hierarchy was to be. Put it back (in both branches).
Discussion: https://postgr.es/m/201901222145.t6wws6t6vrcu@alvherre.pgsql

Teach contrib/amcheck's bt_index_parent_check() function to take
advantage of the uniqueness property of heapkeyspace indexes in support
of a new verification option: non-pivot tuples (non-highkey tuples on
the leaf level) can optionally be re-found using a new search for each,
that starts from the root page. If a tuple cannot be re-found, report
that the index is corrupt.
The new "rootdescend" verification option is exhaustive, and can
therefore make a call to bt_index_parent_check() take a lot longer.
Re-finding tuples during verification is mostly intended as an option
for backend developers, since the corruption scenarios that it alone is
uniquely capable of detecting seem fairly far-fetched.
For example, "rootdescend" verification is much more likely to detect
corruption of the least significant byte of a key from a pivot tuple in
the root page of a B-Tree that already has at least three levels.
Typically, only a few tuples on a cousin leaf page are at risk of
"getting overlooked" by index scans in this scenario. The corrupt key
in the root page is only slightly corrupt: corrupt enough to give wrong
answers to some queries, and yet not corrupt enough to allow the problem
to be detected without verifying agreement between the leaf page and the
root page, skipping at least one internal page level. The existing
bt_index_parent_check() checks never cross more than a single level.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-Wz=yTWnVu+HeHGKb2AGiADL9eprn-cKYAto4MkKOuiGtRQ@mail.gmail.com

Teach nbtree to give some consideration to how "distinguishing"
candidate leaf page split points are. This should not noticeably affect
the balance of free space within each half of the split, while still
making suffix truncation truncate away significantly more attributes on
average.
The logic for choosing a leaf split point now uses a fallback mode in
the case where the page is full of duplicates and it isn't possible to
find even a minimally distinguishing split point. When the page is full
of duplicates, the split should pack the left half very tightly, while
leaving the right half mostly empty. Our assumption is that logical
duplicates will almost always be inserted in ascending heap TID order
with v4 indexes. This strategy leaves most of the free space on the
half of the split that will likely be where future logical duplicates of
the same value need to be placed.
The number of cycles added is not very noticeable. This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held. We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper. We
use a faster binary comparison instead.
Note that even pg_upgrade'd v3 indexes make use of these optimizations.
Benchmarking has shown that even v3 indexes benefit, despite the fact
that suffix truncation will only truncate non-key attributes in INCLUDE
indexes. Grouping relatively similar tuples together is beneficial in
and of itself, since it reduces the number of leaf pages that must be
accessed by subsequent index scans.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzmmoLNQOj9mAD78iQHfWLJDszHEDrAzGTUMG3mVh5xWPw@mail.gmail.com

Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com

Use dedicated struct to represent nbtree insertion scan keys. Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions. This is based on a suggestion by Andrey
Lepikhov.
Streamline how unique index insertions cache binary search progress.
Cache the state of in-progress binary searches within _bt_check_unique()
for later instead of having callers avoid repeating the binary search in
an ad-hoc manner. This makes it easy to add a new optimization:
_bt_check_unique() now falls out of its loop immediately in the common
case where it's already clear that there couldn't possibly be a
duplicate.
The new _bt_check_unique() scheme makes it a lot easier to manage cached
binary search effort afterwards, from within _bt_findinsertloc(). This
is needed for the upcoming patch to make nbtree tuples unique by
treating heap TID as a final tiebreaker column. Unique key binary
searches need to restore lower and upper bounds. They cannot simply
continue to use the >= lower bound as the offset to insert at, because
the heap TID tiebreaker column must be used in comparisons for the
restored binary search (unlike the original _bt_check_unique() binary
search, where scankey's heap TID column must be omitted).
Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas, Andrey Lepikhov
Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com

Jsonpath grammar and scanner are both quite small. It doesn't worth complexity
to compile them separately. This commit makes grammar and scanner be compiled
at once. Therefore, jsonpath_gram.h and jsonpath_gram.h are no longer needed.
This commit also does some reorganization of code in jsonpath_gram.y.
Discussion: https://postgr.es/m/d47b2023-3ecb-5f04-d253-d557547cf74f%402ndQuadrant.com

There are 2-arguments and 4-arguments versions of jsonb_path_match() and
jsonb_path_exists(). But 4-arguments versions have optional 3rd and 4th
arguments, that leads to ambiguity. In the same time 2-arguments versions are
needed only for @@ and @? operators. So, rename 2-arguments versions to
remove the ambiguity.
Catversion is bumped.

Originally, if libpq got a failure (e.g., ECONNRESET) while trying to
send data to the server, it would just report that and wash its hands
of the matter. It was soon found that that wasn't a very pleasant way
of coping with server-initiated disconnections, so we introduced a hack
(pqHandleSendFailure) in the code that sends queries to make it peek
ahead for server error reports before reporting the send failure.
It now emerges that related cases can occur during connection setup;
in particular, as of TLS 1.3 it's unsafe to assume that SSL connection
failures will be reported by SSL_connect rather than during our first
send attempt. We could have fixed that in a hacky way by applying
pqHandleSendFailure after a startup packet send failure, but
(a) pqHandleSendFailure explicitly disclaims suitability for use in any
state except query startup, and (b) the problem still potentially exists
for other send attempts in libpq.
Instead, let's fix this in a more general fashion by eliminating
pqHandleSendFailure altogether, and instead arranging to postpone
all reports of send failures in libpq until after we've made an
attempt to read and process server messages. The send failure won't
be reported at all if we find a server message or detect input EOF.
(Note: this removes one of the reasons why libpq typically overwrites,
rather than appending to, conn->errorMessage: pqHandleSendFailure needed
that behavior so that the send failure report would be replaced if we
got a server message or read failure report. Eventually I'd like to get
rid of that overwrite behavior altogether, but today is not that day.
For the moment, pqSendSome is assuming that its callees will overwrite
not append to conn->errorMessage.)
Possibly this change should get back-patched someday; but it needs
testing first, so let's not consider that till after v12 beta.
Discussion: https://postgr.es/m/CAEepm=2n6Nv+5tFfe8YnkUm1fXgvxR0Mm1FoD+QKG-vLNGLyKg@mail.gmail.com

Commit 6f6a6d8b1 introduced a delay of up to 2 seconds if we're trying
to request a checkpoint but the checkpointer hasn't started yet (or,
much less likely, our kill() call fails). However buildfarm experience
shows that that's not quite enough for slow or heavily-loaded machines.
There's no good reason to assume that the checkpointer won't start
eventually, so we may as well make the timeout much longer, say 60 sec.
However, if the caller didn't say CHECKPOINT_WAIT, it seems like a bad
idea to be waiting at all, much less for as long as 60 sec. We can
remove the need for that, and make this whole thing more robust, by
adjusting the code so that the existence of a pending checkpoint
request is clear from the contents of shared memory, and making sure
that the checkpointer process will notice it at startup even if it did
not get a signal. In this way there's no need for a non-CHECKPOINT_WAIT
call to wait at all; if it can't send the signal, it can nonetheless
assume that the checkpointer will eventually service the request.
A potential downside of this change is that "kill -INT" on the checkpointer
process is no longer enough to trigger a checkpoint, should anyone be
relying on something so hacky. But there's no obvious reason to do it
like that rather than issuing a plain old CHECKPOINT command, so we'll
assume that nobody is. There doesn't seem to be a way to preserve this
undocumented quasi-feature without introducing race conditions.
Since a principal reason for messing with this is to prevent intermittent
buildfarm failures, back-patch to all supported branches.
Discussion: https://postgr.es/m/27830.1552752475@sss.pgh.pa.us

Typedef name should be both unique and non-intersect with variable names
across all the sources. That makes both pg_indent and debuggers happy.
Discussion: https://postgr.es/m/23865.1552936099%40sss.pgh.pa.us

Running ALTER TABLE on any table will check if a TOAST table needs to be
added. On shared tables, this would previously fail, thus effectively
disabling ALTER TABLE for those tables. On (non-shared) system
catalogs, on the other hand, it would add a TOAST table, even though we
don't really want TOAST tables on some system catalogs. In some cases,
it would also fail with an error "AccessExclusiveLock required to add
toast table.", depending on what locks the ALTER TABLE actions had
already taken.
So instead, just ignore attempts to add TOAST tables to such tables,
outside of bootstrap mode, pretending they don't need one.
This allows running ALTER TABLE on such tables without messing up the
TOAST situation. Legitimate uses for ALTER TABLE on system catalogs
include setting reloptions (say, fillfactor or autovacuum settings).
(All this still requires allow_system_table_mods, which is independent
of this.)
Discussion: https://www.postgresql.org/message-id/flat/e49f825b-fb25-0bc8-8afc-d5ad895c7975@2ndquadrant.com

Unrecognized attribute names are supposed to be ignored. But the code
would error out on an unrecognized attribute value even if it did not
recognize the attribute name. So unrecognized attributes wouldn't
really be ignored unless the value happened to be one that matched a
recognized value. This would break some important cases where the
attribute would be processed by ucol_open() directly. Fix that and
add a test case.
The restructured code should also avoid compiler warnings about
initializing a UColAttribute value to -1, because the type might be an
unsigned enum. (reported by Andres Freund)

Commit 6776142a07afb4c28961f27059d800196902f5f1 failed to do this,
and the buildfarm broke.
Patch by me, per advice from Tom Lane and Michael Paquier.
Discussion: http://postgr.es/m/13988.1552960403@sss.pgh.pa.us

Aggregates have acquired a dozen or so optional attributes in recent
years for things like parallel query and moving-aggregate mode; the
lack of an OR REPLACE option to add or change these for an existing
agg makes extension upgrades gratuitously hard. Rectify.

Commit f2dec34e1 changed things so that printtup's output stringinfo
buffer was allocated outside the per-row temporary context, not inside
it. This creates a need to free that buffer explicitly when the temp
context is freed, but that was overlooked. In most cases, this is all
happening inside a portal or executor context that will go away shortly
anyhow, but that's not always true. Notably, the stringinfo ends up
getting leaked when JDBC uses row-at-a-time fetches. For a query
that returns wide rows, that adds up after awhile.
Per bug #15700 from Matthias Otterbach. Back-patch to v11 where the
faulty code was added.
Discussion: https://postgr.es/m/15700-8c408321a87d56bb@postgresql.org

We should try to prewarm each database only once. Otherwise, if
prewarming fails for some reason, it will just keep retrying in an
infnite loop. This can happen if, for example, the database has been
dropped. The existing code was intended to implement the try-once
behavior, but failed to do so because it neglected to set
worker.bgw_restart_time to BGW_NEVER_RESTART.
Mithun Cy, per a report from Hans Buschmann
Discussion: http://postgr.es/m/CA+hUKGKpQJCWcgyy3QTC9vdn6uKAR_8r__A-MMm2GYfj45caag@mail.gmail.com

Like commit f41551f61f9cf4eedd5b7173f985a3bdb4d9858c, this aims
to make it easier to add non-Boolean options to VACUUM (or, in
this case, to ANALYZE). Instead of building up a bitmap of
options directly in the parser, build up a list of DefElem
objects and let ExecVacuum() sort it out; right now, we make
no use of the fact that a DefElem can carry an associated value,
but it will be easy to make that change in the future.
Masahiko Sawada
Discussion: http://postgr.es/m/CAD21AoATE4sn0jFFH3NcfUZXkU2BMbjBWB_kDj-XWYA-LXDcQA@mail.gmail.com

Many places need both, so this allows a few functions to take one
fewer parameter. More importantly, as soon as we add a VACUUM
option that takes a non-Boolean parameter, we need to replace
'int options' with a struct, and it seems better to think
of adding more fields to VacuumParams rather than passing around
both VacuumParams and a separate struct as well.
Patch by me, reviewed by Masahiko Sawada
Discussion: http://postgr.es/m/CA+Tgmob6g6-s50fyv8E8he7APfwCYYJ4z0wbZC2yZeSz=26CYQ@mail.gmail.com

In RI_FKey_pk_upd_check_required(), we check among other things
whether the old and new key are equal, so that we don't need to run
cascade actions when nothing has actually changed. This was using the
equality operator. But the effect of this is that if a value in the
primary key is changed to one that "looks" different but compares as
equal, the update is not propagated. (Examples are float -0 and 0 and
case-insensitive text.) This appears to violate the SQL standard, and
it also behaves inconsistently if in a multicolumn key another key is
also updated that would cause the row to compare as not equal.
To fix, if we are looking at the PK table in ri_KeysEqual(), then do a
bytewise comparison similar to record_image_eq() instead of using the
equality operators. This only makes a difference for ON UPDATE
CASCADE, but for consistency we treat all changes to the PK the same. For
the FK table, we continue to use the equality operators.
Discussion: https://www.postgresql.org/message-id/flat/3326fc2e-bc02-d4c5-e3e5-e54da466e89a@2ndquadrant.com

ce6afc6 has begun the refactoring work by plugging pg_rewind into a
central routine to update the control file, and left around two extra
copies, with one in xlog.c for the backend and one in pg_resetwal.c. By
adding an extra option to the central routine in controldata_utils.c to
control if a flush of the control file needs to be done, it is proving
to be straight-forward to make xlog.c and pg_resetwal.c use the central
code path at the condition of moving the wait event tracking there.
Hence, this allows to have only one central code path to update the
control file, shaving the code from the duplicates.
This refactoring actually fixes a problem in pg_resetwal. Previously,
the control file was first removed before being recreated. So if a
crash happened between the moment the file was removed and the moment
the file was created, then it would have been possible to not have a
control file anymore in the database folder.
Author: Fabien Coelho
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/alpine.DEB.2.21.1903170935210.2506@lancre

This fixes an issue introduced by 266b6ac, which has added filters to
exclude file patterns on the target and source data directories to
reduce the number of files transferred. Filters get applied to both
the target and source data files, and include pg_internal.init which is
present for each database once relations are created on it. However, if
the target differed from the source with at least one new database with
relations, the rewind would fail due to the exclusion filters applied on
the target files, causing pg_internal.init to still be present on the
target database folder, while its contents should have been completely
removed so as there is nothing remaining inside at the time of the
folder deletion.
Applying exclusion filters on the source files is fine, because this way
the amount of data copied from the source to the target is reduced. And
actually, not applying the filters on the target is what pg_rewind
should do, because this causes such files to be automatically removed
during the rewind on the target. Exclusion filters apply to paths which
are removed or recreated automatically at startup, so removing all those
files on the target during the rewind is a win.
The existing set of TAP tests already stresses the rewind of databases,
but it did not include any tables on those newly-created databases.
Creating extra tables in this case is enough to reproduce the failure,
so the existing tests are extended to close the gap.
Reported-by: Mithun Cy
Author: Michael Paquier
Discussion: https://postgr.es/m/CADq3xVYt6_pO7ZzmjOqPgY9HWsL=kLd-_tNyMtdfjKqEALDyTA@mail.gmail.com
Backpatch-through: 11

pg_checksums is compiled with a given block size and has a hard
dependency to it per the way checksums are calculated via
checksum_impl.h, and trying to use the tool on a data folder which has
not the same block size would result in incorrect checksum calculations
and/or block read errors, meaning that the data folder is corrupted.
This is harmless as checksums are only checked now, but very confusing
for the user so issue an error properly if the block size used at
compilation and the block size used in the data folder do not match.
Reported-by: Sergei Kornilov
Author: Michael Banck, Michael Paquier
Reviewed-by: Fabien Coelho, Magnus Hagander
Discussion: https://postgr.es/m/20190317054657.GA3357@paquier.xyz
ackpatch-through: 11

Starting in ICU 54, collation customization attributes can be
specified in the locale string, for example
"@colStrength=primary;colCaseLevel=yes". Add support for this for
older ICU versions as well, by adding some minimal parsing of the
attributes in the locale string and calling ucol_setAttribute() on
them. This is essentially what never ICU versions do internally in
ucol_open(). This was we can offer this functionality in a consistent
way in all ICU versions supported by PostgreSQL.
Also add some tests for ICU collation customization.
Reported-by: Daniel Verite <daniel@manitou-mail.org>
Discussion: https://www.postgresql.org/message-id/0270ebd4-f67c-8774-1a5a-91adfb9bb41f@2ndquadrant.com

It looks like we can leave in most of the test cases for Infinity/NaN
inputs, but buildfarm member jacana gets the wrong answer for acosh(Inf).
It's not worth carrying a variant expected file for that, so just disable
that one test.
Discussion: https://postgr.es/m/E1h3nUY-0000sM-Vf@gemulon.postgresql.org

SQL 2016 standards among other things contains set of SQL/JSON features for
JSON processing inside of relational database. The core of SQL/JSON is JSON
path language, allowing access parts of JSON documents and make computations
over them. This commit implements partial support JSON path language as
separate datatype called "jsonpath". The implementation is partial because
it's lacking datetime support and suppression of numeric errors. Missing
features will be added later by separate commits.
Support of SQL/JSON features requires implementation of separate nodes, and it
will be considered in subsequent patches. This commit includes following
set of plain functions, allowing to execute jsonpath over jsonb values:
* jsonb_path_exists(jsonb, jsonpath[, jsonb, bool]),
* jsonb_path_match(jsonb, jsonpath[, jsonb, bool]),
* jsonb_path_query(jsonb, jsonpath[, jsonb, bool]),
* jsonb_path_query_array(jsonb, jsonpath[, jsonb, bool]).
* jsonb_path_query_first(jsonb, jsonpath[, jsonb, bool]).
This commit also implements "jsonb @? jsonpath" and "jsonb @@ jsonpath", which
are wrappers over jsonpath_exists(jsonb, jsonpath) and jsonpath_predicate(jsonb,
jsonpath) correspondingly. These operators will have an index support
(implemented in subsequent patches).
Catversion bumped, to add new functions and operators.
Code was written by Nikita Glukhov and Teodor Sigaev, revised by me.
Documentation was written by Oleg Bartunov and Liudmila Mantrova. The work
was inspired by Oleg Bartunov.
Discussion: https://postgr.es/m/fcc6fc6a-b497-f39a-923d-aa34d0c588e8%402ndQuadrant.com
Author: Nikita Glukhov, Teodor Sigaev, Alexander Korotkov, Oleg Bartunov, Liudmila Mantrova
Reviewed-by: Tomas Vondra, Andrew Dunstan, Pavel Stehule, Alexander Korotkov

When libpq is loaded in the server (for instance, by
libpqwalreceiver), it may use libpq environment variables set in the
postmaster environment for connection parameter defaults. This has
some confusing effects in our test suites. For example, the TAP test
infrastructure sets PGAPPNAME to allow identifying clients in the
server log. But this environment variable is also inherited by
temporary servers started with pg_ctl and is then in turn used by
libpqwalreceiver as the application_name for connecting to remote
servers where it then shows up in pg_stat_replication and is relevant
for things like synchronous_standby_names. Replication already has a
suitable default for application_name, and overriding that
accidentally then requires the individual test cases to re-override
that, which is all very confusing and unnecessary.
To fix, unset PGAPPNAME temporarily before running pg_ctl start or
restart in the tests.
More comprehensive approaches like unsetting all environment variables
in pg_ctl were considered but might be too complicated to achieve
portably.
The now unnecessary re-overriding of application_name by test cases is
also removed.
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://www.postgresql.org/message-id/flat/33383613-690e-6f1b-d5ba-4957ff40f6ce@2ndquadrant.com

Some buildfarm members using CLOBBER_CACHE_ALWAYS have been having OOM
problems of late. Commit 2455ab488 addressed this problem by recovering
space transiently used within RelationBuildPartitionDesc, but it turns
out that leaves quite a lot on the table, because other subroutines of
RelationBuildDesc also leak memory like mad. Let's move the temp-context
management into RelationBuildDesc so that leakage from the other
subroutines is also recovered.
I examined this issue by arranging for postgres.c to dump the size of
MessageContext just before resetting it in each command cycle, and
then running the update.sql regression test (which is one of the two
that are seeing buildfarm OOMs) with and without CLOBBER_CACHE_ALWAYS.
Before 2455ab488, the peak space usage with CCA was as much as 250MB.
That patch got it down to ~80MB, but with this patch it's about 0.5MB,
and indeed the space usage now seems nearly indistinguishable from a
non-CCA build.
RelationBuildDesc's traditional behavior of not worrying about leaking
transient data is of many years' standing, so I'm pretty hesitant to
change that without more evidence that it'd be useful in a normal build.
(So far as I can see, non-CCA memory consumption is about the same with
or without this change, whuch if anything suggests that it isn't useful.)
Hence, configure the patch so that we recover space only when
CLOBBER_CACHE_ALWAYS or CLOBBER_CACHE_RECURSIVELY is defined. However,
that choice can be overridden at compile time, in case somebody would
like to do some performance testing and try to develop evidence for
changing that decision.
It's possible that we ought to back-patch this change, but in the
absence of back-branch OOM problems in the buildfarm, I'm not in
a hurry to do that.
Discussion: https://postgr.es/m/CA+TgmoY3bRmGB6-DUnoVy5fJoreiBJ43rwMrQRCdPXuKt4Ykaw@mail.gmail.com

The trigger tests for PL/Tcl were spread aroud pltcl_setup.sql and
pltcl_queries.sql, mixed with other tests, which makes them hard to
follow and edit. Move all the trigger-related pieces to a new file
pltcl_trigger.sql. This also makes the test setup more similar to
plperl and plpython.

Add a separate walreceiver API function walrcv_server_version() to get
the version of the remote server, instead of doing it as part of
walrcv_identify_system(). This allows the server version to be
available even for uses that don't call IDENTIFY_SYSTEM, and it seems
cleaner anyway.
This is for an upcoming patch, not currently used.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/20190115071359.GF1433@paquier.xyz

Previously, the SERIALIZABLE isolation level prevented parallel query
from being used. Allow the two features to be used together by
sharing the leader's SERIALIZABLEXACT with parallel workers.
An extra per-SERIALIZABLEXACT LWLock is introduced to make it safe to
share, and new logic is introduced to coordinate the early release
of the SERIALIZABLEXACT required for the SXACT_FLAG_RO_SAFE
optimization, as follows:
The first backend to observe the SXACT_FLAG_RO_SAFE flag (set by
some other transaction) will 'partially release' the SERIALIZABLEXACT,
meaning that the conflicts and locks it holds are released, but the
SERIALIZABLEXACT itself will remain active because other backends
might still have a pointer to it.
Whenever any backend notices the SXACT_FLAG_RO_SAFE flag, it clears
its own MySerializableXact variable and frees local resources so that
it can skip SSI checks for the rest of the transaction. In the
special case of the leader process, it transfers the SERIALIZABLEXACT
to a new variable SavedSerializableXact, so that it can be completely
released at the end of the transaction after all workers have exited.
Remove the serializable_okay flag added to CreateParallelContext() by
commit 9da0cc35, because it's now redundant.
Author: Thomas Munro
Reviewed-by: Haribabu Kommi, Robert Haas, Masahiko Sawada, Kevin Grittner
Discussion: https://postgr.es/m/CAEepm=0gXGYhtrVDWOTHS8SQQy_=S9xo+8oCxGLWZAOoeJ=yzQ@mail.gmail.com

If a heap on the old cluster has 4 pages or fewer, and the old cluster
was PG v11 or earlier, don't copy or link the FSM. This will shrink
space usage for installations with large numbers of small tables.
This will allow pg_upgrade to take advantage of commit b0eaa4c51b where
we have avoided creation of the free space map for small heap relations.
Author: John Naylor
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CACPNZCu4cOdm3uGnNEGXivy7Gz8UWyQjynDpdkPGabQ18_zK6g%40mail.gmail.com

The previous test order had the effect that if something was wrong
with the identity functionality, the create_table_like test would
likely fail or crash first, which is confusing. Reorder so that the
identity test comes before create_table_like.

The idea was to generate all the junk in a destroyable subcontext rather
than leaking it in the caller's context, but partition_bounds_create was
still being called in the caller's context, allowing plenty of scope for
leakage. Also, get_rel_relkind() was still being called in the rel's
rd_pdcxt, creating a risk of session-lifespan memory wastage.
Simplify the logic a bit while at it. Also, reduce rd_pdcxt to
ALLOCSET_SMALL_SIZES, since it seems likely to not usually be big.
Probably something like this needs to be back-patched into v11,
but for now let's get some buildfarm testing on this.
Discussion: https://postgr.es/m/15943.1552601288@sss.pgh.pa.us

The assertions added by commits 34ea1ab7f et al found another problem:
set_dummy_rel_pathlist and mark_dummy_rel were failing to label
the dummy paths they create with the correct outer_relids, in case
the relation is necessarily parameterized due to having lateral
references in its tlist. It's likely that this has no user-visible
consequences in production builds, at the moment; but still an assertion
failure is a bad thing, so back-patch the fix.
Per bug #15694 from Roman Zharkov (via Alexander Lakhin)
and an independent report by Tushar Ahuja.
Discussion: https://postgr.es/m/15694-74f2ca97e7044f7f@postgresql.org
Discussion: https://postgr.es/m/7d72ab20-c725-3ce2-f99d-4e64dd8a0de6@enterprisedb.com

In normal builds, this isn't very important, because the leaks go
into fairly short-lived contexts, but under CLOBBER_CACHE_ALWAYS,
this can result in leaking hundreds of megabytes into MessageContext,
which probably explains recent failures on hyrax.
This may or may not be the best long-term strategy for dealing
with this leak, but we can change it later if we come up with
something better. For now, do this to make the buildfarm green
again (hopefully). Commit 898e5e3290a72d288923260143930fb32036c00c
seems to have exacerbated this problem for reasons that are not
quite clear, but I don't believe it's actually the cause.
Discussion: http://postgr.es/m/CA+TgmoY3bRmGB6-DUnoVy5fJoreiBJ43rwMrQRCdPXuKt4Ykaw@mail.gmail.com

Variables used after a longjmp() need to be declared volatile. In
case of a pointer, it's the pointer itself that needs to be declared
volatile, not the pointed-to value. So we need
PyObject *volatile items;
instead of
volatile PyObject *items; /* wrong */
Discussion: https://www.postgresql.org/message-id/flat/f747368d-9e1a-c46a-ac76-3c27da32e8e4%402ndquadrant.com

This fixes an oversight from 5c99513. This has no actual consequence as
PG_TEMP_FILE_PREFIX and PG_TEMP_FILES_DIR have the same value so when
bumping on a temporary path the directory scan was still moving on to
the next entry instead of skipping the rest of the scan, but let's keep
the logic correct.
Author: Michael Banck
Reviewed-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20190314.115417.58230569.horiguchi.kyotaro@lab.ntt.co.jp
Backpatch-through: 11

Commit a6417078c missed updating some comments in transam.h about
reservation of high OIDs for development purposes. Also tamp down
an over-optimistic comment there about how easy it'd be to change
FirstNormalObjectId.
Earlier, commit 09568ec3d failed to update bki.sgml for the split
between genbki.pl-assigned OIDs and those assigned during initdb.
Also fix genbki.pl so that it will complain if it overruns
that split. It's possible that doing so would have no very bad
consequences, but that's no excuse for not detecting it.

A couple of queries are run on the primary to create and fill in a test
table, which gets checked on the standby afterwards. However the test
was not waiting for the confirmation that the necessary records have
been replayed on the standby, leading to spurious failures.
Per buildfarm member loach. Thanks to Thomas Munro for the report and
Tom Lane for the failure analysis.
Discussion: https://postgr.es/m/CA+hUKGLUpqG52xtriUz5RpmeKPoEfNxNc-CginG+Cx+X2-Ycew@mail.gmail.com

Preliminary results from the buildfarm suggest that no platform gets
commit c6f153dcf's test cases wrong by more than one or two units in
the last place, so setting extra_float_digits = 0 should be plenty
to hide the cross-platform variations.
Also, add tests for Infinity/NaN inputs. I think it highly likely
that we'll end up removing these again, rather than adding code to
make ancient platforms conform. But it seems useful to find out
just how many platforms have such issues before we make a decision.
Discussion: https://postgr.es/m/E1h3nUY-0000sM-Vf@gemulon.postgresql.org

The initial commit tried to test them on trivial cases such as 0,
reasoning that we shouldn't hit any portability issues that way.
The buildfarm immediately proved that hope ill-founded, and anyway
it's not a great testing scheme because it doesn't prove that we're
even calling the right library function for each SQL function.
Instead, let's test them at inputs such as 1 (or something within
the valid range, as needed), so that each function should produce
a different output.
As committed, this is just about certain to show portability
failures, because it's very unlikely that every platform computes
these functions the same as mine down to the last bit. However,
I want to put it through a buildfarm cycle this way, so that
we can see how big the variations are. The plan is to add
"set extra_float_digits = -1", or whatever we need in order to
hide the variations; but first we need data.
Discussion: https://postgr.es/m/E1h3nUY-0000sM-Vf@gemulon.postgresql.org

Previously we used a polling/sleeping loop to wait for checkpoints
to begin and end, which leads to up to a couple hundred milliseconds
of needless thumb-twiddling. Use condition variables instead.
Author: Thomas Munro
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/CA%2BhUKGLY7sDe%2Bbg1K%3DbnEzOofGoo4bJHYh9%2BcDCXJepb6DQmLw%40mail.gmail.com

The buildfarm doesn't like this, because some buildfarm members have
log_statement = 'all'. We could change the log level of the messages
instead, but Tom doesn't like that. So let's do this instead, at
least for now.
Patch by Sergei Kornilov, applied here in reverse.
Discussion: http://postgr.es/m/2123251552490241@myt6-fe24916a5562.qloud-c.yandex.net

When creating a name for a foreign key constraint when none is
specified, use all column names instead of only the first one, similar
to how it is already done for index names.
Author: Paul Martinez <hellopfm@gmail.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/CAF+2_SFjky6XRfLNRXpkG97W6PRbOO_mjAxqXzAAimU=c7w7_A@mail.gmail.com

If existing CHECK or NOT NULL constraints preclude the presence
of nulls, we need not look to see whether any are present.
Sergei Kornilov, reviewed by Stephen Frost, Ildar Musin, David Rowley,
and by me.
Discussion: http://postgr.es/m/81911511895540@web58j.yandex.ru

c186ba13 has fixed an issue related to the updates of the minimum
recovery LSN across multiple processes on standbys, but we never really
had a test case able to reliably check its logic.
This commit introduces a new test case to close the gap, and is designed
to check the consistency of data based on the minimum recovery point set
by either the startup process or the checkpointer for both an offline
cluster (by looking at the on-disk page headers) and an online cluster
(using pageinspect).
Note that with c186ba13 reverted, this test fails badly for both the
online and offline cases, as designed.
Author: Michael Paquier, Andrew Gierth
Reviewed-by: Andrew Gierth, Georgios Kokolatos, Arthur Zakirov
Discussion: https://postgr.es/m/20181108044525.GA17482@paquier.xyz

The current tool name is too restrictive and focuses only on verifying
checksums. As more options to control checksums for an offline cluster
are planned to be added, switch to a more generic name. Documentation
as well as all past references to the tool are updated.
Author: Michael Paquier
Reviewed-by: Michael Banck, Fabien Coelho, Seigei Kornilov
Discussion: https://postgr.es/m/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de

pg_verify_checksums performs a read of the control file, and the data it
fetches should be from a data folder compatible with the major version
of Postgres the binary has been compiled with, but we never actually
checked that compatibility.
Reported-by: Sergei Kornilov
Author: Michael Paquier
Reviewed-by: Sergei Kornilov
Discussion: https://postgr.es/m/155231347133.16480.11453587097036807558.pgcf@coridan.postgresql.org
Backpatch-through: 11

Commit 40dae7ec537, which made the nbtree page split algorithm more
robust, made _bt_insert_parent() only unlock the right child of the
parent page before inserting a new downlink into the parent. Update a
comment from the Berkeley days claiming that both left and right child
pages are unlocked before the new downlink actually gets inserted.
The claim that it is okay to release both locks early based on Lehman
and Yao's say-so never made much sense. Lehman and Yao must sometimes
"couple" buffer locks across a pair of internal pages when relocating a
downlink, unlike the corresponding code within _bt_getstack().

The SQL:2016 standard adds support for the hyperbolic functions
sinh(), cosh(), and tanh(). POSIX has long required libm to
provide those functions as well as their inverses asinh(),
acosh(), atanh(). Hence, let's just expose the libm functions
to the SQL level. As with the trig functions, we only implement
versions for float8, not numeric.
For the moment, we'll assume that all platforms actually do have
these functions; if experience teaches otherwise, some autoconf
effort may be needed.
SQL:2016 also adds support for base-10 logarithm, but with the
function name log10(), whereas the name we've long used is log().
Add aliases named log10() for the float8 and numeric versions.
Lætitia Avrot
Discussion: https://postgr.es/m/CAB_COdguG22LO=rnxDQ2DW1uzv8aQoUzyDQNJjrR4k00XSgm5w@mail.gmail.com

In the v11-era commits that taught genbki.pl to resolve symbolic
OID references in the initial catalog data, we didn't bother to
make every last reference symbolic; some of the catalogs have so
few initial rows that it didn't seem worthwhile.
However, the new project policy that OIDs assigned by new patches
should be automatically renumberable changes this calculus.
A patch that wants to add a row in one of these catalogs would have
a problem when the OID it assigns gets renumbered. Hence, do the
mop-up work needed to make all OID references in initial data be
symbolic, and establish an associated project policy that we'll
never again write a hard-wired OID reference there.
No catversion bump since the contents of postgres.bki aren't
actually changed by this commit.
Discussion: https://postgr.es/m/CAH2-WzmMTGMcPuph4OvsO7Ykut0AOCF_i-=eaochT0dd2BN9CQ@mail.gmail.com

This commit adds a Perl script renumber_oids.pl, which can reassign a
range of manually-assigned OIDs to someplace else by modifying OID
fields of the catalog *.dat files and OID-assigning macros in the
catalog *.h files.
Up to now, we've encouraged new patches that need manually-assigned
OIDs to use OIDs just above the range of existing OIDs. Predictably,
this leads to patches stepping on each others' toes, as whichever
one gets committed first creates an OID conflict that other patch
author(s) have to resolve manually. With the availability of
renumber_oids.pl, we can eliminate a lot of this hassle.
The new project policy, therefore, is:
* Encourage new patches to use high OIDs (the documentation suggests
choosing a block of OIDs at random in 8000..9999).
* After feature freeze in each development cycle, run renumber_oids.pl
to move all such OIDs down to lower numbers, thus freeing the high OID
range for the next development cycle.
This plan should greatly reduce the risk of OID collisions between
concurrently-developed patches. Also, if such a collision happens
anyway, we have the option to resolve it without much effort by doing
an off-schedule OID renumbering to get the first-committed patch out
of the way. Or a patch author could use renumber_oids.pl to change
their patch's assignments without much pain.
This approach does put a premium on not hard-wiring any OID values
in places where renumber_oids.pl and genbki.pl can't fix them.
Project practice in that respect seems to be pretty good already,
but a follow-on patch will sand down some rough edges.
John Naylor and Tom Lane, per an idea of Peter Geoghegan's
Discussion: https://postgr.es/m/CAH2-WzmMTGMcPuph4OvsO7Ykut0AOCF_i-=eaochT0dd2BN9CQ@mail.gmail.com

In commit 960df2a971 ("Correctly assess parallel-safety of tlists when
SRFs are used."), the testing of scan/join target was done incorrectly,
which caused a plan-quality problem. Backpatch through to v11 where
the aforementioned commit went in, since this is a regression from v10.
Author: Etsuro Fujita
Reviewed-by: Robert Haas and Tom Lane
Discussion: https://postgr.es/m/5C75303E.8020303@lab.ntt.co.jp

In commit b0eaa4c51bb, we left out a test that used a vacuum to remove dead
rows as the behavior of test was not predictable. This test has been
rewritten to use fillfactor instead to control free space. Since we no
longer need to remove dead rows as part of the test, put the fsm regression
test in a parallel group.
Author: John Naylor
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CAA4eK1L=qWp_bJ5aTc9+fy4Ewx2LPaLWY-RbR4a60g_rupCKnQ@mail.gmail.com

This adds a new routine to src/common/ which is compatible with both the
frontend and backend code, able to update the control file's contents.
This is now getting used only by pg_rewind, but some upcoming patches
which add more control on checksums for offline instances will make use
of it. This could also get used more by the backend as xlog.c has its
own flavor of the same logic with some wait events and an additional
flush phase before closing the opened file descriptor, but this is let
as separate work.
Author: Michael Banck, Michael Paquier
Reviewed-by: Fabien Coelho, Sergei Kornilov
Discussion: https://postgr.es/m/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de

Historically guc.c has just refused examples like set work_mem = '30.1GB',
but it seems more useful for it to take that and round off the value to
some reasonable approximation of what the user said. Just rounding to
the parameter's native unit would work, but it would lead to rather
silly-looking settings, such as 31562138kB for this example. Instead
let's round to the nearest multiple of the next smaller unit (if any),
producing 30822MB.
Also, do the units conversion math in floating point and round to integer
(if needed) only at the end. This produces saner results for inputs that
aren't exact multiples of the parameter's native unit, and removes another
difference in the behavior for integer vs. float parameters.
In passing, document the ability to use hex or octal input where it
ought to be documented.
Discussion: https://postgr.es/m/1798.1552165479@sss.pgh.pa.us

COALESCE, GREATEST and LEAST all look like functions taking variable
numbers of arguments, but in fact they are not functions, and so
VARIADIC array arguments don't work with them. Add a note to the docs
explaining this fact.
The consensus is not to try to make this work, but just to document the
limitation.
Discussion: https://postgr.es/m/CAFj8pRCaAtuXuRtvXf5GmPbAVriUQrNMo7-=TXUFN025S31R_w@mail.gmail.com

Further buildfarm testing shows that on the machines that are failing
ac75959cd's test case, what we're actually getting from strtod("-infinity")
is a syntax error (endptr == value) not ERANGE at all. This test case
is not worth carrying two sets of expected output for, so just remove it,
and revert commit b212245f9's misguided attempt to work around the platform
dependency.
Discussion: https://postgr.es/m/E1h33xk-0001Og-Gs@gemulon.postgresql.org

Previously ParallelTableScanDescData was just a member in BTShared,
but after c2fe139c2 that doesn't guarantee sufficient alignment as
specific AMs might (are likely to) need atomic variables in the
struct.
One might think that MAXALIGNing would be sufficient, but as a
comment in shm_toc_allocate() explains, that's not enough. For now,
copy the hack described there.
For parallel sequential scans no such change is needed, as its
allocations go through shm_toc_allocate().
An alternative approach would have been to allocate the parallel scan
descriptor in a separate TOC entry, but there seems little benefit in
doing so.
Per buildfarm member dromedary.
Author: Andres Freund
Discussion: https://postgr.es/m/20190311203126.ty5gbfz42gjbm6i6@alap3.anarazel.de

Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql

pgbench's arbitrary limit of 10 arguments for SQL statements or
metacommands is far too low. Increase it to 256.
This results in a very modest increase in memory usage, not enough to
worry about.
The maximum includes the SQL statement or metacommand. This is reflected
in the comments and revised TAP tests.
Simon Riggs and Dagfinn Ilmari Mannsåker with some light editing by me.
Reviewed by: David Rowley and Fabien Coelho
Discussion: https://postgr.es/m/CANP8+jJiMJOAf-dLoHuR-8GENiK+eHTY=Omw38Qx7j2g0NDTXA@mail.gmail.com

... as well as its implementation from backend/access/hash/hashfunc.c to
backend/utils/hash/hashfn.c.
access/hash is the place for the hash index AM, not really appropriate
for generic facilities, which is what hash_any is; having things the old
way meant that anything using hash_any had to include the AM's include
file, pointlessly polluting its namespace with unrelated, unnecessary
cruft.
Also move the HTEqual strategy number to access/stratnum.h from
access/hash.h.
To avoid breaking third-party extension code, add an #include
"utils/hashutils.h" to access/hash.h. (An easily removed line by
committers who enjoy their asbestos suits to protect them from angry
extension authors.)
Discussion: https://postgr.es/m/201901251935.ser5e4h6djt2@alvherre.pgsql

Instead, just proceed with the infinity or zero result that it should
return for overflow/underflow. This avoids a platform dependency,
in that various versions of strtod are inconsistent about whether they
signal ERANGE for a value that's specified as infinity.
It's possible this won't be enough to remove the buildfarm failures
we're seeing from ac75959cd, in which case I'll take out the infinity
test case that commit added. But first let's see if we can fix it.
Discussion: https://postgr.es/m/E1h33xk-0001Og-Gs@gemulon.postgresql.org

M src/backend/utils/misc/guc.c

Fix potential memory access violation in ecpg if filename of include file is shorter than 2 characters.

93473c6 has removed openLogOff, changing on the way the error message
which is used to report partial writes to WAL segments. The
newly-introduced error message used the offset up to which the write has
happened, keeping always the same total length to write. This changes
the error message so as the number of bytes left to write are reported.
Reported-by: Michael Paquier
Author: Robert Haas
Discussion: https://postgr.es/m/20190306235251.GA17293@paquier.xyz

This reverts commit bd09503e633b8077822bb4daf91625b71ac16253.
Per discussion, it seems like what we should do instead is to
reduce the default value of autovacuum_vacuum_cost_delay by the
same factor. That's functionally equivalent as long as the
platform can accurately service the smaller delay request, which
should be true on anything released in the last 10 years or more.
And smaller, more-closely-spaced delays are better in terms of
providing a steady I/O load.
Discussion: https://postgr.es/m/28720.1552101086@sss.pgh.pa.us

This change makes it possible to specify sub-millisecond delays,
which work well on most modern platforms, though that was not true
when the cost-delay feature was designed.
To support this without breaking existing configuration entries,
improve guc.c to allow floating-point GUCs to have units. Also,
allow "us" (microseconds) as an input/outpu