The logic in PQconnectPoll() did not take care to ensure that all of
a PGconn's internal state variables were reset before trying a new
connection attempt. If we got far enough in the connection sequence
to have changed any of these variables, and then decided to try a new
server address or server name, the new connection might be completed
with some state that really only applied to the failed connection.
While this has assorted bad consequences, the only one that is clearly
a security issue is that password_needed didn't get reset, so that
if the first server asked for a password and the second didn't,
PQconnectionUsedPassword() would return an incorrect result. This
could be leveraged by unprivileged users of dblink or postgres_fdw
to allow them to use server-side login credentials that they should
not be able to use.
Other notable problems include the possibility of forcing a v2-protocol
connection to a server capable of supporting v3, or overriding
"sslmode=prefer" to cause a non-encrypted connection to a server that
would have accepted an encrypted one. Those are certainly bugs but
it's harder to paint them as security problems in themselves. However,
forcing a v2-protocol connection could result in libpq having a wrong
idea of the server's standard_conforming_strings setting, which opens
the door to SQL-injection attacks. The extent to which that's actually
a problem, given the prerequisite that the attacker needs control of
the client's connection parameters, is unclear.
These problems have existed for a long time, but became more easily
exploitable in v10, both because it introduced easy ways to force libpq
to abandon a connection attempt at a late stage and then try another one
(rather than just giving up), and because it provided an easy way to
specify multiple target hosts.
Fix by rearranging PQconnectPoll's state machine to provide centralized
places to reset state properly when moving to a new target host or when
dropping and retrying a connection to the same host.
Tom Lane, reviewed by Noah Misch. Our thanks to Andrew Krasichkov
for finding and reporting the problem.
Security: CVE-2018-10915

When expanding an updatable view that is an INSERT's target, the rewriter
failed to rewrite Vars in the ON CONFLICT UPDATE clause. This accidentally
worked if the view was just "SELECT * FROM ...", as the transformation
would be a no-op in that case. With more complicated view targetlists,
this omission would often lead to "attribute ... has the wrong type" errors
or even crashes, as reported by Mario De Frutos Dieguez.
Fix by adding code to rewriteTargetView to fix up the data structure
correctly. The easiest way to update the exclRelTlist list is to rebuild
it from scratch looking at the new target relation, so factor the code
for that out of transformOnConflictClause to make it sharable.
In passing, avoid duplicate permissions checks against the EXCLUDED
pseudo-relation, and prevent useless view expansion of that relation's
dummy RTE. The latter is only known to happen (after this patch) in cases
where the query would fail later due to not having any INSTEAD OF triggers
for the view. But by exactly that token, it would create an unintended
and very poorly tested state of the query data structure, so it seems like
a good idea to prevent it from happening at all.
This has been broken since ON CONFLICT was introduced, so back-patch
to 9.5.
Dean Rasheed, based on an earlier patch by Amit Langote;
comment-kibitzing and back-patching by me
Discussion: https://postgr.es/m/CAFYwGJ0xfzy8jaK80hVN2eUWr6huce0RU8AgU04MGD00igqkTg@mail.gmail.com

6cb3372 enforces errno to ENOSPC when less bytes than what is expected
have been written when it is unset, though it forgot to properly reset
errno before doing a system call to write(), causing errno to
potentially come from a previous system call.
Reported-by: Tom Lane
Author: Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/31797.1533326676@sss.pgh.pa.us

It's necessary to make sure that owning tables have a relcache
invalidation prior to advancing the command counter to make
newly-entered catalog tuples for the index visible. inval.c must be
able to maintain the consistency of the local caches in the event of
transaction abort. There is usually only a problem when CREATE INDEX
transactions abort, since there is a generic invalidation once we reach
index_update_stats().
This bug is of long standing. Problems were made much more likely by
the addition of parallel CREATE INDEX (commit 9da0cc35284), but it is
strongly suspected that similar problems can be triggered without
involving plan_create_index_workers(). (plan_create_index_workers()
triggers a relcache build or rebuild, which previously only happened in
rare edge cases.)
Author: Peter Geoghegan
Reported-By: Luca Ferrari
Diagnosed-By: Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAKoxK+5fVodiCtMsXKV_1YAKXbzwSfp7DgDqUmcUAzeAhf=HEQ@mail.gmail.com
Backpatch: 9.3-

The buffer usage stats is accounted only for the execution phase of the
node. For Gather and Gather Merge nodes, such stats are accumulated at
the time of shutdown of workers which is done after execution of node due
to which we missed to account them for such nodes. Fix it by treating
nodes as running while we shut down them.
We can also miss accounting for a Limit node when Gather or Gather Merge
is beneath it, because it can finish the execution before shutting down
such nodes. So we allow a Limit node to shut down the resources before it
completes the execution.
In the passing fix the gather node code to allow workers to shut down as
soon as we find that all the tuples from the workers have been retrieved.
The original code use to do that, but is accidently removed by commit
01edb5c7fc.
Reported-by: Adrien Nayrat
Author: Amit Kapila and Robert Haas
Reviewed-by: Robert Haas and Andres Freund
Backpatch-through: 9.6 where this code was introduced
Discussion: https://postgr.es/m/86137f17-1dfb-42f9-7421-82fd786b04a1@anayrat.info

In the leader backend, we don't track the buffer usage for ExecutorStart
phase whereas in worker backend we track it for ExecutorStart phase as
well. This leads to different value for buffer usage stats for the
parallel and non-parallel query. Change the code so that worker backend
also starts tracking buffer usage after ExecutorStart.
Author: Amit Kapila and Robert Haas
Reviewed-by: Robert Haas and Andres Freund
Backpatch-through: 9.6 where this code was introduced
Discussion: https://postgr.es/m/86137f17-1dfb-42f9-7421-82fd786b04a1@anayrat.info

Commits 742869946 et al turn out to be a couple bricks shy of a load.
We were dumping the stored values of GUC_LIST_QUOTE variables as they
appear in proconfig or setconfig catalog columns. However, although that
quoting rule looks a lot like SQL-identifier double quotes, there are two
critical differences: empty strings ("") are legal, and depending on which
variable you're considering, values longer than NAMEDATALEN might be valid
too. So the current technique fails altogether on empty-string list
entries (as reported by Steven Winfield in bug #15248) and it also risks
truncating file pathnames during dump/reload of GUC values that are lists
of pathnames.
To fix, split the stored value without any downcasing or truncation,
and then emit each element as a SQL string literal.
This is a tad annoying, because we now have three copies of the
comma-separated-string splitting logic in varlena.c as well as a fourth
one in dumputils.c. (Not to mention the randomly-different-from-those
splitting logic in libpq...) I looked at unifying these, but it would
be rather a mess unless we're willing to tweak the API definitions of
SplitIdentifierString, SplitDirectoriesString, or both. That might be
worth doing in future; but it seems pretty unsafe for a back-patched
bug fix, so for now accept the duplication.
Back-patch to all supported branches, as the previous fix was.
Discussion: https://postgr.es/m/7585.1529435872@sss.pgh.pa.us

pg_dump knew about printing ALTER TABLE ... REPLICA IDENTITY USING INDEX
for indexes declared as indexes, but it failed to print that for indexes
declared as unique or primary-key constraints. Per report from Achilleas
Mantzios.
This has been broken since the feature was introduced, AFAICS.
Back-patch to 9.4.
Discussion: https://postgr.es/m/1e6cc5ad-b84a-7c07-8c08-a4d0c3cdc938@matrix.gatewaynet.com

As written, this policy constrained only the post-image not the pre-image
of rows, meaning that users could delete other users' rows or take
ownership of such rows, contrary to what the docs claimed would happen.
We need two separate policies to achieve the documented effect.
While at it, try to explain what's happening a bit more fully.
Per report from Олег Самойлов. Back-patch to 9.5 where this was added.
Thanks to Stephen Frost for off-list discussion.
Discussion: https://postgr.es/m/3298321532002010@sas1-2b3c3045b736.qloud-c.yandex.net

Previously pg_upgrade checked for the pid file and started/stopped the
server to force a clean shutdown. However, "pg_ctl -m immediate"
removes the pid file but doesn't do a clean shutdown, so check
pg_controldata for a clean shutdown too.
Diagnosed-by: Vimalraj A
Discussion: https://postgr.es/m/CAFKBAK5e4Q-oTUuPPJ56EU_d2Rzodq6GWKS3ncAk3xo7hAsOZg@mail.gmail.com
Backpatch-through: 9.3

I blew the dust off a Bourne shell (file date 1996, yea verily) and
tried to run test.sh with it. It mostly worked, but I found that the
temp-directory creation code introduced by commit be76a6d39 was not
compatible, for a couple of reasons: this shell thinks "set -e" should
force an exit if a command within backticks fails, and it also thinks code
within braces should be executed by a sub-shell, meaning that variable
settings don't propagate back up to the parent shell. In view of Victor
Wagner's report that Solaris is still using pre-POSIX shells, seems like
we oughta make this case work. It's not like the code is any less
idiomatic this way; the prior coding technique appeared nowhere else.
(There is a remaining bash-ism here, which is that $RANDOM doesn't do
what the code hopes in non-bash shells. But the use of $$ elsewhere in
that path should be enough to ensure uniqueness and some amount of
randomness, so I think it's okay as-is.)
Back-patch to all supported branches, as the previous commit was.
Discussion: https://postgr.es/m/20180720153820.69e9ae6c@fafnir.local.vm

I was confused by what "intended to be parallel serially" meant, until
Robert Haas and David G. Johnston explained it. Rephrase the comment to
make it more clear, using David's suggested wording.
Discussion: https://www.postgresql.org/message-id/1fec9022-41e8-e484-70ce-2179b08c2092%40iki.fi

GatherMergePath (introduced in 10) and CustomPath (introduced in 9.5)
have gone missing. The order of the Path nodes was inconsistent with
what is listed in nodes.h, so make the order consistent at the same time
to ease future checks and additions.
Author: Sawada Masahiko
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAD21AoBQMLoc=ohH-oocuAPsELrmk8_EsRJjOyR8FQLZkbE0wA@mail.gmail.com

A collection of typos I happened to spot while reading code, as well as
grepping for common mistakes.
Backpatch to all supported versions, as applicable, to avoid conflicts
when backporting other commits in the future.

lca_inner() wasn't prepared for the possibility of getting no inputs.
Fix that, and make some cosmetic improvements to the code while at it.
Also, I thought the documentation of this function as returning the
"longest common prefix" of the paths was entirely misleading; it really
returns a path one shorter than the longest common prefix, for the typical
definition of "prefix". Don't use that term in the docs, and adjust the
examples to clarify what really happens.
This has been broken since its beginning, so back-patch to all supported
branches.
Per report from Hailong Li. Thanks to Pierre Ducroquet for diagnosing
and for the initial patch, though I whacked it around some and added
test cases.
Discussion: https://postgr.es/m/5b0d8e4f-f2a3-1305-d612-e00e35a7be66@qunar.com

When reading an existing FSM or VM page that was found to be corrupt by the
buffer manager, the code applied PageInit() to reinitialize the page, but
did so without any locking. There is thus a hazard that two backends might
concurrently do PageInit, which in itself would still be OK, but the slower
one might then zero over subsequent data changes applied by the faster one.
Even that is unlikely to be fatal; but it's not desirable, so add locking
to prevent it.
This does not add any locking overhead in the normal code path where the
page is OK. It's not immediately obvious that that's safe, but I believe
it is, for reasons explained in the added comments.
Problem noted by R P Asim. It's been like this for a long time, so
back-patch to all supported branches.
Discussion: https://postgr.es/m/CANXE4Te4G0TGq6cr0-TvwP0H4BNiK_-hB5gHe8mF+nz0mcYfMQ@mail.gmail.com

Explain that you can use any integer expression for the "count" in
pl/pgsql's versions of FETCH/MOVE, unlike the SQL versions which only
allow a constant.
Remove the duplicate version of this para under MOVE. I don't see
a good reason to maintain two identical paras when we just said that
MOVE works exactly like FETCH.
Per Pavel Stehule, though I didn't use his text.
Discussion: https://postgr.es/m/CAFj8pRAcvSXcNdUGx43bOK1e3NNPbQny7neoTLN42af+8MYWEA@mail.gmail.com

WAL senders sending logically-decoded data fail to properly report in
"streaming" state when starting up, hence as long as one extra record is
not replayed, such WAL senders would remain in a "catchup" state, which
is inconsistent with the physical cousin.
This can be easily reproduced by for example using pg_recvlogical and
restarting the upstream server. The TAP tests have been slightly
modified to detect the failure and strengthened so as future tests also
make sure that a node is in streaming state when waiting for its
catchup.
Backpatch down to 9.4 where this code has been introduced.
Reported-by: Sawada Masahiko
Author: Simon Riggs, Sawada Masahiko
Reviewed-by: Petr Jelinek, Michael Paquier, Vaishnavi Prabakaran
Discussion: https://postgr.es/m/CAD21AoB2ZbCCqOx=bgKMcLrAvs1V0ZMqzs7wBTuDySezTGtMZA@mail.gmail.com

We should only run apply_pathtarget_labeling_to_tlist if CP_LABEL_TLIST
was specified, because only in that case has use_physical_tlist checked
that the labeling will succeed; otherwise we may get an "ORDER/GROUP BY
expression not found in targetlist" error. (This subsumes the previous
test about gating_clauses, because we reset "flags" to zero earlier
if there are gating clauses to apply.)
The only known case in which a failure can occur is with a ProjectSet
path directly atop a table scan path, although it seems likely that there
are other cases or will be such in future. This means that the failure
is currently only visible in the v10 branch: 9.6 didn't have ProjectSet,
while in v11 and HEAD, apply_scanjoin_target_to_paths for some weird
reason is using create_projection_path not apply_projection_to_path,
masking the problem because there's a ProjectionPath in between.
Nonetheless this code is clearly wrong on its own terms, so back-patch
to 9.6 where this logic was introduced.
Per report from Regina Obe.
Discussion: https://postgr.es/m/001501d40f88$75186950$5f493bf0$@pcorp.us

Commit fafa374f2 caused _bt_getbuf() to possibly emit a WAL record for
a page that it was about to recycle. However, it failed to distinguish
all-zero pages from dead pages, which is important because only the
latter have valid btpo.xact values, or indeed any special space at all.
Recycling an all-zero page with XLogStandbyInfoActive() enabled therefore
led to an Assert failure, or to emission of a WAL record containing a
bogus cutoff XID, which might lead to unnecessary query cancellations
on hot standby servers.
Per reports from Antonin Houska and 自己. Amit Kapila was first to
propose this fix, and Robert Haas, myself, and Kyotaro Horiguchi
reviewed it at various times.
This is an old bug, so back-patch to all supported branches.
Discussion: https://postgr.es/m/2628.1474272158@localhost
Discussion: https://postgr.es/m/48875502.f4a0.1635f0c27b0.Coremail.zoulx1982@163.com

Back-patch commit dddfc4cb2, which broke LDFLAGS and related Makefile
variables into two parts, one for within-build-tree library references and
one for external libraries, to ensure that the order of -L flags has all
of the former before all of the latter. This turns out to fix a problem
recently noted on buildfarm member peripatus, that we attempted to
incorporate code from libpgport.a into a shared library. That will fail on
platforms that are sticky about putting non-PIC code into shared libraries.
(It's quite surprising we hadn't seen such failures before, since the code
in question has been like that for a long time.)
I think that peripatus' problem could have been fixed with just a subset
of this patch; but since the previous issue of accidentally linking to the
wrong copy of a Postgres shlib seems likely to bite people in the field,
let's just back-patch the whole change. Now that commit dddfc4cb2 has
survived some beta testing, I'm less afraid to back-patch it than I was
at the time.
This also fixes undesired inclusion of "-DFRONTEND" in pg_config's CPPFLAGS
output (in 9.6 and up) and undesired inclusion of "-L../../src/common" in
its LDFLAGS output (in all supported branches).
Back-patch to v10 and older branches; this is already in v11.
Discussion: https://postgr.es/m/20180704234304.bq2dxispefl65odz@ler-imac.local

A critical failure in some of the end-of-recovery actions before the
end-of-recovery record is written can cause PostgreSQL to react
inconsistently with the rest of the cluster in the event of a crash
before the final record is written. Two such failures are for example
an error while processing a two-phase state files or when operating on
recovery.conf. With this commit, the failures are still considered
FATAL, but the write of the timeline history file is delayed as much as
possible so as the window between the moment the file is written and the
end-of-recovery record is generated gets minimized. This way, in the
event of a crash or a failure, the new timeline decided at promotion
will not seem taken by other nodes in the cluster. It is not really
possible to reduce to zero this window, hence one could still see
failures if a crash happens between the history file write and the
end-of-recovery record, so any future code should be careful when
adding new end-of-recovery actions. The original report from Magnus
Hagander mentioned a renamed recovery.conf as original end-of-recovery
failure which caused a timeline to be seen as taken but the subsequent
processing on the now-missing recovery.conf cause the startup process to
issue stop on FATAL, which at follow-up startup made the system
inconsistent because of on-disk changes which already happened.
Processing of two-phase state files still needs some work as corrupted
entries are simply ignored now. This is left as a future item and this
commit fixes the original complain.
Reported-by: Magnus Hagander
Author: Heikki Linnakangas
Reviewed-by: Alexander Korotkov, Michael Paquier, David Steele
Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com

When performing pg_rewind, the presence of a read-only file which is not
accessible for writes will cause a failure while processing. This can
cause the control file of the target data folder to be truncated,
causing it to not be reusable with a successive run.
Also, when pg_rewind fails mid-flight, there is likely no way to be able
to recover the target data folder anyway, in which case a new base
backup is the best option. A note is added in the documentation as
well about.
Reported-by: Christian H.
Author: Michael Paquier
Reviewed-by: Andrew Dunstan
Discussion: https://postgr.es/m/20180104200633.17004.16377%40wrigleys.postgresql.org

Coverity complains that there is no protection in the code (at least in
non-assertion-enabled builds) against speculative insertion failing to
follow the expected protocol. Add an elog(ERROR) for the case.

Change a whole-database VACUUM into doing just pg_attribute, which is
the portion that verifies what we want it to do. The original
formulation wastes a lot of CPU time, which leads the test to fail when
runtime exceeds isolationtester timeout when it's super-slow, such as
under CLOBBER_CACHE_ALWAYS. Per buildfarm member friarbird.
It turns out that the previous shape of the test doesn't always detect
the condition it is supposed to detect (on unpatched reorderbuffer
code): the reason is that there is a good chance of encountering a
xl_running_xacts record (logged every 15 seconds) before the checkpoint
-- and because we advance the xmin when we receive that WAL record, and
we *don't* advance the xmin twice consecutively without receiving a
client message in between, that means the xmin is not advanced enough
for the tuple to be pruned from pg_attribute by VACUUM. So the test
would spuriously pass.
The reason this test deficiency wasn't detected earlier is that HOT
pruning removes the tuple anyway, even if vacuum leaves it in place, so
the test correctly fails (detecting the coding mistake), but for the
wrong reason.
To fix this mess, run the s0_get_changes step twice before vacuum
instead of once: this seems to cause the xmin to be advanced reliably,
wreaking havoc with more certainty.
Author: Arseny Sher
Discussion: https://postgr.es/m/87h8lkuxoa.fsf@ars-thinkpad

If a standby crashes after promotion before having completed its first
post-recovery checkpoint, then the minimal recovery point which marks
the LSN position where the cluster is able to reach consistency may be
set to a position older than the first end-of-recovery checkpoint while
all the WAL available should be replayed. This leads to the instance
thinking that it contains inconsistent pages, causing a PANIC and a hard
instance crash even if all the WAL available has not been replayed for
certain sets of records replayed. When in crash recovery,
minRecoveryPoint is expected to always be set to InvalidXLogRecPtr,
which forces the recovery to replay all the WAL available, so this
commit makes sure that the local copy of minRecoveryPoint from the
control file is initialized properly and stays as it is while crash
recovery is performed. Once switching to archive recovery or if crash
recovery finishes, then the local copy minRecoveryPoint can be safely
updated.
Pavan Deolasee has reported and diagnosed the failure in the first
place, and the base fix idea to rely on the local copy of
minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded
into a full-fledged patch by me. The test included in this commit has
been written by Álvaro Herrera and Pavan Deolasee, which I have modified
to make it faster and more reliable with sleep phases.
Backpatch down to all supported versions where the bug appears, aka 9.3
which is where the end-of-recovery checkpoint is not run by the startup
process anymore. The test gets easily supported down to 10, still it
has been tested on all branches.
Reported-by: Pavan Deolasee
Diagnosed-by: Pavan Deolasee
Reviewed-by: Pavan Deolasee, Kyotaro Horiguchi
Author: Michael Paquier, Kyotaro Horiguchi, Pavan Deolasee, Álvaro
Herrera
Discussion: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com

When deleting pages the nbtree code has to walk through siblings of a
tree node. When those sibling links are corrupted that can lead to
endless loops - which are currently not interruptible. This is
especially problematic if autovacuum is repeatedly blocked on such
indexes, as it can be hard to get out of that situation without
resorting to single user mode.
Thus add interrupt checks to appropriate places in such
loops. Unfortunately in one of the cases it's it's not easy to do so.
Between 9.3 and 9.4 the page deletion (and page split) code changed
significantly. Before it was significantly less robust against
interruptions. Therefore don't backpatch to 9.3.
Author: Andres Freund
Discussion: https://postgr.es/m/20180627191629.wkunw2qbibnvlz53@alap3.anarazel.de
Backpatch: 9.4-

When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was
called for each file to delete, which led to scan shared_buffers
multiple times. Obviously this could cause to increase the WAL replay
time very much especially when shared_buffers was huge.
To alleviate this situation, this commit changes the recovery so that
it also calls smgrdounlinkall() only one time to delete multiple
relation files.
This is just fix for oversight of commit 279628a0a7, not new feature.
So, per discussion on pgsql-hackers, we concluded to backpatch this
to all supported versions.
Author: Fujii Masao
Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa
Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com

When these programs call pg_catalog.set_config, they need to check for
PGRES_TUPLES_OK instead of PGRES_COMMAND_OK. Fix for
5770172cb0c9df9e6ce27c507b449557e5b45124.
Reported-by: Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com>

search.cpan.org has been EOL'd, with metacpan.org being the official
replacement to which URLs now redirect. Update links to match the new
URL. Also update links to CPAN to use https as it will redirect from
http.
Author: Daniel Gustafsson
Discussion: https://postgr.es/m/B74C0219-6BA9-46E1-A524-5B9E8CD3BDB3@yesql.se

Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad

The backup history file has been no longer necessary for recovery
since the version 9.0. It's now basically just for informational purpose.
But previously the documentations still described that a recovery
requests the backup history file to proceed. The commit fixes this
documentation bug.
Back-patch to all supported versions.
Author: Yugo Nagata
Reviewed-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20180626174752.0ce505e3.nagata@sraoss.co.jp

On Windows, it is sometimes important for corresponding malloc() and
free() calls to be made from the same DLL, since some build options can
result in multiple allocators being active at the same time. For that
reason we already provided PQfreemem(). This commit adds a similar
function for freeing string results allocated by the pgtypes library.
Author: Takayuki Tsunakawa
Reviewed-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8AD5D6%40G01JPEXMBYT05

Standbys frequently need to release all locks held by a given xid.
Instead of searching one big list linearly, let's create one list
per xid and put them in a hash table, so we can find what we need
in O(1) time.
Earlier analysis and a prototype were done by David Rowley, though
this isn't his patch.
Back-patch all the way.
Author: Thomas Munro
Diagnosed-by: David Rowley, Andres Freund
Reviewed-by: Andres Freund, Tom Lane, Robert Haas
Discussion: https://postgr.es/m/CAEepm%3D1mL0KiQ2KJ4yuPpLGX94a4Ns_W6TL4EGRouxWibu56pA%40mail.gmail.com
Discussion: https://postgr.es/m/CAKJS1f9vJ841HY%3DwonnLVbfkTWGYWdPN72VMxnArcGCjF3SywA%40mail.gmail.com

System calls mixed up in error code paths are causing two issues which
several code paths have not correctly handled:
1) For write() calls, sometimes the system may return less bytes than
what has been written without errno being set. Some paths were careful
enough to consider that case, and assumed that errno should be set to
ENOSPC, other calls missed that.
2) errno generated by a system call is overwritten by other system calls
which may succeed once an error code path is taken, causing what is
reported to the user to be incorrect.
This patch uses the brute-force approach of correcting all those code
paths. Some refactoring could happen in the future, but this is let as
future work, which is not targeted for back-branches anyway.
Author: Michael Paquier
Reviewed-by: Ashutosh Sharma
Discussion: https://postgr.es/m/20180622061535.GD5215@paquier.xyz

The set-returning nature of these functions make their use unclear. The
modified paragraph was added in PG 9.4.
Reported-by: yshaladi@denodo.com
Discussion: https://postgr.es/m/152571684246.9460.18059951267371255159@wrigleys.postgresql.org
Backpatch-through: 9.4

Bring this transform function into sync with the policy established
by commit 3a382983d.
Also, fix it to make sure that what it drills down to is indeed a
hash, and not some other kind of Perl SV. Previously, the test
cases added here provoked crashes.
Because of the crash hazard, back-patch to 9.5 where this module
was introduced.
Discussion: https://postgr.es/m/28336.1528393969@sss.pgh.pa.us

When a standby's WAL receiver stops reading WAL from a WAL stream, it
writes data to the current WAL segment without having priorily zero'ed
the page currently written to, which can cause the WAL reader to read
junk data from a past recycled segment and then it would try to get a
record from it. While sanity checks in place provide most of the
protection needed, in some rare circumstances, with chances increasing
when a record header crosses a page boundary, then the startup process
could fail violently on an allocation failure, as follows:
FATAL: invalid memory alloc request size XXX
This is confusing for the user and also unhelpful as this requires in
the worst case a manual restart of the instance, impacting potentially
the availability of the cluster, and this also makes WAL data look like
it is in a corrupted state.
The chances of seeing failures are higher if the connection between the
standby and its root node is unstable, causing WAL pages to be written
in the middle. A couple of approaches have been discussed, like
zero-ing new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the sequential writes done by the WAL receiver. This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure which will force the
standby to retry reading WAL. It could be possible that the allocation
call passes and that an unnecessary amount of memory is allocated,
however follow-up checks on records would just fail, making this
allocation short-lived anyway.
This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.
Backpatch down to 9.5, which is where palloc_extended has been
introduced.
Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05

M src/backend/access/transam/xlogreader.c

Use -Wno-format-truncation and -Wno-stringop-truncation, if available.

gcc 8 has started emitting some warnings that are largely useless for
our purposes, particularly since they complain about code following
the project-standard coding convention that path names are assumed
to be shorter than MAXPGPATH. Even if we make the effort to remove
that assumption in some future release, the changes wouldn't get
back-patched. Hence, just suppress these warnings, on compilers that
have these switches.
Backpatch to all supported branches.
Discussion: https://postgr.es/m/1524563856.26306.9.camel@gunduz.org

Use of strncpy with a length limit based on the source, rather than
the destination, is non-idiomatic and draws warnings from gcc 8.
Replace with memcpy, which does exactly the same thing in these cases,
but with less chance for confusion.
Backpatch to all supported branches.
Discussion: https://postgr.es/m/21789.1529170195@sss.pgh.pa.us

This could only cause an issue if strftime returned a ridiculously
long timezone name, which seems unlikely; and it wouldn't qualify
as a security problem even then, since pg_waldump (nee pg_xlogdump)
is a debug tool not part of the server. But gcc 8 has started issuing
warnings about it, so let's use snprintf and be safe.
Backpatch to 9.3 where this code was added.
Discussion: https://postgr.es/m/21789.1529170195@sss.pgh.pa.us

Documentation of word_similarity() and strict_word_similarity() functions
contains some vague wordings which could confuse users. This patch makes
those wordings more clear. word_similarity() was introduced in PostgreSQL 9.6,
and corresponding part of documentation needs to be backpatched.
Author: Bruce Momjian, Alexander Korotkov
Discussion: https://postgr.es/m/20180526165648.GB12510%40momjian.us
Backpatch: 9.6, where word_similarity() was introduced

M doc/src/sgml/pgtrgm.sgml

Fix bugs in vacuum of shared rels, by keeping their relcache entries current.

When vacuum processes a relation it uses the corresponding relcache
entry's relfrozenxid / relminmxid as a cutoff for when to remove
tuples etc. Unfortunately for nailed relations (i.e. critical system
catalogs) bugs could frequently lead to the corresponding relcache
entry being stale.
This set of bugs could cause actual data corruption as vacuum would
potentially not remove the correct row versions, potentially reviving
them at a later point. After 699bf7d05c some corruptions in this vein
were prevented, but the additional error checks could also trigger
spuriously. Examples of such errors are:
ERROR: found xmin ... from before relfrozenxid ...
and
ERROR: found multixact ... from before relminmxid ...
To be caused by this bug the errors have to occur on system catalog
tables.
The two bugs are:
1) Invalidations for nailed relations were ignored, based on the
theory that the relcache entry for such tables doesn't
change. Which is largely true, except for fields like relfrozenxid
etc. This means that changes to relations vacuumed in other
sessions weren't picked up by already existing sessions. Luckily
autovacuum doesn't have particularly longrunning sessions.
2) For shared *and* nailed relations, the shared relcache init file
was never invalidated while running. That means that for such
tables (e.g. pg_authid, pg_database) it's not just already existing
sessions that are affected, but even new connections are as well.
That explains why the reports usually were about pg_authid et. al.
To fix 1), revalidate the rd_rel portion of a relcache entry when
invalid. This implies a bit of extra complexity to deal with
bootstrapping, but it's not too bad. The fix for 2) is simpler,
simply always remove both the shared and local init files.
Author: Andres Freund
Reviewed-By: Alvaro Herrera
Discussion:
https://postgr.es/m/20180525203736.crkbg36muzxrjj5e@alap3.anarazel.de
https://postgr.es/m/CAMa1XUhKSJd98JW4o9StWPrfS=11bPgG+_GDMxe25TvUY4Sugg@mail.gmail.com
https://postgr.es/m/CAKMFJucqbuoDRfxPDX39WhA3vJyxweRg_zDVXzncr6+5wOguWA@mail.gmail.com
https://postgr.es/m/CAGewt-ujGpMLQ09gXcUFMZaZsGJC98VXHEFbF-tpPB0fB13K+A@mail.gmail.com
Backpatch: 9.3-

This bug causes a lseek() failure to be reported as a "could not open"
failure in the error message, muddling bug reports. I introduced this
copy-and-pasteo in commit 78e122010422.
Noticed while reviewing code for bug report #15221, from lily liang. In
version 10 the affected function is only used by multixact.c and
commit_ts, and only in corner-case circumstances, neither of which are
involved in the reported bug (a pg_subtrans failure.)
Author: Álvaro Herrera

To distinguish SQL statements that are INSERT/UPDATE/DELETE from other
ones, exec_stmt_execsql looked at the post-rewrite form of the statement
rather than the original. This is problematic because it did that only
during first execution of the statement (in a session), but the correct
answer could change later due to addition or removal of DO INSTEAD rules
during the session. That could lead to an Assert failure, as reported
by Tushar Ahuja and Robert Haas. In non-assert builds, there's a hazard
that we would fail to enforce STRICT behavior when we'd be expected to.
That would happen if an initially present DO INSTEAD, that replaced the
original statement with one of a different type, were removed; after that
the statement should act "normally", including strictness enforcement, but
it didn't. (The converse case of enforcing strictness when we shouldn't
doesn't seem to be a hazard, as addition of a DO INSTEAD that changes the
statement type would always lead to acting as though the statement returned
zero rows, so that the strictness error could not fire.)
To fix, inspect the original form of the statement not the post-rewrite
form, making it valid to assume the answer can't change intra-session.
This should lead to the same answer in every case except when there is a
DO INSTEAD that changes the statement type; we will now set mod_stmt=true
anyway, while we would not have done so before. That breaks the Assert
in the SPI_OK_REWRITTEN code path, which expected the latter behavior.
It might be all right to assert mod_stmt rather than !mod_stmt there,
but I'm not entirely convinced that that'd always hold, so just remove
the assertion altogether.
This has been broken for a long time, so back-patch to all supported
branches.
Discussion: https://postgr.es/m/CA+TgmoZUrRN4xvZe_BbBn_Xp0BDwuMEue-0OyF0fJpfvU2Yc7Q@mail.gmail.com

kern.ipc.shm_use_phys is not a sysctl on OpenBSD, and SEMMAP is not
a kernel configuration option. These were probably copy pasteos from
when the documentation had a single paragraph for *BSD.
Author: Daniel Gustafsson <daniel@yesql.se>

Collations, conversions, extended statistics objects (in >= v10),
and all four types of text search objects have schema-qualified names.
getObjectDescription() ignored that and would emit just the base name of
the object, potentially producing wrong or at least highly misleading
output. Fix it to add the schema name whenever the object is not "visible"
in the current search path, as is the rule for other schema-qualifiable
object types.
Although in common situations the output won't change, this seems to me
(tgl) to be a bug worthy of back-patching, hence do so.
Kyotaro Horiguchi, per a complaint from me
Discussion: https://postgr.es/m/20180522.182020.114074746.horiguchi.kyotaro@lab.ntt.co.jp

If echo = false, simple_prompt() is supposed to prevent echoing the
input (for password input). However, the Windows implementation applied
the mode change to STD_INPUT_HANDLE. That would not have the desired
effect if stdin isn't actually the terminal, for instance if the user
is piping something into psql. Fix it to apply the mode change to
the correct input file, so that passwords do not echo in such cases.
In passing, shorten and de-uglify this code by using #elif rather than
an #if nest and removing some duplicated code.
Back-patch to all supported versions. To simplify that, also back-patch
the portions of commit 9daec77e1 that got rid of an unnecessary
malloc/free in the same area.
Matthew Stickney (cosmetic changes by me)
Discussion: https://postgr.es/m/502a1fff-862b-da52-1031-f68df6ed5a2d@gmail.com

Because the code for the HEADER option skips a line when this counter
is zero, a very long COPY FROM WITH HEADER operation would drop a line
every 2^32 lines. A lesser but still unfortunate problem is that errors
would show a wrong input line number for errors occurring beyond the
2^31'st input line. While such large input streams seemed impractical
when this code was first written, they're not any more. Widening the
counter (and some associated variables) to uint64 should be enough to
prevent problems for the foreseeable future.
David Rowley
Discussion: https://postgr.es/m/CAKJS1f88yh-6wwEfO6QLEEvH3BEugOq2QX1TOja0vCauoynmOQ@mail.gmail.com

OFFSET <x> ROWS FETCH FIRST <y> ROWS ONLY syntax is supposed to accept
<simple value specification>, which includes parameters as well as
literals. When this syntax was added all those years ago, it was done
inconsistently, with <x> and <y> being different subsets of the
standard syntax.
Rectify that by making <x> and <y> accept the same thing, and allowing
either a (signed) numeric literal or a c_expr there, which allows for
parameters, variables, and parenthesized arbitrary expressions.
Per bug #15200 from Lukas Eder.
Backpatch all the way, since this has been broken from the start.
Discussion: https://postgr.es/m/877enz476l.fsf@news-spur.riddles.org.uk
Discussion: http://postgr.es/m/152647780335.27204.16895288237122418685@wrigleys.postgresql.org

This is the converse of the unsafe-usage-of-%m problem: the reason
ereport/elog provide that format code is mainly to dodge the hazard
of errno getting changed before control reaches functions within the
arguments of the macro. I only found one instance of this hazard,
but it's been there since 9.4 :-(.

Ancient HPUX, for one, does this. We hadn't noticed due to the lack
of regression tests that required a working strtoll.
(I was slightly tempted to remove the other historical spelling,
strto[u]q, since it seems we have no buildfarm members testing that case.
But I refrained.)
Discussion: https://postgr.es/m/151935568942.1461.14623890240535309745@wrigleys.postgresql.org

Buildfarm member dromedary is still unhappy about the recently-added
ecpg "long long" tests. The reason turns out to be that it includes
"-ansi" in its CFLAGS, and in their infinite wisdom Apple have decided
to hide the declarations of strtoll/strtoull in C89-compliant builds.
(I find it pretty curious that they hide those function declarations
when you can nonetheless declare a "long long" variable, but anyway
that is their behavior, both on dromedary's obsolete macOS version and
the newest and shiniest.) As a result, gcc assumes these functions
return "int", leading naturally to wrong results.
(Looking at dromedary's past build results, it's evident that this
problem also breaks pg_strtouint64() on 32-bit platforms; but we
evidently have no regression tests that exercise that function with
values above 32 bits.)
To fix, supply declarations for these functions when the platform
provides the functions but not the declarations, using the same type
of mechanism as we use for some other similar cases.
Discussion: https://postgr.es/m/151935568942.1461.14623890240535309745@wrigleys.postgresql.org

I don't think this is really the best long-term answer, and in
particular it doesn't fix the pre-existing hazard in sqltypes.h.
But for the moment let's just try to make the buildfarm green again.
Discussion: https://postgr.es/m/151935568942.1461.14623890240535309745@wrigleys.postgresql.org

This will only actually exercise the "long long" code paths on platforms
where "long" is 32 bits --- otherwise, the SQL bigint type maps to
plain "long", and we will test that code path instead. But that's
probably sufficient coverage, and anyway we weren't testing either
code path before.
Dang Minh Huong, tweaked a bit by me
Discussion: https://postgr.es/m/151935568942.1461.14623890240535309745@wrigleys.postgresql.org

This is needed for full support of "long long" variables in ecpg, but
the previous patch for bug #15080 (commits 51057feaa et al) missed it.
In MSVC versions where the functions don't exist under those names,
we can nonetheless use _strtoi64() and _strtoui64().
Like the previous patch, back-patch all the way.
Dang Minh Huong
Discussion: https://postgr.es/m/151935568942.1461.14623890240535309745@wrigleys.postgresql.org

canonicalize_ec_expression() is supposed to agree with coerce_type() as to
whether a RelabelType should be inserted to make a subexpression be valid
input for the operators of a given opclass. However, it did the wrong
thing with named-composite-type inputs to record_eq(): it put in a
RelabelType to RECORDOID, which the parser doesn't. In some cases this was
harmless because all code paths involving a particular equivalence class
did the same thing, but in other cases this would result in failing to
recognize a composite-type expression as being a member of an equivalence
class that it actually is a member of. The most obvious bad effect was to
fail to recognize that an index on a composite column could provide the
sort order needed for a mergejoin on that column, as reported by Teodor
Sigaev. I think there might be other, subtler, cases that result in
misoptimization. It also seems possible that an unwanted RelabelType
would sometimes get into an emitted plan --- but because record_eq and
friends don't examine the declared type of their input expressions, that
would not create any visible problems.
To fix, just treat RECORDOID as if it were a polymorphic type, which in
some sense it is. We might want to consider formalizing that a bit more
someday, but for the moment this seems to be the only place where an
IsPolymorphicType() test ought to include RECORDOID as well.
This has been broken for a long time, so back-patch to all supported
branches.
Discussion: https://postgr.es/m/a6b22369-e3bf-4d49-f59d-0c41d3551e81@sigaev.ru

The impact of VARIADIC on the combine/serialize/deserialize support
functions of an aggregate wasn't thought through carefully. There is
actually no impact, because variadicity isn't passed through to these
functions (and it doesn't seem like it would need to be). However,
lookup_agg_function was mistakenly told to check things as though it were
passed through. The net result was that it was impossible to declare an
aggregate that had both VARIADIC input and parallelism support functions.
In passing, fix a runtime check in nodeAgg.c for the combine function's
strictness to make its error message agree with the creation-time check.
The previous message was actually backwards, and it doesn't seem like
there's a good reason to have two versions of this message text anyway.
Back-patch to 9.6 where parallel aggregation was introduced.
Alexey Bashtanov; message fix by me
Discussion: https://postgr.es/m/f86dde87-fef4-71eb-0480-62754aaca01b@imap.cc

DST law changes in North Korea. Redefinition of "daylight savings" in
Ireland, as well as for some past years in Namibia and Czechoslovakia.
Additional historical corrections for Czechoslovakia.
With this change, the IANA database models Irish timekeeping as following
"standard time" in summer, and "daylight savings" in winter, so that the
daylight savings offset is one hour behind standard time not one hour
ahead. This does not change their UTC offset (+1:00 in summer, 0:00 in
winter) nor their timezone abbreviations (IST in summer, GMT in winter),
though now "IST" is more correctly read as "Irish Standard Time" not "Irish
Summer Time". However, the "is_dst" column in the pg_timezone_names view
will now be true in winter and false in summer for the Europe/Dublin zone.
Similar changes were made for Namibia between 1994 and 2017, and for
Czechoslovakia between 1946 and 1947.
So far as I can find, no Postgres internal logic cares about which way
tm_isdst is reported; in particular, since commit b2cbced9e we do not
rely on it to decide how to interpret ambiguous timestamps during DST
transitions. So I don't think this change will affect any Postgres
behavior other than the timezone-view outputs.
Discussion: https://postgr.es/m/30996.1525445902@sss.pgh.pa.us

The regexes used in 102_vacuumdb_stages.pl to check the postmaster log
for expected output contained several places with ".*.*", which is
underdetermined and can cause exponential runtime growth in Perl's regex
matcher (since it's not bright enough not to waste time seeing whether
different splits of the same substring would allow a match). We were
fortunate that the amount of text in the postmaster log was generally not
enough to make the runtime go to the moon; although commit 6271fceb8 had
been on the hairy edge of an obvious problem, thanks to its increasing the
default log verbosity to DEBUG1. Experimentation shows that anyone who
tried to run this test case with an even higher log verbosity would have
been in for serious pain. But even at default logging level, fixing this
saves several hundred ms on my workstation, more on slower buildfarm
members.
Remove the extra ".*"s, restoring more-or-less-linear matching speed.
Back-patch to 9.4 where the test case was added, mostly in case anyone
tries to do related debugging in a back branch.
Discussion: https://postgr.es/m/32459.1525657786@sss.pgh.pa.us