What I wrote about a year ago felt like a far-distant goal — now it feels like we are almost there. Yes, it will take years to fully complete this feature, like all
our complex features, but it feels like we are very near to proof-of-concept and production deployments, at least for certain workloads. Postgres pulls off these
amazing feats regularly, and you would think I would get use to it and not be surprised, but I still am.

I have been with the project long enough to remember how the project got the name "PostgreSQL". In 1996, we inherited the name
"Postgres95" from the Berkeley team's remaining member Jolly Chen. Obviously the Postgres95 name wasn't going to last long-term, so a new
name had to be chosen. We could revert to the original Berkeley name "Postgres", or go with a name that more closely identified us as an
sql database and call it "PostgreSQL". To add complexity to the discussion, Berkeley Postgres (before Postgres95) used the
quel query language, so there were valid arguments that we needed to
advertise that the database now used sql. After heated debate, "PostgreSQL" was chosen, and "Postgres" was later
accepted as a valid alias.

I was always in the "Postgres" camp, but over time I have learned to appreciate the arguments of those who prefer "PostgreSQL". The two
strong arguments for "PostgreSQL" are:

Visually better than "Postgres"

Identifies it to new people as an sql database

The strong argument for "Postgres" is that "Postgres" is simpler to say, while "PostgreSQL" is complex and has several verbal options, e.g. Postgres-Q-L, Postgres
Sequel", Postgre Sequel".

What has really cemented my appreciation for both names is that the companies I have worked for have cycled through the two names because they saw value in each one.
What that tells me is that both names have unique value and that we have to do our best to leverage the strengths of both.

Also, from the further obscure closet, here is the first
email suggesting that Postgres use an elephant as its
mascot:

but if you want an animal-based logo, how about some sort of elephant? After all, as the Agatha Christie title read, elephants can remember ...

Postgres 9.4 added the server variable
huge_pages. The Postgres documentation
does a good job explaining huge pages, "The use of huge pages results in smaller page tables and
less CPU time spent on memory management, increasing performance."

Specifically, all modern operating systems use virtual memory to map process-visible memory to physical ram.
Typically, the default granularity of virtual pages is four kilobytes. For a process using one megabyte of virtual memory, that is 256 page table entries mapping
virtual addresses to physical addresses. For a processing using one gigabyte of virtual memory, that is 256k virtual page table entries! While modern cpus
have translation lookaside buffers (tlb) to speed virtual/physical memory mapping, there is
still a lot of overhead for processes that use large memory spaces. Huge pages allow for less granular virtual pages, meaning fewer page table entries and fewer
lookups. The x86-64cpu architecture supports 2MB and 1GB-sized huge pages.

Unfortunately Postgres only supports huge pages on Linux, and to make use of huge pages, you must enable huge pages in the operating
system. The Postgres documentation goes into
detail on how to do this. You can see how much memory is allocated for huge pages on Linux by looking in /proc/meminfo:

The last line indicates that huge pages are 2MB is size. The first line indicates the anonymous or
transparent huge pages in use in your kernel. Transparent huge pages can cause
significant overhead when a background kernel thread tries to rearrange memory to coalesce it into 2MB chunks. It is recommended that transparent huge pages be disabled
when using Postgres, and only explicit huge pages be used. Fortunately, most modern Linux kernels have transparent huge pages disabled by default.

The Postgres community is often asked to provide automatic tracking of the creation and last modification times for objects, e.g. tables,
functions. File systems track file last modification times, and some record creation time, so the logic is why can't Postgres do this?
(Unix's inode modification time does
not
represent the creation time.)

For object creation-time tracking, we have to consider when creation time should be set. It is obvious that the creation time should be
set when the object is first created, but what about:

Tracking all of these times is reasonable. However, there is significant code and performance overhead in doing it, and a given user is probably only interested in a
few of these. A detailed email by Tom Lane also explains
that even if we tracked all of this, it only records the change time, not what was changed.

A more holistic solution is auditing, which would contain change information as well as tracking of multiple change times. Effort is probably better spent adding
auditing to Postgres than adding fine-grained tracking of creation and modification times. Users who want to track such times are best served writing custom
event triggers.

An age-old question is whether it is better to put data in a database or a file system. Of course, the answer is "it depends," but
let's look at what it depends on.

First, Postgres stores its data in the file system, so the file system must be good for something. Postgres also stores some of its
configuration files in the file system, e.g. pg_hba.conf, so it isn't just a issue that once you have a database, everything is better
in a database.

It usually comes down to evaluating database positives vs. negatives — first the database storage positives:

Do you need to view the data in multiple ways?

Is synchronizing database data and file system data going to be difficult?

Do you need multi-object commit synchronization?

Do applications need a single, guaranteed-consistent view of their data?

It is often a question of object size. At the multi-megabyte size, having multiple copies of the data in the database while a query is executing is not a problem, and
the auto-synchronization of data in the database is a great benefit. At the hundreds of megabytes and larger, the overhead of moving around multiple copies of the
object in the database can become a performance problem. Of course, these problems only happen if you access the large data object in a query — if you don't query the
bytea column or access the large object in the query, there is no overhead except storage. Updates of rows containing large objects also don't suffer a
performance penalty if the update doesn't modify the large object. (The same is true for any long value, e.g. character strings, json.)

Also, if the object is not a binary blob to the database but has a structure the database can understand then storing it in the database has other advantages. For
example, doing full text search or json lookups on data can be very beneficial.

Some of the same questions come up when deciding whether some of your data would be best stored in a different database technology and using
foreign data wrappers to unify data access. Again, the
answer is still "it depends," but hopefully this helps you judge the "depends" better.

I have attended 323 Postgres events in my career. While I have enjoyed almost all of
them, many had different focuses, so I thought I would share my experiences. First, there are a variety of conference types:

Vendor conferences: often in big cities, which focus on company-produced products

Business conferences: also often in big cities, which focus on business challenges and discussions, often with high attendance prices

Community conferences: led by people who care about open-source software and focus on software knowledge transfer

Hobbyist conferences: often in smaller cities, which focus on interpersonal relationship building with technology as a catalyst, often free

It would be nice if I could say which conference types are good or bad, but that isn't possible. Each conference targets an audience whose needs it seeks to fulfill. Let's look at
the needs that each fulfills:

Vendor conferences: If you are new to a technology and need people to help you navigate purchase options, these conferences are for
you.

Business conferences: If you are frequently challenged to make business decisions, but feel you have no one to share options with or
brainstorm, this type of conference can give you a framework to help you make your next complex business decision.

Community conferences: If you spend significant time solving technological problems, you can gain great insight and new approaches by
attending this type of conference.

Hobbyist conferences: If you are looking for emotional connections to people who share similar interests, this type of conference can
be personally rewarding.

Ideally everyone would go to conferences which match their interests, but what happens when they don't match? Here are some examples:

Vendor conferences: "Wow, this is boring. The booth staff don't even know about the technology they are selling. When will this be
over?"

Business conferences: "People are very passionate about the problems they are trying to solve. I am glad I don't have these problems
— they seem unsolvable."

Community conferences: "These people really care about the minutia of the software. When are they going to get a life?"

Hobbyist conferences: "Does this end with everyone sitting in a circle and roasting marshmallows over a cpu fan?"

Ninety-five percent of Postgres conferences are community conferences (#3). They have some of the other aspects, but that is not the focus. Open source vendor conferences,
e.g. LinuxWorld, used to be popular but are almost extinct. They were really focused on vendors, and
when attendees realized this, and vendors didn't get the desired sales opportunities, demand collapsed. Postgres has few business-focused conferences —
Postgres Vision is an example. Hobbyist conferences still exist, and probably always will, though they
are, by necessity, small.

For me, conferences allow me to hear from and talk to people knowledgeable in a field I want to study. These interactions convey
information I can't get from reading books and articles. The best interactions are a fire hose of information that I can absorb and
probe. Such interactions give me information it would take me weeks or months to learn.

This Slashdot discussion covers some of the value in attending
conferences. Hopefully this blog post and that discussion will help you navigate the many conferences offered and help you get the best value from them.

Postgres is advertised as "reading never blocks writing and writing never
blocks reading." While this is true for selects and dml (insert, update, delete), it isn't true for
ddl.

Ddl, like alter table, can block reads and writes. Postgres has continued to reduce locking requirements for alter and other ddl
commands, and will continue to improve things. However, ultimately there is some ddl that will always require, at least briefly, the blocking of reads and
writes.

I had the pleasure of speaking at Alibaba's Computing Conference last week in Hangzhou, China. I gave
two presentations. The first covered Postgres 10 features (images,video). The second was a generic talk about the future of
open source databases (images, video in
English and
Chinese (starts at 153 minutes)).

This week I am attending a community Postgres conference in Shenzhen, China. Postgres users here have a
strong desire to increase Postgres adoption in China in the coming years.

Why would you do this? Normally this would be done using select ...
for update. However, what if you want to do the update without locking the row between select and update, and are
willing to discard the update if the row has already been updated by another session? This is a case where using xmin in
updates makes sense.

cume_dist computes the fraction of partition rows that are less than or equal to the current row and its peers, while
percent_rank computes the fraction of partition rows that are less than the current row, assuming the current row does not
exist in the partition.

There are some key phrases in there. First, cume_dist computes the percentage of rows that are "less than or equal," while
percent_rank computes a similar percentage that are "less than" the current row, except it also assumes the current row isn't in
the partition.

That last issue of not counting the current row struck me as odd. I talked to some statisticians about this and did some web searches.
The best I can understand is that this special percent_rank behavior is designed to handle cases where there is a single
maximum value in the partition. For example:

In this example, what are the ideal ratios for the last line? Cume_dist is easy since is it "less than or equal" to the current
row, so that is clearly 1.00. What about percent_rank, which is "less than?" Without the "assume the current row isn't in the
partition" exclusion, it would show 0.80. That would be a silly answer since all the rows are less than the last row, except the last row
itself. I think this is why that exclusion was added.

This issue becomes even more complex when there are many duplicates. For example, imagine this data set:

Frankly, neither percent_rank nor cume_dist look like they produce reasonable results. If I got a score of 1, is
saying that a ratio of 0.17 rows are less than mine, or that 0.86 are equal or less than mine, useful? Imagine this was an
sat test and many people got 650. Wouldn't it be reasonable to count half of the
people who had matching scores less then or equal to mine, which would compute 0.50, rather than including all or none of the matching
scores?

Right now, there is no way in Postgres to do that, but I think it would be a nice thing to add. I assume a server-side function could be written to do this. Computing
the average of percent_rank and cume_dist would yield a similar result.

In a previous blog entry I suggested storing the original time zone
offset in a separate column if clients need to know the stored time in the original time zone. There is some more complexity to this
issue that I would like to cover.

First, when I suggested using select extract(timezone from current_timestamp), I assumed the user was also storing the
current_timestamp value in the database. If they were storing a past or future timestamp in the database, they would need to use
that value in extract, instead of current_timestamp.

Second, as was pointed out by Miha Vrhovnik in a blog
comment, things get more complicated if a future time is stored in the database and the future time zone rules change after the data is
stored. You might think this concern is only theoretical, but Russia made such changes in
2014 and
2016.

To get the proper behavior, you have to ask yourself, if you are storing a future timestamp, do you want to retain the same wall-clock
time? If you were recording a future doctor's appointment, odds are you would want to adjust the stored value to reflect the same time of
day (wall-clock time). If you were recording a future astronomical event, you would want to keep the same instant in time, even if the
visible time changes. The default timestamp with time
zone behavior is to retain the same instant in time, i.e., astronomical behavior.

To retain the same future wall-clock time after a time zone rule change, you would need to store the timestamp using
without time zone and store the time zone name in a separate column. You would then need to combine the timestamp
without time zone and the time zone name in each query to compute the instant in time based on the current time zone rules. Any
indexes that did such computations would also need to be reindexed when time zone rules change.

This method allows the data to adjust to future time zone rule changes by computing the time zone offset on demand, rather than being
locked in the time zone rules which were current at the time the data was entered, e.g.:

-- controls the OUTPUT time zone of AT TIME ZONE when passed a WITHOUT TIME ZONE value
SET TIME ZONE 'Europe/Moscow';
-- AT TIME ZONE specifies the time zone for the literal value
SELECT TIMESTAMP WITHOUT TIME ZONE '2012-03-08 09:00:00' AT TIME ZONE 'Europe/Moscow';
timezone
------------------------
2012-03-08 09:00:00+04
SELECT TIMESTAMP WITHOUT TIME ZONE '2022-03-08 09:00:00' AT TIME ZONE 'Europe/Moscow';
timezone
------------------------
2022-03-08 09:00:00+03

Basically, if the future time zone rules change, the output time of day would be the same, but the instant in time compared to
utc would change, e.g. the +03. Of course, if all events are in
the same time zone, and you don't want "astronomical behavior," then there is no need for time zone adjustments. Miha Vrhovnik is right
that using time zones can be more trouble than they are worth, especially if all entries are in the same time zone and future time zone
changes are a possibility.

I recently wrote about
PL/pgSQL's good alignment with the scripting needs of the sql
language. There have been several email threads in the past few months exploring this in much greater detail so I wanted to share them.

The first
email
thread defines "impedance mismatch" as a way of measuring how closely a language's features and syntax fit the requirements of
server-side functions. It cautions against always choosing familiar languages because sometimes using a language with a lower "impedance
mismatch", even if its syntax is unfamiliar, is better in the long run. It also has a long discussion about when server-side logic is
wise.

The second
email
thread talks about the mechanics of using server-side languages, and mentions
PL/Perl and PL/v8 as good for
compute-heavy server-side functions.

The final
thread
is a wish-list of how to improve PL/pgSQL. While there are many requests, they center on cases where PL/pgSQL is overly-verbose or the
defaults are non-optimal. (I was actually happy to see that the scope of requested improvements was so limited.)

I will close with a particularly-glowing post
about the community's accomplishments. As someone who sees complaints and bug reports all day, it is good to remember how much goes well
in our community and development processes.

I last blogged about Postgres's handling of case sensitivity in 2012. A
recent
email
thread requested that case sensitivity be configurable in Postgres.

Certainly, allowing case sensitivity to be configurable would help those porting from other databases. However, Postgres tools and
libraries operate inside other applications, so if there was a case sensitivity setting, they couldn't change it. They would need to
support the case sensitivity set by the client, meaning that every identifier would have to be double-quoted. It was concluded that this
seems to be a bigger negative than the portability enhancement is a positive.

The Postgres hackers list is a steady-stream of great ideas and discussion, but
occasionally something comes along that really makes you sit back and think, "Wow, where did that come from?" Such was a February 2017
email
from Konstantin Knizhnik presenting a proof-of-concept vectorization optimization for the executor.

Konstantin's work adds vectorization to the executor, which can be revolutionary. Once Andres's work is in, we can research how to make
the executor even faster using vectorization. This would open up Postgres to an entirely new class of big-data applications.

When setting up a database schema, indexing is always a consideration. While Postgres supports traditional
btree indexes for most data types, it can be quite heavy, often
requiring a significant percentage of the table size for index storage.

There are two options for schemas that need to index a large percentage of columns. The first is
brin indexes which allow for small, low-overhead indexes that can be
created on many columns. One downside of brin indexes is that they
must be defined on each column that needs it.

A more generic solution is to place all data in a jsonb
column and create a gin index. The
gin index indexes every json key and value, and only stores the keys and values once in the index. This is a great
"index everything" option for administrators that aren't sure of their indexing needs.

A more sophisticated improvement, which bypasses the problem of inaccurate statistics, would be to have the executor send feedback on selectivity
found during query execution to the optimizer to improve the future query plans.

You would think that the partitioning feature is now complete. Oh, no! It seems there are now more partitioning enhancement requests
than we had before adding the syntax. I knew there would be new optimizations once there was a canonical representation of partitions in
the backend, but I didn't expect this feature-request growth.

Ottawa's PGCon conference had a talk about
partitioning, but the major partition feature list was created during its
unconference. The good news is that
Amit Langoterecorded all the
requests. I expect it will be a few years until we can say we
are done improving partitioning in Postgres.

Postgres has a well-deserved reputation for flexibility. You can often combine two features to get something very powerful. For example,
it was recognized several years ago that combiningPgpool and
streaming replication
creates a powerful, easy-to-administer read-scaling
solution.

When asked about creating a read-scaling solution for multi-terabyte systems, I came up with an enhanced option. The problem with
using streaming replication for read scaling is that you need a full copy of the entire database cluster on each standby. For
multi-terabyte clusters, that requires a lot of storage. Also, odds are you don't access the cluster's data uniformly — some
data is current and frequently accessed, some is archival and rarely accessed.

An interesting solution to this problem is to combine four Postgres features:

Pgpool

Streaming replication

Partitioning

Foreign data wrappers

First, use Pgpool and add a partitioned table on the primary, where some of the partitions are local and some are
foreign data wrapper references. The local partitions contain
current data and are copied to the top-level streaming standbys. Archival data is placed on a secondary set of streaming-replication
servers and accessed via foreign data wrappers from the top-level primary and its standbys. For read scaling, you might need many
top-level streaming replicas with current data, but only a few streaming replicas to handle archival-data requests.

This solution uses four Postgres tools to provide flexible read scaling for large data sets. It is also transparent to applications
so administrators can rearrange the data without modifying applications.

You might know that Postgres supports materialized views and
foreign data wrappers (fdw). Briefly, materialized
views allow for queries to be materialized and refreshed on demand. Foreign data wrappers allow data to be pulled from foreign data
sources, like Nosql stores and other Postgres servers.

What you might not have considered is that materialized views and foreign data wrappers can be used together. Materialized views speed
data access by summarizing data so it isn't necessary to query the base tables that are referenced by materialized views. Foreign data
wrappers, because they are remote, can be slow to access. By combining the two features, you can get fast access to remote data.

This reduces the amount of data that must be passed from the foreign data wrapper server, and offloads aggregate computation from the
requesting server. The postgres_fdw fdw is able to perform this optimization. There are also improvements in pushing down joins
involving extensions.

Postgres 10 gets faster foreign table aggregate selects than in 9.6, but still slower than using materialized views:

You don't even need to use aggregates in materialized views — you can just copy the contents of a remote table into a materialized view
for faster access, and refresh the materialized view occasionally (though
logical replication in Postgres 10 does this better):

CREATE MATERIALIZED VIEW mat_view2 AS
SELECT *
FROM other_world;

Now we can run a performance test on the foreign table and its local copy:

In summary, materialized views and foreign data wrappers are two features that work well together. Materialized views allow remote data to
be cached locally, either entire tables or aggregate summarizations. The cache can be refreshed using refresh materialized view.
Also, Postgres 10 speeds up aggregate queries on foreign tables.

Like many open-source projects, Postgres is written in the C programming
language. However, with the code base being 31 years old, a lot of
specialized C infrastructure has been added over the years to simplify server programming, e.g. memory management, caching, system catalog
access, tuple access, error handing.

While this infrastructure helps experienced developers be productive, it often confuses people studying the Postgres source code because
you have to understand the infrastructure to understand the code. Robert Haas makes this point
clear in an email post:

A system like PostgreSQL is almost a language of its own; we don't really code for PostgreSQL in C, but in "PG-C". Learning the
PG-specific idioms is arguably more work than learning C itself …

Autovacuum simplifies Postgres administration by
automating the cleanup of updated(expired) and
deleted rows and rows created by aborted transactions. It also updates optimizer statistics. It uses information gathered by the
statistics collector to launch
vacuum and
analyze operations at the appropriate time. Autovacuum can
respond much more promptly than an administrator.

However, there are cases where autovacuum isn't optimal, and for such cases there are two options. First, autovacuum can be controlled at
the global level by settings in postgresql.conf. A
lesser-known option is controlling autovacuum parameters at the table level via create/alter tablestorage options. The later are
obviously more fine-grained and allow cases were global options are too coarse. And, of course, these settings can all be adjusted to be
more aggressive during idle periods.

Linux uses an
out-of-memory killer
(oom), which is designed to kill processes:

If your memory is exhaustively used up by the processes to the extent which can possibly threaten the stability of the
system, then the oom killer comes into picture. It is the task of the oom killer to kill the processes until enough
memory is freed for the smooth functioning of the rest of the process.

While killing processes is never good, it is better than having the system halt due to memory exhaustion. Sometimes the oom
kills Postgres, and that isn't a good thing either. This
email
thread explains how to use ulimit to cause Postgres
sessions that consume a lot of memory to fail due to excessive memory requests. This avoids having them continue and be killed by the
oom killer, which causes the
entire database
server to restart. The Postgres
documentation also explains the behavior of
the oom killer.

It is common knowledge that tablespaces allow you to place
frequently-accessed data on fast storage, and archive-level data on slower storage. This is even mentioned in our documentation.

One additional aspect I had never considered is that you can store archive-level data on slower and compressed file systems. Most file
systems support some kind of compression. While it doesn't make sense for frequently-accessed data, it seems perfect for archive-level
data.

Docker uses Linux containers (lxc)
to allow application deployment in a pre-configured environment. Multiple such isolated environments can run on top of the same
operating system. Docker is ideal for rapid, reproducible deployments.

How that relates to database deployments is an open question. The full power of Docker is that everything is in the container and the
container can be easily destroyed and recreated. Because databases require a persistent state, a hybrid approach is necessary. Typical
Docker database deployments use Docker for database installation and setup, and persistent storage for persistent state. This
email thread
explains the benefits of Docker for databases:

On the server side, high availability means having the ability to quickly
failover to standby hardware, hopefully with no data loss.
Failover behavior on the client side is more nuanced. For example, when failover happens, what happens to connected clients? If no
connection pooler is being used, clients connected to the failed
machine will need to reconnect to the new server to continue their database work. Failover procedures should guarantee that all
connections to the failed server are terminated and that no new connections happen.
(Reconnecting to the failed server could produce incorrect results and lost changes.) If a client is connected to a standby that is
promoted to primary,
existing client connections and new connections are read/write.

Clients connect to the new primary via operating-system-specific methods, usually either
virtual IP addresses (vip,
good blog entry) or dns entries with a short
time to live (ttl). This is normally accomplished using dedicated
high-availability or clustering software. Postgres 10 will also allow
multiple host names to be tried by clients.

For clients using a connection pooler, things are even more complicated. Logically, you would think that, since clients didn't connect
directly to the failed server, they should be able to continue their queries in the same session uninterrupted. Generally, this is not
the case.

First, the client might have been running a query during the failure — therefore, a query error code needs to be returned to the client
so the client can retry the query (assuming query retry logic is built into the application). Second, the session might have been in the
middle of a transaction block, meaning it has to resubmit all the queries that were part of the transaction (again, assuming the client
has such retry logic). Third, several server configuration
parameters might have been modified at the session level. The client might be depending on these settings for proper operation, so they
would need to be re-modified in the new session. Fourth, there is additional session state that might need to be recreated. You can get
an idea of possible session state modifications by reviewing the
discard manual page.

If this sounds like a boat-load of challenges to allow clients using a connection pooler to continue uninterrupted, it is. None of the
Postgres pooling tools allows uninterrupted failover. Pgpool automatically
disconnects all clients and allows them to reconnect. (In
pgpool 3.6+, clients connected to still-running standby servers are not affected.) PgBouncer
does not detect failover.

To implement even a limited case where failover appears uninterrupted to clients, connection poolers would have to record session state
changes, client queries, and even client results because many queries depend on the results of previous queries, e.g. select ... for
update. Even if all this tracking was performed, changed query results would make uninterrupted behavior impossible. Given all this
complexity, it isn't surprising that the standard approach to failover is for clients to be programmed to handle such cases by
reconnecting, resetting their session state, and re-running any queries that future queries depend on.

With the addition of logical replication in Postgres 10, we get
a whole new set of replication capabilities. First, instead of having to replicate an entire cluster, you can replicate specific
tables using streaming replication. With this granularity, you can broadcast a single table to multiple Postgres databases, or
aggregate tables from multiple servers on a single server. This provides new data management opportunities.

Another big advantage of logical replication is migrating between major Postgres versions. If both major Postgres versions support
logical replication, you can set up logical replication between them and then switch over to the new major-version Postgres server with
only seconds of downtime. It also allows you to downgrade back to the old major version and replay any changes made on the upgraded
server, assuming logical replication is still working properly.

Quicker upgrade switching and the ability to downgrade in case of problems have been frequent feature requests that
pg_upgrade has been unable to fulfill. For users who need this, setting
up logical replication for major version upgrades will certainly be worth it.

For wal archiving, e.g.
archive_command, you are going to need to
store your wal files somewhere, and, depending on how often you take base backups, it might be very large.

Most sites that require high availability have both a local standby in the same data center as the primary, and a remote standby in case
of data center failure. This brings up the question of where to store the wal archive files. If you store them in the local
data center, you get fast recovery because the files are quickly accessible, but if the entire data center goes down, you can't access
them from the remote standby, which is now probably the new primary. If you store your wal archive files remotely, it is
difficult to quickly transfer the many files needed to perform point-in-time recovery.

My guess is that most sites assume that they are only going to be using wal archive files for local point-in-time recovery
because if you are running on your backup data center, doing point-in-time recovery is probably not something you are likely to do soon
— you probably only want the most current data, which is already on the standby. However, this is something to consider because
with lost wal you will need to take a base backup soon you can do point-in-time recovery in the future.

When the ability to run queries on standby servers (called
hot_standby) was added to Postgres,
the community was well aware that there were going to be tradeoffs between replaying
wal cleanup records on the standby and canceling standby queries
that relied on records that were about to be cleaned up. The community added
max_standby_streaming_delay
and
max_standby_archive_delay
to control this tradeoff. To completely eliminate this trade-off by delaying cleanup on the primary,
hot_standby was added.

So, in summary, there is no cost-free way to have the primary and standby stay in sync. The cost will be either:

Standby queries canceled due to the replay of wal cleanup records

Stale standby data caused by the delay of wal replay due to cleanup records

Delay of cleanup on the primary

The default is a mix of numbers 1 and 2, i.e., to wait for 30 seconds before canceling queries that conflict with about-to-be-applied
wal records.

Unfortunately, there is always a cost in running hot standby queries, but at least with Postgres, you can control it. Fortunately, you
can also monitor replay delay using
pg_stat_replication.

In streaming replication, wal is
streamed from the primary to the standby. For a variety of
reasons, including
max_standby_streaming_delay,
there can be a significant delay for a commit on the primary to appear in read-only queries running on the standby. The
pg_stat_replication view shows the
possible delays:

Hstore was added to Postgres in
8.2 (2006). I didn't appreciate the purpose of hstore when it was
first introduced. I knew it was a per-field key/value store, but not much else. People would come up to me at conferences and say
"Hstore is the best," but I could only reply with "Yeah." Only later did someone explain to me the usefulness of being able to store
unstructured data in a database.

Of course, with the popularity of NoSQL, everyone now understands the value of storing unstructured data. They probably also know that
being able to store structured (relational) and unstructured data in the same database, like Postgres, is a big win.

Unfortunately, hstore, the Postgres extension that started it, hasn't kept up. When
jsonb was added in Postgres
9.4 (2014) it superseded the capabilities of hstore.Jsonb
is a key/value store, but allows for hierarchical storage too, and uses the popular json
syntax. Some of the performance improvements made to the jsonb data type could not be backpatched to hstore, due to
compatibility requirements.

At this point, unless you have a backward-compatible need for hstore, you should be using jsonb. Existing users of hstore
would probably benefit from moving to jsonb, when convenient.

Now that everyone is using IPv6 () it might be
time to start playing with it. Postgres has had full IPv6 support for years, so Postgres is a good place to start, particularly with
IPv6-aware data types.

Since IPv6 addresses are 128-bits
instead of IPv4's 32-bits, they can be quite long, e.g. 2001:0db8:85a3:0000:0000:8a2e:0000:7334. As you can see, it is made up of eight
quad-hex segments, separated by colons. To shorten the text representation, leading zeros in any quad-hex segment can be removed, though
an all-zero quad still requires a zero. In addition, the longest string of all-zero quads can be abbreviated with double colons. This
can be illustrated in Postgres:

In the output, 0db8 became db8, and :0000:0000: became ::, and the final 0000 became 0. An address with many leading zeros,
e.g. localhost (0000:0000:0000:0000:0000:0000:0000:0001), gets dramatically shortened using these rules:

This highlights perhaps the most confusing aspect of IPv6 addresses — you can't just visually compare two IPv6 addresses to check for
equality, like you can for IPv4. You must use the IPv6 rules for comparisons.

Use of colons is strictly for IPv6 addresses, i.e., 1::127 represents a 128-bit IPv6 value, not a 32-bit IPv4 one, as illustrated by the
IP family() function:

You might be aware of the interval data
type, which allows mathematical operations on date, time, and
timestamp values. This can lead to odd behavior, but this
email posting showed me a new oddity when dealing with
months containing a different number of days. First, let's summarize how many days are in the first five months of 2017:

2017-01 31
2017-02 28
2017-03 31
2017-04 30
2017-05 31

Let's add four months to the last day of January, 2017 in two different ways:

It seems that once the calculation lands on a lower-numbered day of a month, because the month doesn't have as many days as the previous
stop, it never advances past that day number, even if later months have more days. Let's be more overt about this:

All three queries produce different results. This shows that addition with such values is neither
associative (queries one and two) nor
commutative (queries one and three). Real number addition is both associative
and commutative, so it is confusing that addition with intervals is not for some values, though it is for others. (Other interval
operations have similar issues, e.g. subtraction.)

While the results are surprising, it is unclear how Postgres could do any better without adding some very complex logic — so, developers
beware.

notice that the time has not changed even though the session
time zone has changed. A timestamp
date type specification defaults to timestamp without time zone:, to match the sql specification. Contrast the above output
with using timestamp with time zone:

Notice that when the session time zone changed the date and hours were adjusted and the suffix time zone offset changed. This allows
users from multiple time zones to insert into the same table and for viewers to see all rows in their local time zone.

Timestamp with time zone is implemented by adjusting all values to
utc, meaning there is no storage of the original time zone name or
its offset. If you need to record the original time zone offset, you can store the output of SELECT EXTRACT(timezone FROM
CURRENT_TIMESTAMP) in a separate column.

PL/Java has been around since 2005, but it has regularly struggled to gain users. Unfortunately, a
lot of these problems are specific to the Java language and hamper its adoption.

First, there are a limited number of people who know both Java and the Postgres backend code. Fortunately
Chapman Flack has
recently resumed PL/Java development.

Second, there is the deployment complexity of binding PL/Java to a Java Runtime
Environment (jre) in an easily-deployed way. This 2014
email
thread discusses the problems of jre packaging and installation.

Third, PL/Java functions must be compiled into jar files and
installed on the server, similar to server-side
C functions. While this isn't a huge hurdle, it does hamper PL/Java when
competing against interpreted languages like PL/pgSQL and
PL/Perl.

Fourth, every stored procedure language has to
compete with
PL/pgSQL, which has good alignment with the scripting needs of the
sql language.

However, the problems go deeper. Because PL/Java isn't interpreted, it must be compiled, but compiling doesn't produce a binary that can
be run on the cpu. Instead, a Java Virtual Machine (jvm) is
required and it must be installed in a place that can easily receive function arguments and return results. An early attempt was
pl/j, which uses a single jvm process — each backend connects to the jvm process via
ipc and runs in its own thread. Unfortunately ipc
overhead, context switching, and security concerns doomed this attempt.

PL/Java takes a different approach by placing a jvm in each database backend process and using the
Java Native Interface (jni) to pass in function arguments and return
results. While this reduces communication overhead, it causes backend processes to use much more memory because each backend has
its own jvm, and each jvm must be started the first time a PL/Java function is called. (Connection pooling can reduce
the startup overhead.) The bottom line is that stored procedures have short run-times, and the heavy nature of Java isn't a good match.

Ultimately, with all these headwinds against PL/Java, it is hard for it to get traction. A lot of people like the idea of PL/Java because
they can reuse their Java skills and for portability with
other databases that support Java stored procedures. The problem is that once they hit these headwinds, they start looking around. They
realize that other languages are a better match for sql, that there isn't that much
code reuse from the client-side,
and they switch to another
server-side language. I am sure there are some things that only Java can do well, but those use-cases are not common in server-side
functions. (Contrast that with pl/r, which definitely does things no other server-side
language can do.)

In summary, the slow adoption of PL/Java isn't an accident, but the result of multiple challenges that hamper its adoption. If PL/Java is
get more popular, these headwinds must be addressed.

The use of server-side logic, particularly stored procedures, has been a highly contentious topic among database professionals for
decades. The question has always been what amount of logic should be
encoded in the database vs. in client applications or application
servers.

Beyond using the database as a simple data store, there are three levels in which logic can be added to Postgres:

Database constraints are often seen as optional by database application developers. They don't really do anything except prevent
invalid data from being entered into the database. The overhead of
performing these checks is often seen as negative, particularly foreign key checks. (Default clauses that call stored procedures
are similar to triggers; the serial data type uses
this.)

Triggers are stored procedures assigned to tables that are executed automatically during insert, update, delete, or
truncate commands. They can be triggered before the command runs, usually for complex constraint checks or to modify incoming
data, e.g. capitalization. When triggered after the command, they perform post-command operations, such as adding entries to a log table.

Manually-called stored procedures are functions called usually in where clauses or in the target list of select
queries. Stored procedures, including ones used as triggers, can be written in
many languages.

Now, on to the question of when to use these features. Some Postgres users use none of these features for reasons of performance and
portability. Others use all of them, and require applications to call stored procedures to perform tasks, rather than issuing
sql statements. This allows database administrators to control every aspect of database access. For most users, the best
solution is something in the middle, and figuring out when to use what can be tricky.

For database constraints, it is often possible to perform constraint checks in the application, rather than in the database. While this
is possible for most check constraints, it is hard to do for unique constraints because multi-user applications rarely
share state except inside the database. If database constraints are not used, applications and manually-issued sql queries must
perform the checks, and any mistakes must be detected later and corrected. If multiple applications access the database, they must
perform the checks in the same way — this is particularly difficult if they are written in different languages. Application
upgrades also require constraint synchronization in multi-application environments.

Triggers can be avoided by having applications perform the checks and queries that triggers would have performed. Performing data checks
application-side have the same downsides as avoiding database constraints. Additional application queries necessary when triggers are
avoided can lead to slower performance due to network round-trip delays.

Avoiding manually-called stored procedures requires all logic to be in the application. This can lead to serious slowdowns because if a
function cannot be used in a where clause, the entire data set must be transferred to the client application where filtering can be
done.

Ultimately, the decision of when to use server-side logic revolves around efficiency — efficiency of hardware utilization,
efficiency of development, and efficiency of data management. Your use of server-side logic will be dependent on which area of efficiency
is most important to you. This
email thread outlines
many of the tradeoffs seen by Postgres users.

Oh, how I love the title of this 2014 Slashdot request,
"Which NoSQL Database For New
Project?" The user already knows the type of clients (iPhones and Android phones) and the middleware (Php/Symfony or
Ruby/Rails) and then comes the zinger, "I would like to start with a NoSQL solution for scaling …." In addition to the
question of whether Ruby on Rails is a scalable solution, the replies
are illuminating, and even funny, e.g. "I'm working on a new
independent project. It will soon become the new Facebook, and I'll be billionaire next quarter. The only problem is that I don't know
which luxury yacht to buy with all this money."

OK, now on to the serious replies, which are many, seem to come from seasoned Sql and NoSql veterans and all seem to fall
under the heading of premature optimization and scaling:

* NoSQL solutions can be ridiculously fast and scale beautifully over
billions of rows. Under a billion rows, though, and they're just different from normal databases in various arguably-broken ways. By the
time you need a NoSQL database, you'll be successful enough to have a well-organized team to manage the transition to a different backend.
For a new project, use a rdbms, and enjoy the ample documentation and resources available.

* However, like a lot of other posters, I'm very sceptical that NoSQL
is the place to start. SQL databases can do a LOT for you, are very robust and can scale very considerably. As your requirements grow you
might find yourself wanting things like indexes, transactions, referential integrity, the ability to manually inspect and edit data using
SQL and the ability to store and access more complex structures. You're likely to give yourself a lot of pain if you go straight for
NoSQL, and even if you DO need to scale later combining existing SQL and new NoSQL data stores can be a useful way to go.

* So many developers start with the phrase "I need NoSQL so I can
scale" and almost all of them are wrong. The chances are your project will never ever ever scale to the kind of size where the NoSQL
design decision will win. Its far more likely that NoSQL design choice will cause far more problems (performance etc), than the
theoretical scaling issues. … NoSQL does not guarantee scaling, in many cases it scales worse than an SQL based solution.
Workout what your scaling problems will be for your proposed application and workout when they will become a problem and will you ever
reach that scale. Being on a bandwagon can be fun, but you would be in a better place if you really think through any potential scaling
issues. NoSQL might be the right choice but in many places I've seen it in use it was the wrong choice, and it was chosen base on one
developers faith that NoSQL scales better rather than think through the scaling issues.

* Given 3 trillion users your options are pretty much limited to
horizontal scaling, no SQL etc. but most people never get that far with their applications and in that case, storing the data in a noSQL
database and then getting actionable information out of it (which is the hardest part IMO) is a lot of effort spent for something much
cheaper and easier done with an rdbms.

* "Why not" is because the cost/benefit analysis is not in NoSQL's
favor. NoSQL's downsides are a steeper learning curve (to do it right), fewer support tools, and a more specialized skill set. Its primary
benefits don't apply to you. You don't need ridiculously fast writes, you don't need schema flexibility, and you don't need to run complex
queries on previously-unknown keys.

* It's a mistake to think that "NoSQL" is a silver bullet for
scalability. You can scale just fine using MySQL (FlockDB) or Postresgl if you know what you're doing. On the other, if you don't know
what you're doing, NoSQL may create problems where you didn't have them.

* Databases don't scale for people who don't understand SQL, don't
understand data normalization, indexing and want to use them as flat files. Unfortunately, a way too common anti-pattern :(

*So default answer to "Which NoSQL database should I use?" is always
"Don't use NoSQL."

There were many positive comments about Postgres, both from a relational database perspective and touting its scalability and NoSQL-like
features. One poster even wrote about their accounting system using
Mongo that they are porting to Postgres.

In a February blog post I talked about the mismatch between what
people expect from Postgres in terms of hints, and what exists. In this blog post I would like to cover the more general case of when
people should expect feature parity with their previous database, and when such expectations are unreasonable.

First, imagine if every database had the features of every other database — that would be great for compatibility but terrible for
usability. You would have so many ways of doing something, with slightly different behaviors, that development and administration would
be much more difficult. Of course, Perl has its "There's more
than one way to do it," but at least that allows tradeoffs between clarity and conciseness, and allows different programming styles.
Having all features in every database would have few of those benefits. Also consider that some features are mutually exclusive, so this
would be impossible. Therefore, we have to make tradeoffs in the features each database system supports.

Let's think of Postgres compatibility at three levels: sql, tuning, and monitoring. At the sql level, you should
expect parity between Postgres and your previous database. The syntax might be different, but all the capabilities should be similar,
e.g. if your application used save-points with the previous database, Postgres should support that, and in almost all cases, it does.

For tuning, parity is more murky. The internals of database systems differ greatly, so the tuning requirements will differ — and let's
be frank — the less tuning you have to do to attain good performance, the better, so having a tuning item missing in Postgres might be
good thing. Postgres might have good performance without requiring that tuning knob. Postgres might require tuning that wasn't required
in your previous database system — that is a negative. So, if Postgres doesn't have a tuning knob you had in your previous database,
that might be good (Postgres auto-tunes it), or it might be bad (you can't adjust Postgres to achieve good performance for your workload).
(I previously covered the tradeoffs of adding performance settings.)

Monitoring has a similar murky outlook. Sometimes monitoring is required to meet organizational needs, e.g. what are the currently
running queries. For most organizational monitoring needs, Postgres has the required features, whether it is
log_line_prefix,pg_stat_statements, or external tools like
pgBadger or PGAudit. Often, Postgres tooling isn't as polished or
as easy to use as tools in more-established database systems, but the tools exist, and are often more flexible.

Just like tuning, sometimes monitoring that was required in your previous database system isn't required in Postgres. For example,
Postgres's streaming replication is rock-solid, so there is no need to monitor for streaming replication corruption. Similarly, many
organizations built elaborate monitoring of their previous database systems to check for optimizer plan and statistics changes. These are
often to avoid optimizer bugs in their previous systems that caused problems. However, you can't assume this kind of monitoring is
necessary for Postgres. Postgres isn't bug-free, but it might require different monitoring than what was required in the previous system.

In summary, Postgres is different from your previous database system. Hopefully it supports all the necessary features your applications
require, but it will not exactly match your previous tuning and monitoring requirements. It is necessary to approach Postgres with an
open mind to make the best use of your new database system.

Historically, most relational database systems supported raw devices, i.e., the ability to write data directly to the storage subsystem to
avoid file system overhead. Raw device support was added to databases 20-30 years ago when cpus were much slower, and file
systems were much less efficient.

Modern recommendations discourage the use of raw devices unless every last bit of performance is required. Raw devices improve
performance perhaps a few percentage points, but are difficult to administer and resize. The other problem is that raw devices have to be
configured when the database cluster is initialized, meaning you can't easily remove raw devices or add them later. This makes it
difficult to know if raw devices would even help your workload.

Postgres has never supported raw devices, and probably never will. Modern file systems are so good, and the flexibility of file system
storage so powerful, that raw devices just don't make sense for modern Postgres.

Postgres does all heap and index page operations in a shared memory area called
shared buffers. Data is
read into shared buffers by requesting 8k blocks from the kernel, which will either satisfy reads from the kernel cache or retrieve them
from storage devices. Writes from shared buffers are sent to the kernel, which are eventually written to permanent storage. The
write-ahead log (wal) allows writes to be performed
asynchronously; this is illustrated in this presentation.

New users are often surprised that Postgres uses the kernel cache for reads and writes, but there are advantages. While the shared buffer
size is fixed at server start, the kernel cache is resized based on the amount of unused memory in the system. This
blog post explains how to determine the size of the Linux kernel cache.

Many database systems use
direct I/O
to read and write data, which bypasses the kernel cache. This has two advantages:

Avoids the double-copy of data from storage to the kernel cache and then from the kernel cache to shared buffers

Avoids double buffering (storing) of data in the kernel cache and shared buffers

However direct I/O has some disadvantages:

Prevents the kernel from reordering reads and writes to optimize performance

Does not allow free memory to be used as kernel cache

In summary, direct I/O would improve performance in cases where shared buffers is sized properly, but it would dramatically decrease
performance for workloads where shared buffers is missized. This
email report confirms this analysis. Postgres
does use direct I/O for wal writes
(if supported by the operating system) because wal must be flushed immediately to storage and is read only during crash recovery,
so kernel buffering is useless. (It is also read by the
walsender and
archive_command.)

Someday Postgres might support an option to use direct I/O for data files but the downsides make it unlikely it would be enabled by
default.

Postgres has a flexible tablespace feature that allows you to
place data in any filesystem accessible from the database server, whether it is
directly attached storage (das), on a nas or
san, a directory in /tmp, or on a ramdisk in memory.

You might be tempted to put data in any of these places, with the idea that if one of the storage areas goes away, either through hardware
failure or server reboot, Postgres will continue to function. Unfortunately, that is not the case. If a tablespace disappears or is
erased, Postgres will have problems because system tables will now point to nonexistent files. Regular maintenance operations, like
autovacuum, will throw errors. The wal might also still contain references to the missing files, preventing crash recovery from
completing.

Perhaps someday Postgres will have the concept of transient tablespaces, but right now it doesn't so only create tablespaces on durable
storage. Even temp_tablespaces (used
to store temporary objects) cannot be placed on transient
storage.

Effective_io_concurrency
controls how many concurrent requests can be made by
bitmap heap scans. The default
is 1, which means no concurrency. For magnetic disks, this can be increased, perhaps to 8, but not much higher because of the
physical limitations of magnetic disks. However, for ssds, this can be increased dramatically, perhaps in the hundreds.

I have reorganized my twenty active presentations to be easier to find. I was getting confused,
so I assume others were as well. The new categories seem clearer and more closely match the
categories I use for Postgres blog entries. I have reordered items within
categories. I have also indicated more presentations that can be given in pairs.

I have completed the draft version of the Postgres 10 release notes.
Consisting of 180 items, I think you will be impressed with the substance of the improvements. The release notes will be continually
updated until the final release, which is expected in September or October of this year. (I am speaking tomorrow about Postgres 10 in
Boston.)

Fyi, future major versions of Postgres will consist of a single number, e.g. Postgres 11 will be released in 2018. This is
mentioned in the release notes as well.

Modern systems offer several storage options and databases are very sensitive to the I/O characteristics of these options. The simplest
is direct-attached storage (das), e.g.
sata,sas. It
is the simplest because it uses a dedicated connection between the server and storage.

A third options is network-attached storage (nas), e.g.
nfs. While nas is networked storage like san, it
offers a remote file system to the server instead of remote storage blocks. (The fact that the san/nas acronym
letters are just reversed only adds to the confusion. I remember it as (block) "storage"
being the first letter of san.)

So, which is best for you? Well, das is the simplest topology because all storage is connected to a single server, and it is
also often the fastest. San's add flexibility by replacing direct connection with a network, e.g.
fcp,
tcp/ip. This allows for a larger number of devices to be attached
than das, and if a server fails another server can access the device's block storage and restart.

Nas is quite different because it exports a file system to the server. While nas-exported file systems can be mounted
by multiple servers simultaneously, that doesn't help Postgres because only a single server can safely access the Postgres data directory.
Nas usually includes sophisticated management tooling, backup capabilities, and caching layers. Because multiple servers usually
access a nas, contention can be a problem, though ease of administration can be a big benefit for large organizations.

With streaming replication, Postgres allows
sophisticated setups of primary and standby servers. There are two ways to promote a standby to be the new primary. A switchover is
when the change happens in a planned way:

All clients are disconnected from the master to prevent writes

A sufficient delay allows the final write-ahead log (wal)
records to be transferred to all standbys (also performed by step 3)

The primary is cleanly shut down

The standby is promoted to be the primary

A failover happens when the steps above can't be performed, usually because the primary has failed in some catastrophic way. The major
difficulty with failover is the possibility that some of the final database changes contained in the wal are not transferred to
standbys, unless
synchronous_standby_names
was used. When a standby is promoted to primary after a failover, the final missing wal records can cause problems:

Some transactions on the old primary that were acknowledged to clients might be lost

If the old master needs to be reconnected as a standby without reimaging, it might be necessary to use
pg_rewind

Make sure you practice both methods of promoting a standby so, when you have to do the promotion in production, you are ready.

When using continuous archiving, you must restore a file system
backup before replaying the wal. If the file system backup was taken long ago, wal replay might take a long time. One
way to avoid this is to take file system backups frequently.

Another option is to perform an incremental file system backup that can be laid over the original file system backup, then replay
wal over that. This reduces restore time because you only need to replay wal from the start of the incremental backup,
not the start of the full backup. This also reduces the amount of wal that must be retained.

However, Postgres doesn't natively support incremental backup. The best you can do is to use a tool like
pgBackRest or
Barman that supports incremental backup at the file level. The only problem
is that the database files are potentially one gigabyte in size, so the granularity of the incremental backup isn't great. Ideally
solutions will be developed that do page-level (8k) incremental backups, which would be much smaller. The trick is finding an efficient
way to record which 8k pages have been changed since the last file system backup.

You probably have heard the term "checkpoint" before, or seen it
mentioned in the postgresql.conf
file. A checkpoints is a usually-invisible cleanup feature present in most database systems, but it is useful to know what it does.

This diagram illustrates checkpoints. At the top are three
Postgres database sessions. Each session reads and writes to the
shared buffer cache. Every
modification to shared buffers also causes a change record to be written to the
write-ahead log (wal,
blog entry). Over time the wal would grow unbounded in size if
it were not trimmed occasionally — that is what checkpoints do.

A checkpoint writes previously-dirtied shared buffers to durable storage over a period of several minutes, at which point the wal
representing those writes is no longer needed for crash recovery. (Hopefully
continuous archiving and
streaming replication have also processed those
wal files.) Therefore, the old wal can then be removed or recycled.

This diagram illustrates the process. In the diagram, 1
marks three dirty buffers in the shared buffer cache at the start of the checkpoint. During the checkpoint, additional buffers are
dirtied, 2, and the wal pointer is advanced. At the end of the checkpoint all dirty 1 buffers have been
written to durable storage and the old wal file are moved to the end, to be reused.

This all happens automatically, though it can be
tuned and
monitored. Fortunately, Postgres
doesn't have any pathological behavior related to checkpoints, so most administrators never have to think about it.

The write-ahead log (wal) file format changes with every major
release. Also, initdb, which is required to install the new major
version, starts wal file numbers at 000000010000000000000001.

For these reasons, if you are
archivingwal, it is wise
to use the major version number in the name of the wal archive directory, e.g. /archive/pgsql/9.6. This avoids the problem of
wal from an old Postgres major version conflicting with wal files from a new major version.

You can see the insert into the heap, the index insert, and the transaction commit record. When viewing pg_xlogdump output, remember
that while transactions are assigned in start order, higher-numbered shorter transactions can commit before lower-numbered longer
transactions (see this slide). For example, transaction 32 commits before
transaction 30 because transaction 30 runs much longer than 32. (This can make choosing recovery_target_xid particularly tricky.)

To generate more detailed wal activity, set
wal_level to logical and use
test_decoding to view logical wal information:

You might be aware that the sql standard reserves certain identifiers that cannot be used for naming user objects. Postgres
follows that standard, with slight modifications. For example, you cannot create a table called all:

CREATE TABLE all (x INTEGER);
ERROR: syntax error at or near "all"
LINE 1: CREATE TABLE all (x INTEGER);
^

It is actually very easy to find what identifiers are reserved because they are listed in the
documentation. More interestingly, they are also accessible
via the sql function
pg_get_keywords():

The first two are sql standard, the third one is a Postgres-ism that is often convenient, and the final one relies on the
existence of named functions to do the conversion. Some of the more complex data type specifications have shortcuts, e.g. timestamp
with time zone can use ::timestamptz.

While the sql standard allows multiple nulls in a unique column, and that is how Postgres behaves, some database systems
(e.g. ms sql)
allow only a single null in such cases. Users migrating from other database systems sometimes want to emulate this behavior in
Postgres. Fortunately, this can be done. First, let me show the default Postgres behavior:

The i_nulltest2 index allows only one ynull value for each x value. This can actually be useful in certain data models.
This illustrates how expression and partial index features can be combined for some interesting effects.

Many database administrators use databases as simple data stores. However, relational systems can do much more, with advanced querying,
analysis, and transaction control capabilities. Another area that is often overlooked is constraints. Constraints allow new and updated
data to be checked against defined constraints and prevent changes if the constraints would be violated.

Constraints are odd in that they don't do anything if the data is consistent — it is more like an insurance policy against
invalid data being entered into the database. If constraints are missing, there often are no initial problems, but over time erroneous or
unexpected data gets in, causing problems with applications and reporting.

Do yourself a favor the next time you create a table —
take the insurance and create useful check,not null,default,unique,primary key, and
foreign key constraints. If your tables are already created, you can use
alter table to add constraints to existing tables.

SELECT random() AS confusion
FROM generate_series(1,10)
WHERE confusion > 0.5;
ERROR: column "confusion" does not exist
LINE 3: WHERE confusion > 0.5;

It is because the order in which select clauses are evaluated is specified by the sql standard, and it isn't top to
bottom. Tom Lane's excellent email post goes into the
details. The
thread
also mentions the
unfortunate effect that users of union often want the behavior of union all.

Understanding how characters sets, encodings, and collations work together can be confusing. I would like to explain them in this blog
post:

Characters Sets

Postgres databases can be initialized to support a variety of character sets. A character set is a full set of the characters or glyphs
that can be represented. Popular characters sets are ascii (127
characters), Latin1 (ISO8859-1, 255 characters), and Unicode (128k+ characters).

Encodings

Encoding is the way a character set is stored. For single-byte character sets like ascii and Latin1, there is only a single way
to encode the character set, i.e., as a single byte. For more complex multi-byte character sets there can be several ways to encode the
character set. For example, Unicode can be encoded as UTF-8 (8-bit granularity), UTF-16
(16-bit), or UTF-32 (32-bit). (Postgres only supports UTF-8 for Unicode, server-side.) Asian languages also often support multiple
encodings for a single character set, e.g. Big5, GB2312, Shift-JIS, EUC-JP.

Collations

Collations specify the order of characters in a character set. Again, for single-byte character sets, there is usually only one possible
collation, which uses encoded byte values to provide the ordering, though this can lead to odd orderings, e.g. Z (uppercase) ordered
before a (lowercase), or z ordered before á. For a complex character set like Unicode, the user can often select the
desired collation.

It is possible to illustrate different collations of the same character set using Unicode:

You can see in the first two queries that collation en_US ignores case in letter comparisons, and ignores underscores. The last query,
using C collation, reverses the order because the byte values for uppercase letters (C) are less than lowercase letters (b), and the
underscore character is lower than lowercase letters (a).

In the first query, while a single space is ordered before a, it does not cause space-b to order before a; it does in the C
collation. Things get more complicated with letters with accents, tildes, and pictographic languages.

Postgres relies on the operating system locale to support characters sets, encodings, and collations. On Linux, you can see the list of
supported locales via locale -a. While
initdb sets the default locale for the cluster based on evironment
variables, the locale can be overridden in a variety of ways, including by
CREATE TABLE.

Postgres supports both traditional join
syntax, which uses the where clause to specify joined columns, and ansi join syntax, that uses the word join
in the from clause. While both syntaxes can be used for inner joins, only the ansi join syntax supports outer joins in
Postgres.

Because column restrictions like col = 4 are not related to joins, you would think that restrictions have the same effect whether they
appear in the where clause or as part of a join clause, e.g. a outer join b on a.x = b.x and col =
4. However, this is not always true. Restrictions in the join clause are processed during joins, while where clause
restrictions are processed after joins.

This is only significant in outer joins (and
cross joins) because columns from
unjoined rows are manufactured by outer joins. Here is an example:

As you can see, the first select performs the outer join with no column restrictions. The second select returns the
same result because the column restriction matches all rows in test2 (before the outer join happens). The final select
operates on the result of the join and since null <= 2 returns null, which behaves like false, the third output row is
suppressed.

What is even more interesting is seeing how a column restriction can cause a query that would normally be fully joined to return unjoined
column values:

As open source gains popularity in the enterprise, there is increased study of open source communities and how they function. Those
studying such things often ask about Postgres because of its unusually healthy community and recent successes.

While I was in India in February, I was interviewed by an open source magazine; an excerpt from that interview is now
online. It covers open source leadership,
encouraging new developers, and healthy software ecosystems.

It is possible to define table columns in an order that minimizes padding. Someday Postgres might do this
automatically.

The 24-byte row header includes an 8-bit mask to record null values. You can see below that the 8-bit mask is sufficient for eight nulls,
but the ninth null requires the null bit mask to be expanded, with additional alignment:

Application_name might be one
of those Postgres settings that you have seen in various places but never understood its purpose. It is true that setting
application_name doesn't change the behavior of Postgres (with one small
exception), but it is
very useful for monitoring. Its value appears in
pg_stat_activity and can be prepended to
every log line with
log_line_prefix.

A more interesting use of application_name is to change it while the application is running. pgAdmin
updates the application_name when a user changes screens so administrators can know exactly what screen users are
on. While you can't change application_name while a query is running, you
can change it between queries, which allows you to monitor the progress of long-running batch jobs, e.g.:

Some people might complain about the overhead of a separate query to update application_name. There are two solutions. One solution is
to send two queries in a single string, e.g. selectapplication_name = 'demo2'; select100. Some libraries like
libpq support this, but psql sends these as two separate queries
— you have to enable
log_statement to see how
queries are sent to the server.

Another approach is to bundle the setting of application_name inside the query:

Unfortunately, there is no way to guarantee that set_config() will be run first, e.g. in the first query, set_config() is run first,
and in the second query, run second, and the optimizer is allowed to run from-clause expressions in any order:

Postgres has many logging options, e.g.
log_statement,log_min_duration_statement, and log_line_prefix. These values can be set at various levels, which I have already
covered. One level that has particular flexibility is
postgresql.conf. Settings in
this file affect all running sessions, unless the values are overridden at lower levels.

When debugging, it is often hard to know which log settings to enable. One approach is to enable all settings beforehand, but that can
generate lots of log traffic and be hard to interpret. Ideally you can enable just the settings you need at the time you need them
— this is where postgresql.conf comes in. With postgresql.conf or
alter system, you can change any logging settings you want
and then signal a reload by either sending a
sighup signal to the
server, running "pg_ctl reload", or executing
"SELECT pg_reload_conf()". One nice thing is
that after a reload, all running sessions receive the new settings when their transactions complete. This allows logging settings to be
enabled and disabled as needed.

It is cool to be able to stare at colorful graphs to see what Postgres is doing, but sometimes you just want to setup something, walk
away, and be informed when there is problem. That is what check_postgres and
tail_n_mail are designed to do.

check_postgres is a script designed to be run from cron or a monitoring tool like
Nagios. It reports on areas in the database that need administrator attention. tail_n_mail has a similar
purpose, but monitors the Postgres log files for important messages.

Few people stare at their mobile phones waiting for something to happen — they configure their phones to notify them when
something important happens. Why not do that for your Postgres clusters? That's what check_postgres and tail_n_mail are meant to do.

Postgres often lacks the sophisticated reporting of more established enterprise relational database systems. Sometimes that is because
Postgres doesn't require as much monitoring and tuning, but there are legitimate monitoring needs which Postgres doesn't support.

Thanks to the wait_event_type and
wait_event columns added to the pg_stat_activity view in Postgres
9.6, it is possible to find which parts of the system are causing query
delays. In Postgres 9.6 wait_event_type can have four values:

LWLockNamed

LWLockTranche

Lock

BufferPin

and this table lists all the possible wait_event
column values, grouped by wait_event_type. Below is typical pg_stat_activity output during
pgbench:

Postgres 10 will have even more monitoring
capabilities, with at least four new wait event types. One nice thing about this feature is that it has almost no overhead so is enabled
by default. Unfortunately, there is no efficient way to measure event duration except by periodic sampling of event states.

This excellent email post
from Robert Haas statistically analyzes wait events to show how different queries have different wait
behaviors, e.g. heavy wal writes, unlogged tables, all data in
shared buffers. The
results match what you would expect from these kinds of workloads, but it is much more detailed than you would get from guessing.

These new pg_stat_activity columns give us a
new
window into performance. Before this, we had to guess what the bottleneck was, and we were very good at it. This new instrumentation
gives us very accurate statistics on where we are losing performance. I expect this feature to yield major performance improvements in
the years to come.

You might have noticed that create index allows you to
specify asc or desc for each column being indexed, and you might have wondered why. Asc is the default, so
there is no need to use it. Desc actually does have a use, but it isn't obvious.

Postgres can access indexes forward or backward, so there is no need to specify ascending/descending if all columns are ascending or
descending. The use-case for specifying the index order is for multi-column indexes with mixed ordering, assuming queries also use the
same mixed ordering. For example, let's create a two-column table with 1000 rows, and an index in all-ascending order:

Postgres has supported multi-column indexes since 1997, e.g. create index i_test on test (a, b, c). It can easily use
an index if the supplied columns are all at the front of the index, e.g. a and b in the previous index, but it can also use the index
if some of the indexed column values are not supplied, e.g. columns a and c in the previous index. It does this by looking up a in
the index, then looking through the index for matches of c, ignoring values of b, e.g.

Of course, if possible, you should always put the most commonly-supplied columns first in the index because skipping columns during index
scans (called "index skip scans") is expensive.

However, if you are supplying all the column values referenced in the index, I assumed it didn't matter what order the columns were
specified in the index, but Robert Haas recently mentioned this is not always true. For example, if a
has many duplicate values, and b has mostly unique values, having a at the start of the index is suboptimal — better to use b
first, which will more effectively narrow the search space when looking for matches on a.

A more specific example of this is range queries. In that case, if the range test of one column is less restrictive than the equality
test of another, it would be better for the equality test column to be first in an index. For example, in the queries below, the first
query uses an index because it is very restrictive on the first column, the second query uses an index because it is moderately
restrictive on the first column, while the third does not use an index because it is effectively unrestrictive on the first column:

Notice the increasing costs, even though all queries match one indexed row.

Obviously, in cases where you are not specifying all indexed columns in every query, you should put the most frequently referenced columns
first in the index to avoid the overhead of index skip scans. However, for cases where most indexed columns are going to be supplied in
queries, placing the most restrictive columns first in indexes is a win.

In my previous blog post, I showed how statistics generated on
expression indexes can be used to produce more accurate row counts, and potentially better plans. While I did show more accurate row
counts via explain, I did not show changed query plans. I
plan to do so in this blog post. First, the setup:

A nested loop join is used, which is suboptimal because the row count
for test1 is one hundred times too small. With proper statistics on the modulus operation on test1.x, a more efficient
hash join is used:

Notice the test1 row count is now much more accurate, and that analyzing the base table also analyzes the expression index. The total
cost is now slightly higher (2132.29 vs. 1959.02), but that is not because the hash join is more expensive. Rather, it is because the
nested loop misestimated how many rows it would need to process because it didn't know the selectivity of the modulus operation.

One thing I learned in researching this blog post is how much the optimizer "loves" hash joins. If test2 has three or more rows, or if
test1 has ten times more rows and parallelism is enabled,
a hash join is used even without expression index statistics. Hash joins are very robust despite misestimation so they are favored by the
optimizer. The takeaway is that the creation of expression indexes for statistical purposes is recommended only if testing shows they
actually improve query plans, i.e., improving explain row counts alone has little benefit.

Most people know that Postgres allows the creation of indexes on
expressions. This is helpful if you need index lookups of expressions used in where clauses.

However, there is another benefit to expression indexes, and that is optimizer statistics. Not only do expression indexes allow rapid
lookups of matching expressions, but they also provide optimizer statistics, which improve row estimates and hence query plans. Here is
an example:

The optimizer doesn't know the selectivity of the modulus operator, so it initially assumes
only one row is returned. Once an expression index is created and analyze statistics generated, the optimizer knows exactly how many rows
will be returned:

Interestingly, the optimizer used expression index statistics, even though the expression index itself was not used. In the example
above, the modulus operator is not selective enough to make the index useful, but expression statistics would be useful for more complex
queries, e.g. with joins. This method can also be used to create statistics on functions.

It is also possible to create an expression index that generates cross-columns statistics, the benefits of which I mentioned in an earlier
blog post. For example, this expression index would supply accurate
statistics for state/city combinations, but queries would need to use the exact concatenation construction:

So, it is 2017 and Postgres still doesn't support query hints like other relational databases? Yep, it's true, and probably will be
forever since "'Oracle-style' optimizer hints" is listed in the
"Features We Do Not Want" section of the Postgres todo list.
A wiki page outlines the reasons for this.

While this seems perfectly logical to people who have used Postgres for years, it strikes new users as rigid and extreme. There are
several reasons for this divergence.

First, what new users don't realize is that there are ways to control the optimizer, just not with inline query hints. The most useful
ways are by changing planner
constants and other planner
options, which allow you to tailor the optimizer to your specific hardware and query types. This is a more holistic approach to
optimizer tuning because, if set properly, they improve all queries, compared to "use this index"-style hints which must be added to
every query.

There are also several crude
enable settings that can be
changed at the session level to prevent the optimizer from considering certain executor capabilities, like join and scan types.
Unfortunately, these effect the entire query, i.e., you can't turn off sequential scan for just one table. It also doesn't allow the
specification of which index to use. These are really designed for debugging.

In the same vein, an area that often causes suboptimal plans are queries that reference several columns from the same table where column
values are correlated, e.g. a US state column with the value 'NY' is more likely to have a city value of 'New York City' than a
state value of 'AZ' would. While inline query hints might be able to fix this case for constants, it can't help when variables are
passed into the query, like with parameterized queries. A patch being considered for Postgres 10,
multivariate statistics, would benefit all queries by collecting statistics on column
combinations in addition to the single column statistics already
collected. This, again, is an example of a holistic approach to improving the optimizer.

There are two additional ways to control single queries that are often used in production.
Common table expressions allow a query to be broken up into stages
which are executed sequentially because each with clause acts as an optimization fence. Secondly, the use of
offset 0 prevents subqueries from being moved into the outer query.

I expect future improvements to address other optimizer shortcomings. One area where inline query hints would really help is fixing
queries in an emergency, when you don't have time to research the cause. Postgres currently doesn't have an easy solution for that.

What do all those numbers mean? Well, the Linux getrusage() manual page has
a clue — these are kernel resource usage statistics. The first two lines show the user and kernel ("system") cpu time
used, as well as elapsed time. (The numbers in brackets on line #2 and following are totals for this process.) Line #3 shows I/O
from/to the storage device (not from the kernel cache). Line #4 covers memory pages faulted into the process address space and reclaimed.
Line #5 shows signal and ipc message activity. Line #6 shows process context switches.

Let's look at some more interesting queries by populating a new table:

By clearing the kernel buffers with echo 3 > /proc/sys/vm/drop_caches and restarting the server we can see reads from the storage device
(via log_statement_stats) and reads from the kernel cache (via explain (buffers true):

Notice the first count reads 1712 512-byte blocks from storage and 5 8kB blocks from the kernel cache into the
shared buffer cache. (The
unrecorded reads are probably system table information.) The second query shows no I/O from the storage device, and a hit on 5 8kB shared
buffer pages.

As you might know, Postgres executes queries in stages. Fortunately,
you can get rusage statistics for each stage with
log_parser_stats,log_planner_stats, and log_executor_stats. (explain (analyze true) also shows planning and execution durations, which you
can see above.)

As you can see, you can access a wealth of information about Postgres and how it is using the cpus, storage, and caches. These
have a large impact on how quickly queries execute.

I often get the question, "How do I get good performance out of Postgres?," and sometimes I am kind of
miffed because I get the feeling that they expect a short answer. Those who are familiar
with Postgres know there is no short answer, and it really isn't Postgres's fault because there isn't a simple answer to that question for
any relational database.

I guess it comes down to the complexity of relational systems. There is the sql parser, the
optimizer, and background workers. No matter how much auto-tuning we
do, there are still going to be things that administrators must adjust for optimal performance. And then there is the hardware, which is
stressed by all relational systems. Proper tuning of Postgres must happen at all three levels:

I just co-authored a blog for
ibm's Linux on Power developer site. While there is similar content in my
YeSQL slides, the ibm blog more concisely describes the four
options available to application developers, from pure relational to pure NoSQL, and the options in between possible with Postgres.

Postgres extensions are a powerful way to add functionality to Postgres and
dozens of extensions are available. However, it can be difficult to manage them. By default, extensions are loaded into the first
existing schema in search_path (usually the public
schema) e.g.:

This loads 36 functions into the public schema. This can be difficult to manage, particularly if multiple extensions are loaded into
the same database.

One solution is to place each
extension into its own schema. This can be accomplished by setting search_path before loading the extension, or using the with
schema clause in the create extension command:

Antivirus software helps to protect your computer from external attack.
Unfortunately, it can cause problems with Postgres. Antivirus software constantly scans for files and behavior that suggests attacks, but
Postgres can easily get caught in that net.

Postgres regularly creates heap, index, and wal files, and these
files are binary. Depending on what data is written, these files can contain patterns that are mistaken for viruses, causing the files to
be removed from Postgres-controlled directories and quarantined. Obviously, this causes bad things to happen to Postgres. Postgres sites
using anti-virus software should disable anti-virus scans of directories controlled by Postgres.

When the number of my blog entries surpassed 400, I realized that the categories I was using to group entries for easier reference were
too coarse. Therefore, I have added new blog categories to make future
reference easier. Also, I have dozens of new blog entries ready to post in the coming months.

Related to restructuring, I have updated many of the stock images in my presentations.

I did a 12-minute interview in Moscow in November and a
recording is now available online. The first four minutes cover Russian-specific aspects of Postgres, and the remainder discusses
Postgres's general direction, challenges, and its place in the database ecosystem.

Having covered ssl certificate creation and the use of
certificate authorities (ca), I would like to put it all
together and show how certificates and certificate authorities work to ensure trusted Postgres communication.

I have created a diagram showing server, client, and certificate authority
certificates. None of these certificates is secret, e.g. the server sends its ssl certificate to the client, and visa versa. In
the diagram, the server and client use the same certificate authority certificate.
(Intermediate certificate authorities could also be used.)

When the client connects, the server sends its certificate to the client. The client uses the public key in its certificate authority
certificate to verify that the server certificate was signed by its trusted certificate authority (the red line). It then uses the public
key in the server certificate to encrypt a secret key that is sent to the server. Only a server with the matching private key can reply
to generate a session key. It is not the possession of the server certificate that proves the server's identity but the possession of the
private key that matches the public key stored in the server's certificate. The same is true for client certificates used for client
host and user authentication (the blue line).

Hopefully this diagram helped you see how the same certificate authority certificate on the server and client allows for identity
verification. Interestingly, identity verification is not
required for two systems to communicate in an eavesdrop-proof manner, but if you can't
be sure who you are communicating with, it isn't very useful.

The first argument to the openssl command-line tool always starts with a subcommand, in this case req. (The openssl manual pages
are split based on the subcommand, so man req shows openssl arguments for the subcommand req.)

The common name (cn) shown above is potentially checked by the client. The encoded section ("certificate request")
can be suppressed with -noout. The file server.req already contains this information in text format because -text was specified
during its creation.

The server public and private keys stored in privkey.pem can also be viewed:

All of this can be very complex so I have created a diagram which illustrates what
is happening. At the top-left is the server key generated by openssl req. This command also creates the certificate signing request
(the csr, top-right) which contains:

Certificate signing information in X.509 format, e.g. common name (cn)

Public server key to be used by clients wishing to authenticate the server's identity

The above signed by the server's private key to prove that the server owner supplied this information

The certificate signing request (csr) can't be used as a certificate because it is unsigned. For self-signed certificates, the
server's private key is used to sign the csr. (These are the instructions given in the Postgres documentation.) A more powerful
option, as I mentioned in my previous blog post, is to use a local certificate authority. In my diagram, a certificate authority
(bottom-left) uses its private key to sign the csr and create a certificate (bottom-right).

Once the server has a certificate, any client who has the signer of the server's certificate in their
~/.postgresql/root.crt file can authenticate the
server certificate by using the sslmode connection
parameter verify-ca. They can also verify that the certificate was created for the specific server's host name by checking the common
name (cn) using verify-full. (Clients can record revoked certificates in ~/.postgresql/root.crl.)

Hopefully you can make better use of ssl server certificates now that you understand how they are created, inspected, and
verified.

If the certificate is to be self-signed, use the key created by the certificate signing request to
create a certificate. If using a local certificate
authority, sign the csr file with the local certificate authority's key.

The use of public certificate authorities doesn't make sense for most databases because it allows third parties to create trusted
certificates. Their only reasonable use is if you wish to allow public certificate authorities to independently issue certificates that
you wish to trust. This is necessary for browsers because they often connect to unaffiliated websites where trust must be established by
a third party. (Browsers include a list of public certificate authorities who can issue
website certificates it trusts.)

During normal server shutdown, sessions are disconnected, dirty
shared buffers and pending
write-ahead log (wal) records are flushed to durable storage, and
a clean shutdown record is written to pg_control. During the next
server start, pg_control is checked, and if the previous shutdown was clean, startup can ignore the wal and start immediately.

Unfortunately, a clean shutdown can take some time, and impatient database administrators might get into the habit of using kill -9 or
pg_ctl -m immediate to quicken the shutdown. While this does have the intended effect, and you will not lose any committed
transactions, it greatly slows down the next database startup because all wal generated since the last completed checkpoint must
be replayed. You can identify an unclean shutdown by looking at the server logs for these two ominous lines:

LOG: database system was interrupted; last known up at 2016-10-25 12:17:28 EDT
LOG: database system was not properly shut down; automatic recovery in progress

These crash database shutdowns can also happen if the computer crashes, the operating system crashes, or if a proper database shutdown
script isn't triggered on computer shutdown.

So, in summary, administrators should avoid Postgres crash shutdowns unless shutdown time is more precious than startup time. Every
time those log lines appear, Postgres is doing more work than would have been necessary if the previous shutdown had been clean.

Unlike other database systems, Postgres makes the cleanup process visible and tunable to users.
Autovacuum performs recycling of old rows and updates
optimizer statistics. It appears in ps command output, the
pg_stat_activity system view, and
optionally in the server logs via
log_autovacuum_min_duration.

Postgres also allows fine-grained control over the
autovacuum cleanup process. Occasionally users find that cleanup is slowing the system down, and rather than modifying the behavior of
autovacuum, they decide to turn it off via the
autovacuum setting.

However, turning off autovacuum can cause problems. Initially the system will run faster since there is no cleanup overhead, but after a
while old rows will clog up user tables and indexes, leading to increasing slowness. Once that happens, you can turn on autovacuum again,
and it will recycle the old rows and free up space, but there will be much unused space that can't be reused quickly, or perhaps ever.

Turning off autovacuum is kind of like someone trying to save time by not changing the engine oil in their car. They are correct that for
a while their car will be in the shop less often, but ultimately the car will be in the shop for a long time, or will stop working.
Therefore, don't turn off autovacuum.

Another problem is that while autovacuum is turned off, optimizer statistics were not being updated, perhaps causing slower execution
plans. Fortunately restarting autovacuum does fully fix that problem.

Now that we know that turning off autovacuum is a bad idea, what are the options if administrators want to reduce the overhead of
autovacuum. Well, first, there are many autovacuum tuning
parameters that allow autovacuum activity to happen less often, or consume fewer resources while running.

A more aggressive, and perhaps creative, approach is to change when autovacuum runs. Most systems have busy times and idle times.
Sometimes this can be determined by the time of the day or the day of the week, e.g. Sundays, 0200-0500. In other cases it can be
determined by the system's load
average or number of active sql sessions in
pg_stat_activity.

No matter how it is determined, wouldn't it make sense for autovacuum to run during these idle periods, and not run during busy periods?
Well, it's not quite that simple. You don't really want to force autovacuum to run during idle times if there is no useful activity to
perform, nor do you want it to never run during busy times in case there is a massive delete that requires cleanup or a large
data change that requires updated optimizer statistics.

What you want is to encourage autovacuum to run during idle periods, and discourage it from running during busy times. Fortunately,
that is easy to do, especially with the alter system
command added in Postgres 9.4.

Let's assume you have your own method to determine busy and idle times. When an idle time starts, simply execute:

This will set the values back to their defaults. Pretty simple, huh? Yet effective. A more sophisticated approach would be to
proportionally decrease these settings based on the load on the system.

Let's walk through how this works. Suppose we don't change any settings during idle times and use the default
autovacuum_vacuum_scale_factor of 20%. If a table has 19.5% expired rows at midnight, autovacuum will not run. When the percentage of
expired rows reaches 20% at 11am, autovacuum will run and remove the expired rows. Now suppose that autovacuum_vacuum_scale_factor is
reduced by half at midnight — autovacuum will run and all tables will have less than 10% expired rows by the time the idle
period ends. Now, during the busy time autovacuum will only run if a table has increased from 10% expired rows to 20%.

It would be nice if this proportional behavior could be added to autovacuum but I am not sure how to implement that in a clean way.