I mentioned previously that I have been working on improving
pg_upgrade performance for Postgres 9.3. A few days ago I committed a
new patch to allow multiple
cpus to dump/restore databases in parallel. Since dump/restore of database schema takes the majority of execution time when
using --link mode, this provides a significant speedup:

While users with a single large database will see no benefit from using the new --jobs parameter, those with many databases
will see significant improvement. This is an example of the type of
parallelism I am hoping we can add to Postgres in the coming years.

After my previous blog post about nulls, you might have
decided to avoid null completely by never assigning them to your database columns. While this will reduce the number of
nulls in your database, it will not eliminate them because various sql commands generate nulls as well. With
the new realization that null are unavoidable and you are going to have to understand them, let's look at how nulls get
generated. The first method is explicit null generation:

Of course, there are many other ways nulls can be generated —
outer joins can generate nulls for
columns of non-joined tables. Aggregates and window functions can
also generate null values, which might be part of an insert into ... select which gets inserted into database columns.

All storage of nulls is easily prevented by using a not null specification when creating each column:

So, even if you never type "null", you can get nulls into your database. The use of not null when creating
columns is recommended, especially for numeric columns that should contain only non-null values. Ideally you could have not null
be the default for all columns and you would specify null for columns that can contain nulls, but that is not supported. If a
column is specified as the primary key, the column is also internally not null.

Sqlnulls were designed to simplify modeling the real world in relational databases, but the debate about their
usefulness has never been resolved. Specifically, nulls were designed to represent:

unknown values

inapplicable values

empty placeholders

For example, a character field employee.spouse could be null to represent the spouse's name as unknown (#1) or the employee as
unmarried (no spouse) (#2). Alternatively a query could have generated a spouse column with no value (#3), e.g. via an outer
join. Without nulls, fields needing such values have to resort to special values like zero-length strings or "n/a" for
strings, "0" or "-99" for numerics, or "1901-01-1" for dates. Obviously, as complex as null handling is, using special values with little
documentation and wide variability between applications is even worse.

Using nulls in a database is like using flat-nose pliers to do repairs. Flat-nose
pliers are notorious for chewing up nuts and mangling everything it touches. If your first tool of choice is flat-nose pliers, something
is wrong. However, there are a small percentage of jobs where flat-nose pliers are ideal, so the trick is knowing when flat-nose pliers
make sense, and when they don't. Nulls are similar — as I stated above, there are valuable uses for nulls,
but if their use is not controlled, they can cause havoc in relational systems, and these "surprises" are often the basis of people trying
to avoid nulls completely.

In their book A Guide to Sybase and SQL Server, David McGoveran and C. J. Date said: "It is this writer's opinion than
NULLs, at least as currently defined and implemented in SQL, are far more trouble than they are worth and should be avoided; they display
very strange and inconsistent behavior and can be a rich source of error and confusion. (Please note that these comments and criticisms
apply to any system that supports SQL-style NULLs, not just to SQL Server specifically.)"

…

In the rest of this book, I will be urging you not to use them, which may seem contradictory, but it is not. Think of a NULL as a drug;
use it properly and it works for you, but abuse it and it can ruin everything. Your best policy is to avoid NULLs when you can and use
them properly when you have to.

In future blog entries, I will cover the places nulls often appear, and many of the surprises to avoid.

The B-52 Stratofortress was designed in the years after
World War II to serve as a heavy bomber for the Cold
War and took its maiden flight in 1952. The Cold War came and went, but the B-52 is still in active military service —
94 of the
102 H-model B-52 aircraft built in the 1960's remain on duty.

This year marks the sixtieth year of B-52 operation — that is a laudable goal for any piece of technology, but even more so
considering the demanding reliability and performance requirements for military hardware. The plan is for the aircraft to remain in
service through 2044. This 2002
article has some relevant quotes:

In an era when a 9-month-old laptop already feels retro, when people who keep their cars for six years are considered quaint, the fact
that the most powerful military in the world relies on a fleet of 40-year-old bombers [now 50] is pretty astonishing.

And the B-52 is not some creaky relic that the military keeps around for air shows. The planes have seen
more combat in
this decade than in the previous three decades combined.

If the B-52 does remain in service for 80 years, it will be like using a weapon from the Civil War to win World War II.

"Right now, there are three generations of
pilots who have flown that plane -- grandfather, father, and son -- in the same family. If it lasts until 2040 [now 2044], five
generations will have flown the same plane."

So, how does the B-52 relate to Postgres? Ignoring the B-52's destructive purpose, there are some similarities:

Both were developed by the U.S. Defense Department (Postgres was initially
funded by darpa)

Both were developed decades ago and have remained in use far longer than expected

Both are easily enhanced with new technology, which has increased their longevity

Both operate in demanding environments where reliability and performance are critical

The B-52, being a machine, eventually will wear out and need to be replaced. Postgres, being software, never wears out — it
could go on forever, but right now, let's just plan for plan for 2044, and we can reevaluate then.

Having just reported a rare but serious bug in
pg_upgrade (), I wanted to
address the meticulous effort the Postgres community takes in developing software. While most end-users just see the software
releases and the
release notes, there is obviously a huge amount of effort that goes
into to creating those releases. Postgres backend developers are acutely aware of that effort, as every patch they submit receives
detailed scrutiny to detect flaws or potential improvements.

Databases, like operating systems, are held to a high standard for reliability because failure often causes downtime for organizations.
Also, databases lock user data into their software beyond many operating system use cases. Again, another reason that database server work
requires serious focus.

However, the focus on reliability can't be borne alone by the database server developers. Users must also provide a reliable platform to
run database software. Postgres can't maintain high reliability if it is dependent on an unreliable platform. Here are some of the
things users can do to provide a reliable platform for Postgres:

Choose an operating system whose focus is on reliability, rather than desktop user experience or supporting the latest hardware or
software (discussion thread)

Choose hardware designed for 24-hour operation in a demanding environment, not desktop hardware

The Postgres team does its best to provide a highly-reliable database. We hate to hear about database problems caused by running
Postgres on unreliable platforms. Though we are not perfect, we do our best to be as reliable as possible — please do your part to see
that the platform you use is also reliable.

With many companies committing to Postgres in the past year, Postgres has earned a reputation for high reliability. With our open source
development model and the development team growing, Postgres has the potential to set a new standard for database reliability, and with
everyone's help, we can make that happen.

Postgres, since version 8.2, has supported the ability to create indexes without blocking writes (insert, update, or
delete) on the table being indexed. This is done by adding the keyword concurrently to the
create index command. (Reads are never blocked by
create index.)

Implementing this feature was very complex, and it has been recently
discovered that all versions of
pg_upgrade have a bug related to processing such indexes. The problem
revolves around the rare case when create index concurrently fails. When this happens, an invalid index remains in the system.
The invalid index is incomplete and is not used by queries. The pg_upgrade problem is that
pg_dump dumps it as a valid index. While a normal restore of pg_dump
output would create a valid index, pg_upgrade moves the old index file into place without recreating it — hence, an invalid
index file is upgraded as a valid index. In this case, the index could return invalid results or crash the server. The steps necessary
to trigger this bug are:

Use of create index concurrently

Failure of create index concurrently

Not dropping or recreating the index after failure

Use of pg_upgrade without the
patch applied on December 11,
2012. The patch uses this query to throw an error if invalid indexes are found before the upgrade begins (pg_upgrade --check also
performs this check):

This bug remained unreported for so long because concurrent index creation is not common, and leaving invalid indexes in place is also
rare.

The next set of minor Postgres releases will throw an error if invalid indexes exist in the old cluster — until then, users can
run the above query manually in all databases to check for invalid indexes before using pg_upgrade. For those who have already used
pg_upgrade, you can check if invalid indexes were improperly upgraded by running the query in the old cluster. Unfortunately, if the old
cluster is not accessible, there is no known way to check for the existence of such indexes in the new cluster —
reindex can be used to recreate any suspect indexes.

While setting configuration variables at the SQL level is very easy to program, modifying the postgresql.conf file programatically can
be more complex. (The postgresql.conf file exists in the cluster data directory.) Some administrators use sed or perl to modify
the file directly, but a simpler solution is to use the postgresql.confinclude directive:

include 'filename'

By adding this to the bottom of the postgresql.conf file, any settings in filename replaces values set earlier, such as in
postgresql.conf.

Once this is set up, programs can write into the include file to set any variables they wish. After the write, run pg_ctl restart for
variables requiring a restart (labeled with "change requires restart" in postgresql.conf), or pg_ctl reload for other variables.
Truncating the include file removes its settings.

The Postgres development team is discussing adding an SQL
interface for changing postgresql.conf values, and it will use a similar mechanism.

As an example of why you might want to programatically control postgresql.conf, you might want modify
autovacuum's settings to increase its activity during quiet
periods, and revert settings during normal load.

Having returned to blogging after a two-month break, I wanted to point out two new things on my blog site. First, months ago, I started
categorizing my blog posts — the categories appear as small images next to the blog titles.

This has allowed me to automatically create a category index of all my posts.
(The index is accessible by clicking on View Index at the top of my
blog detail page, and then choosing
View Category Index.) For example, the page collects all my tips together for
easy review. (See "Tip" at the bottom of the page.) I wanted to mention this because those tips have been collected from discussions in
many of my training classes, and might be helpful for new users. My blog posts about conferences, pg_upgrade, and performance are
similarly organized.

Second, my old commenting service shut down, so I have migrated to a new blog commenting service. I was able to
import all 362 comments into the new service, so nothing was lost.

I began working on pg_upgrade (previously called pg_migrator) in
2009 after my employer EnterpriseDB
changed the license of their binary migration tool to BSD and instructed me to work on improving it. My initial goal was to produce a
reliable and popular method for binary upgrades (avoiding data
dump/restore). That goal was achieved in 2011, and
since then I have been focusing on usability and performance enhancements.

One repeated request has been for better performance, especially for databases with many tables. You might wonder, "Isn't pg_upgrade
already faster than dumping/reloading the database?" Yes, it is, but that doesn't mean it can't be made even faster.

After receiving scattered reports of link-mode upgrades taking 45 minutes or more, I dug in and made some major performance improvements
in Postgres 9.2. However, that wasn't sufficient, and I got slowness reports from users using pg_upgrade 9.2. Therefore, I have been
back on pg_upgrade performance duty this past month, and I got good results!

Pg_Upgrade performance for Postgres 9.0 and 9.1 are similar because there was little focus on performance during those releases. Pg_Upgrade
9.2 shows a doubling in performance, and 9.3 will have another 2-4x improvement. Notice that
SSDs still provide a major performance boost.

There are still a few more pg_upgrade performance improvements that might make it into the final 9.3 release.

I attended two interesting conferences in October. The first was the Highload ++ conference in Moscow.
The two-day conference had one full day of Postgres talks, thanks to dedicated work by
Nikolay Samokhvalov. The conference has grown significantly in recent years, and I am
hopeful Postgres will continue to play a significant role.

Jonathan S. Katz's presentation, Marketing
PostgreSQL (slides), inspired me because he accurately portrayed the
methodical, results-driven approach organizers used to grow the New York PostgreSQL User Group. Rather than
do what was easy, they analyzed what efforts would lead to significant user group growth, and pursued them consistently.

Antonin Houska's talk,
pg_xnode - Alternative
implementation of XML (slides), showed a method for storing XML
data in binary format for easy indexing. This interested me because it addressed a general problem of how to efficiently store structured
data types. Right now XML and
JSON data types are stored as text strings, and this limits performance
for some use-cases.

The talk, Migrating Oracle
queries to PostgreSQL (slides), got my attention because,
while Oracle migration is very popular, I rarely see presentations on the topic. The presentation Full-text search in PostgreSQL in
milliseconds (slides)
promises dramatic full-text performance improvements.

Of course, my wife attending both conferences with me was the greatest highlight.

During my attendance at the SURGE conference last week, some people asked me about the new
JSON support in Postgres 9.2. As I
started to explain it, I realized that the description was quite complicated, and I agreed to post a blog entry about it — so
here it is. (Linux Weekly News also has an article that summarizes the 9.2
JSON features.)

The JSON features added in Postgres 9.2 were not major:

Create a JSON data type that stores JSON as a text string, and
checks that the string is valid JSON.

Seems pretty simple, and not a great leap forward. However, as with many things in Postgres, it is the coupling of these features with
Postgres extensions and externally-developed features that really makes a must-have feature-set.

First, there is the problem that the JSON value is stored as text, limiting the ability to quickly find matching values. (This is the
same way Postgres stores XML.) This was done to avoid the overhead of converting JSON to/from an indexed storage format. There are
several solutions to this problem:

Postgres's full text search capability allows JSON tokens to be indexed
and quickly retrieved. There is no record of which tokens are keys and which are values, so the output must still be filtered using
Javascript, but it certainly allows fast indexing.

Added to that, there is the stored procedure language PLV8
(summary), which allows Javascript programs to be run inside the database. This
Stackoverflow post goes
into details about indexing possibilities, particularly
expression indexes based on PLV8 function calls. You can
actually write stored procedure functions in any language that understands JSON; this
example is in PL/Perl.

As you can see, a validated JSON data type allows tools to then know the value is JSON, and operate on it in powerful ways.

A new data type that stores the start and stop values of a range in a single column

Operators that allow simple comparison of ranges, including overlaps, contains, and intersection; this avoids complex range comparison
logic in the application

Index support, which greatly improves range query performance

With these features, Jonathan was able to clearly show the usefulness of range types for an event planning application.
Jeff Davis, the author of range types, also gave a
tutorial showing people how to use range types for the temporal
recording of company asset usage. Both presentations were recorded so hopefully the videos will be online soon.

While range types are mostly considered for temporal data, even measurement data can benefit. Most measurements are not exact, e.g.
72°F degrees is not really 72°F, but rather 72°F plus-or-minus some amount, depending on the accuracy of the
measurement device. While many applications handle this by querying for a range of values, e.g. 71.5° to 72.5°, assuming
0.5° precision, this does not work well if the column contains values of varying precision. With range types, these measurement
ranges are easily stored in the database and queried.

Speaking of full text search, Dan Scott gave a great introductory
talk about how to use full text search. Again, some
Postgres features are so complex that watching someone explain them is often the most efficient way of getting started with new features.

In 1997, as part of learning the Postgres backend source code, I created an image of the backend source code flow and added hot-links to
the image so users could click on a section and get detailed information, including a link to the actual source code. This was on the
Postgres website for many years, but because of website changes and the switch to git, it was removed.

After recent requests, I have redone the flowchart and moved it out of our source tree; it now
exists on the main Postgres website, linked to from the
Developers/Coding section. The detail text has been moved to a
Postgres wiki page, to allow for easier community improvements.

So, if you were ever curious about how Postgres works, head on over to that web page
and start clicking.

I previously explained the ability to set Postgres configuration
variables at different levels. In this blog entry, I would like to explain how changes at the top level, postgresql.conf, propagate to
running sessions.

The postgresql.conf file is
usually stored at the top of the pgdata directory,
though it can be relocated. The most simplistic way to
modify the file is to open it with a text editor. (Tools like pgAdmin allow file modifications via a
gui.)

Once the file has been modified, you must signal that the configuration file should be reloaded and your modifications applied. There are
three methods to signal this:

send a sighup signal the postmaster process, or sighup only individual backends

run pg_ctl reload from the command-line

call the sql function pg_reload_conf()

Several settings cannot be changed in this way — they are flagged with the comment "change requires restart". A warning message will
appear in the server log if you attempt to change one of these parameters:

For settings that can be changed via reload, it is obvious that new sessions will get updated values, but what about existing sessions?
Existing sessions get those changes too, as soon as they complete their transactions:

As you can see, with no client action, work_mem has changed from 1MB to 2MB. This allows you to change things like log_statement and
have the changes take effect without waiting for all sessions to restart. While this propagation to running sessions works for
postgresql.conf changes, it does not work for alter role/database-level changes — they require a client reconnection to take
effect.

I hope this blog post explains how to make effective configuration changes to Postgres.

All Postgres servers support tcp/ip connections, including localhost connections that allow clients to connect to servers on
the same machine. Unix-like operating systems also support local
or Unix-domain socket connections. These connections do not use the tcp/ip stack but rather a more efficient stack for local
connections. (I previously showed that Unix-domain socket communication
is measurably faster.)

Unix-domain socket connections require a socket file in the local file system. These are not normal files but more like entry points to
listening servers. By default, Postgres places these socket files in the /tmp directory:

srwxrwxrwx 1 postgres postgres 0 Jul 30 20:27 .s.PGSQL.5432=

Notice the file permissions section starts with an 's', and the file size is zero — these are always true for socket files. Clients can
connect to this socket file and then communicate with the Postgres server listening on port 5432. (5432 is the unique number used to
identify the server on the local machine, and is the same for tcp/ip and Unix-domain socket connections.)

The default directory for the Unix-domain socket is /tmp, though there has always been
concern about this socket location. Postgres does not require
root privileges, so by default it must locate the socket file in a directory writable by normal users, and /tmp is the logical directory
to use. Some installers that have elevated permissions place the socket file in secure directory, like /var/run — this requires
modification of the
unix_socket_directory
setting, and the socket directory must be specified by clients making Unix-domain socket connections. Packagers who change the default
socket directory modify the server and client defaults so no special configuration is required by users.

The Unix-domain socket location is one of those settings that the community would like to make more secure, but can't without requiring
elevated permissions for installations. It is only possible for packagers or end-users, who potentially have access to elevated
permissions, to change this default.

Hard-coding database connection parameters in application code has many downsides:

changes require application modifications

changes are hard to deploy and customize

central connection parameter management is difficult

Libpq does support the setting of connection parameters via environment
variables, and this often avoids many of the down-sides of hard-coding database connection parameters. (I already
covered the importance of
libpq as the common Postgres connection library used by all client interfaces
except jdbc, Npgsql, and a few less-common drivers.)

However, there is another libpq feature that makes connection parameter sharing even easier:
pg_service.conf. This file allows you to name a group of
connection parameters and reference the parameters by specifying the name when connecting. By placing this file in a network storage
device, you can easily centrally-control application connections. Change the file, and every new database connection sees the changes.
While you can store passwords in pg_service.conf, everyone who can access the file can see those passwords, so you would probably be
better off using libpq's password file.

The first session shows an idle user postgres connected to database test; the second shows user demo doing a select in the
template1 database. These updates are controlled by
update_process_title,
which is enabled by default.

Another features is that tools like top can also display this status information,
though such display is often disabled by default. For top, using the -c flag or typing 'c' will display the process status — this
is often helpful in obtaining information about a running session, perhaps one that is consuming much cpu or memory.

While the statistics views give in-depth reporting of database
activity at the sql level (particularly
pg_stat_activity), monitoring Postgres at
the command line is also useful. It allows database information to be integrated with operating system monitoring tools to offer superior
analysis about how the database is interacting with system resources.

During a recent conference, it was pointed out to me that there are contradictory recommendations
about the use of caching on storage devices. For magnetic disk drives, it is pretty clear — if the cache is volatile (which it usually
is on magnetic disk drives), switch the drive to write-through mode so all writes go to durable storage, i.e. the magnetic platters.
The Postgres manual has a detailed description of how to do this.

For solid-state drives (ssds), things are more complicated. If the cache is
volatile, you should switch it to write-through mode. but be aware that this will slow writes and
decrease the life-time of the drive, but it is the only durable solution.
If the cache is non-volatile, you definitely don't want to do this as this will give you all the negative aspects listed above, and not
improve durability, which is already guaranteed because the cache is non-volatile.

So, turn the drive write cache? Turn it off? Hopefully this helps explain which one is appropriate.

Have you ever run benchmarks that topped out at 250 Transactions per Second (tps)? What about 120 or 90? If so, you might have
been stumped about why you consistently hit these numbers and could never go higher.

These numbers are common rotational latency numbers for hard disks.
Specifically, 250, 120, and 90 represent rotational latencies for 15k, 7.2k, and 5.4k rpm drives, respectively. Basically, if
you have no durable cache between the cpu and your magnetic
disk drives, this is the fastest you can durably perform write transactions. Of course, you can reduce
durability requirements, and sometimes
writes can be combined into a single disk write, but in general, disk drive rotational latency can be a significant limiting factor for
any production server that lacks a durable cache layer, e.g.
bbu (Battery-Backed Unit)>,
non-volatile drive cache. Having a durable cache layer can improve
write transaction performance 10 or 100-times.

How available do you want your services? A lot? Downtime is never desirable, so you want zero downtime, right? OK, that's impossible,
so what is reasonable, and how much are you willing to pay for it?

Uptime is often measured in 9's, meaning how many nines in a percentage of uptime is acceptable? 99%, 99.9%, 99.99%, or even 99.95% if
you want to be more precise. Of course, this is the desirable uptime — there will be years you are 100% available, then years you
will fall far short of your goal.

If you have many read-only servers, some of those higher-nine numbers are possible, e.g. if 10% of your users are on a server that is
down for 50 minutes, that might be only 5 minutes of downtime. If you have 10,000 servers, several of them could be down regularly while
still maintaining high availability, assuming something doesn't happen that makes them all unavailable, e.g. network connectivity, power.

For database systems, some of those higher-nine numbers are pretty hard to reach. Unless you
shard your data, writes are going to be centered on only a few
servers, and keeping a few servers running constantly is a challenge, considering all the
failures that can happen. Add to these failures the downtime associated
with database server maintenance, like schema changes, software deployment, and upgrades, and things really get complicated. Postgres can
help in some of these areas:

However, while Postgres has high reliability, high availability still requires serious planning. For example,
pg_upgrade is still too slow for some users. There is also no easy way to
upgrade the primary and have hot standby servers continue to function,
because they are required to run the same major version as the primary. (The community is currently
discussing possible solutions to reuse the majority of the standby
data directory.) The overhead of periodic
vacuum free operations also sometimes affects
downtime by dramatically slowing servers.

Postgres will continue to improve its high-availability options as we gain more users with these requirements.

Having just explained the value of using swap space, I would like to
comment on memory overcommit. Last month, Tom Laneposted a very clear explanation of how
copy-on-write and fork
interact to require either excess swap space or unrecoverable process failure during high memory usage. This
article explains historically how operating systems have dealt with this trade-off. (The
Postgres documentation explains how to
control the Linux out-of-memory (OOM) killer.)

In summary, this is another reason to have swap space, and to monitor it, because it supplies early warning that memory is in high demand.
If you have no swap space, you can't monitor its usage, and you lose valuable information about the health of your system.

I have occasionally heard students in my training classes mention that they
configure their servers with no swap space. The students have two reasons for this:

I have enough memory so I shouldn't be swapping

I don't want to swap because it is slow

Both of these reasons are inaccurate. Let's look at the first one, "I have enough memory so I shouldn't be swapping". While it is
generally true that large memory systems never fully use their ram, it isn't optimal to avoid swap. Applications often have
initialization data that can be written to swap so the memory can be used for other purposes. Kernel kernel hacker
Andrew Morton once
stated:

My point is that decreasing the tendency of the kernel to swap stuff out is wrong. You really don't want hundreds of megabytes of
BloatyApp's untouched memory floating about in the machine. Get it out on the disk, use the memory for something useful.

Without swap space, moving unused data to swap isn't an option. Also, if the system over-commits memory and must swap, the kernel will
kill processes or panic if no swap is available — obviously not a desired outcome.

The second reason, "I don't want to swap because it is slow", is also wrong. Copying pages to swap is done by the kernel, and does not
slow down applications. What does slow down applications is copying pages from swap, and if the kernel correctly guessed that the data
is unreferenced, this is not a problem. By monitoring swap activity, both in and out, you will be alerted about high memory pressure.
This article has extensive details about how swapping works, and how to monitor it, though
free on Linux shows how much swap is in use, and vmstat has an
all-important swap-in column (si).

So, how how much storage should you allocate for swap space? This web page has
a complex formula for computing swap space, but any amount is probably sufficient, as long as you monitor it and increase it if necessary.

Ever need to generate random data? You can easily do it in client applications and server-side functions, but it is possible to generate
random data in sql. The following query generates five lines of 40-character-length lowercase alphabetic strings:

Most people know Heroku as a Platform as a Service
(PaaS) provider — they supply a git server, website for your application, and an optional Postgres database
(details). What is less well known is that Heroku also has a pure-database
offering, more like Database as a Service (DBaaS). This is similar to creating your own Postgres
cluster on Amazon Web Services (aws), except Heroku manages the database, and shares the costs.
What is really interesting is that if you create an account, you can create free databases (of limited size) that will be always available
(screenshot). (FYI, you can easily get
stuck in the PaaS section of the Heroku website trying to return to the DBaaS section.)

In contrast, EnterpriseDB (my employer) also offers a cloud database
product, but it is closer to a database cluster as a service (rather than just a database)
because you control the entire database cluster, including cluster configuration, scaling options, and have full ssh access to
the operating system (details). Unfortunately, their
free trial lasts only 24 hours, probably because of the cost
associated with Amazon instances.

Once you have created a free Heroku database, you can click on the database name to see database details, including instructions on how to
connect to the database, like from psql (screenshot,
screenshot; database subsequently destroyed). If you do a psql \l, you can see
databases owned by other users:

Because Postgres uses global objects to record user and database names,
Heroku has anonymized these identifiers.

So, if you want a free remote database for testing that is always available, try Heroku Postgres. Of course, you will still need a
client, like psql, to connect. Heroku does have something called
data clips which allows you to send live query
output to users via a url.

The talk was interesting in its description of Postgres in the new world order of NoSQL and Cloud. He felt that Postgres's features like
hstore and
JSON (coming in
Postgres 9.2) make it uniquely positioned to handle many NoSQL
workloads. He felt Postgres's open source nature allowed more confident use of Postgres in public clouds without vendor lock-in.

Overall, it was an inspiring talk that highlighted Postgres's ability to adapt to new environments.

I have received three phone calls in the last three days from job recruiters looking for Postgres employees. I usually get one call a
month, which I think means the Postgres job market is heating up.

To help employers find Postgres talent, I have written a wiki page explaining how to
post to the Postgres jobs group email list. If you are contacted by a recruiter, please
mention this web page to them. I have added a suggestion to the bottom of my
résumé page — hopefully that helps. If you are looking for Postgres
employment, please subscribe to that email list.

Suppose a manufacturer produces a product and makes a profit, yet manufacturing generates greater social harm than the value of the
product. How does the manufacturer stay in business? Because the manufacturing costs are distributed among many people and are not all
paid by the company. This is called an external cost or
externality and is commonly seen with pollution costs or natural resource depletion. It can
also apply to database index creation.

Huh? Database index creation? Yes. Consider this case: you are troubleshooting a query; it is slow. It is slow because there is no
index matching the query qualification, so you add one. The query is now much faster, so you are done. But are you? Your query is
faster, but what about other queries? What about insert and some
update operations?
They now have to update that index you created — that has to take some time. How much? It is hard to measure, because it is
distributed among many queries, and is probably small compared to the speed-up you saw in the query that prompted the index creation. But
a lot of small slowdowns can easily surpass the speed-up you saw in that one query.

How can your measure all those small slowdowns? I don't know, but do know they exist, so be careful adding an index that has limited use
— you might find that externalities make your system slower.

Database application programming is never easy. Combine that with the use of server-side functions, and things can get quite complicated.
Why use server-side functions? As Joe Conway said, "Sometimes it is
useful to bring the data to the code, and other times it is useful to bring the code to the data."

Postgres minimizes the complexity of creating server-side functions by supporting server-side languages that are often used client-side:
PL/Perl, PL/Python,
PL/Ruby, PL/Java,
PL/R, PL/TCL, C. In fact, there are
18 support server-side language. (You could
argue that
PL/pgSQL matches sql, especially
Common Table Expressions, (ctes).)

This can allow near-seamless transfer of code to and from server-side functions, which allows seamless movement of code to wherever it can
most easily access data. This is another Postgres database feature unavailable in many other data systems.

I mentioned in March that Joe Celko, "Mr. SQL", attended our
Austin PGDay event. While he did not speak at that event, I am excited to
learn he will be
presenting a full-day SQL class at
PostgreSQL Conference Europe in Prague. It isn't everyday you get to hear Joe Celko, and a full-day class is
alone enough to justify the trip for serious SQL users.

My great disappointment is that I am also scheduled to give a
class that day and cannot
attend Joe's training. Devrim Gündüz is co-presenting with me … perhaps no one would notice if …

The recent PGCon conference reminded me of two things that make open source development unique. First,
open source developers are really the face of the software — they give the presentations, write the blogs, and are the people users look
to for answers and software direction. Closed-source developers are normally hidden within companies — they are trotted out
occasionally, but marketing and sales really lead the show.

The second difference is operational — most closed-source developers have very little interaction with users — they must get user
information though a filter of layered support, and because of this, their ability to improve the software and fix bugs is severely
limited. Open source developers not only have direct contact with users, but they have a large team of other open source developers to
help them solve problems, and can often have users test fixes and new features rapidly, sometimes the same day. This is in contrast to
closed-source developers who often have to wait months or years for their fixes and features to reach a user.

This last aspect is how open source software, while vastly underfunded compared to closed-source software, can excel, because, while money
can help produce good software, clear communication channels between developers and users and a unified team can easily outperform more
rigid development environments.

Images from 2012 and 2010 have all people identified, though when you click on the image to enlarge it, you lose the identifications.
Year 2011 is an extra-wide version. In 2008, it seems we didn't take a group picture, so I have chosen four images that show most
attendees.

If you are a Postgres speaker and are traveling somewhere on vacation, have you ever considered making a Postgres presentation at your
vacation location? You might say, "I take vacations to get away from Postgres" — don't let me ever hear such crazy talk!

For example, six weeks before I left for a mission trip to the Dominican Republic, I sent an
email to the
advocacy list asking if anyone in the Dominican Republic would like to meet. (The
mission trip blog has some humorous mentions of me.) I got a private reply that our
public relations contact in that country would like a presentation or two. I ended up speaking to 85 students at a
university in the Dominican city of Santiago.

What are the advantages of taking vacation time to present a talk about Postgres? Well, many of my most memorable travel experiences have
been while as a guest of Postgres community members — a dacha in Russia, a cruise in Brazil, an
elephant orphanage in Sri Lanka. These are not typical tourist sites, but are
easily accessible with local friends. Even ordinary locations, like an office building or restaurant in Tokyo, is special because you are
living as a Japanese — not something most non-Japanese experience, even as tourists.

Anyway, this isn't for everyone — there certainly are downsides, particularly if you are traveling with a spouse or family, but it might
be worth considering — it could be the highlight of your vacation.

With the list of Postgres committers recently
updated, I wanted to mention a perplexing question I
occasionally get from companies, "How can I become a committer?". This is from companies that are just starting their contributions to
Postgres and have never submitted a patch. It is a perplexing questions because, for Postgres, committers are people who have been around
for years and have submitted many patches — obviously there are mismatched expectations
here.

The cause is that these new companies are equating Postgres with open source communities with hundreds of committers, where you have to be
a committer to get anything significant done. The Postgres work-flow (diagram)
forces all work through community email lists, and a small group of committers apply
patches agreed-upon by the group. So, for Postgres, committing is a mechanical process, not a control process, i.e. contributor ≠
committer.

Companies looking to measure their success in the Postgres community should not focus on committer count, but on the number of
release note items and blog entries mentioning their work. I hope this
helps companies adjust easier to the Postgres way of doing things.

Object Identifiers (oids) were added to Postgres as a way to uniquely
identify database objects, e.g. rows, tables, functions, etc. It is part of Postgres's
object-relational heritage.

Because oids where assigned to every data row by default, and were only four-bytes in size, they were increasingly seen as
unnecessary. In Postgres 7.2 (2002), they were made optional, and in
Postgres 8.1 (2005), after much warning, oids were no longer
assigned to user tables by default. They are still used by system
tables, and can still be added to user tables using the with oids clause during
create table. Server parameter
default_with_oids
controls the default mode for table creation (defaults to "false").

Oids as still used extensively for system table rows, and are used to join system tables, e.g.:

While the English language is somewhat fluid in its use
of single and double quotation marks, sql is very rigid: single quotes are used to delimit text strings, and double quotes can
be used to delimit identifiers. What are
sql identifiers? They identify an sql object, e.g. if you create a table, the table name and column names are
identifiers.

Double quoting identifiers is usually optional, so either of these is valid:

Why are these different? Because Postgres automatically lowercases identifiers (contrary to the sql standard that
requires auto-uppercasing). This means that the first creates my_table with column a, while the second creates My_table with
column A. This becomes an issue when you need to access the table:

Basically, once you use double quotes at object creation time (e.g. create table, create function, create
schema) and use a character that is affected by double quotes (uppercase, punctuation, spaces), you must use double quotes when
accessing the object. If you create a table named "Oh wow, this is super-cool!", you are going to need to double quote that table name
each time you use it.

Also be aware that many tools (e.g. pgAdmin) auto-double-quote supplied identifiers, so be careful when using
any double-quote-affected characters with these tools. In addition. Postgres's auto-lowercasing rather than auto-uppercasing, as the
standard requires, can cause additional quoting needs when porting applications.

Having just attended the Southeast LinuxFest, I was reminded of the elements that make a great
conference site:

Hotel connected to the conference venue

Affordable hotel

Conference venue walking distance to restaurants and entertainment

The Southeast LinuxFest location met all of these criteria, but that is rare — only one-third of conferences I attend meet these
criteria, and when these criteria are not met, you feel it.

When the primary hotel is not connected to the conference, attendees have to take everything with them to the conference because they
can't easily return to their rooms, and once they return to their rooms at night, it is often hard to get them to go out. Expensive
hotels often drastically reduce attendee count because many attendees must pay for conference expenses with personal funds. Conferences
not near food and attractions often lack the fun evening outings that attendees remember long after the conference ends.

Abstractly, you would think that conference technical content alone determines a conference's success, but it is often the nebulous
"hallway track" that is most memorable because of
the ability to interact with many people in situations similar to your own or who possess expert knowledge. The above conference elements
are key in making for meaningful "hallway track" interactions.

So, if you are a conference organizer and find a site that meets all these criteria, grab it — there are not many, but they do exist,
and having all these elements is half the battle of creating a successful conference.

I have accepted two speaking engagements in the Dominican Republic in mid-June; my
website has dates and cities. I don't have any web links for the
events, so if you want to attend, email me and I will get you in touch with the organizer.

I previously covered timing of queries from an external perspective.
However, it is also possible to time queries internally. Each data
manipulation language (dml) command (select, insert, update, delete) goes through three stages:

parser

planner

executor

You can actually time how long each stage takes using these server settings:

The first three are output by log_parser_stats, while the rest are specific to the planner and executor. These values are from a kernel
function called getrusage. It returns verbose output about the amount of
cpu, i/o, page faults, and context switches made during each internal stage of query
execution. You can get cumulative rusage statistics using log_statement_stats = on.

The numbers can be overwhelming, but it does allow users to find out how long each part of a query took (see "elapsed"), and how much
parsing and planning (optimization) time contributed to the total query duration.

Ideally the majority of time should be spent in executor. If you add the parser and planner times, it gives you an idea of the reduction
in query time possible by using prepared statements, assuming
you run the statement more than once per session. The Genetic Query Optimizer
(geqo) is designed to short-circuit planning when planning overhead might significantly increase query duration.
Explain allows you to see the plan (generated by the planner)
that would be run by the executor.

Hopefully this helps you understand how to analyze a query by looking at the internal processes that contribute to total query duration.

There are actually several ways you can time a query, and each has value depending on what you want to measure. You can time a query at
the server (from the time it arrives until the time it completes) by setting the server parameter log_min_duration_statement = 0.
However, that does not measure the network overhead, which can be done using \timing in psql. Then there is the connection and
application overhead that is measured by time at the command-line. This command measures all three aspects of a query:

Notice the server-timed duration is 0.318, but with network overhead it is 0.435, an increase of 36% in total duration. Of course, this
is a simple query, but it does show that network overhead is not trivial. Including application/connection overhead shows 6ms,
overwhelming the previous measurements. That test was with Unix domain sockets —
here are the same measurements for tcp/ip sockets:

Same server timing, but the network overhead has increased 57%. If we look at just the network times (minus the server query time), we
get:

Communication method

time

Change from previous

Unix-domain socket

0.117

n/a

tcp/ip, no ssl

0.153

+31%

tcp/ip, ssl

0.420

+175%

With ssl enabled, total application execution time grew from 6ms to 24ms, a four-fold increase. This is expected, given
ssl's high overhead for secure key negotiation.

As you can see, network overhead, and certainly application and connection overhead, are significant factors in determining query
duration. When debugging slow queries, knowing what you are timing greatly increases the odds of success, and I hope this blog post
helped.

There are 500k to 1-million words in the English language,
so you would think that Postgres would be able to find a unique word for every aspect of the database, but unfortunately, that is not
true. There are two cases where Postgres uses a single word to mean two different things, sometimes leading to confusion.

The first word is "cluster". One meaning represents the data
directory and postmaster that controls it, e.g. a Postgres instance. One install of Postgres binaries can be used to create several
Postgres clusters on the same machine. The second usage of "cluster" is as an SQL command called
cluster, which reorders a table to match an
index (to improve performance). The good news it that the meaning of cluster is usually clear from the context.

The same is not true of our second case, "schema". Traditionally, when someone talks about a "database schema", they are referencing
table or object structure defined using a data-definition language (ddl) command, e.g. create table. The second use of
"schema" is to define a namespace to contain objects, e.g.
create schema. (I have already
covered the value of using such schemas.)

The double-meaning for the word "schema" is often more problematic than for "cluster". For example, the pg_dumpmanual page uses both meanings of the word "schema":

-N schema
--exclude-schema=schema
Do not dump any schemas matching the schema pattern.
-s
--schema-only
Dump only the object definitions (schema), not data.

Tell me that is not confusing! One hint that we are talking about two different meanings for "schema" here is that the first one defaults
to -N, as a reference to "namespace". Internally, Postgres calls schema containers "namespaces", e.g. system table
pg_namespace holds the list of schema containers.

So, if you are having a conversation, and you hear the words "cluster" or "schema", don't be surprised if you get a little confused —
hopefully things soon get clarified and the conversation can continue. (Two more words for the Postgres
drinking game? )

Acid (atomicity, consistency, isolation, durability) is a fundamental aspect of
relational databases. It isn't easy to implement acid, and some database systems take shortcuts in their implementation,
particularly for Data Definition Language (ddl) commands. (Oracle
notoriously commits multi-statement transactions when a ddl command is issued; this
wiki page outlines the support for
transactional ddl among database systems.)

Postgres rarely takes shortcuts, and didn't take shortcuts in implementing transactional ddl. Just like
dml can be wrapped in a multi-statement transaction, so can
ddl. Why is this useful? Well, ddl operations are rarely isolated to a single object — they often are part of a
larger roll-out that affects many objects. By using transactional ddl, you allow the ddl to either be applied
atomically, or rolled back, which greatly simplifies deployments. Transactional ddl can also be used to replace objects
in-place, by performing the create, drop, and rename in the same transaction.

This presentation I attended at
ConFoo praised Postgres's support for transaction ddl — there is no wonder it was called Painless,
Version-Controlled Database Refactoring. Hopefully you perform "painless" deployments too by using transactional ddl.

I often ask people about their backup configuration because I want to make sure my own server backup procedures are sufficient. I
remember asking someone a few years ago about their personal backup methods, and they said they used
raid1 (mirroring), and didn't need backups. That
seemed insufficient, and when I asked about recovery of deleted or improperly-modified files, I wasn't given a satisfactory answer. The
bottom line is that while raid1 protects against the failure of a drive, it does not address other causes of data
loss. Here is a table I created to illustrate the data loss and recovery options I use for my home
server:

Deletion of file, discovered immediately

Restore from daily or two-hourly backup

Deletion of file, discovered < 30 days

Restore from monthly backup

Deletion of file, discovered >= 30, < 90 days

Restore from quarterly backup

Deletion of file, discovered >= 90 days

Unrecoverable

Drive failure

Restore from previous day

Server destroyed

Restore from previous month

Server and data safe destroyed

Restore from quarterly backup

How likely are these failure cases? This report
(article) shows the statistics for data retention failure:

Failure cause

Pct.

Hardware failure

40%

Human error

29%

Software corruption

13%

Theft

9%

Computer virus

6%

Hardware destruction

3%

As you can see, there are many failure cases which raid1 cannot
recover. Don't believe it? Watch this video about a Pixar film almost lost due to file
deletion (and backup failure).

How does this relate to Postgres? Well, if you are only using
replication, you are affectively doing
raid1 at the server level -- changes to one server are "mirrored" to the other server, and instead of recovery
from disk failure, you can recover from complete server failure. We already saw how limited raid1 is in covering
data retention failures, and those same limitations apply to most replication systems. Yes, there can be a slight replication delay, but
the delay is usually unpredictable. (The community has discussed the idea of adding a feature to specify a minimum streaming replication
delay.)

In summary, replication is not enough — it is necessary to use some other method to allow for recovery to some point in the past —
before the deletion, unintended modification, or corruption happened. Most users use
continuous archiving or
logical backups for this purpose. This is why combination of streaming
replication and continuous archiving is such a potent combination — it allows the standbys to disconnect and reconnect/catch-up easily,
and it allows for point-in-time recovery to address the data retention failures not covered by streaming replication alone.

How far back in time do you need to keep backups? Well, an ideal setup has many recent backups, then decreasing retention of backups,
perhaps perpetually retaining backups at some predefined interval, e.g. quarterly.

Hopefully this blog entry has helped stress the importance of addressing all data retention failures, and hopefully you will never have
to say that your backup plan is only raid1/replication.

If continuous upgrades can be done for a web browser, why not for a database? Well, OK, there are some good reasons a database shouldn't
do this, but it certainly is possible to do upgrades more cleanly than we do now. Postgres packagers are already starting to implement
upgrades using pg_upgrade. Some developers might prefer auto-upgrades so they are always developing against the most recent Postgres
version.

Automatic upgrades are something we probably never would enable unconditionally, but we might someday make it optional.

Pg_upgrade
(documentation,
presentation) was
created in 1998 to allow
in-place major version upgrades of Postgres. It was written as a shell script, as were many of our command-line administration tools at
the time. (Postgres didn't support Windows natively until 2005). In
those early years, Postgres changed its on-disk format frequently in major releases, causing pg_upgrade to be only occasionally useful for
major version upgrades (6.5,
7.1, disabled in
2002).

Postgres feature additions soon made the shell script method unworkable, and binary-upgrades were unsupported for many years. In 2006,
EnterpriseDB took on the job of rewriting pg_upgrade in
C, a task that had been discussed many years on the mailing lists — this email
thread, titled "Upgrading rant", give you an idea of the tone of
previous discussions.

In 2009, EnterpriseDB realized that a successful binary-upgrade utility had to be community-developed, so they changed the license from
GPL to BSD and encouraged me to work with the community to improve the tool. In 2010, pg_upgrade was
added to the official Postgres 9.0 release. Postgres 9.1 saw
only performance improvements.
Postgres 9.2 will improve the user interface and error
reporting. It is significant that pg_upgrade has not required large changes to support new major releases, only minor improvements.

Pg_upgrade's goal of allowing in-place upgrades is a bold one, but history shows that it has achieved that goal. Pg_upgrade has had its
share of bugs along the way, but their frequency is diminishing as more and more people use pg_upgrade successfully. The community is
increasingly supportive of providing in-place upgrade capabilities for users, so hopefully pg_upgrade will remain useful for many years to
come.

As part of yesterday's PGCon Developer Meeting, I hosted a discussion
about adding resource parallelism to Postgres, which I blogged about
previously. Josh Berkus has kindly summarized the
results of that discussion. I am hoping to help track and motivate progress in this area in the coming months; it is a multi-year
project to complete.

You might have heard of the template1 database before, or seen it in the output of pg_dumpall:

REVOKE ALL ON DATABASE template1 FROM PUBLIC;
…

It is one of those databases, like template0, that sits in the background and appears in psql \l output. While you can go for years
never knowing about template1, it does have a useful purpose.

As its name suggests, it is the default "template" used for creating new databases. This means, if you modify template1, every new
database created after the modification will contain the changes you made. So, if you need a table, schema, function, server-side
language, or extension in every database, add it to template1 and you will have it in every newly-created database. The Postgres
documentation covers this in detail.

I mentioned template1 is the default template database, but you can use other databases as templates for newly-created databases.
Create database has a
template option that allows you to specify a different database to copy from. For example, to create a copy of the
production database called demo, have everyone disconnect from the production database, and use production as the template to
create demo. The demo database can then be modified without affecting production. Create database does a
file-level copy, so all the I/O is sequential and much faster than a dump/restore of the database.

As a side node, template0's purpose is often mysterious too, particularly because you can't even connect to it:

There must be something special in there because it's inaccessible! Well, sorry to kill the mystery, but template0 is used solely by
pg_dumpall to properly dump out the customizations made to
template1. If template0 did not exist, it would be impossible for pg_dumpall to determine the customizations made to template1
and dump those out appropriately.

I have completed the Postgres 9.2 release notes I
started seven days ago. Writing the release notes is always a taxing
experience. I have to basically dedicate an entire week to the process of digesting 1100 commit messages to produce 3000 lines of SGML
text. Once I am done though, it is rewarding to see the finished product. It is like working on a painting all year, and spending a
frustrating week framing it and hanging it on the all — once it is hung, you stand and admire it, and forget much of the sweat it took
to produce. I am sure many community members feel the same.

Curious how the 9.2 release item count compares to previous major releases? Here are the results:

Release

Items

9.2

241

9.1

203

9.0

237

8.4

314

8.3

214

8.2

215

8.1

174

8.0

230

7.4

263

Of course, this is just the first draft of the release notes; the 9.2 count will change regularly until the final release.

My children have traveled to many conferences with me, and have heard many webcasts and training calls at home. I guess, after hearing
about Postgres so often, they pick up certain phrases of interest, and the big one for them is "shared buffers". Anytime someone uses
those words, they start howling and make up some funny sentence using the words.

There are a variety of beer drinking games that trigger drinking when a
word is said — perhaps "shared buffers" is enough on its own to make a drinking game.

Having reported the methods for finding the size of the kernel cache on
Linux, I wish to highlight the importance of the postgresql.conf setting
effective_cache_size.

Unlike other memory settings that control how memory is allocated, effective_cache_size tells the optimizer how much cache is present in
the kernel. This is important for determining how expensive large index scans will be. The optimizer knows the size of
shared_buffers, but not the
kernel cache size, which affects the probability of expensive disk access.

The kernel cache size changes frequently, so run free during a period of normal system load and use that value to set effective_cache_size.
The value doesn't have to be perfect, but just a rough estimate of how much kernel memory is acting as secondary cache for the shared buffers.

which generates 11k lines of output, and eventually reduce that to 3k lines of SGML-marked-up release notes. I have
outlined the steps in the past, though git has improved the speed of
researching items.

Unfortunately, this brings up two more questions: Why is "MemFree" so small on this idle system, and what is the difference between
"Buffers" and "Cached"? Another way of displaying this information is using the Linux free command:

Addressing the first question, we see the same 3.8GB for free memory displayed on the first line under "free". This display is so often
misinterpreted that it necessitated the creation of an explanatory website.

The Mem line output by free shows memory from the kernel perspective. The free value of 3.8GB represents kernel memory not
allocated for any purpose, i.e. free, from the kernel's perspective. The bulk of the memory, 18GB, is shown as used for caching
("cached"). The second line, confusingly labeled "-/+ buffers/cache", represents memory from the process perspective. It is labeled that
way because the "buffers" and "cached" have been removed from the "used" column and added to the "free" column, i.e. -/+. From the process
perspective that memory is immediately available because it represents read cache that can be discarded as soon as memory is needed by
processes. This labeled output summarizes the columns (details):

"A" represents kernel free memory, and "D" represents memory used by processes. You can see that "B+C" is added to the "used" column in
line 1, and added to the "free" column in line 2. "total" represents all memory, less memory used for basic kernel operation. In fact,
"total" always equals "used" + "free" for each line:

While free is confusing, its layout does suggest the meaning of various columns. Open source has often debated what "free software"
means, and it seem the Linux kernel also has multiple definitions of the word "free".

The second question regards the difference between "buffers" and "cache". (Josh Berkus recently
blogged about a kernel bug that prevented most
available ram from being used as cache.) The definitive answer for modern kernels comes from this source code
comment:

Buffers: Relatively temporary storage for raw disk blocks
shouldn't get tremendously large (20MB or so)
Cached: in-memory cache for files read from the disk (the
pagecache). Doesn't include SwapCached

The buffers and caches can be cleared using these commands (at least in
Linux 2.6.16 and later kernels):

There are several levels of caching used in a typical server — here they are, in order of increasing distance from the
cpu:

1.

Cpu cache

2.

Random-Access Memory (ram)

3.

Storage controller cache

4.

Storage device (disk) cache

All of these are a fixed size and set at hardware installation time, i.e. you can't move cache from one level to another. For Postgres,
there is flexibility in how #2, random-access memory, is allocated, and this provides a never-ending opportunity for administrators to
optimize their systems. The three ram allocation possibilities are:

The last item, kernel cache, isn't really an allocation but rather is based on the ram remaining from the previous
two allocations. Some database systems avoid the kernel cache by directly writing to storage (direct i/o or raw
disk access). While Postgres does allow direct i/o for writing of the
Write-Ahead Log (wal) (via
wal_sync_method), it does not support
direct i/o or raw device control of data buffers. The reason for this is that Postgres tries to be operating
system-neutral, and allows the kernel to do read-ahead and write combining. This is often more efficient, particularly if the storage
subsystem has both database and non-database activity. In such cases, only the kernel knows about all the storage activity, and can
better optimize the workload. This also helps Postgres run well in
virtualized environments. The big downside of using the kernel cache is
that it often contains data already in the shared buffer cache.

During the next two months I will be attending events in the following cities: New York City, Ottawa, Charlotte (North Carolina), and
Boston — the details are on my website. I will also be doing training in many of
these locations. And I spoke in Philadelphia this week — I
guess this is what the New Postgres Era looks like.

I mentioned that Postgres supports multiple clusters, databases, and
schemas, but an open question is why use multiple schemas, rather than placing everything in the "public" schema? By default,
search_path places everything
in the public schema (assuming a schema matching the current user name does not exist):

Of course, search_path can be set at various levels, e.g. user,
database, session.

So, why use more than the public schema?

Easier object management - Rather than having thousands of tables/objects in a single schema, arrange them in separate schemas, either
by user or purpose. FYI, almost every object exists in a schema, not just tables and indexes, e.g. functions exist in schemas.
(Technically, only languages and global objects are schemaless.)

Allow multiple objects with the same name to exist in the same database - For example, schemas allow for two applications in the same
database to both have a "parts" table, e.g. vendor_app.parts, and production.parts.

Permission Control - Schemas have both create and usage (search)
permissions that apply to all objects in the schema. It is possible to
create a schema that only certain roles can use, and to remove creation permission from the "public" schema for read-only users.

When Informix introduced schemas in the mid-1990's, I didn't understand schemas and why all the tables now were prefixed with "public", so
I never used this useful feature. Hopefully this blog post helps explain the usefulness of schemas and why ignoring schemas, as I did in
the past, is not always a good idea.

I recently mentioned the use of connection poolers to reduce the
overhead of server-side language initialization, but I believe the topic warrants fuller coverage.

Aside from the ability to reduce language initialization overhead, the two major advantages of connection pooling are:

Reducing session startup time by preallocating and reusing database sessions

Reducing session management overhead by reducing the number of idle sessions

While these advantages tend to get lumped together in people's minds, they actually address two different issues. The first, reducing
session startup time, reduces the delay in executing the first query of a session, and reduces process creation overhead on the server.
It is particularly useful for databases with many short-lived sessions, and for operating systems, i.e. Windows, that do not use fork
(details about fork).

The second advantage, reducing session management overhead, improves overall performance by reducing the server overhead involved in
managing many sessions, i.e. if many of those sessions are idle, it is possible to speed up the active sessions by reducing the total
number of database sessions connected to the server. This is obviously advantageous when there are many sessions, but most are inactive,
i.e. the database is more efficient executing 20 active sessions than running 20 active sessions and managing 400 inactive sessions.

Postgres has two popular connection poolers, pgpool-II and
PgBouncer. The names are a little confusing, because pgpool-II does a lot more than
pooling. (Josh Berkus covered this issue recently.)
Connection poolers are also built into many application-building tools, like PHP and Hibernate. Whichever one you use, they all provide
the benefits listed above.

One tricky limitation is that database sessions can only be pooled per-database, meaning that if you connect to many different databases
in a unpredictable pattern, a connection pooler will be less useful.

Postgres does not provide a built-in connection pooler. and that is by design — external connection poolers can be placed closer
to the client, or on inactive servers, and they can be used to redirect client connections to new servers after a fail-over — a
built-in connection pooler would have none of these advantages.

I previously explained that Postgres allows multiple databases per
cluster. The outstanding question might be, when should I use multiple clusters, multiple databases, or multiple schemas? The following
table outlines the advantages of the various container types:

Feature

Cluster

Database

Schema

Isolated Server Start/Stop

✓

Connection Control

✓

✓

Private System Tables

✓

✓

Private Plug-Ins

✓

✓

Isolated Administration

✓

Shared Administration

✓

✓

Isolated Resource Usage

✓

Shared Resource Usage(1)

✓

✓

Data Isolation(2)

✓

✓

Cross-Container Queries

✓

(1) A large number of data containers increases the usefulness of resource sharing, e.g. shared_buffers. Resource sharing includes
log shipping and streaming replication sharing.

(2) User and database names, being global objects, are visible in all
databases. It is impossible to query across databases, except via an external database session, e.g.
dblink. Schema permissions allow data access control, but
pg_class still shows all tables defined in the database.

Hopefully this chart helps users choose the proper container for their data needs.

First, many people think that running pg_dumpall and running pg_dump on each database individually are equivalent — they are
not. While pg_dumpall does backup each database individually, it also dumps out
global objects that exist outside any specific database, e.g. roles,
tablespaces. If you prefer to use pg_dump to dump each database individually, be sure to run
pg_dumpall -globals-only as well. Failure to restore global
objects before per-database objects causes all restored data to be owned by the restore user and be in a single tablespace.

The second pitfall is to believe that a logical dump of the database has captured all cluster modifications. While all the database
objects have been dumped, configuration file changes are not included in a logical dump and must be backed up separately, e.g.
postgresql.conf,
pg_hba.conf. This is why many wise administrators and packaging
systems move these configuration files out of the the
data directory and into a directory that is regularly backed up by the operating system, e.g. /etc. With the configuration files moved,
the data directory can be skipped entirely during operating system-level backups.

So, if you perform logical backups, make sure you backup your global objects and configuration files as part of your backup
process.

The Postgres system is setup differently than some other database systems, and while long-time Postgres users consider its layout natural,
people coming from other database are often initially confused. Therefore, it is helpful to explicitly mention some of those differences.

First, Postgres allows multiple databases in a single cluster (which is controlled by a single postmaster). This often confuses Oracle
users because in Oracle you can only have one database in a cluster. This also confuses MySQL users because they have multiple
"databases", but these databases share a single system catalog and allow cross-"database" queries. Effectively, they are "schemas",
rather than databases. Microsoft SQL and DB2 have a similar facility to Postgres, with multiple databases in a cluster, and multiple
schemas in a database.

As part of this multi-database, multi-schema capability, there are some things that don't make sense to store in per-database tables, e.g.
roles, tablespaces. These are called global objects and a system catalog query can show them to us:

Looking at the list, it seems natural that they are not per-database — you certainly wouldn't want the list of available databases
stored in a single database. Role (user and group) information is also cluster-wide. Tablespaces are something that exists as part of
the file system, so it makes sense for them to be global too.

I hope this helps users who might have been confused by the way Postgres does things.

I have already blogged about
toast storage (and
TOAST queries), but I would like to highlight one aspect that is
easily overlooked by application programmers. There is an age-old debate over whether using select * is
good programming practice — many feel that wildcarding all columns of a table makes applications more brittle when columns are
added or removed, while others feel that specifying all columns in an application is more error-prone. If a table has only a few columns,
specifying them is pretty easy, but when the number of columns is large, specification of column names can be cumbersome. Of course,
having wide rows in tables has its own issues, but I don't want to get into that here — what I do want to cover is its affect
on toast'ed values.

As I mentioned in my previous blog post, long values are not stored in the main row, but in toast tables that have
additional access overhead. This is great when you only need the short values from a row, which is typical, but what happens when you use
select *? Well, if you actually need all the column values, using select
* or specifying the column name explicitly makes no difference, but many applications use select
* when they only need some of the column values, and this is where toast becomes an issue. If you
don't need to see some columns, and those columns might be long and hence toast'ed, it is wise to avoid selecting
them in a query — this will avoid toast table access and speed up queries significantly.

With toast, some values are more expensive to access than others, so avoid unnecessarily accessing columns that
might contain toast'ed values. (Observation from Stephen
Frost.)

Unlike traditional database systems, Postgres is an object-relation
database system, meaning it relies heavily on plug-in objects to enable flexible behavior, and many of these objects are supplied as
shared object libraries (or dynamically-loadable libraries (DDL) on Windows). Examples of external objects include the
PL/pgSQL server-side language, the
pgcrypto cryptographic library, and the
PostGIS geographic information system. These are all implemented as shared libraries that are
dynamically loaded into the database server when accessed. The process of loading a library into a running executable is called
dynamic loading and is the way most modern operating systems access libraries (the
non-dynamic method is called "static linking").

However, this dynamic loading is not free — it takes time to load a dynamic library the first time it is accessed. The library
must be mapped into the process address space, and its symbol table read, and this must happen before any functions in the dynamic library
can be executed. This is where shared_preload_libraries is useful — on operating systems where child processes are
forked rather than created from an executable (all but Windows),
shared_preload_libraries loads the shared library into the parent of all child processes (the
postmaster). This causes all child processes, i.e. new sessions, to
inherit an address space that already contains the needed shared library, speeding up calls to any functions in that library. (Fork is
illustrated in my Inside PostgreSQL Shared Memory presentation,
slide 12.)

In one test, I found that setting shared_preload_libraries = '$libdir/plpgsql' in postgresql.conf caused a simple plpgsql function:

PL/Perl, being a more complex library, showed a more significant improvement,
and its high overhead suggests the advantage of using pooled connections so the Perl interpreter doesn't have to be started for every new
connection. Notice that loading unneeded libraries can have a slight overhead, so use shared_preload_libraries discriminately.

In conclusion, testing shows that shared_preload_libraries can improve system performance (except on Windows) if your server heavily uses
shared libraries.

There are three ways to install Postgres, and they are all listed on the Download menu on the
Postgres website. While the web authors did their best to explain the differences between the various installation methods, I thought it
would be helpful to more fully explain their advantaged and disadvantages. The three methods are:

Advantages: Provide an easy-to-install Postgres environment with access to many add-on modules via
Stack Builder. This is ideal for first-time Postgres users. It is also possible
to use the installer in non-interactive mode. You
can even unzip the file and run the binaries manually.

Disadvantages: As mentioned on the Postgres web site, the one click installers do not integrate with platform-specific
packaging systems.

Platform-Specific Packages

Advantages: These are better integrated with other software installed on your operating system. This is ideal for
production servers that rely on operating-system-supplied tools.

Disadvantages: Requires work to identify which other packages are needed for a complete solution. Obtaining newer major versions
of Postgres on older operating systems might also require work.

Source Code:

Advantages: Allows selection of specific configure and compile options for Postgres binaries, and allows the addition of patches
to enhance or fix Postgres. This is ideal for experienced users who can benefit from additional control of Postgres.

Having attended several conferences recently, I saw confirmation of my previous
observation that Postgres is poised for a new wave of adoption. The last
time I saw such an upturn in adoption was with the release of Postgres 8.0 in 2005, which
included a native port of Postgres to Windows. You can see the increase in the
volume of postings to the
Postgres jobs email list. (The spike in January of 2008 was Sun
buying MySQL.)

And that's not all — Robert Haas's recent blog
post about Postgres scaling linearly to 64-cores in upcoming Postgres 9.2 means that, by the end of the year, Postgres will be a major
contender on high-end hardware. We have always done well on small to medium-sized servers, but we are now poised to compete heavily on
the high-end.

Postgres adoption is probably five years behind Linux's adoption. This video explains the
increased popularity and development model used by Linux, which is similar to Postgres.

Most software developers agree that software patents are hopelessly broad and that violation is almost unavoidable for any non-trivial
software project. I won't even go into examples of software patents causing industry chaos — new high-profile patent attacks
appear every month. The risk of software patents to the software industry has been likened to a mine field, and for good reason
— it is impossible to guess when software patents will create a crisis.

The Postgres team has been circumspect about software patents — we address them when someone brings up a possible violation (as
happened in 2005 with a patent on queues), and we try to avoid
patented ideas in our code, as you can see from our recent
discussion about compression algorithms. Several years ago, one
of the Postgres support companies analyzed our source code for possible patent violations, and reported we were clean, but given the "mine
field" unpredictability of patents, it is hard to trust the report's accuracy.

If most people believe the software patent system is causing more harm than good, you would think it would be corrected? Well, it has
been this way for at least a decade, and doesn't seem to be getting any better. In fact, the number of patent attacks, particularly by
competitors, has increased. Even companies that aren't excited about patents (e.g.
Google) are having to file or
purchase patents as a way
of defending themselves from competitors.

As a result, there's a deep and persistent rift between the community of computer programmers, who are overwhelmingly hostile to software
patents, and patent lawyers who seem mystified by all the outrage. The job of a patent lawyer gives him a systematically skewed
understanding of how the patent system affects the software industry.

Hopefully, someday, these two groups will get together and communicate in a way that finally improves software patents. Perhaps, if the
number of software patents attacks continues to increase, their will be so much disruption in the software industry that reform will be
demanded by everyone.

Still think the patent system can continue unchanged? Listen to the upsetting audio report,
When Patents Attack!
(transcript,
summary), about people attacked by patent trolls. This is a
different problem from competitors attacking with patents.

Postgres has been lucky to avoid attacks so far, and we are always vigilant, but in the current system, we can never relax because, in a
mine field, you never know where the mines are buried.

We just completed PGDay Austin, and it was a great event. It was a single-track conference in a serene
hotel — this gave
the event a polish and cohesion that I have rarely felt at Postgres events. Everyone knew where they should be, nothing felt rushed, and
the talks were all of high quality. I wouldn't be surprised to see double the attendance at next year's conference.

The most exciting talk for me was Will Leinweber's Schemaless SQL (down arrow key advances
slides). The talk started by explaining why developers like the document storage model, and Will explained his experienced moving from
document storage databases to Postgres. Will found our hstore extension to
be ideal for storing the type of unstructured data he used to store in document databases. Hstore gave him the best of both relational
and document database worlds.

What really excited me was the combination of our
JSON data type coming in
Postgres 9.2 and the use of the Javascript stored procedure language PLv8 —
combining these gives you a data type (JSON) that has a stored procedure language designed to operate on that data type (Javascript). The
linkage of data type and stored procedure language is something new for Postgres and suggests a powerful new direction for application
programmers. The tools are not ready for production use yet, but I can see data type & language combinations being powerful.

One big highlight of the conference was the attendance of the SQL guru Joe Celko. Joe
lives in Austin and kindly attended, asked questions during the talks, and even went out with us for drinks after the conference. Magnus
Hagander, Greg Smith, and I are flying to Virginia tomorrow to attend PG Corridor Day DC, and then
PGDay NYC on Monday.

Documentation quality can make or break an open source project. Don't believe me? Ever used poor documentation? How did it make you
feel about your continued use of the software?

Just as Postgres's high code quality doesn't happen by accident, neither does its documentation quality. Our community is regularly
updating and improving our documentation — since 1998, our documentation has averaged 1.3 commits per day. Don't believe me?
See for yourself — that's 6913
documentation improvements over 14 years.

New features often require documentation improvements, but that's not all. We are constantly on the lookout for ways to improve our
documentation. Here is an email report (with patch) from last
Thursday indicating our documentation about numeric storage requirements was incorrect, and here is the
commit on the same day fixing
it. We are also on the lookout for people who are confused by our documentation, e.g. this
commit clarifies behavior
described in this bug report.

There are also more structured efforts to improve our documentation. A few years ago I started a
project to systematically improve our documentation, which has led to
significant stylistic improvements. PGCon, the undisputed Postgres hackers conference, this year is
sponsoring a documentation improvement day as part of their conference.

So, as you can see, it does take work to make high-quality documentation. You can help by reporting anything that is unclear, needs
improvement, or incorrect — the community will do its job of making sure your report gets the attention it deserves, so we can
all continue to be proud, not only of the Postgres code, but of its documentation.

The Postgres documentation is extensive — no question about that. In fact it is so extensive (2704 US Letter pages in
PDF for Postgres 9.1.3) that it has discouraged the
publication of commercial Postgres books.

The Postgres documentation is available in two popular formats, HTML and PDF. The HTML format is most prominently
displayed on our web site (9.1
version). Thom Brown masterfully improved the layout of our
HTML documentation to be more esthetically pleasing about 18 months ago, and we made page navigation improvements about six months ago.

The PDF version is great for screen viewing and printing. If you are using a PDF viewer that is only showing you the PDF pages, you are
missing out — sequentially scanning 2704 pages looking for something is frustrating. To make the PDF usable, you have to enable
the left side-pane in your PDF viewer, and choose index view. For example,
this is the page-only view,
this is the thumbnail view, and
this is the index view. The index view allows you to see all the
chapters and sections present in the PDF, and you can click on a title to see the referenced text. You can also open
index chapters to see sections and subsections. Without the index view,
you easily get lost in PDF files; with an index view, PDFs become better than physical printed pages. And if you prefer printed pages,
you can always give the PDF file to a local printer like
Staples and have them print the Postgres manuals for less
than USD $100. However, make sure you tell them it is acceptable to use multiple bindings — most printers are unable to
single-bind books this large.

Why should you care about libpq? It sounds scary — not as scary as "PostgreSQL", but still scary.
Libpq is the C interface library for Postgres. Your reaction, "But I don't use C, I use
Python, PHP, etc., so libpq isn't useful to me?" Hold on! Don't overlook libpq so quickly.

Libpq provides a C interface for clients communicating with Postgres servers over a network (and via Unix domain sockets). It provides
authentication handling, query submission, and query result processing, as well as other utility operations.

Why is it important for people programming in non-C languages? Well, the Perl, Python, Ruby, etc. interface to Postgres using an
interface library (e.g. DBD::Pg, PyGreSQL) and those libraries need to communicate with Postgres servers too. It is inefficient for each
library to maintain its own code to do network communication with Postgres, so they often use libpq internally. In fact, the only two
popular client interface libraries that don't fully use libpq are the Postgres JDBC driver (because it
is a Type 4 driver, and hence written in pure Java), and the Postgres ODBC driver, which
mostly uses libpq but has its own network code to handle special ODBC requirements; all other interface libraries only use libpq.

So, what good is that to me, you might ask. Well, libpq has many features, covered in
70 pages of the Postgres manual. From those manual pages, you can learn more
about libpq features that might be exposed by your interface library, like
connection parameters and
SSL support. Some features are enabled in all interfaces, like the use of
environment variables and
client password storage. What is nice about the last two libpq features
is that they can be used without changing the application, so they are useful to users and developers.

So, as you can see, libpq is quite useful, even if you are not programming in C. Hopefully this blog entry has given you a few ideas of
how libpq can make your programming life easier and more productive.

Open source leadership might sound like an oxymoron — how can you have leadership when there is no organizational structure and
no direct compensation for work performed? Well, let me ask a different question — how can a group of mostly volunteers
create software that rivals that created by billion-dollar companies?

So, combining these questions, is it possible for volunteers to create great software without leadership? No, of course not, but with
leadership, it is certainly possible, as Postgres proves. So, where is the leadership? How does it work, and how effective is it?

Certainly, this is a baffling question, not only to outsiders, but often to people who have worked in open source community for years.
You might have a clue, but it is very difficult to articulate. For me, it was unlocked by hearing John Maxwell's
Five Levels of Leadership
(book). I have heard John several times at annual
Chick-fil-A Leadercast conferences (next event, May 4), but last year he talked about
leadership in a way I had never heard before — his ideas are captured in this
video.

The lowest of John's five levels, position, does not exist in pure open source communities — even core status or committer
rights have little bearing on leadership capabilities. What is striking about the five levels is that open source communities have to
operate purely on the other four leadership levels, and while these levels are harder to accomplish, they produce teams that are much
more productive and efficient. In fact, in the second part of his presentation, he
states that the ability to lead volunteers is the true test of a leader.

The bottom line is that only great companies operate at the highest levels of leadership — our community is forced to operate at
those levels, and that is part of the magic that makes us great.

My recent blog entry about
pg_upgrade improvements in Postgres 9.2 prompted a community
discussion about the time-consuming requirement of generating
optimizer statistics once pg_upgrade completes. This requirement makes pg_upgrade less useful because some queries run slower while
optimizer statistics are missing. The agreed solution was to have pg_upgrade create a script that generates increasingly more-accurate
statistics — this allows the new cluster to generate minimal statistics in minutes, with full statistics taking perhaps an hour
or more to complete.

Fortunately, the script created automatically by pg_upgrade in Postgres 9.2 works just fine in earlier versions of Postgres, so I am
supplying links (Unix,
Windows) for users who are currently using pg_upgrade. (Those
upgrading from Postgres 8.3 will need to change the final "--analyze-only" to "--analyze"; pg_upgrade automatically handles this when it
creates the script.) I hope this helps, and I am interested in feedback about the script's effectiveness.

Only minor improvements are planned for pg_upgrade in Postgres 9.2, and
that is a good thing — it indicates that the pg_upgrade code has reached maturity and is now considered a reliable way to
perform rapid Postgres major upgrades. No additions are (yet) required to allow pg_upgrade to support all the new Postgres 9.2 features.
A few features planned for the future are being designed to allow
pg_upgrade to easily handle them.

The major improvement just
added to Postgres 9.2 is
improved logging, which will allow easier diagnosis of pg_upgrade failures. On a related note, I got a report from a happy tester that
pg_upgrade upgraded their 2.5 terrabyte database to Postgres 9.1 in seven minutes, using link mode.

Recently three people told me that they started seeing increased interest in Postgres 2-3 months ago. I have seen increased interest in
Postgres at conferences over the years, but this was described as different. Rather than interest from developers and open source people,
who are the usual conference attendees, the recent increase has been described as organizations wishing to switch large parts of their
infrastructure to Postgres. This has caused a huge spike in the demand for Postgres services and employment. A good example is a recent
blog post (with
video) of two Microsoft SQL Server DBAs expressing their excitement about Postgres. This
is good news for the project, and foreshadows much excitement to come.

I am bursting with news from Montreal's ConFoo conference, which has about 600 attendees. This is the first
time I am attending, and I am told the first time there are Postgres-specific talks. I gave two Postgres talks yesterday (as I
mentioned), and Ilia Alshanetsky gave a great
Introduction to PostgreSQL talk
(slides). The talk was smoothly presented and well received. I have seen many presentations
about specific Postgres features, but few that give a general introduction to Postgres, particularly with an evangelical focus; in fact,
the last great one I saw was in 2005.

There are some amazing talks at this conference, and the percentage of talks that are helpful to me is very high. I saw
JVM Internals for Dummies, which explained how the
OpenJDKHotSpot optimizer chooses to inline function calls and
convert byte code to CPU instructions by using run-time profiling. This is similar to the
discussion the Postgres community had about inlining some of our sort
code for Postgres 9.2, except the JVM makes the decision at run-time using profiling, while we have to make the decision at compile time
because we use C.

Today, there was an excellent talk about
geolocation (slides) which had good things to
say about Postgres and PostGIS, as well as PostGIS users
OpenStreetMap and MapBox. There was also a good scaling
talk
(slides) about using cloud services, and Postgres got a good
mention there, though the uneven performance of Amazon's Elastic Block Store was highlighted.

And, finally, the hotel is quite elegant.
Because of the French influence here, the meals, breads, pastries, and desserts are fantastic. Also, from my hotel room, I have a
picturesque view of a family of ducks living in the snow-covered hotel courtyard, enjoying a heated pond and small shelter
(image)--- it certainly is relaxing to watch — what a great idea.

My motivation for writing the talk is that, while I have seen several Postgres common table expression (CTE) talks, they started at too
advanced a level for me to understand how all the CTE parts worked together. My talk starts at a basic level and adds features slowly
until complex CTE queries are shown. The audience seemed to like the talk and had good questions, so
I will take that as a good sign.

During recent training, I realized there five ways to
control who can access Postgres, and it isn't totally clear in the documentation how they fit together. There are basically a series of
gates that users must pass through to connect to a Postgres database. I thought I would list them here, in the order they are applied:

listen_addresses: This controls which network
interfaces Postgres will listen on for connections. This default so "localhost" on most Postgres distributions, including the source
install. However, the click-through installers default to
"*", meaning they listen on all interfaces by default.

Today I realized that I have nine events completed or scheduled for 2012, and
they are all in the eastern part of North America (except for Austin, Texas). This is exciting because I am
on track to cover the same number of conferences as last year, but without the
crazy travel schedule. Not that I don't love going to distant lands, but it is
also nice to stay close to home for a while. This might become typical as the density of Postgres users increases.

I got many requests to hear my Virtualizing Postgres presentation,
so I have placed the first part of my talk online. Part 1 is a
virtualization primer. It took a lot of digging to find these details, and I have included many URLs for people who want more information.
Unfortunately, this part doesn't cover many Postgres-specific aspects, which will be covered in parts 2 and 3.

A few people have asked where I am presenting this talk; I have no idea, but perhaps it will be chosen by the organizers of one of the
upcoming USA one-day conferences.

I am excited to be attending all three events, along with Greg Smith and
Magnus Hagander. I will also be doing
training in Austin and New York City. (I am already doing
training next week in the Reston area.)

As you can see, the new patch shaves about 3 seconds off of my test. That isn't a huge win for sorts that have to hit the disk, but for
in-memory sorts (the last row), it yields a 20% improvement, which closely matches the ~25% reported by testers of the patch. This is one
of the many sort improvements coming in Postgres 9.2.

Postgres is an ideal database to run in a virtual environment or public/private cloud — one reason is that Postgres relies
heavily on the operating system, rather than using features like raw devices. Second, its license is obviously very flexible for virtual
deployments.

I am often asked about running Postgres in virtual environments, and I usually answer that it runs just fine — and it does. However, I
am starting to realize that I am not answering the more complex questions of which visualization technology to choose, and what is the
performance and reliability impact of virtualization.

I have started writing a talk, Virtualizing Postgres, which will explore this. I had great trouble finding details of exactly how major
virtualization approaches differ, how CPU-accelerated virtualization works, and how additional file system layers affect performance, and
how these affect Postgres. I hope to share these details, and more, in my talk.

Once I present the talk at a conference or webcast, I will add the presentation to my website.

As you can see, it uses 360MB for temporary sort files, and takes 68 seconds. I tried changing the magnetic disk's ext3 file system from
the default data=ordered to data=writeback and saw the time drop to 50 seconds. (data=writeback is the
recommended mount option for Postgres file systems). Running the same on
an Intel 320 SSD took 36 seconds.

This server has 24GB of RAM, mostly unused, so I was surprised that
there was a difference between magnetic and solid-state drives because there was no need to force anything to disk, except for file system
crash recovery. This suggests that creating a symbolic link from
PGDATA/base/pgsql_tmp to a memory-based file system (tmpfs) might
be wise; I did that and the test took 30 seconds.

But, of course, another solution would be to increase work_mem so no temporary files are unnecessary, as in this test:

tmpfs is probably much slower than the high work_mem test because of the extra comparisons necessary to perform the sort in one-megabyte
batches (the default work_mem setting is '1MB').

Frankly, I am confused why a system with so much free RAM (and free kernel cache) was delayed by file system writes. I realize the data
eventually must be written, but why would it delay these queries? I wrote a small
test program to write 360MB but it consistency took only 0.8 seconds. Sorry,
but it looks like I am going to have to conclude this blog entry with an unanswered question.

The first thing that stands out is the high DRS and VSZ fields, representing the large amount of shared memory allocated by the parent
and shared by the children — 32MB of that is shared buffers, another 9MB is other System V shared memory allocation (as reported
by ipcs), 8MB is for postmaster-accessed local memory, and the remainder is probably for shared libraries that are in the virtual
address space but not accessed. It is also clear that the stats collector process is not attached to shared memory.

What is also interesting is that very little of that memory appears in the smem columns — this is because little of it has been
accessed — it is mostly just in the virtual address space.

For my first memory demonstration, I wanted to show how work_mem affects memory allocation. For this, I needed a query that would
allocate a lot of memory, but not return a lot of data, so I settled on this query:

SELECT random() FROM generate_series(1, 10000000) ORDER BY 1 LIMIT 1;

However, that doesn't allocate much memory at all, because of this optimization added in Postgres 8.3:

ORDER BY ... LIMIT can be done without sorting (Greg Stark)

This is done by sequentially scanning the table and tracking just the top N candidate rows, rather than performing a full sort of the
entire table. This is useful when there is no matching index and the LIMIT is not large.

So, then, I had to place a WHERE clause between the LIMIT and the ORDER BY to disable the optimization:

SELECT *
FROM (SELECT random() FROM generate_series(1, 10000000) ORDER BY 1) AS f(x)
WHERE x < 0
LIMIT 1;

(But, of course, with that WHERE clause, the LIMIT is unnecessary.)

Great — I am monitoring process memory usage and can see the memory growing — but wait, the query is done and memory
usage is back to its original values — it isn't supposed to do that. In all the operating systems I have previously worked
with, allocated memory isn't returned to the operating system, and the virtual address space certainly doesn't shrink. I quick web search
turns up confirming evidence — but wait, there is a
mention of large allocations, and studying the Debian mallocmanual page
explains the behavior:

Normally, malloc() allocates memory from the heap, and adjusts the size of the heap as required, using sbrk(2). When allocating
blocks of memory larger than MMAP_THRESHOLD bytes, the glibc malloc() implementation allocates the memory as a private anonymous
mapping using mmap(2). MMAP_THRESHOLD is 128 kB by default, but is adjustable using mallopt(3). Allocations performed using mmap(2)
are unaffected by the RLIMIT_DATA resource limit (see getrlimit(2)).

So how do I show the memory allocated by work_mem if it disappears when the query completes? I decided to use a cursor, and therefore a
simpler query:

Memory is very important to databases — much more so than for typical applications
(presentation). Unfortunately, because memory allocation is so
complex, it is often hard to figure out how physical RAM is being used. There are several reasons for the complexity:

Sharing: physical RAM is often shared by multiple processes, either in read-only mode (program instructions), shared mode (read/write
of share memory), or copy-on-write (create a new copy on write; used by
fork).

Robert Haas's excellent blog post highlighted much uncertainty about
how to analyze memory usage for specific processes, especially Postgres. I commented on his blog, as did others, and now have a much
clearer idea of how to study memory usage. A blog post by Chris
Siebenmann directly addresses some of my and Robert's questions, and suggests smem as a way to analyze
memory, especially the sharing of memory. It was interesting to learn that smem was designed specifically to address the problems Robert
outlined (2007, 2009).

To learn more, I could have studied the source code of how my particular operating system
(Debian 6.0) reports memory usage, but that probably would not have
helped others running different operating systems, or even a different version of the same operating system. Therefore, I wrote a tool,
pg_memalloc.c, which allocates memory in various ways and reports the memory
allocation values from ps and smem:

Line 1 represents the default output for zero allocated memory, so it is our baseline. Line 2 represents the default allocation of one
megabyte. Notice that the data size (DRS) and virtual memory size (VSZ) increase by one megabyte, and in fact change very little for
subsequent arguments. Similarly, the code/text size (TRS) and Swap remain unchanged.

The interesting values are the process-specific memory allocation (USS), proportional memory allocation (PSS), and resident size
(RSS), which are all shown by smem.

(FYI, though TRS and DRS suggest resident size ("RS"), they are really reporting virtual size for specific memory segments; this can
be proven because TRS + DRS is very close to VSZ, and clearly larger than RSS.)

Line 3 shows that shared memory in a single process has little effect on the reported memory usage.

Line 4 is the first case where subprocess is involved; the interesting change here is that the process-specific memory allocation (USS)
is lower for the parent and child because the combined proportional memory allocation (PSS) is greater. This shows that fact that the
parent and child are sharing memory pages. This sharing is happening as part of fork(), rather than using shared memory. Keep in mind
none of the one megabyte of allocated memory has been accessed yet, so it doesn't show up in these columns. These numbers are really
reflecting the executable and its dependent shared libraries. The child is also quite low because the parent has done much of the
necessary memory access during startup.

Line 8 is the first time we actually access the one megabyte of allocated memory, and the numbers clearly reflect that. This highlights
that you really need to access memory for it to show up, i.e. allocating it and not accessing it does not cause it to use physical RAM,
though it does affect the virtual memory size.

Line 9 shows that using shared memory instead has little effect on the numbers.

Lines 10 and 11 are where things really get interesting — we are creating a child process, but accessing memory only in the
parent. The parent process takes the memory hit, and the child process looks similar to line 5 where memory was not accessed. Lines 12
and 13 again show that allocating shared memory has little effect.

Lines 14 and 15 show that a parent and child both accessing one megabyte of memory each bear the full weight of the access.

Lines 16 and 17 are perhaps the most interesting — because shared memory is involved, there is only one copy of the one megabyte
in RAM, and the two processes split the cost of the access in PSS, e.g. 572 and 570.

The output basically confirms what Chris Siebenmann reported. In a follow-up posting, I plan to show these same fields for a running
Postgres server.

Update: This blog post has many details about process memory
segments.

What I found most interesting about his presentation was the comparison to dynamically-typed languages. When dynamically-typed scripting
languages started to become popular years ago, there was a sense that they were only for trivial applications, compared to compiled
language that were for serious work. As CPUs became faster, scripting languages were increasingly used for production applications,
particularly web applications that are modified frequently. There might be a similar pattern now in the (improper?) dismissal of NoSQL
for serious applications.

Frankly, I think the biggest limitation of NoSQL is the need to do more complex processing in application code because the powerful SQL
language is not available. Of course, the lack of transaction semantics and joins are also serious limitations for many workloads.
Having NoSQL map to JSON has benefits if your web application uses JSON, but it hard to replace the powerful SQL features (e.g. GROUP BY)
in an application. (Postgres is working on a
native JSON storage data type.) Even performance on single node isn't a reason to use NoSQL since Postgres's
UNLOGGED tables gives similar performance.

So, the big benefit for NoSQL is horizontal, read-write scaling and dynamic schemas. The big question for users considering NoSQL is
whether these features are worth the loss of traditional relational database capabilities. Certainly, for some users, it is worth it
— I think the question many relational database users are wondering is which current relational use-cases are better with
NoSQL.

I hope this blog entry has helped answer that question. Josh Berkus has an excellent feature
article that goes into the complexities of NoSQL database selection, and Gavin Roy has an interesting
presentation on the topic.

The term "scalability" gets injected into database conversations often, but scalability discussions often end without a clear consensus.
I believe this is because "scalability", like "replication", actually means different things to different people.

There are two dimensions to database scalability; the first dimension specifies if additional servers are involved:

vertical: The hardware capabilities of a single database server are increased to increase throughput

horizontal: The number of a database servers is increased to increase throughput

Think of "vertical" as the server getting larger (taller), and horizontal as having more servers on the floor (wider). The second
dimension specifies the type of workload to be scaled:

read-only: Workload to be scaled is only reads

read-write: Workload to be scaled includes reads and writes

Keep in mind, the issue is not whether the server workload is read-only or read-write, but rather whether you need to scale writes as well
as reads. There are tools (e.g. pgpool-II, see below) that can split out read-only queries for scaling while allowing read-write queries
to go to a single master server.

With these terms defined, let's see how they are handled by Postgres. Vertical scalability is handled very well by Postgres for both
read-only and read-write workloads, and improvements are coming in
Postgres 9.2.
I think the only area we don't scale well in is using multiple CPUs for a single query, and I have already
blogged about that limitation.

Horizontal scaling is more complex because the server interconnect speed, even using
InfiniBand, is much slower than interprocess communication via shared memory. For read-only
workloads, a combination of pgpool-II and streaming
replication allows for simple and powerful read-only load balancing across multiple servers, e.g. horizontal scaling.

What is hard to do is read-write horizontal scaling, because writes to different servers can conflict with each other, and communication
between servers is relatively slow. One possible solution is Bucardo, which does not prevent
conflicts but rather provides a mechanism for conflicting committed
transactions to be ignored. The Postgres-XC team is working on a more sophisticated
read-write, horizontal scaling system that allows conflicts to be detected before commit, but the software is not production-ready.

So, the next time you talk about scalability, it might be helpful to explain exactly what type of scaling you need — it will
help to make the discussion a lot more fruitful.

Update: One other alternative for horizontal, read-write scaling is sharding, such as with PL/Proxy
(details). This basically avoids write
conflicts by spreading data across multiple servers. 2012-01-26

While database software can be the cause of outages, for Postgres, it is often not the software but the hardware that causes failures
— and storage is often the failing component. Magnetic disk is one of the few moving parts on a computer, and hence prone to
breakage, and solid-state drives (SSDs) have a finite write limit.

While waiting for storage to start making loud noises or fail is an option, a better option is to use some type of monitoring that warns
of storage failure before it occurs, e.g. enter SMART. SMART is a system developed by
storage vendors that allows the operating system to query diagnostics on the drive and warn of unusual storage behavior before failure
occurs. While read/write failures are reported by the kernel, SMART parameters often warn of danger before failure occurs. Below is the
SMART output from a Western Digital (WDC) WD20EARX
magnetic disk drive:

(These charts were generated using smartmon.) The VALUE column ranges from
0-200, with higher values being better. The RAW_VALUE column is more difficult to interpret because in some cases high is good, and in
others, it is bad (e.g. temperature). The next chart is from an
Intel 320 Series SSD:

Notice there is no temperature value (probably unnecessary), but it has additional values like Media_Wearout_Indicator, which indicates the wear level of the
memory chips. It shows a wear value of "100", indicating no measurable wear.

SMART also records all errors generated by the drive, and allows drive self-tests to be run, independent of the operating system. This is
helpful in isolating error causes.

While you could manually check the SMART values of your drives periodically, there are
tools, e.g.
smartd, that can automatically query drives and warn
administrators of potential problems, e.g. every 30 minutes. System administrators that want to maximize uptime should always use such
available tools to get failure warnings before they happen.

Postgres consists of roughly 1.1 million lines of C code, which is compiled into an executable with millions of CPU instructions. Of the
many CPU machine-language instructions in the Postgres server executable, which one is the most important? That might seem like an odd
question, and one that is hard to answer, but I think I know the answer.

You might wonder, "If Postgres is written in C, how would we find the most important machine-language instruction?" Well, there is a
trick to that. Postgres is not completely written in C. There is a very small file (1000 lines) with C code that adds specific
assembly-language CPU instructions into the executable. This file is called
s_lock.h. It is an include file
that is referenced in various parts of the server code that allows very fast locking operations. The C language doesn't supply
fast-locking infrastructure, so Postgres is required to supply its own locking instructions for all twelve supported CPU architectures.
(Operating system kernels do supply locking instructions, but they are much too slow to be used for Postgres.)

The specific locking operation supported by the CPU is called test-and-set. It allows a
value to be stored in a memory location and, at the same time, the old value to be returned. This allows for two sessions to
simultaneously request a lock and for only one session to successfully acquire the lock.
Spinlocks and other more complex locks are built upon test-and-set; see my
Inside PostgreSQL Shared Memory presentation for more details.
this.

I have copied a small part of
s_lock.h to show this important
instruction on x86 architectures:

The first line identifies the function as returning an integer ("int"), and "__inline__" causes the the function body to likely be placed
at the function reference site, rather than generating a function call and storing the function body someplace else. Line 2 names the
function as "tas" (for Test-And-Set), and indicates it takes an "slock_t" memory address ("lock"). Line 4 defines a local variable
"_res" which will hold our initial lock value (1), and the function return value. Line 6 starts a block of assembly-language code
in GCC, with the code on lines lines 7-11. Line 12 indicates that the assembly language uses two input/output parameters: "_res"
(%0) and "*lock" (%1). Lines 7 and 8 check to see if the lock is already non-zero, indicating the lock is already held. If so, it jumps
to the end at line 11 ("1:"). Line 9 locks the memory address referenced by the next instruction. Line 10 is the big one, the most
important assembly language instruction in the Postgres server executable:
"xchgb". It does a test-and-set or exchange of the "_res"
(local register) and "*lock" (shared memory) values, swapping them in an atomic manner. Line 14 indicates that memory and the condition
code register are modified by the assembly language. Line 15 returns zero for success, and one for failure, which can occur if the test
on line 7 or the exchange on line 10 found the lock was already held.

So, there is it is — the most important CPU instruction, at last for x86 CPUs. If you look through
s_lock.h, you will find similar
test-and-set instructions all the other supported CPU architectures.

The new server is 2-10 times faster than my old 2003 server, but that
10x speedup is only possible for applications that:

Do lots of random I/O, thanks to the SSDs. Postgres already supports tablespace-specific
random_page_cost settings, but it
would be interesting to see if there are cases that can be optimized for low random pages costs. This is probably not an immediate
requirement because the in-memory algorithms already assume a low random page cost.

Can be highly parallelized. See my previous blog entry regarding
parallelism. The 16 virtual cores in this server certainly offer more parallelism opportunities than my old two-core system.

Other observations:

It takes serious money to do the job right, roughly USD $4k — hopefully increased productivity and reliability will pay back
this investment.

I actually started the upgrade two years ago by adjusting my scripts to be more portable; this made the migration go much smoother.
The same method can be used for migrations to Postgres by rewriting SQL queries to be more portable before the migration. Reliable
hardware is often the best way to ensure Postgres reliability.

My hot-swappable SATA-2 drive bays allow for a flexible hard-drive-based backup solution (no more magnetic tapes). File system
snapshots allow similar backups for Postgres tablespaces, but it would be good if this were more flexible. It would also be cool if you
could move a drive containing Postgres tablespaces from one server to another (perhaps after freezing the rows).

Hopefully I can translate some of these lessons into Postgres improvements in the coming years. With this new server, and my recent
upgrade to fiber-based Internet, my home feels like a data center.

A few weeks ago, I finally replaced my eight-year-old home server. The age of my server, and its operating system,
(BSD/OS, last officially updated in 2002)
were a frequent source of amusement among Postgres community members. The
new server is:

The server replacement took about 100 hours of my time over four months, from specification, testing, and configuration. (Many thanks to
Greg Smith for helping select and
test the server.) Getting Postgres working on the new
Debian server was trivial — even the historically difficult
SGML documentation build process was
easy. The server required a lot of configuration because it has
twenty years of Unix customizations that do family calendaring, contacts,
home automation, telephone, typesetting,
web serving, email, and family media serving.

I learned a few things related to Postgres adoption in the process. We all have friends who are using databases other than Postgres.
They often would like to use Postgres, but need a reason to switch. Sometimes cost, features, or performance can motivate such a switch,
but without those, it is hard to recommend a time-consuming migration to Postgres. However, eventually, all hardware and software has to
be upgraded, and when that time comes, I looked to friends with experience for advice. People considering upgrading their databases will
do the same — we just need to be ready to give advice when the time comes.

Of course, this is the internal, compressed representation. The chunk_id is the value stored in the main heap table to reference the
proper long toast value. The data is stored in the chunk_data field in 8k chunks, tracked by the chunk_seq.

Using ALTER TABLE and doing another insert, it is possible to see the values in their uncompressed, hex format:

Postgres typically uses an eight-kilobyte block size — you can verify this by running
pg_controldata:

Database block size: 8192

If that is true, how does Postgres support a field limit of one gigabyte, as mentioned in the Postgres
FAQ?

What is the maximum size for a row, a table, and a database?
…
Maximum size for a field? 1 GB

One solution would be to allow rows to span multiple 8k blocks, but this would introduce source code complexity and performance problems.
A better solution, implemented in Postgres since 2001, is
toast (The Oversized-Attribute Storage
Technique). Toast uses a backup table to store a row's long values. Its goal is to move enough row values into
toast tables so the row length is less than about 2 kilobytes. You can read the details in
tuptoaster.c.

Toast tables are actually hard to find — they don't show up when looking at the table via psql, even in
verbose mode:

The only hint of their existence is the Storage column. Column x is a short, fix-length column, and hence is not considered for
toast storage (marked plain). Column y is a "text" column and can store very long
values and hence is considered for toast storage. The possible storage values are listed in the
toast documentation section and in the
alter table manual page. The options control
whether data is compressed, and whether it is considered for toast storage. I use the word "considered" here
because values are not unconditionally forced into compression or toast storage — they are only considered for
such storage if the row length is long enough.

The toast system has several advantages:

Storage of a value in a backup/toast table happens only when the row is long

Compression is attempted before moving a values to the toast table

Sequential scans that do not access toast columns do not need to read or process these long data values.

Updates under MVCC, which create a new copy of the updated row, can share the same toast table pointers unless
the toasted column values are changed.

Notice that though one-million characters were inserted into the table, compression reduced that to 64k, made up of eight mostly-empty 8k
pages: one heap page, two toast pages, three toast free space map (fsm) pages, two
toast index pages. We can use pg_relation_size() to see each piece separately:

As you can see, toast works transparently. In fact, it works so well that the Postgres community hasn't even found
it necessary to provide tooling to study toast behavior. I hope this blog entry has provided interesting details
on how Postgres uses toast to store long values transparently.

As part of my server upgrade, I migrated to a newer version of
LyX and LaTeX, but most significantly, to a more modern and powerful LaTeX
document class, Beamer. All 1300 slides in my
presentations have been updated. If you see something that needs improvement, no matter how
minor, please let please know via email, chat, or blog comment.