Search

At the Forge - PostgreSQL 9.0

If you want to build a Web application, you're going to need a few
pieces of software. You'll need a server operating system, such as
Linux. You'll also need a server-side language and framework. Although I
personally use Ruby on Rails nowadays, plenty of other good
choices exist. For client-side interactions, you'll need
to use a JavaScript framework, such as jQuery. And, to store
information about (and for) your users, you'll need to select a persistent
storage mechanism, otherwise known as a database.

Until just a few years ago, anyone using the term database
almost always was referring to a relational database—that is, one based
on linked two-dimensional tables, which could be queried using the
relatively standard SQL language. Sure, there were a variety of other
types of databases, but they weren't in widespread use or taken very
seriously.

In the past few years, however, there has been tremendous growth in
the use and mindshare of so-called NoSQL databases. There's not
much in common between these various databases—from Redis to
MongoDB to CouchDB to Hadoop—that offer the promise of greater
scalability and flexibility.

Now, I'm not one to say these non-relational databases are bad.
On the contrary, I've used several of them in my own projects and
found them to be a perfect fit for certain needs. But for day-to-day
use, I continue to find relational databases an excellent fit.
And when I need a relational database, I always prefer to use
PostgreSQL. Indeed, when I'm working on a Web application, PostgreSQL
is the default option. Only if I see that it won't work well (for
whatever reason) do I switch some or all of the project to use a
different database.

Why do I love PostgreSQL so much? First, because it's rock
solid. The developers have an almost obsessive dedication to the
safety of the data you store in the database, and they do their best
to ensure that data won't ever be deleted or corrupted. This doesn't
mean such situations never happen, but they're quite rare.
PostgreSQL not only tries to ensure the safety of your data, but it also
provides you with a variety of tools to help you constrain the
values that may be stored in the database, ensuring that anything
stored is guaranteed to be valid.

Second, PostgreSQL offers a wealth of features that never cease to
impress and amaze me. Whether it's the number of languages in which
you can write server-side functions or the different ways
you can define indexes, the clever use of MVCC (multi-version
concurrency control) to avoid writers blocking readers or the ongoing
statistical analysis that feeds into the query optimizer, PostgreSQL
hasn't failed me in more than 15 years of day-to-day use.

Every time I use another relational database, I find myself wishing
for one or more functions that PostgreSQL provides or thinking that
PostgreSQL wouldn't even enter into certain problematic situations,
because of its high-quality architecture. This doesn't mean
PostgreSQL is perfect, but its mix of features has served me quite
well, and I'm not the only one to feel this way.

This month, I want to look at PostgreSQL 9.0, which was released
in late 2010. In particular, I want to consider what advantages it
offers over other open-source databases (mainly MySQL). For people
already using PostgreSQL, what new features does the 9.0 release bring
to the table?

I won't describe installing PostgreSQL, simply
because versions are available for all major Linux distributions.
Downloading and compiling the source code is relatively
straightforward—although doing so as the dedicated
“postgres” user,
rather than as root, will save you many headaches during the
installation process—but I have found binaries to be just fine for
my day-to-day needs.

Background

PostgreSQL is an open-source (BSD-licensed) relational database
developed by a core group of developers working for a number of
different companies. (Indeed, the rules of the core development group
forbid more than a handful of core developers from working for the
same company to ensure that no one organization controls
development directly.) It grew out of the Postgres Project developed by
Michael Stonebreaker at the University of California, Berkeley.
Postgres itself was an attempt to improve upon the original Ingres
database Stonebreaker had developed and commercialized.

Postgres used its own query language, known as QUEL, and had a
number of advanced ideas, including many taken from the world of
object-oriented programming. Several developers joined forces to take
the Postgres database, replace QUEL with SQL, stabilize the code and
release it under an open-source license. The first release of
PostgreSQL, under its new name, came in 1996. Informally, a large
number of PostgreSQL users and developers continue to call it
Postgres, although that technically refers to Stonebreaker's project
at Berkeley and not the current implementation.

One of the main goals of the PostgreSQL team has been to adhere to the
SQL standard wherever possible. In addition, as I mentioned previously,
PostgreSQL developers pride themselves on a rock-solid implementation,
using the standard ACID (atomicity-consistency-isolation-durability)
paradigm for database storage under all circumstances. They further
try to balance between a powerful set of features and configurable
options and overwhelming people who don't want to be full-time
database administrators.

All PostgreSQL storage is transactional, using a system known as MVCC
(multi-version concurrency control). MVCC, which also is used in
Oracle and CouchDB, largely avoids conflicts between readers and
writers, ensuring that neither rows nor tables need to be locked under
most circumstances. MVCC does have the side effect of keeping “dead
rows” around in the database, which occasionally need to be
“vacuumed”, similar to garbage collection in many programming
languages. For many years, vacuuming needed to be done manually,
typically by running the vacuum command from a cron job. Nowadays,
the autovacuum dæmon runs in the background, marking space as
reusable when added or updated rows reach a configurable threshold.

Vacuum also can be run in “analyze” mode, in which case it examines
tables and indexes, updating the statistics that are used when
planning queries. This is one of the reasons PostgreSQL is able
to operate so efficiently, even when handed complex queries. By
keeping an up-to-date description of data stored in each table, it can
make an intelligent decision regarding how to optimize the query. If
the guesses turn out to be wrong for your particular needs, you can
configure some configuration parameters.

PostgreSQL offers strong support for foreign keys, making it possible
to ensure that connections between tables exist and work in the right
way. That is, you can define a People table and then an Addresses
table that refers to it:

Because I have defined person_id as a foreign key, PostgreSQL won't
let me have an invalid value there, period. PostgreSQL also
will refuse to let you remove a record from the People table if there are
rows in the Addresses table referring to it. You can override such
restrictions with the CASCADE keyword, but the database never
will remove things on its own without your explicit approval.

You can, of course, also ensure that e-mail addresses are unique in your
People table:

CREATE UNIQUE INDEX people_email_idx ON People(email_address);

But wait, what if someone enters an e-mail address in capital
letters and then enters the same e-mail address in lowercase
letters?
You can ensure uniqueness by taking advantage of one of my favorite
PostgreSQL features, a functional index:

Now PostgreSQL will ensure its uniqueness, regardless of the case.
This feature also comes in handy if you're trying to index a column
containing long text strings. You can index only the first 1,000
characters:

Finally, another powerful feature is CTEs (common table
expressions). If you often find yourself performing a query, putting
the results in a temporary table and then querying that temp table,
CTEs almost certainly are something you'll want to use. You basically
describe the temporary table query, give it a name using the
“with”
statement, and then query that pseudo table. For example:

WITH Mini_users
AS (SELECT id * 2 AS id2, email FROM Users)
SELECT id2 FROM Mini_users;

In the example above, I create a new temporary table named
mini_users and define it, and then query it as if it already
existed. I already have found CTEs to be quite useful in simplifying
complex queries. PostgreSQL also makes it easy to define recursive
CTEs, allowing you to work with hierarchical data with a single,
simple query.

Advanced Features

One of the features I have grown to love over the years is the
ability to write my own server-side functions—analogous to “stored
procedures” in other databases—in a variety of languages. Most of
the time, I use the built-in Pl/PgSQL language, which marries SQL
queries with a procedural language. But in cases when I need a bit
more power, or want to use existing libraries, I can write functions
in PL/Perl, PL/Python or a number of other languages. Regardless of
what language I use, these functions integrate seamlessly into the
database, working no differently from built-in functions and allowing
me to process data inside the database, rather than in my
application.

Over time, these server-side languages have become increasingly
sophisticated, and although the Pl/PgSQL syntax is not the most modern or
expressive, it is easy to learn and handles a large number of cases
easily and quickly. I've seen a wide variety of uses for server-side
functions, from encapsulating most application logic in such
functions to handling specific data-manipulation routines that don't
logically belong in the application layer.

One of my favorite uses for functions is in triggers—actions the
database executes automatically when a certain action occurs. For
example, I once worked on a project into which we wanted to insert a
URL, but in which we also wanted to have a (separate) column
containing the hostname for that URL. I wrote a function that used
regular expressions to retrieve the hostname and then inserted the
hostname automatically into the appropriate column. From the
application's perspective, it was able to insert a URL but then
retrieve either a URL or a hostname. Triggers can be used to perform
all sorts of actions before or after an insert, delete or update
query is performed on a table.

One of the many advances in the 9.0 release was the improvement of
“window functions”, functions introduced in 8.4 that make it possible
to create aggregations (such as sum and average) over a portion of
other rows in the table, rather than over the entire table. Thus, you could
calculate the difference between someone's salary and other
salaries in that person's department, or rank the number of citations a
department has in a bibliographic index, or find which was the
longest-running action in each controller, within an MVC Web
application. I've only started to work with windowing functions, but
the combination of built-in functionality from 8.4, plus some
additions to 9.0 that make it easy to peek at preceding and following
rows, lead me to believe this is a particularly powerful feature.

PostgreSQL 9.0

The above features are nice improvements, but they're icing on the cake when it
comes to reasons for a full-version upgrade to 9.0.

First, this version makes it fairly simple to upgrade. Earlier
versions required dumping the database to disk, upgrading the server,
and then restoring the database from its dumped form. This might not
be a big problem for small databases that can afford to be off-line for
a while, but it is a major issue for large sites that cannot afford
such a long shutdown period.

The core PostgreSQL developers heard the criticism and have responded
with pg_upgrade. Now, pg_upgrade still is considered to be somewhat
experimental, and it hasn't been given the official seal of approval by
the development team, so it is placed in the contrib directory,
rather than in any official location. However, pg_upgrade, which
allows you to upgrade your PostgreSQL server without a dump or
restore, is certainly one of the major innovations and accomplishments
of 9.0, and it has been greeted with great excitement by people who were
too nervous or busy to upgrade previously.

Another major feature—and perhaps the most impressive
accomplishment of this version—is what's known as “hot streaming
replication”. This feature is actually the combination of two
different features, which work together to make for an amazing new
backup and high-availability system.

The basic idea is as follows. When you commit a transaction to
PostgreSQL, it doesn't immediately update the tables on disk. Rather,
it writes a record of the transaction to a separate “write-ahead
log”,
or WAL, describing the change that should be made to the database.
Only after enough WALs have accumulated does PostgreSQL actually
update the data.

Starting in version 8.4, you could copy the WALs from one PostgreSQL
server to another, typically using a program like rsync. When the
WALs appeared on the second server, they were read into that system.
If the first server were ever to go down, the second server could be
put into active mode, picking up from where the first server had left
off.

Although this was better than nothing, it had at least two problems. Not
only was transferring the WALs via “log shipping” far from an ideal
mechanism, but the secondary server could not be used while it was
receiving data from the primary server.

Both of these problems have gone away in PostgreSQL 9.0, providing a
robust and easy-to-configure master-slave mechanism for high
availability. WALs now can be transferred from the primary server to
the secondary using the network by configuring the secondary as a
special kind of client program. As each transaction occurs, the WAL
can be sent over the network to one or more remote servers, allowing
them to be synchronized almost immediately with the primary. The
other improvement is that the secondary server can answer read-only
queries even while it's receiving data from the primary.

Hot streaming replication, as the combination of these two features
is known, doesn't cover all situations. For example, I recently
was working with a company that has multiple PostgreSQL servers whose data
it would like to put onto a secondary server. For such
purposes, hot streaming replication won't work. Fortunately, there are
other solutions, such as Slony, that might be a better fit for the
company's
needs. But, for many sites that want to ensure their servers won't
go down, this solution is more than adequate.

Conclusion

Am I gushing? Yes, absolutely. There are very few pieces of
software that give me such pleasure to use and on which I feel I can
utterly depend, and PostgreSQL is a select member of that club. It
keeps my data safe, offers me a huge number of features, allows me to
extend the database in whatever ways I want and need, and it gives me
excellent performance with a modest time investment. If you're
thinking of switching to an open-source database, or if you have
already been using one and want to try something powerful but still
rather easy to use, you would do well to try PostgreSQL.
Especially with the latest features added in version 9.0, I'm sure
you'll find something to like.

Resources

The home page for PostgreSQL is www.postgresql.org. From that
site, you can download the software, subscribe to a number of (heavily
trafficked and extremely friendly) e-mail lists and read through the
documentation. There is also a #postgresql IRC channel on Freenode
that is generally populated and available for answering questions.

One of the best-known commercial PostgreSQL support and product
companies, EnterpriseDB, provides binary packages for a number of
operating systems at its Web site, enterprisedb.com.
It
also offers a number of video and audio podcasts and screencasts,
including those by PostgreSQL core contributor Bruce Momjian, which are of
high quality.

Finally, two excellent books about PostgreSQL recently were published
by Packt Press, filling in a long gap since the excellent (but
outdated) PostgreSQL, 2nd edition was written by Korry Douglas a
number of years ago. The two new books are PostgreSQL 9
Administration Cookbook by Simon Riggs and Hannu Krosing, and
PostgreSQL 9.0 High Performance by Gregory Smith. I have learned a
great deal from both books, and although the latter book spends a bit
more time than I might have liked on hardware, it also taught me that
even a software guy like myself occasionally needs to consider the
physical side of software and database design. If you use PostgreSQL
on a regular basis, I strongly recommend these books.

Reuven M. Lerner is a longtime Web developer, architect and trainer.
He is a PhD candidate in learning sciences at Northwestern University,
researching the design and analysis of collaborative on-line
communities. Reuven lives with his wife and three children in
Modi'in, Israel.