Open-Source Databases, Part III: Choosing a Database

If you are an application developer, you're probably working with
large quantities of data. And, if that data is anything more complex
than a hash table, you might want to consider moving some or all
of it into a relational database. Relational databases are designed
for reliable and flexible retrieval of data. The magic of a
relational database is not the use of two-dimensional tables to store
all of the information, but it's the fact that tables can be combined in
many different ways and manipulated using the SQL query language.

As we saw in my database articles in the last two issues of
LJ, open-source programmers are
fortunate enough to have several database options at their disposal.
By far, the two most popular open-source relational databases are
MySQL and PostgreSQL. Each has a large and loyal following, and each
continues to improve with every successive version.

And, when I write “large and loyal following” above, I'm not
kidding. MySQL and PostgreSQL have long been at the center of a major flame war
within the Open Source world. If someone on Slashdot dares say
something about one of these products, you can be sure it won't
be long before someone writes a nasty (and often childish) note about
the other one. These disagreements often reflect the knee-jerk
attitudes of uninformed users, but there have been no shortage of
attacks from well-known and informed users of these products as well.

I believe there are circumstances when either
MySQL or PostgreSQL might be an appropriate choice. I've strongly
preferred PostgreSQL in my work during the last decade—yet, there
definitely are times when MySQL seems to be the more appropriate
solution.

So, despite my personal biases and the risk of opening a flame war
within the Open Source community, I now conclude this series about
open-source databases with a comparison between MySQL and PostgreSQL
in a number of different categories. I hope by the time you
finish reading this article, you understand that choosing a database
is almost never a matter of finding the “fastest” or
“best” product,
because there is no one way to measure the quality or appropriateness
of a relational database server. Rather, I hope you'll be able
to consider each of these on the basis of its own merits, rather than on the
propaganda that is so widespread.

Data Integrity

Perhaps the first and foremost task of a database is to store and
retrieve data reliably. Just as you wouldn't want to use a hard disk
that occasionally loses data, you don't want to put things into a
database that occasionally mangles its contents. This is true even if
the reliability comes at the expense of speed.

The gold standard for reliability in the database world has an
acronym, ACID (Atomicity, Consistency, Isolation and
Durability). This means that under all circumstances in the database,
the following hold true:

Atomicity: each query is guaranteed to complete or not,
without any possibility of halfway or incomplete states.

Consistency: the database is always in a legal state before
and after a transaction.

Isolation: each transaction occurs separately from other
actions, so that you can't have two transactions interfering
with one another.

Durability: transactions persist over time, typically by being
stored on a filesystem.

The attitude toward ACID within the PostgreSQL community has been
unchanged since I first started to use it a decade ago, placing it as
the highest possible priority. This doesn't mean PostgreSQL is
lacking in other features, but rather it means the developers have worked
to ensure that data stored in a PostgreSQL system will be consistent
and reliable, even if you do nasty things such as issue a kill
-9 or
pull the plug.

During the past few years, PostgreSQL has begun to offer even better
support for transactions and database stability, using write-ahead
logs (WALs) that describe each action taken by the database. These WAL
files can be used to recover from a disaster or even to recover the
database to an earlier point in its history—a feature known as
point-in-time recovery (PITR). Thus, if you know something
happened yesterday, but the database was working perfectly two
days ago, you could use PITR to recover to the earlier, stable state.
Recent versions of PostgreSQL also support two-phased commit, a type
of transaction you're likely to see in a distributed system where
multiple servers must coordinate their actions.

MySQL has had a mixed attitude toward ACID during the years. When I
first started to use MySQL in 1995, the authors' attitude was that
transactions should be handled by the application, not the database.
Indeed, as recently as 2000, the to-do list for MySQL included tasks
having to do with production-quality transaction-safe tables. This
has led to a great deal of bad blood between the MySQL and PostgreSQL
communities, with members of the latter sometimes claiming that no
critical data should ever be stored in MySQL.

The good news is that modern versions of MySQL do indeed support
transaction-safe tables, using InnoDB, a third-party product released
under the GPL that has been integrated into MySQL for several years.
Moreover, InnoDB appears to use techniques that PostgreSQL and Oracle
have used for years, such as MVCC (multi-version concurrency control).
The bad news is that at least some benchmarks I've seen indicate that
InnoDB has some problems scaling to large numbers of simultaneous
queries.

In addition, the company that develops InnoDB recently was bought by
Oracle, which might lead some people to worry about future licensing,
development and pricing issues. For the time being, this latter
issue does not appear to be a serious one, because Oracle and MySQL
signed a contract in 2006 extending the licensing for InnoDB. But,
MySQL does not appear to be taking any chances and has hired several
experts to create a new table structure that will be owned by MySQL
and thus be impervious to such business problems.

I'm personally of the persuasion that true ACID compliance is always a
good thing to have around, much like seat belts in a car. Sure, you
can drive without a seat belt, and the odds are that nothing will
happen to you. But, it's impossible to predict when something bad
might happen, and you really don't want to be without a seat belt
under such circumstances. In the same way, if your data is important
to you, it's best to ensure that it will persist with integrity.

A related problem has to do with the degree to which each database
enforces constraints and limits. PostgreSQL tends to be quite
stringent on such matters, refusing to accept illegal data. MySQL
tries to be more forgiving and flexible, but that can result in
strange and illegal data being stored.

For example, consider the following set of MySQL commands, in which we
create a table foo with a single column (named a) of type DATE: