Recent Articles

Batch Update

Performance consulting involves some tricks that you have to teach over and
over again. One of them is that SQL tends to be so much better at dealing
with plenty of rows in a single statement when compared to running as many
statements, each one against a single row.

Another kind of Batch to update

So when you need to
UPDATE a bunch of rows from a given source, remember
that you can actually use a
JOIN in the
update statement. Either the source
of data is already in the database, in which case it's as simple as using
the
FROM clause in the
update statement, or it's not, and we're getting back
to that in a minute.

UPDATE FROM

Using that, you can actually update thousands of rows in our
target table in
a single statement, and you can't really get faster than that.

Preparing the Batch

Now, if you happen to have the source data in your application process'
memory, the previous bits is not doing you any good, you think. Well, the
trick is that pushing your in-memory data into the database and then joining
against the now local source of data is generally faster than looping in the
application and having to do a whole network
round trip per row.

As we're talking about performances, the trick here is to use the
COPY
protocol to fill in the
temporary table we just create to hold our data. So
we're now sending the whole data set in a temporary location in the
database, then using that as the
UPDATE source. And that's way faster than
doing a separate
UPDATE statement per row in your batch, even for small
batches.

Also, rather than using the SQL
COPY command, you might want to look up the
docs of the PostgreSQL driver you are currently using in your application,
it certainly includes some higher level facilities to deal with pushing the
data into the streaming protocol.

Insert or Update

And now sometime some of the rows in the batch have to be
updated while some
others are new and must be inserted. How do you do that? Well, PostgreSQL
9.1 brings on the table
WITH support for all
DML queries, which means that
you can do the following just fine:

That query here is
updating all the rows that are known in both the
target
and the
source and returns what we took from the
source in the operation, so
that we can do an
anti-join in the next step of the query, where we're
inserting any row that was not taken care of in the
update part of the
statement.

Note that when the batch gets to bigger size it's usually better to join
against the
target table in the
INSERT statement, because that will have an
index on the join key.

Concurrency patterns

Now, you will tell me that we just solved the
UPSERT problem. Well what
happens if more than one transaction is trying to do the
WITH (UPDATE)
INSERT dance at the same time? It's a single
statement, so it's a single
snapshot. What can go wrong?

Concurrent processing

What happens is that as soon as the concurrent sources contain some data for
the same
primary key, you get a
duplicate key error on the insert. As both
the transactions are concurrent, they are seeing the same
target table where
the new data does not exists, and both will conclude that they need to
INSERT the new data into the
target table.

There are two things that you can do to avoid the problem. The first thing
is to make it so that you're doing only one
batch update at any time, by
architecting your application around that constraint. That's the most
effective way around the problem, but not the most practical.

The other thing you can do, is force the concurrent transactions to
serialize one after the other, using an
explicit locking statement:

LOCK TABLE target IN SHARE ROW EXCLUSIVE MODE;

That
lock level is not automatically acquired by any PostgreSQL command, so
the only way it helps you is when you're doing that for every transaction
you want to serialize. When you know you're not at risk (that is, when not
playing the
insert or update dance), you can omit taking that
lock.

Conclusion

The SQL language has its quirks, that's true. It's been made for efficient
data processing, and with recent enough
PostgreSQL releases you even have
some advanced pipelining facilities included in the language. Properly
learning how to make the most out of that old component of your programming
stack still makes a lot of sense today!