I have PostgreSQL, loaded some data, and have app using it. Now what?

I had to deal with this question, or some version of it, quite a few times. So, decided to write a summary on what one could (or should) do, after data is in database, and application is running. Namely – setup some kind of replication and backups.

To make sure that the post is as usable as possible, I did not use my installation method – instead I used Ubuntu's pre-packaged Pg.

To get PostgreSQL I installed it using:

$ sudoapt-get install postgresql postgresql-contrib postgresql-plperl

which installed these packages:

libpq5 in version: 9.3.6-0ubuntu0.14.04

postgresql in version: 9.3+154

postgresql-9.3 in version: 9.3.6-0ubuntu0.14.04

postgresql-client-9.3 in version: 9.3.6-0ubuntu0.14.04

postgresql-client-common in version: 154

postgresql-common in version: 154

postgresql-contrib in version: 9.3+154

postgresql-contrib-9.3 in version: 9.3.6-0ubuntu0.14.04

postgresql-plperl-9.3 in version: 9.3.6-0ubuntu0.14.04

I need contrib and plperl for some of the databases that I have there, not for anything related to replication or backups. So if you don't need them for your code – you can skip installation of these packages.

Also – I didn't modify default config in any way before starting work on this blogpost – this should make sure that every change that is necessary to get described results is covered in here.

Additionally – since this is test environment it doesn't have any real traffic happening. To make sure that there is some work being processed, I added simple script, which in a loop created table with some data, and then dropped it.

One final note before I will start describing what I did – I purposely include commands that will render bad results. Not because I want you to fail, but because I want to show why certain things should be set. For everything that I show that doesn't work, I also include, later on, correct, sensible version. As bad as it sounds, it's only so you will actually understand what you're doing, instead of blindly copy-pasting examples.

Having working database, we will first setup replication. Replica is good because you can offload some queries there, and you can (possibly, we'll see) make backups off replica, to limit load on master.

So we need new machine. This one will be named “slave" – as this is the role it will be performing. I installed the same packages, and verified that have the same versions.

Locations are the same in both cases – config files in /etc/postgresql/9.3/main/ and data in /var/lib/postgresql/9.3/main.

After installation of Pg on slave, it creates some template databases, which I don't need – since I will be setting it as slave. So, I have to stop Pg (on slave) and remove all data in data directory:

Please note that I removed content of data directory, but not the directory itself.

You might ask – why bash -c “…", and not just: “sudo rm -rf /var/lib/postgresql/9.3/main/*" – the reason is that * in this example would be “expanded" using my shell (as depesz account), but this account can't list files in data directory, so rm would get literal “/var/lib/postgresql/9.3/main/*" as filename to remove, which it can't handle.

Bash -c trick makes sure that * expansion happens from postgres account, and will give rm correct list of files and directories to remove. There are other ways around this, of course, like: sudo su – postgres, and then from shell running: rm -rf /var/lib/postgresql/9.3/main/*, or even smarter:

Now, using postgres account, I need to copy base set of data – there is a great tool to do it – pg_basebackup, so let's use it:

15:56:30 postgres@slave ~
=$ pg_basebackup --pgdata=/var/lib/postgresql/9.3/main/--write-recovery-conf--checkpoint=fast --verbose--progress--host=master
pg_basebackup: could not connect to server: could not connect to server: Connection refused
Is the server running on host "master"(172.28.173.160) and accepting
TCP/IP connections on port 5432?

First problem. PostgreSQL on master doesn't listen on 5432? It does. But only on localhost, and we need it to listen on everything. So, on master, I edit /etc/postgresql/9.3/main/postgresql.conf and change line:

OK. It's better – PostgreSQL on master listens, but refuses connections. Of course – in it's pg_hba.conf file, I didn't include any information about connections from slave. Default pg_hba.conf on Ubuntu looks like:

16:05:36 depesz@master ~
=$ sudogrep-E'^[^#]'/etc/postgresql/9.3/main/pg_hba.conf
local all postgres peer
local all all peer
host all all 127.0.0.1/32 md5
host all all ::1/128 md5

Which means that only local connections (either using unix socket or tcpip) will be allowed.

Before I will make any changed, first let's think for a moment. We need connection to send initial data, and to replicate it later on. This will be done using the same protocol/methods that normal PostgreSQL clients use. With this in mind, being somewhat security-conscious person, I think it's better to setup special database account for replication purposes. Luckily it's simple:

This created user “replicant" – which doesn't have any great privileges, but it can be used for replication. Added also line to pg_hba.conf which allows passwordless authentication as “replicant" from slave (using its IP address). And finally, to make sure that Pg will load modified pg_hba.conf, I reloaded config.

Please note that I added –user=replicant option to make sure that pg_basebackup will use this new role. But we got error anyway. Why? Well – it just so happens that you can't have any number of slaves (at least, so called streaming slaves) running – each such thing is using some resources, and in the general cases, you don't need them. But now we do. Also – let's be smarter. We will need one now, but let's set the new limit with some margin for future. Let's say 5. To do so, on master, in /etc/postgresql/9.3/main/postgresql.conf, I changed line:

#max_wal_senders = 0 # max number of walsender processes

into

max_wal_senders = 5 # max number of walsender processes

Also, while checking docs for the parameter, I noticed that I also have to change wal_level to archive or higher. Quick read of for this setting shows that I also need to enable archive_mode, and this, in turn, requires non-empty archive_command.

At the moment, I don't care about archiving, so let's use “/bin/true" as my archiving program – it will basically immediately report to PostgreSQL that archiving “succeeded".

All changes in postgresql.conf, on master, now are:

wal_level – was commented out with value “minimal", uncommented it, and changed value to hot_standby since I want to be able to query slave)

What is the problem? PostgreSQL keeps wal files only for some short time, after which they get recycled. How long it keeps them depends on checkpoint_segments setting, and additionally on wal_keep_segments.

Checkpoint_segments is used, as I described earlier to keep up with concurrent writes (and default value is too small, but I left it small purposedly).

Wal_keep_segments is used for exactly the reason we had problems – to keep some segments around, so that newly built slave will be able to catchup with replication, and, additionally, to make sure that in case of network hiccup or slave restart, we will not have to rebuild it from scratch.

Added column pg_current_xlog_location shows current location. And sent_location/write_location/flush_location/replay_location show state of replication to given slave (as shown in client_addr).

Why are there four values? Well masters sends data (sent_location). Slave gets the data, and writes to its own xlog (write_location). Every so often data is flushed to disk (flush_location). And finally – once the data has been flushed to disk on slave – so we had 100% guarantee (usually, if the disks don't lie) – we can apply the wal data on actual data on slave (replay_location).

One common way to check for replication lag is to use pg_last_xact_replay_timestamp function, on slave:

The problem with pg_last_xact_replay_timestamp is that in case you have no traffic on server, or you have, but only large, long, transaction – it will show your slave as lagging, while it's not. This might, or might not be a problem in your case, so you have to test.

This shows delay in WAL data – in this case around 18GB. The thing is that it will be accurate even in case of no traffic, or just couple of long transactions.

It also shows that my test machines (well, virtual boxes) are not fast enough to handle the replication. Reason for this is most likely misconfiguration of checkpointing, but that's irrelevant for now.

To give some kind of summary now, we have:

working streaming replication

enabled, but unused archiving

slave is readable

replication will be OK with network problems or slave restarts, for up to 2000 wal segments (i.e. the time it takes master to write 2000 segments, plus whatever is the checkpoint_segments/ checkpoint_timeout setting).

This is nice. Now we need backups.

Before I'll show you how to make backups, let me step back and think for a moment.

The simplest way of making backup is to call pg_dump on every database (and possibly some additional pg_dumpall to get global data, like users). So, more or less this:

For starters – these are dumps taken at a given moment in time. And you can't store increments. So if I had such backup done every day at 6:00 am, and my database would fail at 5:00 am, I would have to recover from backup that was made 23 hours earlier!

What's more – restoring them takes quite a lot of time, since data has to be actually added to database, indexes created, constraints checked. This takes time, which you might not have.

And finally – let's assume you have a problem with database that is not obvious – for example, bloated tables/indexes, or damaged data file. Dump, if it will work, will not contain the problem. Restored database (assuming it will restore) – will be without bloat, and all trace evidence of problems will be gone.

Luckily – there is another way of handling backups, which solves all of the above problems, with one drawback – backups are larger.

Instead of doing pg_dump, you can actually get a copy of whole data directory (with tar/rsync/whatever). And if you also archive xlogs, you will be able to restore to any point in time that you have xlogs to.

So, in the scenario above – I make full backup daily at 6:00 am, and store all xlogs – when db is gone, I just load last day backup, and apply all xlogs that have since accumulated, and get very fresh backup copy of all data.

To make it happen we'll need to use sensible archive-command, and preferably have separate server for backups.

One final note – since we will be archiving all xlogs for the purpose of backup – we no longer need to have large wal_keep_segments – basically we just need it large enough to handle occasional network problem, but we don't need it to handle prolonged slave downtime, or just building – the wal segments that would be needed will be in walarchive anyway.

Why is that a good thing? Well, wal_keep_segments is not free – it costs disk space (16MB per segment, so 2000 of these is 32GB). And it puts some additional load on master when restoring from them.

Now, let's set it up.

There are many tools that can handle wal archiving and backups, but I like OmniPITR – I know it pretty well, given that I wrote most of it 🙂

So, let's install it. On both machines (master and slave), as postgres account, in postgres home directory (/var/lib/postgresql) I do:

The sanity-check command is used to verify that everything that OmniPITR uses is in place, and that it is in appropriate version.

With OmniPITR in place, not yet configured, I need a place to store backups and wal archive. It could be on slave server, but it makes more sense to put it on yet another machine. So, I made myself third box, aptly named “storage".

Afterwards I copy content of file /var/lib/postgresql/.ssh/id_rsa.pub, from both servers, to /home/pgarchive/.ssh/authorized_keys. You can use any method of copying you want, I just connected to all places, and copy/pasted the key using normal copy/paste commands for terminal (ctrl-c/ctrl-v).

While technically we don't need separate directories, and we don't need them versioned (named after version of PostgreSQL) I found out that it's good idea to have at least walarchive contain version information – saves some headaches in case of upgrades.

If you'll decide you don't need that much logs just remove –verbose from config line for omnipitr on master. You can also use parallel-jobs to send to both places at the same time. Or use pigz instead of gzip for compression. But that's not important now.

Currently both walarchives on slave and backup are getting their wals:

data-dir – where pg datadir is, so that backup script will know what to backup

verbose – make the logs contain more information

log – well, where the logs go

dst-direct – this is actually pretty smart – it tells omnipitr-backup-slave where to save backups, on remote machine. But it does it in such a way that the backup is never saved to file on slave server – it gets made and delivered to backup server “on the fly" – so it's faster, and doesn't use additional disk space on slave.

call-master – since Pg 9.0 it's obligatory that when you run backup, you have to call pg_start_backup() and pg_stop_backup() functions on master, even if doing the backup on slave. That's what it means.

digest – type of digest to calculate for backup files. You can have any type that is supported by Perl's Digest library. Usual choices are MD5 or SHA-1, but you can go wild with it, if you want 🙂

host – what is hostname to connect that is master (this is needed due to call-master above

skip-xlogs – normally backup contains data tarball, and additionally xlogs-tarball, with all xlogs that are required to restore from it. But if you have working wal archive – it's no longer needed, so we can skip creation of xlog tarball.

pgcontroldata-path – unfortunately Ubuntu puts pg_controldata program in a place that is not, by default, in PATH environment variable, so I have to provide the path to it myself.

To restore from such backup we need data tarball, and all xlogs from 00000001000000BC000000F7 (Min-Xlog from meta file).

That means that all older xlogs in walarchive are useless – there is no backup to apply them to.

What's more – if we'd have multiple backups – we'd have to find a way to delete them after some time, and all the older xlogs too. Luckily there is tool for that. In OmniPITR bundle 🙂

I would suggest another layer: do add some active monitoring (nagios or simile), to check every so often that the master, slave and backup are okay. For the backup, it means checking that xlogs have been received recently.

@Calvin: well, sure. monitoring in terms of some graphs and alerts is necessary, but it’s hardly a dba task. There should be some common monitoring tool for the whole stack, and adding monitoring there makes more sense then to setup specific tool just for pg.

@Rafael: good catch, thanks. Fixed.

@X: I added copy of command line for pg_basebackup so it will be clearer.