Services

April 28, 2016

PostgreSQL cloud backups with PGHoard

PGHoard is the cloud backup and restore solution we’re using in Aiven. We started PGHoard development in early 2015 when the Aiven project was launched as a way to provide real-time streaming backups of PostgreSQL to a potentially untrusted cloud object storage.

PGHoard has an extensible object storage interface, which currently works with the following cloud object stores:

Amazon Web Services S3

Google Cloud Storage

OpenStack Swift

Ceph’s RADOSGW utilizing either the S3 or Swift drivers

Microsoft Azure Storage (currently experimental)

Data integrity

PostgreSQL backups consist of full database backups, basebackups, plus write ahead logs and related metadata, WAL. Both basebackups and WAL are required to create and restore a consistent database.

PGHoard handles both the full, periodic backups (driving pg_basebackup](http://www.postgresql.org/docs/current/static/app-pgbasebackup.html)) as well as streaming the write-ahead-log of the database. Constantly streaming WAL as it’s generated allows PGHoard to restore a database to any point in time since the oldest basebackup was taken. This is used to implement Aiven’s Database Forks and Point-in-time-Recovery as described in our
PostgreSQL FAQ.

To save disk space and reduce the data that needs to be sent over the network (potentially incurring extra costs) backups are compressed by default using Google’s Snappy, a fast compression algorithm with a reasonable compression ratio. LZMA (a slower algorithm with very high compression ratio) is also supported.

To protect backups from unauthorized access and to ensure their integrity PGHoard can also transparently encrypt and authenticate the data using RSA, AES and SHA256. Each basebackup and WAL segments gets a unique random AES key which is encrypted with RSA. HMAC-SHA256 is used for file integrity checking.

Restoration is key

As noted in the opening paragraph, PGHoard is a backup and restore tool: backups are largely useless unless they can be restored. Experience tells us that backups, even if set up at some point, are usually not restorable unless restore is routinely tested, but experience also shows that backup restoration is rarely practiced unless it’s easy to do and automate.

This is why PGHoard also includes tooling to restore backups, allowing you to create new master or standby databases from the object store archives. This makes it possible to set up a new database replica with a single command, which first restores the database basebackup from object storage and then sets up PostgreSQL’s recovery.conf to fetch the remaining WAL files from the object storage archive and optionally connect to an existing master server after that.

Preparing PostgreSQL for PGHoard

First, we will need to create a replication user account. We’ll just use the psql command-line client for this:

We also need to allow this new user to make connections to the database. In PostgreSQL this is done by editing the
pg_hba.conf configuration file and adding a line something like this:

host replication backup 127.0.0.1/32 md5

We’ll also need to ensure our PostgreSQL instance is configured to allow WAL replication out from the server and it has the appropriate wal_level setting. We’ll edit
postgresql.conf and edit or add the following settings:

Finally, since we have modified PostgreSQL configuration files, we’ll need to restart PostgreSQL to take the new settings into use by running
pg_ctl restart,
systemctl restart postgresql or service postgresql restart, etc depending on the Linux distribution being used. Note that it’s not enough to “reload” PostgreSQL in case the WAL settings were changed.

Now we are ready on the PostgreSQL side and can move on to PGHoard.

Installing PGHoard

PGHoard’s source distribution includes packaging scripts for Debian, Fedora and Ubuntu. Instructions for building distribution specific packages can be found in the PGHoard README. As PGHoard is a Python package it can also be installed on any system with Python 3 by running
pip3 install pghoard.

Taking backups with PGHoard

PGHoard provides a number of tools that can be launched from the command-line:

pghoard - The backup daemon itself, can be run under systemd sysvinit

pghoard_restore - Backup restoration tool

pghoard_archive_sync - Command for verifying archive integrity

pghoard_create_keys - Backup encryption key utility

pghoard_postgres_command - Used as PostgreSQL’s archive_command and restore_command

First, we will launch the pghoard daemon to start taking backups. pghoard requires a small JSON configuration file that contains the settings for the PostgreSQL connection and for the target backup storage. We’ll name the file pghoard.json:

In the above file we just list where pghoard keep’s its local working directory (backup_location), our PostgreSQL connection settings (nodes) and where we want to store the backups (object_storage. In this example we’ll just write the backup files to a local disk instead of a remote cloud object storage.

Then we just need to run the pghoard daemon and point it to our configuration file:

PGHoard automatically connected to the PostgreSQL database server, noticed that we don’t have any backups and immediately created a new basebackup and started the realtime streaming of WAL files (which act as incremental backups). Each file stored in the backups was first compressed for optimizing the transfer and storage costs.

As long as you keep PGHoard running, it will make full backups using the default schedule (once per 24 hours) and continuously stream WAL files.

Looking at the contents of the backups directory, we see that our backups now contain a full database backup plus a couple of WAL files, and some metadata for each of the files:

The pghoard_restore command automatically chooses the latest available backup, downloads, unpacks (and decompresses and decrypts, when those options are used) it to the specified target directory. The end result will be a complete PostgreSQL data directory (e.g. something like
/var/lib/postgresql/9.5/main or /var/lib/pgsql/data, depending on the distro), ready to be used by a PostgreSQL instance.

There are more command-line options for more detailed control over the restoration process, for example restoring to a particular point in time or transaction (PITR) or choosing whether the restored database will be acting as a master or a standby.

Backup encryption

In order to encrypt our backups, we’ll need to create an encryption key pair. PGHoard provides a handy command for automatically creating a key pair and storing it into our configuration file:

$ pghoard_create_keys --key-id example --config pghoard.json
Saved new key_id 'example' for site 'example-site' in 'pghoard.json'
NOTE: The pghoard daemon does not require the 'private' key in its configuration file, it can be stored elsewhere to improve security.

Note that in most cases you will want to extract the private key away from the configuration file and store it safely elsewhere away from the machine that makes the backups. The pghoard daemon only needs the encryption public key during normal operation. The private key is only required by the restore tool and the daemon while restoring a backup.

Uploading backups to the cloud

Sending backups to an object storage in the cloud is simple: we just need the cloud’s access credentials and we’ll modify the
object_storage section of pghoard.json:

Today we released PGHoard version 1.2.0 with support for Python 3.3 and PostgreSQL 9.2 plus enhanced support for handling network outages. These features were driven by external users, in Aiven we always use the latest PostgreSQL versions (9.5.2 at the time of writing) and access object storages near the database machines.

PGHoard in Aiven.io

We’re happy to talk more about PGHoard and help you set up your backups with it. You can also sign up for a free trial of our aiven.io PostgreSQL service where PGHoard will take care of your backups.