Parallel Backup with gpbackup and gprestore

Parallel Backup with gpbackup and gprestore

gpbackup and gprestore are Greenplum Database utilities
that create and restore backup sets for Greenplum Database. By default,
gpbackup stores only the object metadata files and DDL files for a backup
in the Greenplum Database master data directory. Greenplum Database segments use the
COPY ... ON SEGMENT command to store their data for backed-up tables in
compressed CSV data files, located in each segment's backups
directory.

The backup metadata files contain all of the information that gprestore
needs to restore a full backup set in parallel. Backup metadata also provides the framework
for restoring only individual objects in the data set, along with any dependent objects, in
future versions of gprestore. (See Understanding Backup Files for more information.) Storing the table data in CSV files also provides
opportunities for using other restore utilities, such as gpload, to load the
data either in the same cluster or another cluster. By default, one file is created for each
table on the segment. You can specify the --leaf-partition-data option with
gpbackup to create one data file per leaf partition of a partitioned table,
instead of a single file. This option also enables you to filter backup sets by leaf
partitions.

Each gpbackup task uses a single transaction in Greenplum Database. During
this transaction, metadata is backed up on the master host, and data for each table on each
segment host is written to CSV backup files using COPY ... ON SEGMENT
commands in parallel. The backup process acquires an ACCESS SHARE lock on
each table that is backed up.

For information about the gpbackup and gprestore utility
options, see gpbackup and
gprestore.

Requirements and Limitations

The gpbackup and gprestore utilities are available with
Greenplum Database 5.5.0 and later.

gpbackup and gprestore have the following limitations:

If you create an index on a parent partitioned table, gpbackup does
not back up that same index on child partitioned tables of the parent, as creating the
same index on a child would cause an error. However, if you exchange a partition,
gpbackup does not detect that the index on the exchanged partition is
inherited from the new parent table. In this case, gpbackup backs up
conflicting CREATE INDEX statements, which causes an error when you
restore the backup set.

You can execute multiple instances of gpbackup, but each execution
requires a distinct timestamp.

Database object filtering is currently limited to schemas and tables.

If you use the gpbackup --single-data-file option to combine table
backups into a single file per segment, you cannot perform a parallel restore operation
with gprestore (cannot set --jobs to a value higher
than 1).

You cannot use the --exclude-table-file with
--leaf-partition-data. Although you can specify leaf partition names
in a file specified with --exclude-table-file,
gpbackup ignores the partition names.

Backing up a database with gpbackup while simultaneously running DDL
commands might cause gpbackup to fail, in order to ensure consistency
within the backup set. For example, if a table is dropped after the start of the backup
operation, gpbackup exits and displays the error message ERROR:
relation <schema.table> does not
exist.

gpbackup might fail when a table is dropped
during a backup operation due to table locking issues. gpbackup
generates a list of tables to back up and acquires an ACCESS SHARED
lock on the tables. If an EXCLUSIVE LOCK is held on a table,
gpbackup acquires the ACCESS SHARED lock after the
existing lock is released. If the table no longer exists when
gpbackup attempts to acquire a lock on the table,
gpbackup exits with the error message.

For tables that might
be dropped during a backup, you can exclude the tables from a backup with a
gpbackup table filtering option such as
--exclude-table or --exclude-schema.

Objects Included in a Backup or Restore

The following table lists the objects that are backed up and restored with
gpbackup and gprestore. Database objects are backed up
for the database you specify with the --dbname option. Global objects
(Greenplum Database system objects) are also backed up by default, but they are restored
only if you include the --with-globals option to
gprestore.

Triggers. (While Greenplum Database does not support triggers, any
trigger definitions that are present are backed up and restored.)

Rules

Domains

Operators, operator families, and operator classes

Conversions

Extensions

Text search parsers, dictionaries, templates, and
configurations

Tablespaces

Databases

Database-wide configuration parameter settings (GUCs)

Resource group definitions

Resource queue definitions

Roles

GRANT assignments of roles to databases

Note: These schemas are not included in a backup.

gp_toolkit

information_schema

pg_aoseg

pg_bitmapindex

pg_catalog

pg_toast*

pg_temp*

When restoring to an existing database, gprestore assumes the
public schema exists when restoring objects to the
public schema. When restoring to a new database (with the
--create-db option), gprestore creates the
public schema automatically when creating a database with the
CREATE DATABASE command. The command uses the
template0 database that contains the public
schema.

The above command creates a file that contains global and database-specific metadata on the
Greenplum Database master host in the default directory,
$MASTER_DATA_DIRECTORY/backups/<YYYYMMDD>/<YYYYMMDDHHMMSS>/. For
example:

When performing a backup operation, you can use the --single-data-file in
situations where the additional overhead of multiple files might be prohibitive. For
example, if you use a third party storage solution such as Data Domain with back ups.

Restoring from Backup

To use gprestore to restore from a backup set, you must use the
--timestamp option to specify the exact timestamp value
(YYYYMMDDHHMMSS) to restore. Include the --create-db
option if the database does not exist in the cluster. For
example:

gprestore does not attempt to restore global metadata for the Greenplum
System by default. If this is required, include the --with-globals
argument.

By default, gprestore uses 1 connection to restore table data and
metadata. If you have a large backup set, you can improve performance of the restore by
increasing the number of parallel connections with the --jobs option. For
example:

Test the number of parallel connections with your backup set to determine the ideal
number for fast data recovery.

Note: You cannot perform a parallel restore operation with gprestore if the
backup combined table backups into a single file per segment with the
gpbackup option --single-data-file.

Report Files

When performing a backup or restore
operation, gpbackup and gprestore generate a report
file. When email notification is configured, the email sent contains the contents of the
report file. For information about email notification, see Configuring Email Notifications.

The report file is placed in the Greenplum Database master
backup directory. The report file name contains the timestamp of the operation. These are
the formats of the gpbackup and gprestore report file
names.

History File

When performing a backup operation, gpbackup appends backup information
in the gpbackup history file, gpbackup_history.yaml, in the Greenplum
Database master data directory. The file contains the backup timestamp, information about
the backup options, and backup set information for incremental backups. This file is not
backed up by gpbackup.

gpbackup uses the information in the file to find a matching backup for
an incremental backup when you run gpbackup with the
--incremental option and do not specify the
--from-timesamp option to indicate the backup that you want to use as
the latest backup in the incremental backup set. For information about incremental
backups, see Creating Incremental Backups with gpbackup and gprestore.

Return Codes

One of these codes is returned after gpbackup or
gprestore completes.

0 – Backup or restore completed with no problems

1 – Backup or restore completed with non-fatal errors. See log file for more
information.

2 – Backup or restore failed with a fatal error. See log file for more
information.

Filtering the Contents of a Backup or Restore

gpbackup backs up all schemas and tables in the specified database, unless
you exclude or include individual schema or table objects with schema level or table level
filter options.

The schema level options are --include-schema or
--exclude-schema command-line options to gpbackup. For
example, if the "demo" database includes only two schemas, "wikipedia" and "twitter," both
of the following commands back up only the "wikipedia"
schema:

To filter the individual tables that are included in a backup set, or excluded from a
backup set, specify individual tables with the --include-table option or
the --exclude-table option. The table must be schema qualified,
<schema-name>.<table-name>. The individual table filtering options
can be specified multiple times. However, --include-table and
--exclude-table cannot both be used in the same command.

You can create a list of qualified table names in a text file. When listing tables in a
file, each line in the text file must define a single table using the format
<schema-name>.<table-name>. The file must not include trailing
lines. For example:

wikipedia.articles
twitter.message

If a table or schema name uses any character other than a lowercase letter, number, or an
underscore character, then you must include that name in double quotes. For
example:

beer."IPA"
"Wine".riesling
"Wine"."sauvignon blanc"
water.tonic

After creating the file, you can use it either to include or exclude tables with the
gpbackup options --include-table-file or
--exclude-table-file. For
example:

You cannot combine --include-schema with --include-table
or --include-table-file, and you cannot combine
--exclude-schema with any table filtering option such as
--exclude-table or --include-table.

When you use --include-table or --include-table-file
dependent objects are not automatically backed up or restored, you must explicitly specify
the dependent objects that are required. For example, if you back up or restore a view, you
must also specify the tables that the view uses. If you backup or restore a table that uses
a sequence, you must also specify the sequence.

Filtering by Leaf Partition

By default,
gpbackup creates one file for each table on a segment. You can specify
the --leaf-partition-data option to create one data file per leaf
partition of a partitioned table, instead of a single file. You can also filter backups to
specific leaf partitions by listing the leaf partition names in a text file to include.
For example, consider a table that was created using the
statement:

When
you specify --leaf-partition-data, gpbackup generates one
data file per leaf partition when backing up a partitioned table. For example, this command
generates one data file for each leaf
partition:

When
leaf partitions are backed up, the leaf partition data is backed up along with the
metadata for the entire partitioned table.

Note: You cannot use the --exclude-table-file option with
--leaf-partition-data. Although you can specify leaf partition names in
a file specified with --exclude-table-file, gpbackup
ignores the partition names.

Filtering with gprestore

After creating a backup set with gpbackup, you can filter the schemas
and tables that you want to restore from the backup set using the
gprestore--include-schema and --include-table-file options. These
options work in the same way as their gpbackup counterparts, but have the
following restrictions:

The tables that you attempt to restore must not already exist in the database.

If you attempt to restore a schema or table that does not exist in the backup set,
the gprestore does not execute.

If you use the --include-schema option, gprestore
cannot restore objects that have dependencies on multiple schemas.

If you use the --include-table-file option,
gprestore does not create roles or set the owner of the tables. The
utility restores table indexes and rules. Triggers are also restored but are not
supported in Greenplum Database.

The file that you specify with --include-table-file cannot include
a leaf partition name, as it can when you specify this option with
gpbackup. If you specified leaf partitions in the backup set,
specify the partitioned table to restore the leaf partition data.

When restoring a
backup set that contains data from some leaf partitions of a partitioned table, the
partitioned table is restored along with the data for the leaf partitions. For
example, you create a backup with the gpbackup option
--include-table-file and the text file lists some leaf partitions
of a partitioned table. Restoring the backup creates the partitioned table and
restores the data only for the leaf partitions listed in the file.

Configuring Email Notifications

gpbackup and gprestore can send email notifications after
a back up or restore operation completes.

To have gpbackup or gprestore send out status email
notifications, you must place a file named gp_email_contacts.yaml in the
home directory of the user running gpbackup or gpbackup in
the same directory as the utilities ($GPHOME/bin). A utility issues a
message if it cannot locate a gp_email_contacts.yaml file in either
location. If both locations contain a .yaml file, the utility uses the file
in user $HOME.

The email subject line includes the utility name, timestamp, status, and the name of the
Greenplum Database master. This is an example subject line for a gpbackup
email.

gpbackup 20180202133601 on gp-master completed

The email contains summary information about the operation including options, duration, and
number of objects backed up or restored. For information about the contents of a
notification email, see Report Files.

Note: The UNIX mail utility must be running on the Greenplum Database host and must be
configured to allow the Greenplum superuser (gpadmin) to send email.

gpbackup and gprestore Email File Format

The gpbackup and gprestore email notification YAML file
gp_email_contacts.yaml uses indentation (spaces) to determine the
document hierarchy and the relationships of the sections to one another. The use of white
space is significant. White space should not be used simply for formatting purposes, and
tabs should not be used at all.

Note: If the status parameters are not specified correctly, the utility
does not issue a warning. For example, if the success parameter is
misspelled and is set to true, a warning is not issued and an email is
not sent to the email address after a successful operation. To ensure email notification
is configured correctly, run tests with email notifications configured.

This is the format of the gp_email_contacts.yaml YAML file for
gpbackup email notifications:

Email YAML File Sections

Required. The section that contains the gpbackup and
gprestore sections. The YAML file can contain a
gpbackup section, a gprestore section, or one of
each.

gpbackup

Optional. Begins the gpbackup email section.

address

Required. At least one email address must be specified. Multiple email
address parameters can be specified. Each
address requires a status section.

user@domain is a single, valid email
address.

status

Required. Specify when the utility sends an email to the specified email
address. The default is to not send email notification.

You specify sending email notifications based on the completion status of a
backup or restore operation. At least one of these parameters must be
specified and each parameter can appear at most once.

success

Optional. Specify if an email is sent if the operation completes
without errors. If the value is true, an email is sent
if the operation completes without errors. If the value is
false (the default), an email is not sent.

success_with_errors

Optional. Specify if an email is sent if the operation completes with
errors. If the value is true, an email is sent if the
operation completes with errors. If the value is false
(the default), an email is not sent.

failure

Optional. Specify if an email is sent if the operation fails. If the
value is true, an email is sent if the operation fails.
If the value is false (the default), an email is not
sent.

gprestore

Optional. Begins the gprestore email section. This section
contains the address and status parameters that are used to send an email
notification after a gprestore operation. The syntax is the same as
the gpbackup section.

Examples

This example YAML file specifies sending email to email addresses depending on the
success or failure of an operation. For a backup operation, an email is sent to a
different address depending on the success or failure of the backup operation. For a
restore operation, an email is sent to gpadmin@example.com only when
the operation succeeds or completes with
errors.

Understanding Backup Files

Warning: All gpbackup metadata files are created with read-only
permissions. Never delete or modify the metadata files for a gpbackup
backup set. Doing so will render the backup files non-functional.

A complete backup set for gpbackup includes multiple metadata files,
supporting files, and CSV data files, each designated with the timestamp at which the backup
was created.

By default, metadata and supporting files are stored on the Greenplum Database master host
in the directory
$MASTER_DATA_DIRECTORY/backups/YYYYMMDD/YYYYMMDDHHMMSS/. If you
specify a custom backup directory, this same file path is created as a subdirectory of the
backup directory. The following table describes the names and contents of the metadata and
supporting files.

Table 2. gpbackup Metadata Files (master)

File name

Description

gpbackup_<YYYYMMDDHHMMSS>_metadata.sql

Contains global and database-specific metadata:

DDL for objects that are global to the Greenplum Database cluster, and not
owned by a specific database within the cluster.

DDL for objects in the backed-up database (specified with
--dbname) that must be created before to restoring
the actual data, and DDL for objects that must be created after
restoring the data.

Global objects include:

Tablespaces

Databases

Database-wide configuration parameter settings (GUCs)

Resource group definitions

Resource queue definitions

Roles

GRANT assignments of roles to databases

Note: Global metadata is not restored by default. You must include
the --with-globals option to the gprestore
command to restore global metadata.

Database-specific objects that must be
created before to restoring the actual data include:

Session-level configuration parameter settings (GUCs)

Schemas

Procedural language extensions

Types

Sequences

Functions

Tables

Protocols

Operators and operator classes

Conversions

Aggregates

Casts

Views

Constraints

Database-specific objects that must be created after restoring the
actual data include:

Indexes

Rules

Triggers. (While Greenplum Database does not support triggers, any
trigger definitions that are present are backed up and restored.)

gpbackup_<YYYYMMDDHHMMSS>_toc.yaml

Contains metadata for locating object DDL in the
_predata.sql and _postdata.sql files.
This file also contains the table names and OIDs used for locating the
corresponding table data in CSV data files that are created on each segment. See
Segment Data Files.

gpbackup_<YYYYMMDDHHMMSS>_report

Contains information about the backup operation that is used to populate the
email notice (if configured) that is sent after the backup completes. This file
contains information such as:

Segment Data Files

By default, each segment creates one compressed CSV file for each table that is backed up
on the segment. You can optionally specify the --single-data-file option
to create a single data file on each segment. The files are stored in
<seg_dir>/backups/YYYYMMDD/YYYYMMDDHHMMSS/.

If you specify a custom backup directory, segment data files are copied to this same file
path as a subdirectory of the backup directory. If you include the
--leaf-partition-data option, gpbackup creates one
data file for each leaf partition of a partitioned table, instead of just one table for
file.

Each data file uses the file name format
gpbackup_<content_id>_<YYYYMMDDHHMMSS>_<oid>.gz where:

<content_id> is the content ID of the segment.

<YYYYMMDDHHMMSS> is the timestamp of the
gpbackup operation.

<oid> is the object ID of the table. The metadata file
gpbackup_<YYYYMMDDHHMMSS>_toc.yaml references this
<oid> to locate the data for a specific table in a
schema.

You can optionally specify the gzip compression level (from 1-9) using the
--compression-level option, or disable compression entirely with
--no-compression. If you do not specify a compression level,
gpbackup uses compression level 1 by default.