System Expansion Overview

System Expansion Overview

Data warehouses typically grow over time, often at a continuous pace, as additional data is
gathered and the retention period increases for existing data. At times, it is necessary to
increase database capacity to consolidate disparate data warehouses into a single database.
The data warehouse may also require additional computing capacity (CPU) to accommodate added
analytics projects. It is good to provide capacity for growth when a system is initially
specified, but even if you anticipate high rates of growth, it is generally unwise to invest
in capacity long before it is required. Database expansion, therefore, is a project that you
should expect to have to execute periodically.

When you expand your database, you should expect the following qualities:

Scalable capacity and performance. When you add resources to a Greenplum Database, the
capacity and performance are the same as if the system had been originally implemented
with the added resources.

Uninterrupted service during expansion. Regular workloads, both scheduled and ad-hoc,
are not interrupted. A short, scheduled downtime period is required to initialize the new
servers, similar to downtime required to restart the system. The length of downtime is
unrelated to the size of the system before or after expansion.

Replication and disaster recovery. Any existing replication mechanisms continue to
function during expansion. Restore mechanisms needed in case of a failure or catastrophic
event remain effective.

Transparency of process. The expansion process employs standard Greenplum Database
mechanisms, so administrators can diagnose and troubleshoot any problems.

Configurable process. Expansion can be a long running process, but it can be fit into a
schedule of ongoing operations. Expansion control tables allow adminstrators to prioritize
the order in which tables are redistributed and the expansion activity can be paused and
resumed.

The planning and physical aspects of an expansion project are a greater share of the work
than expanding the database itself. It will take a multi-discipline team to plan and execute
the project. Space must be acquired and prepared for the new servers. The servers must be
specified, acquired, installed, cabled, configured, and tested. Consulting Greenplum Database
platform engineers in the planning stages will help to ensure a successful expansion project.
Planning New Hardware Platforms describes general
considerations for deploying new hardware.

After you provision the new hardware platforms and set up their networks, configure the
operating systems and run performance tests using Greenplum utilities.The Greenplum Database
software distribution includes utilities that are helpful to test and burn-in the new servers
before beginning the software phase of the expansion. See Preparing and Adding Nodes for steps to ready the new hosts for Greenplum Database.

Once the new servers are installed and tested, the software phase of the Greenplum Database
expansion process begins. The software phase is designed to be minimally disruptive,
transactionally consistent, reliable, and flexible.

There is a brief period of downtime while the new segment hosts are initialized and the
system is prepared for the expansion process. This downtime can be scheduled to occur
during a period of low activity to avoid disrupting ongoing business operations. During
the initialization process, the following tasks are performed:

Greenplum Database software is installed.

Databases and database objects are created on the new segment hosts.

An expansion schema is created in the master database to control the expansion
process.

The distribution policy for each table is changed to DISTRIBUTED
RANDOMLY.

The system is restarted and applications resume.

New segments are immediately available and participate in new queries and data
loads. The existing data, however, is skewed. It is concentrated on the original
segments and must be redistributed across the new total number of primary segments.

Because tables now have a random distribution policy, the optimizer creates query
plans that are not dependent on distribution keys. Some queries will be less efficient
because more data motion operators are needed.

Using the expansion control tables as a guide, tables and partitions are redistributed.
For each table:

An ALTER TABLE statement is issued to change the distribution
policy back to the original policy. This causes an automatic data redistribution
operation, which spreads data across all of the servers, old and new, according to the
original distribution policy.

The table's status is updated in the expansion control tables.

The query optimizer creates more efficient execution plans by including the
distribution key in the planning.

When all tables have been redistributed, the expansion is complete.

Redistributing data is a long-running process that creates a large volume of network and disk
activity. It can take days to redistribute some very large databases. To minimize the effects
of the increased activity on business operations, system administrators can pause and resume
expansion activity on an ad hoc basis, or according to a predetermined schedule. Datasets can
be prioritized so that critical applications benefit first from the expansion.

In a typical operation, you run the gpexpand utility four times with
different options during the complete expansion process.

gpexpand
creates a data directory, copies user tables from all existing databases on the new
segments, and captures metadata for each table in an expansion schema for status tracking.
After this process completes, the expansion operation is committed and
irrevocable.

At initialization,
gpexpand nullifies hash distribution policies on tables in all existing
databases, except for parent tables of a partitioned table, and sets the distribution
policy for all tables to random distribution.

To complete system expansion, you must
run gpexpand to redistribute data tables across the newly added segments.
Depending on the size and scale of your system, redistribution can be accomplished in a
single session during low-use hours, or you can divide the process into batches over an
extended period. Each table or partition is unavailable for read or write operations
during redistribution. As each table is redistributed across the new segments, database
performance should incrementally improve until it exceeds pre-expansion performance
levels.

You may need to run gpexpand several times to complete the
expansion in large-scale systems that require multiple redistribution sessions.
gpexpand can benefit from explicit table redistribution ranking; see
Planning Table Redistribution.

Users can
access Greenplum Database after initialization completes and the system is back online,
but they may experience performance degradation on systems that rely heavily on hash
distribution of tables. Normal operations such as ETL jobs, user queries, and reporting
can continue, though users might experience slower response times.

When a table has
a random distribution policy, Greenplum Database cannot enforce unique constraints (such
as PRIMARY KEY). This can affect your ETL and loading processes until
table redistribution completes because duplicate rows do not issue a constraint violation
error.

To remove the expansion schema:

gpexpand -c

For information about the gpexpand utility and the other utilities that are
used for system expansion, see the Greenplum Database Utility Guide.