Loading Data

INSERT Statement with Column Values

A singleton INSERT statement with values adds a single row to a table. The
row flows through the master and is distributed to a segment. This is the slowest method and
is not suitable for loading large amounts of data.

COPY Statement

The PostgreSQL COPY statement copies data from an external file into a
database table. It can insert multiple rows more efficiently than an INSERT
statement, but the rows are still passed through the master. All of the data is copied in
one command; it is not a parallel process.

Data input to the COPY command is from a file or the standard input. For
example:

COPY table FROM '/data/mydata.csv' WITH CSV HEADER;

Use
COPY to add relatively small sets of data, for example dimension tables
with up to ten thousand rows, or one-time data loads.

Use COPY when scripting a process that loads small amounts of data, less
than 10 thousand rows.

Since COPY is a single command, there is no need to disable autocommit when you use this
method to populate a table.

You can run multiple concurrent COPY commands to improve performance.

External Tables

External tables provide access to data in sources outside of Greenplum Database. They can
be accessed with SELECT statements and are commonly used with the Extract,
Load, Transform (ELT) pattern, a variant of the Extract, Transform, Load (ETL) pattern that
takes advantage of Greenplum Database's fast parallel data loading capability.

With ETL, data is extracted from its source, transformed outside of the database using
external transformation tools, such as Informatica or Datastage, and then loaded into the
database.

With ELT, Greenplum external tables provide access to data in external sources, which could
be read-only files (for example, text, CSV, or XML files), Web servers, Hadoop file systems,
executable OS programs, or the Greenplum gpfdist file server, described in
the next section. External tables support SQL operations such as select, sort, and join so
the data can be loaded and transformed simultaneously, or loaded into a load table
and transformed in the database into target tables.

The external table is defined with a CREATE EXTERNAL TABLE statement,
which has a LOCATION clause to define the location of the data and a
FORMAT clause to define the formatting of the source data so that the
system can parse the input data. Files use the file:// protocol, and must
reside on a segment host in a location accessible by the Greenplum superuser. The data can
be spread out among the segment hosts with no more than one file per primary segment on each
host. The number of files listed in the LOCATION clause is the number of
segments that will read the external table in parallel.

External Tables with Gpfdist

The fastest way to load large fact tables is to use external tables with
gpdist. gpfdist is a file server program using an HTTP
protocol that serves external data files to Greenplum Database segments in parallel. A
gpfdist instance can serve 200 MB/second and many
gpfdist processes can run simultaneously, each serving up a portion of
the data to be loaded. When you begin the load using a statement such as INSERT INTO
<table> SELECT * FROM <external_table>, the INSERT
statement is parsed by the master and distributed to the primary segments. The segments
connect to the gpfdist servers and retrieve the data in parallel, parse and
validate the data, calculate a hash from the distribution key data and, based on the hash
key, send the row to its destination segment. By default, each gpfdist
instance will accept up to 64 connections from segments. With many segments and
gpfdist servers participating in the load, data can be loaded at very
high rates.

Primary segments access external files in parallel when using gpfdist up
to the value of gp_external_max_segments. When optimizing
gpfdist performance, maximize the parallelism as the number of segments
increase. Spread the data evenly across as many ETL nodes as possible. Split very large data
files into equal parts and spread the data across as many file systems as possible.

Run two gpfdist instances per file system. gpfdist tends
to be CPU bound on the segment nodes when loading. But if, for example, there are eight
racks of segment nodes, there is lot of available CPU on the segments to drive more
gpfdist processes. Run gpfdist on as many interfaces as
possible. Be aware of bonded NICs and be sure to start enough gpfdist
instances to work them.

It is important to keep the work even across all these resources. The load is as fast as
the slowest node. Skew in the load file layout will cause the overall load to bottleneck on
that resource.

The gp_external_max_segs configuration parameter controls the number of
segments each gpfdist process serves. The default is 64. You can set a
different value in the postgresql.conf configuration file on the master.
Always keep gp_external_max_segs and the number of gpfdist
processes an even factor; that is, the gp_external_max_segs value should be
a multiple of the number of gpfdist processes. For example, if there are 12
segments and 4 gpfdist processes, the planner round robins the segment
connections as follows:

External tables have no statistics to inform the optimizer. You can set rough estimates
for the number of rows and disk pages for the external table in the
pg_class system catalog with a statement like the
following:

When using gpfdist, maximize network bandwidth by running one
gpfdist instance for each NIC on the ETL server. Divide the source data
evenly between the gpfdist instances.

When using gpload, run as many simultaneous gpload
instances as resources allow. Take advantage of the CPU, memory, and networking resources
available to increase the amount of data that can be transferred from ETL servers to the
Greenplum Database.

Use the SEGMENT REJECT LIMIT clause of the COPY
statement to set a limit for the number or percentage of rows that can have errors before
the COPY FROM command is aborted. The reject limit is per segment; when
any one segment exceeds the limit, the command is aborted and no rows are added. Use the
LOG ERRORS clause to save error rows. If a row has errors in the
formatting—for example missing or extra values, or incorrect data
types—Greenplum Database stores the error information and row internally. Use the
gp_read_error_log() built-in SQL function to access this stored
information.

If the load has errors, run VACUUM on the table to recover space.

After you load data into a table, run VACUUM on heap tables, including
system catalogs, and ANALYZE on all tables. It is not necessary to run
VACUUM on append-optimized tables. If the table is partitioned, you can
vacuum and analyze just the partitions affected by the data load. These steps clean up any
rows from aborted loads, deletes, or updates and update statistics for the table.

Recheck for segment skew in the table after loading a large amount of data. You can use
a query like the following to check for
skew:

SELECT gp_segment_id, count(*)
FROM schema.table
GROUP BY gp_segment_id ORDER BY 2;

By default, gpfdist assumes a maximum record size of 32K. To load data
records larger than 32K, you must increase the maximum row size parameter by specifying
the -m <bytes> option on the gpfdist command
line. If you use gpload, set the MAX_LINE_LENGTH
parameter in the gpload control file.

Note: Integrations with Informatica
Power Exchange are currently limited to the default 32K record length.

Additional Information

See the Greenplum Database Reference Guide for detailed instructions for loading
data using gpfdist and gpload.