Greenplum Database can read from and write to several types of external data sources,
including text files, Hadoop file systems, Amazon S3, and web servers.

The COPY SQL command transfers data between an external text file on
the master host, or multiple text files on segment hosts, and a Greenplum Database
table.

Readable external tables allow you to query data outside of the database directly and in
parallel using SQL commands such as SELECT, JOIN, or
SORT EXTERNAL TABLE DATA, and you can create views for external tables.
External tables are often used to load external data into a regular database table using a
command such as CREATE TABLE table AS SELECT * FROM
ext_table.

External web tables provide access to dynamic data. They can be backed with data from
URLs accessed using the HTTP protocol or by the output of an OS script running on one or
more segments.

The gpfdist utility is the Greenplum Database parallel file
distribution program. It is an HTTP server that is used with external tables to allow
Greenplum Database segments to load external data in parallel, from multiple file systems.
You can run multiple instances of gpfdist on different hosts and network
interfaces and access them in parallel.

The gpload utility automates the steps of a load task using
gpfdist and a YAML-formatted control file.

The GemFire-Greenplum Connector allows the transfer of data between
a Pivotal GemFire region and a Greenplum Database table. Pivotal GemFire is an in-memory
data management system that provides reliable asynchronous event notifications and
guaranteed message delivery. For information about using GemFire-Greenplum Connector, see
http://ggc.docs.pivotal.io/. For information about Pivotal GemFire, see http://gemfire.docs.pivotal.io/.

The Greenplum-Spark Connector provides high speed, parallel data
transfer between Pivotal Greenplum Database and Apache Spark. For information about using
the Greenplum-Spark Connector, refer to the documentation at https://greenplum-spark.docs.pivotal.io/.

The method you choose to load data depends on the characteristics of the source
data—its location, size, format, and any transformations required.

In the simplest case, the COPY SQL command loads data into a table from a
text file that is accessible to the Greenplum Database master instance. This requires no setup
and provides good performance for smaller amounts of data. With the COPY
command, the data copied into or out of the database passes between a single file on the
master host and the database. This limits the total size of the dataset to the capacity of the
file system where the external file resides and limits the data transfer to a single file
write stream.

More efficient data loading options for large datasets take advantage of the Greenplum
Database MPP architecture, using the Greenplum Database segments to load data in parallel.
These methods allow data to load simultaneously from multiple file systems, through multiple
NICs, on multiple hosts, achieving very high data transfer rates. External tables allow you to
access external files from within the database as if they are regular database tables. When
used with gpfdist, the Greenplum Database parallel file distribution program,
external tables provide full parallelism by using the resources of all Greenplum Database
segments to load or unload data.

Greenplum Database leverages the parallel architecture of the Hadoop Distributed File System
to access files on that system.

Transforming External Data with gpfdist and gpload
The gpfdist parallel file server allows you to set up transformations that enable Greenplum Database external tables to read and write files in formats that are not supported with the CREATE EXTERNAL TABLE command's FORMAT clause. An input transformation reads a file in the foreign data format and outputs rows to gpfdist in the CSV or other text format specified in the external table's FORMAT clause. An output transformation receives rows from gpfdist in text format and converts them to the foreign data format.