Synopsis

Description

gpfdist is Greenplum Database parallel file distribution program. It is
used by readable external tables and gpload to serve external table files
to all Greenplum Database segments in parallel. It is used by writable external tables to
accept output streams from Greenplum Database segments in parallel and write them out to a
file.

In order for gpfdist to be used by an external table, the
LOCATION clause of the external table definition must specify the
external table data using the gpfdist:// protocol (see the Greenplum
Database command CREATE EXTERNAL TABLE).

Note: If the --ssl option is specified to enable SSL security, create the
external table with the gpfdists:// protocol.

The benefit of using gpfdist is that you are guaranteed maximum
parallelism while reading from or writing to external tables, thereby offering the best
performance as well as easier administration of external tables.

For readable external tables, gpfdist parses and serves data files evenly
to all the segment instances in the Greenplum Database system when users
SELECT from the external table. For writable external tables,
gpfdist accepts parallel output streams from the segments when users
INSERT into the external table, and writes to an output file.

For readable external tables, if load files are compressed using gzip or
bzip2 (have a .gz or .bz2 file
extension), gpfdist uncompresses the files automatically before loading
provided that gunzip or bunzip2 is in your path.

Note: Currently, readable external tables do not support compression on Windows
platforms, and writable external tables do not support compression on any platforms.

Most likely, you will want to run gpfdist on your ETL machines rather than
the hosts where Greenplum Database is installed. To install gpfdist on
another host, simply copy the utility over to that host and add gpfdist to
your $PATH.

Note: When using IPv6, always enclose the numeric IP address in brackets.

Options

The directory from which gpfdist will serve files for readable
external tables or create output files for writable external tables. If not specified,
defaults to the current directory.

-l log_file

The fully qualified path and log file name where standard output messages are to be
logged.

-p http_port

The HTTP port on which gpfdist will serve files. Defaults to
8080.

-t timeout

Sets the time allowed for Greenplum Database to establish a connection to a
gpfdist process. Default is 5 seconds. Allowed values are 2 to 7200
seconds (2 hours). May need to be increased on systems with a lot of network
traffic.

-m max_length

Sets the maximum allowed data row length in bytes. Default is 32768. Should be used
when user data includes very wide rows (or when line too long error
message occurs). Should not be used otherwise as it increases resource allocation. Valid
range is 32K to 256MB. (The upper limit is 1MB on Windows systems.)

Note: Memory issues might occur if you specify a large maximum row length and run a
large number of gpfdist concurrent connections. For example, setting
this value to the maximum of 256MB with 96 concurrent gpfdist
processes requires approximately 24GB of memory ((96 + 1) x
246MB).

-s

Enables simplified logging. When this option is specified, only messages with
WARN level and higher are written to the gpfdist
log file. INFO level messages are not written to the log file. If
this option is not specified, all gpfdist messages are written to the
log file.

You can specify this option to reduce the information written to the log file.

-S (use O_SYNC)

Opens the file for synchronous I/O with the O_SYNC flag. Any writes
to the resulting file descriptor block gpfdist until the data is
physically written to the underlying hardware.

-w time

Sets the number of seconds that Greenplum Database delays before closing a target file
such as a named pipe. The default value is 0, no delay. The maximum value is 7200
seconds (2 hours).

For a Greenplum Database with multiple segments, there might be a delay between
segments when writing data from different segments to the file. You can specify a time
to wait before Greenplum Database closes the file to ensure all the data is written to
the file.

--ssl certificate_path

Adds SSL encryption to data transferred with gpfdist. After executing
gpfdist with the --ssl
certificate_path option, the only way to load data from
this file server is with the gpfdist:// protocol. For information on the gpfdist:// protocol, see "Loading and
Unloading Data" in the Greenplum Database Administrator Guide.

The location specified in certificate_path must contain the
following files:

The server certificate file, server.crt

The server private key file, server.key

The trusted certificate authorities, root.crt

The root directory (/) cannot be specified as
certificate_path.

-v (verbose)

Verbose mode shows progress and status messages.

-V (very verbose)

Verbose mode shows all output messages generated by this utility.

-? (help)

Displays the online help.

--version

Displays the version of this utility.

Running gpfdist as a Windows Service

Greenplum Database Loaders allow gpfdist to run as a Windows Service.

Follow the instructions below to download, register and activate gpfdist
as a service:

Update your Greenplum Database Loader package to the latest version. This
package is available from Pivotal Network.