SFTP/FTP Client

The SFTP/FTP Client origin reads files from a server using the Secure File Transfer
Protocol (SFTP) or the File Transfer Protocol (FTP).

When you configure the SFTP/FTP Client origin, you
specify the URL where the files reside on the remote server. You can specify whether to
process files in subdirectories, a file name pattern, and the first file to process.

If the server requires authentication, configure the origin to use login credentials. If
using the SFTP protocol, you can also configure the origin for strict host checking.

You can configure the origin to download files to an archive directory if the origin
encounters errors while reading the files.

When the pipeline stops, the SFTP/FTP Client origin notes where it stops reading. When
the pipeline starts again, the origin continues processing from where it stopped by
default. You can reset the origin to process all requested files.

The origin can generate events for an event stream. For
more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Read Order

The SFTP/FTP Client origin reads files in ascending order
based on the last-modified timestamp. When the origin reads from a secondary location -
not the directory where the files are created and written - the last-modified timestamp
should be when the file is moved to the directory to be processed.

Tip: Avoid
moving files using commands that preserve the existing timestamp, such as cp
-p. Preserving the existing timestamp can be problematic in some
cases, such as moving files across time zones.

When ordering based on timestamp, any files with the same timestamp are read in
lexicographically ascending order based on the file names.

For example, when reading files with the log*.json file name pattern, the origin reads
the following files in the following order:

File Name

Last Modified Timestamp

log-1.json

APR 24 2016 14:03:35

log-0054.json

APR 24 2016 14:05:03

log-0055.json

APR 24 2016 14:45:11

log-2.json

APR 24 2016 14:45:11

First File for Processing

Configure a
first file for processing when you want the SFTP/FTP Client origin to ignore one or more
existing files in the directory.

When you define a first file to process, the origin starts processing with the specified
file and continues processing files in the expected read order: files that match the
file name pattern in ascending order based on the last-modified timestamp.

When you do not specify a first file, the origin processes the files in the directory
that match the file name pattern, starting with the earliest file and continuing in
ascending order.

For example, if you specify a first file with the last-modified timestamp of 6/01/2017
00:00:00, the origin starts processing with that file and ignores all older files in the
directory.

Note: When you restart a stopped pipeline, the origin ignores this
property. It starts where it left off regardless of the first file name unless you
reset the origin.

Credentials

If the remote server requires authentication, configure the authentication method that
the origin must use to log in to the remote server.

Configure one of the
following authentication methods:

None

The SFTP or FTP server does not require authentication.

Password

The SFTP or FTP server requires authentication using a user name and
password.

Private key

The SFTP server requires authentication using a private key file. Store the
private key file in a local directory. For the SFTP protocol only.

If using the SFTP protocol, you can also configure the origin to use strict host
checking. When enabled, the origin connects to the SFTP server only if the server is
listed in the known hosts file stored in a local directory. The known hosts file
contains the host keys for the approved SFTP servers.

Record Header Attributes

The SFTP/FTP Client origin creates record header
attributes that include information about the originating file for
the record. When the origin processes Avro data, it includes the Avro schema in
an avroSchema record header attribute.

You can use the record:attribute or record:attributeOrDefault functions to
access the information in the attributes. For more information about working with record
header attributes, see Working with Header Attributes.

The SFTP/FTP Client origin creates the following record header
attributes:

avroSchema - When processing Avro data, provides the
Avro schema.

filename - Provides the name of the file where the record
originated.

file - Provides the file path and file name where the record
originated.

mtime - Provides the last-modified time for the file.

remoteUri - Provides the resource URL used by the stage.

Event Generation

The SFTP/FTP Client origin can generate events that you can use in an event stream. When
you enable event generation, the origin generates event records each time
the origin starts or completes reading a file. It can also generate events when it completes processing all available data and the
configured batch wait time has elapsed.

SFTP/FTP Client origin events can be used in any logical way. For example:

With the Pipeline Finisher executor to
stop the pipeline and transition the pipeline to a Finished state when
the origin completes processing available data.

When you restart a
pipeline stopped by the Pipeline Finisher executor, the origin
continues processing from the last-saved offset unless you reset
the origin.

Event Records

Event records generated by the SFTP/FTP Client
origin have the following event-related record header attributes. Record header
attributes are stored as String values:

Record Header Attribute

Description

sdc.event.type

Event type. Uses one of the following types:

new-file - Generated when the origin starts processing a new
file.

finished-file - Generated when the origin completes
processing a file.

no-more-data - Generated after the origin completes
processing all available files and the number of seconds
configured for Batch Wait Time has elapsed.

sdc.event.version

An integer that indicates the version of the event record type.

sdc.event.creation_timestamp

Epoch timestamp when the stage created the event.

The SFTP/FTP Client origin can generate the following types of event records:

new-file

The SFTP/FTP Client origin generates a new-file event record when it starts
processing a new file.

New-file event records have the sdc.event.type set to new-file and include the
following field:

Event Record Field

Description

filepath

Path and name of the file that the origin started or
finished processing.

finished-file

The SFTP/FTP Client origin generates a finished-file event record when it
finishes processing a file.

Finished-file event records have the sdc.event.type set to finished-file and
include the following fields:

Event Record Field

Description

filepath

Path and name of the file that the origin started or
finished processing.

record-count

Number of records successfully generated from the
file.

error-count

Number of error records generated from the file.

no-more-data

The SFTP/FTP Client origin generates a no-more-data event record when the origin
completes processing all available records and the number of seconds configured
for Batch Wait Time elapses without any new files appearing to be processed.

No-more-data event records generated by the SFTP/FTP Client origin have the
sdc.event.type set to no-more-data and include the following fields:

Event Record Field

Description

record-count

Number of records successfully generated since the pipeline started or since the last
no-more-data event was created.

error-count

Number of error records generated since the pipeline started or since the last
no-more-data event was created.

file-count

Number of files the origin attempted to process. Can
include files that were unable to be processed or were not
fully processed.

Data Formats

The SFTP/FTP Client
origin processes data differently based on the data format. SFTP/FTP Client processes
the following types of data:

Avro

Generates a record for every Avro record. Includes a "precision"
and "scale" field attribute for each Decimal field. For more
information about field attributes, see Field Attributes.

The origin writes the Avro schema to an avroSchema record header
attribute. For more information about record header attributes,
see Record Header Attributes.

You can use one of the following methods to specify the location
of the Avro schema definition:

Message/Data Includes Schema -
Use the schema in the file.

In Pipeline Configuration - Use
the schema that you provide in the stage
configuration.

Confluent Schema Registry -
Retrieve the schema from Confluent Schema Registry.
The Confluent Schema Registry is a distributed
storage layer for Avro schemas. You can configure
the origin to look up the schema in the Confluent
Schema Registry by the schema ID or subject
specified in the stage configuration.

Using a schema in the stage configuration or retrieving a schema
from the Confluent Schema Registry overrides any schema that
might be included in the file and can improve performance.

The origin reads files compressed by Avro-supported compression
codecs without requiring additional configuration. To enable the
origin to read files compressed by other codecs, use the
compression format property in the stage.

Delimited

Generates a record for each delimited line. You can use the
following delimited format types:

You can use a list or list-map root field type for delimited data,
optionally including the header information when available. For
more information about the root field types, see Delimited Data Root Field Type.

When using a header line, you can allow processing records with
additional columns. The additional columns are named using a
custom prefix and integers in sequential increasing order, such
as _extra_1, _extra_2. When you disallow additional columns when
using a header line, records that include additional columns are
sent to error.

You can also replace a string constant with null values.

When a record exceeds the user-defined maximum record length, the
origin cannot continue processing data in the file. Records
already processed from the file are passed to the pipeline. The
behavior of the origin is then based on the error handling
configured for the stage:

Discard - The origin continues processing with the
next file, leaving the partially-processed file in
the directory.

To Error - The origin continues processing with the
next file. If a post-processing error directory is
configured for the stage, the origin moves the
partially-processed file to the error directory.
Otherwise, it leaves the file in the directory.

Stop Pipeline - The origin stops the pipeline.

JSON

Generates a record for each JSON object. You can process JSON
files that include multiple JSON objects or a single JSON
array.

When an object exceeds the maximum object length defined for the
origin, the origin cannot continue processing data in the file.
Records already processed from the file are passed to the
pipeline. The behavior of the origin is then based on the error
handling configured for the stage:

Discard - The origin continues processing with the
next file, leaving the partially-processed file in
the directory.

To Error - The origin continues processing with the
next file. If a post-processing error directory is
configured for the stage, the origin moves the
partially-processed file to the error directory.
Otherwise, it leaves the file in the directory.

Stop Pipeline - The origin stops the pipeline.

Log

Generates a record for every log line.

When a line exceeds the user-defined maximum line length, the
origin truncates longer lines.

You can include the processed log line as a field in the record.
If the log line is truncated, and you request the log line in
the record, the origin includes the truncated line.

Protobuf messages must match the specified message type and be described
in the descriptor file.

When the data for a record exceeds 1 MB, the origin cannot continue
processing data in the file. The origin handles the file based on file
error handling properties and continues reading the next file.

Streams whole files from the origin system to the destination
system. You can specify a transfer rate or use all available
resources to perform the transfer.

The origin generates two fields: one for a file reference and one
for file information. For more information, see Whole File Data Format.

XML

Generates records based on a user-defined delimiter element. Use
an XML element directly under the root element or define a
simplified XPath expression. If you do not define a delimiter
element, the origin treats the XML file as a single record.

Generated records include XML attributes and namespace
declarations as fields in the record by default. You can
configure the stage to include them in the record as field
attributes.

You can include XPath information for each parsed XML element and
XML attribute in field attributes. This also places each
namespace in an xmlns record header attribute.

Note:Field attributes and record header attributes are
written to destination systems automatically only when you use the SDC RPC
data format in destinations. For more information about working with field
attributes and record header attributes, and how to include them in records,
see Field Attributes and Record Header Attributes.

When a record exceeds the user-defined maximum record length, the
origin cannot continue processing data in the file. Records
already processed from the file are passed to the pipeline. The
behavior of the origin is then based on the error handling
configured for the stage:

Discard - The origin continues processing with the
next file, leaving the partially-processed file in
the directory.

To Error - The origin continues processing with the
next file. If a post-processing error directory is
configured for the stage, the origin moves the
partially-processed file to the error directory.
Otherwise, it leaves the file in the directory.

Using a schema in the stage configuration or in the
Confluent Schema Registry can improve
performance.

Avro Schema

Avro schema definition used to process the data.
Overrides any existing schema definitions associated with
the data.

You can optionally use the runtime:loadResource
function to use a schema definition stored in a runtime
resource file.

Schema Registry URLs

Confluent Schema Registry URLs used to look up the
schema. To add a URL, click Add. Use
the following format to enter the
URL:

http://<host name>:<port number>

Lookup Schema By

Method used to look up the schema in the Confluent Schema
Registry:

Subject - Look up the specified Avro schema
subject.

Schema ID - Look up the specified Avro schema ID.

Overrides any existing schema definitions associated
with the data.

Schema Subject

Avro schema subject to look up in the Confluent Schema
Registry.

If the specified subject has multiple schema
versions, the origin uses the latest schema version for
that subject. To use an older version, find the
corresponding schema ID, and then set the
Look Up Schema By property to
Schema ID.

Schema ID

Avro schema ID to look up in the Confluent Schema
Registry.

For delimited data, on the Data Format tab, configure the
following properties:

Indicates whether a file contains a header line, and
whether to use the header line.

Allow Extra Columns

When processing data with a header line, allows
processing records with more columns than exist in the
header line.

Extra Column Prefix

Prefix to use for any additional columns. Extra columns
are named using the prefix and sequential increasing
integers as follows:
<prefix><integer>.

For
example, _extra_1. Default is _extra_.

Max Record Length (chars)

Maximum length of a record in characters. Longer records
are not read.

This property can be limited by the Data Collector parser
buffer size. For more information, see Maximum Record Size.

Delimiter Character

Delimiter character for a custom delimiter format. Select
one of the available options or use Other to enter a custom
character.

You can enter a Unicode control character
using the format \uNNNN, where ​N is a
hexadecimal digit from the numbers 0-9 or the letters
A-F. For example, enter \u0000 to use the null character
as the delimiter or \u2028 to use a line separator as
the delimiter.

Default is the pipe character ( |
).

Escape Character

Escape character for a custom file type.

Quote Character

Quote character for a custom file type.

Root Field Type

Root field type to use:

List-Map - Generates an indexed list of data.
Enables you to use standard functions to process
data. Use for new pipelines.

List - Generates a record with an indexed list with
a map for header and value. Requires the use of
delimited data functions to process data. Use only
to maintain pipelines created before 1.1.0.

Lines to Skip

Lines to skip before reading data.

Parse NULLs

Replaces the specified string constant with null
values.

NULL Constant

String constant to replace with null values.

Charset

Character encoding of the files to be processed.

Ignore Ctrl Characters

Removes all ASCII control characters except for the tab, line feed, and carriage
return characters.

For JSON data, on the Data Format tab, configure the
following properties:

Includes the XPath to each parsed XML element and XML
attribute in field attributes. Also includes each namespace
in an xmlns record header attribute.

When not selected,
this information is not included in the record. By
default, the property is not selected.

Note:Field attributes and record header attributes are
written to destination systems automatically only when you use the SDC RPC
data format in destinations. For more information about working with field
attributes and record header attributes, and how to include them in records,
see Field Attributes and Record Header Attributes.

Namespaces

Namespace prefix and URI to use when parsing the XML
document. Define namespaces when the XML element being used
includes a namespace prefix or when the XPath expression
includes namespaces.

Includes XML attributes and namespace declarations in the
record as field attributes. When not selected, XML
attributes and namespace declarations are included in the
record as fields.

Note:Field attributes are automatically included in
records written to destination systems only when you use the SDC RPC data
format in the destination. For more information about working with field
attributes, see Field Attributes.

By default, the property is not
selected.

Max Record Length (chars)

The maximum number of characters in a record. Longer
records are diverted to the pipeline for error handling.

This property can be limited by the Data Collector parser
buffer size. For more information, see Maximum Record Size.

Charset

Character encoding of the files to be processed.

Ignore Ctrl Characters

Removes all ASCII control characters except for the tab, line feed, and carriage
return characters.