Amazon S3

The Amazon S3 origin reads
objects stored in Amazon S3. The object names must share a prefix pattern and should be
fully written. To read messages from Amazon SQS, use the Amazon SQS Consumer origin.

With the Amazon S3 origin, you define the region, bucket, prefix pattern, optional common
prefix, and read order. These properties determine the objects that the origin
processes. You can optionally include Amazon S3 object metadata in the record as record
header attributes.

After processing an object or upon encountering errors, the origin can keep, archive, or delete
the object. When archiving, the origin can copy or move the object.

When the pipeline stops, the Amazon S3 origin notes where it stops reading. When the pipeline
starts again, the origin continues processing from where it stopped by default. You can
reset the origin to process all requested objects.

You can configure the origin to decrypt data stored on Amazon S3 with server-side
encryption and customer-provided encryption keys. You can optionally use a proxy to
connect to Amazon S3.

The origin can generate events for an event stream. For
more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

AWS Credentials

When Data Collector reads data from an Amazon S3 origin, it must pass credentials to Amazon Web
Services.

Use one of the following
methods to pass AWS credentials:

IAM roles

When Data Collector runs
on an Amazon EC2 instance, you can use the AWS Management Console to
configure an IAM role for the EC2 instance. Data Collector uses
the IAM instance profile credentials to automatically connect to AWS.

When you use IAM roles, you do not need to specify the Access Key ID and
Secret Access Key properties in the origin.

For more information about assigning an IAM role to an EC2 instance, see
the Amazon EC2 documentation.

AWS access key pairs

When Data Collector does
not run on an Amazon EC2 instance or when the EC2 instance doesn’t
have an IAM role, you must specify the Access Key
ID and Secret Access Key
properties in the origin.

Common Prefix, Prefix Pattern, and Wildcards

The Amazon S3 origin appends the common prefix to the
prefix pattern to define the objects that the origin processes. You can specify an exact
prefix pattern or you can use Ant-style path patterns to read multiple objects
recursively.

Ant-style path patterns can include the following wildcards:

Question mark (?) to match a single character

Asterisk (*) to match zero or more characters

Double asterisks (**) to match zero or more directories

For example, to process all log files in US/East/MD/ and all nested
prefixes, you can use the following common prefix and prefix
pattern:

Common Prefix: US/East/MD/
Prefix Pattern: **/*.log

If the unnamed nested prefixes that you want to include appear earlier in the hierarchy,
such as US/**/weblogs/, you can include the nested prefixes in the
prefix pattern or define the entire hierarchy in the prefix pattern, as
follows:

Record Header Attributes

When the
Amazon S3 origin processes Avro data, it includes the Avro schema in
an avroSchema record header attribute. You can also configure the origin to include Amazon S3 object metadata in record
header attributes.

You can use the record:attribute or record:attributeOrDefault functions to
access the information in the attributes. For more information about working with record
header attributes, see Working with Header Attributes.

Object Metadata in Record Header Attributes

You can include Amazon S3 object metadata in
record header attributes. Include metadata when you want to use the information to help
process records. For example, you might include metadata if you want to route records to
different branches of a pipeline based on the last-modified timestamp.

Use the Include Metadata property to include metadata in the
record header attributes. When you include metadata in record header attributes, the
Amazon S3 origin includes the following information:

System-defined metadata

The origin includes the following system-defined metadata:

Name - The object name. Bucket and prefix information is included as
follows:

<bucket>/<prefix>/<object_name>

Cache-Control

Content-Disposition

Content-Encoding

Content-Length

Content-MD5

Content-Range

Content-Type

ETag

Expires

Last-Modified

For more information about Amazon S3 system-defined metadata, see the Amazon
S3 documentation.

User-defined metadata

When available, the Amazon S3 origin also includes user-defined metadata in
record header attributes.

Amazon S3 requires user-defined metadata to be named with the following
prefix: x-amz-meta-.

When generating the record header attribute, the origin omits the prefix.

For example, if you have user-defined metadata called
"x-amz-meta-extraInfo", the origin names the record
header attribute as follows: extraInfo.

Read Order

The Amazon S3 origin reads objects in
ascending order based on the object key name or the last modified timestamp.
For best performance when reading a large number of objects, configure the
origin to read objects based on the key name.

You can configure one of the following read orders:

Lexicographically Ascending Key Names

The Amazon S3 origin can read objects in lexicographically
ascending order based on key names. Note that
lexicographically ascending order reads the numbers 1
through 11 as follows:

1, 10, 11, 2, 3, 4... 9

For example, you configure the Amazon S3 origin to read from the
following bucket, common prefix, and prefix pattern using
lexicographically ascending order based on key
names:

The Amazon S3 origin can read objects in ascending order
based on the last modified timestamp. When you start
a pipeline, the origin starts processing data with
the earliest object that matches the common prefix
and prefix pattern, and then progresses in
chronological order. If two or more objects have the
same timestamp, the origin processes the objects in
lexicographically increasing order by key name.

To process objects that include a timestamp earlier than
processed objects, reset the origin to read all
available objects.

For example, you configure the origin to read from the
ServerEast bucket, using
LogFiles/ as the common prefix
and *.log as the prefix pattern.
You need to process the following log files from two
different servers using ascending order based on the
last modified timestamp:

If a new object arrives with a timestamp of 04-29-2016
12:00:00, the Amazon S3 origin does not process the
object unless you reset the origin.

Buffer Limit and Error Handling

The Amazon S3 origin uses a buffer to read objects into memory to produce records. The
size of the buffer determines the maximum size of the record that can be processed.

The buffer limit helps prevent out of memory errors.
Decrease the buffer limit when memory on the Data Collector machine is
limited. Increase the buffer limit to process larger records when memory is available.

When a record is larger than the specified limit, the origin processes the object based on
the stage error handling:

Discard

The origin discards the record and all remaining records in the object, and then
continues processing the next object.

Send to Error

With a buffer limit error, the origin cannot send the record to the pipeline for error
handling because it is unable to fully process the record.

Instead, the origin
displays a message in Monitor mode indicating that a buffer
overrun error occurred. The message includes the object and offset
where the buffer overrun error occurred. The information displays in the
pipeline history and displays as an alert when you monitor the
pipeline.

If an error directory is configured for the stage, the origin
moves the object to the error directory and continues processing the next object.

Stop Pipeline

The origin stops the pipeline and displays a message in Monitor mode
indicating that a buffer overrun error occurred. The message includes the object
and offset where the buffer overrun error occurred. The information displays as an alert and in the pipeline history.

Note: You can also check the Data Collector log file for
error details.

Server Side Encryption

You can configure the origin to decrypt data
stored on Amazon S3 with Amazon Web Services server-side encryption.

When configured for server-side encryption, the origin uses customer-provided encryption
keys to decrypt the data. To use server-side encryption, provide the following
information:

Event Records

Event records generated by the Amazon S3 origin
have the following event-related record header attributes. Record header attributes are
stored as String values:

Record Header Attribute

Description

sdc.event.type

Event type. Uses the following event type:

no-more-data - Generated after the origin completes
processing all available objects and the number of seconds
configured for Batch Wait Time has elapsed.

sdc.event.version

An integer that indicates the version of the event record type.

sdc.event.creation_timestamp

Epoch timestamp when the stage created the event.

The Amazon S3 origin can generate the following event record:

no-more-data

The Amazon S3 origin generates a no-more-data event record when the origin
completes processing all available records and the number of seconds configured
for Batch Wait Time elapses without any new objects appearing to be processed.

No-more-data event records generated by the origin have the sdc.event.type set
to no-more-data and include the following fields:

Event Record Field

Description

record-count

Number of records successfully generated since the pipeline started or since the last
no-more-data event was created.

error-count

Number of error records generated since the pipeline started or since the last
no-more-data event was created.

file-count

Number of objects that the origin attempted to process.
Can include objects that were unable to be processed or were
not fully processed.

Data Formats

The Amazon S3 origin processes data
differently based on the data format. The origin processes the following types of data:

Avro

Generates a record for every Avro record. Includes a "precision"
and "scale" field attribute for each Decimal field. For more
information about field attributes, see Field Attributes.

The origin writes the Avro schema to an avroSchema record header
attribute. For more information about record header attributes,
see Record Header Attributes.

You can use one of the following methods to specify the location
of the Avro schema definition:

Message/Data Includes Schema -
Use the schema in the file.

In Pipeline Configuration - Use
the schema that you provide in the stage
configuration.

Confluent Schema Registry -
Retrieve the schema from Confluent Schema Registry.
The Confluent Schema Registry is a distributed
storage layer for Avro schemas. You can configure
the origin to look up the schema in the Confluent
Schema Registry by the schema ID or subject
specified in the stage configuration.

Using a schema in the stage configuration or retrieving a schema
from the Confluent Schema Registry overrides any schema that
might be included in the file and can improve performance.

The origin reads files compressed by Avro-supported compression
codecs without requiring additional configuration. To enable the
origin to read files compressed by other codecs, use the
compression format property in the stage.

Delimited

Generates a record for each delimited line. You can use the
following delimited format types:

You can use a list or list-map root field type for delimited data,
optionally including the header information when available. For
more information about the root field types, see Delimited Data Root Field Type.

When using a header line, you can allow processing records with
additional columns. The additional columns are named using a
custom prefix and integers in sequential increasing order, such
as _extra_1, _extra_2. When you disallow additional columns when
using a header line, records that include additional columns are
sent to error.

You can also replace a string constant with null values.

When a record exceeds the user-defined maximum record length, the
origin cannot continue processing data in the file. Records
already processed from the file are passed to the pipeline. The
behavior of the origin is then based on the error handling
configured for the stage:

Discard - The origin continues processing with the
next file, leaving the partially-processed file in
the directory.

To Error - The origin continues processing with the
next file. If a post-processing error directory is
configured for the stage, the origin moves the
partially-processed file to the error directory.
Otherwise, it leaves the file in the directory.

Stop Pipeline - The origin stops the pipeline.

JSON

Generates a record for each JSON object. You can process JSON
files that include multiple JSON objects or a single JSON
array.

When an object exceeds the maximum object length defined for the
origin, the origin cannot continue processing data in the file.
Records already processed from the file are passed to the
pipeline. The behavior of the origin is then based on the error
handling configured for the stage:

Discard - The origin continues processing with the
next file, leaving the partially-processed file in
the directory.

To Error - The origin continues processing with the
next file. If a post-processing error directory is
configured for the stage, the origin moves the
partially-processed file to the error directory.
Otherwise, it leaves the file in the directory.

Stop Pipeline - The origin stops the pipeline.

Log

Generates a record for every log line.

When a line exceeds the user-defined maximum line length, the
origin truncates longer lines.

You can include the processed log line as a field in the record.
If the log line is truncated, and you request the log line in
the record, the origin includes the truncated line.

Protobuf messages must match the specified message type and be described
in the descriptor file.

When the data for a record exceeds 1 MB, the origin cannot continue
processing data in the file. The origin handles the file based on file
error handling properties and continues reading the next file.

Streams whole files from the origin system to the destination
system. You can specify a transfer rate or use all available
resources to perform the transfer.

The origin uses checksums to verify the integrity of data
transmission.

The origin generates two fields: one for a file reference and one
for file information. For more information, see Whole File Data Format.

XML

Generates records based on a user-defined delimiter element. Use
an XML element directly under the root element or define a
simplified XPath expression. If you do not define a delimiter
element, the origin treats the XML file as a single record.

Generated records include XML attributes and namespace
declarations as fields in the record by default. You can
configure the stage to include them in the record as field
attributes.

You can include XPath information for each parsed XML element and
XML attribute in field attributes. This also places each
namespace in an xmlns record header attribute.

Note:Field attributes and record header attributes are
written to destination systems automatically only when you use the SDC RPC
data format in destinations. For more information about working with field
attributes and record header attributes, and how to include them in records,
see Field Attributes and Record Header Attributes.

When a record exceeds the user-defined maximum record length, the
origin cannot continue processing data in the file. Records
already processed from the file are passed to the pipeline. The
behavior of the origin is then based on the error handling
configured for the stage:

Discard - The origin continues processing with the
next file, leaving the partially-processed file in
the directory.

To Error - The origin continues processing with the
next file. If a post-processing error directory is
configured for the stage, the origin moves the
partially-processed file to the error directory.
Otherwise, it leaves the file in the directory.

Last-Modified Timestamp - Reads objects in ascending
order based on the last-modified timestamp. When
objects have matching timestamps, reads objects in
lexicographically ascending order based on key
names.

For best performance when reading a large number of
objects, use lexicographical order based on key
names.

File Pool Size

Maximum number of files that the origin stores in memory
for processing after loading and sorting all files present
on S3. Increasing this number can improve pipeline
performance when Data Collector resources permit.

Default is 100.

Buffer Limit (KB)

Maximum buffer size. The buffer size determines the size of the record that can be
processed.

Decrease when memory on the Data Collector machine is limited. Increase to
process larger records when memory is available.

Default is 128 KB.

Max Batch Size (records)

Maximum number of records processed at one time. Honors values up to the Data Collector maximum batch size.

Default is
1000. The Data Collector default is
1000.

Batch Wait Time (ms)

Number of milliseconds to wait before sending a partial or empty batch.

To use server-side encryption, on the SSE tab, configure
the following properties:

SSE Property

Description

Use Server-Side Encryption

Enables the use of server-side encryption.

Customer Encryption Key

A Base64 encoded 256-bit encryption key.

Customer Encryption Key MD5

A Base64 encoded 128-bit MD5 digest of the encryption key
using RFC 1321.

On the Error Handling tab, configure the following
properties:

Error Handling Property

Description

Error Handling Option

The action taken when an error occurs while processing an
object:

None - Keeps the object in place.

Archive - Copies or moves the object to another
prefix or bucket.

Delete - Deletes the object.

When archiving processed objects, best practice is
to also archive objects that cannot be processed.

Archiving Option

The action to take when archiving an object that cannot
be processed.

You can copy or move the object to another prefix or bucket.
When you use another prefix, enter the prefix. When you use another bucket, enter
a prefix and bucket.

Copying the object leaves the original object in place.

Error Prefix

Prefix for the objects that cannot be processed.

Error Bucket

Bucket for the objects that cannot be processed.

On the Post Processing tab, configure the following
properties:

Post Processing Property

Description

Post Processing Option

The action taken after successfully processing an object:

None - Keeps the object in place.

Archive - Copies or moves the object to another
location.

Delete - Deletes the object.

Archiving Option

The action to take when archiving a processed object.

You can copy or move the object to another prefix or bucket.
When you use another prefix, enter the prefix. When you use another bucket, enter
a prefix and bucket.

Using a schema in the stage configuration or in the
Confluent Schema Registry can improve
performance.

Avro Schema

Avro schema definition used to process the data.
Overrides any existing schema definitions associated with
the data.

You can optionally use the runtime:loadResource
function to use a schema definition stored in a runtime
resource file.

Schema Registry URLs

Confluent Schema Registry URLs used to look up the
schema. To add a URL, click Add. Use
the following format to enter the
URL:

http://<host name>:<port number>

Lookup Schema By

Method used to look up the schema in the Confluent Schema
Registry:

Subject - Look up the specified Avro schema
subject.

Schema ID - Look up the specified Avro schema ID.

Overrides any existing schema definitions associated
with the data.

Schema Subject

Avro schema subject to look up in the Confluent Schema
Registry.

If the specified subject has multiple schema
versions, the origin uses the latest schema version for
that subject. To use an older version, find the
corresponding schema ID, and then set the
Look Up Schema By property to
Schema ID.

Schema ID

Avro schema ID to look up in the Confluent Schema
Registry.

For delimited data, on the Data Format tab, configure the
following properties:

Indicates whether a file contains a header line, and
whether to use the header line.

Allow Extra Columns

When processing data with a header line, allows
processing records with more columns than exist in the
header line.

Extra Column Prefix

Prefix to use for any additional columns. Extra columns
are named using the prefix and sequential increasing
integers as follows:
<prefix><integer>.

For
example, _extra_1. Default is _extra_.

Max Record Length (chars)

Maximum length of a record in characters. Longer records
are not read.

This property can be limited by the Data Collector parser
buffer size. For more information, see Maximum Record Size.

Delimiter Character

Delimiter character for a custom delimiter format. Select
one of the available options or use Other to enter a custom
character.

You can enter a Unicode control character
using the format \uNNNN, where ​N is a
hexadecimal digit from the numbers 0-9 or the letters
A-F. For example, enter \u0000 to use the null character
as the delimiter or \u2028 to use a line separator as
the delimiter.

Default is the pipe character ( |
).

Escape Character

Escape character for a custom file type.

Quote Character

Quote character for a custom file type.

Root Field Type

Root field type to use:

List-Map - Generates an indexed list of data.
Enables you to use standard functions to process
data. Use for new pipelines.

List - Generates a record with an indexed list with
a map for header and value. Requires the use of
delimited data functions to process data. Use only
to maintain pipelines created before 1.1.0.

Lines to Skip

Lines to skip before reading data.

Parse NULLs

Replaces the specified string constant with null
values.

NULL Constant

String constant to replace with null values.

Charset

Character encoding of the files to be processed.

Ignore Ctrl Characters

Removes all ASCII control characters except for the tab, line feed, and carriage
return characters.

For JSON data, on the Data Format tab, configure the
following properties:

Includes the XPath to each parsed XML element and XML
attribute in field attributes. Also includes each namespace
in an xmlns record header attribute.

When not selected,
this information is not included in the record. By
default, the property is not selected.

Note:Field attributes and record header attributes are
written to destination systems automatically only when you use the SDC RPC
data format in destinations. For more information about working with field
attributes and record header attributes, and how to include them in records,
see Field Attributes and Record Header Attributes.

Namespaces

Namespace prefix and URI to use when parsing the XML
document. Define namespaces when the XML element being used
includes a namespace prefix or when the XPath expression
includes namespaces.

Includes XML attributes and namespace declarations in the
record as field attributes. When not selected, XML
attributes and namespace declarations are included in the
record as fields.

Note:Field attributes are automatically included in
records written to destination systems only when you use the SDC RPC data
format in the destination. For more information about working with field
attributes, see Field Attributes.

By default, the property is not
selected.

Max Record Length (chars)

The maximum number of characters in a record. Longer
records are diverted to the pipeline for error handling.

This property can be limited by the Data Collector parser
buffer size. For more information, see Maximum Record Size.

Charset

Character encoding of the files to be processed.

Ignore Ctrl Characters

Removes all ASCII control characters except for the tab, line feed, and carriage
return characters.