The EMRFS S3-optimized Committer and
Multipart Uploads

To use the EMRFS S3-optimized committer, multipart uploads must be enabled in
Amazon EMR. Multipart uploads are enabled by default. You can re-enable it if
required. For more information, see Configure
Multipart Upload for Amazon S3 in the
Amazon EMR Management Guide.

The EMRFS S3-optimized committer uses the transaction-like characteristics of
multipart uploads to ensure files written by task attempts only appear in the
job's output location upon task commit. By using multipart uploads in this way,
the committer improves task commit performance over the default
FileOutputCommitter algorithm version 2. When using the EMRFS S3-optimized
committer, there are some key differences from traditional multipart upload
behavior to consider:

Multipart uploads are always performed regardless of the file size.
This differs from the default behavior of EMRFS, where the
fs.s3n.multipart.uploads.split.size property controls
the file size at which multipart uploads are triggered.

Multipart uploads are left in an incomplete state for a longer period
of time until the task commits or aborts. This differs from the default
behavior of EMRFS where a multipart upload completes when a task
finishes writing a given file.

Because of these differences, if a Spark Executor JVM crashes or is killed
while tasks are running and writing data to Amazon S3, incomplete multipart uploads
are more likely to be left behind. For this reason, when you use the
EMRFS S3-optimized committer, be sure to follow the best practices for
managing failed multipart uploads. For more information, see Best
Practices for working with Amazon S3 buckets in the
Amazon EMR Management Guide.

Javascript is disabled or is unavailable in your
browser.

To use the AWS Documentation, Javascript must be
enabled. Please refer to your browser's Help pages for instructions.