Copying Data Between a Cluster and Amazon S3

DistCp is a utility for copying large data sets between distributed filesystems. You can use DistCp to copy data between your cluster’s HDFS and Amazon S3.

This section describes how to copy data from S3 to HDFS, from HDFS to Amazon S3, and between Amazon S3 buckets. It also includes tips for copying large data volumes and describes limitations of using DistCp with Amazon S3.

Invoking DistCp

To access DistCp utility, SSH to any node in your cluster. By default,
DistCp is invoked against the cluster's default file system, which is defined in the configuration property fs.defaultFS in core-site.xml. For HDP clusters on AWS, the default filesystem is the deployed HDFS instance. This means that both of the following examples are valid:

hadoop distcp hdfs://source-folder s3a://destination-bucket

hadoop distcp /source-folder s3a://destination-bucket

Copying Data from HDFS to Amazon S3

To transfer data from HDFS to an Amazon S3 bucket, use the following syntax:

hadoop distcp hdfs://source-folder s3a://destination-bucket

Updating Existing Data

If you would like to transfer only the files that don’t already exist in the target folder, add the update option to improve the copy speeds:

hadoop distcp -update hdfs://source-folder s3a://destination-bucket

Refer to the Apache documentation for a detailed explanation of how the handling of source-paths varies depending on whether or not you add the update option.

Note

When copying between Amazon S3 and HDFS, the "update" check only compares file size and modification times; it does not use checksums to detect other changes in the data.

Copying Data from Amazon S3 to HDFS

To copy data from Amazon S3 to HDFS, list the path of the Amazon S3 data first. For example:

hadoop distcp s3a://hwdev-examples-ireland/datasets /tmp/datasets2

This downloads all files. You can add the update option to only download data which has changed:

Copying Data Between Amazon S3 Buckets

You can copy data between two Amazon S3 buckets, simply by listing the different URLs as the source and destination paths. In addition to copying data between buckets on a single Amazon S3 datacenter, you can use this syntax to copy data between Amazon S3 buckets hosted in different datacenters, simply by naming the bucket in the remote datacenter.

Irrespective of source and destination bucket locations, when copying data between Amazon S3 buckets,
all data passes through the Hadoop cluster: once to read, once to write.
This means that the time to perform the copy depends on the size of the Hadoop cluster, and the
bandwidth between it and the S3 buckets.
Furthermore, even when running within Amazon's own infrastructure, you are billed for your accesses to remote Amazon S3 buckets.

Specifying Bucket-Specific Options

If a bucket has different authentication or endpoint options, then the different options for that bucket can be
set with a bucket-specific option. For example, to copy to a remote bucket using Amazon's V4 authentication API requires
the explicit S3 endpoint to be declared:

Similarly, different credentials may be used when copying between buckets of different accounts.
When performing such an operation, consider that secrets on the command line
can be visible to other users on the system, so potentially insecure -as

Using short-lived session keys can reduce the vulnerabilities here, while storing the
secrets in hadoop jceks credential files is potentially significantly more secure.

Copying Data Within an Amazon S3 Bucket

Copy operations within a single object store still take place in the Hadoop cluster, even when the object store implements a more efficient copy operation internally. That is, an operation such as

hadoop distcp s3a://bucket/datasets/set1 s3a://bucket/datasets/set2

copies each byte down to the Hadoop worker nodes and back to the bucket. In addition to the operation being being slow, it means that charges may be incurred.

Limitations

Consider the following limitations:

The -append option is not supported.

The -diff option is not supported.

The -atomic option causes a rename of the temporary data, so significantly
increases the time to commit work at the end of the operation. Furthermore,
as S3A does not offer atomic renames of directories, the -atomic
operation doesn't actually deliver what is promised. Avoid using this option.

All -p options, including those to preserve permissions, user and group information, attributes checksums, and replication are ignored.

CRC checking will not be performed, irrespective of the value of the -skipCrc flag.

Improving Performance When Copying Large Data Volumes

This section includes tips for improving performance when copying large volumes of data between Amazon S3 and HDFS.

The bandwidth between the Hadoop cluster and Amazon S3 is the upper limit to how fast data can be copied into S3. The further the Hadoop cluster is from the Amazon S3 installation, or the narrower the network connection is, the longer the operation will take. Even a Hadoop cluster deployed within Amazon's own infrastructure may encounter network delays from throttled VM network connections.

Network bandwidth limits notwithstanding, there are some options which can be used to tune the performance of an upload.

Working with Local S3 Buckets

A foundational step to getting good performance is working with buckets "close" to the Hadoop cluster, where "close" is measured in network terms.

Maximum performance in HDCloud for AWS is achieved from working with S3 buckets in the same AWS site as the HDCloud cluster. For example, if your cluster is in North Virginia ("US East"), you will achieve best performance if your S3 bucket is in the same region.

Each time you create a cluster in a new region, create a bucket in that same region. Likewise, make sure that buckets used for backup and cluster storage are on the same site as the clusters.

In addition to improving performance, working with local buckets ensures that no bills are incurred for reading from the bucket.

When working with S3 remotely (i.e. if your cluster is not on AWS), use the closest S3 site possible; this will reduce latency on all queries and reads, as well as bandwidth between the Hadoop cluster and the object store.

Accelerating File Listing

When data is copied between buckets, listing all the files
to copy can take a long time. In such cases, you can increase -numListstatusThreads from 1 (default) to 15. With this setting, multiple threads will be used for listing the contents of the source folder.

Using the S3A Fast Uploader

Warning

These tuning recommendations are experimental and may change in the future.

If you are planning to copy large amounts of data, make sure to use the fast uploader, passing -D fs.s3a.fast.upload=true while invoking DistCp. For example:

Controlling the Number of Mappers and their bandwidth

If you want to control the number of mappers launched for DistCp, you can add the -m option and set it to the desired number of mappers.
If you are using DistCp from a Hadoop cluster running in Amazon's infrastructure, increasing the number of mappers may speed up the operation.

Similarly, if copying to S3 from a remote site, it is possible that the
bandwidth from the Hadoop cluster to Amazon S3 is the bottleneck. In such
a situation, because the bandwidth is shared across all mappers, adding more
mappers will not accelerate the upload: merely slow all the mappers down.

The -bandwidth option sets the approximate maximum bandwidth for each mapper in
Megabytes/second. This a floating point number, so a value such as -bandwidth 0.5 allocates 0.5 MB/s
to each mapper.