Using the S3A FileSystem Client

Hortonworks Data Cloud supports the Apache Hadoop S3A client. S3A is a filesystem client connector used by Hadoop to read and write data from Amazon S3 or a compatible service. The S3A filesystem uses Amazon's libraries to interact with Amazon S3. It uses the URI prefix s3a://.

The S3A is backward compatible with its predecessor S3N (recognized by its prefix s3n:// in URLs), which shipped with earlier versions of Hadoop. Replacing
the prefix URLs beginning withs3n:// with s3a:// is sufficient to use the S3A
connector in place of S3N.

The S3A is implemented in hadoop-aws.jar. This library and its dependencies are automatically placed on the classpath.

Important

The Amazon JARs have proven very brittle: the version of the Amazon
libraries must match the versions against which the Hadoop binaries were built.

Hadoop FileSystem Shell Commands

Many of the standard Hadoop FileSystem shell commands that interact with HDFS
also can be used to interact with S3A.

Accessing S3A

By default, the Hadoop FileSystem shell assumes invocation against the cluster's
default filesystem, which is defined in the configuration property fs.defaultFS
in core-site.xml. For HDP clusters on AWS, the default filesystem is the
deployed HDFS instance.

When running commands, provide a fully qualified URI with the s3a scheme and the bucket in the authority. For example, the following command lists all files in a directory called "dir1", which resides in a bucket called "bucket1":

hadoop fs -ls s3a://bucket1/dir1

The Hadoop FileSystem shell uses the configured AWS credentials to access the S3
bucket. For further discussion of
credential configuration and for additional examples of the Hadoop FileSystem shell invocation, refer to Amazon S3 Security.

Command Structure

The Hadoop FileSystem shell commands use the following syntax:

hadoop fs -<operation> s3a://<bucket>/dri1

where:

hadoop fs indicates that we want to perform an operation using Hadoop FileSystem shell

<operation> indicates a particular action to be performed against a directory or a file

s3a:// is the prefix needed to access Amazon S3

<bucket> indicates a particular Amazon S3 bucket

Command Examples

You can use the Hadoop FileSystem shell
to list directories, create files, delete files, and so on.

You can create directories, and create or copy files into them. For example:

Commands That May Be Slower

Some commands tend to be significantly slower with Amazon S3 than when
invoked against HDFS or other filesystems. This includes renaming files, listing files, find, mv, cp, and rm.

Renaming Files

Unlike in a normal filesystem, renaming files and directories in an object store
usually takes time proportional to the size of the objects being manipulated.
As many of the filesystem shell operations
use renaming as the final stage in operations, skipping that stage
can avoid long delays.

In particular, we recommend that when using the put and copyFromLocal commands, you set the -d option for a direct upload. For example:

Rename

The time to rename a file depends on its size. The time to rename a directory depends on the number and size of all files
beneath that directory. If the operation is interrupted, the object store will be in an undefined
state.

hadoop fs -mv s3a://bucket1/datasets s3a://bucket/historical

Copy

The copy operation reads each file and then writes it back to the object store;
the time to complete depends on the amount of data to copy, and on the bandwidth
in both directions between the local computer and the object store.

hadoop fs -cp s3a://bucket1/datasets s3a://bucket1/historical

Note

The further the computer is from the object store, the longer the copy process takes.

Refer to the Amazon S3 Performance section for further discussion of S3A filesystem semantics and its impact on performance.

Unsupported Subcommands

S3A does not implement the same feature set as HDFS. The following FileSystem
shell subcommands are not supported with an S3A URI:

-appendToFile

-checksum

-chgrp

-chmod

-chown

-createSnapshot

-deleteSnapshot

-df

-getfacl

-getfattr

-renameSnapshot

-setfacl

-setfattr

-setrep

-truncate

Learn More

This section only covers how selected Hadoop FileSystem shell commands behave when invoked against data in Amazon S3. Refer to the Apache documentation
for more information on the Hadoop FileSystem shell commands.

Deleting Objects

The rm command deletes objects and directories full of objects.
If the object store is eventually consistent, fs ls commands
and other accessors may briefly return the details of the now-deleted objects; this
is an artifact of object stores which cannot be avoided.

If the filesystem client is configured to copy files to a trash directory,
the trash directory is in the bucket. The rm operation then takes time proportional
to the size of the data. Furthermore, the deleted files continue to incur
storage costs.

To make sure that your deleted files are no longer incurring costs, you can do two things:

Use the the -skipTrash option when removing files:hadoop fs -rm -skipTrash s3a://bucket1/dataset

Use the expunge command to purge any data that has been previously moved to the .Trash directory:hadoop fs -expunge -D fs.defaultFS=s3a://bucket1/

As the expunge command only works with the default filesystem, you need to use the -D option to make the target object store the default filesystem. This will change the default configuration (see )

Overwriting Objects

Amazon S3 is eventually consistent, which means that an operation which
overwrites existing objects may not be immediately visible to all clients/queries.
As a result, later operations which query the same object's status or contents
may get the previous object; this can sometimes surface within the same client,
while reading a single object.

Avoid having a sequence of commands which overwrite objects and then immediately
working on the updated data; there is a risk that the previous data will be used
instead.

Timestamps

Timestamps of objects and directories in Amazon S3 do not follow the behavior of files
and directories in HDFS:

The creation time of an object is the time when the object was created in the object
store. This is at the end of the write process, not in the beginning.

If an object is overwritten, the modification time is updated.

Directories may or may not have valid timestamps.

The atime access time feature is not supported by any of the object stores
found in the Apache Hadoop codebase.

Security Model and Operations

The security and permissions model of Amazon S3 is very different
from this of a UNIX-style filesystem: on Amazon S3, operations which query or manipulate
permissions are generally unsupported. Operations to which this applies include: chgrp, chmod, chown,
getfacl, and setfacl. The related attribute commands getfattr andsetfattr
are also unavailable. In addition, operations which try to preserve permissions (for example fs -put -p)
do not preserve permissions.

Although these operations are unsupported, filesystem commands which list permission and user/group details usually
simulate these details. As a consequence, when interacting with read-only object stores, the permissions found in "list"
and "stat" commands may indicate that the user has write access — when in fact he does't.

Amazon S3 has a permissions model of its own. This model can be manipulated through store-specific tooling.
Be aware that some of the permissions which can be set — such as write-only paths,
or various permissions on the root path — may
be incompatible with the S3A client. It expects full
read and write access to the entire bucket with trying to write data, and
may fail if it does not have these permissions.

Simulated Permissions

As an example of how permissions are simulated, here is a listing of Amazon's public,
read-only bucket of Landsat images:

The owner of all files and directories is declared to be the current user (mapred).

The timestamp of all directories is actually that of the time the -ls operation
was executed. This is because these directories are not actual objects in the store;
they are simulated directories based on the existence of objects under their paths.

When an attempt is made to delete one of the files, the operation fails — despite
the permissions shown by the ls command:

This demonstrates that the listed permissions cannot
be taken as evidence of write access; only object manipulation can determine
this.

User-Agent Customization

By default, S3A includes the current Hadoop version in the User-Agent string
passed through the AWS SDK to the Amazon S3 service. You may also include optional
additional information to identify your application by setting configuration
propertyfs.s3a.fs.s3a.user.agent.prefix in core-site.xml or on the command line, as documented here:

<property>
<name>fs.s3a.user.agent.prefix</name>
<value></value>
<description>
Sets a custom value that will be prepended to the User-Agent header sent in
HTTP requests to the S3 back-end by S3AFileSystem. The User-Agent header
always includes the Hadoop version number followed by a string generated by
the AWS SDK. An example is "User-Agent: Hadoop 2.8.0, aws-sdk-java/1.10.6".
If this optional property is set, then its value is prepended to create a
customized User-Agent. For example, if this configuration property was set
to "MyApp", then an example of the resulting User-Agent would be
"User-Agent: MyApp, Hadoop 2.8.0, aws-sdk-java/1.10.6".
</description>
</property>

The presence of "Hadoop" in the User-Agent identifies that the source of the
call is a Hadoop application, running a specific version of Hadoop. Setting a
custom prefix is optional, but it may assist AWS support with identifying
traffic originating from a specific application.