Accessing Data Stored in Amazon S3 through Spark

To access data stored in Amazon S3 from Spark applications, you use Hadoop file APIs (SparkContext.hadoopFile, JavaHadoopRDD.saveAsHadoopFile, SparkContext.newAPIHadoopRDD, and JavaHadoopRDD.saveAsNewAPIHadoopFile) for reading
and writing RDDs, providing URLs of the form s3a://bucket_name/path/to/file. You can read and write Spark SQL DataFrames using the
Data Source API.

Important: Cloudera components writing data to S3 are constrained by the inherent limitation of Amazon S3 known as "eventual
consistency". For more information, see Data
Storage Considerations.

Specifying Credentials to Access S3 from Spark

You can access Amazon S3 from Spark by the following methods:

Note: If your S3 buckets have TLS enabled and you are using a custom jssecacerts truststore, make sure that your
truststore includes the root Certificate Authority (CA) certificate that signed the Amazon S3 certificate. For more information, see Amazon
Web Services (AWS) Security.

Without credentials:

This mode of operation associates the authorization with individual EC2 instances instead of with each Spark app or the entire cluster.

Run EC2 instances with instance profiles associated with IAM roles that have the permissions you want. Requests from a machine with such a profile authenticate without credentials.

With credentials:

You can use one of the following methods described below to set up AWS credentials.

Set up AWS Credentials Using the Hadoop Credential Provider - Cloudera recommends
you use this method to set up AWS access because it provides system-wide AWS access to a single predefined bucket, without exposing the secret key in a configuration file or having to specify it at
runtime.

Create the Hadoop credential provider file with the necessary access and secret keys:

AWS access for users can be set up in two ways. You can either provide a global credential provider file that will allow all Spark users to submit S3 jobs, or
have each user submit their own credentials every time they submit a job.

For Per-User Access - Provide the path to your specific credential store on the command line when submitting a Spark job. This means you do not need to
modify the global settings for core-site.xml. Each user submitting a job can provide their own credentials at runtime as follows:

For System-Wide Access - Point to the Hadoop credential file created in the previous step using the Cloudera Manager Server:

Login to the Cloudera Manager server.

On the main page under Cluster, click on HDFS. Then click on Configuration. In the
search box, enter core-site.

Click on the + sign next to Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml. For Name,
put spark.hadoop.security.credential.provider.path and for Value put jceks://hdfs/path_to_hdfs_file. For example, jceks://hdfs/user/root/awskeyfile.jceks.

Click Save Changes and deploy the client configuration to all nodes of the cluster.

After the services restart, you can use AWS filesystem with credentials supplied automatically through a secure mechanism.

(Optional) Configure Oozie to Run Spark S3 Jobs - Set spark.hadoop.security.credential.provider.path to the path of the .jceks file in Oozie's workflow.xml file under the
Spark Action's spark-opts section. This allows Spark to load AWS credentials from the .jceks file in HDFS.

This mode of operation is the most flexible, with each application able to access different S3 buckets. It might require extra work on your part to avoid making the secret key visible in source code.
(For example, you might use a function call to retrieve the secret key from a secure location.)

Note:

This mechanism is not recommended for providing AWS credentials to Spark because the credentials are visible (unredacted) in application logs and event logs.

(Not Recommended) Specify the credentials in a configuration file, such as core-site.xml:

This mode of operation is convenient if all, or most, apps on a cluster access the same S3 bucket. Any apps that need different S3 credentials can use one of the other S3 authorization
techniques.

Note:

This mechanism is not recommended for providing AWS credentials to Spark because the credentials are visible (unredacted) in application logs and event logs.

Performance Considerations for Spark with S3

To improve the performance of writing to Hive tables from Spark under S3, Cloudera recommends using the default output committer, with "algorithm=2". In this case,
you must disable speculative execution by adding the following configuration settings.

This spark.hadoop.* parameter must be added to the YARN advanced configuration snippet (safety valve) to take effect:

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

This Spark configuration setting can be added to the Spark configuration as usual:

spark.speculation=false

Examples of Accessing S3 Data from Spark

The following examples demonstrate basic patterns of accessing data in S3 using Spark. The examples show the setup steps, application code, and input and output files located in S3.

Reading and Writing Data Sources From and To Amazon S3

The following example illustrates how to read a text file from Amazon S3 into an RDD, convert the RDD to a DataFrame, and then use the Data Source API to write the DataFrame into a
Parquet file on Amazon S3: