Amazon EMR is a service that allows you to use distributed data processing frameworks such as Apache Hadoop, Apache Spark and Presto to process data on a managed cluster of EC2 instances.

Newer versions of EMR (3.10 and 4.x), allow you to use Amazon EBS volumes to increase the local storage of each instance. This works well with the existing set of supported instance types, and also gives you the ability to use the M4 and C4 instance types with EMR. Today I would like to tell you more about both of these features.

Increasing Instance Storage Using Amazon EBS EMR uses the local storage of each instance for HDFS (Hadoop Distributed File System) and to store intermediate files when processing data from S3 using EMRFS. You can now use EBS volumes to extend this storage. The EBS volumes are tied to the lifecycle of the associated instances and augment any existing storage on the instance. If you terminate a cluster, any associated EBS volumes are also deleted along with it.

You will benefit from the ability to customize the storage of your EMR instances if…

Your processing requirements demand a larger amount of HDFS (or local) storage than what is available by default on an instance. With support for EBS volumes, you will be able to customize the storage capacity on an instance relative to the compute capacity that the instance provides. Optimizing the storage on an instance will allow you to save costs.

You want to take advantage of the latest generation EC2 instance types such as the M4, C4, and R3 and need more storage than is available on these instance types. You can now add EBS volumes to customize the storage in order to better meet your needs. If you’re using the older M1 and M2 instances, you should be able to reduce costs and improve performance by moving to newer M4, C4 and R3 instances. We recommend that you benchmark your application to measure the impact on your specific workloads.

It’s important to note that the EBS volumes added to an Amazon EMR cluster do not persist data after the cluster is shutdown. EMR will automatically clean up the volumes when you terminate your cluster.

Adding EBS Volumes to a Cluster EMR currently groups the nodes in your cluster into 3 logical instance groups: a Master Group, which runs the YARN Resource Manager and the HDFS Name Node Service; a Core Group, which runs the HDFS DataNode Daemon and the YARN Node Manager Service; and Task Groups, which run the YARN Node Manager Service. EMR supports up to 50 instance groups per cluster and allows you to select an instance type for each group. You can now specify the amount of EBS storage you want to add to each instance in a given instance group. You can specify multiple EBS volumes, add EBS volumes to instances with instance storage, or even combine different volumes of different types. Here is how you specify your storage configuration in the EMR Console:

For example, if you configured a Core Group to use the m4.2xlarge instance, attached a pair of 1 TB gp2 (General Purpose SSD) volumes and want 10 instances in the group, the Core group would have 10 instances with a total of 20 volumes. Here’s how you would set that up:

To learn more, read the EBS FAQ. Support for EBS is available starting with AMI 3.10 and EMR Release 4.0.

EBS Volume Performance Characteristics Amazon EMR allows you to use several different EBS volume types: General Purpose SSD (GP2), Magnetic, and Provisioned IOPS (SSD). You can choose different types of volumes depending upon the nature of your job. Our internal testing suggests that the General Purpose SSD volumes should suffice for most of the workloads, however we recommend that you test against your own workload. One thing to note is that the General Purpose SSD volumes provide a baseline performance of 3 IOPS/GiB (up to 10,000 IOPS) with the ability to burst to 3,000 IOPS for volumes under 1,000 GiB. Please see I/O Credits and Burst Performance for more details. Here is a comparison of the volumes types:

General Purpose (SSD)

Provisioned IOPS

Magnetic

Storage Media

SSD-backed

SSD-backed

Magnetic-backed

Max Volume Size

16 TB

16 TB

1 TB

Max IOPS per Volume

10,000 IOPS

20,000 IOPS

~100 IOPS

Max IOPS Burst Performance

3000 IOPS for volumes <= 1TB

n/a

Hundreds

Max Throughput per Volume

160 MB/second

320 MB/second

40-90 MB/second

Max IOPS per Node (16K)

48,000

48,000

48,000

Max Throughput per Instance

800 MB/second

800 MB/second

800 MB/second

Latency (Random Read)

1-2 ms

1-2 ms

20-40 ms

API Name

gp2

io1

standard

Support for M4 and C4 Instances You can now launch EMR clusters that use M4 and C4 instances in regions where they are available. The M4 instances feature a custom Intel Xeon E5-2676 v3 (Haswell) processor and the C4 instances are based on the Intel Xeon E5-2666 v3 processor. These instances are designed to deliver the highest level of processor performance on EC2. Both types of instances offer Enhanced Networking which delivers up to 4 times the packet rate of instances without Enhanced Networking, while ensuring consistent latency, even when under high network I/O. Both the M4 and C4 instances are EBS-Optimized by default, with additional, dedicated network capacity for I/O operations. The instances support 64-bit HVM AMIs and can be launched only within a VPC.

Please see the Amazon EMR Pricing page for more details on the prices for these instances.

Productivity Tip You can generate a create-cluster command that represents the configuration of an existing EMR 4.x cluster, including the EBS volumes. This will allow you to recreate the cluster using the AWS Command Line Interface (CLI).

Available Now These new features are available now and you can start using them today!