AWS EMR High Performance Bootstrap Actions

In this post, I describe some EMR bootstrap scripts that are especially helpful in ensuring that the Hadoop clusters run great. As a general rule, I use the c3.xlarge compute or m3.xlarge spot nodes and have been consistently deploying medium sized clusters (>30 nodes).

Aggregated logging: To setup more advanced logging, please use the following configure-hadoop bootstrap

Ensuring correct disk utilization – Since some of the inputs I use are very disk heavy, the first issue I faced was that the cluster would run out of disk. Moreover, it became rather clear that this issue was due to incorrect disk utilization, since all the output were written to the boot disk and not the ephemeral storage disks (which had sufficient disk space, 2x40GB per node). It seems that the mount points were incorrectly setup, which I override using configure-hadoop bootstrap and the following (Note: these settings are for EC2 instances which have two ephemeral disks per instance. Please modify if you have a different number of ephemeral disks):

Out Of Memory and GC errors – A common issue that occurs is Out of Memory or GC Overhead exceeded errors. A pretty easy fix for these issues are modifying the settings below. You can either bootstrap a cluster (using configure-hadoop) or set these using Configuration.setConf(). In our case, we have set the mapper JVM to allow upto 4G of memory.