Amazon AWS Tips and Gotchas – Part 7 – AWS EMR, Spot Instances & PGs

Continuing in this series of blog posts taking a bit of a “warts and all” view of a few Amazon AWS features, below are a handful more tips and gotchas when designing and implementing solutions on Amazon AWS, including EMR, Spot Instances and Placement Groups.

AWS Tips and Gotchas – Part 7

As detailed in the EMR FAQ, EMR does not support multi-master config, only one master node per EMR cluster (plus of course, multiple slaves). If that master node goes offline, you lose your cluster and all data which is being processed at the time. The AWS recommended workaround for this is to checkpoint your EMR cluster regularly, which allows resuming of the cluster from the last checkpoint in the event of a failure.

Spot instances and sticky sessions do not play well together!!! If you use spot instances as a method for providing cheap burst resources, make sure your application is not dependent on sticky sessions.
If it is, you risk losing user sessions when the spot instances are terminated with only 2 minutes notice.
There are a couple of mitigation methods for this, the best of which is simply to not use sticky sessions, and store your session data in another system such as ElastiCache or DynamoDB (or both!).
Alternatively, you could setup a script within the EC2 guest OS to monitor the Spot Instance Termination Notifications (http://169.254.169.254/latest/meta-data/spot/termination-time) and devise a method to cleanly migrate off any remaining sessions from your instance and remove it from the load balancer.NOTE: It is best to avoid terminating your spot instances yourself, as AWS will not charge you for the hour in which they terminate your instance, so you can save some budget over shutting your own instances down.

Placement groups were designed specifically for high bandwidth applications, which require low latency, 10Gbps connectivity between instances.If you do not start all instances in a placement group at the same time, you cannot guarantee that they will end up optimally close to each other later. Indeed, as stated in the placement groups KB “If you try to add more instances to the placement group later, or if you try to launch more than one instance type in the placement group, you increase your chances of getting an insufficient capacity error”.
If you do want to add more instances to your placement group later, the best thing to do is stop and restart all of your instances concurrently.