While the 100 hours or so I have spent labbing and interacting with AWS is certainly not 10,000, it has given me some valuable insights on both how absolutely AWSome (sorry – had to be done!) the platform is, as well as experiencing a few eye openers which I felt were worth sharing.

It would be very easy for me to extoll the virtues of AWS, but I don’t think there would be much benefit to that. Everyone knows it is a great platform (but maybe I’ll do it later anyway)! In the meantime, I thought it would be worthwhile taking a bit more of a “warts and all” view of a few features. Hopefully, this will avoid others stepping into the potential traps which have come up directly or indirectly through my recent training materials, as well as being a memory aid to myself!

The key thing is with all of these “gotchas”, they are not irreparable, and can generally be worked around by tweaking your infrastructure design. In addition, with the rate that AWS develop and update features on their platforms, it is likely that many of them will improve over the coming months / years anyway.

The general feeling around many of these “features” is that AWS are indirectly and gently encouraging you to avoid building your solutions on EC2 and other IaaS services, Instead, pushing you more towards using their more managed services such as RDS, Lambda, Elastic Beanstalk etc.

This did originally start off as a single “Top 10” post but realised quickly that there are a lot more than 10 items and some of them are pretty deep dive! As such, I have split the content into easily consumable chunks, with a few lightweight ones to get us started… keep your eyes open for a few whoppers later in the series!

AWS Tips and Gotchas – Part 1

Storage for any single instance may not exceed 20,000 IOPS and 320MB/sec per EBS volume. This is really only something which will impact very significant workloads. The current “recommended” workaround for this is to do some pretty scary things such as in-guest RAID / striping!
Doing this with RAID0 means you then immediately risk loss of the entire datastore if a single EBS volume in the set goes offline for even a few seconds. Alternatively, you can buy twice as much storage and waste compute resources doing RAID calculations. In addition, you then have to do some really kludgy things to get consistent snapshots from your volume, such as taking your service offline. In reality, only the most extreme workloads hit this kind of scale up. The real answer (which is probably better in the long term) is to refactor your application or database for scale-out, a far more cloudy design.

The internet gateway service does not provide a native method for capping of outbound bandwidth. It doesn’t take a genius to work out that when outbound bandwidth is chargeable, you could walk away with a pretty significant bandwidth bill should something decide to attack your platform with a high volume of traffic. One potential method to work around this would be to use NAT instances. You can then control the bandwidth using 3rd party software in the NAT instance OS.

There is no SLA for EC2 instances unless you run them across multiple Availability Zones. Of course with typical RTTs of a few milliseconds at most, there is very little reason not to stretch your solutions across multiple AZs. The only time you might keep in one AZ is if you have highly latency sensitive applications, or potentially the type of app which requires a serialised string of DB queries to generate a response to the end user.
In a way I actually quite like this SLA requirement as it pushes customers who might otherwise have accepted the risk of a single DC, into designing something more robust and accepting the (often minor) additional costs. With the use of Auto Scaling and Elastic Load Balancing there is often no reason you can’t have a very highly available application split across two or more AZs, whilst using roughly the same number of servers as a single site solution.

For example the following solution would be resilient to a single AZ failure, whilst using no more infrastructure than a typical resilient on-premises single site solution:No DR replication required, no crazy metro clustering setup, nothing; just a cost effective, scalable, highly resilient and simple setup capable of withstanding the loss of an entire data centre (though not a region, obviously).