4.
EC2 at OpenX• end of 2008• 100s then 1000s of instances• one of largest AWS customers at the time• NAMING is very important • terminated DB server by mistake • in ideal world naming doesn’t matter

5.
EC2 at OpenX (cont.)• Failures are very frequent at scale• Forced to architect for failure and horizontal scaling• Hard to scale at all layers at the same time (scaling app server layer can overwhelm DB layer; play wack-a-mole)• Elasticity: easier to scale out than scale back

7.
EC2 at OpenX (cont.)• Hard to scale at the DB layer (MySQL) • mysql-proxy for r/w split • slaves behind HAProxy for reads• HAProxy for LB, then ELB • ELB melted initially, had to be gradually warmed up

8.
EC2 at Evite• Sharded MySQL at DB layer; application very write-intensive• Didn’t do proper capacity planning/dark launching; had to move quickly from data center to EC2 to scale horizontally• Engaged Percona at the same time

10.
EC2 at Evite (cont.)• EBS apocalypse in April 2011• Hit us even with masters and slaves in diff. availability zones (but all in single region - mistake!)• IMPORTANT: rebuilding redundancy into your system is HARD• For DB servers, reloading data on new server is a lengthy process

11.
EC2 at Evite (cont.)• General operation: very frequent failures (once a week); nightmare for pager duty• Got very good at disaster recovery! • Failover of master to slave • Rebuilding of slave from master (xtrabackup)• Local disks striped in RAID0 better than EBS

14.
EC2 at Evite (cont.)• Didn’t use provisioned IOPS for EBS• Didn’t use VPC• Great experience with Elastic Map Reduce, S3, Route 53 DNS• Not so great experience with DynamoDB• ELB OK but still need HAProxy behind it

15.
EC2 at NastyGal• VPC - really good idea! • Extension of data center infrastructure • Currently using it for dev/staging + some internal backend production • Challenging to set up VPN tunnels to various ﬁrewall vendors (Cisco, Fortinet) - not much debugging on VPC side

17.
Proper infrastructure care and feeding• Monitoring - alerting, logging, graphing• It’s not in production if it’s not monitored and graphed• Monitoring is for ops what testing is for dev • Great way to learn a new infrastructure • Dev and ops on pager

24.
Is the cloud worth the hype?• It’s a game changer, but it’s not magical; try before you buy! (benchmarks could surprise you)• Cloud expert? Carry pager or STFU• Forces you to think about failure recovery, horizontal scalability, automation• Something to be said about abstracting away the physical network - the most obscure bugs are network-related (ARP caching, routing tables)

25.
So...when should I use the cloud?• Great for dev/staging/testing• Great for layers of infrastructure that contain many identical nodes and that are forgiving of node failures (web farms, Hadoop nodes, distributed databases)• Not great for ‘snowﬂake’-type systems• Not great for RDBMS (esp. write-intensive)

26.
If you still want to use the cloud• Watch that monthly bill!• Use multiple cloud vendors• Design your infrastructure to scale horizontally and to be portable across cloud vendors • Shared nothing • No SAN, NAS

27.
If you still want to use the cloud• Don’t get locked into vendor-proprietary services • EC2, S3, Route 53, EMR are OK • Data stores are not OK (DynamoDB) • OpsWorks - debatable (based on Chef, but still locks you in) • Wrap services in your own RESTful endpoints

28.
Does EC2 have rivals?• No (or at least not yet)• Anybody use GCE?• Other public clouds are either toys or smaller, with less features (no names named)• Perception matters - not a contender unless featured on High Scalability blog• APIs matter less (can use multi-cloud libs)