Rent vs. Buy (or EC2 vs. building your own iron)

Over the past months Jeff Atwood (of Coding Horror fame) has been chronicling Stack Overflows quest for new hardware, starting with “Server Hosting – Rent vs. Buy?” and ending with some glamour shots. I’ve recently (along with others) built a setup for a .Net website in the same “to big for shared or low-end VPS hosting and (much) too small to have dedicated sysadmin staff” segment. We ended up going for Amazon EC2 so I thought I’d share our reasoning by comparing with the Stack Overflow setup.

UPDATE1: Atwood just gave another reason as to why EC2 may be attractive.

First some notes on pricing: Mr. Atwood’s three servers costs him a total $6,000, on top of which comes rack space rent, bandwidth and licenses (where he gets off very cheaply by taking advantage of Microsoft’s BizSpark program). We rent two large EC2 instances, one of them with a SQL Server Standard license, for $1.6 pr. hour giving a total of $14,000 pr. year (on top of which comes bandwidth and Elastic Block Store usage). Mr. Atwood could buy all his gear (minus rack space) more than two times over for that money. And except for one important parameter, which I shall expand on later, his machines are much faster: The Database server has eight cores and 24GB of memory, while the Web servers have four cores and 8GB of memory. Our EC2 instances have to get by with just two cores and 7.5GB. An interesting aside is that exactly half the $1.6 goes to licenses (compared with getting non-windows large instances), most of it for SQL Server Standard.

Several commenters had some beefs with the disks in the new Stack Overflow database server and I agree they look rather dinky. The server has six 7200 RPM SATA drives in RAID1 and RAID10 arrays for OS/logs and data files respectively. While the drives are “Enterprise” branded, I hazard the guess that they are pretty much the same as desktop ones, except for a slightly higher MTBF promise and better warranty from the manufacturer. At any rate, 7200 RPM drives can only sustain about 125 random IOs pr. second, and because of the RAIDing, the IO-rate of the array is not six times that. On EC2 we have access to formidable Elastic Block Storage volumes, which are capable of sustaining upwards of 1000 IOPS. Should we need more oomph, the volumes can be soft-raided together until the 1GBps link from the EC2 instance to the EBS volume runs out of steam. (For completeness, I should note that the sequential IO performance of EBS volumes is not very good. That is irrelevant for most database workloads however, since users generally don’t have the good manners to request data in the order it is placed on disk). Mr. Atwood mentions that query execution time decreases nicely with CPU speed. This is obviously an important parameter when building a responsive web site, but I’d venture that query throughput volume is mostly related to disk performance and that we would have an edge here.

Another potential problem is the reliability of the drives, the longer warranty-period not withstanding. Let’s assume for a second (and I admit this is a pretty improbable scenario), that one of the drives in the Stack Overflow RAID10 array (holding SQL Server data files) copped it and went to the great disk-array in the sky. Mr. Atwood would probably get a notification of this, and immediately initiate a backup-operation to the good array (also holding OS and logs). Let’s also assume that at that very moment, the God of the datacenter decides to invoke Murphy’s law on the other disk in the mirror-set, killing it and taking the array and the database with it. Stack Overflow stops flowing, blog posts are written (I shall magnaminously refrain), F# buffs recurse indefinitely trying to post a question about why Stack Overflow is down but finding that Stack Overflow is down. Reddit and Slashdot are notified, further swamping the exasperated web servers – unable as they are to get anything out of the database. Mr. Atwood, in the meantime, is cheering on SQL Server Management studio to restore the latest backup as quickly as possible to the still-good array. He manages to bring the site up within a quarter of an hour, minus all activity since the last backup and running at a somewhat slower clip than usual. Having wiped the sweat from his brow, he still has to drive to the datacenter and swap the two bad drives (unless he trusts the datacenter dudes to do so), getting the usual datacenter tinnitus and a sniffle in the process.

If, on the other hand, the EBS volume holding our database were to die (an even more unlikely event), we would merely create a new volume, attach it to our database instance and restore from backup (conveniently located in nearby S3). Reaction time and data loss would be similar, but performance will not be degraded for any period. Also, I don’t have to plod out to some datacenter and fuss around with a server. Instead I can concentrate on adding new features to the site.

Some people stress the “Elastic” part of EC2, claiming that it is mostly relevant if your hardware requirements are extremely variable or you expect them to increase very rapidly. I think the flexibility it affords is relevant in more modest scenarios too though. Some examples: Need more IPs? Click of a button. Need a test server to try out a new version of your site? Click of a button. Need to increase the size of your database drive? Grab a snapshot and use it to create a bigger volume. Plus all the other features such as a CDN, secure backup in S3 and redundant datacenters that Amazon offers without large upfront costs.

EC2 is no panacea for sure and I agree with Mr. Atwood that poring over specs and reviews and putting together your own gear on the cheap is extremely rewarding. If you value your time and need flexibility though, it might be worth it to limit yourself to building desktop systems and use something like EC2 for hosting.

Well, first things first: You don’t ever host on EC2, you compute on it. The bandwidth costs for EC2 are prohibitive in that regard when compared to Slicehost, Linode, etc.

Second things second: If Atwood wasn’t so afraid of the command line, he’d be saving himself a shit load of money. There is no conceivable advantage of the .NET platform for StackOverflow beyond Atwood’s comfort with those particular development tools.

Thirds things third: In this day and age, the costs savings of colocating metal versus spreading it out over a semi-cloud solution (slicehost, linode, etc.) is negligible at best. Factor in maintenance (time and hardware) and it’s even less compelling.