So as I stated in the intro to my new untitled php app project, I’ve been doing a lot of research regarding scalable hosting solutions. Rhymes With Milk is currently hosted on Linode, which I love and is perfect for this piddly little wordpress site. But even Linode’s high-end dedicated servers wouldn’t be able to handle the kind of load a large project might require.

As an example, at my job we have several dedicated servers (not at Linode, btw) that have really great performance specs. Just one of them is enough to host over 100 of our clients’ websites. So one year we helped create an iPhone app for the Tour de France. I think the way it worked was the Slipstream sag wagon was carrying an on-board GPS that would ping our server periodically to update an XML file that tracked its location throughout the race. That worked just fine, until about a million requests per minute (that’s a speculative number — I actually don’t know what the exact figures are) were coming into the server for that one file as users were trying to get updates on the location. The file itself was being requested directly by the app, so negligible processing on the server end went into fetching it. But Apache just couldn’t keep up with amount of connections coming in, so our server kept giving up. It pooped out.

This was where Amazon’s AWS really helped save us. We were able to create an Amazon S3 bucket, drop that one XML file in there, and they were able to easily handle the request load.

So what is Amazon AWS?

Well, for starters it stands for Amazon Web Services. It’s taken a lot of research to answer that simple question beyond figuring out what it stand for. Amazon throws around a lot of acronyms, and has a pretty atypical pricing plan compared to more traditional hosts. So here’s a breakdown of their two main services that I think I understand. I’m certainly no expert, I’m just a guy that’s done a lot of reading. I haven’t even set up an account and explored these services myself yet, so this is all what I’ve learned from the outside.

Amazon EC2

Let’s start with EC2 since that’s most like traditional hosting. EC2 is Amazon’s version of VPS hosting. You set up an account and select what size of a virtual server you’d like (how much RAM, bandwidth, storage, cores, etc that you need). Just like many other VPS hosts (e.g. Linode), you can totally customize your server by selecting which flavor of Linux you’d like (you can also set up Windows servers) and where you want it to be hosted (e.g. east coast, Texas, etc). Apparently when you go to select your Linux distribution, Amazon also allows you to select popular packages to start with — like whether it should have LAMP pre-installed, or a mail server included. There is also a community driven list of packages. So if somebody decided that it would be useful to have a server come pre-tuned for video delivery, or have Subversion installed, etc, they could add that package to Amazon’s list. You select one of these, and Amazon creates your VPS with the package you selected.

Now here’s where things get a little cloud-computy. I keep using words like “VPS” and “packages” whereas Amazon uses the words “instances” and “AMIs”. When you create a new VPS, you’re actually creating a new EC2 instance, and those packages are what Amazon calls AMIs (Amazon Machine Images).

You can create a new instance whenever you’d like using one of Amazon’s preconfigured AMIs, or one from the community marketplace like I mentioned above. It’s automatically assigned a new DNS entry, so you can point your domain to it and start delivering content. Similarly, you can turn off an instance whenever you want. This isn’t used much in a single-server setup (we’ll get to multi-instance cases in a minute), but theoretically it’s possible. You can take a snapshot (I think Amazon has a special word/acronym/tool for these, but I don’t know what they are) of your existing setup, back it up, and turn off your server. This does destroy all of your data, so naturally this isn’t something you’d be doing often in a production environment.

I don’t understand multi-instance situations as much, but I think the point is to spawn new instances when the server is under strain. If you know your site was just Dugg or something, you can spawn a new instance and it will share the workload with the first (I think). In a more traditional setup, you might have a server for just routing traffic, and behind it are multiple duplicate copies of the same server each responding to only a fraction of the requests. Each instance is one of those duplicate servers. There are even programs dedicated to just monitoring your server performance, and automatically spawn new instances when it notices your machine needs help (I think Amazon’s CloudWatch is one name I’ve heard before, but that’s something for another post).

These multi-instance situations make it a little more clear why Amazon charges for these services hourly, not monthly. You might need your second instance for only one hour, so you get charged for two instances for that hour.

By the way, I think there are ways you can change your instance specs on the fly instead of spawning a whole new one. Like for example, if you realize you need more RAM all the time, but not necessarily double the processing power, I think you can increase that on your base instance. I’m not confident on that point, though.

Amazon S3

So what if you have a website with a lot of static resources? Things like images, videos, anything that might take up a lot of space, or get requested a whole lot? That’s where S3 (Simple Storage Service) comes in. It’s where you can host all of those resources without putting the strain of delivery/storage of them on your EC2. Remember that XML file I talked about in the beginning? The one getting requested so often it crashed our large dedicated servers? Yeah, S3 was the solution for that. We dropped it into a bucket, and only paid a minimal fee for the number of requests it got. Storage and bandwidth are both really cheep, and you only pay for what you use. I’m about 90% sure web giants like Pinterest, Netflix, and Tumblr all use S3 to store and deliver all of their goods.

—

Alright, that was a lot of info. I’m not sure how to conclude after that. I’ve got a lot more learning to do, and I’m planning on actually setting up an AWS account soon to do that learning. I’ll let you know what I find.

Lastly, on a somewhat unrelated note, I’ve been very curious about where large-scale web apps live. Tools like WhoIsHostingThis.com and Netcraft.com have come in handy. A few interesting finds: