A man had a dream. His dream was to blend a bunch of RSS/Atom/RDF feeds into a single feed. The man is Beau Lebens of Feedville and like most dreamers he was a little short on coin. So he took refuge in the home of a cheap hosting provider and Beau realized his dream, creating FEEDblendr. But FEEDblendr chewed up so much CPU creating blended feeds that the cheap hosting provider ordered Beau to find another home. Where was Beau to go? He eventually found a new home in the virtual machine room of Amazon's EC2. This is the story of how Beau was finally able to create his one feeds safe within the cradle of affordable CPU cycles.

Site: http://feedblendr.com/

The Platform

EC2 (Fedora Core 6 Lite distro)

S3

Apache

PHP

MySQL

DynDNS (for round robin DNS)

The Stats

Beau is a developer with some sysadmin skills, not a web server admin, so a lot of learning was involved in creating FEEDblendr.

FEEDblendr uses 2 EC2 instances. The same Amazon Instance (AMI) is used for both instances.

Over 10,000 blends have been created, containing over 45,000 source feeds.

Approx 30 blends created per day. Processors on the 2 instances are actually pegged pretty high (load averages at ~ 10 - 20 most of the time).

The Architecture

Round robin DNS is used to load balance between instances.-The DNS is updated by hand as an instance is validited to work correctly before the DNS is updated. -Instances seem to be more stable now than they were in the past, but you must still assume they can be lost at any time and no data will be persisted between reboots.

The database is still hosted on an external service because EC2 does not have a decent persistent storage system.

The AMI is kept as minimal as possible. It is a clean instance with some auto-deployment code to load the application off of S3. This means you don't have to create new instances for every software release.

The deployment process is:- Software is developed on a laptop and stored in subversion.- A makefile is used to get a revision, fix permissions etc, package and push to S3.- When the AMI launches it runs a script to grab the software package from S3.- The package is unpacked and a specific script inside is executed to continue the installation process.- Configuration files for Apache, PHP, etc are updated.- Server-specific permissions, symlinks etc are fixed up.- Apache is restarted and email is sent with the IP of that machine. Then the DNS is updated by hand with the new IP address.

Feeds are intelligently cached independely on each instance. This is to reduce the costly polling for feeds as much as possible. S3 was tried as a common feed cache for both instances, but it was too slow. Perhaps feeds could be written to each instance so they would be cached on each machine?

Lesson Learned

A low budget startup can effectively bootstrap using EC2 and S3.

For the budget conscious the free ZoneEdit service might work just as well as the $50/year DynDNS service (which works fine).

Round robin load balancing is slow and unreliable. Even with a short TTL for the DNS some systems hold on to the IP addressed for a long time, so new machines are not load balanced to.

Many problems exist with RSS implementations that keep feeds from being effectively blended. A lot of CPU is spent reading and blending feeds unecessarily because there's no reliable cross implementation way to tell when a feed has really changed or not.

It's really a big mindset change to consider that your instances can go away at any time. You have to change your architecture and design to live with this fact. But once you internalize this model, most problems can be solved.

EC2's poor load balancing and persistence capabilities make development and deployment a lot harder than it should be.

Use the AMI's ability to be passed a parameter to select which configuration to load from S3. This allows you to test different configurations without moving/deleting the current active one.

Create an automated test system to validate an instance as it boots. Then automatically update the DNS if the tests pass. This makes it easy create new instances and takes the slow human out of the loop.

Always load software from S3. The last thing you want happening is your instance loading, and for some reason not being able to contact your SVN server, and thus failing to load properly. Putting it in S3 virtually eliminates the chances of this occurring, because it's on the same network.

Reader Comments (6)

I might be missing something, but I don't see how this is an interesting example of "using EC2 to scale".

There appears to be no difference between using EC2 in the way Beau is using it and setting up two leased servers from a normal provider. In fact, getting leased servers might be better, since the cost might be lower (an EC2 instance costs $72/month + bandwidth) and the database would be on the same network.

Beau does not appear to be doing anything that takes advantage of EC2, such as dynamically creating and discarding instances based on demand.

Am I missing something here? Is this an interesting use of using EC2 to scale?

> I might be missing something, but I don't see how this is an interesting example of "using EC2 to scale".

I admit to being a bit polymorphously perverse with respect to finding things interesting, but from Beau's position, which many people are, the drama is thrilling. The story starts with a conflict: how to implement this idea? The first option is the traditional cheap host option. And for a long time that would have been the end of the story. Dedicated servers with higher end CPUs, RAM, and persistent storage are still not cheap. So if you aren't making money that would have been where the story ended. Scaling by adding more and more dedicated servers would be impossible. Hopefully the new grid model will allow a lot of people to keep writing their stories. His learning curve of creating the system is what was most interesting. Figuring out how to set things up, load balance, load the software, test it, regular nuts and bolts development stuff. And that puts him in the position of being able to get more CPU immediately when and if the time comes. He'll be able to add that feature in quickly because he's already done the ground work. But for now it's running fine. The spanner in the plan was the database and that points out the fatal flaw of EC2, which is the database. The plan would look a bit more successful if that part had worked out better, but it didn't, which is also interesting.

@Todd, thanks for the write-up, and a couple quick corrections/clarifications:

- "Beau is a developer with some sysadmin skills, not a web server admin, so a lot of learning was involved in creating FEEDblendr." - Just to be clear, the learning curve was mostly in dealing with EC2 and how it works, not so much FeedBlendr, which at it's core is relatively simple.

- "no data will be persisted between reboots" this is not exactly true. Rebooting will persist data, but a true "crash" or termination of your instance will discard everything.

- "The database is still hosted on an external service because EC2 does not have a decent persistent storage system" - more the case here is that I didn't want to have to deal with (or pay for) setting something up to cater to them not having persistent storage. It is being done by other people, and can be done, it just seemed like overkill for what I was doing.

- "EC2's poor load balancing and persistence capabilities make development and deployment a lot harder than it should be" - to be clear, EC2 has no inherent load balancing, so it's up to you (the developer/admin) to provide it yourself somehow. There are a number of different ways of doing it, but I choose dynamic DNS because it was something I was familiar with.

@Greg in response to your question - I suppose the point here is that even though FeedBlendr isn't currently a poster-child for scaling, that's also kind of the point. As Todd says, this is about the learning curve and trials and tribulations of getting to a point where it can scale. There is nothing stopping me (other than budget!) from launching an additional 5 instances right now and adding them into DNS, and then I've suddenly scaled. From there I can kill some instances off and scale back. This is all about getting to the point where I even have that option, and how it was done on EC2 in particular.

Nice article and an inspiring story. It's nice to read about the little guy building something that has the capability to scale.

If I could offer a point of feedback though, caching data independently on each machine will not scale as the application develops. That's going to cause problems. What about running an EC2 instance as a dedicated cache? It's not persistent, if it fails then you'll have to rebuild the cash. But assuming it's a simple storage mechanism, it should be pretty consistent. They have pretty generous storage allowances I think.

Either way, the idea of being able to turn on another 3 instances in a few minutes is definitely nice, particularly if you get "slashdotted" / "dugg" / whatever. It'd be especially sweet if the instances could detect their own high load and launch new instances automatically.