How I'm using Amazon S3 to serve media files

As traffic to chicagocrime.org has steadily increased, I've been looking for ways to tweak the site's performance. The site runs on a rented dedicated server with Apache/mod_python, PostgreSQL and Django. (I'd love to bite the bullet and buy proper servers but haven't done so yet. Donations are welcome!)

One thing that's always bugged me is that chicagocrime.org's Apache instance serves both the dynamic, Django-powered pages and static media files such as the CSS and images. It's inefficient for a single Apache instance to act as both an application server (mod_python) and a media server. A bunch of Apache configuration tweaks can improve performance of one aspect of serving but are somewhat detrimental to the other aspect. For example, using the KeepAlive directive improves Apache's media-serving capabilities, but KeepAlive is detrimental in a server arrangement that mainly churns out dynamic pages. So if a single Apache instance does both media serving and dynamic page creation, you can't optimize for both cases.

(When I worked at LJWorld.com, we had the luxury of separate application, media and database servers, and we have a similar setup where I work now, but I can't afford separate servers for my little side projects.)

The solution hit me the other day -- I can just use Amazon's new Amazon S3 data-storage service to host chicagocrime.org's media files, so my own Apache server can focus on serving dynamic pages. S3 is very cheap -- 15 cents a month for each gig of storage (and I have only 936 K of media files) and 20 cents per gig bandwidth. That's peanuts.

It was easy to get this working; took less than an hour total. Here's what I did:

First, I signed up for an Amazon S3 account. Do that by clicking "Sign Up For Web Service" on the main S3 page. When you sign up, you get two codes: an access key ID and secret access key.

Next, I created an S3 "bucket" for my chicagocrime.org media files. An account can have multiple buckets. As far as I can tell, it's just a way of keeping your S3 stuff in separate containers. I did this by using the free S3 Python bindings. Just download the file, unzip it and put the file S3.py somewhere on your Python path. To create a bucket named 'mybucketname', do this:

Next, I wrote a Python script that uploaded my media files to this bucket and made them publically readable. S3 has a bunch of complex authentication stuff, but all I wanted to do was use S3, essentially, as a Web hosting service. Here's the script I used, and here's how to use it:

$ cd /directory/with/media/files/
find | python /path/to/update_s3.py

The script is kind of cool because it uses Python's mimetypes to determine the content type of each file in order to pass that to the S3 API. Otherwise it's pretty straightforward.

Finally, it was just a matter of changing my chicagocrime.org templates to point to S3's URLs rather than my own URLs. That was a snap, thanks to Django's template inheritance and includes.

Now chicagocrime.org's media files are served directly off of S3, at a cost of 35 cents a month, and my Apache is happier.

Comments

I did something sort of similar with my wine blog (tastebudchicago.com), but with Flickr. When I write up a wine, I upload the pictures to Flickr, and give the photoset a unique name. When I add the entry to the blog, there's a field for photoset. It goes out, grabs the URL's, and attaches them to the entry. The blog is super-low traffic, but it felt pretty clever to save a little bit of bandwidth.

This is a great idea! I haven't signed up for an S3 account yet because I don't really need a place to store my files online. From the accounts others have given so far, that's all that I thought it was really capable of. Nobody mentioned that you could get a public URL for your content.

So have you noticed any changes in speed? Is it faster or slower than hosting the content on your own server?

Posted by Steve Holden on April 7, 2006, at 12:43 p.m.:

Wouldn't it be simpler just to run separate virtuals for the graphics and the mod_python services? Each can configure a separate KeepAliveTimeout, giving you the behaviour you want and saving a valuable 35c per month.

Scott: I can't tell any difference in the serving of the media files themselves, which suggests S3 is just as fast as serving them from chicagocrime.org directly.

Steve: That would still require overhead of separate VirtualHosts, resulting in my server doing more work than necessary (i.e., serving the media files in the first place, when it could be concentrating only on the dynamic content and database work).

Suhweeet! Thanks Adrian!! I've been shopping for ideas in this regard. I'd also like to hear more about preformance and reliability. Just curious, why not create a lighttpd server instead [on your Apache box]? Is there a disadvantage to using lighttpd on the same box as Apache?

Bart: There's no inherent disadvantage in using lighttpd on the same box, but it wouldn't solve the problem of reducing server load. Whether you use Apache or lighttpd for serving files, either way your server is serving files -- and, hence, using extra overhead -- that could be offloaded to S3.

Isn't using amazon s3 storage an expensive solution for hosting static content? You could have got a cheap 1and1 hosting which offers 100GB of space and 1000GB/month transfer for only 10bucks which is equal to 215$/month on amazon s3.

Phillip, yes, that's true I agree. But I guess that Amazon should (in future) lower their BW prices atleast for bulk orders. I thought their service is more for enterprise use than end-user (is it?).

Jeff, yes, I had problems with their customer support too. That why their service is cheap, isn't it?

Posted by Graham Carlyle on April 8, 2006, at 12:24 p.m.:

Looks like you made one template change too many. The "crimes by district" page contains an iframe with a src attribute pointing to s3. So the map doesn't show and you get a popup from google complaining about the API key.

1) you probably should not have apache answering web requests directly - slower connections will chew resources for quite a while - use a reverse proxy like pound, perlbal, or apache's mod_proxy to sit in front of the server actually doing the work. this allows the working apache to serve the request and go on to the next request fairly quickly. (use mod_rpaf to get the proxied apache to log properly)

2) use the frontend reverse proxy to route requests by matching parts of the url or whatever. you can use a much lighter httpd to serve static content, such as thttpd. the load due to serving the static content should be close to zero, since sendfile() is insanely efficient.

Posted by Rajan on April 9, 2006, at 4:16 p.m.:

Posted by Brad P on April 9, 2006, at 4:24 p.m.:

So is S3 only a viable option for people who know how to program python or whatever it is?

Posted by anonymous on April 9, 2006, at 5:16 p.m.:

Your Apache configuration must be fairly screwed up if you aren't willing to let it do what it is the best in the world at doing-- static files.

An Apache 2.0 installation using the worker MPM will kill your bandwidth long before it kills your CPU even if 95% of the CPU is already dedicated towards generating dynamic content. Static content is so braindead easy for Apache that a Pentium 100 can fill a 10Mbit pipe without even blinking.

If your static files don't change very often, you can _additionally_ cache all of that 936k of data in RAM using mod_file_cache, which would essentially remove the most expensive piece of an http connection, the disk read. Using KeepAlive as a way to try to improve performance is like loading a bigger gun to shoot yourself in the foot. TCP/IP connections are only expensive when you have exceeded the amount of physical memory you have on your box. Just boosting the KeepAlive timeout doesn't help when people are only viewing one page.

This statement: "So if a single Apache instance does both media serving and dynamic page creation, you can't optimize for both cases." is mostly wrong. You can set expire times for specific mime types, you can use a multithreaded mpm, and you can reduce or remove disk hits with selective caching as you deem necessary. These are just a small small number of a much larger pool of hints for both dynamic content served through an application server or builtin Apache module, _and_ for static content.

1. worker mpm mode

2. mod_file_cache

3. remove unnecessary shared object modules.

If a site hasn't performed those three steps, they will see at least an order of magnitude performance improvement when they do perform those steps.

I've proposed to Amazon that they add a simple extension to S3 so that it can act as a CDN/cache. See the writeup. Would be way cool. It would automatically extend the scale and reliability of any large-file asset server.

The Apache people always shout that it can easily saturate a standard dedicated network connection before it pins the CPU, but on every single server that I administer (heavily loaded or not), I've seen an improvement in performance and response times from Lighttpd, even when serving dynamic content. Sometimes (one popular blog) those results are extremely dramatic. I recommend it to everyone. The only drawback is the poor documentation for the configuration file.

Only problem, depending on your traffic, is that you might find you violate S3's TOS as they have restrictions on the rate you can make requests to the service.

Note that the 1 call per second restriction does not apply to S3; please refer to the S3 portion of the license.

Posted by Peter Fein on April 13, 2006, at 11:03 a.m.:

After reading your article, we're thinking about using S3 for offsite backup. It looks like it's around 1/10th the cost of dedicated backup providers. We'd be encrypting files with GPG anyway ; see 'man split' for working aroung the 5GB limit.

Posted by Chris Sidi on April 16, 2006, at 11:24 p.m.:

It looks like they're not doing virtual hosts at s3.amazonaws.com so you can give it your own CNAME record. Looks a little nicer, and may make it easier to move your serving elsewhere someday.

Posted by Bryan Batchelder on April 19, 2006, at 3:27 p.m.:

Very clever, but it sounds like this is a case of premature optimization. At the end of your post you say Apache is Happier...what are your metrics? Then later on in a comment you say that you see no difference in performance. So I am kind of confused.

Posted by user on May 2, 2006, at 2:57 a.m.:

The reason I think the bandwidth / storage costs are fair is that it is effectively infinitely scalable at a consistent cost. What happens when you go over 1000gb at 1&1? What if it’s a huge rip off for more? Are you going to move everything you have stored over to somewhere new? I'm using this for a very, very small scale project that could easily turn into a very, very large scale project. So long as I can earn a decent amount per GB, I see no issues thanks to its constancy.

Posted by Jon on May 26, 2006, at 1:23 p.m.:

Anonymous was right about pretty much everything except KeepAlive. Even if a person is viewing one page, most likely that page has several images, scripts, ads, etc, that all constitute separate requests.

Posted by Dave, web developer on August 22, 2006, at 4:17 a.m.:

Does this all mean that in fact you speeded up the process? You haven't pointed it out. I' sure that the only reason to change things is optimization, but I don't see it from what you've told us.