We have a new computer to serve the photo.net photo database. We thought we might do something intelligent, but we ended up instead buying a machine that will carry us forward another 6 months without us having to think. Here are the specs on the new server, which arrived yesterday from Silicon Mechanics:

2 Opteron 2212 2 GHz CPUs (each of which is dual core, so effectively I think that means that four threads can be running simultaneously)

4 GB RAM

hardware RAID of 8 750 GB Seagate SATA drives

CentOS 64-bit operating system

We need to serve a continuous stream of photos from this machine. The data are static and in the local file system. The current load is 2.5 million JPEGs per day, of which 1.2 million are small thumbnail images. We don’t need to query the relational database management system or do anything fancy, just serve the files via HTTP. So maybe, after 12 years, it is time to look beyond AOLserver! Should we consider lighttpd? Apache 2.0? Squid? Should we run just one process of any of these and let threads handle the multiple clients from a single Unix process? Or run multiple copies of the Web server program and tell our load balancer that two sources of these files are available?

[Oh yes, and what about the file system block size? The thumbnails are around 10k bytes.]

Assistance via email or comments here would be appreciated!

Thanks,

Philip

48 Comments

2.5m/day is an average of 29 per second, which is fast enough to be noticable but not very extreme. Assuming that the average image is 500 KB or so, that’s a bit over 100 Mbps, which isn’t going to challenge this hardware either.

Your big issues will probably be disk I/O and maybe total connection count. I doubt you have under 4 GB of images, so some number of the requests will need to be read from disk. A modern SATA drive will probably let you read 10-20 images per second randomly, so 8 drives will be good for at least 60 requests per second (assuming unequal balancing, drive failures, etc). Assuming that at least half of your traffic can be served out of the cache, then you shouldn’t have disk problems–I’d expect to be able to handle over 100 images/second without too much pain, but that depends on cache hit rate, disk performance, and a bunch of other things that I can’t measure from here.

The final remaining issue is the number of open sockets needed to serve that many requests per second. Really slow downloads can clog up heavyweight servers by forcing them to blow a thread (plus state) per ongoing download. Big download servers can easily have over 10,000 active connections, which can be a pain to handle efficiently. Unless you’re handing out a lot of full-size images to dialup users, you probably won’t have issues with Apache or a similar server; your file size and transaction rate just aren’t that high. I’ve never benchmarked AOLserver; there’s a chance that it’d have issues.

Personally, I’d probably use Apache, just because it’s well-known, stable, and easy to configure. Lighttpd is supposed to be good, but I’ve heard from a few people who have hit weird corner cases with it when using it for Rails apps. For maximal performance, you’d probably be best off finding a single-threaded non-blocking server that uses epoll and sendfile and then running 4 copies, but that’s probably gross overkill here. I’d be amazed if Apache couldn’t handle this load just fine with 15 minutes of tuning.

Jim

Even assuming a relatively worst case that you get 1/5 of your daily hits in the peak hour, that’s 0.5×10^6 files per hour, or 140 per second, yielding 1400KB/s or 1.4 MB/s, which is only slightly faster than old 10bT ethernet. With any reasonable disk caching, I don’t think you’re going to tax any reasonable web server on that hardware, assuming it’s multi-threaded.

You don’t state how many distinct thumbnails you have, but if photo.net has traffic patterns anything like my day-job site sees, you don’t have a problem.

If you’re doing this world-wide, have a budget to spend, and care about latency to non-US customers, we’ve been EXTREMELY happy with our Akamai edge servers in pass-through mode, and Akamai is actually cheaper than our own “source” bandwidth (for which we’re being overcharged for valid reasons that aren’t relevant to this discussion)

We’ve been using nginx at Yottamusic and have had great experiences with it so far. Fast and simple and it Just Works. It also lets you run multiple worker processes for SMP machines. English docs are at http://wiki.codemongers.com/Nginx .

My guess is that standard out-of-the-box Apache will handle it. Delivering mediums-sized static files is mostly a matter of bandwidth. Disk I/O and CPU load are low by comparison. I looked at the site and found an average thumbnail size of less than 10kB and less than 100kB for the larger images. Using those conservative numbers, here’s my math. (Worth checking!)

A 100mbps controller (100baseT) can fairly easily handle about 50mbps, but to be safe you might want to add a second controller. That may also take more advantage of the multi-CPU architecture.

I’d install Apache 2.x on the box, load up the files and let it run. You could mess with different httpd’s or evenSquid, but why bother if they’re not necessary? Premature optimization is the root of all evil.

In the past I’ve used fnord (http://www.fefe.de/fnord/) when I needed bone-simple, fast HTTP serving. It doesn’t really do much beyond GET requests and some CGI, but it does them well. It leverages Dan Berstein’s tcpserver app, which is a deal breaker for many. However, given it’s incredibly small size, I think it’s a well written app and might be worth considering in your case.

Jay L.

It’s a matter of optimization. If there are 4 processors, each with its
own memory, then it’s best to run 4 processes of the http
server. Each one will have multiple threads running within it. So, if
there are 100 transactions (processes) taking place at a given time,
each of the 4 processes should handle 25 on average.

If you have very large files (on the order of 100 MB), then choose a
larger blocksize (10 KB is not large, but other images may be large).
Or, if there are many transactions (on the order of 1000) going on
at one time, increase the blocksize. Otherwise, just use the default.
In general, you can’t go wrong with the system default blocksize.

Mike Scott

I currently administer three servers that each serve up to 25 million images per day, about 85% in small thumbnails (3-4KB) and the rest at 310 pixel width, so around 20KB.

They’re on old Sun boxes with only 4 x 440MHz CPU each attached to RAID arrays with 7 10K RPM FC disks, and they cope just fine, although we do cache a lot of them in memory, and it helps that the thumbnail size is smaller than the 4KB block size. We’re using Apache.

You’re going to have no CPU issues at all, you’ll probably be 95% or more idle under peak load. It’s the disk performance you want to watch out for — and so it doesn’t really matter what web server software you use, since a disk read is a disk read. But at only 2.5 million per day, an 8 disk array should cope just fine unless you’re expecting significant growth or your traffic patter is very spiky.

There’s no point load balancing two web server instances on the same machine — it doesn’t really gain you any extra performance or resilience.

So just use whatever web server you’re most comfortable with, as long as it lets you cache the most frequently accessed thumbnails in memory. Apache 2.2 can certainly do that — I don’t know about other products.

Mike Scott

Oh, a couple of things that will make a difference to your disk performance are the filesystem that you use (I can’t comment meaningfully on the performance of different Linux filesystems) and the way the directories are laid out on the disk — try to avoid having a single directory with hundreds of thouands of files in it.

Patrick C

Lighttpd is taking over in the static-file webserver space. YouTube uses it. The Wikipedia folks use a server running Fedora Core 3 and Lighttpd with two 1.5TB RAID 0 arrays, 8Gb of RAM, and two 3.4GHz Xeons for serving all static files (mostly photos) from upload.wikipedia.org. They also use lighttpd for download.wikipedia.org, which serves the world with 60GB database dumps of the English Wikipedia.

John Havard

Just stick with nsd. It’s always run faster, for me atleast, on static files than any version of apache. Best of all, you can easily cheat when you need to cache some frequently requested files. Those fools that know only LAMP, and even the Java types would probably cry knowing they’ve wasted so much time reinventing what was a solid product in 1995.

David Magda

I don’t think block size will matter too much for performance, but proper size may help a bit in storage efficiency. If your block size is 4KB and your thumbnails are 10KB, then it will take 2.5 blocks to store each file. Since in most file systems you can only use whole blocks, it will take three full blocks to store the file–you’ll lose half of that third block. Most file systems also only allow power-of-2 block sizes (4096, 8192, etc.).

I believe ReiserFS (and XFS?) doesn’t have this design ‘flaw’, but generally you shouldn’t worry about. The default for Linux’s ext2/ext3 is 4096, and I think this should be sufficient.

What may give you a speed boost is setting the “noatime” mount option on the file system. In the file structure a record is kept for when the file is accessed, and when opening / reading a file it’s updated. This is handy for backups so you know which files have been active, but in your case it’s probably not worth it, so you can save some disk I/O by disabling it. If you’re doing benchmarking it would be worth spending some time testing this mount(8) option.

For security you may also want to add the “nodev”, “noexec”, and “nosuid” options which disallow the pressence of device nodes (like in /dev), any execution of files from that file system, and any set-UID programs (though a bit redundant because of noexec).

Dave Cheney

I’d put my vote in for lighttpd. It uses event driven aysnc IO, rather than thread / process driven (apache worker/mpm) IO, so once the request has been parsed and the correct file located a tight select loop just pushes files from the disk to the nic. Lightty supports epoll or kevent for more efficient select loops and on modern kernels supports sendfile() for zero copy transfers.

Configuration is straight forward, the default configuration will work for you out of the box, but read through the sections in the wiki dealing with SMP (http://trac.lighttpd.net/trac/wiki/Docs%3AMultiProcessor). Even through lightty uses async IO there are a few operations that still can cause the process to block so running 2-3 workers per cpu may give you better throughput.

As other have said before you’ll probably be IO bound well before you even notice a blip on our load graphs. If you can afford the space I would recommend setting your raid array up in RAID 0+1 (a stripe of mirrors), that would still give you close to 3Tb of raw disk space with more io bandwidth than the bus you have the raid card connected too.

An important problem with having a single volume this large, is should you every have to do an fsck on it it will take hours (possibly even days). Journeling filesystems help, but having being bitten by the 180 day timer in an old redhat distro which forced a disk check, you have to consider some downtime may be required just for house keeping.

Using technologies like linux lvm on top of your raw raid volume would let you create say 16 devices for a hashed directory structure (you are hashing the directories your images are in aren’t you?) where the top level hash is actually a seperate filesystem.

Lvm would allow you to grow each filesystem individually as files were added to your image device, and if a filesystem check was required you could have a rolling outage for 1/16th of your images, without the expense of a full mirror somewhere else, or having to say ‘sorry, come back tomorrow’

I agree with posts 2 and 10 regarding the use of a single threaded server with select/poll. So nginx, lighthttpd and some others should do. Sending static files to a network is an I/O-bound job which requires a high performance “I/O juggler”: a process that finds out what I/O has to be done (select/poll), instructs the OS to do it (with sendfile), and then sits waiting for the next thing to do.

With the kind of CPUs that you have this should an easy task. The only thing to check is “surprises” from the operating systems. I mean: how does the OS scale on operations like select/poll/mmap/sendfile/etc when the number of open sockets and file descriptors is very high?

In 2003 a very good benchmark came out regarding this point:http://bulk.fefe.de/scalability/
The only problem is that it is old and doesn’t include other OS’s like Solaris. When the benchmark came out it stirred some discussion in the OpenBSD community because the benchmark showed poor scalability of this OS. The OpenBSD people have improved things since then. Anyway, Linux 2.6 came out well, so it should still be a good choice.

Linux caches file system blocks as much as possible so I would leave caching entirely to the OS.

At this point the bottleneck should be disk I/O. SCSI channels have high bandwidth, so the real problem is latency. What has to be avoided is file fragmentation (the blocks of a file being spread on different tracks). With larger jpgs and a lot of disk churn (files being added, copied, removed) this could become significant. Larger block size can reduce the problem, if you accept to sacrifice a bit of disk space when the file size is smaller than the block size.

My starting point would be: 16kB for the file system containing the thumbnails and 128kB for the file system containing the larger images. This way, I guess, 90% or more of the files would be read in one disk operation.

Apologies. Maximum block size in XFS is 64kB, so 128kB is not possible. To get contiguous allocations of more than 64kB, one would have to create files/use extents in the real-time section of an XFS file system.

It claims that Apache 2 tends to die under 4000 concurrent sesssions vs. YAWS at 80,000. This is single process / single instance.

On the other hand, rolling your own SQL interface for Erlang is not the easiest thing in the world, from what I understand (no native SQL interface libraries). But for serving static content, this should be pretty cool.

In the interest of disclosure, I must admit I have no experience with it. I have played with Apache, but am far from an expert.

David: Is this true about ext3 that it can’t put information from more than one file in a given block? So if we create a whole bunch of files, each with only 1 character, each will take up an entire 4K block? I would have thought the consequence of having, say, 100K block size, would be RAM cache inefficiency. You want one file and pull into RAM a huge block that contains fragments of unrelated files. (This is the issue with RDBMSes; I wonder if I have been coding for Oracle for so long that I’ve forgotten how operating system file systems work.)

I’m going to give a +1 to S3. You’re probably smart enough to set up something better just for yourselves, but that takes time and effort. S3 works, only costs for what you use, and takes little to no time and effort.

presidentpicker

Using Doug’s numbers, you are talking about less than 2 MB/s random I/O. Just about any filesystem/disk/cpu combination from last five years should be able to handle this load with lots of reserve capacity. If in doubt you can run IOzone benchmark http://www.iozone.org/.

2.6 Kernel supports asynchronous I/O (AIO) – a huge benefit for your case. But I don’t know which of the servers mentioned above can take advantage of it. I think Apache v3 will add AIO support as a new major feature.

Also look at using ext4 (make sure to enable extents). This would make this filesystem similar to (as good as?) raiserfs for handling file fragmentation. EXT4 became part of 2.6.19 and later kernels

Guys: I think we are going to use S3 for off-site backup, but I don’t think that we established that it is workable for our day-to-day page serving (I can’t remember if we concluded whether or not one of our Web pages could reference an IMG stored in S3 and have the bits go directly from Amazon to our reader without coming back into our cluster first).

philg: As long as you don’t need any goofy authentication/authorization, Amazon S3 serves static assets just fine, directly to the browser. That’s one of the advantages of Amazon S3 speaking HTTP. (It also speaks Bittorrent, interestingly enough.)

We use S3 for JPG Magazine and it’s rock solid, cheap as hell, easy to use and we aliased photos.jpgmag.com to it so the images even look like they’re coming from our own server farm. We have onsite backups of everything in case it fails but in 3 months of heavy use it hasn’t hiccuped for more than 15 or 20 seconds here or there. It’s amazing.

Oh yeah, in reference to your previous comment about coming back to your cluster first that’s what the url aliasing is for and if you set the files with global read perms then the client’s system goes straight to S3 for the image and leaves you out of the loop entirely. And did I mention it’s fast?

Thanks, Jason. I don’t think that we’ve demonstrated spectacular sysadmin competence in the past at photo.net (though we brought in some professionals in September 2006 and I’m hoping that things will improve), and that is not our core business, so maybe we should dump as much as possible onto Amazon. I guess we will try to do an off-site backup over there ASAP.

Jin

Leaving the files in S3 with global read perms would open up the possibility of a botnet attack whereby an attacker could simply request all your files continuously and thereby drain your bank account by using up bandwidth as fast as Amazon can deliver it. Is there any way to address this kind of abuse? Otherwise, agreed, serving directly from S3 would be a fine idea.

Mark

If you’re considering S3, even for backup, you might want to take a look at Joyent’s BingoDisk (or Accelerator) products which at $200/yr for 100GB of HTTP/WebDAV accessable disk, works out a lot cheaper than S3. B/W charge is the same as S3 (but you do get 100GB transfer included in the base fee).

I’m using it for offsite backup, and it’s been flawless.

Mark

Oh, I should also mention that I’m serving photos.nsmb.com via lighttpd and I’ve been happy. Less Apache memory bloat means more RAM available for the OS to cache static files. I’ve been testing nginx (the rewrite engine is more featureful than lighttpd) and it’s also looking very solid.

I suggest testing a separate AOLserver process on a separate IP address dedicated to just serving the static image files. Don’t load any of your nice TCL code into it, it should take only a few MB of RAM. You could try bumping up the thread count to say 50, and with so little TCL code in each thread the RAM usage will still be low.

Also you should definitely go to more than 4GB RAM. Opteron systems seem to do better with lots of RAM, provided you are running a 64bit kernel (I use Solaris 10 myself, but Centos x64 should be fine also). Be sure that the RAM is installed properly with equal amounts of RAM in each CPU’s bank of RAM.

I’m a fan of mathopd, though I confess that the last time I looked into the “fast static file webserver” space was circa 2003, and the game was pretty much between Apache 1.x, thttpd, mathopd, and maybe one other one I can’t recall. I read a lot of benchmarks other people had done (most of which showed poor understanding of statistical methods), set up my own non-scientific tests, and then read the code of parts of them, including thttpd and mathopd. I picked mathopd at the time because it seemed to win most for my use case and I could understand the code easily, something I couldn’t say for thttpd. Some thttpd security holes had me concerned as well. mathopd has not been free of them, but I sensed less opportunity for them to creep in given the way mathopd was written. None of this may matter much for your uses.

I haven’t compared mathopd to the newer servers others have mentioned here, including lighttpd and ngnix, but if I do, I’ll let you know what I find out.

hyperion

I would not suggest using nginx or any single-threaded synchronous-IO (read(), sendfile()) webserver, as with a big fileset IO can become disk-seek speed bound instead of disk throughput bound. The solutions are 1) use a multithreaded webserver (which you already do), where the OS can reorder IO somewhat or 2) use an async-IO based webserver. The problem with async-IO is that the POSIX implementations are usually limited to a small nubmer of ongoing IOs at a time, so the server would need to use OS-specific AIO interface until this improves… The only httpd which I know of that can do this is lighttpd 1.5 branch on Linux which is not released yet unfortunately. I expect the current lighttpd 1.4 branch perform worse than Apache 2.2, but that will change…

With S3 I’ve had some very bad experience. Bandwidth was very slow. Maybe they solved that now but is it really intended as a content distribution network? I think Limelight is better for content distribution (for example MySpace uses it for pictures). Can’t comment on Akamai.

Ata

I’ve used TUX in the past and served more than what you mention with much smaller servers (1GB, IDE). (Adult gallery hosting).

The file sizes were about the same, however I had only a couple thousand different files.

But as others have mentioned, you actually don’t even need TUX for that relatively small amount of hits/second.

But while you don’t NEED TUX it would still make your site *fly*, and leave a LOT of room for growth. Your only concern would be disk I/O (if the images don’t fit all into RAM, or each user requests wildly different images) and your server’s uplink to the internet (100mbps, 1gbps).

On Redhat/Centos, TUX is already there, and after

/etc/rc.d/init.d/tux start
/etc/rc.d/init.d/tux stop

you can start configuring it. You can be up and running within minutes. It runs very well along with Apache on the same machine, too. I.e. if you wanted to server dynamic pages from the same server. In this case TUX will sit in front at port 80 and pass every file with extensions specified by you to Apache in the background at port 8080. Restart of xinetd required to have Apache leave port 80 and let TUX go there. If it won’t work just reboot the whole server.

Only drawback is TUX doesn’t have any hotlink protection (no htaccess). But if that is a concern you can work around that by changing the path to the images with symlinks & cron jobs regularely. Have a symlink point to the real (internal) path, create a new random symlink every hour or day, and delete the old symlink after the new symlink has been active for some time (to prevent broken images for users).

Of course, if you push a lot of bandwidth, usually hotlinking won’t have any noticable effect anyway, so I personally just let away the whole part with path changing.

If you’re concerned about speed, and room for growth, look no further, TUX is second to none for static file serving and will put a smile on your face…

hyperion

Ata

Yes after trying it out on Centos 4.4 I can confirm that Tux causes kernel panic on Linux 2.6 kernels. Several people on the Tux list report the same thing. I got it running for anything between 5 hrs and 2.5 days before a kernel panick occured and the server went down. Too bad really. This happens on both SMP and non-SMP kernels. It worked wonderful on 2.4 kernels though. I had it running non-stop without problems on Redhat 9 (!). So the only options for now would probably be to use Tux on an older kernel on a dedicated box, or skip Tux alltogether and use Lighttpd/Nginx/Litespeed instead.

cmholm

thttpd: I’ve been using it on Linux and OSX for years. Most of my traffic comes from Yahoo and Google bots sucking up my photos during indexing.

nginx: I’d like to get some dynamic content rendered faster, so I’m playing with deploying it in front for static files, and using its url proxy feature to have it shoot the cgi/fastcgi/mod_whatever calls through to Apache.

lighttpd: I started to configure it for its fastcgi feature. However, many are the claims that it suffers significant memory leaks under heavy load. Of those making the claim but sticking with the product, the workaround was a cron job that reloaded the daemon at least nightly.

I tried lighttpd but I don’t like the configuration file and in general how configure it, it is better than the crazy apache conf file but still not perfect, I am using MyServer now and I am much happier, I suggest it to eveyone who wants a fast and simple to use server