Diskless Web Serving for Fun and Profit

Wednesday 03 December, 2008

We’ve used network-booting diskless servers at Last.fm ever since we got our third web server back in 2004. I think it’s one of the best architectural decisions we’ve made, yet there’s precious little information around about running diskless servers. Hopefully this article will go some way towards rectifying that.

Why?

First, a summary of why diskless web serving rocks so much:

No disks mean less failures, hence less maintenance. I hate disks.

One single image means only one copy of your web serving environment to maintain.

There are situations where diskless web serving won’t work, primarily when the content you want to serve won’t fit into an economical amount of RAM. If your code base isn’t enormous and you serve your static assets separately (which you absolutely should be doing), you shouldn’t hit this problem.

Of course, you can use this for other purposes than just web serving - we also boot our Hadoop clusters off the network.

What you need

A server to boot from: the hardware requirements for this are minimal, however it must be reliable. If this dies, your entire web cluster does. Availability for this can be improved using HA-NFS and similar trickery.

Between 1 and N web nodes. The only hardware requirement for your web nodes is that they support PXE booting, but pretty much everything does these days.

I’m going to assume that you’re using Debian/Ubuntu on your machines. Debian’s installation tools are very handy for getting this running; your mileage may vary with other operating systems.

Booting

Here’s how a server will boot up into a web serving environment:

The machine powers on. The BIOS is configured to use PXE in its boot order.

The network card’s PXE code brings up the link and gets an IP via DHCP (usually this will be a static MAC->IP mapping).

The DHCP server contains the “next-server” and “filename” options. The PXE client connects to the next-server, grabs the file (which happens to be PXELINUX), and executes it.

PXELINUX grabs the boot configuration for the machine, which has the Linux kernel and initrd details in. It grabs them from TFTP and runs Linux.

It’s a bit of a marathon, but the stages are actually quite simple. (And it’s magical when it works.) For the purposes of tying things together nicely, I’m going to run through configuring these steps roughly backwards.

Bootstrap a Linux install

You’ll need a Linux install for your webservers to run. This will be a full Linux filesystem which will exist on the boot server. To make this we use Debian’s excellent debootstrap tool. We’ll put a Debian Etch install in the directory /netboot/root:

Once that finishes (it will take a while), you should have a pristine Debian install. One thing you should do immediately is set up the debian_chroot file so you know if you’re inside the image or not:

Tuning: Linux Swapless Memory Management

It’s worth noting that Linux’s memory management strategy doesn’t take kindly to being run with high memory pressures and no swap. People have tried various approaches to solving this in a diskless environment, even going so far as putting swap partitions on network block devices. We’ve found that it’s not too hard to keep things under control if you’re careful, even with PHP’s poor memory management.

There are two methods we use: firstly, leave a 10% safety margin when allocating your Apache MaxChildren. 10% “wasted” RAM may seem bad, but it’s a small price to pay for maintainability.

I might document these better in a future post, but it’s at least a good start.

Bear in mind that in statistics, the size of the tmpfs root filesystem will not show up as “used” memory, it will show up as “cached”. Annoyingly, there’s no easy way of telling how much of your cached memory is essential and how much can be “swapped out”.

(Oh, and keep an eye out for shared memory leaks if you’re seeing inexplicable out-of-memory issues. That one kept me guessing for 4 months. Shared memory is also accounted for under “cached”.)

The kernel and initrd

Next, you should choose which kernel you want to use. It might be possible to use a distribution’s stock kernel, but it simplifies things a lot if you have a custom kernel with the modules you require statically compiled in. Generally this is pretty minimal: just Ethernet, NFS and possibly USB HID drivers are needed.

You now need to build a skeleton initrd, which you can do with mkinitrd, then mount it:

I’m going to skirt around the finer points of initrd construction here; the initrd mkinitrd provides is slight overkill for our needs. The important thing is that you add a custom linuxrc file into your initrd to configure the network and root filesystem. Here’s the one we use - it’s a little crude but it works (refinements welcome).

TFTP

Set up a TFTP server of your choice. I like atftpd, but I find it quite hard to get excited about TFTP servers so I won’t prescribe one. I’ll assume that it’s working and it’s serving from the directory /tftproot. Install PXELINUX into that directory, as well as your kernel (vmlinux) and initrd. You should have this directory structure:

DHCP

Lastly, you need to set up your DHCP server to send the correct boot options to your web nodes. I’ll assume you’re using ISC dhcpd v3, which seems to be a decent enough DHCP server. In dhcpd.conf, we create a separate group for web servers (this is just a snippet, it assumes you have a working config beforehand):

Tuning: The NFS root filesystem

You can see ,actimeo=120 in the root-path option. This is a standard mount option for NFS, and it’s used to control the stat (or getattr) cache. On your web nodes, all your system files will be NFS-mounted. In some cases files in these directories will be hit very frequently (glibc loves statting /etc/localtime) - you don’t want to incur a network trip every time that happens. This setting sets the cache to 120 seconds, so be aware that it may cause some weirdness.

Icing

Provided I haven’t missed anything, you should be able to boot a node and have it load up your Linux install. That’s the hard part.

We have init scripts which copy our web codebase onto the tmpfs ramdisk, then launch Apache and Memcache with parameters appropriate to the machine spec. Those are pretty specialist, though, so I’m not publishing them.

To finish by way of a list of credits, here’s a quick list of the other things which keep our web cluster ticking over smoothly: