Oracle Blog

Rebooting is obsolete

Wednesday May 05, 2010

If you're a veteran system administrator, you might remember an era of extremely expensive hard disk storage, when any serious network would have a beefy central file server (probably accessed using the Network File System, NFS) that formed the lifeblood of its operations. It was a well-loved feature as early as Linux kernel 2.0 that you could actually boot your machine with a root filesystem in NFS and have no local disk at all. Hardware costs went down, similar machines could share large parts of their system binaries, upgrades could be done without touching anything but the central server—sysadmins loved this.

But that was then. Diskless booting these days seems a lot less common, even though the technology still exists. You hear about supercomputer clusters using it, but not the "typical" IT department. What happened?

Part of it, I'm sure, is that hard disks became speedier and cheaper more quickly than consumer network technology gained performance. With local disks, it's still difficult to roll out updates to a hundred or a thousand computers simultaneously, but many groups don't start with a hundred or a thousand computers, and multicast system re-imaging software like Norton Ghost prevents the hassle from being unbearable enough to force a switch.

More important, though, is that after a few years of real innovation, the de facto standard in network booting has been stagnant for over a decade. Back in 1993, when the fastest Ethernet anyone could use transferred a little over a megabyte of data per second and IDE hard drives didn't go much faster, network card managers were already including boot ROMs on their expansion cards, each following its own proprietary protocol for loading and executing a bootstrap program. A first effort at standardization, Jamie Honan's "Net Boot Image Proposal", was informally published that year, and soon enough two open-source projects, Etherboot (1995) and Netboot (1996), were providing generic ROM images with pluggable driver support. (Full disclosure: I'm an Etherboot Project developer.) They took care of downloading and executing a boot file, but that file would have no way of going back to the network for more data unless it had a network card driver built in. These tools thus became rather popular for booting Linux, and largely useless for booting simpler system management utilities that couldn't afford the maintenance cost of their own network stack and drivers.

Around this time, Intel was looking at diskless booting from a more commercial point of view: it made management easier, consolidated resources, avoided leaving sysadmins at the mercy of users who broke their systems thinking themselves experts. They published a specification for the Preboot Execution Environment (PXE), as part of a larger initiative called Wired for Management. Network cards started replacing their proprietary boot ROMs with PXE, and things looked pretty good; the venerable SYSLINUX bootloader grew a PXELINUX variant for PXE-booting Linux, and a number of enterprise system management utilities became available in PXE-bootable form.

But, for whatever reason, the standard hasn't been updated since 1999. It still operates in terms of the ancient x86 real mode, only supports UDP and a "slow, simple, and stupid" file transfer protocol called TFTP, and officially limits boot program size to 32kB. For modern-day applications, this is less than ideal.

Luckily for us, the Etherboot Project still exists, and Etherboot's successor gPXE has been picking up where Intel left off, and supports a number of more modern protocols. Between that, excellent support in recent Linux kernels for both accessing and serving SAN disks with high performance, and the flexibility gained by booting with an initial ramdisk, diskless booting is making a big comeback. It's not even very hard to set up: read on!

The general idea

PXE Booting Flowchart

While it can get a lot more complicated to support boot menus and proprietary operating systems, the basic netbooting process these days is pretty straightforward. The PXE firmware (usually burned into ROM on the network card) performs a DHCP request, just like most networked computers do to get an IP address. The DHCP server has been configured to provide additional "options" in its reply, specifying where to find boot files. All PXE stacks support booting from a file (a network bootstrap program, NBP); PXELINUX is the NBP most commonly used for booting Linux. The NBP can call back to the PXE stack to ask it to download more files using TFTP.

Alternatively, some PXE stacks (including gPXE) support booting from a networked disk, accessed using a SAN protocol like ATA over Ethernet or iSCSI. Since it's running in real mode, the firmware can hook the interrupt table to cause other boot-time programs (like the GRUB bootloader) to see an extra hard disk attached to the system; unbeknownst to these programs, requests to read sectors from that hard disk are actually being routed over the network.

If you want to boot a real-mode operating system like DOS, you can stop there; DOS never looks beyond the hooked interrupt, so it never has to know about the network at all. We're interested in booting Linux, though, and Linux has to manage the network card itself. When the kernel is booted, the PXE firmware becomes inaccessible, so it falls to the initial ramdisk (initrd or initramfs) to establish its own connection to the boot server so it can mount the root filesystem.

Setting up the client

We're going to walk through setting up network booting for a group of similar Ubuntu Lucid systems using disk images served over iSCSI. (The instructions should work with Karmic as well.) iSCSI runs on top of a TCP/IP stack, so it'll work fine within your current network infrastructure. Even over 100Mbps Ethernet, it's not significantly slower than a local disk boot, and certainly faster than a live CD. Rebooting may be obsolete, but short bootup times are still good to have!

You'll want to start by installing one Ubuntu system and configuring it how you'll want all of your diskless clients to be configured. There's room for individual configuration (like setting unique hostnames and maybe passwords) later on, but the more you can do once here, the less you'll have to do or script for all however-many clients you have. When they're booted, the clients will find your networked drive just like a real hard drive; it'll show up as /dev/sda, in /proc/scsi/scsi, and so forth, so you can pretty much configure them just like any local machine.

No matter what site-specific configuration choices you make, there are some steps you'll need to perform to make the image iSCSI bootable. First, you'll need to install the iSCSI initiator, which makes the connection to the server housing the boot disk image:

client# aptitude install open-iscsi

That connection will need to occur during the earliest stages of bootup, in the scripts on the initial ramdisk. open-iscsi can automatically update the initramfs to find and mount the iSCSI device, but it assumes you'll be setting a bunch of parameters in a configuration file to point it in the right place. It's quite cumbersome to set this up separately for every node, so I have prepared a patch that will make the initramfs automatically set these values based on the "boot firmware table" created by the iSCSI boot firmware from the information provided by your DHCP server. You should apply it now with

Let's assume you've set up the prototype client as above, and you now have an image of its hard disk in a file somewhere. Because the disk-level utilities we'll be using don't know how to deal with files, it's necessary to create a loop device to bridge the two:

server# losetup -v -f /path/to/ubuntu-image
Loop device is /dev/loop0

If you get different output, or if the client disk image you created is already on a "real" block device (e.g. using LVM), replace /dev/loop0 with that device in the below examples.

You may be familiar with the Linux device mapper, probably as the backend behind LVM, but there's a lot more it can do. In particular, it gives us very easy copy-on-write (COW) semantics: you can create multiple overlays on a shared image, such that writes to the overlay get stored separately from the underlying image, reads of areas you've written come from the overlay, and reads of areas you've not modified come from the underlying image, all transparently. The shared image is not modified, and the overlays are only as large as is necessary to store each one's changes. Let's suppose you've got some space in /cow that you want to use for the overlay images; then you can create five of them named /cow/overlay-1.img through /cow/overlay-5.img with

(10M blocks * 512 bytes per block = 5GB per overlay; this represents the amount of new data that can be written "on top of" the base image.)

Now for the fun part. The aforementioned snapshot functionality is provided by the dm-snapshot module; it's a standard part of the Linux device mapper, but you might not have it loaded if you've not used the snapshotting feature before. Rectify that if necessary:

This sequence of commands assigns a loopback device to each COW overlay file (to make it look like a normal block device) and creates a bunch of /dev/mapper/image-n devices from which each client will boot. The 8 in the above line is the "chunk size" in 512-byte blocks, i.e. the size of the modified regions that the overlay device will record. A large chunk size wastes more disk space if you're only modifying a byte here and there, but may increase performance by lowering the overhead of the COW setup. p makes the overlays "persistent"; i.e., all relevant state is stored in the image itself, so it can survive a reboot.

You can tear down the overlays with dmsetup remove:

server# for i in $(seq 1 5); do
> dmsetup remove image-$i
> done

It's generally not safe to modify the base image when there are overlays on top of it. However, if you script the changes (hostname and such) that you need to make in the overlays, it should be pretty easy to just blow away the COW files and regenerate everything when you need to do an upgrade.

The loopback device and dmsetup configuration need to be performed again after every reboot, but you can reuse the /cow/overlay-n.img files.

Setting up the server for iSCSI

We now have five client images set up to boot over iSCSI; currently they're all passing reads through to the single prototype client image, but when the clients start writing to their disks they won't interfere with each other. All that remains is to set up the iSCSI server and actually boot the clients.

Each Target line names an image that can be mounted over iSCSI, using a hierarchical naming scheme called the "iSCSI Qualified Name" or IQN. In the example above, the com.ksplice.servertest should be replaced with the reverse DNS name of your organization's domain, and 2008-01 with a year and month as of which that name validly referred to you. The part after the colon determines the specific resource being named; in this case these are the client drives client-0, client-1, etc. None of this is required—your clients will quite happily boot images introduced as Target worfle-blarg—but it's customary, and useful in managing large setups. The Lun 0 line specifies a backing store to use for the first SCSI logical unit number of the exported device. (Multi-LUN configurations are outside the scope of this post.)

Edit /etc/default/iscsitarget and change the one line in that file to:

ISCSITARGET_ENABLE=true

You can then start ietd with

server# /etc/init.d/iscsitarget start

To test that it's working, you can install open-iscsi and ask the server what images it's serving up:

The only piece that remains is somehow communicating to your clients what they'll be booting from; if they're diskless, they don't have any way to read that information locally. Luckily, you probably already have a DHCP server set up in your organization, and as we mentioned before, it can hand out boot information just as easily as it can hand out IP addresses. You need to have it supply the root-path option (number 17); detailed instructions for ISC dhcpd, the most popular DHCP server, are below.

In order to make sure each client gets the right disk, you'll need to know their MAC addresses; for this demo's sake, we'll assume the addresses are 52:54:00:00:00:0n where n is the client number (1 through 5). Then the lines you'll need to add to /etc/dhcpd.conf, inside the subnet block corresponding to your network, look like this:

Replace 192.168.1.90 with the IP address of your iSCSI server. The syntax of the root-path option is actually iscsi:server:protocol:port:lun:iqn, but the middle three fields can be left blank because the defaults (TCP, port 3260, LUN 0) are exactly what we want.

Booting the clients

If your clients are equipped with particularly high-end, "server" network cards, you can likely boot them now and everything will Just Work. Most network cards, though, don't contain an iSCSI initiator; they only know how to boot files downloaded using TFTP. To bridge the gap, we'll be using gPXE.

gPXE is a very flexible open-source boot firmware that implements the PXE standard as well as a number of extensions: you can download files over HTTP, use symbolic DNS names instead of IP addresses, and (most importantly for our purposes) boot off a SAN disk served over iSCSI. You can burn gPXE into your network card, replacing the less-capable PXE firmware, but that's likely more hassle than you'd like to go to. You can start it from a CD or USB key, which is great for testing. For long-term use you probably want to set up PXE chainloading; the basic idea is to configure the DHCP server to hand out your root-path when it gets a DHCP request with user class "gPXE", and the gPXE firmware (in PXE netboot program format) when it gets a request without that user class (coming from your network card's simple PXE firmware).

For now, let's go the easy-testing route and start gPXE from a CD. Download this 600kB ISO image, burn it to a CD, and boot one of your client machines using it. It will automatically perform DHCP and boot, yielding output something like the below:

after which, thanks to the client setup peformed earlier, the boot will proceed just like from a local hard disk. You can eject the CD out as soon as you see the gPXE banner; it's just being used as an oversized ROM chip here.

You'll probably want to boot each client in turn and, at a minimum, set its hostname to something unique. It's also possible to script this on the server side by using kpartx on the /dev/mapper/image-n devices, mounting each client's root partition, and modifying the configuration files therein.

That's it: if you've followed these instructions, you now have a basic but complete architecture for network booting a bunch of similar clients. You've set up servers to handle iSCSI and DHCP, set up one prototype client from which client disks can be automatically generated, and can easily scale to hosting many more clients just by increasing the number 5 to something larger. (You'd probably want to switch to using LVM logical volumes instead of file-backed loopback devices for performance reasons, though.) The number of clients you can quickly provision is limited only by the capacity of your network. And the next time one of your users decides their computer is an excellent place to stick refrigerator magnets, they won't be creating any additional headaches for you.

About

Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.