Nodebootstates

A quick guide to state transitions in the Emulab boot process

A quick guide to state transitions in the Emulab boot process

Overview of the Process:

PC nodes are configured to boot using PXE via PXE-enabled NICs. PXE uses DHCP to determine what to boot next, and uses TFTP to download that next level boot program. The Emulab "boss" node serves as the DHCP and TFTP server for this purpose. To ensure that Emulab always gains control of a node before anything else, PXE should only be enabled on the control network interface of a node. Having it enabled on experimental network interfaces as well would, at best, slow down the boot process and, at worse, cause nodes to respond to a DHCP server on one of the experimental networks.

The first-level boot program downloaded via TFTP (traditionally called "pxeboot") queries the boss node to find out what to do next. (The boss node is identified either explicitly by the next-server parameter in the DHCP reply or if not set, implicitly by the host that responded to the DHCP.) The "what to do" query is made through the Emulab-specific "bootinfo" protocol using the PXE BIOS provided network routines. The bootinfo server on boss can return a variety of responses:

A disk and partition number. In this case, pxeboot will perform a chain boot by using BIOS functions to load the first sector of the indicated disk/partition (the "partition boot block") into memory and then jump to that code.

A directory path on the TFTP server. In this case, the specified directory contains the information necessary to load and boot an OS kernel running from a memory-based filesystem image. These kernel/FS combos, called "MFSes", run entirely out of RAM and are used for loading, examining, or saving the contents of the local hard drive. Note that MFSes are strictly an Emulab infrastructure mechanism currently, though we would like to make this boot path available to users so that they can specify node OSes that run out of memory and not use the hard drive (except perhaps as a cache or for soft-state).

An order to wait. In this case, the node will wait for 1ms and query again. This command is returned to nodes which are currently free (not assigned to any experiment). Having free nodes sit in "PXE wait" allows for fast experiment startup in certain cases. Since free nodes have one or more default OS images loaded, an experiment requesting a node with an OS that has been pre-loaded, requires only a change to what is returned by bootinfo to that node, giving almost instant start up.

For the disk/partition and MFS cases, bootinfo may also return a kernel command line to use when booting. Originally this was intended for network-booted, Multiboot-format, OSKit kernels (circa early-2000s), but that support atrophied long ago. We do support command lines still, even allowing them to be set per-node via the web interface. However, whether this command line feature works, depends on the implementation of the boot process (see the Implementation section below) and whether it is a disk or MFS boot that is being attempted. This option is used for optimization of "traffic-shaping" nodes, allowing their kernels to be booted at a higher HZ rate.

Implementations of the PXE-boot process:

We have implemented both FreeBSD and Linux based versions of pxeboot and the MFSes.

In the FreeBSD version, pxeboot is a PXE-savvy, modified version of the standard "second-level" FreeBSD boot loader (i.e., "/boot/loader"). It has been modified to support bootinfo queries using the PXE BIOS's network functions. The FreeBSD pxeboot will only pass command lines to FreeBSD kernels booted from a partition, otherwise the command line is ignored. This is in support of traffic shaping nodes ("delay nodes") where we specify the kernel HZ value on the command line. There are three MFSes, a heavily scaled-back version of a FreeBSD installation used for imaging nodes (the "Frisbee MFS"), a more complete version used for creating node images and general examination of the disk (the "admin MFS"), and a version with specialized start up scripts used to allow unknown nodes to join the testbed (the "newnode MFS"). All three use a GENERIC configuration of the kernel.

In the Linux version, pxeboot is a PXE-savvy, modified version of grub2. There is only one MFS, based on a busybox configuration of ... The "role" of the MFS is determined by a new "elab_mode" kernel command line parameter.

Note that the versions of pxeboot and the MFSes are paired, you cannot mix and match FreeBSD pxeboot with Linux MFSes or vice versa. This is primarily due to laziness on our part and the fact that the loaders are best suited to loading their respective kernels. (Yes, Grub can be used to load lots of things, but it is mostly used for Linux). I believe that the Grub-based pxeboot can pass command lines for either Linux or FreeBSD kernels booted from a partition. Again, the only Emulab infrastructure use of this is when booting traffic-shaping nodes.

Common workflows:

There is a big "circle of life" thing going on here with nodes, so we will arbitrarily start with:

Free node. A node enters the free state when it has gone through the "reloading" cycle after being freed from an experiment. The last step of the reload process is to reboot the node. When the node reboots, it PXEs, downloads pxeboot, queries boss and gets the "wait" response. The node continues to query at 1ms intervals until it gets a different response.

Node allocated to an experiment, booting a pre-installed OS. Whan a node is first allocated to an experiment, and the proper OS has been pre-installed, then the subsequent pxeboot query will return "boot from disk D, partition P". The disk is currently always 0 (rather 0x80, as bootinfo uses BIOS ids) and the partition is between 1 and 4 (meaning, we assume MBR and only primary partitions). (In the past we pre-loaded both FreeBSD in partition 1 and Linux in partition 2 on free nodes, so P would be either 1 or 2). Pxeboot will use BIOS routines to read the MBR partition table from the appropriate disk to find the start sector for the appropriate partition, and then load the first sector of the partition into memory and jump to it. There is also the undocumented partition "0" which means to boot through the MBR bootblock. This can be used for non-Linux/FreeBSD OSes (e.g., ESXi) which have special boot code or that do not support a partition bootblock. Here pxeboot reads the entire MBR sector into memory and jumps to the beginning of it.

Node allocated to an experiment, needs OS loaded. When a node is first allocated to an experiment, and the proper OS has NOT been pre-installed, then the subsequent pxeboot query will return "boot from the frisbee MFS". In this case, pxeboot, whether it be the FreeBSD loader or Grub, will read (via TFTP) a configuration file from the named directory on boss and use that to guide what kernel and root filesystem image to read (also via TFTP) and boot. Essentially, this is a regular boot where /boot just happens to be across the network in the named server directory--all the same files are used (e.g., grub/grub.cfg or loader.conf). Once booted, Emulab startup scripts inside the filesystem image will invoke Frisbee to image the local disk and then perform any necessary post-load customizations (e.g., fixing up /etc/fstab, setting the console type). Finally, it will reboot the node and we will have...

Normal reboot of a node allocated to an experiment. This behaves just like the pre-installed OS case above. The node PXE boots into pxeboot, pxeboot queries bootinfo, bootinfo returns "boot from disk D, partition P", and we are off to the races. Note that a user or Emulab admin could use the Emulab web interface to change what partition a particular node boots from (as long as that partition contains an OS) and that will be reflected in the bootinfo response when the node is next rebooted. A node can also be told, via the web interface or command line tool, to boot into the "admin MFS" which leads us to the final possibility...

Node placed in "node admin" mode. Here the node is rebooted and upon PXE boot and bootinfo query, gets back a "boot from an MFS" reply. The only difference is the server directory that is used, e.g., "/tftpboot/admin" rather than "/tftpboot/frisbee". For the FreeBSD-based implementation, the only difference in directory contents is the root FS image--the admin MFS using a larger, more general filesystem image. For the Linux-based implementation, the difference is the grub.cfg file that uses the same kernel and initrd as the disk loading MFS, but specifies a different elab_mode command line parameter.

Interactive use of pxeboot. One final workflow, is interactive use of pxeboot. When pxeboot starts, it allows for being interrupted on the console so that you can override boss's default action of what to do. At the pxeboot prompt, you can explicitly specify booting from a partition or any of the MFSes. You can also set the command line to use. This mode of operation is used strictly by admins for debugging.

State transitions:

When nodes are allocated to, or freed from experiments, their state is monitored by the central "stated" daemon which makes sure that the node follows an expected series of steps required to get the node to a usable state or get it reloaded. Stated can perform actions on state transitions (correct or not) or when a node fails to make a transition within a reasonable amount of time. A typical action on an invalid transition or a timeout is to reboot the node (i.e., it assumes the failure is transient). While a node is in an experiment, it continues to be monitored, but the only events of significance are reboots (which trigger no actions), disk reloads (which trigger the state machine described below), or experiment swapouts (which trigger the disk reloading sequence as well).

The current state of a node is recorded in the database. State transitions may be generated by the node itself using "tmcc" or by the control infrastructure on behalf of a node. Control programs that generate events include: tmcd (events from the node), bootinfo (reboot related events), and node_reboot (ditto). State transition events are sent from these programs using the event system. Stated listens for all such events (TBNODESTATE). Nodes can query their state through tmcc. Control programs query state typically by accessing the state in the DB.

As mentioned, free nodes sit in pxeboot, looping and waiting for an order. This is the PXEWAIT state. When a node is allocated the node_reboot script sends a wakeup (via the bootinfo protocol) and changes the node state to PXEWAKEUP. The node will then contact the bootinfo server on boss to see what to do. When the node makes this request, it is moved to the PXEBOOTING state (by the bootinfo server on behalf of the node). After bootinfo determines the response, and if it sends back a reply to either boot from disk or an MFS, it changes the node state to BOOTING. It stays in this state until the node OS's Emulab client scripts start up. The first such script moves the state to TBSETUP and then the final script moves it to ISUP. If any script fails along the way, it instead moves it to TBFAILED and aborts setup.

For a normal node reboot, a node will start out in the ISUP state. During the shutdown process, a node sends a shutdown event via tmcc, moving the node to the SHUTDOWN state. When the node reboots and gets into pxeboot, it makes a request to the bootinfo server and is moved to the PXEBOOTING state, and then follows the sequence above (to BOOTING, TBSETUP, and then to ISUP or TBFAILED).

When a node is freed via the swapout process, the node is set to use the RELOADING state machine that drives the node through a different set of transitions. First it is rebooted where it is told to load and execute the "frisbee MFS." After the BOOTING state, the rc.frisbee client script uses tmcc to report states RELOADSETUP (when it starts), RELOADING (when it actually starts the frisbee client), and RELOADDONEV2 (when post-frisbee tweaks are finished). The final state tells stated to reboot the machine, from which it reenters the PXEBOOTING/PXEWAIT states. [ The reason the node does not reboot itself after finishing is to close a race condition where the node could get back to pxeboot before stated had the chance to receive the DONE message and/or clear the reload-related DB state. ]