Talking about design & implementation of solutions using modern OSS frameworks, tools, open standards and cutting-edge technology

Thursday, June 7, 2012

RAM-only PXE boot & the "smallest" diskless Linux box

This is a how to easily create a very small Linux box purely running on RAM memory that boots using PXE. This is an introductory topic (not new at all), for further posts about scalability, load balancing and high availability. That's why I mention clustering very often and also start simple & small by preparing a single node that boots smoothly in a controlled environment.

Why RAM-only & diskless?

I must say that such configuration of a Linux box can bring many advantages if you plan to assemble a cheap cluster without persistence in mind and with low maintenance costs. Consider the following:

ram availability: RAM memory is cheaper & fast wrt. the pass years

modern hardware, big ram: A lot of hardware support a great amount RAM memory installed, from 1 GB to 128 GB, and go on

Linux rocks: 64-bits Linux systems are able to manage a lot of RAM memory efficiently

HDD & the planet: Electromechanical HDD have serious implications in energy consumption, recycling, NOISE and are more susceptible to failures

SSD & your wallet: SSD are more advanced wrt. their electromechanical counterparts (less susceptible to physical shock, are silent, and have lower access time and latency) but at present market prices, more expensive per unit of storage. So if you problem is not the storage, just processing, caching and networking, you are in the right place!

less is sometimes cheaper: It's not a bad idea, if you have a good chance to buy cheaper nodes by parts/complete w/o HDD

crashing doesn't matter: if a misbehaving node crashes you just need to restart it and it'll wake up again in a healthy state. A single node state doesn't wander in time

scaling better: adding a node to the cluster is easy, just connect it, enable PXE boot and add an entry in DHCP config

network congestion is reduced: the RAM filesystem is copied once per boot to the target node

Life is easy and cluster's maintenance costs are reduced, but remember that this is only if you don't need persistence in every single node, just CPU power, networking and RAM memory.

When don't I need persistence?

I don't have a full inventory of persistence-less & memory-network-only scenarios, but a practical and discrete list, I'm sure you can see the benefits:

cryptographic stuff, privacy: you need to run a cryptographic algorithm and ensure a full cleanup of private keys after the execution is complete, a HDD formating is not enough sometimes, and recovery data from a RAM memory after a full power of is very difficult if not impossible. Also an encrypted filesystem on top of the RAM shall be challenging for hackers

caching efficiently: if your RAM is enough and your backend cluster is under a constantly growing demand for static content. You can delegate all your caching needs to a dedicated frontend cluster running purely in RAM and release the load of backend servers by processing only dynamic content on this physical layer

time only algorithms: many algorithms have only processing power needs and low/medium memory foot print, some of them even only need volatile (non-persistence) memory for allocating data structures

display only apps: some software solutions only need for displaying incoming data via graphs, video streaming, etc.. So a good display, a RAM-only system and a network is enough

What will I obtain at the end of this guide?
A Linux box, named it rambox, purely running in RAM memory, that means a root (/) filesystem mounted in RAM, that's why memory preservation is a priority as well as avoidance of a filesystem full of never-used archives which also increase the memory usage.

We'll also make a customized Kernel compilation to shrink it, with a "minimal" set of features incorporated. Keeping it simple small! At this point you should be careful about omitting mandatory kernel features, there's another set of features that are not mandatory but useful to obtain the best performance. They mainly depend on your hardware, so take care of them.

What's a RAM filesystem? A filesystem mounted on RAM isn't a new invention, is a awesome Kernel feature mostly used to load firmware/modules before starting the normal boot process. It's called initrd or initramfs, there are differences between both (see references) and we'll be using initramfs.

NOTE: Notice that BIOS used for QEMU and KVM virtual machines (the SeaBIOS) supports an open source implementation of PXE, named gPXE. So KVM-based virtual machine is able to boot via network. Now days almost any motherboard should have a BIOS with PXE support. Ensure that your rambox support it by checking the BIOS setup.

the Linux PXE boot loader takes control and uses the same IP configuration to connect to TFTP server and fetch two archives: the kernel and the ramdisk

the Kernel takes control and configures its network interface, statically or by performing a second round of DHCP request, it depends

the Kernel uncompress the ramdisk in memory

the RAM disk is mounted on / and the /init script gets invoked

What do we have to configure and where? In pxe server computer is where everything takes place:

Install and configure a DHCP server with support for PXE extensions

Install and configure a TFTP server

Create a reduced ramdisk with a minimal set of utils and programs

Compile and optionally shrink the Kernel to include support for Kernel-level IP configuration, including NIC drivers

Locate all the stuff in the correct place and wake up the rambox!

There are several detailed explanations of the Linux boot process, some of them are outdated but still useful. At the moment, I won't make a full description of every single step of the boot process, ramdisk, PXE, Kernel-level IP, etc. (see references)

NOTE: In this configuration the nodes will always use the same IP addresses leased by their MAC and the nodes with an unknown hardware address will be rejected. You can easily change this behavior by replacing "deny unknown clients" directive with "allow unknown clients" and deleting all the hosts entries.

Configuring TFPT

To enable the TFTP server, edit /etc/xinetd.d/tftp replacing the word yes on the disable line with the word no. Then save the file and exit the editor:

Configuring the PXE environment

Create PXE config directory on TFP root, this directory will contains a single configuration file per node or per subnet:

$ sudo mkdir -p /var/lib/tftpboot/pxelinux.cfg

The Linux PXE boot loader uses its own IP address in hexadecimal format to look for a single configuration file under pxelinux.cfg directory, if its not found it will remove the last octet and
try again, repeating until it runs out of octets. That's why I define a helper function to convert IPv4 decimal to an hexadecimal string:

Creating a compressed root filesystem

The Kernel support for initramfs allow us to create a customizable boot process to load modules and provide a minimalistic shell that runs on RAM memory. An initramfs disk is nothing else than a compressed cpio archive, that is then either embedded directly into your kernel image, or stored as a separate file which can be loaded by the Linux PXE boot loader. Embedded or not, it should always contains at least:

a minimum set of directories:

/sbin -- Critical system binaries

/bin -- Essential binaries considered part of the system

/dev -- Device files, required to perform I/O

/etc -- System configuration files

/lib, /lib32, /lib64 -- Shared libraries to provide run-time support

/mnt -- A mount point for maintenance and use after the boot/root system is running

/proc -- Directory stub required by the proc filesystem. The /proc directory is a stub under which the proc filesystem is placed

Is there any other simple method to create the RAM disk? Creating an initramfs can be also achieved by copying the content of an already installed Linux distro into an empty directory then package it, but you must be aware of carrying undesired and/or useless archives. There other methods, some of them simple, some of them not, but they are outside of the scopte of this guide which aims to show you a handy approch to obtain a lightweight RAM disk and Kernel

Use the following steps to create the initramfs:

Creating a download cache & working zone. Also defining a helper command to download and cache archives:

Busybox is a handful tool used very often in ramdisks and small devices with very limited resources, providing a self-contained and minimal set of POSIX compatible unix tools in a single executable archive. I'll be using busybox on this guide. Getting busybox and create sh symbolic link:

One of the most important phases is /init script execution, this is a simple shell script file that performs all initialization process on the ramdisk. It usually mounts all filesystems listed on fstab, creates device nodes (like udev device manager), loads device firmware and finally remounts another root (/) directory in other device and relaunches the new mounted /sbin/init. This is the point where we intervened, by just launching the shell or by executing our own /sbin/init w/o remounting the root (/). So edit init script and add the following content:

as you may notice, every single step is commented, however this is an overall explanation of the process:

/etc/profile is sourced to export PATH variable and make all executables reachable

All busybox's symbolic links are created

Some special devices are created by hand

All /etc/fstab filesystems are mounted

The rest of the devices are discovery and created by busybox's mdev

The Kernel command line located at /proc/cmdline is parsed to see if the shell parameter was supplied, is so the shell is immediately launched replacing the current process instance, hence everything else is ignored

The Kernel command line is checked again to see if rambox parameter was supplied, indicating that we want to keep the ramdisk mounted at / and launch the normal /sbin/init process

If neither shell nor rambox parameters were supplied then, try to mount the new root (/) and launch the /sbin/init on this new location

Finally if neither the new root cannot be mounted nor the /sbin/init script cannot be executed, then a shell is launched indicating this situation

If on any of these shell launching steps an error is produced, then a Kernel panic is issued

Append execution permissions to /init:

$ chmod +x init

Change ownership to everything:

$ sudo chown -R root:root *

Create the initramfs.cpio.gz compressed archive and copy it to tftp's root directory:

Now the Kernel stuff:

What I am about to do with the Kernel is very simple, compile it using a minimal set of features that makes it boot and recognize MY hardware, mainly the NIC device. Hence, depending on your hardware you should probably use a different selection of features for Kernel compiling. So I recommend first to do a once-time installation of any modern Linux distribution (like I did) like CentOS, Gentoo, Fedora, Debian or Ubuntu with a modern Kernel version and check the modules loaded on boot using /sbin/lsmod. Then using this modules list, look for the corresponding Kernel options and INCLUDE them all in the Kernel, making it a solid rock!. That's what I did.

NOTE: In our journey for making the Kernel simple and small, we should be careful in omitting some Kernel critical features and lost the hardware advantages, for example SMP features. So if we really want to use it in a production environment, then a deep research and customization must be done before.

An ncurses menu dialog should be opened. Now check a "minimal" set of features, and uncheck the unneeded ones, I'll only list what changes wrt the clean configuration settings. So [*] means explicitly checked to be EMBEDDED it into the Kernel, and [ ] means explicitly unchecked to be not included

File systems ---> (File systems are very important, they support depend on what's your final goal: mount an NFS remotely for a shared storage? use a GlusterFS / Ceph filesystem in top of a NAS? The configuration I used is the simplest one, only support for initramfs and other pseudo filesystem. I recommend to start with this one, then gradually embed your filesystems)

Power on the rambox and enjoy it! It should boot smoothly and launch the busybox's shell.

You will find the basic tools at /bin, /sbin, /usr/bin, /usr/sbin, /usr/local/sbin, all these tools are indeed in the PATH environment variable. To renew your IP address just run renew_ip. Finally notice that any Kernel module is loaded since all that you need is embedded.

Enjoy it!

Post install

Perform some checks after install to ensure that everything is OK and measure for resource consumption:

Thanks much for this howto! I am a networking intern and I have had a similar idea though its implementation has been rough going... this is the 3rd howto I have tried to follow along these lines, but the only one that has gone fairly smooth! Thanks for the indepth explanations and for taking the time to write it all out. I will comment again when I am done (I'm compiling the kernel right now) and let you know if I could make it work for me. Thanks again for the time you have taken here, much appreciated!

There are many factors which influenced the development of buy memory and ram. Remarkably buy memory and ram is heralded by shopkeepers and investment bankers alike, leading many to state that buy memory and ram is not given the credit if deserves for inspiring many of the worlds famous painters.