Yaird — Yet Another Mkinitrd

Erikvan Konijnenburg

This memo discusses the design goals and implementation of
Yaird (Yet Another mkInitRD),
a proof-of-concept application to create an initial boot image,
a minimal filesystem used to bring a booting Linux kernel to a
level where it can access the root file system and use startup
scripts to bring the system to the normal run level. It differs
from earlier mkinitrd implementations in
that it leverages the information in sysfs to minimise the number
of special cases that the application has to know about, and in
that it uses a template system to separate the analysis of the
system from the building of the image.

Introduction

Yaird (Yet Another mkInitRD) is an
application to create an initial boot image, a minimal filesystem used
to bring a booting Linux kernel to a level where it can access
the root file system and use startup scripts to bring the system
to the normal run level.

It differs from earlier mkinitrd
implementations in that it attempts to leverage the information in
sysfs to minimise the number of special cases that the application
has to know about, and in that it uses a template system to separate
the analysis of the system from the building of the image.

This document gives an overview of the design and implementation
of Yaird; see the README file for
usage information. This text assumes familiarity with Linux
system administration and the basics of hotplug and sysfs.

This document describes version 0.0.12
of Yaird.
This is a very rough, proof of concept, version.

Goals, features, to do

The purpose in life of a tool like Yaird
is to produce an initial boot image that loads the required modules
to allow a booting kernel to access the root file system and from
there use the startup scripts to get to the default run level.
This means that hardly any drivers need to be compiled into the kernel
itself, so a distribution can produce a kernel with a large amount of
modules that will run unchanged on practically any hardware, without
introducing a large number of unused drivers that would waste RAM.
In a sense, the initial boot image customises the kernel to the hardware
it happens to be running on.

That purpose still leaves a lot of room to optimise for different
goals: as an example, you could attempt to make the generated
image as small as possible, or you could attempt to make the
generated image so flexible that it will boot on any hardware.
This chapter discusses the goals that determined the design, the
resulting features, and what's still left to do.

Be maintainable. Small functions with documented arguments
and result are better than a shell script full of constructs
like eval "awk | bash | tac 3>&7".

Be secure and reliable. The application should stop with an error
message at the slightest provocation, rather than run the
risk of producing a non-booting initrd image.
The application should not open loopholes that allow the 'bad
guys' to modify the image, gain access to raw devices or
overwrite system files.

Be distribution agnostic. Fedora and Debian run similar
kernels and similar startup scripts, so there's little
reason why the glue between the two levels should be
completely different.

Have limited footprint. The tools needed to build and run
the application should be few and widely available, with a
preference for tools that are installed anyway.

Be future proof. Future kernels may use different modules
and may change device numbers; the application should need
no changes to cope with such migrations.

Promote code reuse. Make functions side-effect free and
independent of context, so that it's easy to package the
core as a library that can be reused in other applications.

Generate small images. The application should accurately
detect what modules are needed to get the root file system
running and include only those modules on the generated
image.[2]

Requirements:

Linux 2.6.8 or later, both when running
yaird and when running the generated
image. By limiting the goal to support only recent kernels,
we can drastically reduce the number of special cases and
knowledge about modules in the application.

A version of modprobe suitable
for 2.6 kernels.

Sysfs and procfs, both on the old and on the
new kernel.

Perl and the HTML-Template module.

To achieve these goals, the following features are implemented:

Templating system to tune the generated image to a
given distribution; templates for Debian and Fedora FC3
included.

Interprets /etc/fstab, including
details such as octal escapes, ignore and
noauto keywords, and — for ext3 and reiser
file systems — label and uuid detection.
Where applicable, options in /etc/fstab
are used in the generated image.

Supports volume management via LVM2; activates only the volume
group required for the root file system.

Image generation understands how included executables may
depend on symbolic links and shared libraries. Shared libraries
work for both glibc and klibc.

Support input devices such as USB keyboard, if the input
device supports sysfs.
Input devices are needed in the initial image to supply
a password for encrypted root disk and to do debugging.

Basic support for kernel command line as passed by the boot
loader. Interprets init=, ro, rw.

Module aliases and options as specified in
/etc/modprobe.d are supported.

Interprets the blacklist information from hotplug.

Interprets the kernel configuration file that defines whether a
component is built in, available as a module or unavailable.
By maintaining a mapping between module name and config
parameter for selected modules, we avoid error messages if
for instance a required file system is built into the kernel.

Supports initramfs, both in Debian and Fedora versions.
An example template using the older initrd model
is included for Debian.

Does not require devfs in either the old or the new kernel.

Behaviour of the generated image can be tuned using
configuration files.

Obviously, this tool is far from complete. Here's a list of
features that still need to be implemented:

Understands USB storage, no special provisions
are needed for code generation, but it's not tested yet.

Swsusp is not supported yet.

Firewire is not supported.

Loopback file systems are not supported yet.

Filesystems encrypted via loopaes are not supported yet.

Concepts

This section discusses the basic concepts underlying
yaird.
The main procedure of the program is this:

given some goals, make a plan with a number of actions
that the generated image should execute;

transform the plan to a detailed description of the image;

build and pack the image.

About Goals

The generated initial boot image should achieve a number of goals
before handing over control to the root file system. There is a
configuration file that determines what these goals are; the
default list of goals is as follows:

Add the contents of the named template to the image.
It is not possible to pass arguments to the template.

MODULE name

Add the named module to the image.

INPUT

Add modules for every keyboard device found on the
system to the image.

NETWORK

Add modules for every ethernet device found on the
system to the image.

MOUNTDIR fsdir mountPoint

Given a directory that occurs in /etc/fstab,
get the underlying block device and file system type
working, then mount it at mountPoint.

MOUNTDEV blockDevice mountPoint

Given a block device that occurs in /etc/fstab,
get the block device and corresponding file system type
working, then mount it at mountPoint.
It is not possible to express activating a block device
without mounting it somewhere.

It is likely that new types of goal will need to be introduced
to support features such as software suspend.

Making the Plan

The goals listed in the configuration file need to be translated
into actions to be taken by the generated image.
As an example, before mounting a file system, the modules containing
the implementation of the file system need to be loaded.

To refine the goal of loading a kernel module,
the ModProbe module invokes the
modprobe command to find any
prerequisite modules, skipping any modules that are blacklisted
or compiled into the kernel. Aliases are handled transparantly
by modprobe, module options are recorded to be included in the
initial image.
If the
modprobe command decides a module
needs an install command, an error is generated because we
cannot in general determine which executables the install
command would need to be on the initial boot image.

The KConfig module determines if loading a
module can be omitted because the module is hardcoded into the
kernel. As an example, it is aware of the fact that the module
ext3 is not needed if the new kernel configuration
contains CONFIG_EXT3_FS=y.[4]
Only a few modules are known: yaird
looks for modules such as ext3 when that filesystem
is used, so it makes sense to check whether a missing module
is compiled in. On the other hand, hardware modules that are
compiled in never show up in modules.pcimap
and friends, so they remain completely outside the view of
yaird.

Before a device as listed in /etc/fstab can be
mounted, that device needs to be enabled. That device could be an
NFS mount, a loopback mount or it could be a block device.
The loopback case is not supported yet, but block devices are.
This support is based on a number of sources of information:

Scanning the /dev directory gives us the
relation between all block special files and major/minor
numbers.

Scanning the /sys/block directory gives us the
relation between all major/minor numbers and kernel names
such as dm-0 or sda1; it
also gives the relation between partitions and complete
devices.

If there is a symlink in a /sys/block
subdirectory to the /sys/devices
directory, it also gives is the relation between a block
device and the underlying hardware.

Based on the kernel name and partition relationships of the device,
we determine the steps needed to activate the device. As an example,
to activate sda1, we need to activate sda,
then create a block special file for sda1. As
another example, to activate dm-0, our first bet is
to check whether this is an LVM logical volume, and if so activate the
physical volumes underlying the volume group, and finally running
vgchange -a y.
Otherwise, it could be an encrypted device, for which we
generate different code.

Hardware Planning

Some devices, such as sdx or hdy, are
expected to have underlying hardware; as an example, sda
may be backed by pci0000:00/0000:00:1f.2/host0/0:0:0:0.
This represents a hardware path, in this case a controller on the PCI
bus that connects to a SCSI device. In order to use the device,
every component on the path needs to be activated, the component
closest to the CPU first.
Based on the pathname in /sys/devices and on
files within the directory for the component, we can determine
what kind of component we're dealing with, and how to find the
required modules.

Finding modules closely follows the methods used in the
hotplug package, and the
hotplug approach in turn is an almost
literal translation of the code that the kernel uses to find a
driver for a newly detected piece of hardware.

For components that talk some protocol over a bus, like SCSI or
IDE disks or CDROMs, this is a simple hard coded selection; as an
example, the ScsiDev module knows that a SCSI device
with a type file containing "5" is a CDROM,
and that sr-mod is the appropriate driver.

Devices such as PCI or USB devices cannot be classified into
a few simple categories. These devices have properties such
as "Vendor", "Device" and "Class" that are visible in sysfs.
The source code of kernel driver modules for these devices
contains a table listing which combination of properties mark a
device that the driver is prepared to handle. When the kernel
is compiled, these tables are summarised in a text file such
as modules.pcimap. Based on this table,
we find a driver module needed for the device and mark it for
inclusion on the image.

Multiple modules can match the same hardware: as an example,
usb-storage and ub both match an USB
stick. In such cases, we load all matching modules into the
kernel and leave it to kernel to decide who gets to manage the
device. There's one complication: some modules, such as
usb-core, match any device (probably to maintain some
administration of their own, or to provide an ultra-generic
interface), but do not actually provide access to the device.
Such devices are weeded out by the Blacklist module,
based on information in
/etc/hotplug/blacklist and
/etc/hotplug/blacklist.d.

It turns out that the "load modules for every component in the sysfs
path" approach is not always sufficient: sometimes you have to load
siblings as well. As an example, consider a combined EHCI/UHCI
USB controller on a single chip. The same ports can show up as EHCI
or UHCI devices, different PCI functions in the same PCI slot, with
different sysfs directories, depending on what kind of hardware is
connected. Purely following the sysfs path, we would only need to load
the EHCI driver, but it appears that on this kind of chip, EHCI devices
are not reliably detected unless the UHCI driver is loaded as well.
For this reason, we extend the algorithm with a rule: "for PCI devices,
load modules for every function in the PCI slot".

That's actually a bit much: it would load all of ALSA if you have a
combined ISA/IDE/USB/Multimedia chipset. So we limit the above
to those PCI functions that provide USB ports.

Plan Transformation

The plan generated in the first phase is a collection of general
intentions, stuff like 'load this module', but it does not
specify exactly what files must be placed on the image and what
lines are to be added to the initialisation scripts.

The module ActionList represents this plan with a
list of hashes; every hash contains at least 'action' and
'target', with other keys added to provide extra information
as needed. If two steps in the plan have identical action and
target, the last one is considered redundant and silently omitted.

This plan is transformed to an exact image description with
the help of templates.
These templates are read from a configuration file; for every
type of action they can contain:

files to be copied from the mother system to the
same location on the image;

directories to be created on the image; these do
not have to exist on the mother system;

trees to be copied recursively from the mother
system to the image;

script fragments: a few lines of code to be appended to
the named file on the image.

All of the above are fed through HTML-Template, with the hash
describing this action as parameters. In practice, this looks
like so:

The kernel version we're generating an image for.
Useful if you want your image to include a complete copy
of /lib/modules/(version)/kernel.

appVersion

The version of yaird used to
build the image.

auxDir

The directory where yaird
keeps executables intended to go on the image, such as
run_init.

Currently, there are templates for Debian and for Fedora, plus
a template showing how to use the older initrd approach.

Image Generation

The detailed image description consists of a collection of names of
files, directories, symbolic links and block or character devices,
plus a number of lines of shell script. The image description does
not contain permission or ownership information: files always have
mode 444, executables and directories always 555, devices always
mode 600,[5] and everything is owned by root.

The Image module contains the image description and
can write the image to a directory. It understands about symlinks:
if /sbin/vgscan is added to the image and it
happens to be a symlink to lvmiopversion, both
vgscan and lvmiopversion
will be added to the image. Shared libraries are supported
via the SharedLibraries module, as discussed in
the section called “Supporting Shared Libraries”. Invocations of other executables are not
recognised automatically: if lvmiopversion executes
/etc/lvm-200/vgscan, the latter needs to be
added explicitly to the image.

The copying of complete trees to the image is influenced by the
copying for executables: if there is a symlink in the tree, it's
target is also included on the image, but if the target is a
directory, its contents are not copied recursively. This
approach avoids loops in image generation.
Note that the target of a symlink must exist:
yaird refuses to copy dangling links.

Packing the Image

The final step is packing the image in a format that the
bootloader can process; this is handled by the module
Pack. The following formats are supported:

cpio

A zipped cpio file (new ASCII format), required for the
initramfs model as used in the templates for Debian and
Fedora.

directory

An unpacked directory, good for debugging or manually
creating odd formats.

cramfs

A cramfs filesystem, used for Debian initrd images.

The interface between kernel and image

The initial boot image is supposed to load enough modules to let
the real root device be mounted cleanly. It starts up in a
very bare environment and it has to do tricky
stuff like juggling root filesystems; to pull that off successfully
it makes sense to take a close look at the environment that the
kernel creates for the image and what the kernel expects it to do.
This section contains raw design notes based on kernel 2.6.8.

The processing of the image starts even before the kernel is
activated. The bootloader, grub or lilo for example, reads two
files from the boot file system into ram: the kernel and image.
The bootloader somehow manages to set two variables in the kernel:
initrd_start and initrd_end; these variables
point to the copy of the image in ram. The bootloader now
hands over control to the kernel.

During setup, the kernel creates a special file system, rootfs.
This mostly reuses ramfs code, but there are a few twists: it can
never be mounted from userspace, there's only one copy, and it's not
mounted on top of anything else. The existence of rootfs means that
the rest of the kernel always can assume there's a place to mount
other file systems. It also is a place where temporary files can
be created during the boot sequence.

In initramfs.c:populate_rootfs(), there are two
possibilities. If the image looks like a cpio.gz file, it is
unpacked into rootfs. If the file /init is
among the files unpacked from the cpio file, the initramfs model
is used; otherwise we get a more complex interaction between kernel
and initrd, discussed in the section called “Booting with initrd”.

Booting with Initramfs

If the image was a cpio file, and it contains a file
/init, the initram model is used.
The kernel does some basic setup and hands over control to
/init; it is then up to
/init to make a real root available and to
transfer control to the /sbin/init command
on the real root.

The tricky part is to do that in such a way that there
is no way for user processes to gain access to the rootfs
filesystem; and in such a way that rootfs remains empty and
hidden under the user root file system. This is best done
using some C code; yaird uses
run_init, a small tool based on
klibc.

Booting with initrd

If the image was not a cpio file, the kernel copies the
initrd image from where ever the boot loader left it to
rootfs:/initrd.image, and frees the ram used
by the bootloader for the initrd image.

After reading initrd, the kernel does more setup to the point where
we have:

working CPU and memory management

working process management

compiled in drivers activated

a number of support processes such as ksoftirqd are created.
(These processes have the rootfs as root; they can get a new
root when the pivot_root() system call is used.)

something like a console. Console_init() is
called before PCI or USB probes, so expect only compiled in
console devices to work.

At this point, in do_mounts.c:prepare_namespace(),
the kernel looks for a root filesystem to mount. That root file
system can come from a number of places: NFS, a raid device, a plain
disk or an initrd. If it's an initrd, the sequence is as follows
(where devfs can fail if it's not compiled into the kernel)

Once that returns, in init/main.c:init(),
initialisation memory is freed and /sbin/init
is executed with /dev/console as file descriptor 0, 1
and 2. /sbin/init can be overruled with
an init=/usr/bin/firefox parameter passed to the
boot loader; if /sbin/init is not found,
/etc/init and a number of other fallbacks
are tried. We're in business.

The processing of initrd starts in
do_mounts_initrd.c:initrd_load(). It creates
rootfs:/dev/ram, then copies
rootfs:/initrd.image there and unlinks
rootfs:/initrd.image. Now we have the initrd
image in a block device, which is good for mounting. It calls
handle_initrd(), which does:

So initrd:/linuxrc runs in an environment where
initrd is the root, with devfs mounted if available, and rootfs is
invisible (except that there are open file handles to directories
in rootfs, needed to change back to the old environment).

Now the idea seems to have been that /linuxrc
would mount the real root and pivot_root into it, then start
/sbin/init. Thus, linuxrc would never return.
However, main.c:init() does some usefull stuff only
after linuxrc returns: freeing init memory segments and starting numa
policy, so in eg Debian and Fedora, /linuxrc
will end, and /sbin/init
is started by main.c:init().

After linuxrc returns, the variable real_root_dev
determines what happens. This variable can be read and written
via /proc/sys/kernel/real-root-dev. If it
is 0x0100 (the device number of /dev/ram0)
or something equivalent, handle_initrd() will change
directory to /old and return. If it is
something else, handle_initrd() will decode it, mount
it as root, mount initrd as /root/initrd,
and again start /sbin/init. (if mounting as
/root/initrd fails, the block device is freed.)

Remember handle_initrd() was called via
load_initrd() from prepare_namespace(),
and prepare_namespace() ends by chrooting into the
current directory: rootfs:/old.

Note that rootfs:/old was move-mounted
from '/' after /linuxrc returned.
When /linuxrc started, the root was
initrd, but /linuxrc may have done a
pivot_root(), replacing the root with a real root,
say /dev/hda1.

Thus:

/linuxrc is started with initrd
mounted as root.

There is working memory management, processes, compiled
in drivers, and stdin/out/err are connected to a console,
if the relevant drivers are compiled in.

Devfs may be mounted on /dev.

/linuxrc can pivot_root.

If you echo 0x0100 to
/proc/sys/kernel/real-root-dev,
the pivot_root will remain in effect after
/linuxrc ends.

After /linuxrc returns,
/dev may be unmounted and replaced
with devfs.

Thus a good strategy for /linuxrc is to
do as little as possible, and defer the real initialisation
to /sbin/init on the initrd; this
/sbin/init can then pivot_root
into the real root device.

Kernel command line parameters

The kernel passes more information than just an initial file system
to the initrd or initramfs image; there also are the kernel boot
parameters. The bootloader passes these to the kernel, and the kernel
in turn passes them on via /proc/cmdline.

An old version of these parameters is documented in the
bootparam(7) manual page; more recent information is in the kernel
documentation file kernel-parameters.txt.
Mostly, these parameters are used to configure non-modular drivers,
and thus not very interesting to yaird.
Then there are parameters such as noapic, which are
interpreted by the kernel core and also irrelevant to
yaird.
Finally there are a few parameters which are used by the kernel
to determine how to mount the root file system.

Whether the initial image should emulate these options or ignore them
is open to discussion; you can make a case that the flexibility these
options offer has become irrelevant now that initrd/initramfs offers
far more fine grained control over the way in which the system
is booted.
Support for these options is mostly a matter of tuning the
distribution specific templates, but it is possible that the
templates need an occassional hint from the planner.
To find out just how much "mostly" is, we'll try to implement
full support for these options and see where we run into
limitations.
An inventarisation of relevant options.

ydebug

The kernel does not know about this option,
so we can use it to enable debugging in the generated image.

ide

These are options for the modular ide-core driver.
This could be supported by adding an attribute
"isIdeCore" to insmod actions, and expanding the ide
kernel options only for insmod actions where that
attribute is true.
It seems cleaner to support the options from
/etc/modprobe.conf.
Unsupported for now.

init

The first program to be started on the definitive root device,
default /sbin/init. Supported.

ro

Mount the definitive root device read only,
so that it can be submitted to fsck.
Supported; this is the default behaviour.

rw

Three guesses. Supported.

resume, noresume

Which device (not) to use for software suspend.
To be done.

root

The device to mount as root. This is a nasty one:
the planner by default only creates device nodes
that are needed to mount the root device, and even
if you were to put hotplug on the inital image
to create all possible device nodes, there's still
the matter of putting support for the proper file system
on the initial image.
We could make an option to
yaird to specify a list
of possible root devices and load the necessary
modules for all of them.
Unsupported until there's a clear need for it.

rootflags

Flags to use while mounting root file system.
Implement together with root option.

rootfstype

File system type for root file system.
Implement together with root option.

These two are aliases, with "ip" being the preferred
form. This option may appear more than once.
It tells the kernel to configure a network device,
either based on values that are part of the option
string or based values supplied by DHCP.

Where the root file system to be mounted is coming from.
If you don't give any options, we try first with NFS over
TCP, then over UDP and finally NFSv2.
If DHCP specifies a root directory, server and root are
based on DHCP, but options in nfsroot are still applied.
If nfsroot does not give server-ip, the server IP given
by DHCP is used.

Supporting Raid Devices

This section discusses software raid devices from an initial boot
image perspective: how to get the root device up and running.
There are other aspects to consider, the bootloader for example:
if your root device is on a mirror for reliability, it would be
a disappointment if after the crash you still had a long downtime
because the MBR was only available on the crashed disk. Then there's
the issue of managing raid devices in combination with hotplugging:
once the system is operational, how should the raid devices that
the initial image left untouched be brought online?

Raid devices are managed via ioctls (mostly; there is something
called "autorun" in the kernel)
The interface from userland is simple: mknod a block device file,
send an ioctl to it specifying the devnos of the underlying block
devices and whether you'd like mirroring or striping, then send
a final ioctl to activate the device. This leaves the managing
application free to pick any unused device (minor) number and
has no assumptions about device file names.

Devices that take part in a raid set also have a "superblock",
a header at the end of the device that contains a uuid and indicates
how many drives and spares are supposed to take part in the raid set.
This can be used be the kernel to do consistency checking, it can also
be used by applications to scan for all disks belonging in a raid set,
even if one of the component drives is moved to another disk controller.

The fact that the superblock is at the end of a device has an obvious
advantage: if you somehow loose your raid software, the device
underlying a mirror can be mounted directly as a fallback measure.

If raid is compiled into the kernel rather than provided as a module,
the kernel uses superblocks at boot time to find raid sets and make
them available without user interaction. In this case the filename of
the created blockdevice is hardcoded: /dev/md\d.
This feature is intended for machines with root on a raid device
that don't use an initial boot image. This autorun feature is
also accessible via an ioctl, but it's not used in management
applications, since it won't work with an initial boot image and
it can be a nuisance if some daemon brought a raid set online just
after the administator took it off line for replacement.

Finally, by picking a different major device number for the raid device,
the raid device can be made partitionable without use of LVM.

There are at least three different raid management applications
for Linux: raidtools, the oldest; mdadm, more modern; and EVMS, a
suite of graphical and command line tools that manages not only raid
but also LVM, partitioning and file system formating. We'll only
consider mdadm for now. The use of mdadm is simple:

There's an option to create a new device from components,
building the superblock.

Another option assembles a raid device from components,
assuming the superblocks are already available.

Optionally, a configuration file can be used, specifying which
components make up a device, whether a device file should
be created or it is assumed to exist, whether it's stripe or
mirror, and the uuid. Also, a wildcard pattern can be given:
disks matching this pattern will be searched for superblocks.

Information given in the configuration file can be omitted
on the command line. If there's a wildcard, you don't even
have to specify the component devices of the raid device.
A typical command is mdadm --assemble /dev/md-root
auto=md uuid=..., which translates to "create
/dev/md-root with some unused minor number,
and put the components with matching uuid in it."

So far, raid devices look fairly simple to use; the complications
arise when you have to play nicely with all the other software
on the box. It turns out there are quite a lot of packages that
interact with raid devices:

When the md module is loaded, it registers 256 block devices
with devfs. These devices
are not actually allocated, they're just names set up to
allocate the underlying device when opened. These names in
devfs have no counterpart in sysfs.

When the LVM vgchange is started,
it opens all md devices to scan for headers, only to find the
raid devices have no underlying components and will return
no data. In this process, all these stillborn md devices get
registered with sysfs.

When udevstart is executed
at boot time, it walks over the sysfs tree and lets
udev create block devices files for
every block device it finds in sysfs. The name and permissions
of the created file are configurable, and there is a hook to
initialise SELinux access controls.

When mdadm is invoked with the auto
option, it will create a block device file with an unused
device number and put the requested raid volume under it.
The created device file is owned by whoever executed the
mdadm command, permissions are 0600
and there are no hooks for SELinux.

When the Debian installer builds a system with LVM and raid, the
raid volumes have names such as /dev/md0,
where there is an assumption about the device minor number in
the name of the file.

For the current Debian mkinitrd, this all works together in
a wonderful manner: devfs creates file names for raid devices,
LVM scans them with as side effect entering the devices in sysfs,
and after pivotroot udevstart triggers
udev into creating block device files with proper permissions and
SELinux hooks. Later in the processing of rcS.d,
mdadm will put a raid device under the
created special file. Convoluted but correct, except for the fact
that out of 256 generated raid device files, up to 255 are unused.

In yaird, we do not use devfs.
Instead, we do a mknod before the
mdadm, taking care to use the same
device number that's in use in the running kernel. We expect
mdadm.conf to contain an auto=md
option for any raid device files that need to be created.
This approach should work regardless of whether the fstab uses
/dev/md\d or a device number independent name.

Supporting EVMS

The EVMS suite aims to be a complete disk management solution:
it recognises disk partitions, RAID configurations, concatenation
of disk partitions, and file systems. It does all of this using its
own plugin architecture, and is largely selfcontained: in particular,
LVM, mdadm or libdevmapper are not required. There are some external
dependencies though: EVMS uses the same kernel modules to do RAID
that other packages use, and it uses an external
mkfs command to support file systems.

What can be moved out of the kernel has been moved out: as an
example, EVMS does not rely on code in the kernel to interpret
partition tables: a partition such as hda1 is
unused. Instead, EVMS uses the dm mechanism to present parts of a
physical disk as independent block devices. The advantage of this
approach is that new partition table formats can be supported
without kernel changes.

The plugin architecture provides three different user interfaces:
command line, curses based and GUI. There also is a
configuration and backup/restore mechanism, where plugins can send
and receive state related data. There does not seem to be a
central state file other than basic configuration: all state
information is kept with the plugins.

Plugins are implemented as shared libraries, but the relation
between library and plugin is not simple: there's no command
to determine which plugins are contained in a library.
This makes it difficult to determine in a maintainable way
what's the minimal set of plugins needed to boot the system;
the current implementation makes no attampt in that direction,
and just loads the lot of them.

Once the hardware is available and device drivers are loaded, the
EVMS system expects to take care of everything. This means
yaird support can be fairly simple:
once we find that a device is supported by EVMS (it's listed
with by the command "evms_query volumes"), we determine the
underlying physical disks with the command "evms_query disks".
We then build a boot image that loads drivers for the physical
disk and afterwards runs the command "evms_activate" that will
recreate all volumes.

There's a twist: the volume may need RAID drivers; to accomodate
this, all RAID related modules are inserted into the kernel before
starting "evms_activate". A possible improvement is to include
modprobe on the image, and to let EVMS load only the required
modules. This would save RAM at the expense of a somewhat larger
initial boot image.

Note that some devices are visible in EVMS without actually
working; these normally are shown with device number 0:0.
This seems to happen mostly with devices that are not completely
under the control of EVMS.
I'm not sure whether this a bug or a feature; but either way
yaird will need to be aware of such
devices and the fact that they may be visible, but that they are
not bootable.

Supporting Encrypted Disks

To protect the content of your disk against unwanted reading
even if the machine is stolen, it can make sense to encrypt the disk.
This section discusses Linux support for disk encryption and
the impact this has on the initial boot image.

The idea here is to encrypt the entire disk with a single key:
the kernel encrypts and decrypts all blocks on an underlying
device and presents it as a new ordinary block device, where
you can use mkfs and fsck as always. Thus an encrypted disk
only protects the confidentiality of your data in cases where
the hardware is first switched off and then taken away for later
perusal by the bad guys. It will not protect confidentiality
if the bad guy gains access to a running system, either through
an exploit or with a valid account.

There are different implementations of this idea. All implementations
use the kernel crypto modules (the same stuff that supports IPsec),
but they differ in how that cryptography is squeezed between userland
and the diskplatter.[7]
Note that we do not compare how effective the various implementations are
at keeping your data secret: if your data is important enough to encrypt,
it's also important enough to do your own research into which implementation
is most robust.

cryptoloop

Is in mainline kernel 2.6.10, but has reliability problems, such as
possible deadlocks. The cryptoloop maintainer:
"We should support cryptoloop. No new features, but working
well. At the same time we should declare it 'deprecated' and
provide dm-crypt as alternative."
See kerneltrap
for background.
The on-disk format is trivial: just the encrypted data.
When the device is initialised, the user enters a passphrase
and a hash of this phrase is used as key to do the decryption,
and if the result is a filesystem, the key was valid.

dm-crypt

Is in mainline kernel since 2.6.4. It uses device mapper
(the same framework that is also used by LVM), which makes
it more stable than cryptoloop.
See dm-crypt:
a device-mapper crypto target.
Dm-crypt can use the same on-disk format as cryptoloop, but
the device mapper makes it easy to reserve part of the disk
for a partition header with key material.

Such a partition header, LUKS,
is now under development; it will offer improved protection
against dictionary attacks and will make it easier to change
the password on an encrypted disk. Due to the way the device
mapper works, support for the partition header can be implemented
completely in userspace.

LUKS is integrated in Gentoo and
included in Fedora FC4 test1. A debian package exists
(cryptsetup-luks),
but is not (yet) included in the main archive.

All these implementations need some kind of userspace tool to pass
key material to the kernel; this key material may come from lots of
places:

in the most simple case, it could be a hashed version of the
password

it could be a large random key stored in a gpg-encrypted file

for swap devices, it could be randomly regenerated on each reboot

for file systems other than the root, it could be from a file
with mode 600 on the root file system

the key could be stored on a USB stick, stored separately
from the machine.

An overview of relevant userspace tools:

the losetup command has an encryption option to use the
cryptoloop module. Note that this does not cause cryptoloop
to be mounted automatically.

versions of the mount command in Debian and Fedora have
a 'loop,encryption' option that will be passed to losetup
for use with cryptoloop, like so:

/dev/vg0/crwrap /crypt1 ext3 loop,encryption=aes,noauto 0 0

The dmsetup command can set and show parameters (including
key hashes!) for dm based devices, including dm-crypt and
LVM. With a bit of shell scripting, you can hash a password
and pass it on the command line to set up a dm-crypt device.

The cryptsetup command adds a friendly wrapper around this.
In particular, it has hashing of the keyword built in.

A modified package cryptsetup-luks exists, that adds
extra options to (1) create a luks headers for a partition and
(2) open a partition given one of a number of possible
passphrases.

The file /etc/crypttab is a debian
extension to cryptsetup: it provides a list of crypted
devices, their underlying devices, corresponding cipher
and hash settings, plus the source for the passphrase:
either some file or the controlling terminal. This allows the devices
to be activated by /etc/init.d/cryptdisks. There is a
thread on adding /etc/crypttab to
Fedora: too late for FC3, to be considered again for FC4:
see here
and
here.

In order to activate an encrypted device with cryptsetup,
we need to detect:

which underlying device to use

which encryption and hash algorithm to use

where the passphrase comes from

whether we have a plain crypted partition from LUKS partition

In order to determine all these points we need information from
/etc/crypttab; as a consistency check, we'll
compare this to the output from "cryptsetup
status".[8]

The resulting actions:

If the source of the passphrase is something other than
the console, abort. There are too many variables to support
this reliably.

For the passphrase hash algorithm, no modules need to be loaded,
since it is included by cryptsetup from a user space
library.

Make the underlying device available.

Modprobe the dm-crypt and the cipher (the module name
is the part of the cipher name before the first hyphen).
If the cipher block mode needs a hash, load that too.
Note that the cipher block mode hash is something
different from the passphrase hash: it's the part after
the colon in eg 'aes-cbc-essiv:sha256'.

Here the cryptsetup action will result in a script
fragment in /init that has "cryptsetup create" in a loop
until exit status is 0. For plain cryptsetup,
this only has effect in combination with the "verify"
option: exit status is 0 is the user gives the same
password twice in succession. With cryptsetup-luks,
this would test that the passphrase actually gives access
to the encrypted device.

For cryptsetup-luks, invoke a similar action with fewer
parameters, since so much of the required information
is already in the header.

Supporting NFS Root

It is possible to use an NFS share rather than a local disk
as root device; this is (obviously) useful for diskless terminals,
but it also can come in handy for recovery.

Examples of projects using NFS root for diskless work
are
LTSP,
Lessdisks and
Stateless Linux.
In these projects, the initial boot image comes with the distribution
and it must be sufficiently generic to support a wide range of
hardware; in particular it must probe for different network
cards. For yaird, we'll focus on recovery use, where the initial
boot image is tailored for a single computer.

Although in principe the kernel and initial boot image for an NFS root
system can be stored on a local disk, it's more common to have them
loaded over the network with TFTP. This means you'll need a boot loader
that can work over the network, such as pxelinux.
This takes place before the initial boot image takes over;
we won't dive into the details here.

There are a number of issues that make it impossible to automatically
determine exactly what is needed to do a network boot:

Not all interfaces are suitable for booting: think of
loopback devices IPsec tunnels, 802.1Q endpoints.

Interfaces may be renamed by udev;
thus there is no link between the name while running
yaird and the name while
running the initial boot image.

Once the system is running, there is no way to determine
how an interface got its IP address: could be RARP, DHCP
or static.

An NFS share in /etc/fstab contains
a hostname and directory, with no portable indication how
that name is resolved to an IP address, whether that IP
address will be unchanged during the next reboot and whether
the route to that IP address will stay unchanged.

This means we cannot determine how to mount the NFS root using
only information that is readily available on the running system:
we'll need a hint. Rather than give that hint in the form of
yaird configuration options, we will
use the kernel command line.

The NFS part of the boot process takes place after
loading of keyboard drivers and before switching to the
final root. It has the following phases:

Load device drivers for every interface that is backed
by hardware: /sys/class/net/*/device.

Configure interfaces: get an IP address, netmask, broadcast,
gateway. As a side effect, get hostname, dns, rootserver,
rootpath.

Mount the NFS root.

The last two steps are done by a single program,
trynfs. This is based on the klibc
components ipconfig and
nfsmount.
This program only is invoked if the kernel command line parameter
ip= (or its alias nfsaddrs=) is set. The kernel parameters ip=,
nfsaddrs=, nfsroot= are passed as arguments to
trynfs.

Earlier versions of Yaird had a command
line option "--nfs" to enable NFS code generation. Starting with
version 0.0.11, this option no longer is available. Instead, write
a configuration file based in Default.cfg that
uses the 'nfsstart' template to get an IP address and mount a root
file system. The reason the command line option is dropped is that
there are more ways to use NFS than can be expressed with a simple
command line option: some people need only a driver for a specific
card, others need lots of network drivers; you may or may not want
to use a local drive as backup if no network is available; using
a configuration file makes it possible to tune the generated image
exactly for the situation at hand.

NFS Pitfalls

Yaird can get the system to a state
where init is running from an NFS mounted root device, but that
is not always sufficient to get a reliable system: the init
scripts will also need to be written to work well in an NFS
mounted environment. This section discusses some potential
problems.

The Linux version of NFSv4 (Working Group,
Linux
reference implementation)
has a new channel of communication between the kernel and user
space: rpc_pipefs. This is normally mounted on
/ar/lib/nfs/rpc_pipefs, and is used to
let a user space daemon do locking and Kerberos on behalf of the
kernel.

The rpc_pipefs support on a machine can interfere with
yaird. As an example, in Fedora,
/etc/modprobe.conf.dist has an 'install'
line for module 'sunrpc' that automatically mounts the
rpc_pipefs filesystem when the module is loaded. This means
the filesystem is not mounted if the sunrpc module happens
to be compiled into the kernel; it also can't be mounted if
sunrpc is loaded from the initial boot image, since there is no
/var/lib/nfs/rpc_pipefs yet to mount it on.
When yaird sees such an install line,
it can no longer determine what should go on the initial boot
image and terminates.

The workaround is to remove the 'install' line from
modprobe.conf and to do the mounting
in an /etc/init.d script before the
rpc.gssd and
rpc.statd daemons are started.

Note that using Kerberos with an NFS mounted root is of
questionable value: Kerberos relies on a secret file on the root
file system to guarantee the security of NFS, and if that secret
file is on an NFS file system that is itself not protected by
Kerberos, the guarantee loses value.

Another potential problem is dhclient, a tool to configure a
network interface with DHCP. This can call a user script
to manage DHCP state changes, and on FC4, that script happens
to stop and start the interface to get it to a known state.
Since the script itself is accessed over NFS via the interface,
the stopping works, but the starting doesn't ... By using a
fixed IP address you avoid this problem, but that is not a
generally applicable solution.

The upshot of all this seems to be that we can ignore the css0 and subchannel directory,
then should look up required module based on the modules.ccwmap, in the same way that
lookup is done in usbmap and pcimap.

There also is the concept of "ccwgroup", where a single device uses a number of
S390 channels. No indications that this has implications for booting.

Supporting Input Devices

A working console and keyboard during the initial boot image execution
is needed to enter a password for encrypted file systems; it also
helps while debugging. This section discusses the kernel input
layer and how it can be supported during image generation.

The console is a designated terminal, where kernel output goes, and that
is the initial I/O device for /sbin/init. Like all
terminal devices, it provides a number of functions: you can read
and write to it, plus it has a number of ioctl()
functions to manage line buffering, interrupt characters and
baudrate or parity where applicable.

Terminals come in different types: it can be a VT100 or terminal
emulator connected via an RS232 cable, or it can be a combination
of a CRT and a keyboard. The keyboard can be connected via
USB or it can talk a byte oriented protocol via a legacy UART
chip.

The CRT is managed in two layers. The top layer, "virtual
terminal", manages a two dimensional array describing which letter
should go in which position of the screen. In fact, there are a
number of different arrays, and which one is actually visible on
the screen is selected by a keyboard combination.
Below the virtual terminals is a layer that actually places the
letters on the screen. This can be done a letter at a time,
using a VGA interface, or the letters can be painted pixel by
pixel, using a frame buffer.

Below the terminal concept we find the input layer. This provides a
unified interface to the various user input devices: mouse, keyboard,
PC speaker, joystick, tablet. These input devices not only
generate data, they can also receive input from the computer. As
an example, the keyboard needs computer input to operate the NUM
LOCK indicator. Hardware devices such as keyboards register
themselves with the input layer, describing their capabilities
(I can send relative position, have two buttons and no LEDs),
and the input layer assigns a handler to the hardware device.
The handler presents the device to upper layers, either as a char
special file or as the input part of a terminal device.
This is not a one-to-one mapping: every mouse gets its own
handler, but keyboard and PC speaker share a handler, so it looks
to userland like you have a keyboard that can do "beep".

In addition to handlers for specific type of upper layers (mouse,
joystick, touch screen) there is a generic handler that provides a
character device file such as /dev/input/event0
for every input device detected; input events are presented through
these devices in a unified format. The input layer generates
hotplug events for these generic event handlers; hotplug uses
modules.inputmap to load a module containing a
suitable upper layer event handler. The keyboard handler is a special
case that does not occur in this map, so for image generation there
is little to be learned from hotplug input support.

To guarantee a working console, yaird
should examine /dev/console, determine
whether it's RS232 or hardware directly connected to the computer,
and then load modules for either serial port, or for virtual
terminals, the input layer and any hardware underlying it.
Unfortunately, /dev/console does not give
a hint what is below the terminal interface, and unfortunately,
lots of input devices are legacy hardware that is hard to probe
and only sketchily described by sysfs in kernel 2.6.10.

This means that a guarantee for a working console cannot be made,
which is why distribution kernels come with components such as the
keyboard and serial port driver compiled into the kernel. We can
do something else though: provide modules for keyboard devices
provided the kernel provides correct information. That covers the
case of USB keyboards, and that's something that's not compiled
into distribution kernels, so that the administrator has to add
modules explictly in order to get the keyboard working in
the initial boot image.

Lets examine the sources of information we have to find which input
hardware we have to support.

In /sys/class/input, all input devices
are enumerated. Mostly, these only contain a
dev file containing major/minor number,
but USB devices also have a device
symlink into /sys/devices identifying
the underlying hardware.

In kernel 2.6.15, /sys/class/input
is far more complete. It has links from class device to
hardware devices, and hardware devices such as atkbd and
psmouse have a 'modalias' file that can be fed to modprobe.
This contains everything that's in
/proc/bus/input/devices,
in a nice accessible manner.

As an aside, can we do all device probing based on the
modalias file? This would mean we no longer would have
to distinguish between sysfs format for usb and pci,
making the code simpler. The tricky part is to distinguish
between modules compiled in and modules simply missing from
the kernel: dealing with "FATAL: Module ... not found".
As a first step, we could simply assume that aliases that cannot
be resolved refer to compiled in modules; this is in essence
what the current scan of eg modules.usbmap does.

In /boot/menu/grub.lst, kernel options
can be defined that determine whether to use a serial line as
console and whether to use a frame buffer. The consequence
is that it is fundamentally impossible to determine by looking
at the hardware alone what's needed to get an image that will
boot without problems. This probably means we'll have to consider
supplying some modules in the image that will only get loaded
depending on kernel options.

The file /proc/bus/input/devices gives
a formatted overview of all known input devices; entries look
like this:

Here the "I" line shows identification information passed to
the input layer by the hardware driver that is used to look
up the appropiate handler. "N" is a printable name provided
by the hardware driver. "P" is a hint at location in a bus
of the device; note how this line is completely unrelated to
the location of the hardware in
/sys/devices.
The H (Handlers) line is obvious; The B lines specify
capabilities of the device, plus extra information for each
capability. Known capabilities include:

Capability

Description

SYN

Input event is completed

KEY

Key press/release event

REL

Relative measure, as in mouse movement

ABS

Absolute position, as in graphics
tablet

MSC

Miscelanious

SND

Beep

REP

Set hardware repeat

FF

Don't know

PWR

Power event: on/off switch pressed.

FF_STATUS

Don't know.

Finally, let's consider some kernel configuration defines, the
corresponding modules and their function. This could be used as a
start to check whether all components required to make an
operational console are available on the generated image:

Define

Module

Description

VT

(bool)

Support multiple virtual terminals, irrespective of what
hardware is used to display letters from the virtual
terminal on the CRT.

VT_CONSOLE

(bool)

Make the VT a candidate for console output. The alternative
is a serial line to a VT100 or terminal emulator

VGA_CONSOLE

(bool)

Display a terminal on CRT using the VGA interface.

FRAMEBUFFER_CONSOLE

fbcon

Display a terminal on a framebuffer, painting letters a
pixel at a time. This has to know about fonts.

FB_VESA

vesafb

Implement a framebuffer based on VESA (a common standard
for PC graphic cards), a place where an X server or
the framebuffer console can write pixels to be displayed
on CRT.
There are many different framebuffer modules that
optimise for different graphics cards.
Note that while vesafb and other drivers such as intelfb
can be built as a module, they only function correctly
when built into the kernel. Most framebuffer modules
depend on three other modules to function correctly:
cfbfillrect, cfbcopyarea, cfbimgblt.

ATKBD

atkbd

Interpret input from a standard AT or PS/2 keyboard.
Other keyboards use other byte codes, see for example
the Acorn keyboard (rpckbd).

SERIO

serio

Module that manages a stream of bytes from and to an IO port.
It includes a kernel thread (kseriod) that handles the queue
needed to talk to slow ports. It is normally used for
dedicated IO ports talking to PS/2 mouse and keyboard,
but can also be interfaced to serial ports (COM1, COM2).
The atkbd driver uses a serio driver to communicate with
the keyboard.

SERIO_I8042

i8042

Implement a serio stream on top of the i8042 chip, the chip
that connects the standard AT keyboard and PS/2 mouse to
the computer.
This is legacy hardware: it's not connected via PCI but
directly to the 'platform bus'.
When a chip such as i8042 that implements
serio is detected, it registers itself with the input
layer. The input layer then lets drivers that use serio
(such as atkbd and psmouse) probe whether a known device
is connected via the chip; if such a device is found,
it is registered as a new input device.

SERIAL_8250

serial

Support for serial ports (COM1, COM2) on PC hardware.
Lots of other configuration options exist to support
multiple cards and fiddle with interrupts.
If compiled in rather than modular, a further option,
SERIAL_8250_CONSOLE, allows using the serial port as a
console.

USB_HID

usbhid

Driver for USB keyboards and mice.
Another define, USB_HIDINPUT, needs to be true for
these devices to actually work.

USB_KBD

usbkbd

Severely limited form of USB keyboard; uses the "boot
protocol". This conflicts with the complete driver.

The following figure gives an example of how the various modules
can fit together.

Figure 1.
Module relation for common console setup

In practical terms, a first step toward a more robust boot image
is to support new keyboard types, such as USB keyboards.
The following algorithm should do that.

Interpret /proc/bus/input/devices.

Look for devices that have handler kbd and
that have buttons. Mice and the PC speaker don't match that
criterium, keyboards do.

You could interpret the name field of such devices if you're
interested in supporting legacy keyboards.

The devices that have handler 'kbd' also have a handler 'event\d',
where input is presented in a generalised event format;
look up this device in /sys/class/input/event\d/.

Otherwise it's presumable a legacy device; you could check for
the existence of
/sys/devices/platform/i8042/serio\d/,
or you could just assume the appropriate driver to be compiled in.

Implement support for
/etc/hotplug/blacklist,
since some USB keyboards publish two interfaces (full HID
and the limited boot protocol), the input layer makes both
visible in /proc/bus/input/devices and
the corresponding modules are mutually conflicting.
The blacklist is used to filter out one of these modules.

Supporting Shared Libraries

When an executable is added to the image, we want any required shared
libraries to be added automatically. The SharedLibraries
module determines which files are required. This section discusses
the features of kernel and compiler we need to be aware of in order
to do this reliably.

Linux executables today are in ELF format; it is defined in
Generic ELF Specification ELFVERSION,
part of the Linux Standard Base. This is based on part of the System
V ABI: Tool Interface Standard (TIS), Executable and Linking Format
(ELF) Sepcification

ELF has consequences in different parts of the system: in
the link-editor, that needs to merge ELF object files into ELF
executables; in the kernel (fs/binfmt_elf.c),
that has to place the executable in RAM and transfer control to it,
and in the runtime loader, that is invoked when starting the
application to load the necessary shared libraries into RAM.
The idea is as follows.

Executables are in ELF format, with a type of either
ET_EXEC (executable) or ET_DYN (shared
library; yes, you can execute those.) There are other types of
ELF file (core files for example) but you can't execute them.

These files contain two kind of headers: program headers and
section headers. Program headers define segments of the file that
the kernel should store consequetively in RAM; section headers define
parts of the file that should be treated by the link editor
as a single unit. Program headers normally point to a group
of adjacent sections.

The program may be statically linked or dynamically (with shared
libraries).
If it's statically linked, the kernel loads relevant segments,
then transfers control to main() in userland.

If it's dynamically linked, one of the program headers has type
PT_INTERP. It points to a segment that contains
the name of a (static) executable; this executable is loaded in
RAM together with the segments of the dynamic executable.

The kernel then transfers control to the userland
interpreter, passing program headers and related info in a
fourth argument to main(), after envp.

There's one interesting twist: one of the segments loaded
into RAM (linux-gate.so) does not
come from the executable, but is a piece of kernel mapped
into user space. It contains a subroutine that the kernel
provides to do a system call; the idea is that this way,
the C library does not have to know which calling convention
for system calls is supported by the kernel and optimal for
the current hardware. The link editor knows nothing about
this, only the interpreter knows that the kernel can pass the
address of this subroutine together with the program headers.
[9]

The interpreter interprets the .dynamic section of
the dynamic executable. This is a table containing various types
of info; if the type is DT_NEEDED, the info is the
name of a shared library that is needed to run the executable.
Normally, it's the basename.

The interpreter searches LD_LIBARY_PATH for the
library and loads the first working version it finds, using a
breath-first search. Once everything is loaded, the interpreter
hands over control to main in the executable.

Except that that's not how it really works: the path that glibc
uses depends on whether threads are supported, and klibc can
function as a PT_INTERP but will not load additional
libraries.

The ldd command finds the pathnames
of shared libraries used by an executable. This works
only for glibc: it invokes the interpreter
with the executable as argument plus an environment variable that
tells it to print the pathnames rather than load them. For other
C libraries, there's no guaranteed correct way to find the path of
shared libraries.

Update: ldd also works for another
C library, uclibc, unless you disable that support while building
the library by unsetting LDSO_LDD_SUPPORT.

Thus, to figure out what goes on the initial ram image, first try
ldd. If that gives an answer, good.
Otherwise, use a helper program to find PT_INTERP and
DT_NEEDED. If there's only PT_INTERP, good,
add it to the image. If there are DT_NEEDED libraries
as well, and they have relative rather than absolute pathnames,
we can't determine the full path, so don't generate an image.

There are a number of options to build a helper to extract the relevant
information from the executable:

Build it in perl. The problem here is that unpacking 64-bit
integers is an optional part of the language.

Build a wrapper around objdump or
readelf. The drawback is that
there programs are not part of a minimal Linux distribution:
depending on them in yaird would
increase the footprint.

Building a C program using libbdf. This is a library
intended to simplify working with object files. Drawbacks
are that it adds complexity that is not necessary in our
context since it supports multiple executable formats;
furthermore, at least in Debian it is treated as internal
to the gcc tool chain, complicating packaging the tool.

Building a C program based on elf.h.
This turns out to be easy to do.

Yaird uses the last approach listed.

Security

This section discusses security: avoiding downtime, avoiding revealing
sensitive information, avoiding unwanted modifications to the data;
either through accident or malice.
A good introduction to secure programming can be found in
Secure Programming for Linux and Unix HOWTO.

For yaird, security is not very
complicated: although it runs with root privileges, the program is
not setuid, and all external input comes from files or programs
installed by the admnistrator, so our main focus is on avoiding
downtime caused by ignored error codes.
A full blown risk assessment would be overkill, so we'll just use
the HOWTO as a checklist to verify that the basic precautions are
in place.

File contents in sysfs verified.
Fstab entries properly quoted.
TODO: check for spaces in names of LVM volume or of
modules; could end up in generated /sbin/init.

Bad input

Verify locale settings

All locale related environment variables are wiped at
program startup.

Bad input

Verify character encoding

All IO is byte oriented.

Bad input

Buffer overflow

In perl?

Program structure

Separate data and control

Under this heading, the HOWTO discusses the dangers of
auto-executing macros in data files. The closest thing we
have to a data file are the templates that tune the image
to the distribution. We use a templating language that
does not allow code embedding, and the image generation
module does not make it possible for template output to
end up outside of the image. Conclusion: broken templates
can produce a broken image, but cannot affect the running
system.

Program structure

Minimize privileges

The user is supposed to bring his own root privileges to
the party, not much to be done here. A related issue
is the minimizing of privileges in the system that is
started with the generated image. This would include
starting SELinux at the earliest possible moment.
At least in Fedora, that earliest possible moment is
in rc.sysinit, well past the moment
where the initial boot image hands over control to the newly
mount root file system. No yaird
support needed.

Program structure

Safe defaults

Configuration only specifies sources of information,
like /etc/hotplug, not much can go wrong here.

Program structure

Safe Initialisation

The location of the main configuration file is configured
as an absolute path into the application.

Program structure

Fail safe

Planning and writing the image is separated;
writing only starts after planning is succesfully completed.
Todo: consider backout on write failure.

Program structure

Avoid race conditions

Temporary files and directories are created
with the File::Temp module, which is
resistant to name guessing attacks.
The completed image is installed with rename
rather than link; if an existing file is
overwritten, this guarantees there's no race where the
old image has been deleted bu the new one is not yet in
place. (Note that there is no option in place yet which
allows overwriting of existing files.)
To do: examine File::Temp safe_level=HIGH.

Underlying resources

Handle meta characters

Protection against terminal escape sequences in output
is not yet in place.

Underlying resources

Check system call results

Yes.

Language specific

Verify perl behaviour with taint.

Yes.

Language specific

Avoid perl open magic with 3rd argument.

Yes.

Tool Chain

This section discusses which tools are used in implementing
yaird and why.

The application is built as a collection of perl modules.
The use of a scripting language makes consistent error checking
and building sane data structures a lot easier than shell
scripting; using perl rather than python is mainly because in
Debian perl has 'required' status while python is only 'standard'.
The code follows some conventions:

Where there are multiple items of a kind, say fstab entries,
the perl module implements a class for individual items.
All classes share a common base class, Obj,
that handles constructor argument validation and that offers
a place to plug in debugging code.

Object attributes are used via accessor methods to catch
typos in attribute names.

Objects have a string method, that returns
a string version of the object. Binary data is not
guaranteed to be absent from the string version.

Where there are multiple items of a kind, say fstab entries,
the collection is implemented as a module that is not a
class. There is a function all that returns a
list of all known items, and functions findByXxx
to retrieve an item where the Xxx attribute has a given
value. There is an init function that
initializes the collection; this is called automatically
upon first invocation of all or
findByXxx.
Collections may have convenience functions
findXxxByYyy: return attribute Xxx, given a
value for attribute Yyy.

The generated initrd image needs a command interpreter;
the choice of command interpreter is exclusively determined
by the image generation template.
At this point, both Debian and Fedora templates use the
dash shell, for historical reasons only.
Presumably busybox could be used to build a
smaller image. However, support for initramfs requires a complicated
construction involving a combination of mount, chroot and chdir;
to do that reliably, nash as used in Fedora
seems a more attractive option.

Documentation is in docbook format, since it's widely supported,
supports numerous output formats, has better separation between
content and layout than texinfo, and provides better guarantees
against malformed HTML than texinfo.

Autoconf

GNU automake is used to build and install the application,
where 'building' is perhaps too big a word adding the location
of the underlying modules to the wrapper script.
The reasons for using automake: it provides packagers with a
well known mechanism for changing installation directories,
and it makes it easy for developers to produce a cruft-free
and reproducible tarball based on the tree extracted from
version control.

C Library

The standard C library under linux is glibc. This is big:
1.2Mb, where an alternative implementation, klibc, is only 28Kb.
The reason klibc can be so much smaller than glibc is that a
lot of features of glibc, like NIS support, are not relevant for
applications that need to do basic stuff like loading an IDE driver.

There are other small libc implementations: in the embedded world,
dietlibc and uClibc are popular. However, klibc was specifically
developed to support the initial image: it's intended to be included
with the mainline kernel and allow moving a lot of startup magic out
of the kernel into the initial image. See
LKML: [RFC] klibc requirements, round 2
for requirements on klibc; the
mailing list is the most current
source of information.

Recent versions of klibc (1.0 and later) include a wrapper around
gcc, named klcc, that will compile a program with klibc. This means
yaird does not need to include klibc,
but can easily be configured to use klibc rather than glibc.
Of course this will only pay off if every
executable on the initial image uses klibc.

Template Processing

This section discusses the templates used to transform
high-level actions to lines of script in the generated image.
These templates are intended to cope with small differences
between distributions: a shell that is named
dash in Debian and
ash in Fedora for example.
By processing the output of yaird
through a template, we can confine the tuning of
yaird for a specific distribution
to the template, without having to touch the core code.

One important function of a template library is to enforce
a clear separation between progam logic and output formatting:
there should be no way to put perl fragments inside a template.
See StringTemplate
for a discussion of what is needed in a templating system, plus
a Java implementation.

Lets consider a number of possible templating solutions:

Template Toolkit:
widely used, not in perl core distribution, does not
prevent mixing of code and templates.

Text::Template:
not in perl core distribution, does not
prevent mixing of code and templates.

Some XSLT processor. Not in core distribution,
more suitable for file-to-file transformations
than for expanding in-process data; overkill.

HTML-Template:
not in perl core distribution,
prevents mixing of code and templates,
simple, no dependencies, dual GPL/Artistic license.
Available in Debian as
libhtml-template-perl,
in Fedora 2 as perl-HTML-Template, dropped from Fedora 3,
but available via
Fedora Extras.

A home grown templating system: a simple system such as the
HTML-Template module is over 100Kb. We can cut down on that
by dropping functions we don't immediately need, but the effort
to get a tested and documented implementation remains substantial.

The HTML-Template approach is the best match for our
requirements, so used in yaird.

Configuration Parsing

Yaird has a fair number of
configuration items: templates containing a list of files and
trees, named shell script fragments with a value that spans
multiple lines. If future versions of the application are going
to be more flexible, the number of configuration items is only
going to grow. Somehow this information has to be passed to the
application; an overview of the options.

Configuration as part of the program. Simply hard-code
all configuration choices, and structure the program so that
the configuration part is a well defined part of the
program. The advantage is that there is no need for any
infrastructure, the disadvantage is that there is no clear
boundary where problems can be reported, and that it
requires the user to be familiar with the programming
language.

AppConfig.
A mature perl module that parses configuration files in a
format similar to Win32 "INI" files. Widely used, stable,
flexible, well-documented, with as added bonus the fact that
it unifies options given on the command line and in the
configuration file. An ideal solution, except for the fact
that we need a more complex configuration than can
conventiently be expressed in INI-file format.

An XML based configuration format. XML parsers for perl are
readily available. The advantage is that it's an industry
standard; the disadvantage that the markup can get very
verbose and that support for input validation is limited
(XML::LibXML mentions a binding for RelaxNG, but the code is
missing, and defining an input format in XML-Schema ... just
say no).

YAML is a data
serialisation format that is a lot more readable than XML.
The disadvantage is that it's not as widely known as XML,
that it's an indentation based language (so confusion over tabs
versus spaces can arise) and that support for input validation
is completely missing.

A custom made configuration language, based on
Perl::RecDescent,
a widely used, mature module to do recursive descent parsing
in perl. Using a custom language means we can structure the
language to minimise opportunities for mistakes, can provide
relevant error messages, can support complex configuration
structures and can easily parse the configuration file to a tree
format that's suitable for further processing. The disadvantage
is that a custom language is yet another syntax to learn.

Building a recursive descent parser seems the best match for this
application.

Authors

This is a place holder section.
Yaird was written by ...
website here ... comments to ... bug reports ...

License

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public
License as published by the Free Software Foundation;
either version 2 of the License, or (at your option) any later
version.

This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public
License along with this program; you may also obtain
a copy of the GNU General Public License
from the Free Software Foundation by visiting their Web site or by writing to

Klibc code

Yaird contains code based on klibc; this code is made available
by the author under the following licence. The relevant source
files have this copyright notice included.

/* ----------------------------------------------------------------------- *
*
* Copyright 2004 H. Peter Anvin - All Rights Reserved
*
* Permission is hereby granted, free of charge, to any person
* obtaining a copy of this software and associated documentation
* files (the "Software"), to deal in the Software without
* restriction, including without limitation the rights to use,
* copy, modify, merge, publish, distribute, sublicense, and/or
* sell copies of the Software, and to permit persons to whom
* the Software is furnished to do so, subject to the following
* conditions:
*
* The above copyright notice and this permission notice shall
* be included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
* OTHER DEALINGS IN THE SOFTWARE.
*
* ----------------------------------------------------------------------- */

[1]
Well, not really. I started this thingy to show off a small
algorithm to find required modules based on sysfs information.
To make that a credible demonstration, the small algorithm
turned out to need a lot of scaffolding to turn it into a
working program ...

[2]
An alternative and equally interesting exercise would be
an attempt to generate a universal initrd that could be
distributed together with the kernel. Such an image
would most likely be based on udev/hotplug.

[3]
Except where the distribution depends on it;
there are some issues with mdadm in Debian.

[4]
Having knowledge of the relation between module names and
kernel defines hardcoded into yaird
is hardly elegant. Perhaps it is possible to generate this
mapping based on the kernel Makefiles when building the
kernel, but that's too complex just now.

[5]
Having device files on the image is wrong: it will
break if the new kernel uses different device numbers. Mostly
this can be avoided by using the dev
files provided by sysfs, but there is a bootstrap problem:
the mount command needed to
access sysfs assumes /dev/null and
/dev/console are available.

[6]
The idea that the "ip=" kernel command line option
implies mounting an NFS root is debatable. Since
the only use of the network for now is mounting NFS
we can get away with it, and it simplifies passing
a DHCP supplied boot path to the NFS mount code.
If we find situations where IP is needed but NFS is
not, we'll have to trigger NFS mount when
"root=/dev/nfs".