This document describes Linux i386 boot code,
serving as a study guide and source commentary.
In addition to C-like pseudocode source commentary, it also presents
keynotes of toolchains and specs related to kernel development.
It is designed to help:

1. Introduction

This document serves as a study guide and source commentary for
Linux i386 boot code.
In addition to C-like pseudocode source commentary, it also presents
keynotes of toolchains and specs related to kernel development.
It is designed to help:

1.1. Copyright and License

This document, Linux i386 Boot Code HOWTO,
is copyrighted (c) 2003, 2004 by Feiyun Wang.
Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation
License, Version 1.2 or any later version published
by the Free Software Foundation; with no Invariant Sections,
with no Front-Cover Texts, and with no Back-Cover Texts.
A copy of the license is available at
http://www.gnu.org/copyleft/fdl.html.

Linux is a registered trademark of Linus Torvalds.

1.2. Disclaimer

No liability for the contents of this document can be accepted.
Use the concepts, examples and information at your own risk.
There may be errors and inaccuracies which could be damaging to
your system. Proceed with caution, and although this is highly
unlikely, the author(s) do not take any responsibility.

Owners hold all copyrights,
unless specifically noted otherwise. Use of a term in this
document should not be regarded as affecting the validity of any
trademark or service mark. Naming of particular products or
brands should not be seen as endorsements.

1.4. Feedback

1.5. Translations

English is the only version available now.

2. Linux Makefiles

Before perusing Linux code, we should get some basic idea about
how Linux is composed, compiled and linked.
A straightforward way to achieve this goal is to understand Linux makefiles.
Check
Cross-Referencing Linux if you prefer online source browsing.

Rules.make contains rules which are shared
between multiple Makefiles.

2.2. linux/arch/i386/vmlinux.lds

After compilation, ld combines a number of
object and archive files, relocates their data and
ties up symbol references.
linux/arch/i386/vmlinux.lds is designated by
linux/Makefile as the linker script used
in linking the resident kernel image linux/vmlinux.

piggy.o contains
variable input_len
and gzipped linux/vmlinux.
input_len is at the beginning of
piggy.o, and it is equal to the size of
piggy.o excluding
input_len itself. Refer to
Using LD, the GNU linker: Section Data Expressions
for "LONG(expression)" in piggy.o linker script.

To be exact, it is not linux/vmlinux itself
(in ELF format) that is gzipped but its binary image,
which is generated by objcopy command.
Note that $(OBJCOPY) has been redefined by
linux/arch/i386/Makefile in
Section 2.3 to output raw binary
using "-O binary" option.

When linking {bootsect, setup} or
{bbootsect, bsetup}, $(LD) specifies
"--oformat binary" option to output them in binary format.
When making zImage (or bzImage),
$(OBJCOPY) generates an intermediate binary output from
compressed/vmlinux
(or compressed/bvmlinux) too.
It is vital that all components in zImage or
bzImage are in raw binary format,
so that the image can run by itself without asking a loader
to load and relocate it.

Both vmlinux and bvmlinux
prepend head.o and misc.o
before piggy.o,
but they are linked against different start addresses (0x1000 vs 0x100000).

2.6. linux/arch/i386/tools/build.c

linux/arch/i386/tools/build.c is a host utility to
generate zImage or bzImage.

2.7. Reference

3. linux/arch/i386/boot/bootsect.S

Given that we are booting up bzImage, which is
composed of bbootsect, bsetup
and bvmlinux (head.o, misc.o, piggy.o),
the first floppy sector, bbootsect (512 bytes),
which is compiled from linux/arch/i386/boot/bootsect.S,
is loaded by BIOS to 07C0:0.
The reset of bzImage (bsetup
and bvmlinux) has not been loaded yet.

Make sure SP is initialized immediately after SS register.
The recommended method of modifying SS is to use "lss" instruction
according to
IA-32 Intel Architecture Software Developer's Manual
(Vol.3. Ch.5.8.3. Masking Exceptions and Interrupts When Switching Stacks).

Stack operations, such as push and pop, will be OK now.
First 12 bytes of disk parameter have been copied to INITSEG:3FF4.

"lodsb" loads a byte from DS:[SI] to AL and increases SI automatically.

The number of sectors per track has been saved in variable
sectors.

3.3. Load Setup Code

bsetup (setup_sects sectors)
will be loaded right after bbootsect, i.e. SETUPSEG:0.
Note that INITSEG:0200==SETUPSEG:0 and
setup_sects has been changed
by tools/build to match
bsetup size
in Section 2.6.

3.7. Bootsect Helper

setup.S:bootsect_helper() is only used by
bootsect.S:read_it().

Because bbootsect and bsetup
are linked separately, they use offsets relative to
their own code/data segments.
We have to "call far" (lcall) for bootsect_helper()
in different segment, and it must "return far" (lret) then.
This results in CS change in calling, which makes CS!=DS, and
we have to use segment modifier to specify variables in
setup.S.

This "header" must conform to the layout pattern in
linux/Documentation/i386/boot.txt:

Offset Proto Name Meaning
/Size
01F1/1 ALL setup_sects The size of the setup in sectors
01F2/2 ALL root_flags If set, the root is mounted readonly
01F4/2 ALL syssize DO NOT USE - for bootsect.S use only
01F6/2 ALL swap_dev DO NOT USE - obsolete
01F8/2 ALL ram_size DO NOT USE - for bootsect.S use only
01FA/2 ALL vid_mode Video mode control
01FC/2 ALL root_dev Default root device number
01FE/2 ALL boot_flag 0xAA55 magic number

3.9. Reference

As <IA-32 Intel Architecture Software Developer's Manual>
is widely referenced in this document, I will call it "IA-32 Manual"
for short.

4. linux/arch/i386/boot/setup.S

setup.S is responsible for getting the system data
from the BIOS and putting them into appropriate places in system memory.

Other boot loaders, like
GNU GRUB and
LILO,
can load bzImage too.
Such boot loaders should load bzImage into memory
and setup "real-mode kernel header",
esp. type_of_loader, then pass control
to bsetup directly.
setup.S assumes:

bsetup or setup may not be
loaded at SETUPSEG:0, i.e. CS may not be equal to SETUPSEG
when control is passed to setup.S;

The first 4 sectors of setup
are loaded right after bootsect.
The reset may be loaded at SYSSEG:0, preceding
vmlinux;
This assumption does not apply to bsetup.

4.1. Header

/* Signature words to ensure LILO loaded us right */
#define SIG1 0xAA55
#define SIG2 0x5A5A
INITSEG = DEF_INITSEG # 0x9000, we move boot here, out of the way
SYSSEG = DEF_SYSSEG # 0x1000, system loaded at 0x10000 (65536).
SETUPSEG = DEF_SETUPSEG # 0x9020, this is the current segment
# ... and the former contents of CS
DELTA_INITSEG = SETUPSEG - INITSEG # 0x0020
.code16
.text
///////////////////////////////////////////////////////////////////////////////
start:
{
goto trampoline(); // skip the following header
}
# This is the setup header, and it must start at %cs:2 (old 0x9020:2)
.ascii "HdrS" # header signature
.word 0x0203 # header version number (>= 0x0105)
# or else old loadlin-1.5 will fail)
realmode_swtch: .word 0, 0 # default_switch, SETUPSEG
start_sys_seg: .word SYSSEG
.word kernel_version # pointing to kernel version string
# above section of header is compatible
# with loadlin-1.5 (header v1.5). Don't
# change it.
// kernel_version defined below
type_of_loader: .byte 0 # = 0, old one (LILO, Loadlin,
# Bootlin, SYSLX, bootsect...)
# See Documentation/i386/boot.txt for
# assigned ids
# flags, unused bits must be zero (RFU) bit within loadflags
loadflags:
LOADED_HIGH = 1 # If set, the kernel is loaded high
CAN_USE_HEAP = 0x80 # If set, the loader also has set
# heap_end_ptr to tell how much
# space behind setup.S can be used for
# heap purposes.
# Only the loader knows what is free
#ifndef __BIG_KERNEL__
.byte 0
#else
.byte LOADED_HIGH
#endif
setup_move_size: .word 0x8000 # size to move, when setup is not
# loaded at 0x90000. We will move setup
# to 0x90000 then just before jumping
# into the kernel. However, only the
# loader knows how much data behind
# us also needs to be loaded.
code32_start: # here loaders can put a different
# start address for 32-bit code.
#ifndef __BIG_KERNEL__
.long 0x1000 # 0x1000 = default for zImage
#else
.long 0x100000 # 0x100000 = default for big kernel
#endif
ramdisk_image: .long 0 # address of loaded ramdisk image
# Here the loader puts the 32-bit
# address where it loaded the image.
# This only will be read by the kernel.
ramdisk_size: .long 0 # its size in bytes
bootsect_kludge:
.word bootsect_helper, SETUPSEG
heap_end_ptr: .word modelist+1024 # (Header version 0x0201 or later)
# space from here (exclusive) down to
# end of setup code can be used by setup
# for local heap purposes.
// modelist is at the end of .text section
pad1: .word 0
cmd_line_ptr: .long 0 # (Header version 0x0202 or later)
# If nonzero, a 32-bit pointer
# to the kernel command line.
# The command line should be
# located between the start of
# setup and the end of low
# memory (0xa0000), or it may
# get overwritten before it
# gets read. If this field is
# used, there is no longer
# anything magical about the
# 0x90000 segment; the setup
# can be located anywhere in
# low memory 0x10000 or higher.
ramdisk_max: .long __MAXMEM-1 # (Header version 0x0203 or later)
# The highest safe address for
# the contents of an initrd

The __MAXMEM definition in
linux/asm-i386/page.h:

/*
* A __PAGE_OFFSET of 0xC0000000 means that the kernel has
* a virtual address space of one gigabyte, which limits the
* amount of physical memory you can use to about 950MB.
*/
#define __PAGE_OFFSET (0xC0000000)
/*
* This much address space is reserved for vmalloc() and iomap()
* as well as fixmap mappings.
*/
#define __VMALLOC_RESERVE (128 << 20)
#define __MAXMEM (-__PAGE_OFFSET-__VMALLOC_RESERVE)

It gives __MAXMEM = 1G - 128M.

The setup header must follow some layout pattern.
Refer to linux/Documentation/i386/boot.txt:

"hlt" instruction stops instruction execution and places the processor
in halt state.
The processor generates a special bus cycle to indicate that
halt mode has been entered.
When an enabled interrupt (including NMI) is issued,
the processor will resume execution after the "hlt" instruction,
and the instruction pointer (CS:EIP), pointing to the instruction
following the "hlt", will be saved to stack
before the interrupt handler is called.
Thus we need a "jmp" instruction after the "hlt" to put the processor
back to halt state again.

The setup code has been moved to correct place.
Variable start_sys_seg points to
where real system code starts.
If "bad_sig" does not happen, start_sys_seg
remains SYSSEG.

Note that code32_start is initialized to
0x1000 for zImage, or
0x100000 for bzImage.
The code32 value will be used in passing control to
linux/arch/i386/boot/compressed/head.S in
Section 4.9.
If we boot up zImage, it relocates
vmlinux to 0100:0;
If we boot up bzImage,
bvmlinux remains at start_sys_seg:0.
The relocation address must match the "-Ttext" option in
linux/arch/i386/boot/compressed/Makefile.
See Section 2.5.

Then it will relocate code from CS-DELTA_INITSEG:0
(bbootsect and bsetup)
to INITSEG:0, if necessary.

The far "jmp" instruction (0xea) updates CS register.
The contents of the remaining segment registers (DS, SS, ES, FS and GS)
should be reloaded later.
The operand-size prefix (0x66) is used to enforce "jmp" to be executed
upon the 32-bit operand code32.
For operand-size prefix details, check IA-32 Manual
(Vol.1. Ch.3.6. Operand-size and Address-size Attributes, and
Vol.3. Ch.17. Mixing 16-bit and 32-bit Code).

Control is passed to
linux/arch/i386/boot/compressed/head.S:startup_32.
For zImage, it is at address 0x1000;
For bzImage, it is at 0x100000.
See Section 5.

ESI points to the memory area of collected system data.
It is used to pass parameters from the 16-bit real mode code of the kernel
to the 32-bit part.
See linux/Documentation/i386/zero-page.txt
for details.

4.11. Reference

Summary of empty_zero_page layout (kernel point of view):
linux/Documentation/i386/zero-page.txt

5. linux/arch/i386/boot/compressed/head.S

We are in bvmlinux now!
With the help of misc.c:decompress_kernel(),
we are going to decompress piggy.o
to get the resident kernel image linux/vmlinux.

This file is of pure 32-bit startup code.
Unlike previous two files, it has no ".code16" statement in the source file.
Refer to
Using as: Writing 16-bit Code for details.

5.1. Decompress Kernel

The segment base addresses in segment descriptors (which correspond to
segment selector __KERNEL_CS and __KERNEL_DS) are equal to 0;
therefore, the logical address offset (in segment:offset format) will
be equal to its linear address if either of these segment selectors
is used.
For zImage, CS:EIP is at logical address 10:1000
(linear address 0x1000) now;
for bzImage, 10:100000 (linear address 0x100000).

5.2. gunzip()

decompress_kernel() calls
gunzip() -> inflate(), which are defined in
linux/lib/inflate.c,
to decompress resident kernel image to
low buffer (pointed by output_data) and
high buffer (pointed by high_buffer_start, for
bzImage only).

We can see that the gzipped file begins at 0x4c50 in the above example.
The four bytes before "1f 8b 08 00" is input_len
(0x0011011e, in little endian), and 0x4c50+0x0011011e=0x114d6e equals to
the size of bzImage
(/boot/vmlinuz-2.4.20-28.9).

When get_byte(), defined in
linux/arch/i386/boot/compressed/misc.c,
is called for the first time,
it calls fill_inbuf() to setup input buffer
inbuf=input_data and
insize=input_len.
Symbol input_data and
input_len are defined in
piggy.o linker script.
See Section 2.5.

free_mem_ptr is used in
misc.c:malloc() for dynamic memory allocation.
Before inflating each compressed block, gzip_mark()
saves the value of free_mem_ptr;
After inflation, gzip_release() will
restore this value.
This is how it "free()" the memory allocated in
inflate_block().

Gzip uses
Lempel-Ziv coding (LZ77) to compress files.
The compressed data format is specified in
RFC 1951.
inflate_block() will inflate compressed blocks,
which can be treated as a bit sequence.

Note that data elements are packed into bytes starting from
Least-Significant Bit (LSB) to Most-Significant Bit (MSB), while
Huffman codes are packed starting with MSB.
Also note that literal value 286-287 and
distance codes 30-31 will never actually occur.

With the above data structure in mind and RFC 1951 by hand,
it is not too hard to understand inflate_block().
Refer to related paragraphs in RFC 1951 for Huffman coding and
alphabet table generation.

From a software point of view, in a multiprocessor system, BSP and APs
share the physical memory but use their own register sets.
BSP runs the kernel code first, setups OS execution enviornment and
triggers APs to run over it too.
AP will be sleeping until BSP kicks it.

As pg0 is at offset 0x2000 of section
.text in
linux/arch/i386/kernel/head.o,
which is the first file to be linked for linux/vmlinux,
it will be at offset 0x2000 in output section .text.
Thus it will be located at address 0xC0000000+0x100000+0x2000 after linking.

In protected mode without paging enabled, linear address will be
mapped directly to physical address.
"movl $pg0-__PAGE_OFFSET,%edi" will set EDI=0x102000,
which is equal to the physical address of pg0
(as linux/vmlinux is relocated to 0x100000).
Without this "-PAGE_OFFSET" scheme, it will access physical address
0xC0102000, which will be wrong and probably beyond RAM space.

mmu_cr4_features is in .bss
section and is located at physical address 0x376404 in the above example.

Page directory swapper_pg_dir (see definition in
Section 6.5), together with
page tables pg0 and pg1,
defines that both linear address 0..8M-1 and 3G..3G+8M-1 are mapped to
physical address 0..8M-1.
We can access kernel symbols without "-__PAGE_OFFSET" from now on,
because kernel space (resides in linear address >=3G) will
be correctly mapped to its physical addresss after paging is enabled.

"lss stack_start,%esp" (SS:ESP = *stack_start)
is the first example to reference a symbol without "-PAGE_OFFSET",
which sets up a new stack.
For BSP, the stack is at the end of init_task_union.
For AP, stack_start.esp has been redefined by
linux/arch/i386/kernel/smpboot.c:do_boot_cpu() to be
"(void *) (1024 + PAGE_SIZE + (char *)idle)" in
Section 8.2.

The first CPU (BSP) will call
linux/init/main.c:start_kernel() and
the others (AP) will call
linux/arch/i386/kernel/smpboot.c:initialize_secondary().
See start_kernel() in Section 7
and initialize_secondary() in
Section 8.4.

init_task_union happens to be the task struct
for the first process, "idle" process (pid=0), whose stack grows
from the tail of init_task_union.
The following is the code related to init_task_union:

6.6. Reference

7. linux/init/main.c

I felt guilty writing this chapter as there are too many documents
about it, if not more than enough.
start_kernel() supporting functions
are changed from version to version, as they depend on
OS component internals, which are being improved all the time.
I may not have the time for frequent document updates,
so I decided to keep this chapter as simple as possible.

7.4. Reference

8. SMP Boot

There are a few SMP related macros, like CONFIG_SMP,
CONFIG_X86_LOCAL_APIC, CONFIG_X86_IO_APIC, CONFIG_MULTIQUAD
and CONFIG_VISWS.
I will ignore code that requires CONFIG_MULTIQUAD
or CONFIG_VISWS,
which most people don't care (if not using IBM high-end multiprocessor
server or SGI Visual Workstation).

IPI (InterProcessor Interrupt), CPU-to-CPU interrupt through local APIC,
is the mechanism used by BSP to trigger APs.

Be aware that "one local APIC per CPU is required" in an
MP-compliant system.
Processors do not share APIC local units address space (physical address
0xFEE00000 - 0xFEEFFFFF), but will share APIC I/O units
(0xFEC00000 - 0xFECFFFFF).
Both address spaces are uncacheable.

Don't confuse start_secondary() with
trampoline_data().
The former is AP "idle" process task struct EIP value, and the latter is
the real-mode code that AP runs after BSP kicks it
(using wakeup_secondary_via_INIT()).

8.3. linux/arch/i386/kernel/trampoline.S

This file contains the 16-bit real-mode AP startup code.
BSP reserved memory space trampoline_base in
start_kernel() -> setup_arch() -> smp_alloc_memory().
Before BSP triggers AP, it copies the trampoline code, between
trampoline_data and
trampoline_end,
to trampoline_base
(in do_boot_cpu() -> setup_trampoline()).
BSP sets up 0:467 to point to trampoline_base,
so that AP will run from here.

Note that BX=1 when AP jumps to
linux/arch/i386/kernel/head.S:startup_32(),
which is different from that of BSP (BX=0).
See Section 6.

8.4. initialize_secondary()

Unlike BSP, at the end of
linux/arch/i386/kernel/head.S:startup_32()
in Section 6.4,
AP will call initialize_secondary() instead of
start_kernel().

/* Everything has been set up for the secondary
* CPUs - they just need to reload everything
* from the task structure
* This function must not return. */
void __init initialize_secondary(void)
{
/* We don't actually need to load the full TSS,
* basically just the stack pointer and the eip. */
asm volatile(
"movl %0,%%esp\n\t"
"jmp *%1"
:
:"r" (current->thread.esp),"r" (current->thread.eip));
}

As BSP called do_boot_cpu() to set
thread.eip to start_secondary(),
control of AP is passed to this function.
AP uses a new stack frame, which was set up by BSP in
do_boot_cpu() -> fork_by_hand() -> do_fork().

8.5. start_secondary()

All APs wait for signal smp_commenced from BSP,
triggered in Section 8.2smp_init() -> smp_commence().
After getting this signal, they will run "idle" processes.

C. GRUB and LILO

Both GNU GRUB and
LILO
understand the real-mode kernel header format and will load
the bootsect (one sector), setup code
(setup_sects sectors) and
compressed kernel image (syssize*16 bytes) into memory.
They fill out the loader identifier (type_of_loader)
and try to pass appropriate parameters and options to the kernel.
After they finish their jobs, control is passed to setup code.

map_add(), map_add_sector() and
map_add_zero() may call
map_register() to complete their jobs,
while map_register() will keep a list for
all (CX, DX, AL) triplets (data structure SECTOR_ADDR) used to
identify all registered sectors.