Secrets Revealed

To most, the Linux kernel is rife with mystery, a thing of unfathomable
complexity. It is with fear and great caution that most Linux users approach the kernel, daring not
to disturb the beasts within. Many users are content to simply say a prayer and proceed with the
occasional upgrade or superstitious rebuild. To be sure, the internals of the Linux kernel are not
for the faint-hearted, but deciphering the myriad secrets which the kernel contains is not as daunting a task as one might
imagine. The basic ideas are easy to grasp, and you are helped greatly by the fact that all of the
source code is at your disposal. Exploring the kernel’s internals, no matter how superficially, presents the opportunity to prod within the viscera of a
modern operating system, thereby obtaining what we can only call “kernel knowledge.” (I can hear the
groans already!)

In this article, I’d like to uncover some of the mysteries of the Linux kernel and explain its
innards from a very high-level perspective, and in the most intuitive manner possible. I will point
out the portions of the kernel source code which correspond to the ideas described herein. Before we
dive in, however, I must begin with an important disclaimer: this article greatly oversimplifies
many important details!If you are a rugged and seasoned kernel hacker, I beg that you excuse this
travesty (in fact, you should probably be reading something else). In order for things to make sense
at this level, sometimes I must resort to a few creative lies.

What is the kernel anyway?

Put simply, the kernel is the Linux operating system itself. The term “kernel” originates from
the early days of operating systems design, when the operative metaphor was one of a system which
consisted of many layers. User applications were on the outside and the operating system core
(hence, “kernel”) was in the center. This is still a fairly accurate representation, so the term
persists.

The kernel is responsible for controlling access to all of the machine’s resources: CPU, memory,
disk, network interfaces, graphics boards, keyboards, mice, and so forth. It’s the kernel’s job to
make these resources available to applications (such as Emacs, the GIMP, etc.) that wish to use them
(this is referred to as “multiplexing” system resources), and to prevent individual applications
from interfering with one another. Many of the hardware resources controlled by the kernel are
thought of as peripheral devices — such as disks, network interfaces, and so forth; we use the term
“device drivers” to describe those parts of the kernel which interface such devices to the
system.

So, the kernel has two primary jobs: to multiplex system resources and to protect applications
from interfering with one another’s use of those resources. The most straightforward example of
multiplexing is what most people call multitasking –allowing multiple applications (or “tasks”) to
share the same physical CPU, but giving each application the illusion that it has the entire CPU to
itself. Most modern operating systems (including Linux, all variants of UNIX, and even Windows 98)
provide some form of multitasking. An example of resource protection is the way in which the kernel
prevents two applications from reading or writing each other’s memory; for example, it shouldn’t be
possible for Emacs to corrupt the memory being used by the GIMP running on the same system. It’s the
kernel’s job to ensure that this is the case.

As it turns out, the kernel gets a lot of help from the system hardware when it comes to
multiplexing and memory protection. For example, the Intel x86 CPU architecture (everything from the
80386 on up) includes support for memory protection and CPU multitasking; in fact, without some
hardware support, it is very difficult to do these things. The Linux kernel relies on the fact that
the CPU will tell it when an application has made a bad memory reference (one which might be a
protection violation). Without this signal (called a “page fault”), the kernel would have no way to
enforce memory protection between processes.

What all of this technical mumbo-jumbo boils down to is that a Linux system can have many
separate applications running simultaneously, sharing the system’s CPU, memory, and other resources,
and it’s impossible for these applications to interfere with each other or otherwise cause damage to
the system. All of this is very good for application programmers, who can write programs with the
knowledge that their code is “fenced in” by the kernel, making it unlikely that a program which goes
haywire could cause any real harm.

The kernel structure

The Linux kernel consists of a number of components working together: there’s the virtual memory
subsystem (which implements both memory protection and “paging”, which allows disk to be used in
place of physical memory); the scheduler (which multiplexes the CPU across multiple applications);
the file systems (including the Linux ext2fs, NFS, ISO-9660, MS-DOS FAT, and other filesystem types);
the networking code (including the TCP/IP protocol stack, as well as code for PPP, SLIP, AppleTalk,
and other protocols); as well as a hoard of device drivers for everything ranging from serial ports
to disk controllers to network cards.

Structurally, everything in the kernel is compiled together as one big program which is started
when the system boots. Because of this, Linux (as with other UNIX systems) is sometimes referred to
as a “monolithic” kernel design, as opposed to a “microkernel-” based system. In a microkernel-based
system, the OS is composed of a number of separate programs, each of which is structurally
independent of the others. Linux does have a mechanism by which new pieces of kernel code can be
added to the system dynamically — using so-called loadable kernel modules. However, once a kernel
module is loaded, it really becomes part of the “one big program”, no different than any other
kernel code.

Figure 1 shows the overall structure of the kernel, which sits between the user applications and
the system hardware. User applications and the kernel share the CPU and system memory; we say that
the applications live in “user space” while the kernel resides in “kernel space”. These “spaces”
imply more than just physical separation, they also refer to the privilege that each has. In short,
user applications are only able to access their limited memory space, a certain percentage of
overall CPU time, and so forth, while the kernel has the ultimate privilege to access any hardware
device, read or write any memory address, consume as much CPU time as it requires, and so forth.
This privilege distinction is important because it is this power which gives the kernel the ability
to protect and multiplex system resources between user processes. User-space code, on the other
hand, is subject to the limits placed upon it by the kernel.

In Figure 1, blue lines connecting the various components within the kernel (and to hardware
devices) indicate that those components directly interact in some way. For example, the TCP/IP stack
sends network packets through either the TCP or UDP code path, but both types of packets are
eventually handled by the IP layer. In this figure, “VFS” stands for the Virtual Filesystem layer,
which abstracts away the details of the particular filesystem types (such as ext2fs and ISO-9660, as
shown) from user applications. This means that applications need not know what type of filesystem is
being accessed when a file is opened, read, written, and so forth. “IPC” stands for Interprocess
Communication and includes various mechanisms user processes employ to “talk” to each other and
coordinate their activity. The component labeled “SMP” is the shared-memory multiprocessing support
in the Linux kernel, which enables the use of systems with multiple CPUs.

Basic kernel operation responding to events

I said above that the kernel is like “one big program” which manages system resources. Clearly,
however, it is different than other (user-space) programs, because the kernel is responsible for
allowing all other programs to run. How does this work?

One can think of the kernel as a large chunk of code which is always sitting in memory, ready to
be executed whenever it’s required. Most of the time, the CPU is busy executing user applications,
but occasionally, the kernel kicks in. This can happen in a number of cases:

* When a user application invokes a system call;

* When a hardware device issues an interrupt;

* When the timer interrupt goes off (really a special case of the above)

* When a page fault occurs.

You’ll notice that all of these cases are effectively external events which the kernel responds
to. So, during normal operation, the system is executing user space code — but the kernel is
lurking in the background, ready to respond to these events.

A system call is the most common way in which the kernel is invoked; this is a function which a
user application can call when it requires services from the kernel. Common examples include:

* Allocating memory (with the brk system call, which is called by the malloc C library
routine);

* Starting a new process (with the fork and exec system calls).

There are plenty of other system calls; the complete list can be found in the file /usr/include/asm/ unistd.h on most systems.

Because system calls invoke privileged code within the kernel, they’re unlike other library
routines which can be called by, say, a C program. When a system call is invoked by an application,
a special mechanism is used to make the transition between the user program and the kernel code
called a “trap”. The trap, among other things, guarantees that only particular system call entry
points within the kernel can be called by the user application — that is, the application is unable
to invoke arbitrary code inside of the kernel. This is accomplished by using a function lookup
table: the application specifies the system call number which it wishes to invoke, and this number
is used as an index into a table of allowable system calls. The actual code for this trap (for x86
systems) is found in arch/i386/kernel/ entry.S in the kernel source tree.
It is coded in assembly language because it needs to do some fairly esoteric mucking about with
processor registers! (Most of the kernel is, however, coded in C. Only a few specialized routines
need to be implemented as assembly code.)

Interrupts are the other major source of user-to-kernel transitions. When a hardware device
requires the operating system to service it, it may issue an interrupt — a special hardware signal
which causes the CPU to immediately jump to a special routine, inside of the kernel, called the
interrupt handler. (On the Intel x86 architecture, a hardware interrupt is sometimes called an
“IRQ”, or “interrupt request”. There are sixteen “IRQ lines” on x86 systems which are given the
names IRQ 0 through IRQ 15. For example, When a hardware board is configured to use IRQ 5, that
means it will signal the kernel on interrupt line 5.) The interrupt handler is responsible for
servicing the hardware device which caused the interrupt (for example, a network board might trigger
an interrupt when a new packet arrives from the network); the interrupt handler would transfer the
new packet from the board to the kernel’s TCP/ IP code.

One important interrupt is the timer or clock interrupt. This interrupt, which is on IRQ 0, is
triggered by the system’s hardware clock, and is programmed to fire 100 times a second. The timer
interrupt is important as it is responsible for triggering many time-based events in the system, not
the least of which is CPU scheduling. Every time the timer interrupt goes off, the scheduler might
need to choose a different process to run on the CPU. This is what insures that all programs will
have a fair chance to run, and that no one process can hog the CPU indefinitely.

Page faults are caused by a CPU access to memory which is invalid for some reason. (Here, “page”
refers to a page of memory, which on the Intel x86 architecture is 4 kilobytes. The hardware only
deals with memory in page-sized chunks.) For example, if a user process attempts to read or write
data outside of the memory allocated to it by the kernel, a page fault will occur, and the kernel
will respond by killing the process (accompanied by the infamous error message, “segmentation
fault”). However, not all page faults are so violent in nature; in fact, many are necessary for
normal system operation. For example, if a process is consuming too much memory, some of its data
may be swapped (or “paged”) temporarily out to disk. When the process later attempts to access that
data, a page fault will occur. This time, however, the kernel can service the fault by reading the
swapped-out page back from disk — resuming the process where it left off once the page has been
retrieved. In this case the user process has no idea that this memory shuffling has taken place
behind its back!

As you can see, then, the kernel is primarily responsible for servicing all of the various
events which might cause it to jump to action. For this reason, reading kernel code for the first
time is often a bit tricky, because it’s not obvious where in the code those various events are
being handled. Here are a couple of hints:

*arch/i386/mm/fault.c contains the default page fault handler do_page_fault(). This routine figures out the source of the fault and calls
another handler to take care of it.

*arch/i386/kernel/entry.S contains the (assembly language) code for
dealing with system calls. The line ENTRY(system_call) is where all the fun
begins.

*arch/i386/kernel/irq.c contains the code which sets up interrupt
handlers — the tricky bit is that this is actually done by the BUILD_IRQ
macro called at the top of this file. The definition for this macro is contained in include/asm-i386/irq.h, which turns the
macro into a bunch of inline assembly code.

Device drivers use the request_ irq and free_irq kernel routines to associate a C function with an interrupt handler; arch/i386/kernel /irq.c has the definitions for those routines as well. For
example, in drivers/net/tulip.c, a search for request_irq reveals that this driver associates the board’s IRQ with the routinetulip_interrupt(). Whenever the board interrupts the CPU, the tulip_interrupt() routine will be called.

Now that we’ve covered the high-level view of the Linux kernel and how it works, let’s delve
deeper into the morass and look at one of the most important aspects of the system: virtual
memory.

Virtual Memory

Earlier, I said that the virtual memory subsystem in the kernel was responsible for memory
protection (preventing processes from interfering with each other’s memory) and paging (allowing
memory pages to be temporarily moved out to disk). This is easily one of the most complex aspects of
the Linux kernel design, and it bears taking a closer look at. Understanding how virtual memory
works in Linux is the key to understanding many other features, such as the filesystem layer.

This is a topic which has consumed a fair number of pages in both intermediate and advanced
operating systems textbooks (not to mention a slew of research papers and articles in magazines such
as Glamour), so I can’t expect to faithfully cover the bulk of virtual memory
management in this article. Hopefully, though, I can give you a clear picture of what the various
pieces are so you can look into them in more detail elsewhere. Different operating systems implement
virtual memory management in very different ways, so my treatment here will be particular to Linux.
It’s also helpful to focus on a particular hardware architecture, because Linux relies heavily (as
do other operating systems) on the memory-management support provided by the hardware. In this case,
I’ll focus on the Intel x86, but the discussion is similar for other systems.

The very meaning of the term virtual memory is that an application should be given the illusion
that it has access to a much larger amount of memory than is actually present on the system.
Ideally, we’d like the application to believe it can access the entire range of memory addresses
allowed by the CPU — which on the Intel x86 is 232 bytes, or 4 gigabytes of virtual
memory. (This is because every pointer, or “virtual address”, is stored in 32 bits and each bit only
has 2 possible values: 0 or 1. 232, then, is the range of virtual memory which can be
addressed by a single pointer. We can write this range, in hexadecimal, as 0x 00000000 to
0xffffffff.) Very few systems, however, have 4 gigabytes of physical memory installed. In addition,
multiple applications may be running on the system at once, and we’d like each application to be
able to access the entire 4-gigabyte “virtual address space” without interfering with one another.
How are we going to accomplish this?

Luckily, the Intel x86 architecture (as well as most other modern CPU architectures) includes
hardware support for implementing virtual memory, in the form of a “memory management unit”, or MMU.
The MMU is responsible for translating virtual memory accesses made by user programs into physical
memory accesses of the actual RAM in the machine. Of course, the MMU needs help from the operating
system to do this — which is where the Linux kernel virtual memory subsystem comes into play.
Before we get into all of that, however, let’s look at what happens when a user program reads or
writes an address in virtual memory.

Translating virtual addresses

Let’s say that a user process (such as Emacs) wants to read or write the memory at virtual
address 0x 48b32f00 (this address is meaningless to a human being, of course, but to Emacs it might
represent the current position of the cursor in the window). In order to translate this memory
reference into an address in physical memory, the MMU uses a special set of structures, called the
page tables. These specify for each page (4 kilobytes) of virtual memory what physical memory
address (if any) should be used. We can think of a particular page table entry as containing the
following information:

virtual page address -> physical page address (+ other information)

So, there might be an entry in the page tables which looks like:

0x48b32000 -> 0x0028a000 (+ other information)

Note that the MMU deals with memory in page-sized chunks, which (among other things) reduces the
size of the page tables themselves. Since a page is 4 kilobytes, the last 3 digits (in hexadecimal)
of the virtual and physical page addresses are always ’000′. The last three digits of an address
(such as 0x48b32f00) specify the offset into the page which the user is accessing; in this case that
offset is 0xf00. The MMU adds the page offset to the physical page address found in the page tables
to produce a complete physical memory address — voila!

Certainly not every virtual address in the entire 4 GB range corresponds to a unique physical
memory address — unless one has 4 GB of memory installed. So, some of these page table entries
might contain a physical page address which is marked as “invalid”, meaning that there is no
physical page corresponding to the virtual page. When an invalid entry is read by the MMU, a page
fault occurs. We’ll talk more about this case later.

Page tables and the TLB

Note that the page tables are actually stored in memory themselves. This raises an interesting
issue, as it means that the MMU might have to consult the page tables in order to look up something
else in the page tables. This issue is exacerbated by the fact that there is not just a single page
table — rather, there are three levels of page tables which must be traversed by the MMU in order
to translate a single virtual address into a physical address. This is done for space reasons; on
the Intel x86, a single page table entry actually consumes four bytes. If we have one page table
entry per page of virtual address space, that’s ((4 Gb/4 Kb) * 4 bytes) = 4 megabytes of memory,
just to hold the page tables for a single user process Using multiple levels of page tables actually
allows the hardware to swap portions of the page tables out to disk when they’re not being used –
we warned you that this was going to get hairy! For now, though, don’t worry about it — it suffices
to imagine that there is a single (in-memory) page table being consulted by the MMU for every memory
reference.

If the MMU is performing multiple memory operations just to translate the virtual address for
another memory operation, clearly this is going to be bad for performance. To remedy this problem,
the MMU hardware includes a special cache for virtual-to-physical memory translations, called the
translation lookaside buffer, or TLB. The TLB can be thought of as containing a very small portion
of the page tables in very fast RAM to speed up MMU address translations.

How the TLB is used to access physical memory from a virtual memory
address.

Figure 2 shows what happens during a single memory access by a user application. On the upper
left of the figure we have the CPU, from which a user process wishes to read the virtual address
0x48b32f00. First, the TLB is consulted, and if a translation is found there, the physical address
is used to access memory directly. (We’ve drawn the TLB as being separate from the CPU itself, but
it’s actually part of the processor in most cases.) If the TLB does not contain a mapping for the
given virtual address, the MMU must then perform the arduous task of looking up the address in the
page tables. However, the result of this lookup will be saved in the TLB, thereby increasing
performance if another address on the same page is accessed again.

So far I’ve talked about the Intel x86′s MMU hardware, not about the Linux kernel. Clearly, the
kernel isn’t involved every time a virtual address is translated; that would be far too slow. So,
what does the kernel have to do with virtual memory management?

The first job of the kernel is to set up and maintain the page tables which the MMU will
traverse when translating addresses. This requires the kernel to understand how physical memory is
laid out, to allocate portions of it to various user processes, and to create page table entries
allowing virtual addresses in the user process to eventually map onto physical RAM.

Handling page faults

Perhaps more importantly, though, the kernel must service page faults. As we’ve said before,
these occur whenever an invalid page table entry is encountered by the MMU. A page fault is a
hardware trap — much like an interrupt — which causes the kernel to jump to a page fault handler;
kernel code which is responsible for determining the source of the fault and dealing with it. For
example, if the fault was caused by a virtual memory reference to a bad address, the page fault
handler will kill the offending process and cause it to dump core. (The ol’ “segmentation fault”
error message again.)

For More Information

Now that I’ve hit you over the head with a whirlwind tour of the Linux kernel, you’re probably dying
to find out more. Here are some good places to find out about this stuff:

This is an excellent college-level textbook describing the fundamentals of operating systems
design. Contains all of the theoretical groundwork which one needs to really understand the guts of
a kernel.

This enormous tome (over 1,400 pages!) has literally everything about the Intel x86 architecture,
and everything that surrounds it in a real system: the BIOS, SCSI interfaces, PCI, IDE, parallel,
serial, and PCMCIA ports …you name it.

A comprehensive look at the Linux kernel internals from the Linux Documentation Project. Not as
useful to absolute beginners as to those who have done some O/S work in the past, but helps to
decipher the apparent madness.

An aging introduction to the Linux kernel which eventually turned into a quasi-collaborative
HyperNews archive. Somewhat out of date, but good for the details.

And, of course, there’s the Linux kernel source code itself — the original and still the best!

In other cases, however, a page fault may be caused by a valid virtual memory address, the
physical page for which was temporarily swapped out to disk. When this occurs, the page fault
handler must first determine where on disk the page can be found; this is done through a combination
of various table lookups and information saved in the page table entry. The page fault handler then
issues a call to retrieve the page from disk; because reading a block from disk is a relatively slow
operation, during this procedure the kernel scheduler runs another process on the CPU. (This is just
one example of how the various kernel components coordinate their activity.) What this means, of
course, is that when a process accesses a swapped-out page, it’s actually taken off of the CPU for a
while; of course, this is invisible both to the process and, hopefully, the user!

The code for handling page faults is in arch/i386/mm/fault.c. Under
normal conditions the page fault handler calls either do_no_ page() or do_wp_page(), both of which are found in mm/memory.c.
The former of these is the generic handler which covers most cases; the latter is a special case
where a user process is attempting to write to a page which is being shared between multiple
processes. The do_wp_ page() function implements what’s called a
“copy-on-write” scheme, where a copy is made of the (shared) page being written to, thus giving the
writing process a private copy to scribble on. This allows processes to share memory (such as common
data structures and code) as long as they are only reading that memory; a write triggers a copy.

The swap daemon

Occasionally, the kernel sweeps through physical RAM and earmarks pages which haven’t been
accessed in a certain amount of time as candidates for paging out — that is, writing to the swap
space on disk. (The terms paging and swapping are often used interchangeably in the Linux world, but
traditionally, paging refers to moving individual pages to and from disk, while swapping refers to
moving entire processes. Linux never moves an entire process out to disk at once, although it might
do so in smaller chunks over time.) The code for this is in mm/vmscan.c,
starting with the kswapd routine. This routine is a kernel thread (the so-called “swap daemon”)
which wakes up occasionally to do the scan through memory, accomplished by repeatedly calling try_to_free_ page.

Demand paging

Already, we can see that the kernel’s virtual memory system is somewhat complex; it’s arguably
the trickiest part of the kernel design. Still, it can do a lot of other tricks, one of which is
demand paging.

The concept behind demand paging is that pages aren’t read from disk into memory until they are
absolutely needed. When you run an application such as Emacs, all of the code is stored on disk, and
depending on how you use the program it may not be necessary to read all of that code into memory.
For example, if I never touch the menu bar at the top of the Emacs window, the code which deals with
the menu bar can be kept on disk, avoiding expensive disk reads. This is very similar to regular
paging, described above, and in fact the mechanisms for demand paging and regular paging are handled
by the same kernel code.

In the case of regular paging, the kernel sets up page table entries (and other information)
allowing it to recover swapped-out memory pages from disk on a page fault. In the case of demand
paging, the kernel simply sets up page table entries which allow it to read the data from wherever
the application executable is stored on disk. Say that Emacs is in /usr/
bin/emacs on your system; the first time you run Emacs, the kernel sets up tables which
allow it to fault in the code for the application from that location. A similar mechanism is used
for shared libraries, which contain code used by a large number of applications. As described above,
the virtual memory system also allows memory to be transparently shared between processes; this
means that if you run multiple copies of Emacs, say, only one copy of the application code is in
memory at one time.

Implementing protection

So far I’ve mostly been talking about a single user process; with multiple processes, however,
the kernel must enforce memory protection between them. With all of the above mechanisms in place,
this is very easy to do: each process has its own private set of page tables, which map to different
physical memory regions. When the scheduler selects a new process to run on the CPU, the MMU is told
to use the new page tables for all future memory references (on the x86 this is done by writing the
special register %cr3). In addition, entries in the TLB are “flushed”,
preventing the new process from using the old process’s cached virtual-to-physical translations. (If
the TLB weren’t flushed, it would be possible for a process to read or write the memory of another,
simply by accessing virtual addresses left over in the TLB.)

Processes can easily share memory, simply by having page table entries which map to the same
physical addresses. In order to implement the “copy-on-write” scheme described above, these page
table entries are marked as “read only”, which will cause a page fault if the process attempts to
write to any of the shared addresses. The fault handler then determines that this is a shared page
and that a copy should be made whenever a process attempts to write to it. Simple, no?

I hope this overview will give you enough background to start digging into the kernel source on
your own. All right, it’s no picnic, but your buddies sure will be impressed when you can explain
the inner workings of a page fault handler. At least, my friends were, although this might just be
saying something about my friends…

Happy hacking!

Matt Welsh is a long-time Linux hacker and author of Running Linux, published by O’Reilly and Associates. His own Linux kernel hacking has remained fairly elusive,
but the proof is out there: He can be reached at mdw@cs.berkeley.edu.