Kernel Mode Linux

Now you don't have to write a module to run a program in kernel space. Run any program there with this patch.

What Kernel-Mode User Processes Cannot
Do

Although kernel-mode user processes are ordinary user
processes, they have a few limitations. If a kernel-mode user
process violates these limitations, the system will be in an
undefined state. In the worst-case scenario, your system may be
broken.

Limitation 1: don't modify the CS, DS, SS or FS segment
register. Current KML for IA-32 assumes that these segment
registers are not modified by kernel-mode user processes, and it
uses them internally.

Limitation 2: don't perform privileged actions improperly. In
kernel mode, programs can perform any privileged action. However,
if your program performs such actions in a way that is inconsistent
with the kernel, the system will be in an undefined state. For
example, if you execute the following program as a kernel-mode user
process:

In my experience, few applications violate these limitations.
Ones that do violate them include WINE and VMware. These
limitations are against only kernel-mode user processes. Ordinary
user processes are never affected by these limitations, even when
running on a KML-capable kernel.

KML Internals

In IA-32 CPUs, the privilege level of an executed program is
determined by the privilege level of the code segment in which the
program is executed. Recall that a program counter for IA-32 CPUs
consists of a segment, specified by the CS segment register, and an
offset into the segment, the EIP register. The privilege level of
the code segment then is determined by its segment descriptor. A
segment descriptor has a field for specifying the privilege level
of the segment.

Basically, the Linux kernel prepares two segments, the kernel
code segment and the user code segment. The kernel code segment is
used for the kernel itself, and its privilege level is kernel mode.
The user code segment is used for ordinary user processes, and its
privilege level is user mode. When using execve on a user process,
the original Linux kernel sets its CS segment register to the user
code segment. Thus, the user process is executed in user
mode.

To execute a user process as a kernel-mode user process, the
only thing KML does is set the CS register of the process to the
kernel code segment, instead of to the user code segment. Then the
process is executed in kernel mode. Because of KML's simple
approach, a kernel-mode user process can be an ordinary user
process.

The Stack Starvation Problem and Its
Solution

As described in the previous section, the basic approach of
KML is quite simple. Its big problem is called stack starvation.
First, I'll explain how the original Linux kernel handles
exceptions (page faults) and interrupts (timer interrupts) on IA-32
CPUs. Then, I'll describe the stack starvation problem. Finally,
I'll present my solution to the problem.

In the original Linux kernel, interrupts are handled by
interrupt handling routines specified as gates in the Interrupt
Descriptor Table (IDT). When an interrupt occurs, an IA-32 CPU
stops execution of the running program, saves the execution context
of the program and executes the interrupt handling routine.

How the IA-32 CPU saves the execution context of a running
program at interrupts depends on the privilege level of the
program. If the program is executed in user mode, the IA-32 CPU
automatically switches its memory stack to a kernel stack. Then, it
saves the execution context (EIP, CS, EFLAGS, ESP and SS register)
to the kernel stack. On the other hand, if the program is executed
in kernel mode, the IA-32 CPU doesn't switch its memory stack and
saves the context (EIP, CS and EFLAGS register) to the memory stack
of the running program.

What happens if a kernel-mode user process of KML accesses
its memory stack, which is not mapped by the page tables of a CPU?
First, a page fault occurs, and the CPU tries to interrupt the
process and jump to a page fault handler specified in the IDT.
However, the CPU can't accomplish this work, because there is no
stack for saving the execution context. Because the process is
executed in kernel mode, the CPU can never switch the memory stack
to the kernel stack. To signal this fatal situation, the CPU tries
to generate a special exception, a double fault. Again, the CPU
can't generate the double fault, because there is no stack for
saving the execution context of the running process. Finally, the
CPU gives up and resets itself.

To solve this stack starvation problem, KML exploits the task
management facility of IA-32 CPUs. The IA-32 task management
facility is provided to support process management for kernels.
Using the facility, a kernel can switch between processes with only
one instruction. However, today's kernels don't use this facility,
because it is slower than software-only approaches. Thus, the
facility is almost forgotten by all.

The strength of this task management facility in IA-32 CPUs
is that it can be used to handle interrupts and exceptions. Tasks
managed by an IA-32 CPU can be set to the IDT. If an interrupt
occurs and a task is assigned to handle the interrupt, the CPU
first saves the execution context of the interrupted program to a
task data structure of the program instead of to the memory stacks.
Then, the CPU restores the context from the task data structure
specified in the IDT.

The most important point is there is no need to switch a
memory stack if the task management facility is used to handle
interrupts. That is, if we handle page fault exceptions with the
facility, a kernel-mode user process can access its memory stack
safely.

However, if we handle all page faults with the facility, the
performance of the whole system degrades, because the task-based
interrupt handling is slower than the ordinary interrupt
handling.

Therefore, we handle only double fault exceptions this way.
So, only page faults caused by memory stack absence are handled by
the task management facility. In my experience, memory stacks
rarely cause page faults, and the performance decrement is
negligible.