From the Book

Linux Kernel Architecture

Let’s begin this section by discussing the architecture of the Linux
kernel, including responsibilities of the kernel, its organization and modules,
services of the kernel, and process management.

Kernel Responsibilities

The kernel (also called the operating system) has two major
responsibilities:

To interact with and control the system’s hardware
components

To provide an environment in which applications can run

Some operating systems allow applications to directly access hardware
components, although this capability is very uncommon nowadays. UNIX-like
operating systems hide all the low-level hardware details from an application.
If an application wants to make use of a hardware resource, it must make a
request to the operating system. The operating system then evaluates the request
and interacts with the hardware component on behalf of the application, but only
if it’s valid. To enforce this kind of scheme, the operating system needs
to depend on hardware capabilities that forbid applications to directly interact
with them.

Organization and Modules

Like many other UNIX-like operating systems, the Linux kernel is
monolithic. This means that even though Linux is divided into
subsystems that control various components of the system (such as memory
management and process management), all of these subsystems are tightly
integrated to form the whole kernel. In contrast, microkernel operating
systems provide bare, minimal functionality, and all other operating system
layers are performed on top of microkernels as processes. Microkernel operating
systems are generally slower due to message passing between the various layers.
However, microkernel operating systems can be extended very easily.

Linux kernels can be extended by modules. A module is a kernel
feature that provides the benefits of a microkernel without a penalty. A module
is an object that can be linked to the kernel at runtime.

Using Kernel Services

The kernel provides a set of interfaces for applications running in user mode
to interact with the system. These interfaces, also known as system calls, give
applications access to hardware and other kernel resources. System calls not
only provide applications with abstracted hardware, but also ensure security and
stability.

Most applications do not use system calls directly. Instead, they are
programmed to an application programming interface (API). It is important to
note that there is no relation between the API and system calls. APIs are
provided as part of libraries for applications to make use of. These APIs are
generally implemented through the use of one or more system calls.

/proc File System—External Performance View

The /proc file system provides the user with a view of internal kernel data
structures. It also lets you look at and change some of the kernel internal data
structures, thereby changing the kernal’s behavior. The /proc file system
provides an easy way to fine-tune system resources to improve the performance
not only of applications but of the overall system.

/procis a virtual file system that is created dynamically by the
kernel to provide data. It is organized into various directories. Each of these
directories corresponds to tunables for a given subsystem. Appendix A explains
in detail how to use the /proc file system to fine-tune your system.

Another essential of the Linux system is memory management. In the next
section, we’ll cover five aspects of how Linux handles this
management.

Memory Management

The various aspects of memory management in Linux include address space,
physical memory, memory mapping, paging, and swapping.

Address Space

One of the advantages of virtual memory is that each process thinks it has
all the address space it needs. The virtual memory can be many times larger than
the physical memory in the system. Each process in the system has its own
virtual address space. These virtual address spaces are completely separate from
each other. A process running one application cannot affect another, and the
applications are protected from each other. The virtual address space is mapped
to physical memory by the operating system. From an application point of view,
this address space is a flat linear address space. The kernel, however, treats
the user virtual address space very differently.

The linear address space is divided into two parts: user address space and
kernel address space. The user address space cannot change every time a context
switch occurs and the kernel address space remains constant. How much space is
allocated for user space and kernel space depends mainly on whether the system
is a 32-bit or 64-bit architecture. For example, x86 is a 32-bit architecture
and supports only a 4GB address space. Out of this 4GB, 3GB is reserved for user
space and 1GB is reserved for the kernel. The location of the split is
determined by the PAGE_OFFSET kernel configuration variable.

Physical Memory

Linux uses an architecture-independent way of describing physical memory in
order to support various architectures.

Physical memory can be arranged into banks, with each bank being a particular
distance from the processor. This type of memory arrangement is becoming very
common, with more machines employing NUMA (Nonuniform Memory Access) technology.
Linux VM represents this arrangement as a node. Each node is divided
into a number of blocks called zones that represent ranges within
memory. There are three different zones: ZONE_DMA,
ZONE_NORMAL, and ZONE_HIGHMEM. For example, x86 has the
following zones:

ZONE_ DMA First
16MB of memory

ZONE_ NORMAL 16MB  896MB

ZONE_ HIGHMEM 896MB  end

Each zone has its own use. Some of the legacy ISA devices have restrictions
on where they can perform I/O from and to. ZONE_DMA addresses those
requirements.

ZONE_NORMAL is used for all kernel operations and allocations. It is
extremely crucial for system performance.

ZONE_ HIGHMEM is the rest of the memory in the system. It’s important
to note that ZONE_HIGHMEM cannot be used for kernel allocations and data
structures—it can only be used for user data.

Memory Mapping

While looking at how kernel memory is mapped, we will use x86 as an example
for better understanding. As mentioned earlier, the kernel has only 1GB of
virtual address space for its use. The other 3GB is reserved for the kernel. The
kernel maps the physical memory in ZONE_DMA and ZONE_NORMAL directly to its
address space. This means that the first 896MB of physical memory in the system
is mapped to the kernel’s virtual address space, which leaves only 128MB
of virtual address space. This 128MB of virtual space is used for operations
such as vmalloc and kmap.

This mapping scheme works well as long as physical memory sizes are small
(less than 1GB). However, these days, all servers support tens of gigabytes of
memory. Intel has added PAE (Physical Address Extension) to its Pentium
processors to support up to 64GB of physical memory. Because of the preceding
memory mapping, handling physical memories in tens of gigabytes is a major
source of problems for x86 Linux. The Linux kernel handles high memory (all
memory about 896MB) as follows: When the Linux kernel needs to address a page in
high memory, it maps that page into a small virtual address space (kmap) window,
operates on that page, and unmaps the page. The 64-bit architectures do not have
this problem because their address space is huge.

Paging

Virtual memory is implemented in many ways, but the most effective way is
hardware-based. Virtual address space is divided into fixed-size chunks called
pages. Virtual memory references are translated into addresses in
physical memory using page tables. To support various architectures and page
sizes, Linux uses a three-level paging mechanism. The three types of page tables
are as follows:

Page Global Directory (PGD)

Page Middle Directory (PMD)

Page Table (PTE)

Address translation provides a way to separate the virtual address space of a
process from the physical address space. Each page of virtual memory can be
marked "present" or "not present" in the main memory. If a
process references an address in virtual memory that is not present, hardware
generates a page fault, which is handled by the kernel. The kernel handles the
fault and brings the page into main memory. In this process, the system might
have to replace an existing page to make room for the new one.

The replacement policy is one of the most critical aspects of the paging
system. Linux 2.6 fixed various problems surrounding the page selection and
replacement that were present in previous versions of Linux.

Swapping

Swapping is the moving of an entire process to and from secondary
storage when the main memory is low. Many modern operating systems, including
Linux, do not use this approach, mainly because context switches are very
expensive. Instead, they use paging. In Linux, swapping is performed at the page
level rather than at the process level. The main advantage of swapping is that
it expands the process address space that is usable by a process. As the kernel
needs to free up memory to make room for new pages, it may need to discard some
of the less frequently used or unused pages. Some of the pages cannot be freed
up easily because they are not backed by disks. Instead, they have to be copied
to a backing store (swap area) and need to be read back from the backing store
when needed. One major disadvantage of swapping is speed. Generally, disks are
very slow, so swapping should be eliminated whenever possible.