Category: Paging

A process is a program in execution. Since linux represent everything in terms of files, currently running processes are also represented using /proc file system. The proc/<pid> contains open files (sockets), pending signals, process state, kernel data structures, threads of execution and data section.

Processes provides two types of virtualization

Virtual Processor

Virtual Memory

Processes are created on linux using fork() system call. When fork() is called, a process is created which is child of process who is called fork(). The reason it’s called as a child because it gets copy of resources (data, variables, code, pages, sockets etc) from it’s parent process.

The list of processes are stored in a doubly linked list called as task list. The process descriptor is type struct task_struct defined in <linux/sched.h> .

Processes are identified by PID. PID is a defined by pid_t data structure which is an integer up to 32,768. Generally processes up to number 1024 is kernel processes and rest of them are user processes. When a new process is forked, the number is not sequential but its randomized in order to avoid being guesses. When number is processes reaches max limit, the number is reseted and starts from 1024. The limit of number of processes can be increased by setting value at proc/sys/kernel/pid_max.

if return value of fork() call is 0, fork is successful and control block is in child process.

if (pid == 0)
{
printf("child process");
}

if return value of fork() call is greater than 0, fork is successful and control block is in parent process process.

else if (pid > 0)
{
printf("parent process");
}

if return value of fork() is negative, the call is failed.

else
{
printf("fork failed");
}

fork() creates a child process that is a copy of the current task. It differs from the parent only in its PID, its PPID (parent’s PID), and certain resources and statistics (e.g. pending signals) which are not inherited

Besides the open files, other properties of the parent are inherited by the child:

Process 0 and 1

Process 0 is ancestor of all processes which is also known as swapper process. This process is build from scratch during boot up. Process 0 initializes all the data structures needed by kernel, enables interrupts, and creates init process which is process 1.init process, invoked by the kernel at the end of the bootstrap procedure. It is responsible for bringing up the system after the kernel has been bootstrapped. init usually reads the system-dependent initialization files (/etc/rc* files or /etc/inittab and the files in /etc/init.d) and brings the system to a certain state. Process 1 never dies. It is a normal user process, not a system process within the kernel, but runs with superuser privileges.

State of a Process

Process can be any one of following states:

TASK_RUNNING

Process is being executed on CPU or waiting

TASK_INTERRUPTABLE

Process is sleeping/suspended.

TASK_UNINTERRUPTABLE

Process is in sleeping state but signal to these processes will not change state of the process.

TASK_STOPPED

Process being stopped.

TASK_TRACED

Process is being debugged

EXIT_DEAD

Child process is being removed because parent process issued wait4() or waitpid()

EXIT_ZOMBIE

Child process being terminated but parent has not issued wait() or waitpid()

Copy on Write – an optimization on fork()

All the data and resources that belong to the parent are not copied to child. This is done is steps and as needed. This is an optimization in order to delay copying data as needed. The another reason is in calls such as exec(), none of the parent data is actually needed, there should not be a need to copy any pages from parent.

Process Switch

During process switch the registers are saved in kernel memory area for process and loaded back when process resumes on CPU. This is was done using far jmp assembly instruction on x86. On linux 2.6 onwards software is used for context switching instead of hardware context switching. Software context switch is safer and faster.

Scheduling

Processes work on time slicing basis giving user the illusion that only the user has access to the CPU and the memory is unlimited. The scheduling of processes depend upon scheduling policy or algorithm. The scheduling algorithm must schedule in such a way that processes should have fast response, good throughput, limited use of resources and honor the priorities.

Linux processes are preemptive. A process with high priority than low priority runs first by interruption of low priority process is one is running. There are following classes of Linux processes:

SCHED_FIFO

SCHED_RR

SCHED_NORMAL

Scheduling factors for a process

static priority: between 100 (high priority) to 139 (low priority)

nice value: -20 (high priority) to 19 (low priority)

dynamic priority: between 100 (high priority) to 139 (low priority)

real-time priority: 1 (highest priority) to 99 (lowest priority)

Kernel Preemption

In non-preemptive kernels, kernel code runs until completion. That is, the scheduler cannot reschedule a task while it is in the kernel: kernel code is scheduled cooperatively, not preemptively. Kernel code runs until it finishes (returns to user-space) or explicitly blocks.

The Linux kernel (since 2.6), unlike most other Unix variants and many other operating systems, is a fully preemptive kernel. It is possible to preempt a task at any point, so long as the kernel is in a state in which it is safe to reschedule.

Threads

Linux threads are light weight processes i.e. Linux does not make any difference between threads and processes. They share resources with parent like heap space, file descriptors, sockets but each thread gets own stack.

each thread contain TCB- a thread control block which contains following

Thread Identifier:Unique id (tid) is assigned to every new thread

Stack pointer: Points to thread’s stack in the process

Program counter: Points to the current program instruction of the thread

State of the thread (running, ready, waiting, start, done)

Thread’s register values

Pointer to the Process control block (PCB) of the process that the thread lives on

threads are also created with clone() system call like other processes except for some flags to clone like

clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);

Address Space of a Process

The address space of a process consist of linear addresses the process is allowed to use. The linear address spaces are different per process. The linear addresses are managed in form of Memory Regions. Following are examples of system calls to manage memory regions:

brk() System call:

brk() is the only function that is a syscall. All the other functions listed above are implemented using brk() and mmap()

it allocated and deallocated whole pages since it acts on memory region.

System calls a.k.a syscalls:

syscalls are implemented in Kernel and can be called from a process running in user space with appropriate privileges. There are around 330 number of system calls in Linux. The Linux API is written in c.

Process Termination

Process termination can be normal (finishing intended code execution) or can be abnormal. The Abnormal execution is because of following reasons:

Kernel, the process itself or any other process send certain signal.

Sigfault: Process tries to access memory location which it does not have privilege of access.

Abnormal conditions is exit with non-zero code.

On process exit Kernel releases the memory and other resources held by the process like file descriptors.

When Parent process dies, the children of that process becomes orphan processes and the process init becomes parent of those processes. Sometimes parents do not wait for child processes after fork(). These child processes becomes zombi processes after termination. A process can wait (block) when its children are in the running state using wait or waitpid.

The wait function can block the caller until a child process terminates, whereas waitpid has an option that prevents it from blocking.

The waitpid function doesn’t wait for the child that terminates first; it has a number of options that control which process it waits for.

What is Virtual Memory?

It allows us to run more applications on the system than we have enough physical memory to support. Virtual memory is simulated memory that is written to a file on the hard drive.

Why Virtual Memory?

Most of the modern CPUs has memory management unit built in to generate virtual memory addresses. Using virtual memory has following advantages.

Main Memory is limited: especially in small devices like raspberry pi which runs linux can run multiple processes each with 32 bit addressing (4GB/process)

Sharing: It makes easy for library programs to be shared across applications example glibc.

Memory protection: Memory corruption in one program does not affect memory for another program.

Only a part of program needs to be in memory execution.

Virtual Memory is implemented with

Demand Paging

Demand Segmentation

Physical memory management

Zones are defined using data structure struct zone, which is defined in <linux/mmzone.h>. On X86 the physical memory is divided in zones as follows

ZONE_DMA First 16MB of memory

This is used by (legacy) hardware because the hardware can address only upto 16 bytes.

ZONE_NORMAL 16MiB – 896MB

The DMA and NORMAL zone page frames can be accessed directly by Kernel.

ZONE_HIGHMEM 896 MB – End

In 32 bit systems – the page frames belongs to HIGHMEM can not be accessed directly by Kernel.

In 64-bit system this problem does not exists because the linear space available is vastly large than the RAM available on the system.

Each zone has a zone descriptor with following fields:

free_pages: Number of free pages

pages_min: Number of reserved pages

pages_low: Low watermark for reclaiming page frames

pages_high: High watermark for reclaiming page frames

free_area: Blocks of free pages

lru_lock: spinlock for any operation

wait_table: wait queues of processes waiting on one of the pages in the zone.

zone_mem_map: pointer to first page descriptor

name: “DMA”, “Normal” or “HighMem”

Zones in NUMA System

Memory Management in Linux

Some part of main memory is assigned permanenetly to Kernel.

Remaining memory is called as Dynamic Memory

The effeciancy of system depends upon how effectively the memory is managed.

Page Frame sizes

Linux uses 4 KB and 4MB page size by default.

With PAE (Physical Address Extension) enabled it can support 2 MB pages.

32-bit architectures have 4KB pages

64-bit architectures have 8KB pages.

This means: 4KB pages and 1GB of memory, physical memory is divided into 262,144 distinct pages.

The reason to support 4 KB page size

Data transfer between disk and main memory is effecient when smaller page size is used.

Page Descriptors

The page information structure is defined using struct page structure. This structure is defined in <linux/mm_types.h>

Information about the state of the page frame is stored in page descriptor data structure.

Page descriptor store following information:

Flags

Various flags store the information about page.

The flag values are defined in <linux/page-flags.h>.

Frame’s reference counter

This is important field. When its value is -1, the page frame free and can be accessed by any process or kernel. A 0 or positive number implies, that the page is assigned to one or more processes.

Number of page entries

Index

pointer to least recently used doubly linked list

Memory Allocation

Zoned Page Frame Allocator is used by Kernel to handle memory allocation requests for a group of continuous page frames. The algorithm used is called as “Buddy System”. Kernel provides various macros as follows for allocating page frames.

alloc_pages(gfp_mask, order)

alloc_page(gfp_mask)

_ _get_free_pages(gfp_mask, order)

_ _get_free_page(gfp_mask)

get_zeroed_page(gfp_mask)

_ _get_dma_pages(gfp_mask, order)

Deallocation of page frames is done using following:

_ _free_pages(page, order)

free_pages(addr, order)

_ _free_page(page)

free_page(addr)

High-Memory Page frame mappings:

There are 3 mechanisms to map page frames in high memory:

Permanent Kernel mapping

Temporary Kernel Mapping

Non Contiguous Memory Allocation

Fragmentation:

Frequent allocations and deallocations of page frames of different sizes may result in several small blocks of free page frames scattered in the block of page frames. Then it is impossible to allocate large chunk of page frames.

To fix this issue, Kernel keeps track of existing blocks of free contiguous page frames and avoid need to split up large blocks if request comes for smaller chunk.

Buddy System

This is one of the algorithms for allocating memory from fixed-size segment consisting of physically-contiguous pages.

Memory is allocated from the segment using a power-of-2 allocator

This method satisfies requests in units sized as power of 2

The requests rounded up to next highest power of 2

When smaller allocation needed than is available, current chunk split into two buddies of next-lower power of 2

Continue until appropriate sized chunk available

For example, assume 256KB chunk available, kernel requests 21KB

Split into ALand Ar of 128KB each

One further divided into BL and BR of 64KB

One further into CL and CR of 32KB each – one used to satisfy request

Advantage – quickly coalesce unused chunks into larger chunk

Disadvantage – fragmentation

Slab Allocator

There is an alternate strategy for Memory allocation Slab Allocator.

A slab allocator is a list that contains a block of available, already allocated, data structures.

When needed the data structure is allocated from the slab, when not needed, it’s returned to the slab rather than freeing.

The slab allocator works as generic data structure caching layer.

Slabis one or more physically contiguous pages

Cacheconsists of one or more slabs

Single cache for each unique kernel data structure

Each cache filled with objects– instantiations of the kernel data structure

Slab-allocation algorithm uses caches to store kernel objects

When cache created, filled with objects marked as free

When a new object for a kernel data structure is needed, the allocator can assign any free object from the cache to satisfy the request.

The object assigned from the cache is marked as used.

There are 3 states of a slab, full slab has no free objects, an empty slab has no allocated objects and partial slab has some free and some allocator objects.

If slab is full of used objects, next object allocated from empty slab

If no empty slabs, a new slab is allocated from contiguous physical pages and assigned to a cache

Benefits include no fragmentation, fast memory request satisfaction

Demand Paging

Software Programs are generally very large in size and can not fit entirely in main memory. Hence only part of the pages are brought in main memory from disk when absolutely required.

Advantages of demand paging

Similar to a paging system with swapping

Less I/O needed, no unnecessary I/O

Less memory needed. Example: Raspberry pi.

Faster response because of shared libraries.

More users can be using the operating system at a time

Details of demand paging

Extreme case – start a process with no pages in memory

OS sets instruction pointer to the first instruction of process, non-memory-resident page fault

And for every other process pages on first access faults for the page

Pure demand paging never bring a page into memory until it is required.

Actually, a given instruction could access multiple pagesmultiple page faults

One page for the instruction and many for data : unacceptable system performance

Fortunately, pain decreased because of locality of reference reasonable performance