System Call Handler and Service Routines

When a User Mode process invokes a system call, the CPU switches to Kernel Mode and starts the execution of a kernel function. In Linux a system call must be invoked by executing the int $0x8 0 assembly language instruction, which raises the programmed exception that has vector 128 (see Section 4.4.1 and Section 4.2.4, both in Chapter 4).

Since the kernel implements many different system calls, the process must pass a parameter called the system call number to identify the required system call; the eax register is used for this purpose. As we shall see in Section 9.2.3 later in this chapter, additional parameters are usually passed when invoking a system call.

All system calls return an integer value. The conventions for these return values are different from those for wrapper routines. In the kernel, positive or 0 values denote a successful termination of the system call, while negative values denote an error condition. In the latter case, the value is the negation of the error code that must be returned to the application program in the errno variable. The errno variable is not set or used by the kernel. Instead, the wrapper routines handles the task of setting this variable after a return from a system call.

The system call handler, which has a structure similar to that of the other exception handlers, performs the following operations:

• Saves the contents of most registers in the Kernel Mode stack (this operation is common to all system calls and is coded in assembly language).

• Handles the system call by invoking a corresponding C function called the system call service routine.

• Exits from the handler by means of the ret_from_sys_call( ) function (which is coded in assembly language).

The name of the service routine associated with the xyz ( ) system call is usually sys_xyz ( ) ; there are, however, a few exceptions to this rule.

Figure 9-1 illustrates the relationships between the application program that invokes a system call, the corresponding wrapper routine, the system call handler, and the system call service routine. The arrows denote the execution flow between the functions.

Figure 9-1. Invoking a system call

To associate each system call number with its corresponding service routine, the kernel uses a system call dispatch table , which is stored in the sys_call_table array and has NR_syscalls entries (usually 256). The nth entry contains the service routine address of the system call having number n.

The NR_syscalls macro is just a static limit on the maximum number of implementable system calls; it does not indicate the number of system calls actually implemented. Indeed, any entry of the dispatch table may contain the address of the sys_ni_syscall( ) function, which is the service routine of the "nonimplemented" system calls; it just returns the error code -enosys.

The call loads the following values into the gate descriptor fields (see Section 4.4.1):

Segment Selector

The _ _kernel_cs Segment Selector of the kernel code segment.

Offset

The pointer to the system_call( ) exception handler.

Type

Set to 15. Indicates that the exception is a Trap and that the corresponding handler does not disable maskable interrupts.

DPL (Descriptor Privilege Level)

Set to 3. This allows processes in User Mode to invoke the exception handler (see

9.2.2 The system_call( ) Function

The system_call( ) function implements the system call handler. It starts by saving the system call number and all the CPU registers that may be used by the exception handler on the stack — except for eflags, cs, eip, ss, and esp, which have already been saved automatically by the control unit (see Section 4.2.4). The save_all macro, which was already discussed in Section 4.6.1.4, also loads the Segment Selector of the kernel data segment in ds and es:

The function also stores the address of the process descriptor in ebx. This is done by taking the value of the kernel stack pointer and rounding it up to a multiple of 8 KB (see Section 3.2.2).

Next, the system_call( ) function checks whether the pt_tracesys flag included in the ptrace field of current is set — that is, whether the system call invocations of the executed program are being traced by a debugger. If this is the case, system_call( ) invokes the syscall_trace( ) function twice: once right before and once right after the execution of the system call service routine. This function stops current and thus allows the debugging process to collect information about it.

A validity check is then performed on the system call number passed by the User Mode process. If it is greater than or equal to NR_syscalls, the system call handler terminates:

If the system call number is not valid, the function stores the -enosys value in the stack location where the eax register has been saved (at offset 24 from the current stack top). It then jumps to ret_from_sys_call( ). In this way, when the process resumes its execution in User Mode, it will find a negative return code in eax.

Finally, the specific service routine associated with the system call number contained in eax is invoked:

Since each entry in the dispatch table is 4 bytes long, the kernel finds the address of the service routine to be invoked by multiplying the system call number by 4, adding the initial address of the sys_call_table dispatch table and extracting a pointer to the service routine from that slot in the table.

When the service routine terminates, system_call( ) gets its return code from eax and stores it in the stack location where the User Mode value of the eax register is saved. It then jumps to ret_from_sys_call( ), which terminates the execution of the system call handler (see Section 4.8.3):

movl %eax, 24(%esp) jmp ret from sys call

When the process resumes its execution in User Mode, it finds the return code of the system call in eax.