Sysenter Based System Call Mechanism in Linux 2.6

Starting with version 2.5, linux kernel introduced a new system call
entry mechanism on Pentium II+ processors. Due to performance issues on
Pentium IV processors with existing software interrupt method, an
alternative system call entry mechanism was implemented using
SYSENTER/SYSEXIT instructions available on Pentium II+ processors. This
article explores this new mechanism. Discussion is limited to x86
architecture and all source code listings are based on linux kernel
2.6.15.6.

1. What are system calls?

System calls provide userland processes a way to request services from
the kernel. What kind of services? Services which are managed by
operating system like storage, memory, network, process management etc.
For example if a user process wants to read a file, it will have to
make 'open' and 'read' system calls. Generally system calls are not
called by processes directly. C library provides an interface to all
system calls.

2. What happens in a system call?

A kernel code snippet is run on request of a user process. This code
runs in ring 0 (with current privilege level -CPL- 0), which is the
highest level of privilege in x86 architecture. All user processes run
in ring 3 (CPL 3). So, to implement system call mechanism, what we need
is 1) a way to call ring 0 code from ring 3 and 2) some kernel code to
service the request.

3. Good old way of doing it

Until some time back, linux used to
implement system calls on all x86 platforms using software interrupts.
To execute a system call, user process will copy desired system call
number to %eax and will execute 'int 0x80'. This will generate
interrupt 0x80 and an interrupt service routine will be called. For
interrupt 0x80, this routine is an "all system calls handling" routine.
This routine will execute in ring 0. This routine, as defined in the
file /usr/src/linux/arch/i386/kernel/entry.S, will save the current state and call appropriate system call handler based on the value in %eax.

4. New shiny way of doing it

It was found out
that this software interrupt method was much slower on Pentium IV
processors. To solve this issue, Linus implemented an alternative
system call mechanism to take advantage of SYSENTER/SYSEXIT
instructions provided by all Pentium II+ processors. Before going
further with this new way of doing it, let's make ourselves more
familiar with these instructions.

4.1. SYSENTER/SYSEXIT instructions:

The
SYSENTER instruction is part of the "Fast System Call" facility
introduced on the Pentium® II processor. The SYSENTER instruction is
optimized to provide the maximum performance for transitions to
protection ring 0 (CPL = 0). The SYSENTER instruction sets the
following registers according to values specified by the operating
system in certain model-specific registers.

CS register set to the value of (SYSENTER_CS_MSR)

EIP register set to the value of (SYSENTER_EIP_MSR)

SS register set to the sum of (8 plus the value in SYSENTER_CS_MSR)

ESP register set to the value of (SYSENTER_ESP_MSR)

Looks like processor is trying to help us. Let's look at SYSEXIT also very quickly:

The
SYSEXIT instruction is part of the "Fast System Call" facility
introduced on the Pentium® II processor. The SYSEXIT instruction is
optimized to provide the maximum performance for transitions to
protection ring 3 (CPL = 3) from protection ring 0 (CPL = 0). The
SYSEXIT instruction sets the following registers according to values
specified by the operating system in certain model-specific or general
purpose registers.

CS register set to the sum of (16 plus the value in SYSENTER_CS_MSR)

EIP register set to the value contained in the EDX register

SS register set to the sum of (24 plus the value in SYSENTER_CS_MSR)

ESP register set to the value contained in the ECX register

SYSENTER_CS_MSR,SYSENTER_ESP_MSR, and SYSENTER_EIP_MSR are not really names of the
registers. Intel just defines the address of these registers as:

Please
note that 'tss' refers to the Task State Segment (TSS) and tss->esp1
thus points to the kernel mode stack. [4] explains the use of TSS in
linux as:

The
x86 architecture includes a specific segment type called the Task State
Segment (TSS), to store hardware contexts. Although Linux doesn't use
hardware context switches, it is nonetheless forced to set up a TSS for
each distinct CPU in the system. This is done for two main reasons:

- When an 80 x 86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS.

-
When a User Mode process attempts to access an I/O port by means of an
in or out instruction, the CPU may need to access an I/O Permission
Bitmap stored in the TSS to verify whether the process is allowed to
address the port.

So during initialization
kernel sets up these registers such that after SYSENTER instruction,
ESP is set to kernel mode stack and EIP is set to sysenter_entry.

Kernel
also setups system call entry/exit points for user processes. Kernel
creates a single page in the memory and attaches it to all processes'
address space when they are loaded into memory. This page contains the
actual implementation of the system call entry/exit mechanism.
Definition of this page can be found in the file /usr/src/linux/arch/i386/kernel/vsyscall-sysenter.S. Kernel calls this page virtual dynamic shared object (vdso). Existence of this page can be confirmed by looking at cat /proc/`pid`/maps:

Initiation:
Userland processes (or C library on their behalf) call
__kernel_vsyscall to execute system calls. Address of __kernel_vsyscall
is not fixed. Kernel passes this address to userland processes using
AT_SYSINFO elf parameter. AT_ elf parameters, a.k.a. elf auxiliary
vectors, are loaded on the process stack at the time of startup,
alongwith the process arguments and the environment variables. Look at
[1] for more information on Elf auxiliary vectors.

After moving
to this address, registers %ecx, %edx and %ebp are saved on the user
stack and %esp is copied to %ebp before executing sysenter. This %ebp
later helps kernel in restoring userland stack back. After executing
sysenter instruction, processor starts execution at sysenter_entry. sysenter_entry is defined in /usr/src/linux/arch/i386/kernel/entry.S as: (See my comments in [ ])

Inside sysenter_entry: between line 183 and 202, kernel is saving the current state by pushing register values on to the stack.

Observe that $SYSENTER_RETURN is the userland return address as defined inside /usr/src/linux/arch/i386/kernel/vsyscall-sysenter.S and %ebp contains userland ESP as %esp was copied to %ebp before calling sysenter.

After
saving the state, kernel validates the system call number stored in
%eax. Finally appropriate system call is called using instruction:

210 call *sys_call_table(,%eax,4)

This is very much similar to old way.

After system call is complete, processor resumes execution at line 211. Looking further in sysenter_entry definition:

Copies
value in %eax to stack. Userland ESP and return address (to-be EIP) are
copied from kernel stack to %edx and %ecx respectively. Observe that
the userland return address, $SYSENTER_RETURN was pushed on to stack in
line 187. After that 0x28 bytes have been pushed on to the stack.
That's why 0x28(%esp) points to $SYSENTER_RETURN.

After
that SYSEXIT instruction is executed. As we know from previous section,
sysexit copies value in %edx to EIP and value in %ecx to ESP. sysexit
transfers processor back to ring 3 and processor resumes execution in
userland.

This
does the getpid() system call (__NR_getpid is 20) using
__kernel_vsyscall instead of int 0x80. Why %gs:0x10? Parsing process
stack to find out AT_SYSINFO's value can be a cumbersome task. So, when
libc.so (C library) is loaded, it copies the value of AT_SYSINFO from
the process stack to the TCB (Thread Control Block). Segment register
%gs refers to the TCB.

Please note that the offset 0x10 is not
fixed across the systems. I found it out for my system using GDB. A
system independent way to find out AT_SYSINFO is given in [1].