Anatomy of a Syscall

NaCl syscalls are the interface between untrusted code and the trusted codebase. They are the means by which a NaCl process can execute code outside the inner sandbox. This is kind of a big deal, because the entire point of NaCl is to prevent untrusted code from getting out of the inner sandbox. Accordingly, the design and implementation of the syscall interface is a crucial part of the NaCl system.

The purpose of a syscall is to transfer control from an untrusted execution context to a trusted one, so that the thread can execute trusted code. The details of this implementation vary from platform to platform, but the general flow is the same. This figure shows the flow of control:

The syscall starts as a call from untrusted code to a trampoline, which is a tiny bit of code (less than one NaCl bundle) that resides at the bottom of the untrusted address space. Each syscall has its own trampoline, but all trampolines are identical--in fact, they're all generated by the loader from a simple template. The trampoline does at most two things:

Exits the hardware sandbox (on non-SFI implementations) by restoring the original system value of %ds.

The call to the context switch function does not return. Instead, when the syscall is finished, the flow of control is transferred directly back to the code that called the trampoline. The return address is still pushed on the stack as part of the call instruction, though. This value is used by the dispatcher to identify which trampoline initiated the syscall.

Context Switch

The next step is to switch the execution context. Each thread in the NaCl process owns a trusted context as well as an untrusted context. Untrusted code cannot read the trusted stack, and trusted code can't use the untrusted stack, so nothing that uses the stack can run until the context switch takes place. For this reason, the context switch must be the first thing to run when execution enters trusted code, and the last thing to run before execution leaves trusted code.

The context switch function performs the following functions:

Read TLS to find the index in the saved context array that belongs to the current thread

Save the current context into the untrusted context array (nacl_user)

Load the trusted context from the trusted context array (nacl_sys)

Move arguments from the untrusted stack into registers (x86-64 only)

Call the syscall dispatcher function

Switching between the two contexts is similar to a thread or fiber switch: the current register set is saved, and a new set of registers is loaded. The set of registers is slightly different from a traditional thread switch. The program counter doesn't need to be saved, but the segment registers (on non-SFI systems) do. The contexts themselves are saved in a location pointed to by thread local storage. This requires some platform-dependent work, because TLS implementations differ--the Windows implementation in particular is unusually complex.

The x86-64 ABI expects some parameters to be loaded into registers; these parameters need to be moved from the untrusted context into the trusted context. The current implementation loads these values from the untrusted stack.

The last thing the context switch function does is transfer the flow of control to the syscall dispatcher. This function call does not return. Instead, the switch back to the untrusted function is handled by a different function (NaClSwitch(), currently).

Dispatcher

Once the context switch succeeds, the code becomes a lot more straightforward. The dispatcher does the following:

Determine which syscall was called based on the address of the trampoline

Fix ABI mismatches on the stack

Look up the syscall implementation in the dispatch table

Call the syscall

Sandbox the return address

Initiate the switch back to untrusted code

The dispatcher determines which syscall was called by reading the trampoline return address from the untrusted stack. Since the trampolines are evenly spaced in memory, the return address can be used to determine the ordinal position of the trampoline that initiated the syscall. The ordinal position is then used as a lookup into a dispatch table.

The context switch function also needs to ensure that the stack is laid out in the way that the trusted codebase expects. This can be tricky, because while the untrusted code is compiled with a standard unix-style toolchain, the trusted code is compiled with the native platform compilers and follows the native ABI. For example, the Windows x86-64 calling convention is very different from the Linux x86-64 convention. The dispatch function is responsible for fixing the stack to comply with the target platform's alignment and padding rules.

Once the stack has been fixed, the dispatcher calls the syscall function pointer that it retrieved from the dispatch table. This call returns normally. The last thing the dispatcher does is mask the user return pointer and call the trusted-to-untrusted context switch function. That call does not return.

Validation and Implementation

Now the syscall is almost ready to execute. The last thing that needs to be done is to unpack the parameters and validate them. The syscall parameters are stored, along with other useful data, in a NaClAppThread structure which is passed to the syscall function. Most of the NaCl syscall implementations are wrapped within functions that decode and validate the parameters before calling the internal implementation.

The wrappers also call NaClSysCommonThreadSyscallEnter() before calling the internal implementation, and NaClSysCommonThreadSyscallLeave() after the internal implementation completes. The primary responsibility of this pair of functions is to acquire and release a mutex that prevents concurrent access to the trusted codebase. This helps eliminate possible race condition exploits.

Leaving the Syscall

When the syscall returns, the dispatcher function sandboxes the return address and calls a function to switch back to untrusted code. That function (NaClSwitchToApp) does the following:

Writes the user return address into the untrusted context; this will become the new untrusted program counter

Calls the trusted-to-untrusted context switch function

The trusted-to-untrusted context switch function does the following:

Restores the untrusted context

Jumps to the return address (SFI) or the springboard (non-SFI)

Springboard

On SFI systems, the trusted-to-untrusted context switch returns directly to untrusted code. On non-SFI systems, however, one more function is needed. This function is the mirror image of the trampoline function that was called when the syscall was initiated. It also lives at the bottom of the trusted address space and is automatically written by the loader. To differentiate this incoming function from the outgoing trampoline, the incoming function is called the springboard.