December 15, 2010

From ls(1) to the kernel and back again.

Too often sysadmins are afraid to dive into the source code of our core
utilities to see how they really work. We're happy to edit our scripts but we
don't do the same with our command line utilities, libraries, and
kernel. Today we're going to do some source diving in
those core components. We'll answer the age-old interview question, "What
happens when you type ls at the command line and press enter?" The answer to this question
has infinite depth, so I'll leave out some detail, but I'll capture the essence
of what is going, and I'll show the source in each component as we go. The
pedants in the crowd may find much to gripe about but hopefully they'll do so
by posting further detail in the comments.

Requirements

It'll be helpful if you install the source on your machine for the software
we'll be looking at. Below are the commands I used to get the source for the
needed packages on Ubuntu 9.10, and similar packages are available for your
Linux distribution.

I'm using linux-source version 2.6.31.22.35, coreutils (for the code to
ls) version 7.4-2ubuntu1, bash version 3.5.21, and libc6 version
2.10.1-0ubuntu18, and finally manpages-dev to get the programmer's man pages.

Starting Out - strace & bash

One of the most useful tools in the sysadmin's arsenal is strace, a command
that will show you most of the standard library and system calls a program
makes while it executes. We'll use this tool extensively to figure out what
code we are looking for in each component.

Let's start by strace'ing bash when it runs ls. To do so, we'll start a
new instance of bash under strace. Note that I'll be cutting the output of
strace down a lot in the post for readability.

... and that's where the output stops. If you're new to strace the key to
reading it is to make liberal use of man pages to figure out what each
library call does. Be aware that the relevant pages you want are in section
2 of the man pages, so you'll need to do man 2 read to find the page on read; this is
because many of the system functions have the same name as regular commands that are found
in chapter 1 of the man pages.

The read call is waiting for input on file descriptor 0, which is standard
input. So we type ls and hit enter (you'll see more read & write calls as
you type).

There's a lot of output, but we know we want to see ls related output, so
let's do the simple thing and look at the lines that have ls in them:

If we man 2 stat we see that stat returns information about a file if it
can find it, and an error if it can't (much more on stat later). In this case
what bash is doing is searching my $PATH environment variable in hopes of
finding an executable file with the name ls. Bash will stat every directory
in my $PATH, and if it can't find the file it returns command not found. In
this case, Bash found ls in /bin, and then that's the last we see of the
string ls in our output.

We don't see ls in our output anymore because once Bash knows it can execute
the program it spawns a child process to execute that program, and we haven't
told strace to follow children of the command it is tracing. It's the next
few lines of strace that give this spawning away:

If we man 2 pipe and man 2 clone we see that bash is creating a pipe (two
file descriptors that can be read and written to; this way a shell can link commands input
and output together when you give the shell a | character) and clone'ing itself so
that there are two copies of bash running. Remember, every UNIX process is a
child of another process, and a brand new process starts out as a copy of its
parent. So when does ls actually happen? Let's strace ls and find out!

That first line is the key. execve is the library call to load and run a new
executable. Once execve runs we're actually ls (well, the loader runs
first, but that's another article). Interestingly, the call to execve is in
the bash source code, not the ls source code. Let's find it in the bash
code:

/* Execute a simple command that is hopefully defined in a disk file
somewhere.
1) fork ()
2) connect pipes
3) look up the command
4) do redirections
5) execve ()
6) If the execve failed, see if the file has executable mode set.
If so, and it isn't a directory, then execute its contents as
a shell script.
[...]
*/

And looking at line 4665 we do see the call to execve. Take a look at the code
around execve - it's a bunch of error handling but nothing too hard to
understand. What's interesting is what is not there; the code exists only to
handle errors and nothing to handle success. That is because execve will only
return if there's a failure, which makes sense - a successful call to execve
means we're running something completely different!

Look around execute_cmd.c at the code around calls to shell_execve and
you'll see that that code is fairly straightforward.

Inside ls(1)

Let's look at what ls is doing by creating a single file in our directory and
ls'ing that file under strace.

Right at the end of the strace output we see bar a few times. It looks like
bar gets passed to stat, lstat, and write. Working backwards, we can man 2 write
to figure out that write sends data to a file descriptor, in this case
standard out, which is our screen. So the call to write is just ls printing out
bar. The next two library calls, stat and lstat, share a man page, with the
difference between the commands being that lstat will get information on a
symbolic link while stat will only get information on a file. Let's look in in
the ls source code for these calls to see why ls does both lstat and stat:

That call to lstat stands out amongst the other calls, and so it is a pretty
good guess that lstat happens for some exceptional reason that programmer would
notate with a comment. Looking at line 2749 in ls.c we see an interesting
comment a few lines above:

/* stat failed because of ENOENT, maybe indicating a dangling
symlink. Or stat succeeded, ABSOLUTE_NAME does not refer to a
directory, and --dereference-command-line-symlink-to-dir is
in effect. Fall through so that we call lstat instead. */
}
default: /* DEREF_NEVER */
err = lstat (absolute_name, &f->stat);
do_deref = false;
break;
}

That comment means that if we're not talking about a directory and stat has
already succeeded, we need to see if we are looking at a symlink. We can see
that this is true by ls'ing a directory under strace:

Where We Are Going There Is No strace

It is time to bid our friend strace a fond farewell as he doesn't have what
it takes to show us what stat is doing. For that we need to look into the
standard library, or as it is commonly known, libc.

The libc code provides a common API for UNIX programs, and a portion of that
API is the system calls. These are functions that provide a way for a
programmer to ask the kernel for a resource that is managed by the kernel,
including the resource we're interested in: the filesystem. The code we'd like
to look at is for the system call stat. However, because kernels are very
dependent on the hardware architecture they run on, and libc needs to talk to
the kernel, much of the code you'll find in the libc source organized by
architecture. This makes finding the code for stat tricky; if we look in
io/stat.c we see basically a single line of code that calls a function called
__xstat. If we find . -name xstat.c we'll see that we want
./sysdeps/unix/sysv/linux/i386/xstat.c, which is the implementation of stat for
Linux on i386.

The code in xstat.c that isn't a reference to a C #include looks like:

Reading the comments in the code we can see that stat64is for 64-bit
platforms. We'll stick to 32-bits for now, but either way we need to figure out
what INLINE_SYSCALL is. A convention in C programming is that FUNCTIONS IN ALL
CAPS are pre-processor macros, which means you can typically find out what
those macros are by grep'ing for define <macroname>:

Looking at the code the call to INTERNAL_SYSCALL stands out - it appears that
all INLINE_SYSCALL is doing is calling INTERNAL_SYSCALL. Conveniently we can
scroll down in sysdep.h to find the definition of INTERNAL_SYSCALL:

/* Define a macro which expands inline into the wrapper code for a system
call. This use is for internal calls that do not need to handle errors
normally. It will never touch errno. This returns just what the kernel
gave back.
The _NCS variant allows non-constant syscall numbers but it is not
possible to use more than four parameters. */
#undef INTERNAL_SYSCALL
#ifdef I386_USE_SYSENTER
# ifdef SHARED
# define INTERNAL_SYSCALL(name, err, nr, args...) \

... but it appears to define INTERNAL_SYSCALL a few times, and I'm not sure
which one is actually used.

A good practice in a situation like this is to stop looking at the code and
instead take some time to understand the concept the code is trying to
implement. Googling for something like i386 system calls linux gets us a to
(Implementing A System Call On i386 Linux)[http://tldp.org/HOWTO/html_single/Implement-Sys-Call-Linux-2.6-i386/] which
says:

A system call executes in the kernel mode. Every system call has a number
associated with it. This number is passed to the kernel and that's how the
kernel knows which system call was made. When a user program issues a system
call, it is actually calling a library routine. The library routine issues a
trap to the Linux operating system by executing INT 0x80 assembly instruction.
It also passes the system call number to the kernel using the EAX register. The
arguments of the system call are also passed to the kernel using other
registers (EBX, ECX, etc.). The kernel executes the system call and returns the
result to the user program using a register. If the system call needs to supply
the user program with large amounts of data, it will use another mechanism
(e.g., copy_to_user call).

Okay, so I think the implementation of INTERNAL_SYSCALL we'll want will have 0x80 in it
and some assembly code that puts stuff in the eax register (newer x86 machines can
use sysenter instead of int 0x80 to make syscalls).
Line 419 in sysdep.h does the trick:

If we go back to xstat.c we see that the name we pass to INTERNAL_SYSCALL is
stat, and in the code above the name argument will expand from __NR_##name
to __NR_stat. The web page we found describing syscalls says that syscalls are
represented by a number, so there has to be some piece of code that turns __NR_stat into a
number. However, when I grep through all of the libc6 source I can't find any
definition of __NR_stat for i386.

It turns out that the code that translates __NR_stat into a number is inside
the Linux kernel:

The Amulet Of Yendor: Inside The Kernel

The syscall number definitions being inside the kernel makes sense, as the kernel is
the owner of the syscall API and as such
will have the final say on what numbers get assigned to each syscall. As we're
running on 32-bit Linux, it appears the syscall number that libc is going to
put in eax is 106.

The table in unistd_32.h is great (look at all those syscalls!) but it
doesn't tell us where the code for handling a call to stat actually lives in
the kernel. find is our friend again:

Looks like this just a wrapper around vfs_stat, which is also in stat.c and
is a wrapper around vfs_statat, which again is in stat.c and is wrapper
around two functions, user_path_at() and vfs_getattr(). We'll ignore user_path_at()
for now (it figures out if the file exists) and instead follow
vfs_getattr():

One thing that is helpful to do in a case like is to look back at any
documentation I have on the function whose implementation I'm tracking down,
which in this case is the library call to stat. Back to man 2 stat we see:

In the first attempt to fill stat, vfs_getattr checks to see if this
inode struct has a special function defined to fill the stat structure.
Each inode has an i_op struct which can have a getattr function, if
needed. This getattr function is not defined in fs.h but rather is defined
by the specific file system the inode is on. This makes good sense as it
allows the application programmer to call libc's stat without caring if the
underlying file system is ext2, ext3, NTFS, NFS, etc. This abstraction layer
is called the 'Virtual File System' and is why the syscall above is prefixed with 'vfs'.

Some filesystems, like NFS, implement a specific getattr handler, but the
filesystem I'm running (ext3) does not. In the case where there is no special
getattr function defined vfs_getattr will call generic_fillattr
(helpfully defined in stat.c) which simply copies the relevant data from the
inode struct to the stat struct:

If you squint a little bit at this struct you'll see all the fields you can get
out of a single ls command! Our adventure into the kernel has yielded fruit.

Just One More Turn...

If you'd like to keep going, the next thing to figure out is how the inode
struct gets populated (hint: ext3_iget) and updated, and from there figure
out how the kernel reads that data from the block device (and then how the
block device talks to the disk controller, and how the disk controller finds
the data on the disk, and so on).

I hope this has been instructive. Digging through the actual source code to a
program isn't as easy as reading a summary of how something works, but it is
more rewarding and you'll know how the program actually works. Don't be
intimidated by an unknown language or concept! We found our way through the
internals of the kernel with strace, find and grep, tools a sysadmin uses
every day.

6 comments
:

User Rhombold on reddit pointed out that moden Linux no longer use the int 0x80 method to make syscalls, they instead use the method described in section 4.6 of this page: http://www.win.tue.nl/~aeb/linux/lk/lk-4.html

The way I show above is for older architectures. I'll pretend I left finding the new method of making syscalls out as an exercise for the reader, rather than just not finding that part of the code.