Anatomy of a Read and Write Call

We look at three different tactics for optimizing read and write performance under Linux.

A few years ago I was tasked with making
the Spec96 benchmark suite produce the fastest numbers possible
using the Solaris Intel operating system and Compaq Proliant
servers. We were given all the resources that Sun Microsystems and
Compaq Computer Corporation could muster to help take both
companies to the next level in Unix computing on the Intel
architecture. Sun had just announced its flagship operating system
on the Intel platform and Compaq was in a heated race with Dell for
the best departmental servers. Unixware and SCO were the primary
challengers since Windows NT 3.5 was not very stable at the time
and no one had ever heard of an upstart graduate student from
overseas who thought that he could build a kernel that rivaled
those of multi-billion dollar corporations.

Now many years later, Linux has gained considerable market
share and is the De facto Unix for all the major hardware
manufacturers on the Intel architecture. In this article, I will
attempt to take the lessons learned from this tuning exercise and
show how they can be applied to the Linux operating system.

As it turned out, the gcc benchmark was the one that everyone
seemed to be improving on the most. As we analyzed what the
benchmark was doing, we found out that basically it opened a file,
read its contents, created a new file, wrote new contents, then
closed both files. It did this over and over and over. File
operations proved to be the bottleneck in performance. We tried
faster processors with insignificant improvement. We tried
processors with huge (at the time) level 1 and level 2 cache and
still found no significant improvement. We tried using a gigabyte
of memory and found little or no improvement. By using the vmstat
command, we found that the processor was relatively idle, little
memory was being used, but we were getting a significant amount of
reads and writes to the root disk. Using the same hardware and same
test programs, Unixware was 25% faster than Solaris Intel.
Initially, we decided that Solaris was just really slow.
Unfortunately, I was working for Sun at the time and this was not
the answer that we could take to my management. We had to figure
out why it was slow and make recommendations on how to improve the
performance. The target was 25% faster than Unixware, not
slower.

The first thing that we did was to look at the
configurations. It turns out that the two systems were identical
hardware,. We just booted a different disk to boot the other
operating system. The Unixware system was configured with /tmp as a
tmpfs whereas the Solaris system had /tmp on the root file system.
We changed the Solaris configuration to use tmpfs but it did not
significantly improve performance. Later, we found that this was
due to a bug in the tmpfs implementation on Solaris Intel. By
braking down the file operation, we decided to focus on three
areas; the libc interface, the node/dentry layer, and the device
drivers managing the disk. In this article, we will look at the
three different layers and talk about how to improve performance
and how they specifically apply to Linux.

Test Program

If we take a characteristic program and look at what it does,
we can drill a little deeper into the operating system on each
pass. The program that we will use is relatively simple:

When we compile this program, by using the ldd command we see
that the routines; create, write, close, open, and read, are all
part of libc. On a RedHat 7.3 system, ldd returned that
/lib/i686/libc.so.6 and the loader are the only libraries that were
included when compiled. Further investigation with the nm command
shows that we actually link with the GLIB_2.0 which correlates to
the gcc compiler that we used to compile the program with and not
the libc in the operating system. Since libc is basically part of
the operating system, there does not seem like much we can do.

Fortunately, it turns out that there are a variety of options
available. Initially, for our benchmark we tried statically linking
our program which had marginal improvement but nothing substantial.
We then tried using the libc that came with the gcc compiler. It
had a noticeable improvement in performance but not as much as we
wanted. By mistake, we tried the Unixware libc dynamically linked
to the Solaris binary and got 30% better performance than with the
Solaris libc. Basically we had a substantial improvement in
performance and didn't do anything and didn't know why. Since we
didn't have the source to Unixware but did have the source to
Solaris and the gcc libc, we did a comparison. It turns out that
the Solaris implementation had substantially more test cases and
significant overhead that it imposed between the users program and
the system call to get into the kernel. A substantial amount of
code was written to make sure that buffers did not overflow or
pointers run off into the stack in the Solaris libraries.

Basically, what is done at the libc layer is that the random
input from the user program is copied onto the stack and tests are
made to make sure that it is not malicious code or code that might
attempt to gain root access. A hardware interrupt is then generated
requesting that control be taken from a user process into the
kernel. The interrupt takes the data that is on the stack and
passes it into an interrupt handler. The code for this interrupt
handler can be found in /usr/src/linux/arch/i386/kernel/entry.S.
This interrupt handler decides that this is a call into the
operating system and transfers control to the kernel to process the
request or decides that the call is an invalid call and returns
with an error.

If the kernel notices that it is a request to create a file,
it goes into the routines that deals with file systems. This is
done through the sys_call_table entry for sys_creat. This linkage
takes you to the file /usr/src/linux/kernel/module.c and the
sys_create_module routine. This routine figures out if the file
name already exists returns an error or creates the name in the
directory name space. If the kernel notices that it is a read from
a file on a file system, through sys_call_table structure sys_read,
it calls the device driver that controls the file system and
eventually controls the hardware for the attached disk. The
/usr/src/linux/fs/read_write.c routine is linked for reads and
writes. For the read command, this eventually resolved to the
kernel_read located in /usr/src/linux/fs/exec.c. This module
determines which file system that the file resides and calls the
read function using the device driver structures linking it to the
read function. Similar entries exist for write and close.

It turns out that the libc on Solaris had substantial error
checking, boundary checking, and stack controls that prohibited
users from hijacking the operating system. Unixware and GNU did not
meticulously check for these error conditions thus was
substantially faster. Since our intent was to produce the fastest
benchmark numbers possible, we went with the Unixware libc and
continued our optimizations. Once we figured out that we had
optimized everything in user space with tricks like running the
application as a real-time thread, running /tmp in tmpfs,
dynamically linking with a fast libc, and running the test three
times to make sure that all of the code fit into cache and remained
memory resident for subsequent runs, we were ready to figure out
how to optimize the kernel.

The decision that we made at this point was that performance
was the most important objective. Security, stability, and
reliability were no longer concerns for our system and secondary
objectives. Stability was important as long as the system did not
crash before or during our tests. If your intent is truly to
proceed the fastest linkage from a read command into the kernel,
you might look at bypassing the read and going straight into the
system_call. This is a bit risky and does reduce functionality but
for raw reads and writes it produces optimum code.

If the system depends on parameter checking for system calls to be in libc, then the system is not secure. The kernel must do these checks... after all, a program can bypass the C library and do a system call itself.

Also, checking for NULL buffers is silly. A program that reads into a NULL or writes from a NULL is broken and trying to make it looks like it is doing the right thing is probably a bad idea.

Trending Topics

Webinar: 8 Signs You’re Beyond Cron

Scheduling Crontabs With an Enterprise Scheduler
11am CDT, April 29th

Join Linux Journal and Pat Cameron, Director of Automation Technology at HelpSystems, as they discuss the eight primary advantages of moving beyond cron job scheduling. In this webinar, you’ll learn about integrating cron with an enterprise scheduler.