Improving Compilation Times: A Case Study in Performance Analysis

The software engineers at e-Gizmo, a software development firm
developing B2C applications for the emerging, high-performance canine
consumables market, are not particularly happy--a library critical to
their environment is taking longer to compile than they'd like.

Their development server, polaris, consists of a Sun Enterprise 2 server with a pair of 250MHz UltraSPARC II processors, 512MB of memory, and two 18GB disks--one for 1GB of swap and the root
file system and the other for /export/src, where all their
compilation work is done. The system is running Solaris 7 with all the
current patches, and the developers are using the Forte Developer 6 update 2 compiler suite. They haven't done any tuning work.

The complaint of the software developers is that compiling one of
their products, which consists of about 100,000 lines of C, takes too
long--they've timed it at about 18 minutes. They want to see compilation
times around 12 minutes, or about a 35% speedup, without any change to the
compiler output. Furthermore, management has established that there are no
funds available for this project, so they'll have to make do with the hardware they have.

Because this piece of software is pretty critical to the performance
of their entire application suite, it's compiled with a very high
level of optimizations: -fast -xarch=v8plusa.

Before we start looking at the system's workload, let's
figure out what we might suspect to be bottlenecks and plan a tuning
strategy. Based on how we think compilation stresses the system, we
suspect that we will need to look at memory consumption, CPU usage, and
disk I/O. Since NFS isn't involved, networking isn't likely to be a
problem. So, we can formulate three questions:

Are we running short of memory? Our best evidence for a memory
shortage is going to come from the sr field in the output of
vmstat. This is discussed in more detail in Chapter 4 of System Performance Tuning, 2nd Edition.

Are we running into disk I/O problems? Compilation is
often very disk-intensive. We'll look at this through the iostat
command. Disk I/O is covered in Chapter 5.

Are we saturating the CPUs? We are in a flat-out race to
compile this library as fast as possible, so we want both CPUs to be
working 100% of the time. For a more in-depth discussion of processor
performance, see Chapter 3.

Let's take a look at the data that we actually get during our first
trial run. I'll summarize representative output from vmstat 5,
iostat -xnP 30, and mpstat 5 in terms of the three questions we posed above.

Our top indicator of a memory shortage is the sr column, but it's consistently zero. If we were short of memory, the first thing we'd try is turning on priority paging. One thing that does look suspicious is the fact that we have some idle time, as shown by the id column; we'll have to take a look at that.

It looks like we're only consuming one of our processors with much useful work. On a multiprocessor system like the one that e-Gizmo is using for software development, we can use dmake, which parallelizes the compilation process. Let's see if we can use that to get the CPU utilization up.

To test this, we should start by running a make clean to clean out all the
old object files. Then, we'll unmount and remount /export/src in order to clear out any file system cache entries. This way, we'll be running the same test again, but this time with a parallel make. We'll start by trying two threads of compilation since we have two processors:

Our performance target is the wall-clock compilation time, which /bin/time reports as "real," so that's pretty good--we've already beat our performance target by almost two minutes! A quick look at the output of mpstat shows us that we have indeed gotten rid of almost all the idle time.

At this point, we're starting to see diminishing returns. We could probably
experiment some more and drive the compilation time down by another few
seconds, but the time that e-Gizmo is willing to give us for testing is
starting to run awfully short.

So, it looks like -j 4 is going to be our best bet. Finally, we'll check the output of iostat and vmstat to make sure that we haven't got another easy tuning target elsewhere in the system, but it looks like we're out of luck:

Nope, no easy performance pickings left, and we're about out of time on the
client system, so we'll call it a day. The e-Gizmo engineers are pretty
happy with their performance boost.

As we look back on this, it was a pretty straightforward problem with a
pretty straightforward solution, but we know they aren't all this easy. The important thing to take away from this case study isn't how much speedup we achieved by using dmake to split the compilation across multiple CPUs, and especially not the exact value of -j that we found to be optimal, but rather it's about how we approached the problem. We figured out what the performance requirements were, formulated some hypotheses about where we might look for tuning opportunities, gathered some data while the system was running, and then acted on our data.