gprof, bprof and Time Profilers

Mr. Vaught shows programmers a few commands to determine which sections of their code need optimization.

How the Profilers Work and Why They Sometimes
Don't

The profilers under Linux work by examining the program
counter at regular intervals to see where in the code the program
is actually working. Although easy to implement, there is a
randomness to this process that results in a certain amount of
noise in the measurements. Over “long” intervals, the amount of
good data overwhelms the noise. This is usually good enough in
practice, when we are only interested in finding the
bottlenecks.

Both gprof and bprof use timers that only run while your
program is actually running. One of the ways that the kernel can
use a lot of time on your program's behalf is by reading or writing
huge amounts of data. These times do not accumulate into the
profile. A more subtle way in which the kernel can eat a good deal
of time is if your program uses so much memory that the kernel must
swap part of your program in and out of memory. This situation is
called page thrashing because you can usually hear your disk
thrashing around.

A simple way of checking the system time is to run your
program with this command:

time <

After your program finishes, three times are printed: user
time (the time the CPU spent running your program), system time
(the time the CPU spent in the kernel serving your program) and
elapsed time (real time, sometimes referred to as “wall clock”
time). By comparing these times you can get a rough view of how
much work the kernel is having to do for you.

My favorite profiler is a program called
pixie, which is unfortunately not
available for Linux. Pixie works by actually reading the
executable, inserting counting code into “basic blocks” of code
that can only be entered at the start and exited at the end.
Support exists in gcc today for counting execution of these basic
blocks (the -a option), but getting actual times
for each block is not yet supported.

Using the Profile

So now you know the location of the bottleneck in your
program. There are a couple of simple techniques for making things
go faster. The easiest is of course to use the
-O flag of gcc to optimize the code. Be warned
that optimizers are notorious for generating bad code.

It is often possible to decrease running time in exchange for
an increase in space. Consider the following (FORTRAN) code
fragment in Listing 2.

Profiling revealed that the program was spending over a
quarter of its time in this loop, not just because it is slow, but
because it was also being called frequently. Since it is called
frequently, the variables cx, cy and cz are recalculated for each
loop iteration. If we precalculate these values into the
tcentr() array, four array references, three
floating-point additions and a multiplication are replaced by a
single array reference. A lot of code in a critical loop is thereby
eliminated.

The cherry on this is moving the multiplication by 0.25
(multiplication by 0.25 is faster than dividing by 4.0) out of the
loop altogether; thus, instead of multiplying each element of the
sum by 0.25, we multiply the whole sum by 0.25. Since the loop
happened to execute about 100,000 times or so, we've eliminated
99,999 floating-point multiplies. The new code is shown in
Listing 3.

At this point, the program was profiled again, and this
subroutine had dropped to taking 15% of the total time with 10%
being taken up by the square root calculation. Since a different
subroutine was now dominating the run time, the focus of the
optimization effort moved away from this subroutine.

There is one more semi-easy thing that can be done for this
code. It happens that the square roots are known to be needed for a
very limited range of values. So, we can replace the square root
function with a function that outputs the same value by looking up
precalculated values in a table and returning an interpolated value
for values that occur between table entries. Again, we have traded
space for time.

Another common way of speeding up a program is to replace
array references with pointers. In the example at the start of this
article, the line that accounted for 98% of my friend's program
looked like:

int i, array[1000000];
...
i = 0;
while(array[i] == 0) i++;

This line searches for the next non-zero element of an array.
The operation of referencing the array consists of multiplying
i times a scaling factor (which is implemented
as binary left shifts), adding this value to the start of array[]
and fetching from that location. Replacing this code with:

The idea here is that on certain types of machines, taking a
branch is expensive while rejecting a branch is cheap. In a very
tight loop, the overhead of the loop can end up being a significant
part of the total time. The code in the second example has been
rewritten so that most branches are not taken and that more work is
done in the body of the loop for each iteration of the loop.

As it turned out, this “optimization” didn't speed anything
up for the machine we were using. A good compiler compiling with
-O will unroll short loops for you. It is
important to profile before and after to see if what you've done
has helped or hurt.

The other main option for optimizing code consists of simply
looking for a better algorithm. For example, suppose we want to
search an array for a particular entry. If the array is very small,
we can simply check each element in turn. When the number of
elements becomes large, hash tables are a quick and easy way to
prune the number of elements that must be searched. For data that
cannot be hashed, tree seaches provide another alternative. Hashing
and trees are beyond the scope of this article, but should be a
part of any programmer's bag of tricks. Any good book on data
structures can show you how they work.

Machine specific optimizations are generally best left to the
compiler. Compilers are becoming quite good with the simplest sort
of optimizations, and gcc is one of the best. Once the profiler has
located the slow portion of the program, the best way to optimize
it is to simply imagine having to calculate everything by hand.
Hopefully, you will notice improvements that a compiler will
miss.

After all, the perfect compiler will never be written. There
is an old joke that once a computer is built that can write code as
well as a person, that computer will expect to be paid for its
work. Hey, it's job security.

Andy Vaught
is currently a Ph.D. candidate in
computational physics at Arizona State University and has been
running Linux since 1.1. He enjoys flying with the Civil Air Patrol
as well as skiing. He can be reached at
andy@maxwell.la.asu.edu.

> The operation of referencing the array consists of multiplying
> i times a scaling factor (which is implemented as binary left shifts),
> adding this value to the start of array[] and fetching from that location.
> Replacing this code with:

Any CPU since the 68000 has had addressing modes like (reg1*constant + reg2) which makes the array index and pointer cases strictly identical, unless the pointer saves you a register. Is your friend coding on 8-bit cpu or something ?

We are using Flat profile in order to analyse our program. But the calls column for all the functions does not show any value. We are unable to figure out what is the problem. Please tell us a solution as quickly as possible. Thanks in advance.

I am using gprof for profiling my code compiled using gcc in Linux.
the code runs fine and generates gmon.out
but when i give gprof it is giving error
"gprof: datalogger(which is the name of my executable):not in a.out format"

Hi,
I am trying to profile my code on Linux using Gprof
However it is showing incomplete data
Also it is showing the cumulative sec and self sec as all 0.00
I tried profileing in HP 11i there gprof is doing name mangling
Thanks
Nidhi