Performance tools

Frequently, we need to identify slow portions of our programs so we can improve performance. There are a number of tools available to profile programs and identify how much time is spent where. The most common of these tools sample the program periodically, recording information to be later analyzed. Typically, they involve a phase spent recording data and a later phase for analyzing it. We will use two common tools to analyze a simple program: Google pprof and Linux perf.

Google pprof

Google pprof is a tool available as part of the Google perftools package. It is is used with
libprofiler, a sampling based profiler that is linked into your binary. There are 3 steps for using pprof: linking it into the binary, generating profile output, and analyzing the output. The following links a binary with libprofiler:

% gcc main.c -lprofiler

For any profile linked with libprofiler, setting the environment variable CPUPROFILE enables profiling and specifies the output file. The following command runs ./a.out and prints profiling data to out.prof:

%CPUPROFILE=out.prof ./a.out

We can now analyze this file using pprof. Below, we output the sample counts for all the functions in a.out:

Linux perf

On Linux, the perf system is a powerful tool for analyzing program / system performance. It provides some nice abstractions over tracking hardware counters on different CPUs. It defines a number of events to be tracked and recorded. Run perf list to see a list of the events allowed on your system.

Our Investigation

We compile the program with -lprofiler so we can generate output to examine. try_perf.c is a C program that counts the number of even values
in an array of random numbers. We run with 8 threads that all increment a global
counter every time they see an even number.

The output above is actually misleading: if you look at the assembly (shown below), the instruction immediately after the atomic instruction (the addq $0x1,-0x8(%rbp) after the lock addq $0x1,(%rax)) gets excess hits that count towards the for loop when they should probably count towards the atomic instruction.

Wow, thats a lot of stalled instructions! The 8 threads are sharing the same counter, generating a lot of memory traffic. We modify the program so they all use their own counter, and then we aggregate at the end (if we do this, we don't need to use the atomic instruction).