Measuring and Improving Application Performance with PerfSuite

Get a realistic view of how your program runs on real hardware, so you can find small changes that make a big performance difference.

At some point, all developers of software
applications, whether targeted to Linux or not,
are likely to spend at least a small amount of time
focusing on the performance of their applications.
The reason is simple:
many potential benefits can be gained from tuning software
for improved performance. For example, in the
scientific and engineering arenas, performance
gains can make the difference between running
smaller scale simulations rather than larger
and potentially more accurate models that would
improve the scientific quality of the results.
Applications that are more user-oriented also stand
to benefit from improvements that result in faster
responsiveness to the user and an improved overall
user experience.

Although microprocessor improvements over the past
decade or so have made clock speeds well in excess of
the gigahertz range commonplace, most developers are
aware that a tenfold increase in processor frequency
does not guarantee a tenfold reduction
in the runtime of your application. Additionally,
for those developing software for distribution
to others, attention to performance and responsiveness
can pay big dividends when you consider that your end
user may be running your application on a mid-1990s
era 100MHz Pentium processor.

This article is an introduction to a set of open-source software tools
called PerfSuite that can help you to understand and possibly improve
the performance of your application under Linux. PerfSuite consists
of several related tools and libraries targeted at several
different activities useful in performance-oriented analysis.

The development of PerfSuite was motivated by my own experiences in
working with not only applications that I had developed, but a
number of large supercomputer-class applications in both
academic and corporate settings. After having worked with several
research groups, I realized that developers often take advantage of
only a limited set of tools that may be available to them. They typically rely
on traditional time-based statistical profiling techniques
such as gprof.

Of course, gprof-style profiles are invaluable and
should be the mainstay of any developer's performance
toolbox. However, the microprocessors of today,
such as those on which you probably are using Linux,
offer advanced features that can provide alternative
insights into characteristics that directly affect
the performance of your software. In particular,
nearly all microprocessors in common use today
incorporate hardware-based performance measurement
support in their designs. This support can provide an
alternative viewpoint of your software's performance.
While time-based profiles tell you where your software
spends its time, hardware performance measurements
can help you understand what the processor is doing and
how effectively the processor is being utilized. Hardware measurements
also
pinpoint particular reasons why the CPU is stalling
rather than accomplishing useful work.

Hardware Performance Counter Basics

The first time I encountered the term hardware performance counters, it
was in the context of having access to multimillion-dollar supercomputers
where every CPU cycle is critical and research teams spend substantial
amounts of time tweaking their codes in order to extract maximum
performance from the system. Often, software is tailored explicitly
for each type of computer on which it is to be run. Research teams
sometimes pore over the numbers generated by these performance counters
to measure the exact performance of their applications and to ferret
out places where they might gain additional speedup. Needless to say,
this all sounded exotic to me. But the purpose and function of
the counters turned out to be simple: they are extra
logic added to the CPU that track low-level operations or events
within the processor accurately and with minimal overhead.

For example, even if you're not an expert in computer architecture,
you probably already know that nearly all processors in common use are
cache-based machines. Caches, which offer much higher-speed access
to data and instructions than what is possible with main memory, are based on
the principles of temporal and spatial locality. Put another way, cache
designs hope to take advantage of many applications' tendency to
reuse blocks of data not long after first use (temporal locality) and to
also access data items near those already used (spatial locality).
If your application follows these patterns, you have a much
greater chance of achieving high performance on a cache-based processor.
If not, your performance may be disappointing. If you're interested in
improving a poorly performing application, your next task is to try
to determine why the processor is stalling instead of completing useful
work. This is where performance counters may help.

It takes a little research to learn which performance counters are
available to you on a particular processor. Each CPU has
a different set of available performance counters, usually with different
names. In fact, different models in the same processor family
can differ substantially in the specific performance counters available.
In general, the counters measure similar types of things. For example,
they can record the absolute number of cache misses,
the number of instructions issued, the number of floating
point instructions executed and the number of vector, such as SSE or MMX,
instructions. The best reference for available counters on
your processor are the vendor's technical reference on the processor,
often available on the Web.

Another complication is kernel-level support is needed to access
the performance counters. Although the Itanium (IA-64) kernel provides
this support through the perfmon driver in the official kernel (authored by Stephane Eranian
of HP Research), the standard x86 Linux tree
currently does not.

Fortunately, efforts are underway to address these issues. The first is
the development of a performance monitoring driver for the x86 kernel
called perfctr. This is a very stable kernel patch developed by Mikael
Pettersson of Uppsala University in Sweden. The perfctr kernel patch
is becoming more widely adopted by the community and continually is
improved and maintained. The second is an effort from the Innovative
Computing Laboratory at the University of Tennessee-Knoxville called PAPI
(Performance Application Programming Interface). PAPI defines a standard
set of cross-platform performance monitoring events and
a standard API that allows measurement using hardware counters
in a portable way. The PAPI Project provides implementations for the
library on several current processors and operating systems, including
Intel/AMD x86 processors, Itanium systems and, most recently, AMD's
x86-64 CPUs. On Linux, PAPI uses the perfmon and perfctr drivers as
appropriate. Refer to the on-line Resources for references where you can
learn much more about perfctr, perfmon and PAPI.

PerfSuite, discussed in the remainder of this article, builds upon PAPI,
perfmon and perfctr to provide developers with an even higher-level user interface as
well as additional functionality. A main focus
of PerfSuite is ease of use. Based on my experiences in working with
developers interested in performance analysis, it became clear that
an ideal solution would require little or no extra work from users
who simply want to know how well an application is
performing on a computer. They want to know this without having to learn
many details about
how to configure or access the performance data at a low level.