Feedback

Performance Tuning with GCC, Part 1

by Diego Novillo

Introduction

The GNU Compiler Collection
(GCC) is one of the most popular compilers available
today. Over the last 15-20 years, GCC has evolved from a
relatively modest C compiler to a sophisticated system that
supports many languages (C, C++, Java, Objective C, Objective
C++, Ada, Fortran 95) and about 30 different architectures.

When GCC optimizes your application, the original program goes through
an assortment of optimizing transformations before generating the
final binary. Since optimizations are usually expensive, the
optimizer is organized in levels of aggressiveness. You can select
between -O1 for quick and light transformations to
-O3 for heavy duty optimizations. In theory, the higher
the optimization level, the faster your application will run.
Unfortunately, things are not always so simple. Generating perfect
target code is almost an impossibility for a multitude of factors:
Compiler optimizations usually interfere with each other in almost
infinite ways, heuristics and cost models may be wrong, the code may
be obscure enough to confuse the optimizers, etc.

Optimizing transformations present even more problems. For instance,
debugging optimized code is often an exercise in frustration.
Variables disappear, code is executed in seemingly random order,
entire function bodies go missing, etc. Usually, none of this means
that the compiler is buggy, they are simply side effects of the
transformations done to the code. It also happens that not all
possible transformations are included with the standard
-Ox flags, so getting the utmost
performance from your code may mean experimenting with the myriad of
-f and
-m flags.

This article provides an overview of the different flags controlling
optimization in GCC and some hints on how to use them to get the most
performance out of your application. In particular, it discusses
some of the new optimization features of the GCC 4.x series included
in Fedora™ Core 4 and the upcoming Red Hat® Enterprise Linux® versions.

Optimization flags

Optimizing transformations are performed at various levels of
abstraction. In general, optimizations can be categorized into
two groups: those that apply to all targets and those that are
target-specific. The former are controlled by -f flags;
the latter are controlled by -m flags.

When invoking GCC, the order in which flags are specified is not
important. You can add as many flags as you wish, in any order.
A typical command line would look the following:

gcc -O2 [other -f or -m flags] -o file file.c

For a complete listing of all the available options and their
meaning, refer to the GCC
documentation.

GCC implements many transformations (over 100 passes in GCC 4.0), and
most of them can be controlled by a single flag. Specifying all these
options individually would clearly be overkill. Therefore, GCC
implements the notion of optimization levels (-Ox), which
are umbrella options that automatically enable individual
transformations. These optimization levels are described in Table 1, “GCC optimization levels”.

Flag

Description

-O0

Barely any transformations are done to the code, just code
generation. At this level, the target code can be debugged with
no loss of information.

-O1

Some transformations that preserve execution ordering.
Debuggability of the generated code is hardly affected. User variables
should not disappear and function inlining is not done.

-O2

More aggressive transformations that may affect execution
ordering and usually provide faster code. Debuggability may be
somewhat compromised by disappearing user variables and function
bodies.

-O3

Very aggressive transformations that may or may not provide
better code. Some semantics may be modified (particularly
floating point). Order of execution completely distorted.
Debuggability is seriously compromised.

-Os

Optimize for size. This option enables transformations that
reduce the size of the generated code. Ironically, this may
sometimes improve application performance because there simply is
less code to execute. This may lead to reduced memory footprints
which may produce fewer page faults.

-O4

There is no -O4. Anything above
-O3 is treated as -O3.

Table 1. GCC optimization levels

In theory, it should be possible to drive GCC's optimizer at a very
fine-grained level of detail using -f/-m flags alone.
However, -Ox options enable other internal options and
code paths in the compiler that cannot be otherwise controlled. By
default, if no -Ox flag is specified then
-O0 is assumed.

At the same time, the converse also holds. Not all the possible
transformations are enabled by the -Ox flags (not even
-O3). The reason for this is usually either because the
specific transformation is still relatively immature or because it
only benefits few programs. So, when experimenting with different
optimization levels, it is common to add other -f/-m
flags.

It would be pointless to describe each and every -f/-m
flag. Instead of a laundry list of all available options, we will
discuss some of the more important flags in these areas: aliasing,
profiling, and vectorization. Unless otherwise noted, the flags
described in these article must be specified separately from the main
optimization level (-Ox), as they are not enabled by
default.

Aliasing

When the same memory location may be accessed using two or more
different names, we say that those names are
aliased. For instance, in the code snippet
below, *p is aliased with variables a and
b.

Alias analysis helps the compiler decide whether pointer dereferences
may affect other variables. Without alias analysis, variables
a and b would appear to be dead to the
optimizers, and dead code elimination would remove the assignments
a = 39 and b = 21. So, alias analysis is
not really used for optimization, rather it prevents the optimizers
from doing transformations in the presence of memory stores that may
have hidden side effects.

However, the problem with alias analysis is that it is fantastically
difficult to compute. Compilers must compute approximations. This results in
the compiler assuming alias relations between pointers and variables
that may not really happen at runtime.

From a user's perspective, there are not many things that can be done
about this problem. In some cases, the code can be re-arranged to
help the analyzer. For instance, by not taking the address of
variables unless absolutely necessary or by using local unaddressable
variables to hold the values of globals or other addressable symbols.
However, this is usually unfeasible in real life. In general, if you
have code that is performing poorly and you suspect that the compiler
is making poor aliasing decisions, you can ask the compiler to dump
the results of alias analysis with the flag
-ftree-dump-alias and analyze the files
file.c.tNN.aliasN.

These three flags affect the interaction between function
arguments and global variables. The first specifies that arguments
may alias each other and other globals, the second that arguments
may not alias each other but may alias other globals, and the third
one that arguments may neither alias each other nor other globals.

GCC will choose a default value according to the corresponding
language rules (in C, the default is
-fargument-alias). But if your code does not need
such a strict limitation, you can relax the requirements and give
more freedom to the optimizers.

-fstrict-aliasing

This flag is enabled by default at -O2 and up.
It means that the compiler will apply strict aliasing rules as
defined by the corresponding language standard. In previous
versions of GCC this flag was disabled by default because it used
to cause major grief in some code bases. Part of it were bugs in
GCC's alias analysis but mostly it exposed bugs in non-conformant
code.

This flag is enabled by default in GCC 4.x. If you have code
that used to run fine when compiled with previous versions of
GCC but behaves erratically with GCC 4.x, try specifying
-fno-strict-aliasing. If the code works fine,
your program probably violates some language aliasing rule
(try the -Wstrict-aliasing switch).

An alternative to specifying -fno-strict-aliasing
is to use the attribute may_alias on an individual
type. If variables of a certain type are mis-compiled, you can
add the may_alias to that type.

Attribute malloc

This can be used in functions that return pointers. If the
function is guaranteed to always return a pointer that cannot
possibly point to any other variable, then adding this attribute
to the function may help the aliaser reduce the number of alias
relations.

Table 2. Alias analysis flags and attributes

Profiling

Profiled-directed optimizations can produce pretty noticeable
performance improvements, at the cost of increased compilation time.
They work in three phases:

Profile code generation. Using the
-fprofile-generate flag, the compiler inserts
instrumentation code in the program that collects statistics
at runtime about execution frequencies over the different code
paths.

Training run. Once instrumented, the user program is executed and
profile information is saved to a file. This file describes where
the program spent most of its execution time.

Feedback optimization. Using -fprofile-use, the user
program is compiled a second time. The profile information saved
in the previous step is used to guide the optimizers. Currently,
there are several optimizers that will use profile information,
including value profiling, branch prediction, loop
unrolling/peeling, and tail duplication.

Profile-driven optimizations are very powerful because the optimizers
can work with pretty detailed and accurate cost models based on the
exact behavior of your program. The obvious disadvantage of this
scheme is that the compilation cycle is substantially more expensive.
Not only will you need to compile your application twice, you also need to
run it with training data. This training data should represent the
common case that your program will handle in general.

In some cases, using profile-driven optimizations may be either
unfeasible or too inconvenient, but if you need the utmost performance
and your code behaves uniformly on most inputs,
-fprofile-generate and -fprofile-use may be
a significant win [1].

Vectorization

Vectorization is a technique that takes advantage of the ability of
some processors to apply the same operation on multiple pieces of data
in a single step. Modern architectures have marketed this feature as
multimedia instructions or streaming
instructions. Other commonly used terms include
SSE (Streaming SIMD Extensions),
SIMD (Single Instruction Multiple Data), and
AltiVec (Velocity Engine).

All these names describe the same idea: vector-capable hardware has a
collection of simple processing units that can operate on multiple data
simultaneously. These processing units are most commonly used for
operating on vectors of values (hence the name vector processing).

The more common exploitable code patterns are loops that operate on
arrays. In GCC 4.x, vectorization is supported on most of the targets
that offer multimedia instructions, including i686, PowerPC, and AMD64.
To enable the vectorizer, use the flag -ftree-vectorizer
and -msse2 (i686/AMD64) or -maltivec
(PowerPC). For instance:

for (i = 0; i < N; i++)
a[i] = (b[i] * c[i]) + d[i];

When compiled on an i686 machine with -ftree-vectorize
-msse2, the compiler generates the following vector code:

A word of caution about vectorization. It is not a magic bullet and
it may even degrade the performance of your application. Vector
operations require very specific data size and alignments, which can
force the compiler to add code to guarantee alignment. If the work
done inside the loop is not sufficiently expensive to justify the
added instructions, your application will run slower.

Target-specific flags

Perhaps the most immediate performance boost can be obtained by using
target-specific options when compiling your application. The most
common ones are instruction selection and scheduling for the target
processor. Since GCC can generate code for a wide variety of
architectures, when invoked with default options, GCC will generate
generic code for the processor family that it was configured with.
So, if you know that your application will always run on a specific
processor, you can tell GCC to fine tune its scheduler and instruction
selection machinery to emit code specific to your target.

Tells GCC to tune generated code (instruction selection and
scheduling) to run on cpu-name. -mtune
will not affect the ABI (Application Binary Interface) nor will it
generate instructions that are specific to cpu-name.
The ABI is a set of conventions dictating how function calls are
made, the stack is organized, values are returned, default
alignments, etc. Specifying -mtune will allow you to
modify how GCC generates code for your application while
maintaining ABI compatibility with other libraries that may have
not been compiled with your target.

-march=cpu-name

Same as -mtune, and it also tells GCC to use the
ABI and instruction set specific to cpu-name. If
your application is standalone and does not need to interface with
other code, this is the more powerful switch of the two.

Table 3. Switches to control target code generation

The GCC
documentation
describes all the possible values that these
flags accept. You will also find other target-specific flags that you
may want to experiment with. For instance, if your code makes
heavy use of floating points but it is not sensitive to loss of precision,
you can use the -ffast-math. This allows GCC to perform
floating point transformations that may break IEEE/ISO rules regarding
floating point arithmetic. This usually translates to faster floating
point code.

Loop optimizations

In GCC 4.x, we overhauled GCC's internal architecture so that we could
implement more powerful optimizations, like the auto vectorization
support we discussed earlier. The new infrastructure also allowed us
to implement some higher-level loop transformations.

Keep in mind that much like the vectorization feature, these
optimizations are brand new in GCC 4.0, so they may not be as powerful
as they could be. Also, loop transformations do not always generate
better code, so you have to experiment with them. Some of these
options already existed in previous versions of GCC, but they have
been adapted to the new infrastructure and in some cases are able to
do a better job.

Option

Description

-ftree-loop-linear

Applies some linear transformations on loops (reversal,
interchange, scaling, and skewing) that may improve cache
utilization and simplify the loops to allow other optimizations to
take place. In particular, loop interchange may change loop
nesting for loops that walk through matrices.

-ftree-loop-im

Removes loop invariant code from inside loops. For instance,
'a = 30' will always do the same regardless on how many times it
executes.

-funswitch-loops

Somewhat similar to -ftree-loop-im but for
conditionals. If there is an if() with a
loop-invariant predicate inside the loop, the code is rearranged so
that the conditional jump is done outside the loop.

-funroll-loops

One of the better known loop transformations. As the name
indicates, if the compiler can determine how many iterations will
the loop execute, and that number is some small constant N, it
emits N copies of the loop body. This may produce faster code at
the expense of increased size. There are several associated
parameters that can be tweaked with this option [2].

Table 4. Loop optimization options

Conclusions

Compilers usually offer a staggering number of flags and options that
affect optimization behavior. GCC is no exception, so getting it to
generate the absolute best code for your application may involve some
experimentation. This article provides a survey of some of the
options that may help the most.

In most cases, obtaining the very last nanosecond of performance is
not terribly important. Optimization follows a distinct curve of
diminishing returns. Most applications will thrive with the default
transformations done at -O2, and since some of the more
esoteric flags are seldom used, your code may behave erratically
because of latent bugs in your application (for example, the code may be
violating language aliasing rules). Though, sometimes the bug may be
in the compiler itself. If the speedup is not measurable or
insignificant, it may not be worth the pain.

An even bigger problem is the interactions between optimizing
transformations. It is not uncommon for different optimizations to
interfere or even cancel each other out, so by combining many
different flags, you may be causing more harm than good. This problem
is the current focus of active research. One of the proposed
approaches is called adaptive compilation,
which is roughly a souped up variant of profile-directed compilation.
An external tool drives the compiler through a series of different
mixtures of optimization options, until it finds a combination
producing the best code [3]. Needless to say, this can become very
expensive but it may produce pretty good results.

About the Author

Diego Novillo was born in Córdoba, Argentina. In 1993 he moved to Canada
to obtain a PhD degree at the University of Alberta, Edmonton. He
graduated with a PhD in Computing Science in 2000. He is a member of the
compiler group at Red Hat Canada. One of the main missions of this group
is to improve GCC both in developing new ports and implementing new
analyses and optimizations. He is one of the main architects of Tree SSA,
a new global optimization framework for GCC.