Improving GCC's Infrastructure
(Control Flow Graph)

This page describes ongoing work to improve GCC's infrastructure
for RTL based optimizers. The work is done on a branch in GCC's CVS
tree called cfg-branch, which is available to experiment
in. As the patches stabilize, they will be submitted for review and
acceptance into the mainline development tree. Please contact Jan Hubicka <jh@suse.cz> if you
would like to contribute.

Background & Rationale

The purpose of work on the CFG branch is to update GCC's older
optimization passes to use the modern infrastructure which was not
available when they were written. A second purpose is to continue
improving the infrastructure for RTL-based optimizations.

Currently, work concentrates on the control flow graph (CFG) data
structure, originally contributed by Richard Henderson. We would like
to convert all of the RTL-based optimizer passes to use it and keep it
accurate. This will allow us to store more information directly in
this structure, instead of using "notes" in the instruction stream
which are difficult to keep accurate during optimization.

Goals

In addition to improving the infrastructure and updating the
existing RTL-based optimization passes, we are implementing new
optimization passes. We plan to implement optimizations that perform
code duplication first such as superblock formation and loop
peeling. We also plan to reimplement the loop unrolling pass.

Detailed Description

Some ideas that the cfg-branch will implement are known only in the
compiler research literature but are new concepts for GCC. We'd like
to explain the concepts briefly, give an overview of what we want to
achieve in GCC and describe the current status.

Loop Peeling

Theory

Loop Peeling is a form of loop unrolling where the
first iterations from a loop are unrolled. It removes
i iterations from the beginning of a loop and adds
i copies of the loop body directly before the start of
the loop.

Implementation in GCC

The loop peeling optimization will be done as part of the
superblock formation. The tracer should peel as many iterations as
can be predicted. Zdenek has also implemented simple loop peeling
as part of the new loop optimizer.

Status

The code is on the branch.

The tracer peels loops with one predicted iteration. We should try
to peel more than one iteration.

Loop optimizer re-implementation

Theory

The current loop optimizer uses information passed by the front end
to discover loop constructs to simplify flow analysis.
It is difficult to keep the information up-to-date and nowday
it is easy to implement the loop discovery code on CFG.

Implementation in GCC

We want to use the Michael Hayes' loop discovery code and slowly
replace existing features of loop optimizer by new one.
So far Zdenek has modified the datastructure to allow easier
updating and implemented loop unrolling, peeling and unswitching
on the top of it.

Status

The main changes the loop infrastructure has been merged to mainline.
Rest is being tunned on the branch.

Software Trace Cache

Theory

Software Trace Cache is a basic block layout algorithm
that also does code duplication during the process. While optimal
code layout is an NP complete problem, allowing a limited amount of
duplication should eliminate the common drawbacks of current
algorithms and keep code size under control. The purpose of this pass
is to minimize the number of branches and cache misses. It uses code
duplication to avoid jumps. A trivial example is copying the return
instruction instead of jumping to it. See [6] for a
detailed description.

Implementation in GCC

The implementation will only do intraprocedural optimizations since
interprocedural optimizations are out of reach of current GCC.

Status

The code is in the branch. The exact benefits needs to be measured
but on non-PDO runs it brings 5% to eon benchmark (from CPU2000).

Web Construction Pass

This pass constructs webs as commonly used for register allocation
purposes. After that each web gets assigned individual pseudo. This
allows our register allocation pass to operate on pseudos directly,
but also strengthens several other optimization passes, such as CSE,
loop optimizer and trivial dead code remover.

While future SSA passes will have same effect on the low-level RTL
generation, we still believe this pass to be useful right before the
register allocation pass or after the code duplication passes that
need to unshare temporaries, such as tracer or loop unroller.

Implementation in GCC

The implementation of the web pass uses Michael Hayes's dataflow
module to construct du-chains and unionfind to produce webs.

Status

The code is on the branch.

This pass improves SPECint2000 performance by roughly 1%. We need
to work on updating debug information in this pass.

Register coalescing Pass

This pass coalesces multiple registers into single one in order to
avoid register to register copies that our register allocator is not
able to deal with very well. It is a kind of temporary solution until
the new register allocator is in place. At that point, the benefits are
questionable, but still it is more effective (and probably less
expensive) than the current copy propagation implementation.

Implementation in GCC

It is designed as stand alone pass constructing conflict graph and
coalescing registers run after GCSE.

Status

The code is on the branch.

This pass improves some of SPECint2000 tests and degrades others basically
because of lack of live range splitting and increased register pressure in
final program. It helps by about 1 point on peak SPECint runs.

Jump Combining Pass

Theory

This optimization is very close to instruction combining. The
compiler discovers conditionals with common expression like patterns
(a || b) and (a && b) and attempts
to create compound conditions.

When using condition registers arithmetics this is always possible,
but not always profitable for all architectures. Trivial, still
common, cases can be handled. Some examples for these conversions
are: (a || b) can be converted to (a | b),
(a == b || c == d) to (a^b)|(c^d) and,
(a == 3 || a == 4) to ((unsigned)(a - 3) <
2).

Implementation in GCC

This is implemented as part of the if-conversion pass. It
basically finds the patterns of two jumps in the CFG that can be
combined and then attempts to to apply several simplifications.
Profile data is used to control the amount of optimizations.

Status

The code is on the branch. Benefits for Athlon CPU are about 0.5%.
The code is being reviewed for inclusion by Richard Henderson.

Thread Safe Profiling

Theory

GCC currently does not support profiling of threads.

Implementation in GCC

This is only needed for machines that have no atomic adds and for
SMP machines. We currently plan to do this with per-thread counters.
At the start of each function, a function from GCC's runtime
environment will be called that returns the pointer.

Status

A first version has been written and is installed to the branch.
The merging to mainline is problematic and the code may be dropped

Procedure splitting

To shorten branches and avoid pollution it is desirable to put
frequently executed procedures together and split their infrequently
executed parts.

Status

We don't split the function body, but we do categorize functions
into different parts.
A first version has been written and has been installed on the branch.
It adds about 1% to spec2000 runs with profile feedback.

MIDlevel RTL

Theory

The current RTL representation is too low level for a number of
optimizations that we want to implement. This makes the
implementation difficult and even ineffective in some cases. MidRTL
will be essentially a RTL representation of a program lacking the tons
of architectural details making the transformations of instruction
chain much easier and therefore being the basis for new optimization
passes.

Implementation in GCC

The following outline has been written by Jeff Law:

Particularly for the task of building the machine description for the generic
RTL and the translation from generic RTL into target specific RTL. Those two
(closely related) tasks are the biggest hunks of infrastructure we need.
Conceptually what we want is a md file which describes the set of generic
RTL patterns. We're not precisely sure on what predicates to use for
operands, so invent something like "generic_operand" which we will more
precisely define later. Make all operands use "generic_operand". At
least initially accept MEMs, REGs and all constants.
The only "tricky" part of "generic_operand" is memory addresses; at least
at this stage, generic_operand should accept a MEM with a "generic_address".
A "generic address" should be any "generic operand", except another MEM.
(ie we don't allow (MEM (MEM)) at this stage. We will likely refine this
later.
Once you've got a rough framework in place, you have to build out some
infrastructure for lowering. Specifically we'll need the ability to have
two recognizers. Early in compilation we'll only recognize the generic
patterns. When we start lowering we want to recognize the target specific
expanders, patterns, splitters, etc. So you'll probably have to hack up
genrecog to build two insn-recog files with different prefixes for anything
with external scope and some infrastructure to switch between them based
on a state variable which indicates we've started the lowering process.
For the actual lowering process, we want to avoid twiddling the existing
back ends as much as possible. If I recall, the general idea Richard and
I came up with was:
For each insn:
Try to see if it matches a pattern in the target's md file. If it does,
then you're done. Similarly, if it matches a splitter, then run the
splitter, match the resulting insns in the target md file and your done.
Else generic.md should call into optabs.c to expand operations -- this will
in large part allow target specific lowering using existing mechanisms.
(ie, lowering of operands, running target expanders, etc etc). It's likely
we'll need to refine this some, but this is the general idea.

Status

A first version bootstraps and will be hopefully included in the branch
soon. It is far from being complete thought.

Miscellaneous Patches

Achievements that have been merged into mainline already:

Easier to use CFG routines

Before branching, we reorganized and cleaned up the CFG routines
to make them easier to use. Each RTL instruction now contains a
pointer to its basic block. It is therefore easy to adjust basic
block boundary notes when instructions are reordered.

New code layout functions (file cfglayout.c)
based on the basic block reordering pass and capable of code
duplication.

A fix to loop discovery code, by Josef Zlomek.

CFG based jump threading.

Cross-jumping for computed jumps and trapping instructions

Subprojects already merged into mainline tree in 3.2 development period

Tracer / Superblock Formation

Theory

The tracer also known as tail duplication or
superblock formation pass identifies commonly executed
sequences of basic blocks, so called traces and duplicates
the blocks in order to avoid side entrances to those sequences -
forming so-called superblocks. The superblocks are
easier to optimize by conventional passes such as CSE and scheduler.
In case no extra optimizations apply, our cross-jumping pass merges the
code paths again after register allocation.

Implementation in GCC

We plan to do loop peeling and superblock formation in single pass
as described in [3].

Enhancements that are currently only in the cfg-branch:

Avoid rebuilding the CFG for toplev.c.

Opcode heuristics tweak:
Avoid predicting comparisons for equality to 0 and floating point comparisons
to increase hitrate of opcode heuristic.

Theory

Implementation in GCC

Flexible and Safe GCOV Format

Theory

Currently the profile is fragile, since there is no verification
that the compiled program matches the profiled data. Since the
profiler eliminates the redundancy in data, the mismatch is often not
discovered at all causing completely nonsensical results.
nonsense.

Implementation in GCC

We plan to calculate a checksum (CRC) of each graph when it is
constructed. Also the format will be extended so that more
information can be recorded, such as histograms for the number of
iterations for loops that can then be used for loop peeling and loop
unrolling.

We also want to add versioning and further information to make it
look like a real file format.

There are a few references to much more advanced profiling systems
in [3].

This is done using NOTE_INSN_PREDICTION emitted in the
stream converted to REG_PREDICTION later. For instance,
the compiler can now predict branches using goto as
infrequent. It also allows to handle and specially predict constructs
such as return of (negative) constants, out of range code and, exception
handling where the front end knows that those are infrequent, but the
back end has no information about it.

We can also use it to preserve more information about high-level
loops. For instance loop a with continue statement looks like two
nested loops to the back end, but the optimizer may be able to estimate
that the internal (continue) loop is not that frequent.

Branch Status

The branch was created from the development mainline on 12 November
2001. Its CVS tag is called cfg-branch.

Contributing

Check out the cfg-branch branch with the command

cvs co -r cfg-branch gcc

then configure and build in the normal way.

Please post patches in the usual way
to the gcc-patches list, marked [cfg-branch]. As this is
a branch the usual maintainer rules do not apply. This branch is
maintained by Jan Hubicka
<jh@suse.cz>. Approval from the usual maintainers will be
needed when submitting patches from the branch for consideration for
the mainline.

For questions related to the use of GCC,
please consult these web pages and the
GCC manuals. If that fails,
the gcc-help@gcc.gnu.org
mailing list might help.
Comments on these web pages and the development of GCC are welcome on our
developer list at gcc@gcc.gnu.org.
All of our lists
have public archives.

Copyright (C)
Free Software Foundation, Inc.
Verbatim copying and distribution of this entire article is
permitted in any medium, provided this notice is preserved.