\input cyginfo @c -*-texinfo-*-
@c $Id: insn-scheduler.texi,v 1.1 1991/09/25 01:03:21 pesch Exp $
@tex
\def\$#1${{#1}} % Kluge: collect RCS revision info without $...$
\xdef\manvers{GNU Instruction Scheduler\qquad\$Revision: 1.1 $} % For use in headers, footers too
@end tex
@settitle The GNU Instruction Scheduler
@setfilename insn-sched.info
@setchapternewpage off
@finalout
@titlepage
@repno{0}
@title{The GNU Instruction Scheduler}
@subtitle June 16, 1989
@subtitle (Updated July, 1989)
@c ?? @center CS 343
@c FIXME fill in Stanford TR #. Same, or earlier draft?
This paper is also available as @*
Stanford Technical Report number ??.
@unnumbered Abstract
The GNU C compiler is a mostly portable, optimizing compiler
written by Richard Stallman and others to provide the GNU project with a
free C compiler. The GNU Instruction Scheduler is a
new pass written by the author for the GNU C compiler. This report
begins with a brief overview of the architecture of the GNU C
compiler, followed by a description of list-based instruction
scheduling. Design goals of the scheduler and accomplishments to date
are presented, followed by a section on work yet to be done. This
report concludes with an evaluation of the scheduler, its limitations
and its strengths.
@author Michael D. Tiemann
@end titlepage
@chapter The Architecture of the GNU C Compiler
The GNU C compiler exists to provide the GNU project and others with a
free C compiler. Of the roughly 65,000 lines of code implementing GNU
C, about 15,000 lines of code implement the front-end, and the remaining
50,000 lines of code implement the back end. The compiler is
surprisingly portable: most machine descriptions (with optimizations)
can be written in 2000 lines of the GNU machine description format, a
combination of Register Transfer Language (RTL) and C, plus a 1000 line
(more or less) header file. There is a little more to the story, but
the main point is that with a ratio better than 15:1 (machine
independent code to machine dependent code), the compiler is quite
portable.
Just as the back end of the compiler was designed to be portable across
different register-based machine platforms, the front-end language
processor was designed to support different languages. Both the
front-end and back-end achieve their power by implementing a set of
powerful primitives. For example, there's only one looping construct in
in the front-end; this construct is called the ``loop statement.'' When
a C @code{while} or @code{for} statement is read in, it is represented
as a combination of a loop statement and few other statement types
perhaps put around the loop statement. In the case of a @code{for},
where there is an initialization, it is represented as a compound of
``initialize'' followed by ``loop.'' The end tests that are built in in
a while or for statements are represented using one ``exit the loop''
construct. Such organization makes life easier for the programmer
writing a new front-end, but may make life more difficult for the
programmer writing an optimizer to perform optimizations based on
elaborate loop analysis.
The GNU C compiler, like most optimizing compilers, runs in many passes.
The front end builds the syntax tree bottom up, using an LALR parser
built by Bison [Corbett85]. When expression syntax grows to statement
syntax, the statement is translated from its tree representation to its
RTL representation. RTL representation looks like lisp lists; RTL
objects are self-describing, and can be printed, copied, and traversed
in much the same way that lisp lists are. The RTL representation for
instructions is the aggregation of a Unique Identifier (UID), pointers
to previous and next RTL instruction objects (making the insn list a
doubly linked list, hence efficiently supporting insertion and deletion
of other insns), the instruction pattern, the instruction code (a
memoized value which associates the instruction object with a specific
instruction template in the machine description), the instruction's
logical links (pointers to instructions which feed this instruction via
dataflow), and register notes (a list of assertions about the properties
of the registers used by this RTL instruction object).
Once the primitive instructions (insns) have been emitted in response to
requests from the front end, the passes of the back end go to work. The
first pass is jump optimization, which is powerful and fast. When first
run, its purpose is to remove dead code, so that the compiler has less
code it must slog through. After initial jump optimization has been
performed, common subexpressions are eliminated. Common subexpression
elimination (CSE) is performed for each extended basic block (a sequence
of code with a unique entry point, but possibly many exits points). If
after performing common subexpression elimination, branch statements
have been affected, jump optimization is run again, to remove any newly
discovered dead code. CSE is only half-smart about memory: it knows to
distinguish scalar variables from aggregates, and unknown addresses from
either, but its ability to understand memory references is limited. A
consequence is that often writes to independent memory locations tend to
destroy valid common subexpressions.
Having simplified the code with jump and CSE optimizations, loop
optimizations are performed. Loop invariant code motion, loop
inversion, induction variable elimination, strength reduction, and use
of special machine constructs (such as decrement and branch) are applied
heuristically. Loop optimizations not performed include loop peeling,
loop fission and loop fusion. Loop unrolling, perhaps most important
for superscalar machines, is also not implemented. Loop optimizations
may have again affected the branching behavior of the program; if so,
jump optimization is again performed.
As discussed previously, all the different kinds of loops provided by
e.g. C are mapped to a single kind of RTL-level looping structure. This
makes some kinds of loop-based optimizations harder to do. For example,
to get better performance from a superscalar machine, the compiler
should unroll loops which do not have inter-iteration dependencies to
achieve some degree of software pipelining. To compute such
dependencies, it is very helpful to know the precise nature of the loop:
its beginning, its body, its iteration step code, and its termination
condition. As things stand right now, the loop optimizer would have to
derive all that information itself from the RTL--a non-trivial task.
Common subexpression elimination and loop optimizations can take place
without knowledge about basic blocks. The remaining optimizations
depend on basic-block level data-flow analysis. A pass computes the
basic blocks, initializing vectors which describe characteristics of
basic blocks (live registers on entry) characteristics of registers
within and across basic blocks (number of times set, number of times
referenced, live across how many instructions, live in which basic
blocks, number of calls crossed), and characteristics of insns (what
registers die, what insns a data-flow sources for this insn). Data-flow
analysis computes live variables, and hence can detect what become dead
stores, which can then be eliminated.
Having computed such detailed information, the compiler is prepared to
perform Frazier-Davidson style code optimization. The essence of this
technique is to look for two or three insns related by data-flow,
and to see whether the insns can be rewritten as a single insn which
results from performing algebraic substitution.
The following example shows how the machine description supports
an accumulate-multiply instruction. A complete description of the
machine description syntax and semantics are beyond the scope of this
paper, but the interested reader is encouraged to consult ``Using and
Porting GNU CC'' [Stallman89] for more information. If the machine
description contains e.g. the following patter:
@example
@group
;; FPA Add and multiply
(define_insn ""
[(set (match_op:DF 0 "register_op" "=x,y,y")
(mult:DF (plus:DF (match_op:DF 1 "register_op" "%xH,y,y")
(match_op:DF 2 "general_op" "x,y,rmF"))
(match_op:DF 3 "general_op" "xH,rmF,y")))]
"TARGET_FPA"
"*
{
if (which_alternative == 0)
return \"fpam%.d %2,%w1,%w3,%0\";
return \"fpam%.d %x2,%1,%x3,%0\";
}")
@end group
@end example
@noindent
Then when the combiner comes across these two (possibly non-adjacent)
patterns:
@example
@group
(insn 78 77 79
(set (reg:DF 99)
(plus:DF (reg:DF 88)
(reg:DF 89))) -1 (nil)
(expr_list:REG_DEAD (reg 89) (nil)))
(...)
(insn 92 91 93
(set (reg:DF 100)
(mult:DF (reg:DF 99)
(reg:DF 80))) -1 (insn_list 78 (nil))
(nil))
@end group
@end example
@noindent
it can substitute the source of the set for pseudoregister 99 in insn 78
for the use of pseudoregister 99 in insn 92, leading to:
@example
@group
(note 78 77 79 "" NOTE_INSN_DELETED)
(...)
(insn 92 91 93
(set (reg:DF 100)
(mult:DF (plus:DF (reg:DF 88)
(reg:DF 89))
(reg:DF 80))) -1 (insn_list 78 (nil))
(expr_list:REG_DEAD (reg 89) (nil)))
@end group
@end example
@noindent
This technique can be though of as performing peephole optimization in
data-flow space. A much later pass handles peephole optimization in the
conventional manner, by attempting to combine physically adjacent
instructions.
After instruction selection has been performed, the compiler goes about
allocating registers. Register allocation is performed in two passes.
The first pass allocates registers within basic blocks. Only registers
local to a single basic block, and which do not cross calls are
allocated in this pass. The algorithm is efficient, and does a good,
though limited, job. Register targeting is performed, which means that
if the machine needs a particular register to contain a computed value,
the register allocator tries to make the computed value wind up in the
desired register. Its strategy for doing so is simple however, and it
sometimes fails, resulting in register shuffling.
After basic block register allocation has been performed, the compiler
allocates the remaining registers by building a conflict matrix, and
proceeds to allocate registers by graph coloring. Heuristics use a
priority function to determine which quantities should be register
allocated and which quantities should go to memory. The priority
function assigns quantities priorities computed from the value of the
number of times a register is used times the logarithm of the number of
insns across which the register is live, divided by the number of insns
across which the register is live. Quantities in loops have a weight
which makes them more likely to be register allocated.
It is possible that during global register allocation, temporary
registers are needed to compute an intermediate result. The register
allocator does not know how to find a ``global'' register which may
happen to be free to perform this computation, and all ``local''
register may have been allocated by the previous phase. In such a case,
a local register must be spilled. This means throwing away all
quantities previously held by that register within that basic block;
the quantities thus spilled will have to be computed again using
different registers.
Once all registers have been allocated, the compiler can perform branch
scheduling. Not all machines have delayed branches; for machines that
do not, this pass is not run. The branch scheduler walks through basic
blocks looking for unfilled delay slots. When one is found, it searches
the basic block for instructions which could be put in its delay slot.
The branch scheduler is quite general. It was initially designed to
support a machine with zero to three delay slots, with or without
nullification. Adapting it to schedule only one delay slot is
straightforward.
A final jump optimization combined with peephole optimization is run.
During this jump optimization phase, cross jump optimizations are
performed. The peephole optimizer runs concurrently, and recognizes
machine idioms which are difficult or impossible to represent using
data-flow relationships.
Finally, assembly code is produced. At this time, redundant conditional
tests are removed, and other machine-specific optimizations can be
performed by the assembly output routines themselves.
@chapter List-based Instruction Scheduling
Instruction scheduling is a technique to exploit instruction level
parallelism exposed by pipelined machines. The idea behind list-based
scheduling is to simultaneously make the critical path of computation
run as quickly as possible while maximizing parallel instruction
execution. Scheduling only the critical path of a computation (critical
path scheduling) does not account for instructions which are not
immediately on the critical path, but were they to execute, would allow
more instructions to be executed in parallel. Scheduling to achieve
maximum parallelism is also not good, because it can very quickly use
more registers than most machines have available, and because the
schedule does not take into account an important performance constraint:
the critical path. A later example details these problems. List
scheduling is a simple way to balance these two approaches, and usually
outperform both.
@tex
The algorithm for list scheduling is as follows: starting with a basic
block, compute instruction priorities. An instruction with priority
$P1$ must be scheduled to execute later than any instruction with
priority $P2$ if $P1 > P2$.
@end tex
An instruction I1 which depends on an instruction I2 cannot execute
before I2. By giving I1 a larger priority number than I2, and always
placing larger-prioritied instructions later in the schedule than
smaller-prioritied ones, we preserve correct execution order.
Instructions which take more time to compute are also given larger
priority numbers. This has the effect of exposing the critical path:
the instruction with the largest priority number is that last
instruction on the critical path. After computing all priorities, the
scheduler marks all instructions on which no instructions depend
``ready''. These instructions are then sorted by priority order. The
last instruction in the basic block is always on the ready list, but it
need not always have the highest priority number.
The algorithm then starts scheduling by issuing the instruction with the
highest priority order. This procedure will build the instruction block
from last instruction to first. Each time we schedule an instruction,
we check to see whether instructions on which it depended now have no
more dependers. For each such instruction, we can mark it ``ready'',
and add it to the ready list in priority order. When all instructions
have been scheduled, the algorithm terminates, and the next block, if
there is one, can be scheduled.
In a vacuum, the implementation of instruction scheduling is quite
straightforward, amounting to no more than performing a reverse
topological sort on instruction dependencies (or a topological sort on
anti-dependencies, depending on one's point of view), adding weights to
break ties between some topologically equivalent instructions. The
first implementation of list scheduling for the GNU C compiler was
essentially this algorithm, but the results were not very good: on a
matrix multiply program in which the inner multiply loop had been
unrolled four times reduced execution speed by nearly a factor of two!
What went wrong?
@chapter Smarter List Scheduling
The problem with the list scheduling algorithm presented in the previous
section is that it is too simple. For example, when an add instruction
can be scheduled, two multiply instructions on which it depends may
become ready. They may have identical priorities, which means that both
multiplications will be performed very close to one another, which is
desirable, if the multiply unit is pipelined. Each of these multiply
instructions may depend on some memory read instructions, which may all
have the same priority. All the memory read instructions will then be
scheduled very close to one another. Now each memory read may depend on
instructions which compute memory addresses, all of which may have the
same priorties. If this loop is unrolled four times, the schedule then
becomes: compute eight memory addresses, perform eight memory reads,
perform 4 multiplies, perform two addition operations, and finally
perform one addition operation. If the machine has enough registers and
functional units, this is a great schedule. If not, then register
spills and interlocks take their toll.
To handle this problem, the priority function for certain instructions
was modified as follows. When the scheduler chooses an instruction in
which a register dies, the scheduler becomes sensitive to the
instruction where that register was born. As soon as that instruction
becomes ``ready'' (i.e., it has no further dependers), its priority is
increased so that it will be the next instruction scheduled. The effect
of this modification is to reduce register lives without greatly
affecting parallelism.
Unfortunately, though this modification halved the number of register
spills introduced by the scheduler, it did not eliminate all of them.
This was because when several instructions' birthing registers became
ready simultaneously, they all had the same high priority, and hence
could still interfere with one another. The solution to this problem
was to introduce a gradient on the priorities: the first birthing
instruction got highest priority, and subsequent birthing instructions
got lower priorities. This caused the scheduler to generate the
schedule ``compute, load, compute, load, ...'' instead of ``compute,
compute, ..., load, load, ...'', and all spills were eliminated.
@chapter Design Goals
As has been stated before, the GNU C compiler is a mostly portable
compiler. To maintain that property, it was essential to design the
instruction scheduler to be as machine-independent as possible. The GNU
C compiler can generate optimized code which retains symbol-table and
debugging information; this is a very nice feature of that compiler, and
one which should be compromised as little as possible in the presence of
instruction scheduling. Finally, the GNU C compiler generates optimized
code very quickly. For example, it can generate better code than
industry-standard AT&T compilers, yet do so about three times faster.
While compilation time should be a secondary criterion when evaluating
optimizing compilers, it would be nice to keep the compiler fast.
The GNU C compiler makes a number of simplifying assumptions which
inhibit instruction scheduling freedom. For example, it is assumed that
any instruction which tests condition codes immediately follows the
instruction which set them. Similarly, it is assumed that before
register allocation has been performed, machine registers (not pseudo
registers) are only used for specific purposes, and the compiler takes
shortcuts based on such assumptions. For example, when performing a
subroutine call on a machine which passes arguments in registers, it is
assumed that no instructions between the loading of the parameter
registers and the use of the function return value intervene.
A final design goal of the instruction scheduler was to design it to be
easily understood and improved upon. If it is to gain wide acceptance
and use across many machines, its design must be simple and robust. For
example, it should be possible to bootstrap the compiler with
instruction scheduling enabled in a relatively short amount of time.
It should also be cooperative with other passes of the compiler. A
scheduler which only ``just works'' is not good enough; it must be
sufficiently general and robust that others can modify other passes of
the compiler, and their modifications will work, or will fail to work in
easily understood, or at least documented ways.
@chapter Implementation
The optimizations performed by an instruction scheduler are necessarily
machine specific. Therefore, to design a machine-independent
instruction scheduler, it was necessary to abstract away as much of the
machine as possible. This was done by using a program to build
encapsulations of machine-dependent information that could be carried by
instructions. The essential information used by the instruction
scheduler presented here is the number of cycles an instruction expects
to execute, and the resources used by the instruction during its
execution. At such a level of abstraction, the scheduler is
sufficiently machine-independent.
Preserving debugging information is a hard problem, especially when
instructions are moved across binding contours, e.g.,
@example
@group
Original Code Scheduled Code
@{ @{
double i = 3.14; double i(1) = 3.14;
int b; int b(1);
@{ @{
int i = foo (); int i(2) = foo ();
int *p; int *p(1);
b *= i; i(1) *= 2;
*p -= i; b(1) *= i(2);
@} @}
b -= 2; *p(1) -= i(2);
i *= 2; b(1) -= 2;
@} @}
@end group
@end example
@noindent
The problem is, how can the debugger maintain correspondence between the
binding contours of the program and the instructions in the program if
instructions can move freely across them? I do not have a good answer
for this, but I have preserved all debugging information when
instructions do not move across binding contours. It should be pointed
out that most binding contours appear at program control points, such as
loops and conditionals. Since instructions are only scheduled within a
basic block, the distance an instruction can travel is limited to remain
within the control construct, and hence the binding contour.
The instruction scheduler that I implemented is currently one of the
faster passes of the compiler. Mysteriously, the compiler is sometimes
faster when performing instruction scheduling then when not. From
profiling information, it appears that scheduled instructions utilizes
registers in a way that simplifies the task of local register
allocation. Time spent scheduling is often offset by time saved in
register allocation. For such a fast instruction scheduler, the
performance is pretty good. On a Sun 4/110, where loads and stores take
three cycles when an interlock must be used, or two cycles if not, a
large application (GNU CC itself) shows a speedup of 8%. Assuming that
loads and stores constitute 25% of the instruction mix [Hennessey88],
the maximum performance improvement achievable from just scheduling
loads and stores is 0.25 (frequency) * 0.50 (speedup) = 0.125, or 12.5%.
On the Sun 4/110, floating point operations are not pipelined, so there
is no benefit to be gained there. Thus 8% is not bad, and shows that
good instruction schedulers can be made fast.
The instruction scheduler was implemented to write out data dependency
information to assist the compiler writer in analyzing its decisions.
Instructions which are ready to execute, instruction priorities, stalls,
and successfully filled delays slots are all printed prior to dumping
the function's RTL. It is hoped that this information will be helpful
to people who are interested in experimenting with, or measuring the
performance of, the priority function.
Alas, the instruction scheduler and the branch scheduler do not yet work
together. The main problem is caused by the fact that in order to get
the branch scheduler to use instructions which come after the branch,
data-flow information is needed. Unfortunately, the GNU C compiler does
not keep its data flow information up to date as the compiler
progresses. Specifically, optimizations such as cross-jumping change
completely the landscape of the code, altering basic blocks in ways not
accounted for. I was able to reconstruct most of the data needed to
still do a decent job filling branch delay slots, but in so doing,
introduced new data dependencies which violated assumptions made by the
instruction scheduler. The compiler has been bootstrapped with
instruction scheduling and branch scheduling as mutually exclusive
options, but it does not yet work with both options enabled at once.
@chapter Performance Analysis
It is interesting to compare the speedup gained by instruction
scheduling as compared with that gained by branch scheduling. While
many people do not make the distinction between the two, there is an
important difference which should be recognized. Instruction scheduling
has to do with data dependencies, while branch scheduling has to do with
control dependencies. List scheduling is an algorithm which attacks the
problem of the many alternatives possible due to data parallelism. For
scalar and superscalar machines, control parallelism is converted to
data parallelism (by e.g. loop unrolling), but there still are slots to
fill in control transfer instructions. Algorithms for filling these
slots can be much more aggressive about finding instructions to fill
these slots than one can be with data-parallel instructions, especially
if the control constructs can annul speculative computation. The GNU C
compiler had a primitive branch scheduler which worked by finding
instructions which had delay slots, then looking for an earlier
instructions which could be place in the delay slot. This simple
strategy always improves code when it applies; the problem is that it
does not apply with great frequency.
I extended the branch scheduler by adding a second pass to fill slots
left previously unfilled (by the original branch scheduler) with
instructions taken either from the target or following the branch.
Strategies and tradeoffs for ``eager'' branch scheduling can be found in
[Hennessey88]. After making this work (which was harder than
anticipated), I compared the performance of an application (the compiler
itself) compiled with instruction scheduling versus that same
application compiled with branch scheduling.
Under test, branch scheduling showed an 18% performance improvement,
while instruction scheduling showed only about 8%. Why such a large
difference if there are about as many branch delays to fill as there are
load and store delays? Without precise measurements, only an
approximate answer can be given. First, the branch scheduler reduces
the cost of a branch instruction from two cycles to one, where the
instruction scheduler reduces the cost of loads and stores from three
cycles to two. Second, many memory delay slots may be filled just by
luck, say about one fourth the time. Such accounting would make an
perfectly efficient instruction scheduler give a performance improvement
of 14%, versus a perfectly efficient branch scheduler a performance
improvement of 18%. The branch scheduler often separates loads from
uses across the branch instruction, which is pure fortitude. Therefore,
in the estimated scheduler performance, the cost of memory operations is
somewhat reduced. Also, the compiler has some ability to fill branch
delay slots using peephole optimizations, independent of the branch
scheduler, hence in the second table, branch delay is estimated to be on
average 1.9 cycles per instruction.
@example
@group
Perfect Schedulers
CPI unscheduled i-scheduled b-scheduled
25 load/store 2.8 ~/insn 2.0 ~/insn 2.7 ~/insn
25 branch 1.9 ~/insn 1.9 ~/insn 1.0 ~/insn
50 ALU 1.0 ~/insn 1.0 ~/insn 1.0 ~/insn
100 insns 1.68 ~/insn 1.48 ~/insn 1.42 ~/insn
Performance improvement: 1.00 1.14 1.18
Estimated Schedulers
CPI unscheduled i-scheduled b-scheduled
25 load/store 2.8 ~/insn 2.3 ~/insn 2.7 ~/insn
25 branch 1.9 ~/insn 1.9 ~/insn 1.2 ~/insn
50 ALU 1.0 ~/insn 1.0 ~/insn 1.0 ~/insn
100 insns 1.68 ~/insn 1.55 ~/insn 1.47 ~/insn
Performance improvement: 1.00 1.08 1.14
@end group
@end example
@noindent
If the analysis is accurate, the best performance improvement one can
expect to see on a Sun4/110 using GNU C's scheduling optimizers would be
34%; using the estimates in the second table, a 23% performance
improvement can be expected using both schedulers.
@chapter Extensions and Future Work
Before talking about future work, some past work should be presented.
That work should give the reader some context and understanding of how
to approach doing future work should s/he be so inclined.
There are two major aspects of the instruction scheduler: the work done
for the scheduler and the work done by the scheduler. The work done for
the scheduler involved modifying the data flow analysis pass of the
compiler to compute dependency information. Initially,
inter-instruction dependencies were only recorded between the setting of
a register and its first use. This had to be modified so that all uses
depended on the preceding set, and that following sets depended on the
preceding set as well. Hard registers and function calls complicated
matters further: dependencies had to be added to preserve call order,
and to ensure that parameter passing for one function was not
interleaved with parameter passing for another. Adjustments to the
stack pointer rounded out the kind of operations about which the
scheduler had to be cautious.
Live register information is often represented in compilers by
bitvectors, and GNU C is no exception. The total space consumed by bits
is the number of basic blocks times the number of pseudoregisters used
by the function. Originally, flow analysis only needed to allocate from
the heap bits to record the registers live on entry to a basic block.
It could allocate the bits for registers live on exit from a basic block
from the stack. When performing instruction scheduling, both must be
allocated from the heap, because accurate live register information must
be maintained after scheduling for register allocation.
The additional overhead of adding all these dependencies and extra
bitvectors slows down flow analysis, already one of the slowest passes
in the compiler, by about 10%.
Collecting the dependencies was a straightforward task, but keeping the
information accurate was not so simple. When two instructions are
combined to make a new instruction, flow information must be propagated
from the instructions disappearing to the new instruction. Consider a
multiply instruction I1 and an add instruction I2 becoming a
multiply-accumulate instruction. If a third instruction, I3, had to
execute after I1, how will it know that it must now execute after I2?
To keep the scheduler's memory utilization within reasonable limits, I
utilized transitive dependencies whenever possible. This meant that
after the combiner ran, the dependencies could be in shambles. It is
not desirable to punt and re-run flow analysis when interesting
instruction reorganizations are performed, since this would have to be
done twice--both before and after instruction scheduling, slowing the
compiler by an additional 40%. Instead, the combiner and the scheduler
were modified to incrementally fix most dependency information that they
broke. The additional time spent by the combiner doing these fixups is
virtually unmeasureable. For the scheduler, the additional time is also
quite small, though the solution is much trickier.
The scheduler has find the new instructions made by the combiner based
on dependencies pointing to deleted instructions. In many cases the
delete instruction became part of the current instruction, hence no
dependency is needed at all. The remaining cases involve searching for
the merged instruction. The search uses a table to record all
dependency information derived while searching for some particular
dependence. As the table is built, searches become shorter.
The instruction scheduler runs before register allocation. This gives
the scheduler much more flexibility in moving instructions around, but
it can lead to worse registers allocation. For example, nothing
prevents the scheduler from moving instructions across calls. Such code
motion may be detrimental if moving the instruction causes a register
not previously live across a call to become so. Running the scheduler
before register allocation can also result in incomplete instruction
scheduling: if a quantity fails to be register allocated (i.e. spilled
to memory), the memory access may affect the schedule. It would be
interesting to design an incremental scheduler which could take care of
such problems without rescheduling the whole block.
To save time, the scheduler does nothing if the instructions are already
in their priority order. The scheduler could look for other blocks with
which to merge in order to improve schedule of both blocks. Selecting
the blocks with which to merge, and setting thresholds on code
duplication both probably need good heuristics. Merging would also mean
extending the incremental life analysis performed by the scheduler.
The scheduler has a very simple model of memory accesses. Three
functions, @var{true_dependence}, @var{anti_dependence}, and
@var{output_dependence} a predicate functions which return the truth
value of the relationship between two memory references for true-,
anti-, and output-dependencies respectively. They are currently
implemented by simply returning @code{TRUE} for all references. Making
these functions return more accurate answers would give the scheduler
more freedom to schedule, since their would be fewer dependencies acting
as constraints.
The loop optimization pass is run well before the scheduling pass, thus
the scheduler has no input on how loops should be unrolled. It would be
interesting to see how one can reconcile the independence of these two
interdependent passes.
Some machines do not have primitive support for basic algebraic
operations such as multiplication. When this is the case, the compiler
writer must make the tradeoff of exposing the machine's weakness and
hiding the actual machine implementation. For example, for a machine
without multiplication, the compiler can implement multiplication by
constants as a sequence of shifts and additions. If multiplication is
transformed into shifts and adds too early, the algebraic simplifiers
may miss some simplifications, whereas if it is transformed too late,
the sequence of instructions will not be available for instruction
scheduling.
To make good use of instruction scheduling, basic blocks should be large
enough to expose a good deal of instruction level parallelism. This
generally means performing procedure integration and loop unrolling.
The mechanics of implementing these methods in GNU C is quite simple; in
fact, procedure integration is already done. The main difficulty is
deciding which procedure to integrate, or deciding which loop to unroll
and how many times to do so. Good loop unrolling is probably the next
important optimization that should be implemented in the GNU C compiler.
For the basic blocks where there is a fair amount of instruction level
parallelism, there is still about a 50-50 chance that the scheduler will
build a worse schedule rather than a better one. With the many
registers available on a Sparc (due to its use of register windows), the
down side is often not seen, but here is an example which exposes it:
@example
static void
decode (shorts, low, hi)
short *shorts;
int *low;
@{
*low = (shorts[3] << 24)
| (shorts[2] << 16)
| (shorts[1] << 8)
| shorts[0];
@}
@end example
The GNU C compiler will, on its own, generate nice code for this, while
the scheduler will try to do too much in parallel, and get lost:
@example
ldsh [%o0+6],%o3 add %sp,-88,%sp
sll %o3,24,%o3 st %l0,[%sp+80+0]
ldsh [%o0+4],%o4 ldsh [%o0+2],%o4
sll %o4,16,%o4 ldsh [%o0],%o5
or %o3,%o4,%o3 sll %o4,8,%o4
ldsh [%o0+2],%o4 ldsh [%o0+4],%l0
sll %o4,8,%o4 ldsh [%o0+6],%o3
or %o3,%o4,%o3 sll %l0,16,%l0
ldsh [%o0],%o4 sll %o3,24,%o3
or %o3,%o4,%o3 or %o3,%l0,%o3
st %o3,[%o1] or %o3,%o4,%o3
retl or %o3,%o5,%o3
nop st %o3,[%o1]
ld [%sp+80+0],%l0
retl
add %sp,88,%sp
@end example
@noindent
The unscheduled code has stalls due to interlocks, but the scheduled
code executes an extra load, an extra store, and a stack manipulation
operation, resulting in slower overall code. It would be nice to see
future work improve the priority assignment function so that the
scheduler always makes at least code-improving transformations.
Scheduling across basic blocks is another way to achieve more
parallelism. This technique was not attempted for this project, but
should extend from this work in a straightforward manner. As with loop
unrolling and procedure integration, cross-basic-block scheduling can
result in code explosion. The heuristics of GNU C compiler perform
their optimizations based only on the code being compiled. While they
tend to do a good job, a number of these optimizations could benefit
from profiling information. In particular, profile information should
be used to guide cross-basic-block instruction scheduling.
The instruction scheduler should be modified to work after register
allocation. Register spills become loads and stores, and can change the
characteristics of the instruction stream enough to warrant
rescheduling. It would be trivial to run the scheduler after register
allocation, just to take care of the cases of register spills. What
would be more interesting is to modify the scheduler and/or the register
to note the cause of the spill. If the spill occurred because a
register's life was wantonly extended (perhaps across a call or two),
the scheduler should attempt to shorten that register's life when next
it builds its schedule. Live-range splitting is not implemented in GNU
C, but it would be logical to have the live-range splitter and the
scheduler coordinate their actions.
@chapter Conclusion
The difficult work of interfacing an instruction scheduler to the GNU C
compiler has been performed. The compiler has been successfully
bootstrapped using the instruction scheduler presented here, and an 8%
performance improvement was observed. The compiler has also been
successfully bootstrapped using a branch scheduler, resulting in a 20%
performance improvement. Work on making the two cooperate is almost
complete; when this is accomplished (expected in the next few days) the
opportunity will exist for others to make scheduling-related
improvements, such as those suggested in the previous section, without
having to worry about too many other compiler bookkeeping details.
The GNU C compiler has been a willing partner in this undertaking. It
showed fine sportsmanship in permitting itself to be bootstrapped with
the two schedulers presented here. It was a disappointment that it did
not work with both schedulers enabled, but it is expected to do so soon.
@page
@unnumbered Bibliography
Corbett85: Corbett, Robert Paul, ``Static Semantics in Compiler Error
Recovery'', June 1985, Report No. UCB/CSD 85/251.
Stallman89: Stallman, Richard M., ``Using and Porting GNU CC'', June
1989, Free Software Foundation, Cambridge MA, 02139.
Hennessey88: Hennessey, John L., ``EE 282 Class Notes'',
September-December 1989, Stanford University, Stanford CA, 94305.
Fisher81: Fisher, J. A., ``Trace Scheduling: A Technique for Global
Microcode Compaction.'' IEEE Transactions on Computers C-30(7):
478-490, 1981.
@page
@unnumbered Appendix
@example
Here are the numbers of GCC compiled without GCC instruction scheduling:
Cold Warm
time in parse: 4.750000 time in parse: 4.490000
time in integration: 0.000000 time in integration: 0.000000
time in jump: 0.970000 time in jump: 1.000000
time in cse: 4.600000 time in cse: 4.310000
time in loop: 1.030000 time in loop: 1.050000
time in flow: 5.140000 time in flow: 5.560000
time in combine: 1.470000 time in combine: 1.430000
time in sched: 0.940000 time in sched: 0.890000
time in local-alloc: 3.520000 time in local-alloc: 3.400000
time in global-alloc: 3.880000 time in global-alloc: 3.870000
time in dbranch: 0.000000 time in dbranch: 0.000000
time in final: 1.910000 time in final: 1.830000
time in varconst: 0.060000 time in varconst: 0.030000
time in symout: 0.000000 time in symout: 0.000000
time in dump: 0.000000 time in dump: 0.000000
28.1u 0.5s 0:32 88% 0+1232k 20+21io 89pf+0w
27.8u 0.4s 0:30 91% 0+1240k 3+18io 4pf+0w
With GCC instruction scheduling:
time in parse: 4.470000 time in parse: 4.500000
time in integration: 0.000000 time in integration: 0.000000
time in jump: 0.830000 time in jump: 0.850000
time in cse: 4.080000 time in cse: 4.090000
time in loop: 0.920000 time in loop: 0.940000
time in flow: 5.510000 time in flow: 4.720000
time in combine: 1.420000 time in combine: 1.370000
time in sched: 0.860000 time in sched: 0.920000
time in local-alloc: 3.290000 time in local-alloc: 2.990000
time in global-alloc: 3.650000 time in global-alloc: 3.630000
time in dbranch: 0.000000 time in dbranch: 0.000000
time in final: 1.800000 time in final: 1.810000
time in varconst: 0.010000 time in varconst: 0.040000
time in symout: 0.000000 time in symout: 0.000000
time in dump: 0.000000 time in dump: 0.000000
26.6u 0.5s 0:30 88% 0+1256k 0+18io 6pf+0w
25.8u 0.3s 0:28 93% 0+1256k 0+18io 0pf+0w
Performance difference: 27.8 / 25.8 = 7.8%
@end example
@page
@example
With GCC branch scheduling:
time in parse: 4.580000 time in parse: 4.100000
time in integration: 0.000000 time in integration: 0.000000
time in jump: 0.860000 time in jump: 0.740000
time in cse: 3.980000 time in cse: 4.070000
time in loop: 0.940000 time in loop: 0.840000
time in flow: 5.630000 time in flow: 4.250000
time in combine: 1.340000 time in combine: 1.420000
time in sched: 0.000000 time in sched: 0.000000
time in local-alloc: 2.980000 time in local-alloc: 2.920000
time in global-alloc: 3.430000 time in global-alloc: 3.490000
time in dbranch: 0.000000 time in dbranch: 0.000000
time in final: 1.880000 time in final: 1.770000
time in varconst: 0.020000 time in varconst: 0.020000
time in symout: 0.000000 time in symout: 0.000000
time in dump: 0.000000 time in dump: 0.000000
25.4u 0.5s 0:33 77% 0+1184k 20+21io 89pf+0w
23.5u 0.4s 0:26 89% 0+1184k 0+19io 0pf+0w
Performance difference: 27.8 / 23.5 = 18.2%
@end example
@page
@unnumbered Updated Results
The paper was written under strict time constraints. The instruction
and branch schedulers were successfully merged in July 1989. At that
time, performance improvement was approximately 25% for the GNU C
benchmark. The following shows the performance comparison.
@example
With GCC branch and instruction scheduling:
time in parse: 4.370000 time in parse: 4.270000
time in integration: 0.000000 time in integration: 0.000000
time in jump: 0.930000 time in jump: 0.900000
time in cse: 3.900000 time in cse: 3.760000
time in loop: 0.860000 time in loop: 0.840000
time in flow: 4.060000 time in flow: 3.960000
time in combine: 1.260000 time in combine: 1.320000
time in sched: 0.000000 time in sched: 0.000000
time in local-alloc: 2.820000 time in local-alloc: 2.810000
time in global-alloc: 3.350000 time in global-alloc: 3.440000
time in dbranch: 0.000000 time in dbranch: 0.000000
time in final: 1.650000 time in final: 1.670000
time in varconst: 0.040000 time in varconst: 0.050000
time in symout: 0.000000 time in symout: 0.000000
time in dump: 0.000000 time in dump: 0.000000
23.0u 0.5s 0:27 86% 0+1176k 22+22io 90pf+0w
22.9u 0.4s 0:24 96% 0+1184k 1+19io 0pf+0w
Performance difference: 27.8 / 22.9 = 21.3%
@end example
@page
@example
GCC Compiled by Sun CC (-O2, SunOS 4.0.1 distribution)
time in parse: 5.910000 time in parse: 5.100000
time in integration: 0.000000 time in integration: 0.000000
time in jump: 0.940000 time in jump: 0.870000
time in cse: 5.130000 time in cse: 5.300000
time in loop: 1.000000 time in loop: 1.010000
time in flow: 4.620000 time in flow: 4.620000
time in combine: 1.380000 time in combine: 1.310000
time in sched: 0.000000 time in sched: 0.000000
time in local-alloc: 4.640000 time in local-alloc: 5.070000
time in global-alloc: 3.720000 time in global-alloc: 3.700000
time in dbranch: 0.000000 time in dbranch: 0.000000
time in final: 1.980000 time in final: 1.850000
time in varconst: 0.040000 time in varconst: 0.050000
time in symout: 0.000000 time in symout: 0.000000
time in dump: 0.000000 time in dump: 0.000000
29.2u 0.5s 0:32 90% 0+1416k 6+21io 88pf+0w
28.8u 0.3s 0:30 96% 0+1424k 1+19io 0pf+0w
Performance difference: 28.8 / 22.9 = 25.7%
@end example
@bye