Document transcript

This chapter deals with real-life examples of compilers. For eachcompiler this scribewill discuss three subjects:



A brief history of the compiler.



The structure of the compiler with emphasis on the back end.



Optimization performed on two programs. It must be noted that thistest can’t be used to measure and compare the

performance of thecompilers.

The compilers examined are the following:



SUN compilers for SPARC 8, 9



IBM XL compilers for Power and PowerPC architectures. The Powerand PowerPC are classes of architectures. The processors are sold indifferent configurations.



Digital compiler for Alpha. The Alpha processor was bought by Intel.



Intel reference compiler for 386

Historically, compilers were built for specific processors. Today, it is not so obvious.Companies use other developers’ compilers. For example, Intel uses IBM’s compilerfor the Pentium processor.

The compilers will compile two programs: a C program and a Fortran 77 program.

The C program

int length, width, radius;

enum figure {RECTANGLE, CIRCLE} ;

main()

{

int area = 0, volume = 0, height;

enumfigure kind = RECTANGLE;

for (height=0; height < 10; height++)

{

if (kind == RECTANGLE) {

area += length * width;

volume += length * width * height;

}

else if (kind == CIRCLE){

area += 3.14 * radius * radius ;

volume += 3.14 * radius * height;

}

}

process(area, volume);

}

Possible optimizations:

1.

The value of ‘kind’ is constant and equals to ‘RECTANGLE’.Therefore the second branch, the ‘else’ part is dead code and the first‘if’ is also redundant.

2.

‘length * width’ is loop invariant and can be calculated before the loop.

3.

Because ‘length * width’ is loop invariant, area can be calculatedsimply by using a single multiplication. Specifically, 10 * length *width.

4.

The calculation of ‘volume’ in the loop can be done using additioninstead of multiplication.

5.

The call to ‘process()’ is a tail-call. The fact can be used to prevent toneed to create a stack frame.

6.

Compilers will probably use loop unrolling to increase pipelineutilization.

Note: Without the call to ‘process()’, all the code is dead, because ‘area’ and ‘volume’aren’t used.

The Fortran 77 program:

integer a(500, 500), k, l;

do 20 k=1,500

do 20 l=1,500

a(k, l)= k+l

20 continue

call s1(a, 500)

end

subroutine s1(a,n)

integer a(500, 500), n

do 100 i = 1,n

do 100 j = i + 1, n

do 100 k = 1, n

l = a(k, i)

m = a(k, j)

a(k, j) = l + m

100 continue

end

Possible optimizations:

1.

a(k,j) is calculated twice. This can be prevented by using common-subexpression elimination.

2.

The call to ‘s1’ is a tail-call. Because the compiler has the source of‘s1’, it can be inlined at the main procedure. This can be used to furtheroptimize the resulting code. Most compilers will leave the originalcopy of ‘s1’ intact.

3.

If the procedure is not inlined, interprocedural constant propagationcan be used to find out that ‘n’ is a constant equals 500.

4.

The access to ‘a’ is calculated using multiplication. This can beaverted using addition. The compiler “knows” how the array will berealized in memory. For example, in Fortran, arrays are ordered bycolumns. So it can

add the correct number of bytes every time to theaddress, instead of recalculating.

5.

After 4, the counters aren’t needed and the conditions in the loop canbe replaced by testing the address. That’s done using linear testreplacement.

6.

Again, loop unrolling will be used according to the architecture.

Sun SPARC

The SPARC architecture

SPARC has two major versions of the architecture, Version 8 and Version 9.

The SPARC 8 has the following features:



32 bit RISC superscalar system with pipeline.



Integer and floating point units.



The integer unit has a set of 32-bit general registers and executes load,store, arithmetic, logical, shift, branch, call and system-controlinstructions. It also computers addresses (register + register or register+ displacement).

There are 8 general purpose integer registers (from the integer unit).The first has a constant value of zero (r0=0).



Three address instructions at the following form: InstructionSrc1,Src2,Result.



Several 24 register windows (spilling by OS). This is used to save onprocedure calls. When there aren’t enough registers, the processorsends an interrupt and the OS handles saving the registers to memoryand refilling them with the necessary values.

SPARC 9 is a 64-bit version, fully upward-compatible with Version 8.

The assembly language guide is on pages 748-749 of the course book, tables A.1, A.2,A.3.

The SPARC compilers

General

Sun SPARC compilers originated from the Berkeley 4.2 BSD UNIX softwaredistribution and have been developed at Sun since 1982. The original back end wasfor the Motorola 68010 and was migrated successively to later members of theM68000 family and then to SPARC. Workon global optimization began in 1984 andon interprocedural optimization and parallelization in 1989. The optimizer isorganized as a mixed model. Today Sun provides front-ends, and thus compilers, forC, C++, Fortran 77 and Pascal.

The structure

The four compilers: C, C++, Fortran 77 and Pascal, share the same back-end. Thefront-end is Sun IR, an intermediate representation discussed later. The back endconsists of two parts:



Yabe–

“Yet Another Back-End”. Creates a relocatable code

withoutoptimization.



An optimizer.

The optimizer is divided to the following:



The automatic inliner. This part works only on optimization level 04(discussed later). It replaces some calls to routines within the samecompilation unit with inline copiesof the routines’ body. Next, tail-recursion elimination is preformed and other tail calls are marked forthe code generator to optimize.



The aliaser. The aliaser used information that is provided by thelanguage specific front-end to determine which sets of variables may,at some point in the procedure, map to the same memory location. Thealiaser aggressiveness is determined on the optimization level. Aliasinginformation is attached to each triple that requires it, for use by theglobal optimizer.



IRopt,the global optimizer



The code generator.

Front End

Codegenerator

Automatic inliner

Aliaser

Iropt

(global optimization)

yabe

Sun IR

Relocatable

Sun IR

Relocatable

The Sun IR

The Sun IR represents a program as a linked list of triples representing executableoperations and several tables representing declarative information. For example:

ENTRY “s1_” {IS_EXT_ENTRY, ENTRY_IS_GLOBAL}

GOTO LAB_32

LAB_32:

LTEMP.1 = (.n { ACCESS V41} );

i = 1

CBRANCH (i <= LTEMP.1, 1: LAB_36, 0: LAB_35);

LAB_36:

LTEMP.2 = (.n { ACCESS V41} );

j=i+1

CBRANCH (j <= LTEMP.2, 1: LAB_41, 0: LAB_40);

LAB_41:

LTEMP.3 = (.n { ACCESS V41} );

k=1

CBRANCH (k<= LTEMP.3, 1: LAB_46, 0: LAB_45);

LAB_46:

l = (.a[k, i] ACCESS V20} );

m = (.a[k, j] ACCESS V20});

*(a[k,j] = l+m {ACCESS V20, INT});

LAB_34:

k = k+1;

CBRANCH(k>LTEMP.3, 1: LAB_45, 0: LAB_46);

LAB_45:

j = j+1;

CBRANCH(j>LTEMP.2, 1: LAB_40, 0: LAB_41);

LAB_40:

i = i+1;

CBRANCH(i>LTEMP.1, 1: LAB_35, 0: LAB_36);

LAB_35:

The CBRANCH is a general branch, not attached to the architecture. It provides twobranches, the first when the expression is correct and the second when not.

This IR is somewhere between LIR

and MIR. It isn’t LIR because there are noregisters. It isn’t MIR because there is access to memory using the compiler memoryorganization, the use of LTEMP.

Optimization levels

There are four optimization levels:

01

Limited optimizations. This level invokes only certain optimizationcomponents of the code generator.

02

This and higher levels invoke both the global optimizer and the optimizercomponents of the code generator. At this level, expressions that involve global orequivalent variables, aliased local variables’ or volatile variables are not candidatesfor optimization. Automatic inlining, software pipelining, loop unrolling, and theearly phase of instruction scheduling are not done.

03

This level optimizes expressions that involve global variables but make worst-case assumptions about potential aliases caused by pointers and omits earlyinstruction scheduling and automatic inlining. This level gives the best results.

04

This level aggressively tracks what pointers may point to’ making worst-caseassumptions only where necessary. It depends on the language-specific front ends toidentify potentially aliased variables, pointer variables, and a worst-case set ofpotential aliases. It also does automatic inlining and early instruction scheduling. Thislevel turned out to be very problematic because of bugs in the front-ends.

The global optimizer

The optimizer input is Sun IR and the output is Sun IR.

The global optimizer performs the subsequent on that input:



Control-flow analysis is done by identifying dominators and backedges, except that the parallelizer does structural analysis for its ownpurposes.



The parallelizer searches for commands the processor can execute inparallel. Practically, it doesn’t improve execution time (The alphaprocessor iswhere it has an effect, if any). Most of the time it is justfor not interrupting the processor’s parallelism.



The global optimizer processes each procedure separately, using basicblocks. It first computes additional control-flow information. Inparticular, loops are identified at this point, including both explicitloops (for example, ‘do’ loops in Fortran 77) and implicit onesconstructed from ‘if’s and ‘goto’s.



Then a series of data-flow analysis and transformations is applied tothe procedure. All data-flow analysis is done iteratively. Eachtransformation phase first computes (or recomputes) data-flowinformation if needed. The transformations are preformed in this order:

1.

Scalar replacement of aggregates and expansion of Fortranarithmetic on complex numbers to sequences of real-arithmeticoperations.

The dependence based analysis and transformation phase is designedto support parallelization and data-cache optimization and may be done(under control of a separate option) when the optimization levelselected is 03 or 04. The steps comprising it (in order) are as follows:

After global optimization has been completed, the code generator first translates theSun IR code input to it to a representation called ‘asm+’ that consists of assembly-language instructions and structures that represent control-flow and data dependenceinformation. An example is available on page 712. The code generator then performsa series of phases, in the following order:

(4) Strength reduction of “height”. Instead of multiplying by ‘height’,addition of previous value is used.



(6) Loop unrolling by factor of four. (‘cmp %lo,3’)



Local variables in registers.



All computations in registers.



(5) Identifying tail call and optimizing it by eliminating the stackframe.

Missed optimizations on the C program



Removal of

computation.



(3) Compute area in one instruction.



Completely unroll the loop. Only the first 8 iterations were unrolled.

Optimizations performed on the Fortran 77 program



(2) Procedure integration of s1. The compiler can make use of the factthat n=500 to unroll the loop, which it did.



Common subexpression elimination of ‘a[k,j]’



Loop unrolling, from label .L900000112 to .L900000113.



Local variables in registers



Software pipelining. Note, for example, the load just above the startinglabel of the loop.

An example for software pipelining:

When running the following commands, assuming all depend on each other:

Load

Add

Store

The add can’t be started until load is finished and ‘store’ can’t be started until ‘add’ isfinished. The compiler can improve this code by writing the following:

Load

*Load

Add

*Store

Store

The compiler inserts here the commands with * needed later. This way, when ‘add’will start execution, the result of the first load will be available. Same for ‘store’ and‘add’ respectively.

Missed optimizations on the Fortran 77 program



Eliminating s1. The compiler produced code for ‘s1()’ although themain routine is the only one calling ‘s1()’.



Eliminating addition in the loop via linear function test replacement.This would have eliminated one of the additions in the resulting code.POWER/PowerPC

The POWER/PowerPC architecture

The POWER architecture is an enhanced 32-bit RISC machine with the followingfeatures:



It consists of branch, fixed-point, floating-point and storage-controlprocessors.



Individual implementations may have multiple processors of each sort,except that the registers are shared among them and there may be onlyone branch processor in a system. That is, a processor is configurableand may be purchased

with different number of processors.



The branch processor includes the condition, link and count registersand executes conditional and unconditional braches and calls, systemcalls and condition register move and logical operations.



The fixed-point processor contains 32 32-bit integer general purposeregisters, with register ‘gr0’ delivering the value zero when used as anoperand in an address computation. (gr0=0). It implements loads andstores, arithmetic, logical, compare, shift, rotate and trap instructions.It also implements system control instructions. There are two modes ofaddressing: register + register or register + displacement, plus thecapability to update the base register with the computed address.

The storage-control processor provides for segmented main-storage,interfaces with caches and translation look-aside buffer and doesvirtual addresstranslation.



The instructions typically have three operands, two sources and oneresult. The order is opposite to SPARC, first the result and then theoperands: Instructions result, src1, src2.

The PowerPC architecture is a nearly upward compatible extension of POWER thatallows for 32-

and 64-bit implementations. It isn’t 100% compatible because, forexample, some instructions, which were troublesome corner cases, have been madeinvalid.

The assembly language guide is on page 750 of the course book, table A.4.

The IBM XL compilers

General

The compilers for these architectures are known as the XL family. The XL familyoriginated in 1983, as a project to provide compiler to an IBM RISC architecture thatwas an intermediate stage between the IBM 801 and POWER, but that was neverreleased as a product. It was an academic project. The first compilers created were anoptimizing Fortran compiler for the PC RT that was release to a selected fewcustomers and a C compiler for the PC RT used only for internal IBMdevelopment.The compilers were created with interchangeable back ends, so today they generatecode for POWER, Intel 386, SPARC and PowerPC. The compilers were written inPL.8.

The compilers don’t perform interprocedural optimizations. Almost all optimizationare preformed on a proprietary low level IR, called “XIL”. Some optimizations, whichrequire higher level IR, for example, optimizations on arrays, are performed on YIL, ahigher level representation. It’s created from XIL.

The structure

Each compiler consists of a front end called a translator, a global optimizer, aninstruction scheduler, a register allocator. An instruction selector and a phase calledfinal assembly that produces the relocatable image and assembly language listings.The root services module interacts with all phases and serves to make compilerscompatible with multiple operating systems by, for example, holding informationabout how to produce listings and error messages.

The translator and XIL

A translator converts the source language to XIL using calls to XIL library routines.The XIL generation routines do not merely generate instructions. They may perform afew optimizations, for example, generate a constant in place of an instruction thatTranslator

Root services

Optimizer

Instructionscheduler

Register allocator

Instructionscheduler

Instructionselection

Final assembly

XIL

XIL

XIL

XIL

Translator

Root services

Optimizer

Instructionscheduler

Register allocator

Instructionscheduler

Instructionselection

Final assembly

XIL

XIL

XIL

relocatable

would compute the constant. A translator may consist of a front end that translates asource language to a different IR language, followed by a translator from the otherintermediate form to XIL.

The illustrationshows the relationships among the structures. It may save memory space whilecompiling but it makes debugging the compiler more difficult. The data structures are:



A procedure descriptor table that holds information about eachprocedure,such as the size of its stack frame and information aboutglobal variables it affects, and a pointer to the representation of itscode.



A procedure list. The code representation of each procedure consists ofa procedure list that comprises pointers to the

Computation table. Each instruction is represented as an entry in thistable. The computation table is an array of variable length records thatrepresent preorder traversals of the intermediate code for theinstructions.



Symbolic register table. Variables and intermediate results arerepresented by symbolic registers, each comprises an entry in thistable. Each entry points to the computation table entry that defines it.

An example of XIL is on page 721.

TOBEY

The compiler back end (all the phases except the source to XIL translator) is namedTOBEY, an acronym for TOronto Back End with Yorktown, indicating the heritageof the two group which created the back end.

The TOBEY optimizer

The optimizer does the subsequent:



YIL is used for storage-related optimization.

o

YIL is created by TOBEY from XIL and includes; inaddition to the structures in XIL, representations forlooping constructs, assignment statements, subscriptingoperation, and conditional control flow at the level of‘if’ statements.

o

It also represent the code is SSA form.

o

The goal is to produce code that is appropriate fordependence analysis and loop transformations.

o

After the analysis and transformations, the YIL istranslated back to XIL.



Alias information is provided by the translator to the optimizer by callsfrom the optimizer to front end routines.



Control flow uses basic blocks. It builds the flow graph within aprocedure, usesDFS to construct a search tree and divides it intointervals.



Data flow analysis is done by interval analysis. It’s an older methodthat the dominator method for finding loops. The iterative form is usedfor irreducible intervals.



Optimization is preformed on each procedure separately.

The register allocator

TOBEY includes two register allocators:



A “quick and dirty” local , used when optimization is not requested.



A graph-coloring global based on Chatin’s, but with spilling done inthe style of Brigg’swork.

The instruction scheduler



Performs basic-block and branch scheduling.



Performs global scheduling.



Run after register allocations if any spill code has been generated.

The final assembly

The final assembly phase does 2 passes over the XIL:



peephole optimizations–

removing compares.



generate relocatable image and listings.

Compilation results

The assembly code for the C program appears in the book on page 724.

The assembly code for the Fortran 77 program appears in the book on pages 724-725.

The numbers in parentheses are according to the numbering of possible optimizationsfor each program.

Optimizations performed on the C program



(1) The constant value of kind has been propagated into the conditionaland the dead code eliminated.

The Intel 386 architecture includes the Intel 386 and its successors, the 486, Pentium,Pentium Pro and so on. The architecture is a thoroughly CISC design, however someimplementations utilize RISC principles such as pipelining and superscalarity.

It has the following characteristics:



There are eight 32-bit integer registers.



It supports 16 and 8 bit registers.



There are six 32-bit segment registers for computing addresses.



Some registers have dedicated purposes(e.g. point to the top of thecurrent stack frame).



There are many addressing modes.



There are eight 80-bit floating point regisers.

The assembly language guide is on page 752-753 of the course book, tables A.7 andA.8.

The structure of the compilers, which use the mixed model of optimizer organization,is as follows:

Front end

Interproceduraloptimizer

Memoryoptimizer

Globaloptimizer

Code selector

Registerallocator

Instruction

scheduler

IL-1

IL-1 + IL-2

IL-1 + IL-2

IL-1 + IL-2

Relocatable

Code genrator

The fron-end is derived from work done at Multiflowand the Edison Design Group.

The fron-ends produce a medium-level intermediate code called IL-1.

The interprocedural optimizer operates accross modules. It performs a series ofoptimizations that include inlining, procedure cloning, parameter substitution,

andinterprocedural constant propagation.

The output of the interprocedural optimizer is a lowered version of IL-1, called IL-2,along with IL-1’s program-structure information; this intermediate form is used forthe remainder of the major components of the compiler, down through input to thecode generator.

The memory optimizer improves use of memory and caches mainly by performingloop transformations.It first does SSA-based sparse conditional constant propagationand then data dependence analysis.

The global optimizer does the following optimizations:



constant propagation



dead-code elimination



local common subexpression elimination



copy propagation



partial-redundancy elimination



a second pass of copy propagation



a second pass of dead-code elimination

Compilation results

The assembly code for the C program appears in the book on page 741.

The assembly code for the Fortran 77 program appears in the book on pages 742-743.

The numbers in parentheses are according to the numbering of possible optimizationsfor

each program.

Optimizations performed on the C program



(1) The constant value ofkind

has been propagated into the conditionaland the dead code eliminated.



(2) the loop invariantlength*width

has been removed from the loop.



strength-reduction ofheight.



the local variables have been allocated to registers.



instruction scheduling has been performed.

Missed optimizations on the C program



(6) loop unroll.



(5) tail-call optimization.



(3) accumulation ofarea

into a single multiplication.

Optimizations performed on the Fortran 77 program



(2)s1()

has been inlined, and therefore it is found out thatn=500.



(1) common subexpression elimination ofa[k,j]



(5) linear-function test replacement



local variables allocated to regisers

Missed optimizations on the Fortran

77 program



(6) loop unroll

Compilers comparison

The performance of each of the compilers on the C example is summarized in thefollowing table:

optimization

Sun SPARC

IBM XL

Intel 386 family

constant propagation ofking

yes

yes

yes

dead-code elimination

almost all

yes

yes

loop-invariant codemotion

yes

yes

yes

strength-reduction ofheight

yes

yes

yes

reduction ofarea

computation

no

no

no

loop unrolling factor

4

2

none

rolled loop

yes

yes

yes

regiser allocation

yes

yes

yes

instruction scheduling

yes

yes

yes

stack frame eliminated

yes

no

no

tail call optimized

yes

no

no

The performance of each of the compilers on the Fortran example is summarized inthe following table:

optimization

Sun SPARC

IBM XL

Intel 386 family

address ofa(i)

acommon subexpression

yes

yes

yes

precedure integration ofs1()

yes

no

yes

loop unrolling factor

4

2

none

rolled loop

yes

yes

yes

instructions in innermostloop

21

9

4

linear-function testreplacement

no

no

yes

software pipelining

yes

no

no

register allocation

yes

yes

yes

instruction scheduling

yes

yes

yes

elimination ofs1()

subroutine

no

no

no

Future trends

There are several clear main trends developing for the near future of advancedcompiler design and implementation:



SSA is being uses more and more:

o

allows methods designed to basic blocks & extendedbasic blocks to be applied to whole procedures