4205: Compiler Construction

Introduction

The structure of a compiler

The compiling process has two parts. The analysis part breaks up the
source program and imposes a grammatical structure on them. In addition
to this it will alert the user about illegal input (grammar, syntactical
etc.) and fill inn the symbol table. The synthesis part constructs
the target program from the intermediate representation and the symbol
table. The parts are called front end and back end respectively.

Semantic Analysis

The semantic analyzer uses the syntax tree and the information in the
symbol table to check the source program for semantic consistency. It
also gathers type information and saves it to the symbol table. An
important part here is type checking where it checks that operators
have matching operands. It might also handle coercions, i.e.
transforming INT + FLOAT to FLOAT + FLOAT.

Programming Language Basics

Aliasing

There is as interesting consequence of call-by-reference (references
passed by value). Two formal parameters can refer to the same location,
and they are alias of each other. You need to understand this in order
to optimize a program, because certain optimizations may only be done if
one can be sure there is no aliasing.

A Simple Syntax-Directed Translator

Introduction

Intermediate code-generation comes in two forms, abstract syntax tree
which represents the hierarchical syntactic structure of the source
program. The other one is three address instructions.

Syntax-Directed Translation

Syntax-directed translation is done by attaching rules or program
fragments to productions in a grammar.

Synthesized Attributes

The idea of associating quantities with programming constructs can be
expressed in terms of grammars. We attach rules to productions of the
grammar. A syntax-directed definition associates:

With each grammar symbol, a set of attributes

With each production, a set of semantic rules for computing values
of the attributes associated with the symbols appearing in the
production.

An attribute is synthesized if its value at a parse-tree node N is
determined from attribute values at the children of N and N itself, and
can be evaluated by a single bottom-up traversal.

A Translator for Simple Expressions

A syntaxt-directed translation scheme often serves as the specification
for a translator. We might come up in a conflict between a translatable
and parsable grammar.

Abstract and Concrete Syntax

Useful starting point for creating a translator. AST resemble parse
trees to an extent, but these are conceptually different. Parse trees
have the interior nodes representing non-terminals, and a syntax tree
represent programming constructs. Many productions represent grammar
constructs, but many are just there to simplify parsing. The helpers are
dropped in the syntax tree.

Lexical Analysis

A lexical analyzer reads characters from the input and groups them into
“token objects” with attribute values. A sequence that comprises a
single token is called a lexeme. An num can have a value and an
id can have a lexeme.

Removal of White Space and Comments

White space and comments can be removed by the lexer, so the parser won't
have to worry.

Intermediate Code Generation

Construction of Syntax Trees

Syntax trees can be created for any construct, not just expressions.
Each construct is represented by a node, with children for the
semantically meaningful components of the construct. These nodes can be
represented in a class structure with an arbitrary amount of children.

Lexical Analysis

The Lexical-Analyzer Generator Lex

This tool allows one to specify a lexical analyzer by specifying regular
expressions to describe tokens. The expressions are transformed to
deterministic finite automatas. Use:

Lex source program $\to$ lex compiler $\to$ c source program

C source program $\to$ C compiler $\to$ Lexer executable

Input stream $\to$ Lexer executable $\to$ Sequence of tokens

Syntax Analysis

Every program has rules that describes the syntactic structure. These
rules are normally expressed as context-free grammars. Grammars have
significant benefits:

A grammar gives a precise, yet easy-to-understand syntactic
specification of a programming language

Certain classes of grammars can be deterministicly and efficiently
parsed, and the process of making the parser might reveal ambigous
constructs in the language.

Top-Down Parsing

The problem of constructiong a parse tree for the input string, starting
from the root and creating the nodes of the parse tree in
preorder (depth-first). Will generate the leftmost derivation. At each
step the problem is to determine the production to apply for each
non-terminal. A predictive recursive-descent parser can choose
productions by looking at the next symbol. We call this the LL(k)
class.

Bottom-Up Parsing

A bottom-up parse corresponds to the construction of a parse tree for an
input string that begins at the leaves (the bottom) and working up
towards the root (the top). A general style of bottom-up parsing is
shift-reduce parsing, which can be built by using LR grammars.

Reductions

We can think of bottom-up parsing as the process of reducing a string
w to the start symbol of the grammar. Each step matches a substring
and replaces it with a non-terminal. By definition, a reduction is the
reverse of a step in a derivation. The goal is to construct a derivation
in reverse, which in fact will be a rightmost derivation.

Using Ambigous Grammars

Any ambigous grammar fails to be in LR. Ambigous grammars might still be
useful, because of it’s expressiveness. We can specify disambigouating
rules, although these should be used iwth care.

Precedence and Associativity to Resolve Conflicts

We might alter productions to give precedence (E <span>$\to$ </span>E +
E | E * E | (E) | id). This might not always be the optimal apporach,
because the parser will have more states, and it is hard to modify
precedence. We can express conflicts with precedence if we have
different operators, associativity when considering equal operators.

Parser Generators

yacc (yet another compiler-compiler) can construct a translator for a
programming language using a specification. It will parse with the LALR
method. It is used the same way as lex (from specification to program
that can read an input stream). The yacc specification file has three
parts:

Declarations

Translation rules

Supporting C routines

The Declarations Part

Consists of two parts, normal C declarations and declarations of grammar
tokens. The grammar docens will then be available for the second and
third part. You can make the tokens available for Lex as well.

These rules might not be what the compiler writer wants, so we can choose
associativity (%left ’+’ ’-’) or force it to be nonnassociative. You
can also force precedence.

Intermediate-Code Generation

Type Checking

To do type checking a compiler needs to assign a type expression to
each component of the source program. It must then determine that these
types conform to the rules (type system) of a program. It has the
potential of catching errors. A sound type system eliminates the need
for dynamic checking for type erors. A language is strongly typed if a
compiler guarantees that the program will run without type errors.

Static/dynamic typed is about WHEN type information is aquired
(compile-time or run-time).

Strongly/weakly typed is about how strictly types are distinguished.

Rules for Type Checking

Two forms, synthesis which builds up the type of an expression from
the types of its subexpressions and type inference which determines
the type of a language construct from the way it is used. The latter can
be used for languages that permits names to be used without declaration,
but still uses type checking. void is a special type that denote the
absence of a value.

Type Conversions

You have widening (preserve information) and narrowing (may lose
information). If type conversions are done automatically by the
compiler, it is implicit. These conversions are also known as
coercions, and is normally limited to widening conversions. Two
functions, max(a,b) finds the highest type in the hierarchy and
widen() which widens a type.

Control Flow

Backpatching

A key problem when generating code for boolean expressions and flow of
control statements is that of matching a jump instruction with the
target of the jump. An approach to this is called backpatching, in
which lists of jumps are passed as synthesized attributes. When a jump
is generated, the target of the jump is temporarily left unspecified.

One-Pass Code Generation using Backpatching

Backpatching can be used to generate code for boolean expressions and
flow-of-control statements in one pass. We use synthesized attributes
truelist and falselist to manage labels in jumping code for bool
expressions.

Backpatching for Boolean Expressions

Flow-of-Control Statements

Run-Time Environments

A compiler must implement the abstarctions in the source language. In
addition to names, scopes, bindings etc, it must also cooperate with the
operating system and other system software. To do this, the compiler
creates an run-time environment which it assumes the program are being
executed. This deals with storage, namning, mechanisms to access
variables, parameters, return values, interface to the operating system
etc.

Storage Organization

In the perspective of the compiler writer, the program runs in its own
logical address space, where the OS maps from logical to physical.
Normal setup is Code <span>$\to$ </span>Static <span>$\to$ </span>Heap
<span>$\to$ </span>Free Memory <span>$\to$ </span>Stack. The amount of
storage needed is determined by the types in the program. Many machines
require the data to be aligned, that all addresses must be divisible
by 4. The compiler may omit this limitation by packing data. Code size
and static data is known compile-time.\
\
You have two areas for dynamic storage allocation, the stack and the
heap, which sizes can change during program execution. Some programs
allocate the heap automatically and has a garbage collection, some must
have the programmer explicitly allocate space on the heap.

Stacit Versus Dynamic Storage Allocation

A storage allocation is static (compile time) if the decicion can be
made at looking at the text of a program. A storage allocation is
dynamic if it can be decided only when the program is running. Two
techniques:

Stack

Heap. For data that might outlive the call to the procedure that
created it.

Stack Allocation of Space

Almost all compilers for procedures/methods will use the stack. Will
make sure no variables overlap in time and the relative addresses are
always the same.

Activation Records

Procedure calls and returns are usually managed by a run-time stack
(Control stack). Each live activation has an activation record
(frame). Might include:

Temporary values

Local data

A saved machine status

Access link

Control link, pointing to the activation record of the caller.

Space for return value

The actual parameters used by the calling procedure.

Heap Management

For data that must be explicitly deleted.

The Memory Manager

The memory manager keeps track of all the free space in heap storage at
all times. Performs:

Allocation. Must provide a memory region in requested size.

Deallocation. For reusing space.

Code Generation

The final phase in the compiler is the code generator. It takes IR and
input and produces a semantically equivalent target program. It must

Preserve semantic meaning

Make efficient use of resources on target machine

The code generator itself must run efficiently

Making optimal code are undecidable, but many useful methods with
heuristics exists which will improve program performance by many
factors.

Addresses in the Target Code

Instruction Selection by Tree Rewriting

Instruction selection can be a large combinatorial task, especially on
CISC. We treat instruction selection as a tree-rewriting problem. Here
we can make efficient algorithms.

Tree-Translation Schemes

The input of code-generation will be a sequence of trees at the semantic
level. We need to apply a sequence of tree-rewriting rules to reduce the
input tree to a single node. Each rule represents the translation of a
portion of the tree given by the template. The replacement is called
tiling of the subtree

Code Generation by Tiling an Input Tree

Given an input tree, the templates in the tree-rewriting rules are
applied to tile its subtrees. When match is found, the subtree is
replaced with replacement node of a rule and action associated with the
rule is done (generate code). Two issues:

How is the tree-pattern matching to be done?

What if we have more matches?

Must always have a match and not inifite amount of matches. You can
create a postfix representation of a tree and create a grammar for it.
By reduce-reduce, take largest production, by shift-reduce, shift. This
is known as maximal munch[^1]. This is a good and well understood
method, but it has its challenges.

Machine-Independent Optimizations

High-level language constructs can introduce substantial run-time
overhead if we naively translate each construct independently. We have
three forms:

Local code optimization - within a block

Global code optimization - across basic blocks

Interprocedural code optimization - across functions and procedures

Most global optimization are based on data-flow analyses.

The Principal Sources of Optimization

A compiler optimization must presere the semanticts of the original
program, and the optimizer is usually only able to apply low-level
semantic transformations.

Causes of redundancy

Sometimes redundancy is available at source level, but often introduced
because the source is written in a high level language (array accesses
will cause redundancy). By having a compiler eliminate the redundancies,
we get the best of both worlds: the programs are both efficient and easy
to maintain.

A Running Example: Quicksort

blabla

Semantics-Preserving Transformations

A compiler can improve a program without changing the function it
computes.

Global Common Subexpressions

An occurence of an expression $E$ is called a common subexpression if
$E$ was previously computed and the values of the variables in $E$ have
not changed since the previous computation.

Copy Propagation

A copy statement is an assignment on the form $u = v$. The idea is to
use $v$ for $u$ in other statements using $u$. It may not appear to be
an improvement, but dead-code elimination gives the opportunity to get
rid of the copy statement.

Dead-Code Elimination

A variable is live at a point in a program if its value can be used
subesquently, otherwise dead. We would like to remove dead statements.
Copy propagation makes dead code elimination able to delete many copy
statements.\
\
Deducing at compile time that a value of an expression is known as
constant folding.

Code Motion

Loops are very important place for optimizations. Code motion takes an
expression that evaluates to the same each time and lifts it out of the
loop.

Induction Variables and Reduction in Strength

A variable x is said to be an induction variable if there is a
positive or negative constant c such that each time x is assigned,
its value increases by c. They can be computed with a single increment
per loop iteration. Transforming from an expensive to a cheap operation
is strength reduction, which you can do on induction variables. It’s
normal to work inside-out when optimizing loops. When there are two or
more induction variables in a loop, it might be possible to get rid of
all but one.

Introduction to Data-Flow Analysis

Data-flow analysis refers to a body of techniques that derive
information about the flow of data along program execution paths. You
can answer many questions about optimization, like global common
subexpression and dead code ellimination.

The Data-Flow Abstraction

You can view the program as a series of transformations, where the
states are between the statements. Control flow will go from statement
to statement in single blocks, and go between blocks if there is an edge
between. Execution path is the instructions executed, which there
normally is an infinite number of. Different analyses may choose to
abstract out different information.\
\
The definitions that may reach a program point along some path are known
as reaching definitions.

The Data-Flow Analysis Schema

At each program point we associate a data-flow value which is the set
of all possible program states that can be observed for that point. We
denote the data-flow values before and after each statement $s$ by
$IN[s]$ and $OUT[s]$, and the data-flow problem is to find a solution
to a set of constraints on these definitions for all statements $s$.

Transfer Functions

The relationship between the data-flow values before and after the
assignment statement is known as a transfer function. Might have two
directions (forward and backward analysis).

Control-Flow Constraints

You can create $IN[B]$ and $OUT[B]$ by running through all the
statements.

Data-Flow Schemas on Basic Blocks

Going through a block is easy, just follow each statement, and we get:

$$OUT[B] = f_b(IN[B])$$

If we do constant propagation, we are interested in the sets of constats
that may be assigned to a variable, then we have forward flow with
union operator.

$$IN[B] = \cup_{P a predecessor of B}OUT[P]$$

No unique solution, but find the most precise that satisfies
control-flow and transfer constraints.

Reaching Definitions

One of the most common and useful data-flow schemas. We say a definition
$d$reaches a point $p$ if there is a path from the point immediatly
following $d$ to $p$, such that $d$ is not killed along that path. We
might have aliases, and that is why we must be conservative.

Transfer Equation for Reaching Definitions

$$f_d(x) = gen_d \cup (x - kill_d)$$

The gen-set contains all definitions that are directly visible after the
block (will only generate last assignment of each variable), but the
kill-set kills all definitions of assignment (even those inside the same
block).

Control-Flow Equations

Union is the meet operator for reaching definitions. It creates a
summary fo the contributions from different paths at the confluecnce of
those paths.

Live-Variable Analysis

Useful for register allocation for basic blocks (dead values need not be
stored). We use bacwkard analysis and we define:

$def_B$ as the set of variables defined in the block.

$use_B$ as the set of variables used in the block.

About the same as the reaching definitions, except opposite direction:

$$IN[B] = use_b \cup (OUT[B] - def_B)$$

Available Expressions

An expression $x + y$ is available at a point $p$ if every path from
the entry node to $p$ evaluates $x + y$, and after the last such
evaluation prior to reaching $p$, there are no subesquent assignments to
$x$ or $y$. A block kills expression $x + y$ if it assigns (or may
assign) $x$ or $y$ and does not subesquently recompute $x + y$. A block
generates an expression $x + y$ if it evaluates it and not
subsequently define $x$ or $y$. The primary use of available expressions
is detecting global common subexpressions. At each step ($x = y + z$):

Add to $S$ the expression $y + z$.

Delete from $S$ any expression involving $x$.

Must be done in correct order. Data-flow can be done the same way as
reaching definitions (forward analysis), but the meet operator is
intersection rather than union. An expression is available at the top
of a block if it is available at the end of each predecessor block.\
\
In reaching definitions, we start with a set too small and work up, in
available expression, we work with a set too large and work down. We
start with the assumption that everything is available everywhere, to
keep the optimization conservative.

Loops in Flow Graphs

Loops when considering optimizations are important because programs
spend most of their time executing them. Additionally, program analysis
can happen faster if it contains no loops, because you only need one
pass through the source code.

Dominators

A node d of a flow graph dominates node n, written d dom n, if every
path from the entry node of the flow graph to n goes through d. Every
node dominates itself. The information can be presented in a dominator
tree, where entry is the root and each node dominates its descendants.
The problem can be formulated as a forward data-flow analysis.

Edges in a Depth-First Spanning Tree

We can traverse it with DFS, getting following endges

Advancing edges

Retreating edges

Cross edges

Interprocedural Analysis

An interprocedural analysis operates across an entire program. One
simple and useful technique is inlining of functions, where you
replace the function invocation with the body of the function itself.
You cannot immideatly inline recursive methods, and inlining may expand
code size exponentially (cache miss). Advantages of function inlining is
improved speed, due to no method invocation overhead.

Other definitions

Loop Unrolling

Loop unrolling creates more instructions in the loop body, permitting
global scheduling algorithms to find more parallelism. Will increase
code size, and hereby introduce more cache misses and memory references.

[^1]: From wikipedia: In computer programming and computer science,
“maximal munch” or “longest match” is the principle that when
creating some construct, as much of the available input as possible
should be consumed.