Tuesday, August 29, 2017

The functionality that cannot be handled by the machine description in machine.md is implemented using target macros and functions in machine.h and machine.c. Starting from an existing back end that is similar to you target will give you a reasonable implementation for most of these, but you will need to update the implementation related to the register file and addressing modes to correctly describe your architecture. This blog post describes the relevant macros and functions, with examples from the Moxie back end.

Inspecting the RTL

Some of the macros and functions are given RTL expressions that they need to inspect in order to do what is expected. RTL expressions are represented by a type rtx that is accessed using macros. Assume we have a variable

rtx x;

containing the expression

(plus:SI (reg:SI 100) (const_int 42))

We can now access the different parts of it as

GET_CODE(x) returns the operation PLUS

GET_MODE(x) returns the operation’s machine mode SImode

XEXP(x,0) returns the first operand – an rtx corresponding to (reg:SI 100)

XEXP(x,1) returns the second operand – an rtx corresponding to (const_int 42)

XEXP() returns an rtx which is not right for accessing the value of a leaf expression (such as reg) – those need to be accessed using a type-specific macro. For example, the integer value from (const_int 42) is accessed using INTVAL(), and the register number from (reg:SI 100) is accessed using REGNO().

For an example of how this is used in the back end, suppose our machine have load and store instructions that accept an address that is a constant, a register, or a register plus a constant in the range [-32768, 32767], and we need a function that given an address expression tells if it is a valid address expression for the target. We could write that function as

The names are used for generating the assembly files, and for the GCC extension letting programmers place variables in specified registers

registerint *foo asm ("$r12");

The Moxie architecture does only have 18 registers, but it adds two fake registers ?fp and ?ap for the frame pointer and argument pointer (used to access the function’s argument lists). This is a common strategy in GCC back ends to simplify code generation and elimination of unneeded frame pointer and argument pointers, and there there is a mechanism (ELIMINABLE_REGS) that rewrites these fake registers to real registers (such as the stack pointer).

The GCC RTL contains virtual registers (which are called pseudo registers in GCC) before the real registers (called hard registers) are allocated. The registers are represented as an integer, where 0 is the first hard register, 1 is the second hard register, etc., and the pseudo registers follows after the last hard register. The back end specifies the number of hard registers by specifying the first pseudo register

#define FIRST_PSEUDO_REGISTER 20

Some of the registers, such as the program counter $pc, cannot be used by the register allocator, so we need to specify which registers the register allocator must avoid

It is possible to specify the order in which the register allocator allocates the registers. The Moxie back end does not do this, but it could have done it with something like the code below that would use register 2 ($r0) for the first allocated register, register 3 ($r1) for the second allocated register, etc.

For an example of where this can be used, consider an architecture with 16 registers and “push multiple” and “pop multiple” instructions that push/pop the last \(n\) registers. The instructions are designed to store call-saved registers when calling a function, and it makes sense to allocate the call-saved registers in reverse order so the push/pop instructions save as few registers as possible

There are usually restrictions on how registers can be used (e.g. the Moxie $pc register cannot be used in arithmetic instructions), and these restrictions are described by register classes. Each register class defines a set of registers that can be used in the same way (for example, that can be used in arithmetic instructions). There are three standard register classes that must be defined

ALL_REGS – containing all of the registers.

NO_REGS – containing no registers.

GENERAL_REGS – used for the ‘r’ and ‘g’ constraints.

and the back end can add its own as needed. The register classes are implemented as

REG_CLASS_CONTENTS consists of a bit mask for each register class, telling with registers are in the set. Architectures having more than 32 registers use several values for each register class, such as

{ 0xFFFFFFFF, 0x000FFFFF } /* All registers */ \

The back end also need to implement the REGNO_REG_CLASS macro that returns the register class for a register. There are in general several register classes containing the registers, and the macro should return a minimal class (i.e. one for which no smaller class contains the register)

The back end may also need to set a few target macros detailing how values fit in registers, such as

/* A C expression that is nonzero if it is permissible to store a value of mode MODE in hard register number REGNO (or in several registers starting with that one). */ #define HARD_REGNO_MODE_OK(REGNO,MODE) 1

Addressing modes

Addressing modes varies a bit between architectures – most can do “base + index” addressing, but there are often restrictions on the format of the index etc. The general functionality is described by five macros:

/* Specify the machine mode that pointers have. After generation of rtl, the compiler makes no further distinction between pointers and any other objects of this machine mode. */#define Pmode SImode/* Maximum number of registers that can appear in a valid memory address. */#define MAX_REGS_PER_ADDRESS 1/* A macro whose definition is the name of the class to which a valid base register must belong. A base register is one used in an address which is the register value plus a displacement. */#define BASE_REG_CLASS GENERAL_REGS/* A macro whose definition is the name of the class to which a valid index register must belong. An index register is one used in an address where its value is either multiplied by a scale factor or added to another register (as well as added to a displacement). */#define INDEX_REG_CLASS NO_REGS/* A C expression which is nonzero if register number NUM is suitable for use as a base register in operand addresses. */#ifdef REG_OK_STRICT#define REGNO_OK_FOR_BASE_P(NUM) \ (HARD_REGNO_OK_FOR_BASE_P(NUM) \ || HARD_REGNO_OK_FOR_BASE_P(reg_renumber[(NUM)]))#else#define REGNO_OK_FOR_BASE_P(NUM) \ ((NUM) >= FIRST_PSEUDO_REGISTER || HARD_REGNO_OK_FOR_BASE_P(NUM))#endif/* A C expression which is nonzero if register number NUM is suitable for use as an index register in operand addresses. */#define REGNO_OK_FOR_INDEX_P(NUM) 0

The macros REGNO_OK_FOR_BASE_P and REGNO_OK_FOR_INDEX_P have two versions – the normal version that accepts all pseudo registers as well as the valid hard registers, and one “strict” version that only accepts the valid hard registers. The strict version is used from within the register allocator, so it needs to check pseudo registers against the reg_number array to accept cases where the pseudo register is in the process of being allocated to a hard register.

The back end also needs to implement the TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P hook that is used by the compiler to check that the address expression is valid (this is needed as the hardware may have more detailed constraints that cannot be expressed by the macros above). This is done similarly to the valid_address_p example in “inspecting the IR” above, but it needs to handle the “strict” case too. The Moxie back end implements this as

Tuesday, August 22, 2017

Compilers are usually described as having three parts – a front end that handles the source language, a middle that optimizes the code using a target-independent representation, and a back end doing the target-specific code generation. GCC is no different in this regard – the front end generates a target-independent representation (GIMPLE) that is used when optimizing the code, and the result is passed to the back end that converts it to its own representation (RTL) and generates the code for the program.

The back end IR

The back end’s internal representation for the code consists of a linked list of objects called insns. An insn corresponds roughly to an assembly instruction, but there are insns representing labels, dispatch tables for switch statements, etc.. Each insnsn is constructed as a tree of expressions, and is usually written in an LISP-like syntax. For example,

(reg:mn)

is an expression representing a register access, and

(plus:mxy)

represents adding the expressions x and y. An insn adding two registers may combine these as

(set (reg:mr0) (plus:m (reg:mr1) (reg:mr2)))

The m in the expressions denotes a machine mode that defines the size and representation of the data object or operation in the expression. There are lots of machine modes, but the most common are

QI – “Quarter-Integer” mode represents a single byte treated as an integer.

HI – “Half-Integer” mode represents a two-byte integer.

SI – “Single Integer” mode represents a four-byte integer.

DI – “Double Integer” mode represents an eight-byte integer.

CC – “Condition Code” mode represents the value of a condition code (used to represent the result of a comparison operation).

GCC supports architectures where a byte is not 8 bits, but this blog series will assume 8-bit bytes (mostly as I often find it clearer to talk about a 32-bit value in examples, instead of a more abstract SImode value).

Overview of the back end operation

The back end runs a number of passes over the IR, and GCC will output the resulting RTL for each pass when -fdump-rtl-all is passed to the compiler.

The back end starts by converting GIMPLE to RTL, and a small GIMPLE function such as

The generated RTL corresponds mostly to real instructions even at this early stage in the back end, but the generated code it is inefficient and registers are still virtual.

The next step is to run optimization passes on the RTL. This is the same kind of optimization passes that have already been run on GIMPLE (constant folding, dead code elimination, simple loop optimizations, etc.), but they can do a better job with knowledge of the target architecture. For example, loop optimizations may transform loops to better take advantage of loop instructions and addressing modes, and dead code elimination may see that the operations working on the upper part of

on a 32-bit architecture are not needed (as the returned value is always 0) and can thus be eliminated.

After this, instructions are combined and spit to better instruction sequences by peep-hole optimizations, registers are allocated, the instructions are scheduled, and the resulting RTL dump contains all the information about which instructions are selected and registers allocated

which defines an insn with a name mulsi3 that generates a mul instruction for a 32-bit multiplication.

The first operand in the instruction pattern is a name that is used in debug dumps and when writing C++ code that generates RTL. For example,

emit_insn (gen_mulsi3 (dst, src1, src2));

generates a mulsi3 insn. The back end does not in general need to generate RTL, but the rest of GCC does, and the name tells GCC that it can use the pattern to accomplish a certain task. It is therefore important that the named patterns implement the functionality that GCC expects, but names starting with * are ignored and thus safe to use for non-standard instruction patterns. The back end should only implement the named patterns that make sense for the architecture – GCC will do its best to emit code for missing patterns using other strategies. For example, a 32-bit multiplication will be generated as a call to __mulsi3 in libgcc if the target does not have a mulsi3 instruction pattern.

The next part of the instruction pattern is the RTL template that describes the semantics of the instruction that is generated by the insn

This says that the instruction multiplies two registers and places the result in a register. This example is taken from RISC-V that has a multiplication instruction without any side-effects, but some architectures (such as X86_64) sets condition flags as part of the operation, and that needs to be expressed in the RTL template, such as

where the parallel expresses that the two operations are done in as a unit.

The instruction’s operands are specified with expressions of the form

(match_operand:SI 1 "register_operand" "r")

consisting of four parts:

match_operand: followed by the machine mode of the operand.

The operand number used as an identifier when referring the operand.

A predicate telling what kind of operands are valid ("register_operand" means that it must be a register).

A constraint string describing the details of the operand ("r" means it must be a general register). These are the same constraints as are used in inline assembly (but the instruction patterns support additional constraints that are not allowed in inline assembly).

The predicate and constraint string contain similar information, but they are used in different ways:

The predicate is used when generating the RTL. As an example, when generating RTL for a GIMPLE function

then the GIMPLE to RTL converter will generate the multiplication as a mulsi3 insn. The predicates will be checked for the result _2 and the operands a_1(D) and 42, and it is determined that 42 is not valid as only registers are allowed. GCC will, therefore, insert a movsi insn that moves the constant into a register.

The constraints are used when doing the register allocation and final instruction selection. Many architectures have constraints on the operands, such as m68k that has 16 registers (a0–a7 and d0–d7), but only d0–d7 are allowed in a muls instruction. This is expressed by a constraint telling the register allocator that it must use register d0–d7.

The string after the RTL template contains C++ code that disables the instruction pattern if it evaluates to false (an empty string is treated as true). This is used when having different versions of the architecture, such as one for small CPUs that do not have multiplication instructions, and one for more advanced cores that can do multiplication. This is typically handled by letting machine.opt generate a global variable TARGET_MUL that can be set by an option such as -mmul and -march, and this global variable is placed in the condition string

so that the instruction pattern it is disabled (and the compiler will thus generate a call to libgcc) when multiplication is not available.

The resulting instruction is emitted using the output template

"mul\t%0,%1,%2"

where %0, %1, ..., are substituted with the corresponding operands.

More about define_insn

Many architectures need more complex instruction handling than the RISC-V mul instruction described above, but define_insn is flexible enough to handle essentially all cases that occur for real CPUs.

Let’s say our target can multiply a register and an immediate integer and that this requires the first operand to be an even-numbered register, while multiplying two registers requires that the first operand is an odd-numbered register (this is not as strange as it may seem – some CPU designs use such tricks to save one bit in the instruction encoding). This is easily handled by defining a new predicate in predicates.md

The constraint strings now list two alternatives – one for the register/register case and one for the register/integer case. And there is a % character added to tell the back end that the operation is commutative (so that the code generation may switch the order of the operands if the integer is the first operand, or help the register allocation for cases such as

_1 = _2 * 42;
_3 = _2 * _4;

to let it change the order of arguments on the second line to avoid inserting an extra move to make _2 an even register on the first line and an odd register on the second line).

Sometimes the different alternatives need to generate different instructions, such as the instruction multiplying two registers being called mulr and the multiplication with an integer being called muli. This can be handled by starting the output template string with an @ character and listing the different alternatives in the same order as in the constraint strings

This is usually not that useful for define_insn but may reduce the number of instruction patterns when the instruction names depend on the configuration. For example, mulsi3 in the RISC-V back end must generate mulw in 64-bit mode and mul in 32-bit mode, which is implemented as

Sunday, August 13, 2017

Most CPU architectures have a common subset – they have instructions doing arithmetics and bit operations on a few general registers, an instruction that can write a register to memory, and an instruction that can read from memory and place the result in a register. It is therefore easy to make a compiler that can compile simple straight-line functions by taking an existing back end and restricting it to this common subset. This is enough to start running the test suite, and it is then straightforward to address one deficiency at a time (adding additional instructions, addressing modes, ABI, etc.).

My original thought was that the RISC-V back end would be a good choice as a starting point – the architecture is fully documented, and it is a new, actively maintained, backend that does not use legacy APIs. But the RISC-V back end has lots of functionality (such as support for multiple ISA profiles, 32- and 64-bit modes, and features such as position-independent code, exception handling and debug information) and the work of reducing it became unnecessarily complicated when I tried...

I now think it is better to start from one of the minimal back ends, such as the back end for the Moxie architecture. Moxie seems to be a good choice as there is also a blog series “How To Retarget the GNU Toolchain in 21 Patches” describing step-by-step how it was developed. The blog series is old, but GCC has a very stable API, so it is essentially the same now (I once updated a GCC backend from GCC 4.3 to GCC 4.9, which were released 6 years apart, and only a few lines needed to be modified...).

Sunday, August 6, 2017

The GCC back end is configured in gcc/config.host and the implementation is placed in directories machine under gcc/config and gcc/common/config where “machine” is the name of the back end (for example, i386 for the x86 architecture).

The back end places some functionality in libgcc. For example, architectures that do not have an instruction for integer division will instead generate a call to a function __divsi3 in libgcc. libgcc is configured in libgcc/config.host and target-specific files are located in a directory machine under libgcc/config.

gcc/config.gcc

config.gcc is a shell script that parses the target string (e.g. x86_64-linux-gnu) and sets variables pointing out where to find the rest of the back end and how to compile it. The variables that can be set are documented at the top of the config.gcc file.

The only variable that must be set is cpu_type that specifies machine. Most targets also set extra_objs that specifies extra object files that should be linked into the compiler, tmake_file that contains makefile fragments that compiles those extra objects (or sets makefile variables modifying the build), and tm_file that adds header files containing target-specific information.

A typical configuration for a simple target (such as ft32-unknown-elf) looks something like

cpu_type=ft32
tm_file="dbxelf.h elfos.h newlib-stdint.h ${tm_file}"

gcc/config/machine

The main part of the back end is located in gcc/config/machine. It consists of eight different components, each implemented in a separate file:

machine.h is included all over the compiler and contains macros defining properties of the target, such as the size of integers and pointers, number of registers, data alignment rules, etc.

GCC implements a generic backend where machine.c can override most of the functionality. The backend is written in C,1 so the virtual functions are handled manually with function pointers in a structure, and machine.c overrides the defaults using code of the form

adds a command-line option -msmall-data-limit that has a default value 8, and is generated as an unsigned variable named g_switch_value.

machine.md, predicates.md, and constraints.md contain the machine description consisting of rules for instruction selection and register allocation, pipeline description, and peephole optimizations. These will be covered in detail in parts 3–7 of this series.

machine-modes.def defines extra machine modes for use in the low-level IR (a “machine mode” in the GCC terminology defines the size and representation of a data object. That is, it is a data type.). This is typically used for condition codes and vectors.

The GCC configuration is very flexible and everything can be overridden, so some back ends look slightly different as they, for example, add several .opt files by settingextra_options in config.gcc.

gcc/common/config/machine

The gcc/common/config/machine directory contains a file machine-common.c that can add/remove optimization passes, change the defaults for --param values, etc.

Many back ends do not need to do anything here, and this file can be disabled by setting

target_has_targetm_common=no

in config.gcc.

libgcc/config.host

The libgcc config.host works in the same way as config.gcc, but with different variables.

The only variable that must be set is cpu_type that specifies machine. Most targets also set extra_parts that specifies extra object files to include in the library and tmake_file that contains makefile fragments that add extra functionality (such as soft-float support).

libgcc/config/machine

The libgcc/config/machine directory contains extra files that may be needed for the target architecture. Simple implementations typically only contain a crti.S and crtn.S (crtbegin/crtend and the makefile support for building all of these have default implementation) and a file sfp-machine.h containing defaults for the soft-float implementation.

1. GCC is written in C++03 these days, but the structure has not been changed since it was written in C.

Friday, August 4, 2017

It is surprisingly easy to design a CPU (see for example Colin Riley’s blog series) and I was recently asked how hard it is to write a GCC back end for your new architecture. That too is easy — provided you have done it once before. But the first time is quite painful...

I plan to write some blog posts the coming weeks that will try to ease the pain by showing what is involved in creating a “working” back end that is capable of compiling simple functions, give some pointers to how to proceed to make this production-ready, and in general provide the overview I would have liked before I started developing my backend (GCC has a good reference manual, “GNU Compiler Collection Internals”, describing everything you need to know, but it is a bit overwhelming when you start...)

The series will cover the following (I’ll update the list with links to the posts as they become available)