Writing a GCC Front End

Language designers rejoice! Now it's easier to put a front end for your language onto GCC.

GCC, the premier free software compiler suite, has undergone many
changes in the last few years. One change in particular, the merging
of the tree-ssa branch, has made it much simpler to write a new GCC
front end.

GCC always has had two different internal representations, trees and
RTL. RTL, the register transfer language, is the low-level
representation that GCC uses when generating machine code.
Traditionally, all optimizations were done in RTL. Trees are a higher-level
representation; traditionally, they were less documented and less
well known than RTL.

The tree-ssa Project, a long-term reworking of GCC internals
spearheaded by Diego Novillo, changes all that. Now, trees are much
better although still imperfectly documented, and many optimizations
are done at the tree level. A side effect of this work on trees was
the clear specification of a tree-based language called GENERIC. All GCC
front ends generate GENERIC, which is later lowered to another
tree-based representation called GIMPLE, and from there it goes to RTL.

What this means to you is that it is much, much simpler to write a new
front end for GCC. In fact, it now is feasible to write a front end for
GCC one without any knowledge of RTL whatsoever. This article provides a tour of how you
would go about connecting your own compiler front end to GCC. The
information in this article is specific to GCC 4.0, due to be released
in 2005.

Representing the Program

For our purposes, compilation is done in two phases, parsing and
semantic analysis and then code generation. GCC handles the second
phase for you, so the question is, what is the best way to implement
phase one?

Traditional GCC front ends, such as the C and C++ front ends, generate
trees during parsing. Front ends like these typically add their own
tree codes for language-specific constructs. Then, after semantic
analysis has completed, these trees are lowered to GENERIC by
replacing high-level, language-specific trees with lower-level
equivalents. One advantage of this approach is the language-specific
trees usually are nearly GENERIC already. The lowering phase often can
prevent too much garbage from generating.

The primary disadvantage of this approach is trees are typed dynamically.
In theory, this might not seem so bad—many dynamically
typed environments exist that can be used efficiently by
developers, including Lisp and Python. However,
these are complete environments, and GCC's heavily
macro-ized C code doesn't confer the same advantages.

My preferred approach to writing a front end is to have a strongly
typed, language-specific representation of the program, called an
abstract syntax tree (AST). This is the approach used by the Ada front
end and by gcjx, a rewrite of the front end for the
Java programming language.

For instance, gcjx is written in C++ and has a class hierarchy that
models the elements of the Java programming language. This code
actually is independent of GCC and can be used for other purposes. In
gcjx's case, the model can be lowered to GENERIC, but it also can be used
to generate bytecode or JNI header files. In addition, it could be used for
code introspection of various kinds; in practice, the front end
is a reusable library.

This approach provides all the usual advantages of a strongly typed
design, and in the GCC context, it results in a program that is easier
to understand and debug. The relative independence of the
resulting front end from the rest of GCC also is an advantage, because
GCC changes rapidly and this loose coupling minimizes your exposure.

Potential disadvantages of this approach are the possibilities that
your compiler might do more work than is strictly needed or use more
memory. In practice, this doesn't seem to be too important.

Before we talk about some details of interfacing your front end to
GCC, let's take a look at some of the documentation and source files
you need to know. Because it hasn't been a priority in
the GCC community to make it simpler to write front ends, some things
you need to know are documented only in the source. The documentation
references here refer to info pages and not URLs, because GCC 4.0 has
not yet been released. Thus, the Web pages reflect earlier versions. Your
best bet is to check out a copy of GCC from CVS and dig around in the source.

gcc/c.opt: describes command-line options used by the C family of front
ends. More importantly, it describes the format of the .opt files.
You'll be writing one of these.

gcc/tree.def, gcc/tree.h: some attributes of trees don't seem to be documented, and reading these files can help.
tree.def defines all the tree codes and is, in large part, explanatory comments. tree.h defines the tree node
structures, the many accessor macros and declares functions that
are useful in building trees of various types.

libcpp/include/line-map.h: line maps are used to represent source
code locations in GCC. You may or may not use these in your front
end—gcjx does not. Even if you do not use them, you need to build them when
lowering to GENERIC, as information in line maps is used when generating
debug information.

gcc/errors.h, gcc/diagnostic.h: defines the interface to GCC's error formatting functions, which you may choose to use.

gcc/gdbinit.in: defines some GDB commands that are handy when
debugging GCC. For instance, the pt command
prints a textual representation
of a tree. The file .gdbinit also is made in the GCC build
directory; if you debug there, the macros immediately are
available.

gcc/langhooks.h: lang hooks are a mechanism GCC uses to allow front ends to control some aspects of GCC's behavior. Each front end must define its own
copy of the langhooks structures; these structures consist largely
of function pointers. GCC's middle and back ends call these
functions to make language-specific decisions during compilation.
The langhooks structures do change from time to time, but due to the
way GCC expects front ends to initialize these structures, you largely
are insulated from these changes at the source level. Some
of these lang hooks are not optional, so your front end is going to
implement them. Others are ad hoc additions for particular
problems. For instance, the can_use_bit_fields_p hook was introduced
solely to work around an optimization problem with the current gcj
front end.

Tom Tromey reported herein 2005...April 6th
"Writing the Driver
Currently GCC requires your front end to be visible at build time—there is no way to write a front end that is built separately and linked against an installed GCC. "

Has that situation change or is it still so that if I even change one variable declaration in a new frontend source , that the WHOLE (if only core) gcc would need to be rebuilt from sources?