I'm working on GCC in my spare time, starting with helping Zack revamp the preprocessor. The target is an integrated, token- based preprocessor and front end, with some kind of facility for handling precompiled headers.

I love Japan and most things Japanese, and have lived here for over five years. Sadly, that is about to end. I like computers too; hence my login name. If you've not been to Akihabara in Tokyo (or worse, never heard of it), then you haven't experienced one of life's great pleasures. There is nothing like it anywhere else; nowhere I know of comes close. Book a trip at your travel agent tomorrow, you won't regret it.

Projects

Recent blog entries by akihabara

Spent the last week adding preprocessor testcases for every
bit of odd behaviour I can dream up. Tidying up the
#define directive parser at the moment, removing a malloc
performance bottleneck. Zack's just completed a nice tidy-
up of the macro expanding code, removing excessive
recursive calls. I suspect the current code is now faster
than the old cpplib and cccp, certainly there is little
reason for it to be slower.

We should be able to scrap support for -traditional
(though
not -Wtraditional I expect) since we're now bundling an old
preprocessor, tradcpp, just for that job. A token-based
preprocessor just proved to be too fundamentally different
to K+R for the integration to be sustainable, and it was
getting in the way.

Cpplib is beginning to look quite clean in most places,
and
should be not too hard to read. Almost at the stage of
being a piece of code to be proud of. A noteable exception
is the lexer, which still needs a lot of cleaning up and
work on improving performance. Lexers tend to be ugly by
their very nature, though.

Hopefully we can soon start to think about front-end
integration and pre-compiled headers, which will be fun to
work on, and give us some really nice performance
improvements. The C and C++ front ends should be able to
all-but abandon their existing lexers, save crannies like
interpreting numbers and merging adjacent string literals.

In a few days I'm going to be offline for a month or
three, so Zack will be working on it alone for a while. I
think he's forgotten his Advogato password, though
<g>.

Finally got the new expander and lexer live today. A lot of
cleanup and optimisation remains to be done, but the
immediate priority is comprehensive testsuites so we can be
sure not to introduce regressions when improving the code
base.

-traditional is not supported fully at present, but we're
working on a solution.

At last, the new macro stuff is nearly done, thanks to some
work by Zack yesterday. We bootstrap and pass the tests in
the testsuite, and are more precise about corner cases than
before. Just -traditional stuff to go, and we should be
able to apply it to CVS. If you use non-ISO stuff like the
GNU ## extension to delete the previous token, or token
pasting to get a non-token (remember, we're grown-up and
token-based now) you'll get warnings telling you to clean
up your act.

A lot of ugliness remains, but that will be easier to clean
once we're happy we've got working code and binned the old
text-based expander. Many areas are much cleaner, for
example the three places (#assert, #unassert and #if/#elif)
that need to parse assertions all use the same code now,
rather than having their own slightly different version to
handle the slight differences of syntax.

The token-based macro expansion process is quite simple in
concept, but the reality is a bit messy and hard to
understand from the source code. I'll try to clean it up
and comment it once we're sure it's working, and have it in
CVS.

After -traditional, the next stage is probably to get cpp
re-integrated with the front ends, as a library and not a
separate process. This will cut out a lot of overhead: an
extra exec(), writing out the preprocessed file, the front
end reading it in again, and re-tokenising.

Putting the finishing touches on a macro expander that uses
the new lexer. Like the lexer, it is token-based. The
current lexer and macro expander are both text-based.

Getting this to work has been a very frustrating
experience. Macro expansion is a hairy and convoluted
process, and stringification and token-pasting just add to
the confusion. A dense and strangely-worded C99
specification doesn't help :-)

We just have a single token list, and the lexer lexes
all tokens in the next logical line into this list.
However, a function-like macro invocation can cross
multiple logical source file lines. So we don't write over
the original token list, and cause chaos, we append to it
instead in this case. However, this appending could cause
a realloc of the tokens (stored consecutively in memory),
and arguments to macros are stored as lists of pointers to
the original tokens (they needn't be consecutive), so they
need to be fixed up if we realloc. Other things still to
do include fixing bogus line numbers in errors and the
final output, and squeezing tokens back into 16 bytes for
both 32-bit and 64-bit architectures. We need to run it
against a macro abuser like glibc to try and turn up missed
obscure cases.

Ah, almost forgot, the gem of -traditional support.
Not
sure what's best there; I think to get everything right
would need a separate pre-pass that does traditional macro
text splicing. However, this would lose line and column
information and just be a maintenance headache. Probably
it's best just to support everything we reasonably can in
the token-based environment, and drop the really weird
stuff like half-strings and macro expansion within strings.