>It has been warned many times in books such as by Appel and the red>dragon book that compilers need to be conservative. If I can get more>ideas on what mind be hindering compilers from applying optimizations,>then I could rewrite the code to make the work easier for the>compilers, or to manually perform such optimizations. After all, I do>not need to be conservative. I do not even need to preserve the>semantics of the program: if I find that I can get similar semantics>which is just as correct but results in faster executables, then I>will.

The biggest problem for developers on modern OSes is the failure to
recognize the impact of virtual memory on large data sets.

Tricky coding and trying to outguess the compiler can easily result in
_worse_ performance. Better algorithms, clean simple code, making
sure data is properly aligned and reorganizing data to better match
VMM paging behavior (tiling, etc.) buys you much more than playing at
being a compiler.

It is true that the compiler must be conservative ... it is not
supposed to alter the program's semantics (modulo compile "modes" nd
conditional compilation) ... but it is also true that good compilers
are smarter and more observant than all but the very best developers.
If a good compiler passes up an opportunity to apply some meaningful
optimization it is *usually* because it has seen a problem the
developer has overlooked.

>Various tricks in tiger books; the red dragon book; and "Computer>Architecture: A Quantitative Approach" (or superior variants of those>tricks as appeared in the books "More Tricks of the Game Programming>Gurus" and the excellent "Black Art of Macintosh Game Programming" by>Kevin Tieskoetter) have helped me to speed up an as-yet unreleased>version of the Monte Carlo particle simulator Cosima (>WWW.MPE.MPG.De/MEGA/megalib.html ).

That's great. But keep in mind that *many* of those "tricks" are
special purpose and/or architecture specific ... moving up or down one
generation in CPU or changing the number of cores may invalidate
everything and result in a net loss of performance.

CPU specific optimizations apart from data alignment are much more
difficult to analyze for value and overall are worth less than a more
general code optimization in the face of multi-tasking. For example,
it is feasible to statically determine optimal data placements so as
to minimize direct mapped cache conflicts ... but there's little point
to doing this complicated analysis if the cache is expected to be
polluted by unrelated programs.

>Using C++ was really not a good idea. Expect to see a version with>a lot more Ada on the Internet later this year.

YMMV but I think you are naove to expect a whole lot of performance
improvement to come simply from switching to Ada. Once you step
outside the restricted real-time subset, Ada and C++ provide equal
opportunity for a programmer to derail potential optimizations and
produce slow, bloated code. Ada's stricter type system *occasionally*
can provide an opportunity for increased optimization vs. C++, but
such opportunities are relatively rare and always are architecture
specific.

>We unfortunately target x86s, but we will consider targeting other>types of architectures such as FPGAs; GPUs; and supercomputers. As we>already have x86s and might adopt something much faster, I do not>expect that we shall target RISC machines though they probably would>had been better to start with than x86s.

There is little internal difference between a modern RISC and a modern
x86 - the x86 has too few programmer visible register names and a
complex instruction set, but behind the instruction decoder it is an
out-of-order superscalar RISC with hundreds of rename registers.

In actual fact, the densely packed complex instruction set of the x86
helps reduce memory pressure vs. the fixed width instructions of a
traditional RISC (not that any modern RISCs are "traditional" - they
all sport variable length instructions now).

There is an ongoing thread in comp.arch talking about this right now.

The one place that x86 compilers routinely do stumble is in stack
handling. The x86 prefers certain data alignment but none is actually
*required* (even for SSE) ... and it is quite hard to guarantee that a
value on the stack will always be properly aligned for most efficient
handling - particularly when some performance obsessed developer has
switched off stack framing and function preambles and makes
(in)judicious use of alloca() (or local equivalent).

GPUs and separate memory multi-computers are a different ball game.
They require very different algorithms from general shared memory
machines. GPUs themselves are not all alike - they all try to exploit
massive data parallelism but their particular ISAs may be focused on
either SIMD or MIMD style processing or, in some cases, may be able to
go either way.

FPGAs implement algorithms directly in hardware (more or less - many
now have scratchpad RAM so you don't have to implement register level
logic) and are, in general, unlike any kind of CPU programming.

>Feedback [on] the following books ...

I haven't read many of these, but if you've studied compiler basics
already, then you can safely ignore any intro books (if there are
chapters on parsing, you can pretty much forget about it).

For optimization purposes you need to concentrate on various forms of
dependence analysis. It's important to know what the compiler looks
for - and what it does *not* want to find - when it tries to optimize
code.

By far, the bulk of research on performance optimization has been done
for Fortran. You'll need to be able to read Fortran and transliterate
Fortran examples into your preferred implementation language(s).

>Monica S. Lam,>"A Systolic Array Optimizing Compiler"
Interesting but irrelevant unless you actually have a systolic array.
SAs are essentially a fixed pipeline of (relatively) simple processing
steps that together perform some complex computation on streaming
data. They are almost always dedicated, single purpose machines and
are not very common.

>John R. Ellis,>"Bulldog: A Compiler for VLIW Architectures",
Interesting, but focused on low level instruction selection which is
way below the level of optimizations you want to target.

>Leland Beck,>"System Software: An Introduction to Systems Programming",
Haven't read this one, but Beck's stuff tends to be good.

>Alpha Architecture Handbook,>DEC
The Alpha is dead.

>Pappas & Murray,>"80386 Microprocessor Handbook",
Way out of date. The 486 is still used in embedded systems, but it
has a different internal architecture and optimization for 486 is
quite different than for 386. The 486 and original Pentium have much
more in common internally than do the 386 and 486 (selecting
Pentium/P5 as the target is a cheap way to make some 486 programs run
faster).

>"Games Programming Gems 4", maybe other books in the series
The "Programming Gems" books, in general, are worth reading ... but
keep in mind that many of their tricks are architecture and
program/purpose specific rather than general principles to follow.

>Kevin Dowd,>"High Performance Computing",>O'Reilly & Associates,>we own a copy of this but I have not had time to look at it yet: if it>is worthwhile then we will probably want to buy another copy
Worthwhile.

>Hatcher,>"Data-Parallel Programming on MIMD Computers">The MIT Press,>we own a copy of this but I have not had time to look at it yet: if it>is worthwhile then we will probably want to buy another copy
Haven't read this one but it might be worthwhile.

Many of the others you listed (did you cut-n-paste a biblio search?)
are too old, too basic, too general, or just irrelevant to your stated
purpose.

Unless you have a very specific task that needs a special purpose
architectural tweak to achieve your performance goal, you will want to
stay with more general knowledge of what compilers are looking for.
Follow the path of dependence analysis and you won't go wrong.