>> Maybe. You've set up a straw-man where you have tried to deny all>> ways that changing the order of function calls might affect>> preformance and then asked will performance remain unchanged.
...

> I agree with you on that. That's obvious. But I was actually talking> about the rearrangement of functions in memory while the order of> function invocations remains the same.

If you are talking simply about re-ordering the location of functions
in memory, but not about re-ordering the calls, then cache (paging,
tlb and related) penalties are in fact the main intents. (And my
apologies for mis-understanding previously.) One is trying to remove
useless locations in the instruction area from being in the working
set. For those cases cache (et. al.) effects are truly important.
Don't dismiss the significance of that.

One of the key results I note in that area comes from Steve Sharp who
works at Cadence Design on the nc-verilog compiler. Verilog
applications are very large (relative to other apps) with a very large
working set, and overflow the cache on even the largest processors.
Moreover, the nature of Verilog programs makes it appear that the code
"randomly" wanders about its instruction space, rather than staying in
some "small" tight loops. As such, the best optimization algorithms
to apply to generated code for Verilog programs is ones that reduce
the code or data size, because the effect of a cache miss takes about
50-cycles to recover from, and thus reducing the frequency of cache
misses is the most important thing one can do. In fact in Verilog the
problem is so extreme that one can waste a few instruction executions
in each code sequence to minimize the total code or data footprint and
still get a positive overall effect. There are very few optimizations
(other than those applied to inner loops of array algorithms) that
will save you 50 cycles.

What applies to Verilog applies to a lesser extent to most programs.
If you can reduce the frequency of cache misses (or page faults, less
frequent but more expensive, or TLB misses, somewhere in the middle),
then you will be saving large numbers of cycles. For the code outside
the inner loops, that is likely to be at least as big a factor as
removing instructions from the code path.

> How do compiler designers decide which combinations/sequences> of optimizations seem to be promising and are put together into an> optimization level (like gcc's -O)?>> Is this an trial-and-error approach assisted with some experience?

In general, one only has a limited set of optimizations that one
actually knows how to implement. Thus, one implements them all.
There are some hueristics (and background knowledge) that help one
decide the order in which to apply optimizations and which ones to
turn on at which levels.

For example, one generally does the "local" (within a basic block)
optimizations first (at the lowest optimization level) as they have
the most payback for the fewest resources and are also the safest.
Global "loop" optimizations are generally next, and there is some
ordering amongst them (e.g. common-sub-expression elimination is
usually one of the first global optimizations). Once, that set has
been mined, one finally applies the interprocedural optimizations.

Thus, one usually applies more local optimizations before more global
ones. Most global optimizations depend on the local optimizations
doing the correct thing as part of their effectiveness. One also
generally does simple optimizations before complex ones, again because
of the dependencies. If you don't get the local simple stuff right,
changing the global complicated stuff becomes more a case of rolling
the dice.