Mono is a JIT compiler and as such it compiles a method only
when needed: the moment the execution flow requires the
method to execute. This mode of execution greatly improves
startup time of applications and is implemented with a
simple trick: when a method call is compiled, the generated
native code can't transfer execution to the method's native
code address, because it hasn't been compiled yet. Instead
it will go through a magic trampoline: this chunk of code
knows which method is going to be executed, so it will
compile it and jump to the generated code.

The way the trampoline knows which method to compile is
pretty simple: for each method a small specific trampoline
is created that will pass the pointer to the method to
execute to the real worker, the magic trampoline.

Different architectures implement this trampoline in
different ways, but each with the aim to reduce its size:
the reason is that many trampolines are generated and so
they use quite a bit of memory.

Mono in svn has quite a few improvements in this area
compared to mono 1.2.5 which was released just a few weeks
ago. I'll try to detail the major changes below.

The first change is related to how the memory for the
specific trampolines is allocated: this is executable memory
so it is not allocated with malloc, but with a custom
allocator, called Mono Code Manager. Since the code manager
is used primarily for methods, it allocates chunks of memory
that are aligned to multiples of 8 or 16 bytes depending on
the architecture: this allows the cpu to fetch the
instructions faster. But the specific trampolines are not
performance critical (we'll spend lots of time JITting the
method anyway), so they can tolerate a smaller alignment.
Considering the fact that most trampolines are allocated one
after the other and that in most architectures they are 10
or 12 bytes, this change alone saved about 25% of the memory
used (they used to be aligned up to 16 bytes).

To give a rough idea of how many trampolines are generated
I'll give a few examples:

This change in the first case saved more than 80 KB of
memory (plus about the same because reviewing the code
allowed me to fix also a related overallocation issue).

So reducing the size of the trampolines is great, but it's
really not possible to reduce them much further in size, if
at all. The next step is trying just not to create them.
There are two primary ways a trampoline is generated: a
direct call to the method is made or a virtual table slot is
filled with a trampoline for the case when the method is
invoked using a virtual call. I'll note here than in both
cases, after compiling the method, the magic trampoline will
do the needed changes so that the trampoline is not
executed again, but execution goes directly to the newly
compiled code. In one case the callsite is changed so that
the branch or call instruction will transfer control to the
new address. In the virtual call case the magic trampoline
will change the virtual table slot directly.

The sequence of instructions used by the JIT to implement a
virtual call are well-known and the magic trampoline
(inspecting the registers and the code sequence) can easily
get the virtual table slot that was used for the invocation.
The idea here then is: if we know the virtual table slot we
know also the method that is supposed to be compiled and
executed, since each vtable slot is assigned a unique method
by the class loader. This simple fact allows us to use a
completely generic trampoline in the virtual table slots,
avoiding the creation of many method-specific trampolines.

In the cases above, the number of generated trampolines goes
from 21000 to 7700 for MonoDevelop (saving 160 KB of
memory), from 17000 to 5400 for the IronPython case and from
800 to 150 for the hello world case.

I'll describe more optimizations (both already committed and
forthcoming) in the next blog posts.