This morning Google Reader brought up this
blog post about a technique used internally by the MS CLR
version 4.0.
It is basically a way to share complex code between
different architectures, thus reducing the maintenance
burden and allowing better compatibility in the tricky area
of marshalling (when the runtime goes from managed code to
unmanaged code it needs to perform many operations,
including massaging the data exchanged between the two
worlds).

At the beginning Mono had specialized x86 chunks of code to
do the work, but we soon realized the problems with that.
At the time I proposed to use a small, specialized bytecode
set (after all the operations involved are very few), which
would have been great for the interpreter I was working on.
Dietmar instead pushed for the use of IL bytecode: this
would have been a slower solution for the interpreter, but
it would allow access to a wider set of operations and it
would reduce the burden on the JIT backend, which, after
all, already had an IL frontend.
We went for the IL solution (the time frame was the
beginning of 2002) and we're glad to see that MS choose to
adopt the same technique a few years later.

There are two important considerations about this story in
the context of Mono development

we started to always choose techniques that would favour
the JIT against the interpreter, both in the case of speed
and maintenance burden. This was a path that eventually lead
to the discontinuation of the interpreter.

we built a codebase that is easily portable across
architectures, to the point that we consider a JIT port only
slightly more complicated than an interpreter port (because
of the many features of Mono/CLR even an interpreter needs
quite a bit of very low-level knowledge about how an
architecture works)

As an extension, we later used the same IL-based technique
to implement many other runtime helper methods that would
otherwise have been to be written in low-level
architecture-specific code: remoting helpers, garbage
collection fast paths, delegate runtime methods etc.
For the curious, most of this code is in
metadata/marshal.c
in the Mono source code.

One of the things that have been sitting in my TODO list for
a few years is to improve the performance of the Regex
engine in Mono, both by speeding up the interpreter and by
compiling the regular expressions to IL code so that the JIT
could optimize it. This seemed like a good project for hack
week and Zoltan joined me doing the implementation.

I worked on a new compiler/interpreter combination that uses
simplified opcodes, with the aim of making their execution
faster and also making the translation to IL code easier. As
an example, the old interpreter used a 'Char' opcode to
match a single char, but this opcode has several runtime
options to ignore the case, negate the assertion and go
backward in the string. Each option required decoding and
conditional branching at runtime, so removing this overhead
should improve performance significantly.

Zoltan made a clever observation: he could reuse the new
interpreter for bytecodes that the IL engine he was working
on couldn't yet handle and so he used dynamic methods with
the same signature as the method in the interpreter that
evaluates the regex bytecode. This design has also the nice
property that compiled regular expressions will be garbage
collected (as opposed to the MS runtime that, at least in
the 1.1 version, will leak in this case).

IL-compiling regular expressions has several benefits: it
completely removes dispatch and decoding overhead and the
JIT will use very fast instructions like compare with
immediate to implement the 'Char' bytecode.

I used a few microbenchmarks to test the speed of the new
engines, but I'll report here just the results of running
the regexdna test from the language shootout: in other cases
the speedup is even bigger.

Old interpreter: 10.1 seconds

New interpreter: 5.4 seconds

IL engine: 1.3 seconds.

Most of the new code is in svn, though it's not enabled
since it's still incomplete. We'd need another week or so to
make it usable instead of the old engine and we haven't
allocated the time yet to complete this work, but it sure
looks promising.

C# as a programming language is still young and has evolved
nicely in the last few years: most of the new features take
it closer to a high-level language, allowing shorter and
nicer code for common patterns. A language needs to be able
to cover a wide spectrum of usages to be considered mature
but there is one side of the spectrum that has been
neglected in C# for far too long: the low level side close
to IL code.

Someone already hacked up a program that uses a preprocessor
and ilasm to inject IL into C# code, but this approach has
many drawbacks (too many to list them all here).

Inline IL code should integrate with the rest of the C#
program as closely as possible, allowing, for example, to
reference labels defined in C# code from branches in IL code
or using the names of local variables and arguments for
opcodes like ldloc and ldarg.

The proposal here is to allow IL code as a new statement in
the form:

unsafe LITERAL_STRING;

This is similar to the traditional way inline assembly has
been done (gcc's __asm__ ("code") statement), it's very
unlikely to clash with other possible uses of the unsafe
keyword and also conveys the notion that IL code may break
type safety, IL language rules etc. It has also the added
property that all the code needed to implement this support
could be easily copied in a separate library and used in
standalone programs to, say, simplify code emission for
Reflection.Emit (this inline IL support has been implemented
inside the Mono C# compiler, so it's C# code that uses
Reflection.Emit as the backend).

So, without further ado, the standard sample program written
with inline IL:

Ok, that kind of code is written more easily in C# proper,
so what about things that IL code can do, but that C# code
can't? Ever wanted to be able to change the value of a boxed
integer? In C# you can't, but this is very easy with inline IL:

Note that in this case, the compiler won't emit warnings
about field1 and localvar being never used and of course
you'll get an error if you mispell the field name in IL code
as you would in C# code.

The main usage of the new feature would be for some corlib
methods in mono or for more easily implementing test cases
for the JIT and runtime test suites: some specific IL code
patterns (that may not be expressed in C# or that there is
no guarantee C# will compile to the effecting code) can be
easily written while the rest of the boilerplate code needed
by the unit testing program can be written in much more
readable C#. That said, this opens many possibilities for a
creative hacker, finally free of the constraints of C#.

Mono is a JIT compiler and as such it compiles a method only
when needed: the moment the execution flow requires the
method to execute. This mode of execution greatly improves
startup time of applications and is implemented with a
simple trick: when a method call is compiled, the generated
native code can't transfer execution to the method's native
code address, because it hasn't been compiled yet. Instead
it will go through a magic trampoline: this chunk of code
knows which method is going to be executed, so it will
compile it and jump to the generated code.

The way the trampoline knows which method to compile is
pretty simple: for each method a small specific trampoline
is created that will pass the pointer to the method to
execute to the real worker, the magic trampoline.

Different architectures implement this trampoline in
different ways, but each with the aim to reduce its size:
the reason is that many trampolines are generated and so
they use quite a bit of memory.

Mono in svn has quite a few improvements in this area
compared to mono 1.2.5 which was released just a few weeks
ago. I'll try to detail the major changes below.

The first change is related to how the memory for the
specific trampolines is allocated: this is executable memory
so it is not allocated with malloc, but with a custom
allocator, called Mono Code Manager. Since the code manager
is used primarily for methods, it allocates chunks of memory
that are aligned to multiples of 8 or 16 bytes depending on
the architecture: this allows the cpu to fetch the
instructions faster. But the specific trampolines are not
performance critical (we'll spend lots of time JITting the
method anyway), so they can tolerate a smaller alignment.
Considering the fact that most trampolines are allocated one
after the other and that in most architectures they are 10
or 12 bytes, this change alone saved about 25% of the memory
used (they used to be aligned up to 16 bytes).

To give a rough idea of how many trampolines are generated
I'll give a few examples:

This change in the first case saved more than 80 KB of
memory (plus about the same because reviewing the code
allowed me to fix also a related overallocation issue).

So reducing the size of the trampolines is great, but it's
really not possible to reduce them much further in size, if
at all. The next step is trying just not to create them.
There are two primary ways a trampoline is generated: a
direct call to the method is made or a virtual table slot is
filled with a trampoline for the case when the method is
invoked using a virtual call. I'll note here than in both
cases, after compiling the method, the magic trampoline will
do the needed changes so that the trampoline is not
executed again, but execution goes directly to the newly
compiled code. In one case the callsite is changed so that
the branch or call instruction will transfer control to the
new address. In the virtual call case the magic trampoline
will change the virtual table slot directly.

The sequence of instructions used by the JIT to implement a
virtual call are well-known and the magic trampoline
(inspecting the registers and the code sequence) can easily
get the virtual table slot that was used for the invocation.
The idea here then is: if we know the virtual table slot we
know also the method that is supposed to be compiled and
executed, since each vtable slot is assigned a unique method
by the class loader. This simple fact allows us to use a
completely generic trampoline in the virtual table slots,
avoiding the creation of many method-specific trampolines.

In the cases above, the number of generated trampolines goes
from 21000 to 7700 for MonoDevelop (saving 160 KB of
memory), from 17000 to 5400 for the IronPython case and from
800 to 150 for the hello world case.

I'll describe more optimizations (both already committed and
forthcoming) in the next blog posts.

I guess some of you expected a blog entry about the
generational GC in Mono, given the title. From my
understanding many have the expectation that the new GC will
solve all the issues they think are caused by the GC so they
await with trepidation.
As a matter of fact, from my debugging of all or almost all
those issues, the existing GC is not the culprit. Sometimes
there is an unmanaged leak, sometimes a managed or unmanaged
excessive retention of objects, but basically 80% of those
issues that get attributed to the GC are not GC issues at
all.
So, instead of waiting for the holy grail, provide test
cases or as much data as you can for the bugs you
experience, because chances are that the bug can be fixed
relatively easily without waiting for the new GC to
stabilize and get deployed.
Now, this is not to say that the new GC won't bring great
improvements, but that those improvements are mainly in
allocation speed and mean pause time, both of which, while
measurable, are not bugs per-se and so are not part of the
few issues that people hit with the current Boehm-GC based
implementation.

After the long introduction, let's go to the purpose of this
entry: svn Mono now can perform an object allocation
entirely in managed code. Let me explain why this is
significant.

The Mono runtime (including the GC) is written in C code and
this is called unmanaged code as opposed to managed code
which is all the code that gets JITted from IL opcodes.
The JIT and the runtime cooperate so that managed code is
compiled in a way that lets the runtime inspect it, inject
exceptions, unwind the stack and so on. The unmanaged code,
on the other hand, is compiled by the C compiler and on most
systems and architectures, there is no info available on it
that would allow the same operations. For this reason,
whenever a program needs to make a transition from managed
code to unmanaged (for example for an internal call
implementation or for calling into the GC) the runtime needs
to perform some additional bookeeping, which can be relatively
expensive, especially if the amount of code to execute in
unmanaged land is tiny.

Since a while we have made use of the Boehm GC's ability to
allocate objects in a thread-local fast-path, but we
couldn't take the full benefit of it because the cost of the
managed to unmanaged and back transition was bigger than the
allocation cost itself.
Now the runtime can create a managed method that performs
the allocation fast-path entirely in managed code, avoiding
the cost of the transition in most cases. This
infrastructure will be also used for the generational GC
where it will be more important: the allocation fast-path
sequence there is 4-5 instructions vs the dozen or more of
the Boehm GC thread local alloc.

As for actual numbers, a benchmark that repeatedly allocates
small objects is now more than 20% faster overall (overall
includes the time spent collecting the garbage objects, the
actual allocation speed increase is much bigger).

I uploaded version 0.2 of the monocov coverage tool for Mono
here.
It is also available from the monocov svn module from the
usual Mono svn server.

The release features an improved Gtk# GUI, fixes to html
rendering and other minor improvements.
The usage is pretty simple, just run you program or test
suite with the following command after having installed monocov:

mono --debug --profile=monocov program.exe

The coverage information will be output to the
program.exe.cov file. Now you can load this file in the GUI
with:

monocov program.exe.cov

and browse the namespaces for interesting types you want to
check code coverage for. Double clicking on a method will
bring up a viewer with the source file of the method with
the lines of code not reached by execution highlighted in
red.

To limit the collection of data to a specific assembly you
can specify it as an argument to the profiler. For example,
to consider only the code in mscorlib, use:

mono --debug --profile=monocov:+[mscorlib] test-suite.exe

To be able to easily collect coverage information from the
unit tests in the mono mcs directory you can also run the
test suite as follows, for example in mcs/class/corlib:

make run-test
RUNTIME_FLAGS="--profile=monocov:outfile=corlib.cov,+[mscorlib]"

Monocov can also generate a set of HTML pages that display
the coverage data. Here
are the files generated when running the nunit-based test
suite for mono's mscorlib with the following command:

monocov --export-html=/tmp/corlib-cov corlib.cov

Hopefully this tool will help both new and old contributors
to easily find untested spots in our libraries and
contribute tests for them.
Happy testing!

I just committed to svn a small function that can be used to
help debug deadlocks that result from the incorrect use of
managed locks.

Managed locks (implemented in the Monitor class and usually
invoked with the lock () construct in C#) are subject to the
same incorrect uses of normal locks, though they can be
safely taken recursively by the same thread.

One of the obviously incorrect way to use locks is to have
multiple locks and acquire them in different orders in
different codepaths. Here is an example:

I added an explicit Sleep () call to make the race condition
happen almost every time you run such a program. The issue
with such deadlocks is that usually the race time window is
very small and it will go unnoticed during testing. The new
feature in the mono runtime is designed to help find the
issue when a process is stuck and we don't know why.

Now you can attach to the stuck process using gdb and issue
the following command:

We can see that there are three locks currently held by
three different threads. The first has been recursively
acquired 2 times. The other two are more interesting because
they each have a thread waiting on a semaphore associated
with the lock structure: they must be the ones involved in
the deadlock.

Once we know the threads that are deadlocking and the
objects that hold the lock we might have a better idea of
where exactly to look in the code for incorrect ordering of
lock statements.

In this particular case it's pretty easy since the objects
used for locking are static fields. The easy way to get the
class is to notice that the object which is locked twice
(0x2ffd8) is of the same class as the static fields:

Starting with Mono version 1.2.1, the Mono JIT supports the
new ARM ABI (also called gnueabi or armel). This is the same
ABI used by the 2006 OS update of the Nokia 770 and it
should be good news for all the people that asked me about
having Mono run on their newly-flashed devices.

The changes involved enhancing the JIT to support soft-float
targets (this work will also help people porting mono to
other embedded architectures without a hardware floating
point instruction set) as well as the ARM-specific call
convention changes. There was also some hair-pulling
involved, since the gcc version provided with scratchbox
goes into an infinite loop while compiling the changed
mini.c sources when optimizations are enabled, but I'm sure
you don't want to know the details...

This was not enough, though, to be able to run Gtk#
applications on the Nokia 770. When I first ran a simple
Gtk# test app I got a SIGILL inside gtk_init() in a
seemlingly simple instruction. Since this happened inside a
gcc-compiled binary I had no idea what the JIT could have
been doing wrong. Then this morning I noticed that the
instructions in gtk_init() were two bytes long: everything
became clear again, I needed to implement interworking with
Thumb code in the JIT. This required a few changes in how
the call instructions are emitted and at callsite patching.
The result is that now Mono can P/Invoke shared libraries
compiled in Thumb mode (mono itself must still be compiled
in ARM mode: this should be easy to fix, but there is no
immediate need now for it). Note that this change didn't
make it to the mono 1.2.1 release, you'll have to use mono
from svn.

As part of this work, I also added an option to mono's
configure to disable the compilation of the mcs/ directory,
which would require running mono in emulation by qemu inside
scratchbox. The new option is --disable-mcs-build. This can
also be useful when building the runtime on slow boxes, if
the building of the mcs/ dir is not needed (common for
embedded environments where the managed assemblies are
simply copied from an x86 box).

There are not yet packages ready for the Nokia 770, though
I'll provide a rough tarball of binaries soon: the issue is
that at least my version of scratchbox has a qemu build that
fails to emulate some syscalls used by mono, so it's hard to
build packages that require mono or mcs to be run inside
scratchbox. I'm told this bug
has been fixed in more recent versions, so I'll report how
well jitted code runs in qemu when I'll install a new
scratchbox. This is not the best way to handle this, though,
because even if qemu can
emulate everything mono does, it would be very slow and
silly to run it
that way: we should run mono on the host, just like we run
the cross-compiling gcc on the host from inside scratchbox
and make it appear as a native compiler. From a quick look
at the documentation, it should be possible to build a mono
devkit for scratchbox that does exactly this. This would be
very nice for building packages like Gtk# that involve both
managed assemblies and unmanaged shared libraries (the Gtk#
I used for testing required lots of painful switches between
scratchbox for compiling with gcc and another terminal for
running the C#-based build helper tools and mcs...). So, if
anyone has time and skills to develop such a devkit, it
will be much appreciated! Alternatively, we could wait for
debian packages to be built as part of the debian project's
port to armel, which will use armel build boxes.

This afternoon Jonathan Pryor pasted on the mono IRC channel an interesting benchmarklet that showed interesting results.
It came from Rico Mariani at http://blogs.msdn.com/ricom/archive/2006/03/09/548097.aspx as a performance quiz. The results are non-intuitive, since it makes it appear that using a simple array is slower than using the List<T> generic implementation (which internally is supposed to use an array itself).

On mono, using the simple array was about 3 times slower than using the generics implementation, so I ran the profiler to find out why.

It turns out that in the implementation of the IList<T> interface methods we used a special generic internal call to access the array elements: this internal call is implemented by a C function that needs to cope with any array element type. But since it is an internal call and the JIT knows what it is supposed to do, I quickly wrote the support to recognize it and inline the instructions to access the array elements. This makes the two versions of the code run with about the same speed (with mono from svn, of course).

The interesting fact is that the MS runtime behaves similarly, with the simple array test running about 3 times slower than the IList<T> implementation. If you're curious about why the MS implementation is so slow, follow the link above: I guess sooner or later some MS people will explain it.