Qualitative Code Differences in Managed Code

My colleague Vance Morrison wrote an internal paper on code quality issues in our current system. I thought there were some excellent items discussed in his paper so with his kind permission I've edited/summarized it for a general audience. Thank you Vance.

Qualitative Code Differences in Managed/Unmanaged Code

If you were to compare the assembly code of an equivalent managed and unmanaged program, you would find the differences break down into three broad categories: Intrinsic Features, Optional Features, and JIT Compiler Limitations.

In contrast, things like local variable access, argument access, flow of control, method calls, instance field accesses, as well as all primitive arithmetic are largely unchanged in managed code. This is very nice since this is the heart of most performance-sensitive programs. So there’s a great base for raw computation problems. See Jan Gray’s paper “Writing Faster Managed Code: Know What Things Cost”

Intrinsic Runtime Features

These are the things that don’t exist in the unmanaged world, such as garbage collection (GC), appdomains, and code-access security. This is the most worrisome set of differences between managed and unmanaged code because you really can’t “opt-out” of these features – they represent the intrinsic cost of using the runtime.

GC Information – To do a garbage collection, all pointers to the GC heap must be identified (and possibly updated). This includes all pointers on the execution stack (local variables, arguments, register spills) for every thread in the system, as well as any pointers in CPU registers themselves. This requires the JIT compiler to generate GC tracking information sufficient to walk the stack at (roughly) arbitrary times. This extra information is most fundamental difference between unmanaged and managed code. While GC tracking requirement does not affect code quality at all, it does mean that every method has a table associated with it that is typically 15% the size of the code (on x86). Luckily, this table is only accessed for methods active during a GC, so it generally has a small affect on working set. There is also a small working set overhead (~ 1 DWORD per method), to link the method and its GC information. The good news is that all of this has no effect at all on code quality and only a small effect on working set.

Write Barriers – The runtime uses a “generational” GC which improves GC performance by only collecting part of the heap most of the time. To implement this every write of a GC pointer that resides in the GC heap needs to be logged as a potential root of a partial GC. This bookkeeping adds an additional 4-10 cycles for every such write in the common case, see “Garbage Collector Basics and Performance Hints.” Write barriers are a concern, but the overhead is not huge. A pointer write goes from about 1 cycle to on average of 6 or 7 cycles), for pointers on the GC heap – and the hottest pointers are typically on the stack where there is no penalty at all. The effect of write barriers is often measurable (a few percent or so), and can be more significant in certain tight loops.

Static Field Access – The runtime supports a lightweight process-like environment called an AppDomain. Each AppDomain has its own copy of all static variables. Because of this, any domain nuetral code must use 5-10 instructions to access static fields of just 1. The JIT can optimize many cases (allowing one fetch of AppDomain variables to serve many static field fetches in the same method), but there are cases when no optimization can be done. Domain Nuetral code is more common in Whidbey. Static field access overhead is actually worse than write barriers in the worst case: the static field access goes from one cycle to roughly ten cycles. However because the overhead of field fetches can be combined (and pulled out of loops) the impact of slower field fetch is generally less than that of write barriers. It has no measurable impact at all in many scenarios (for instance the framework code tends to not use static much).

Interop with existing unmanaged code – Transitions to unmanaged code minimally must be marked on the stack to allow garbage collections to happen correctly, and there can be security checks and/or argument conversion necessary (if the types don’t exactly match operating system type). In the best case (no security concerns, simplest kind of call) the overhead is 10-20 instructions. Costs can increase dramatically if argument conversion is needed. .

“Optional” Features

These are features that developers can avoid if they wish to, though for the most part we encourage developers to use them universally. (e.g. array bounds checks, run time casts). These features can be avoided in particular cases if needed (e.g. by using “unsafe” code).

Ease of use, safety, and simplicity are weighted heavily in making design decisions for most managed code users, including our framework, so most code takes advantage of these “optional” features as a matter of course. Where these costs are hard to bear because the code is highly performance critical you can opt-out if necessary. Opting out with due caution is our normal recommendation.

Managed code strongly encourages code to be verifiably type safe (which means the CLR can prove all references are to instances of the statically declared type). This leads to a bunch of small overheads that can add up.

Bounds checks on many array accesses (by default, every access has a length check at the cost 2 instructions). You can opt-out by using unsafe code.

Type checks on every set to an array of objects to ensure that the value being set is compatible with the array being updated. You can opt-out by using unsafe code.

Type checks when extracting data from type neutral containers and APIs. You can opt out by using unsafe code.

Boxing (wrapping a primitive type in an GC heap object) when inserting primitive types into type neutral containers and APIs. You can opt out by using generics or generating a container for the specific primitive type.

Non-mutable strings. The basic string type is not mutable, which often means more data copying (but sometimes less). You can opt out of this by manipulating character arrays or special classes like StringBuilder, but when you interface with APIs that expect strings, you need to make a copy.

Delegates. Managed code has type-safe notion of a function pointer called a delegate. Delegates are more powerful then C function pointer because they carry state, and can dispatch to multiple targets. This increases overhead. You can opt out by using unsafe function pointers.

The runtime has an extensive set of reflection APIs that allow code to introspect on the running code. It is relatively easy to probe for types at runtime, traverse inheritance hierarchies, set fields by string name, call methods by string name, and even generate new methods on the fly. These are powerful features (really not available at all in the unmanaged world), but have a significant cost compared the precompiled code. A careful engineering tradeoff has to be made by the users of these features to ensure the benefit of this introspection is worth it.

Managed code tends to have more extensibility points than the equivalent unmanaged counterpart. Developers use object oriented techniques, using virtual functions, interfaces, and the reflection APIs to achieve this. These extensibility points can cost significant amounts of performance and have to be carefully weighed by framework designers.

Managed code supports Custom Attributes on IL entities (Types, Methods, Fields etc.) This has been valuable for adding new features to the system (e.g. hosting, interop, security, or reliability information) but the attributes are relatively expensive to access at run time. This expense has to be factored into the cost of these new added features.

Managed code tends to allocate more heap objects (i.e. more methods tend to return new objects rather than modify one that was passed in). Of course reusing objects in place can cut down on the allocation overhead, but, even more importantly, sometimes the locality benefits of nice compact allocations trumps other considerations, and of course managed allocations are more like the speed of a custom unmanaged allocator and not a raw malloc(). So allocation considerations are a subtle topic at best.

Compilers can make expensive features very easy or even implicit (e.g. transitioning to unmanaged code, anonymous delegates) which magnifies their use tremendously.

Managed libraries often do extensive precondition checking to give detailed errors on API misuse, for example checking for null object references and returning an ArgumentException. This is great for developers but hurts performance. Obviously this was a choice made by the library designers (end users can’t opt out, except by re-implementing, but library designers can).

JIT Compiler Limitations

The final category of code generation differences are artifacts of the current JIT compiler rather than inherent trade-offs in the managed system.

The current just in time (JIT) compiler is more limited than a typical commercial quality unmanaged compiler, partly because it needs to be smaller and faster and partly because it just isn’t as mature. Some of the larger issues include:

Inlining – The inlining subsystem could use additional work to handle larger inlining cases – this is getting more important as more complex properties become more common and require inlining for performance.

Analysis caps – For the sake of speed the JIT places arbitrary caps on the size of analysis data. For large methods, the JIT does not have the information necessary to do a really good job.

Value Types (structs) – Value types are not handled as well as reference types. For example the inliner does not inline function with value type parameters.

Exception Handling – The code generated for exceptions is based on the assumption that exception handling is rare. This assumption is turning out to be false as users write code with increasingly rich exception semantics.

I personally believe that the lack of inlining for Value Types is the biggest issue in current JIT performance. I think that Value Types in general need much more aggressive inlining than reference types. If performance were not an issue, nobody would be using Value Types in the first place.

Special-casing of generics for Value Types is great (at the cost of a bigger working set), but the lack of inling makes it difficult/unworthwhile to create efficient lightweight wrappers for other primitive types that add additional features (like say a wrapper for Int32 that restricted its values, or implemented a generic interface such as IArithmetic<T> for doing math in generics). It also adds penalties when creating new primitive-like types (such as a Complex number).

Personally, I think that there should be really aggressive inlining on overloaded operators, property accessors, and constructors defined for Value Types, even if it increases the initial cost of JITing a Value Type a bit.

Most programmers do not "see" the added cost of calling an overloaded operator, and the runtime (or maybe even the C# compiler) should work as hard as possible to inline them.

Even System.Decimal would benefit from such changes to the JIT, and it is very frequently used in business applications.

With regards interop with unmanaged code, I wonder if you’d consider writing a post on suggestions for performance regarding crossing the managed/unmanaged boundary? We have a slight performance issue with the following, and I’m sure it must be a common scenario (even if not exactly the same).

We started our application (in C#) pretty much as soon as .NET 1.0 appeared. Front-end and business objects are written entirely in C#. We reused an in-house object/relational system that sits on top of the database, which we upgraded from straight C++. In the middle is an auto-generated data-access layer, with a pair of object/collection classes per table, and a pair of get/set methods per column. Business objects are stateless and only contain a reference to the data-acess object. These d/a methods have the form:

So obviously the interface is too ‘chatty’ but that’s hindsight and can’t be easily changed. How can we know if we’re making the cheapest possible call here (you mentioned argument conversion and security)?

> The current just in time (JIT) compiler is more limited than a typical commercial quality unmanaged compiler, partly because it needs to be smaller and faster and partly because it just isn’t as mature. Some of the larger issues include.

Not true. Exceptions have a cost, granted. But, the thing is, if they are used to actually handle errors (1/1000 rule from Rico’s other post), the cost is not important. Why? Well, because if you have an error, the amount of processing to do is normally MUCH less than what’s needed for normal operation (we write code to do stuff and not to handle errors, don’t we? BTW, that’s also why exceptions are good: they help us to have to write less code for error handling; that leaves us more time to write code that does stuff)

The important overhead, then, is the "static" one for exception handling init/cleanup. But, there, do not forget that you compare the compiler-generated code that you don’t see, with code for error handling that you write. So, it’s not that you have hidden overhead in case of exception-enabled environment as opposed to NOTHING in exception-free (C code, anyone?) environments.

My vote for most pressing performance issue also goes to inlining value types. This should have been a priority from the outset, not an afterthought.

By their nature, value types tend to be used for small, fundamental types like complex numbers, intervals, quantities (size + unit), etc. Method bodies are usually very small. These value types are used in calculations, very often inside loops.

A second issue concerns mutable reference types with value semantics. These are sometimes necessary, but seem to run against the NET philosophy and their use is discouraged at every opportunity. One example: C# doesn’t allow overloading of compound assignment operators. It just assumes your objects are immutable and therefore "operator synthesis" (x@=y -> x=x@y) is all you need.

The String/Builder pattern can’t be used for objects like vectors and matrices. Like value types, these objects are used in calculations, implying that their component values change often. Recycling is a must. Bounds checking can often be eliminated if certain class-level invariants can be verified.

Jeffery Sax’s thoughts are closely aligned with mine. Of the JIT issues, the inliner is the thing I would most like to see improved and handling of value types in the inliner is doubly important. I think value types are largely under-used and perhaps they might be used more often if you could cash in, in practice, on the gains that they offer in theory.

But of course all the areas identified as weaknesses in the Jit above are obviously on our minds.

Not all exceptions are thrown by code deep down in the call stack. Exceptions are often raised by the CLR because of an issue with code inside a method. For example: OverflowException on a checked block.

A form of ‘light-weight exception handling’ for these situations, where the JIT bypasses the full exception handling mechanism, would be very welcome. I.e. if a CLR exception is thrown by code inside a try block, and the exception is caught in a corresponding catch block, the overhead of building a full-featured exception object could be eliminated.

This is especially important since the CLI spec states that certain tests throw an exception if the test fails rather than branch if it succeeds. Example: the ckfinite instruction. In other words, without this type of optimization, the CLI *imposes* severe performance degradation.

I’m aware that exception handling code can be extremely complicated, but that does not mean that the simple cases cannot be optimized.