Common Type System (CTS): One Platform to Rule Them All

The Common Language Runtime (CLR) — or more precisely any implementation of the Common Language Infrastructure (CLI) specification — executes code inside the bounds of a well-defined type system, called the Common Type System (CTS). The CTS is part of the CLI, and is maintained via the ECMA and International Organization for Standardization (ISO) international standards bodies. It defines a set of structures and services that programs targeting the CLR may use, including a rich type system for building abstractions out of built-in and custom abstract data-types. The CTS constitutes the interface between managed programs and the runtime itself, in a language agnostic manner.

As a brief example of the diversity of languages that the CTS supports, consider four examples, each of which has a publicly available compiler targeting the CLR: C#, C++/CLI, Python, and F#:

C# is a (mostly) statically typed, imperative, C-style language. It offers very few features that step outside of the CLR's verifiable type-safety, and employs a heavily object-oriented view of the world. C# also offers some interesting functional language features such as first class functions and their close cousins, closures, and continues to move in this direction with the addition of, for example, type inferencing and lambdas in new versions of the language. This is, at the time of this writing, the most popular programming language on the CLR platform.

C++/CLI is an implementation of the C++ language targeting the CTS instruction set. Programmers in this language often step outside of the bounds of verifiable type safety, directly manipulating pointers and memory segments. The compiler does, however, support compilation options to restrict programs to a verifiable subset of the language. The ability to bridge the managed and unmanaged worlds with C++ is amazing, enabling many existing unmanaged programs to be recompiled under the CLR's control, of course with the benefits of Garbage Collection and (mostly) verifiable IL.

Python, like C#, deals with data in an object-oriented fashion. But unlike C# — and much like Visual Basic — it prefers to infer as much as possible and defer as many decisions until runtime that would have traditionally been resolved at compile time. Programmers in this language never deal directly with raw memory, and always live inside the safe confines of verifiable type safety. Productivity and ease of programming are often of utmost importance for such dynamic languages, making them amenable to scripting and lightweight program extensions. But they still must produce code that resolves typing and other CLR-related mapping issues somewhere between compile- and runtime. Some say that dynamic languages are the way of the future. Thankfully, the CLR supports them just as well as any other type of language.

Lastly, F# is a typed, functional language derived from O'Caml (which is itself derived from Standard ML), which offers type inferencing and scripting-like interoperability features. F# certainly exposes a very different syntax to the programmer than, say, C#, VB, or Python. In fact, many programmers with a background in C-style languages might find the syntax quite uncomfortable at first. It offers a mathematical style of type declarations and manipulations, and many other useful features that are more prevalent in functional languages, such as pattern matching. F# is a great language for scientific- and mathematical-oriented programming.

Each of these languages exposes a different view of the type system, sometimes extreme yet often subtle, and all compile into abstractions from the same CTS and instructions from the same Common Instruction Language (CIL). Libraries written in one language can be consumed from another. A single program can even be composed from multiple parts, each written in whatever language is most appropriate, and combined to form a single managed assembly. Also notice that the idea of verification makes it possible to prove type safety, yet work around entire portions of the CTS when necessary (such as manipulating raw memory pointers in C++). The security system provides facilities for placing restrictions on the execution of unverifiable code.

The Importance of Type Safety

Not so long ago, unmanaged assembly, C, and C++ programming were the de facto standard in industry, and types — when present — weren't much more than ways to name memory offsets. For example, a C structure is really just a big sequence of bits with names to access precise offsets from the base address. That is, fields. Pointers to structures can be used to point at incompatible instances and data can be indexed into and manipulated freely. C++ is admittedly a huge step in the right direction. But there generally wasn't any runtime system enforcing that memory access followed the type system rules at runtime. In all unmanaged languages, there was a way to get around the illusion of type safety.

This approach to programming has proven to be quite error prone, leading to hard bugs and a movement toward completely type-safe languages. (To be fair, languages with memory safety were available well in advance of C. LISP, for instance, uses a virtual machine and garbage collected environment similar to the CLR.) Over time, safe languages and compilers have grown in popularity, as has using static detection to notify developers about operations that could lead to memory errors. Other languages such as VB6 and Java, for example, fully employ type safety through a runtime, to increase programmer productivity and robustness of programs. If language constructs were permitted to bypass compiler type checking, the runtime will catch and deal with illegal casts in a controlled manner at runtime, for instance by throwing an exception. The CLR follows in this spirit.

Proving Type Safety

The CLR execution environment takes the responsibility of ensuring that type safety is proven prior to executing any code. This safety cannot be subverted by untrusted malicious programs, ensuring that memory corruption is not possible. This only strictly applies to verifiable code. By using unverifiable code constructs, you can create programs that violate these restrictions wholesale. Doing so generally means that your programs won't be available to execute in partial trust without a special security policy.

There are also situations where unmanaged interoperability supplied by a trusted library can be tricked into performing incorrect operations. For example, if a trusted managed API in the Base Class Libraries (BCL) blindly accepts an integer and passes it to an unmanaged bit of code, that unmanaged code might use the integer to index into an array. A malicious user could intentionally pass an invalid index to provoke a buffer overflow. It is the responsibility of trusted library developers to ensure that such program errors are not present.

Common Type System (CTS): One Platform to Rule Them All

An Example of Type-Unsafe Code (in C)

Consider a C program that manipulates some data in an unsafe way, a situation that generally leads to either a memory access violation at runtime or a silent data corruption. An access violation (sometimes just called an AV) happens when protected memory is written to by accident; this is generally more desirable (and debuggable) than blindly overwriting memory. This snippet of code clobbers the stack, meaning that the control flow of your program and various bits of data — including the return address for the current function — could be overwritten. It's bad:

Our main function allocates two items on its stack, an integer x and a 16-character array named buffer. It then passes a pointer to buffer (remember, it's on the stack), and the receiving function fill_buffer proceeds to use the size and character c parameters to fill the buffer with that character. Unfortunately, the main function passed 32 instead of 16, meaning that we'll be writing 32 char-sized pieces of data onto the stack, 16 more than we should have. The result can be disastrous. This situation might not be so bad depending on compiler optimizations — we could simply overwrite half of x — but could be horrific if we end up overwriting the return address. It is only possible because we are permitted to access raw memory entirely outside of the confines of C's primitive type system.

Static and Dynamic Typing

Type systems are often categorized using a single pivot: static versus dynamic. The reality is that type systems vary quite a bit more than being just one or the other. Nonetheless, the CTS provides capabilities for both, giving languages the responsibility of choosing how to expose the CLR's features. There are strong proponents of both styles, although many programmers feel most comfortable somewhere in the middle. Regardless of your favorite language, the CLR runs code in a strongly typed environment. This means that your language can avoid dealing with types at compile time, but ultimately it will end up having to work within the type system at runtime. Everything has a type, whether a language designer surfaces this to users or not.

Key Differences in Typing Strategies

Static typing seeks to prove program safety at compile time, thus eliminating a whole category of runtime failures to do with type mismatches and memory access violations. C# programs are mostly statically typed, although some features like casting enable you to relax or avoid static typing in favor of dynamism. In such cases, the runtime ensures types are compatible at runtime. Other examples of statically typed languages include Java, Haskell, Standard ML, and F#. C++ is very much like C# in that it uses a great deal of static typing, although there are several areas that can cause failures at runtime, notably in the area of type-unsafe memory manipulation, as is the case with old-style C.

Some people feel that static typing forces a more verbose and less explorative programming style. Type declarations are often littered throughout programs, for instance, even in cases where a more intelligent compiler could infer them. The benefit, of course, is finding more errors at compile time, but in some scenarios the restriction of having to play the "beat the compiler" game is simply too great. Dynamic languages defer to runtime many of the correctness checks that static languages perform at compile time. Some languages take extreme and defer all checks, while others employ a mixture of static and dynamic checking. Languages like VB, Python, Common LISP, Scheme, Perl, Ruby, and Python fall into this category.

Late binding is a form of dynamic programming in which exact types and target methods to invoke are not decided until runtime. Many programs bind to a precise metadata token directly in the IL. Dynamic languages, however, perform this binding very late, often times just prior to dispatching a method call.

Common Type System (CTS): One Platform to Rule Them All

The Language Spectrum

The CLR supports the entire spectrum of languages, from static to dynamic and everywhere in between. The Framework itself in fact provides an entire library for doing late-bound, dynamic programming, called reflection. Reflection exposes the entire CTS through a set of APIs in the System.Reflection namespace, offering functionality that facilitates compiler authors in implementing dynamic languages, and enables everyday developers to exploit some of the power of dynamic programming.

Let's take a brief look at some example languages from this spectrum. You'll find below four small programs, each printing out the 10th element in the Fibonacci series (a well-known algorithm, the naíve implementation of which is shown). Two of these examples are written in statically typed languages (C# and F#), one in a language in between (VB), and one in a dynamically typed language (Python). The primary differences you will notice immediately are stylistic. But one deeply ingrained difference is whether the IL they emit is typed or instead relies on dynamic type checking and binding.

You'll notice the C# version is the only one that mentions we're working with 32-bit int values. These are static type annotations and are needed for the compiler to prove type soundness at compile time. Many static languages like F#, on the other hand, use a technique called type inferencing, avoiding the need for annotations where they can be inferred by the use of literals. F# actually emits IL similar to C#'s, working with statically typed ints, although we never specified it in the source code. In other words, it infers the type of a variable by examining its usage. Languages that infer types ordinarily require type annotations where a type can't be inferred solely by its usage.

The other languages shown, VB and Python, emit code that works with Object — the root of the CTS type hierarchy — and choose to bind strongly at runtime. They do so by emitting calls into their own runtime libraries, which are based on reflection. Clearly, the performance of statically typed programs will often win out over dynamic, simply because they can emit raw IL instructions instead of relying on additional function calls to, for example, late-binding libraries. Some degree of clever runtime caching can significantly narrow this difference.

Wrapping Up

As we've seen, the CLR is a great platform for language diversity. No single language is perfect for all jobs, and many programmers actually bounce between languages, tailoring the choice to the specific project they are working on.