Monday, May 28, 2007

I just read a blog post recommending that developers broaden their language horizons, with a particular emphasis on Ruby. The author attempts to explain why expressiveness is an important metric for a language:

What is this obssesion[sic] with "expressiveness"? Go write poertry [sic] if you want to be expressvive.[sic]

Remember that ultimately our jobs are (usually) to solve some kind of business problem. We're aiming for a finish line, a goal. The programmer's job is translate the language of the business person to the language of the computer.

The whole point of compilers, interpreters, layers of abstraction and what-not are to shorten the semantic distance between our intent and the way the computer thinks of things.

To be honest, this is not very convincing; the moment you mention "semantics", is the moment many developers will close your blog and go do something "productive".

The argument for expressiveness is ultimately quite simple: the more expressive the language, the more you can do in fewer lines of code. This means that 3 lines of Ruby code might take 12 lines in C#, and 15 lines of C# could be compressed to 2 lines of Ruby while retaining readability and maintainability [1].

Consider viewing Ruby as a C# code generator, where 2 lines of Ruby code can generate 12 lines of C#. That actually sounds like a pretty good idea doesn't it? You would never write a generator to go the other way around though would you?

More expressive languages also tend to be simpler and more coherent. There are all sorts of little ad-hoc rules to Java and C# that you would not find in most functional languages.

You can readily see the differences in expressiveness at the Alioth language shootout. Set the metrics to CPU:0, memory:0, GZIP:1, which means we only care about the GZIP'd size of the source code. You'll see that functional languages tend to come out on top. Ironically, Ruby is first lending weight to the above blog post on Ruby's expressiveness.

Expressiveness is the whole driving force behind DSLs: a DSL is more expressive in solving the problems it was designed for. For instance, a relational database DSL specific to managing blogs could generate perhaps 100 lines of SQL per single line of DSL code. It would take you far more code to emulate that 1 DSL line in C#.

So the expressiveness argument is quite compelling if you simply view it as a code generation metric: the more expressive the language, the more code it can generate for you for the same number of characters typed. That means you do your job faster, and more importantly, with fewer errors.

Why fewer errors? Studies have shown that developers generally write about 20-30 correct lines of code per day. That means 20 lines of correct C# code or 20 lines of correct Ruby code. It just so happens that those 20 lines of correct Ruby can generate 100 lines of correct C#, which means you're now 5x more productive than your C#-only developer next door. Do you see the advantages now?

[1] These numbers are completely bogus and are used purely for illustrative purposes.

Wednesday, May 9, 2007

GEXL is a general expression library providing core primitives for building and processing expression trees. I had abandoned it awhile ago because I quickly ran into a serious, well-known issue: The Expression Problem.

Basically, the expression problem boils down to the fact that typical programming solutions are extensible in only a single direction: either the data types or the operations defined on them, but not both, can be extended without modifying existing code. It takes some sophisticated type machinery to safely solve the expression problem, and various languages, including OCaml, Haskell, Beta/gBeta, and Scala, have acceptable solutions. Languages supporting multimethods support less type safe extension, so I exclude them here (although Nice might do it properly).

So GEXL, which aims to provide both data and their operations as data and visitors respectively, runs smack into the expression problem. Fortunately, Mads Torgersen's paper, The Expression Problem Revisited, provides 4 new solutions by exploiting the generics of C# and Java. The first 3 solutions are still biased towards data-centric or operation-centric extensions, but the last solution provides full extensibility in both directions in ordinary C# with generics. GEXL is saved! There's just one catch: it's not type safe since it requires 2 casts. Fortunately, these casts are encapsulated in the core data and visitor classes, which are parameterized by the types they must check and cast to. The developer extending the core will also need to be a bit careful to make sure he's parameterizing the visitor and expression types at the same level as the extension he's writing.

These core classes are now implemented in GEXL, together with two interpreters and pretty printers written for the untyped lambda calculus, one implemented as data-centric extension of the core library, and the other as an operation-centric extension; the two total about 300 lines of excessively verbose C#, including comments. The paper provides a mixed example of data and operation extensions if you're interested in extending in both directions.

There is a disadvantage of this solution however: visitors can no longer return the results of their computation, due to a type dependency between the Data and Op; if we were to parameterize IVisitor with a return type, IVisitor<R>, as I described earlier, Op must also be parameterized by R, which forces Data to also be parameterized by R since Data's type is constrained by IVisitor.

This means that the entire expression tree will be parameterized by the return type of the operation that we wish to perform, R, which means it can only be traversed by operations that return R! No more typechecking returning bools and tree transforms returning new trees on the same expression tree. Type safety has backed us into a corner; this is in fact a GADT problem, and as covered previously, C# generics fall just short of GADTs. If C# were to be augmented by full equational constraints, instead of the limited form they have now, this problem would be solvable, since we could simply parameterize the methods themselves with the appropriate constraints.

In fact, full equational constraints aren't needed; there are only two limitations of C#'s type system impeding this solution:

Limitations of type constraints to derivations: a constraint placed on a generic parameter must be a non-sealed class or an interface, so a constraint limiting a type parameter to a string is not possible. This should be relaxed so that the type constraint is a full subtype relation. The current design is simply a consequence of #2.

Lack of type constraint refinements: a constraint cannot be refined to a more specific type when subtyping. For instance, the declaration "class Foo where T : S", cannot be refined to "class Bar : Foo where T : U" where U is a subtype of S.

As it stands, the least offensive solution is for each visitor to retain its computed value as a private field.

So now that GEXL lives, perhaps we'll also see the eventual revival of Orc.NET. In the meantime, I'll try porting the interpreter for my language to GEXL and see how painful it turns out to be. :-)

P.S. There's a further interesting post in the above LTU thread: it provides a solution to the expression problem using just sum and recursive types; it requires significant boilerplate, but the extensions are still possible, which is impressive.

Thursday, May 3, 2007

I work with .NET everyday, and it's a decent virtual machine from the developer's perspective, and a slight overall improvement over Java in a number of respects.

But for a language implementor, .NET continues a tradition of mistakes that began with Java, which inhibit it from achieving its much touted goal as the Common Language Runtime (CLR). These issues are divided into two categories: blatant mistakes, and nice to haves. The former requires no cutting edge features or research to implement, and the CLR should have had them from the get-go. The latter might have required some research to get right (like they did with generics), but which are ultimately required for a truly secure and flexible virtual machine.

Blatant Mistakes

Insufficiently powerful primitives for execution and control flow. The most glaring of these omissions are first class function pointers and verifiable indirect calls. On .NET a function pointer is reified as an IntPtr, which is an opaque struct representing a native pointer reified as a native integer. calli is the VM instruction that indirectly invokes a function via such an IntPtr. Because an IntPtr is opaque, the VM can't really know whether it's pointing to a function, or a data blob held outside the VM heap, and calli is consequently an unverifiable instruction. calli is thus available only under full trust since the VM will crash if it calli's into a data blob. Many truly interesting languages would require calli for efficient execution, and the full trust requirement means that partial trust environments, such as cheap web hosts, won't be able to run assemblies generated from these languages. Microsoft consulted with a number of language researchers in the design for .NET and they responded enthusiastically with a "verifiable calli" proposal for just the above reasons. Unfortunately, MS ignored them.

Stack walking was a huge mistake inherited from Java. Stack walking prevents a number of useful optimizations, such as efficient tail calls, and alternative security architectures and execution models from being implemented. As it is, we're stuck with the horrible Code Access Security (CAS), when a far simpler model was already built into the VM. MS's new Silverlight supposedly introduces a new security framework based solely on named "trusted" assemblies, so perhaps we can expect this to be partly fixed. Efficient tail calls in particular are essential for functional languages.

Continuations. I was never sold on continuations until I started considering scheduling and scalability issues, and subsequently read a number of papers on concurrency; in summary, continuations are essential for scalable, schedulable execution subject to configurable user-level policies. Actually, continuations are not strictly necessary, as an expensive global CPS-transform of a program coupled with efficient tail calls can achieve the same effect; however, CPS has a number of performance side-effects that may not be acceptable for certain languages. Thus, continuations or something equally expressive, like delimited continuations, are essential for a true CLR. Coupled with efficient tail calls, we can easily achieve highly scalability architectures with low memory use, as SEDA and Erlang have shown. Many people believe that threads subsume continuations, and some believe the contrary, but in fact, both abstractions are necessary; a thread is a "virtual processor", and a continuation is the processor context. If .NET had continuations, we could build operating system-like execution models that, in theory, can achieve optimal CPU utilization and parallelism with minimal resource use via non-blocking I/O, dataflow variables, and a number of threads equal to the number of real CPUs on the machine. This is achievable only with continuations or the equivalent as a global CPS transform.

Poor security architecture. CAS is a hack bolted on to a VM that had it omitted a single feature and tweaked another feature, wouldn't have needed any security architecture at all. If static class fields were not available, then the native execution model of the .NET VM reduces to capability-based security. It's possible to include static fields if the fields have certain properties (transitive immutability for instance), but conceptually, just imagine a VM whereby the only way to obtain access to an object is being given access to it such as in a method call. There is a further condition to satisfy capability security that has consequences for P/Invoke: all functionality accessible via P/Invoke must be reified as a reference (ie. an object). This means that static, globally accessible operations such as "FileInfo File.Open(string name)" would not be permitted since they subvert the security model by calling out to the OS to convert a string into authority to a File; instead, at the very least File.Open would be reified as a FileOpener object, and this object is "closely held" by trusted code. In effect, this forces all P/Invoke functions to be minimally capability secure. There is also an alternative approach to security with which we can have our fully mutable static fields, and have our isolation and security too; in fact, it's a widely used abstraction invented decades ago which provides full fault isolation and a number of other useful properties: lightweight language process; like OS processes they provide full isolation, but it's a process that exists only in the VM; in effect, it's similar to processes as found in the pi-calculus. Static fields are thus scoped to the lightweight process. Processes and their properties are explained further under "nice to haves".

Nice to Haves

Lightweight process model. Yes, I'm aware that .NET has the AppDomain which is sort of like a lightweight process, but it's isolation properties are inadequate. Microsoft's Singularity operating system takes the base AppDomain abstraction, and extends it into a full "software isolated process" (SIP). Processes should be single-threaded, and should additionally permit fault handling and resumption via Keepers. Like in OSs, portions of a process can potentially be paged in and out, serialized, persisted, or otherwise manipulated, all transparently. Call them Vats as in E, processes as in Erlang, or SIPs as in Singularity, but the process abstraction is critical for full VM isolation. AppDomains are fairly good start however, and I'm researching ways to exploit AppDomains to enhance .NET's security.

Resource accounting. Resource accounting includes managing memory, scheduling concurrent execution, etc. Without properly reifying these abstractions as explicit objects subject to configurable policies, the VM is vulnerable to denial of service (DoS) attacks from malicious code. As it stands, .NET cannot load and run potentially malicious code in the same VM as benign code. For instance, it should be possible to set a quota for memory use, and to schedule and interleave execution so that code cannot exhaust VM memory and "steal the CPU". However, you can imagine that passing around quotas in code would be quite unwieldy; fortunately, this requirement interacts nicely with another abstraction we already need: processes. Thus, the process is also the unit of resource accounting, and spawning a new process starts a new heap, potentially with a quota. Processes can then be scheduled via a VM-level scheduler, and each VM can also schedule its own execution as it sees fit (threaded, event-loops, etc.). Interprocess communication (IPC) can be managed with an exchange heap as in Singularity, or via copying as is done in most microkernels.

Concurrency model. This has more of an effect on the class library than the base VM design, but plan interference is a huge source of bugs in concurrent code, and it should be tackled by every language. VM processes provide a minimal, tried and tested concurrency and isolation abstraction, but within a process there are a wide range of options available, including event-loop concurrency, or language enforced concurrency models. Controlled sharing of state between processes needs to be tackled, such as the exchange heap in Singularity, though my current inclination is to prefer copying for safety reasons.

As you can see, .NET doesn't need much to turn it into a flexible, industrial strength, high security VM. I'm going to follow this post with a detailed description of what the ideal minimal VM should have, and I will even describe how I believe it can be implemented safely and efficiently.