In this position paper we propose to extend an existing delegation-based machine model with concurrency primitives. The original machine model which is built on the concepts of objects, messages, and delegation, provides support for languages enabling multi-dimensional separation of concerns (MDSOC). We propose to extend this model with an actor-based concurrency model, allowing for both true parallelism as well as lightweight concurrency primitives such as coroutines. In order to demonstrate its expressiveness, we informally describe how three high-level languages supporting diﬀerent concurrency models can be mapped onto our extended machine model. We also provide an outlook on the extended model’s potential to support concurrency-related MDSOC features.

Transactional memory (TM) is an emerging concurrency control mechanism that provides a simple and composable programming model. Unfortunately, transactions violate the semantics of mutual exclusion locks when they execute concurrently. Due to the prevalence of locks, transactions must be made lock-aware enabling them to correctly interoperate with locks.

We present a lock-aware transactional memory (LATM) system that employs a unique communication method using local knowledge of locks coupled with granularity-based policies. Our system allows higher concurrent throughput than prior systems because it only prevents truly conﬂicting critical sections from executing concurrently. Furthermore, our system relaxes the prior requirement of transaction isolation when executing conﬂicting transactional critical sections and instead runs these transactions as irrevocable, improving transaction concurrency. We demonstrate our performance improvements mathematically and empirically.

Our system also advances LATM research in terms of program consistency. This is achieved by detecting potential deadlocks at run-time and aborting the programs that contain them. Prior systems break deadlocks, which reveal partially executed critical sections to other threads, thereby violating mutual exclusion. Because our system disallows deadlocks, it does not suffer from mutual exclusion violations, improving program consistency.

10:00

Blurring the line between Compiler and RuntimeEric JulDIKU, Dept. of Computer Science, University of Copenhagen

A traditional implementation of a language often strives to have a nice clean interface between the compiler and the runtime system where a short, concise, and clean interface description is considered good. However, for efficiency it can be of great advantage to purposefully blur the line between compiler and runtime by letting the compiler and compiler cooperate across an otherwise well-defined interface boundary. This paper presents a few examples and is intended to generate a discussion of these and other examples.

We attempt to apply the technique of Tracing JIT Compilers in the context of the PyPy project, i.e., to programs that are interpreters for some dynamic languages, including Python. Tracing JIT compilers can greatly speed up programs that spend most of their time in loops in which they take similar code paths. However, applying an unmodiﬁed tracing JIT to a program that is itself a bytecode interpreter results in very limited or no speedup. In this paper we show how to guide tracing JIT compilers to greatly improve the speed of bytecode interpreters. One crucial point is to unroll the bytecode dispatch loop, based on two hints provided by the implementer of the bytecode interpreter. We evaluate our technique by applying it to two PyPy interpreters: one is a small example, and the other one is the full Python interpreter.

The Common Language Infrastructure (CLI) is a virtual machine expressly designed for implementing statically typed languages as C#, therefore programs written in dynamically typed languages are typically much slower than C# when executed on .NET.

Recent developments show that Just In Time (JIT) compilers can exploit runtime type information to generate quite eﬃcient code. Unfortunately, writing a JIT compiler is far from being simple.

In this paper we report our positive experience with automatic generation of JIT compilers as supported by the PyPy infrastructure, by focusing on JIT compilation for .NET. Following this approach, we have in fact added a second layer of JIT compilation, by allowing dynamic generation of more eﬃcient .NET bytecode, which in turn can be compiled to machine code by the .NET JIT compiler.

The main and novel contribution of this paper is to show that this two-layers JIT technique is eﬀective, since programs written in dynamic languages can run on .NET as fast as (and in some cases even faster than) the equivalent C# programs.

The practicality of the approach is demonstrated by showing some promising experiments done with benchmarks written in a simple dynamic language.

There are several languages that target bytecodes and the JVM™ machine as their new "assembler," including Scala, Clojure, Jython, JRuby, the JavaScript™ programming language/Rhino, and JPC. This presentations takes a quick look at how well these languages sit on a JVM machine, what their performance is, where it goes, and why.

Some of the results are surprising: Clojure's STM ran a complex concurrent problem with 600 parallel worker threads with perfect scaling on an Azul box without modification. Some of the results are less surprising: fixnum/bignum math ops take a substantial toll on the benefit of entirely transparent integer math, and a lack of tail-call optimization gives some languages fits. Some of the languages can get "to the metal," and sometimes performance takes a backseat to other concerns. This session, for non-Java™ platform JVM machine users, is a JVM machine's-eye-view of bytecodes, JITs, and code-gen and will give you insight into why a language is (or is not!) as fast as you might expect.

14:30

Compiling Structural Types on the JVM: A Comparison of Reflective and Generative Techniques from Scala's PerspectiveGilles Dubochet, Martin OderskyÉcole Polytechnique Fédérale de Lausanne

This article describes Scala’s compilation technique of structural types for the JVM. The technique uses Java reﬂection and polymorphic inline caches. Performance measure-ments of this technique are presented and analysed. Further measurements compare Scala’s reﬂective technique with the “generative” technique used by Whiteoak to compile structural types. The article ends with a comparison of reﬂective and generative techniques for compiling structural types. It concludes that generative techniques may, in speciﬁc cases, exhibit higher performances than reﬂective approaches, but that reﬂective techniques are easier to implement and have fewer restrictions.

Compilation of polymorphic code through type erasure gives compact code but performance on primitive types is significantly hurt. Full specialization gives good performance, but at the cost of increased code size and compilation time. Instead we propose a mixed approach, which allows the programmer decide what code to specialize. Our approach supports separate compilation, allows mixing of specialized and generic code, and gives very good results in practice.

The insertion of read and write barriers into managed code is a typical runtime compilation task of a Virtual Machine. As part of our current work in applying Thread-Level Speculation (TLS) to Java, we insert a high density of barriers that are conditionally executed based on the identity of the running thread and current execution context. Rather than perform runtime tests, it is more proﬁtable for our TLS system to maintain thread and execution-context speciﬁc versions of methods that are compiled with unconditional barriers, and then rely on modiﬁed dispatch semantics to ensure conditional execution.

In this paper, we extract the method versioning system from our TLS implementation and present it in a general form, which we call Dynamic Method Versioning (DMV). DMV allows thread and execution-context speciﬁc versions of Java methods to be dynamically generated and compiled, with inter-version dispatch managed by a runtime policy. We describe our technique via its implementation within the Jikes Research Virtual Machine, and present initial measurements of its runtime overheads.

16:25

Using Program Metadata to Support SDT in Object-Oriented ApplicationsDaniel Williams, Jason D. Hiser, Jack W. DavidsonUniversity of Virginia

Software dynamic translation (SDT) is a powerful technology that enables software malleability and adaptivity at the instruction level by providing facilities for run-time monitoring and code modiﬁcation. SDT has been used as the basis for many valuable tools, including dynamic optimizers, proﬁlers, security policy enforcement, and binary translation to name a few. However, modern object-oriented programming techniques and their implementations (e.g., virtual functions, exceptions, dynamic code, etc.) pose unique challenges to high performing SDT systems. In this paper, we present Metaman, a generalized program metadata manager that stores and manages program information so that it can be eﬃciently accessed by emerging SDT systems to improve overall runtime performance of a managed executable. Using the information collected by Metaman, the run-time performance of an existing SDT system was improved by 22% making its execution speed only 3% slower than native (i.e., non-managed) execution.

In the past decade processors were improved by using vector instructions called SIMD instructions. Those vector instructions have dramatically enhanced the performance of many multimedia applications. This paper studies Leupers’ code selection technique capable of generating SIMD instructions automatically in the context of dynamic compilation. It develops a portable implementation using loop unrolling and tree pattern matching techniques, applied on the optimized compiler of Jikes RVM in the phase of converting Lower Intermediate Representation code into Machine-specific Intermediate Representation using BURS system. This implementation adds new BURS rules capable of generating SIMD instructions that perform manipulation of subword data. Applying the suggested implementation in Jikes RVM with IA-32 architecture results in an overall speedup at runtime despite the runtime overhead of the compilation phase.

This paper presents a Just-In-Time compilation system for ARM processors. The complete architecture is described, starting from static compilation of the sources into CIL(Common Intermediate Language) bytecode. The intermediate languages that are used are explained, together with the instuction selection and code generation techiniques. Finally, some experimental results are presented, comparing them with those of our best open source competitor: Mono.