Abstract
Software-based Thread-Level Speculation (TLS) requires speculative threads
executing ahead of normal execution be isolated from main memory until
validation. The resulting read/write buffering requirements, however, mean
speculative execution proceeds slower than unbuffered, non-speculative
execution, resulting in poor parallel coverage of loops as the non-speculative
threads catches up to and prematurely joins with its speculative children.
In this work we investigate an "asymmetric partitioning" strategy, modifying
speculative code generation to better partition loops, balancing the load
assuming a range of constant factor slowdowns on speculative threads. An
implementation within the LLVM-based MUTLS system shows a significant benefit
to memory intensive benchmarks is possible, although it is dependent on
relatively precise estimates of the memory access rate that induces buffering
slowdown for individual benchmarks.

Abstract
Most modern web applications are written in JavaScript. However, the
demand for web applications that require more numerically-intensive
calculations, such as 3D gaming or photo-editing, has increased. This has
also increased the demand for code that runs near native speeds.
PNaCl is a toolchain that allows native C/C++ code to be run in the browser.
This paper provides a comparison of the performance of PNaCl to native code
and JavaScript. Using a benchmark suite that covers a representative set of
numerical computations, it is shown on average, that the performance PNaCl
is within 9% of native C code.

Abstract
Learning how to program has increasingly become a more important skill for
non-programmers in the tech industry or researchers outside of computer
science. Newer programming languages such as MATLAB language have grown into
industrial strength languages, and many industries and academic fields outside
of math and computer science have found uses for it. Thus, it is essential for
many more people to learn MATLAB. However, many people often learn it without
fundamental knowledge in programming concepts that are rooted in computer
science. Thus, a new tutorial, McTutorial addresses this issue.

Abstract
From its modest beginnings as a tool to validate forms, JavaScript is now an industrial-strength language used to power online applications such as spreadsheets, IDEs, image editors and even 3D games. Since all modern web browsers support JavaScript, it provides a medium that is both easy to distribute for developers and easy to access for users. This paper provides empirical data to answer the question: Is JavaScript suitable for numerical computations? By measuring and comparing the runtime performance of benchmarks representative of a wide variety of scientific applications, we show that for sequential JavaScript is within a factor of 2 of native code. Parallel code using WebCL shows speed improvements of up to 2.28 over JavaScript for the majority of the benchmarks.

Abstract
Developing compilers that allow scientific programmers to use multicores
and GPUs is of increasing interest, however building such compilers
requires considerable effort. We present Velociraptor: a
portable compiler toolkit that can be used to easily build
compilers for numerical programs targeting multicores and GPUs.
Velociraptor provides a new high-level IR called VRIR which has been
specifically designed for numeric computations, with rich support for
arrays, plus support for high-level parallel and accelerator constructs.
A compiler developer uses Velociraptor by generating VRIR for key parts
of an input program. Velociraptor does the rest of the work by
optimizing the VRIR code, and generating LLVM for CPUs and OpenCL for
GPUs. Velociraptor also provides a smart runtime
system to manage GPU resources and task dispatch.
To demonstrate Velociraptor in action, we present two case studies: a
proof-of-concept Python compiler targeting CPUs and GPUs, and
a GPU extension for a MATLAB JIT.

Abstract
MATLAB is a dynamic numerical scripting language widely used by scientists,
engineers and students. While MATLAB's high-level syntax and dynamic types
makes it ideal for prototyping, programmers often prefer using high-performance
static programming languages such as Fortran for their final distributable
code. Rather than requiring programmers to rewrite their code by hand, our
solution is to provide a tool that automatically translates the original MATLAB
program to produce an equivalent Fortran program. There are several important
challenges for automatically translating MATLAB to Fortran, such as correctly
estimating the static type characteristics of all the variables in a MATLAB
program, mapping MATLAB built-in functions, and effectively mapping MATLAB
constructs to Fortran constructs.
In this paper, we introduce Mc2For, a tool which automatically translates
MATLAB to Fortran. This tool consists of two major parts. The first part is an
interprocedural analysis component to estimate the static type characteristics,
such as array shape and the range value information, which are used to generate
variable declarations in the translated Fortran program. The second part is an
extensible Fortran code generation framework to automatically transform MATLAB
constructs to corresponding Fortran constructs. This work has been implemented
within the McLab framework, and we demonstrate the performance of the
translated Fortran code for a collection of MATLAB benchmark programs.

Abstract
MATLAB is a popular dynamic array-based language commonly used by
students, scientists and engineers, who appreciate the interactive
development style, the rich set of array operators, the extensive
builtin library, and the fact that they do not have to declare static
types. Even though these users like to program in MATLAB, their
computations are often very compute-intensive and are potentially very
good applications for high-performance languages such as X10.
To provide a bridge between MATLAB and X10, we are developing
MiX10, a source-to-source compiler that translates MATLAB to X10.
This paper provides an overview of the initial design of the MiX10
compiler, presents a template-based specialization approach to compiling
the builtin MATLAB operators, and provides translation rules for the
key sequential MATLAB constructs with a focus on those which are
challenging to convert to semantically-equivalent X10. An initial
core compiler has been implemented, and preliminary results are
provided.

Abstract
High-performance computing systems today include a variety of compute
devices such as multi-core CPUs, GPUs and many-core accelerators.
OpenCL allows programming different types of compute devices using a
single API and kernel language. However, there is no standard matrix
operations library in OpenCL for operations such as matrix
multiplication that works well on a variety of hardware from multiple
vendors. We implemented an OpenCL auto-tuning library for real and complex variants of general
matrix multiply (GEMM) and present detailed performance results and
analysis on a variety of GPU and CPU architectures. The library provides
good performance on all the architectures tested, and is competitive
with vendor libraries on some architectures (such as AMD Radeons) We
also present brief analysis for related kernels such as matrix transpose
and matrix-vector multiply.

MATLAB is a popular dynamically-typed array-based language. The
built-in function feval is an important MATLAB feature for certain
classes of numerical programs and solvers which benefit from having functions
as parameters. Programmers may pass a function name or function handle to
the solver and then the solver uses feval to indirectly call the
function. In this paper, we show that although feval provides an
acceptable abstraction mechanism for these types of applications, there
are significant performance overheads for function calls via feval, in
both MATLAB interpreters and JITs. The paper then proposes, implements
and compares two on-the-fly mechanisms for specialization of feval
calls. The first approach specializes calls of functions with feval
using a combination of runtime input argument types and values. The second
approach uses on-stack replacement technology, as supported by McVM/McOSR.
Experimental results on seven numerical solvers show that the techniques
provide good performance improvements.

Abstract
Matlab is a dynamic scientific language used by
scientists, engineers and students world- wide. Although Matlab is
very suitable for rapid prototyping and development, Matlab users
often want to convert their final Matlab programs to a static language
such as Fortran. This paper presents an extensible object-oriented
toolkit for supporting the generation of static programs from dynamic
Matlab programs. Our open source toolkit, called the Matlab Tamer,
identifies a large tame subset of Matlab, supports the generation of a
specialized Tame IR for that subset, provides a principled approach to
handling the large number of builtin Matlab functions, and supports an
extensible interprocedural value analysis for estimating Matlab types
and call graphs.

Abstract
On-stack replacement (OSR) is a technique which allows a virtual machine
to interrupt running code during the execution of a function/method, to
reoptimize the function on-the-fly using an optimizing JIT compiler, and
then to resume the interrupted function at the point and state at which
it was interrupted. OSR is particularly useful programs with
potentially long-running loops, as it allows dynamic optimization of
those loops as soon as they become hot.

In this paper, we present a modular approach to implementing on-stack
replacement that can be used by any system that targets the LLVM SSA
intermediate representation, and we demonstrate the approach by using it
to support dynamic inlining in McVM. McVM is a virtual machine for
MATLAB which uses a LLVM-based JIT compiler. MATLAB is a popular
dynamic language for scientific and engineering applications which
typically manipulate large matrices and often contain long-running
loops, and is thus an ideal target for dynamic JIT compilation and OSRs.

Using our McVM example, we examine the overheads of using OSR on a suite
of MATLAB benchmarks, and we show the potential performance
improvements when using it to perform dynamic inlining.

Abstract
Matlab is a dynamic scientific language used by
scientists, engineers and students world- wide. Although Matlab is
very suitable for rapid prototyping and development, Matlab users
often want to convert their final Matlab programs to a static language
such as Fortran. This paper presents an extensible object-oriented
toolkit for supporting the generation of static programs from dynamic
Matlab programs. Our open source toolkit, called the Matlab Tamer,
identifies a large tame subset of Matlab, supports the generation of a
specialized Tame IR for that subset, provides a principled approach to
handling the large number of builtin Matlab functions, and supports an
extensible interprocedural value analysis for estimating Matlab types
and call graphs.

Abstract
Thread-level Speculation (TLS) is a technique for automatic
parallelization that has shown excellent results in hardware
simulation studies. Existing studies, however, typically require a
full stack of analyses, hardware components, and performance
assumptions in order to demonstrate and measure speedup, limiting the
ability to vary fundamental choices and making basic design
comparisons difficult. Here we approach the problem analytically,
abstracting several variations on a general form of TLS (method-level
speculation) and using our abstraction to model the performance of TLS
on common coding idioms. Our investigation is based on exhaustive
exploration, and we are able to show how optimal performance is
strongly limited by program structure and core choices in speculation
design, irrespective of data dependencies. These results provide new,
high-level insight into where and how thread-level speculation can and
should be applied in order to produce practical speedup.

Abstract
MATLAB is a very popular dynamic "scripting" language for numerical
computations used by scientists, engineers and students world-wide.
MATLAB programs are often developed incrementally using a mixture of
MATLAB scripts and functions and frequently build upon existing code
which may use outdated features. This results in programs that could
benefit from refactoring, especially if the code will be reused and/or
distributed. Despite the need for refactoring there appear to be no
MATLAB refactoring tools available. Furthermore, correct refactoring of
MATLAB is quite challenging because of its non-standard rules for binding
identifiers. Even simple refactorings are non-trivial.

This paper presents the important challenges of refactoring MATLAB along
with automated techniques to handle a collection of refactorings for
MATLAB functions and scripts including: function and script inlining,
converting scripts to functions, and converting dynamic feval calls
to static function calls. The refactorings have been implemented using the
MATLAB compiler framework, and an evaluation is given on a large set of
MATLAB benchmarks which demonstrates the effectiveness of our approach.

Abstract
MATLAB is an extremely popular programming language used by scientists,
engineers, researchers and students world-wide. Despite its popularity, it has
received very little attention from compiler researchers. This paper
introduces McSAF, an open-source static analysis framework which is intended
to enable more compiler research for MATLAB and extensions of MATLAB. The
framework is based on an intermediate representation (IR) called McLAST, which
has been designed to capture all the key features of MATLAB, while at the same
time as being simple for program analysis. The paper describes both the IR and
the procedure for creating the IR from the higher-level AST. The analysis
framework itself provides visitor-based traversals including fixed-point-based
traversals to support both forwards and backwards analyses. McSAF has been
implemented as part of the McLAB project, and the framework has already been
used for a variety of analyses, both for MATLAB and the AspectMATLAB
extension.

Abstract
AspectMatlab is an aspect-oriented extension of the MATLAB programming language. In this
paper we present some important extensions to the original implementation of AspectMatlab. These
extensions enhance the expressiveness of the aspect pattern definition and widen support to certain
MATLAB native functions. Of particular importance are the new operator patterns, which permit
matching on binary operators such as "*" and "+". Correct support for the MATLAB "clear"
command and the "end" expression were also added. Finally, we documented previously hidden
features, most notably a negation operator on pattern. This new extended version of MATLAB
thus supports important added functionality.

Abstract
Parallelization and optimization of the MATLAB programming language
presents several challenges due to the dynamic nature of MATLAB. Since
MATLAB does not have static type declarations, neither the shape and size of
arrays, nor the loop bounds are known at compile-time. This means that many
standard array dependence tests and associated transformations cannot be applied
straight-forwardly. On the other hand, many MATLAB programs operate on arrays
using loops and thus are ideal candidates for loop transformations and possibly
loop vectorization/parallelization.
This paper presents a new framework,McFLAT,which uses profile-based training
runs to determine likely loop-bounds ranges for which specialized versions of the
loops may be generated. The main idea is to collect information about observed
loop bounds and hot loops using training data which is then used to heuristically
decide upon which loops and which ranges are worth specializing using a variety
of loop transformations.

Our McFLAT framework has been implemented as part of the McLAB extensible
compiler toolkit. Currently, McFLAT, is used to automatically transform ordinary
MATLAB code into specialized MATLAB code with transformations applied
to it. This specialized code can be executed on any MATLAB system, and we report
results for four execution engines, Mathwork.s proprietary MATLAB system,
the GNU Octave open-source interpreter, McLAB.s McVM interpreter and the
McVM JIT. For several benchmarks, we observed significant speedups for the
specialized versions, and noted that loop transformations had different impacts
depending on the loop range and execution engine.

Abstract
MATLAB has gained widespread acceptance among engineers and scientists.
Several aspects of the language such as dynamic loading and typing, safe
updates, and copy semantics for arrays contribute to its appeal, but at
the same time provide many challenges to the compiler and virtual
machine. One such problem, minimizing the number of copies and copy
checks for MATLAB programs has not received much attention. Existing
MATLAB systems rely on reference-counting schemes to create copies only
when a shared array representation is updated. This reduces array
copies, but increases the number of runtime checks. In addition, this
sort of reference-counted approach does not work in a garbage-collected
system.

In this paper we present a staged static analysis approach that does
not require reference counts, thus enabling a garbage-collected virtual
machine. Our approach eliminates both unneeded array copies and
does not require runtime checks. The first stage combines two
simple, yet fast, intraprocedural analyses to eliminate unnecessary
copies. The second stage is comprised of two analyses that
together determine whether a copy should be performed before an array is
updated: the first, necessary copy analysis, is a forward flow
analysis and determines the program points where array copies are required
while the second, copy placement analysis, is a backward analysis
that finds the optimal points to place copies, which also guarantee safe
array updates.

We have implemented our approach in the McVM JIT, which is part of a
garbage-collected virtual machine for MATLAB. Our results demonstrate
that there are significant overheads for both existing reference-counted and
naive copy-insertion approaches. We also show that our staged approach is
effective. In some cases the first stage is sufficient, but in many
cases the second stage is required. Overall, our approach required no
dynamic checks and successfully eliminated all unnecessary copies, for
our benchmark set.

Abstract
Return value prediction (RVP) is a technique for guessing the return
value from a function before it actually completes, enabling a number
of program optimizations and analyses. However, despite the
apparent usefulness, RVP and value prediction in general have seen
limited uptake in practice. Hardware proposals have been
successful in terms of speed and prediction accuracy, but the cost of
dedicated circuitry is high, the available memory for prediction is
low, and the flexibility is negligible. Software solutions are
inherently much more flexible, but can only achieve high accuracies in
exchange for reduced speed and increased memory consumption. In
this work we first express many different existing prediction
strategies in a unification framework, using it as the basis for a
software implementation. We then explore an adaptive software
RVP design that relies on simple object-orientation in a hybrid
predictor. It allocates predictors on a per-callsite basis
instead of globally, and frees the resources associated with unused
hybrid sub-predictors after an initial warmup period. We find
that these techniques dramatically improve speed and reduce memory
consumption while maintaining high prediction accuracy.

Abstract
Method level speculation (MLS) is an optimistic technique for
parallelizing imperative programs, for which a variety of MLS systems
and optimizations have been proposed. However, runtime
performance strongly depends on the interaction between program
structure and MLS system design choices, making it difficult to
compare approaches or understand in a general way how programs behave
under MLS. In this work we seek to establish a basic framework
for understanding MLS designs. We first present a stack-based
abstraction of MLS that encompasses major design choices such as
in-order and out-of-order nesting and is also suitable for
implementations. We then use this abstraction to develop the
structural operational semantics for a series of progressively more
flexible MLS models. Assuming the most flexible such model, we
provide transition-based visualizations that reveal the speculative
execution behaviour for a number of simple imperative programs. These
visualizations show how specific parallelization patterns can be
induced by combining common programming idioms with precise decisions
about where to speculate. We find that the runtime
parallelization structures are complex and non-intuitive, and that
both in-order and out-of-order nesting are important for exposing
parallelism. The overwhelming conclusion is that either
programmer or compiler knowledge of how implicit parallelism will
develop at runtime is necessary to maximize performance.

Abstract
MATLAB has gained widespread acceptance among engineers and scientists
as a platform for programming engineering and scientific applications.
Several aspects of the language such as dynamic loading and typing,
safe updates, and copy semantics for arrays contribute to its appeal,
but at the same time provide many challenges to the compiler and
virtual machine. One such problem, minimizing the number of copies
and copy checks for MATLAB programs has not received much attention.
Existing MATLAB systems use a reference counting scheme to create
copies only when a shared array representation is updated. This reduces
array copies, but increases the number of runtime checks. In this paper
we present a static analysis approach that eliminates both the unneeded
array copies and the runtime checks. Our approach comprises of two
analyses that together determine whether a copy should be performed
before an array is updated: the first, necessary copy analysis,
is a forward flow analysis and determines the program points where array
copies are required while the second, copy placement analysis,
is a backward analysis that finds the optimal points to place copies,
which also guarantee safe array updates.
We have implemented our approach as part of the McVM JIT and compared
our approach to Mathwork's commercial MATLAB implementation and the
open-source Octave implementation. The results show that, on our
benchmark set, our approach is effective at minimizing the number of
copies without requiring run-time checks.

Abstract
This paper introduces a new aspect-oriented programming language,
AspectMatlab. Matlab is a dynamic scientific programming
language that is commonly used by scientists because of its
convenient and high-level syntax for arrays, the fact that type
declarations are not required, and the availability of a rich
set of application libraries. AspectMatlab introduces key
aspect-oriented features in a way that is both accessible to
scientists and where the aspect-oriented features concentrate
on array accesses and loops, the core computation elements in
scientific programs. Introducing aspects into a dynamic
language such as Matlab also provides some new challenges.
In particular, it is difficult to statically determine precisely
where patterns match, resulting in many dynamic checks in the
woven code. Our compiler includes flow analyses which are used
to eliminate many of those dynamic checks. This paper reports
on the language design of AspectMatlab, the amc compiler
implementation and related optimizations, and also provides an
overview of use cases that are specific to scientific
programming.

Abstract
Method level speculation (MLS) is an optimistic technique for
parallelizing imperative programs, for which a variety of MLS
systems and optimizations have been proposed. However, runtime
performance strongly depends on the interaction between program
structure and MLS system design choices, making it difficult to
compare approaches or understand in a general way how programs
behave under MLS. Here we develop an abstract list-based model of
speculative execution that encompasses several MLS designs, and a
concrete stack-based model that is suitable for implementations.
Using our abstract model, we show equivalence and correctness for a
variety of MLS designs, unifying in-order and out-of-order execution
models. Using our concrete model, we directly explore the execution
behaviour of simple imperative programs, and show how specific
parallelization patterns are induced by combining common programming
idioms with precise speculation decisions. This basic groundwork
establishes a common basis for understanding MLS designs, and
suggests more formal directions for optimizing MLS behaviour and
application.

Abstract
Return value prediction (RVP) is a useful technique that enables a
number of program optimizations and analyses. Potentially high
overhead, however, as well as a dependence on novel hardware support
remain significant barriers to method level speculation and other
applications that depend on low cost, high accuracy RVP. Here we
investigate the feasibility of software-based RVP through a
comprehensive software study of RVP behaviour. We develop a
structured framework for RVP design and use it to experimentally
analyze existing and novel predictors in the context of non-trivial
Java programs. As well as measuring accuracy, time, and memory
overhead, we show that an object-oriented adaptive hybrid predictor
model can significantly improve performance while maintaining high
accuracy levels. We consider the impact on speculative
parallelism, and show further application of RVP to program
understanding. Our results suggest that software return value
prediction can play a practical role in further program optimization
and analysis.

Abstract
The memory alignment of executable code can have significant yet hard to predict effects on
the run-time performance of programs, even in the presence of existing aggressive optimization.
We present an investigation of two different approaches to instruction cache optimization\97the
first is an implementation of a well-known and established compile-time heuristic, and the second
is our own stochastic, genetic search algorithm based on active profiling. Both algorithms were
tested on 7 real-world programs and speed gains of up to 10% were observed. Our results show
that, although more time consuming, our approach yields higher average performance gains than
established heuristic solutions. This suggests there is room for improvement, and moreover that
stochastic and learning-based optimization approaches like our own may be worth investigating
further as a means of better exploiting hardware features.

Abstract
Speculative multithreading is a promising technique for automatic
parallelization. However, our experience with a software
implementation indicates that there is significant overhead involved
in managing the heap memory internal to the speculative system and
significant complexity in correctly managing the call stack memory
used by speculative tasks. We first describe a specialized heap
management system that allows the data structures necessary to support
speculation to be quickly recycled. We take advantage of
constraints imposed by our speculation model to reduce the complexity
of this otherwise general producer/consumer memory problem. We
next provide a call stack abstraction that allows the local variable
requirements of in-order and out-of-order nested method level
speculation to be expressed and hence implemented in a relatively
simple fashion. We believe that heap and stack memory management
issues are important to efficient speculative system design. We
also believe that abstractions such as ours help provide a framework
for understanding the requirements of different speculation
models.

Software engineering tools often deal with the source code of programs retrieved
from the web or source code repositories. Typically, these tools only have
access to a subset of the programs' source code (one file or a subset of files)
which makes it difficult to build a complete and typed intermediate
representation (IR). Indeed, for incomplete object-oriented programs, it is not
always possible to completely disambiguate the syntactic constructs and to
recover the declared type of certain expressions because the declaration of many
types and class members are not accessible.

We present a framework that performs partial type inference and uses heuristics
to recover the declared type of expressions and resolve ambiguities in partial
Java programs. Our framework produces a complete and typed IR suitable for
further static analysis. We have implemented this framework and used it in an
empirical study on four large open source systems which shows that our system
recovers most declared types with a low error rate, even when only one class is
accessible.

In runtime monitoring, a programmer specifies
code to execute whenever a sequence of events occurs during program execution.
Previous and related work has shown that runtime monitoring techniques can be
useful in order to validate or guarantee the safety and security of running
programs.
Those techniques have however not been incorporated in
everyday software development processes.
One problem that hinders industry adoption is that the required
specifications use a cumbersome, textual notation.
As a consequence, only verification experts, not programmers, can
understand what a given specification means and in particular,
whether it is correct.
In 2001, researchers at Bell Labs proposed the Timeline formalism.
This formalism was designed with ease of use in mind,
for the purpose of static verification (and not, as in our work,
for runtime monitoring).

In this article, we describe how software safety specifications
can be described visually in the Timeline formalism and subsequently
transformed into finite automata suitable for runtime monitoring,
using our meta-modelling and model transformation tool \At.
The synthesized automata are subsequently fed into an
existing monitoring back-end that
generates efficient runtime monitors for them.
Those monitors can then automatically be applied to Java programs.

Our work shows that the transformation of Timeline models to automata is
not only feasible in an efficient and sound way but also
helps programmers identify correspondences between the original
specification and the generated monitors.
We argue that visual specification of safety criteria and subsequent
automatic synthesis of runtime monitors will help users reason about the
correctness of their specifications on the one hand and effectively
deploy them in industrial settings on the other hand.

AbstractCode-copying is a very attractive, simple-JIT, highly portable fast bytecode execution technique with
implementation costs close to a classical interpreter. The idea is to reuse, at Virtual Machine (VM)
runtime, chunks of VM binary code, copy them, concatenate and execute together at a greater speed.
Unfortunatelly code-copying makes anwarranted assumptions about the code generated by the compiler
used to compile the VM. It is therefore necessary to find which VM code chunks can be safely used
after being copied and which can not. Previously used manual testing and assembly analysis were highly
ineffective and error prone. In this work we present a test suite, Bytecode Testing Framework (BTF) that
contains a comprehensive set of fine-grained Java assembly tests. These tests are then used in conjunction
with SableVM JVM testing mode to semi-automatically detect one-by-one which code chunks are safe
to use. With this techunique we ported SableVM\E2s code-copying engine to several architectures and had
external contributors with no VM knowledge port SableVM to new architectures within as little as one
hour. The presented Bytecode Testing Framework greatly improved reliability and ease of porting of
SableVM's code-copying engine, thus its practical usability.

Abstract
Pointer analyses enable many subsequent program analyses and
transformations, since they enable compilers to statically
disambiguate references to the heap. Extra precision enables pointer
analysis clients to draw stronger conclusions about
programs. Flow-sensitive pointer analyses are typically quite precise.
Unfortunately, flow-sensitive pointer analyses are also often too expensive
to run on whole programs. This paper therefore describes a technique
which sharpens results from a whole-program flow-insensitive points-to
analysis using two flow-sensitive intraprocedural analyses: a
must-not-alias analysis and a must-alias analysis. The main technical
idea is the notion of instance keys, which enable client
analyses to disambiguate object references. We distinguish two kinds of keys:
weak and strong. Strong instance keys guarantee that
different keys represent different runtime objects, allowing
instance keys to almost transparently substitute for runtime objects in static
analyses. Weak instance keys may represent multiple runtime objects, but
still enable disambiguation of compile-time values.
We assign instance keys based on the results of our analyses. We found that
the use of instance keys greatly simplified the design and implementation of
subsequent analysis phases.

Abstract
Virtual Machine authors face a difficult choice: to settle for low performance, cheap interpreter, or to
write a specialized and costly compiler. One of the methods to bridge the gap between these two distant
solutions is to use the existing code-copying technique that reuses chunks of VM\E2s binary code creating a
simple JIT. While simple in principle this technique is not reliable without a compiler that can guarantee
that copied chunks are functionally equivalent, which is often not the case due to aggressive optimizations.
We present a proof-of-concept, minimal-impact modification of a highly optimizing compiler,
GCC. It allows a VM programmer to mark specific chunks of VM source code as copyable. The chunks
of native code resulting from compilation of the marked source become addressable and self-contained.
Chunks can be safely copied at VM runtime, concatenated and executed together. With minimal impact
on compiler maintenance we guarantee the necessary safety and correctness properties of chunks. This
allows code-copying VMs to safely achieve performance improvement up to 200%, 67% average, over
direct interpretation. ensured thanks to chunks integrity verification. This maintanable enhancement
makes the code-copying technique reliable and thus practially usable.

Abstract
Software engineers often need to perform static analysis on a subset of a program source
code. Unfortunately, static analysis frameworks usually fail to complete their analyses be-
cause the definitions of some types used within a partial program are unavailable and the
complete program type hierarchy cannot be reconstructed. We propose a technique, Partial
Program Analysis, to generate and infer type facts in incomplete Java programs to allow static
analysis frameworks to complete their analyses. We describe the various levels of inference
soundness our technique provides, and then, we cover the main algorithm and type inference
strategies we use. We conclude with a detailed case study showing how our technique can
provide more precise type facts than standard parsers and abstract syntax tree generators.

Abstract
This document is an updated and translated version of the German paper
Arithmetische Kodierung from 2002.
It tries to be a comprehensive guide to the art of arithmetic coding.
First we give an introduction to the mathematic principles involved. These build the
foundation for the next chapters, where we describe the encoding and decoding
algorithms for different numerical systems. Here we also mention various problems one can
come across as well as solutions for those.
This is followed by a proof of uniqueness and an estimation of the efficiency of the
algorithm. In the end we briefly mention different kinds of statistical models, which are
used to actually gain compression through the encoding. Throughout this paper we
occasionally make some comparisons to the related Huffman encoding algorithm. Though, some
rudimentary knowledge about Huffman encoding should suffice for the reader to follow the
line of reasoning.
This paper is mainly based on the works on Sayood and Bell. On the latter we
base our implementation which is included in the appendix as full C++ source code. We also
make use of parts of this code during some of our examples. The mathematical model we use,
however, is strongly based on Sayood and Fano. Also we employ the well-known
Shannon-Theorem, which proofs the entropy to be the bound of feasible
lossless compression.

Abstract
Modern JIT compilers often employ multi-level recompilation
strategies as a means of ensuring the most used code is also the most
highly optimized, balancing optimization costs and expected future
performance. Accurate selection of code to compile and
level of optimization to apply is thus important to performance. In this paper
we investigate the effect of an improved recompilation strategy for
a Java virtual machine. Our design makes use of a lightweight, low-level
profiling mechanism to detect high-level, variable length phases in
program execution. Phases are then used to guide adaptive
recompilation choices, improving performance. We develop both an offline implementation
based on trace data and a self-contained online version.
Our offline study shows an average speedup of 8.5% and up to 21%, and our online
system achieves an average speedup of 4.5%, up to 18%.
We subject our results to extensive analysis and show that our
design achieves good overall performance with high consistency
despite the existence of many complex and interacting factors in such
an environment.

Abstract
The choice of lock objects in concurrent programs can affect both
performance and correctness, a burden of complexity for programmers.
Recently, various automated lock allocation and assignment techniques
have been proposed, each aiming primarily to minimize the number of
conflicts between critical sections. However, practical performance
depends on a number of important factors, including the nature of
concurrent interaction, the accuracy of the program analyses used to
support the lock allocation, and the underlying machine hardware. We
introduce component-based lock allocation, which starts by
analysing data dependences and automatically assigns lock objects with
tunable granularity to groups of interfering critical sections.
Our experimental results show that while a single global lock is
usually suboptimal, high accuracy in program analysis is not always
necessary to achieve generally good performance. Our work provides
empirical evidence as to sufficient strategies for efficient lock
allocation, and details the requirements of an adaptable
implementation.

Abstract
Detecting repetitive ``phases'' in program execution is helpful for
program understanding, runtime optimization, and for reducing
simulation/profiling workload. The nature of the phases that may be
found, however, depend on the kinds of programs, as well how programs
interact with the underlying hardware. We present a technique to
detect long term and variable length repetitive program behaviour by
monitoring microarchitecture-level hardware events. Our approach results
in high quality repetitive phase detection; we propose quantitative
methods of evaluation, and show that our design accurately calculates
phases with a 92% ``confidence'' level. We further validate our design
through an implementation and analysis of phase-driven, runtime
profiling, showing a reduction of about 50% of the profiling workload
while still preserving 94% accuracy in the profiling results. Our work
confirms that it is practical to detect high-level phases from
lightweight hardware monitoring, and to use such information to improve
runtime performance.

Abstract Speculative multithreading (SpMT), or thread level
speculation (TLS), is a dynamic parallelisation technique that uses
out-of-order execution and memory buffering to achieve speedup. The
complexity of implementing software SpMT results in long development
lead times, and the lack of reuse between different software SpMT
systems makes comparisons difficult. In order to address these
problems we have created libspmt, a new language independent library
for speculative multithreading. It provides a virtualization of
speculative execution components that are unavailable in commercial
multiprocessors, and enables method level speculation when fully
integrated. We have isolated a clean and minimal library interface,
and the corresponding implementation is highly modular. It can
accommodate hosts that implement garbage collection, exceptions, and
non-speculative multithreading. We created libspmt by refactoring a
previous implementation of software SpMT for Java, and by aiming for
modularity have exposed several new opportunities for optimization.

Software obfuscators are used to transform code so that it becomes more
difficult to understand and harder to reverse engineer. Obfuscation of
Java programs is particularly important since Java's binary form, Java
bytecode, is relatively high-level and susceptible to high-quality
decompilation. The objective of our work is to develop and study
obfuscation techniques that produce obfuscated bytecode that is very
hard to reverse engineer while at the same time not significantly
degrading performance.
We present three kinds of obfuscation techniques that: (1) obscure intent
at the operational level; (2) obfuscate program structure
such as complicating control flow and confusing object-oriented design;
and (3) exploit the semantic gap between what is allowed at the
Java source level and what is allowed at the bytecode level.
We have implemented all of our techniques as a tool called JBCO (Java
Byte Code Obfuscator), which is built on top of the Soot framework. We
developed a benchmark suite for evaluating our obfuscations and we
examine runtime performance, control flow graph complexity and the
negative impact on decompilers. These results show that most of the
obfuscations do not degrade performance significantly and many increase
complexity, making reverse engineering more difficult. The impact on
decompilers was two-fold. For those obfuscations that can be
decompiled, readability is greatly reduced. Otherwise, the decompilers
fail to produce legal source code or crash completely.

Java developers often use decompilers to aid reverse engineering and
obfuscators to prevent reverse engineering. Decompilers translate
low-level
class files to Java source and often produce ``good" output.
Obfuscators
rewrite class files into semantically-equivalent class files that are
either: (1) difficult to decompile, or (2) decompilable, but
produce ``hard-to-understand" Java source.
This paper presents a set of metrics we have developed to quantify
the
effectiveness of decompilers and obfuscators. The metrics include
some
selective size and counting metrics, a expression complexity metric
and
a metric for evaluating the complexity of identifier names.
We have applied these metrics to evaluate a collection of
decompilers. By
comparing metrics of the original Java source and the decompiled
source we can
show when the decompilers produced ``good" code (as compared to the
original
code). We have also applied the metrics on the code produced by
obfuscators
and show how the metrics indicate that the obfuscators produced
``hard-to-understand" code (as compared to the unobfuscated code).
Our work provides a first step in evaluating the effectiveness of
decompilers
and obfuscators and we plan to apply these techniques to evaluate new
decompiler and obfuscator tools and techniques.

The AspectBench Compiler (abc) is an extensible AspectJ compiler and built based on Polyglot and Soot. It generates optimized Java bytecode which can run faster in many cases when compared with ajc. However, the compilation speed of abc can be substantially slower than ajc. Typically it is roughly 4 times slower than ajc. By using various profiling tools, we observed some bottlenecks during the process of compiling AspectJ source code. We modified some data structures related to forward and backward flow analysis and changed some algorithms such as the variable name generator and around inlining to remove the bottlenecks or relieve the severity of those bottlenecks. The experimental results on selected benchmarks shows that the combined modifications reduce overall by 8% compilation time as compared with original abc. The speed up is especially notable for the benchmarks with around advice applied many places. At the same time, we also reduce the class file size for the benchmarks with around advices applied many places greatly, around 28%.

Java decompilers convert Java class files to Java source. Java class files
may be created by a number of different tools including standard Java
compilers, compilers for other languages such as AspectJ, or other tools
such as optimizers or obfuscators. There are two kinds of Java decompilers,
javac-specific decompilers that assume that the class file was
created by a standard javac compiler and
tool-independent decompilers that can decompile arbitrary class files,
independent of the tool that created the class files. Typically
javac-specific decompilers produce more readable code, but they
fail to decompile many class files produced by other tools.

This paper tackles the problem of how to make a tool-independent
decompiler, Dava, produce Java source code that is programmer-friendly.
In past work it has been shown that Dava can decompile arbitrary
class files, but often the output, although correct, is very different
from what a programmer would write and is hard to understand.
Furthermore, tools like obfuscators
intentionally confuse the class files and this also leads
to confusing decompiled source files.

Given that Dava already produces correct Java abstract syntax trees (ASTs)
for arbitrary class files, we provide a new back-end for Dava. The
back-end rewrites the ASTs to semantically equivalent
ASTs that correspond to code that is easier
for programmers to understand. Our new back-end includes a new
AST traversal framework, a set of simple pattern-based transformations,
a structure-based data flow analysis framework and a collection of more
advanced AST transformations that use flow analysis information. We
include several illustrative examples including the use of advanced
transformations to clean up obfuscated code.

Most programs exhibit significant repetition of behaviour, and
detecting program phases is an increasingly
important aspect of dynamic, adaptive program optimizations.
Phase-based optimization can also reduce profiling overhead,
save simulation time, and help in program understanding.
Using a novel state space characterization, we give an an overview of
state-of-the-art phase detection and prediction techniques.
We use this systematic exposition to situate our own long-range
phase analysis, and evaluate our approach using a novel phase
prediction evaluation metric. Our results show that phases with
relatively long periods can be predicted with
very high accuracy. This new technique, as well as the
unexplored regions in our high level characterization suggest
that further improvements and variations on program phase analysis
are both possible and potentially quite useful.

Many new Java runtime optimizations report relatively small,
single-digit performance improvements. On modern virtual and actual
hardware, however, the performance impact of an optimization
can be influenced by a variety of factors in the underlying systems.
Using a
case study of a new garbage collection optimization in two different
Java virtual machines, we show the various issues that must be taken
into consideration when claiming an improvement. We examine the
specific and overall performance changes due to our optimization and
show how unintended side-effects can contribute to, and distort the
final assessment. Our experience shows that VM and hardware concerns
can generate variances of up to 9.5% in whole program execution time.
Consideration of these confounding effects is critical to a good
understanding of Java performance and optimization.

Analysis of dynamic data structure usage is useful for both program
understanding and for improving the accuracy of other program
analyses. Static shape analysis techniques, however, suffer from
reduced accuracy in complex situations. We have designed and
implemented a dynamic shape analysis system that allows one to examine
and analyze how Java programs build and modify data structures. Using
a complete execution trace from a profiled run of the program, we
build an internal representation that mirrors the evolving runtime
data structures. The resulting series of representations can then be
analyzed and visualized, and we show how to use our approach to help
understand how programs use data structures, the precise effect of
garbage collection, and to establish limits on static techniques for
shape analysis. A deep understanding of dynamic data structures is
particularly important for modern, object-oriented languages that make
extensive use of heap-based data structures.

We present the results of an empirical study evaluating the precision
of subset-based points-to analysis with several variations of context
sensitivity on Java benchmarks of significant size. We compare the use
of call site strings as the context abstraction, object sensitivity,
and the BDD-based context-sensitive algorithm proposed by Zhu and
Calman, and by Whaley and Lam. Our study includes analyses that
context-sensitively specialize only pointer variables, as well as
ones that also specialize the heap abstraction. We measure both
characteristics of the points-to sets themselves, as well as effects
on the precision of client analyses. To guide development of efficient
analysis implementations, we measure the number of contexts, the number
of distinct contexts, and the number of distinct points-to sets that
arise with each context sensitivity variation. To evaluate precision, we
measure the size of the call graph in terms of methods and edges, the
number of devirtualizable call sites, and the number of casts statically
provable to be safe.

The results of our study indicate that object-sensitive analysis
implementations are likely to scale better and more predictably than the
other approaches; that object-sensitive analyses are more precise than
comparable variations of the other approaches; that specializing the
heap abstraction improves precision more than extending the length of
context strings; and that the profusion of cycles in Java call graphs
severely reduces precision of analyses that forsake context sensitivity
in cyclic regions.

Abstract
Speculative multithreading (SpMT) is a dynamic program
parallelisation technique that promises dramatic speedup of
irregular, pointer-based programs as well as numerical, loop-based
programs.
We present the design and implementation of software-only SpMT for
Java at the virtual machine level. We take the full Java language
into account and we are able to run and analyse real world
benchmarks in reasonable execution times on commodity multiprocessor
hardware.
We provide an experimental analysis of benchmark behaviour,
uncovered parallelism, the impact of return value prediction,
processor scalability, and a breakdown of overhead costs.

Abstract
An aspect observes the execution of a base program; when certain actions occur, the aspect runs some extra code of its own. In the AspectJ language, the observations that an aspect can make are confined to the current action: it is not possible to directly observe the history of a computation.

Recently there have been several interesting proposals for new history-based language features, most notably by Douence et al. and also by Walker and Viggers. In this paper we present a new history-based language feature called tracematches, where the programmer can trigger the execution of extra code by specifying a regular pattern of events in a computation trace. We have fully designed and implemented tracematches as a seamless extension to AspectJ.

A key innovation in our tracematch approach is the introduction of free variables in the matching patterns. This enhancement enables a whole new class of applications where events can be matched not only by the event kind, but also by the values associated with the free variables. We provide several examples of applications enabled by this feature.

After introducing and motivating the idea of tracematches via examples, we present a detailed semantics of our language design, and we derive an implementation from that semantics. The implementation has been realised as an extension of the abc compiler for AspectJ.

Abstract
AspectJ, an aspect-oriented extension of Java, is becoming increasingly
popular. However, not much work has been
directed at optimising compilers for AspectJ.
Optimising AOP languages provides many new and interesting
challenges for compiler writers, and this paper identifies and
addresses three such challenges.

First, compiling around advice efficiently is particularly challenging.
We provide a new code generation strategy for around
advice, which (unlike previous implementations) both avoids
the use of excessive inlining and the use of closures. We show it
leads to more compact code, and can also improve run-time
performance. Second, woven code sometimes includes run-time tests
to determine whether advice should execute. One important
case is the cflow pointcut which uses information about the
dynamic calling context. Previous techniques for cflow
were very costly in terms of both time and space. We
present new techniques to minimize or eliminate the overhead of
cflow using both intra- and inter-procedural analyses.
Third, we have addressed the general problem of how to structure an
optimising compiler so that traditional analyses can be
easily adapted to the AOP setting.

We have implemented all of the techniques in this paper
in abc, our AspectBench Compiler for AspectJ, and
we demonstrate significant speedups with empirical results.
Some of our techniques have already been integrated into the production
AspectJ compiler, ajc 1.2.1.

Abstract
We describe the effect of a particular form of "noise" in
benchmarking. We investigate the source of anomalous measurement data
in a series of optimization strategies that attempt to improve data
cache performance in the garbage collector of a Java virtual
machine. The results of our experiments can be explained in terms of
the difference in code positioning, and hence instruction and data
cache behaviour. We show that unintended changes in code positioning
due to code modifications as trivial as symbol renaming can contribute
up to 2.7% of measured machine cycle cost, 20% in data cache misses,
and 37% in instruction cache misses.

Abstract
We present an algorithmic approach to dataflow
analysis of the pi-calculus that unifies previous such analyses under
a single framework. The structure of the algorithms presented
lead to a dissociation of correctness proofs from the analysis itself,
and this has led to the development of an algorithm that improves the
accuracy of previous approaches by considering the operation of all
process operators including sequencing and replication. Our
analysis solution is also independent of any concurrent context in
which the analyzed process runs.

Abstract
Speculative method-level parallelism has been shown to benefit from
return value prediction. In this paper we propose and analyse two
compiler analyses designed to improve the cost and performance of a
hybrid return value predictor in a Java virtual machine setting. A
return value use analysis determines which return values are consumed,
and enables us to eliminate 2.6% of all non-void dynamic method calls
in the SPEC JVM98 benchmarks as prediction candidates. An extension
of this analysis detects return values used only inside boolean and
branch expressions, and allows us to safely substitute incorrect
predictions for 14.1% of return values at runtime, provided all future
use expressions are satisfied correctly. We find an average 3.2%
reduction in memory and up to 7% increase in accuracy. A second,
interprocedural parameter dependence analysis reduces the number of
inputs to a memoization-based sub-predictor for 10.2% of all method
calls, and the combination of both analyses reduces overall memory
requirements by 5.3%.

Abstract
Aspect-oriented programming and the development of aspect-oriented
languages is rapidly gaining momentum, and the advent of this new kind
of programming language provides interesting challenges for compiler
developers. Aspect-oriented language features require new compilation
approaches, both in the frontend semantic analysis and in the backend
code generation. This paper is about the design and implementation of
the abc compiler for the aspect-oriented language AspectJ.

One important contribution of this paper is to show how we can leverage
existing compiler technology by combining Polyglot (an extensible
compiler framework for Java) in the frontend and Soot (a framework for
analysis and transformation of Java) in the backend. In particular,
the extra semantic checks needed to compile AspectJ are implemented
by extending and augmenting the compiler passes provided by Polyglot.
All AspectJ-specific language constructs are then extracted from
the program to leave a pure Java program that can be compiled using
the existing Java code generation facility of Soot. Finally, all
code generation and transformation is performed on Jimple, Soot's
intermediate representation, which allows for a clean and convenient
method of applying aspects to source code and class files alike.

A second important contribution of the paper is that we describe our
implementation strategies for the new challenges that are specific to
aspect-oriented language constructs. Although our compiler is targeted
towards AspectJ, many of these ideas apply to aspect-oriented languages
in general.

Our abc compiler implements the full AspectJ language as defined by ajc
1.2.0 and is freely available under the GNU LGPL.

Abstract
Inter-procedural analyses such as side-effect analysis can provide
information useful for performing aggressive optimizations. We present
a study of whether side-effect information improves performance in
just-in-time (JIT) compilers, and if so, what level of analysis
precision is needed.

We used Spark, the inter-procedural analysis component of the Soot Java
analysis and optimization framework, to compute side-effect information
and encode it in class files. We modified Jikes RVM, a research JIT,
to make use of side-effect analysis in local common sub-expression
elimination, heap SSA, redundant load elimination and loop-invariant
code motion. On the SpecJVM98 benchmarks, we measured the static number
of memory operations removed, the dynamic counts of memory reads
eliminated, and the execution time.

Our results show that the use of side-effect analysis increases the
number of static opportunities for load elimination by up to 98%, and
reduces dynamic field read instructions by up to 27%. Side-effect
information enabled speedups in the range of 1.08x to 1.20x for some
benchmarks. Finally, among the different levels of precision of
side-effect information, a simple side-effect analysis is usually
sufficient to obtain most of these speedups.

Abstract
Method inlining is one of the most important optimizations for JIT
compilers in Java virtual machines. In order to increase the number of
inlining opportunities, a type analysis can be used to identify
monomorphic virtual calls. In a JIT environment, the compiler and
type analysis must also handle dynamic class loading properly because
class loading can invalidate previous analysis results and invalidate
some speculative inlining decisions. To date, a very simple type
analysis, class hierarchy analysis (CHA), has been used successfully
in JIT compilers for speculative inlining with invalidation techniques
as backup.

This paper seeks to determine if more powerful dynamic type analyses
could further improve inlining opportunities in a JIT compiler. To
achieve this goal we developed a general dynamic type analysis
framework which we have used for designing and implementing dynamic
versions of several well-known static type analyses, including CHA,
RTA, XTA and VTA.

Surprisingly, the simple dynamic CHA is nearly as good as an ideal
type analysis for inlining virtual method calls. There is little
room for further improvement. On the other hand, only
a reachability-based interprocedural type analysis (VTA) is able to
capture the majority of monomorphic interface calls.

We also found that memory overhead is the biggest issue with dynamic
whole-program analyses. To address this problem, we outline the
design of a demand-driven interprocedural type analysis for inlining
hot interface calls.

Abstract
Research in the design of aspect-oriented programming languages
requires a workbench that facilitates easy experimentation with
new language features and implementation techniques. In particular,
new features for AspectJ have been proposed that require extensions
in many dimensions: syntax, type checking and code generation, as well as
data flow and control flow analyses.

The AspectBench Compiler (abc) is an implementation of such a workbench.
The base version of abc implements the full AspectJ language.
Its frontend is built, using the Polyglot framework, as a modular
extension of the Java language. The use of Polyglot
gives flexibility of syntax and type checking.
The backend is built using the Soot framework, to give modular code
generation and analyses.

In this paper, we outline the design of abc, focusing mostly on how
the design supports extensibility. We then provide a general overview of how
to use abc to implement an extension. Finally,
we illustrate the extension mechanisms of abc through a number of
small, but non-trivial, examples. abc is freely available under
the GNU LGPL.

Abstract
In this paper we present an implementation of May Happen in
Parallel analysis for Java that attempts to address some of the
practical implementation concerns of the original work. We describe
a design that incorporates techniques for aiding a feasible
implementation and expanding the range of acceptable inputs.
We provide experimental results showing the utility and impact
of our approach and optimizations using a variety of concurrent
benchmarks.

Abstract
This paper proposes and implements a rigorous method for studying the
dynamic behaviour of AspectJ programs. As part of this methodology
several new metrics specific to AspectJ programs are proposed and
tools for collecting the relevant metrics are presented. The major
tools consist of: (1) a modified version of the AspectJ compiler that tags
bytecode instructions with an indication of the cause of their generation,
such as a particular feature of AspectJ; and (2) a modified version of the
*J dynamic metrics collection tool which is composed of a JVMPI-based
trace generator and an analyzer which propagates tags and
computes the proposed metrics. This dynamic propagation is essential,
and thus this paper contributes not only
new metrics, but also non-trivial ways of computing them.

We furthermore
present a set of benchmarks that exercise a wide range of AspectJ's
features, and the metrics that we measured on these
benchmarks. The results provide guidance to AspectJ users on how to avoid
efficiency pitfalls, to AspectJ implementors on promising
areas for future optimization, and to tool builders on ways to
understand runtime behaviour of AspectJ.

Abstract We present a new algorithm for computing
control flow analysis on the pi-calculus. Our approach is strictly
more accurate then existing algorithms, while maintaining a polynomial
running time. Our algorithm is also partially compositional, in the
sense that the representational structure that results from analyzing a
process can be efficiently reused in an analysis of the same process in
another context.

Abstract
In this paper we present Jedd, a language extension to Java that supports
a convenient way of programming with Binary Decision Diagrams (BDDs).
The Jedd language abstracts BDDs as database-style relations and operations
on relations, and provides static type rules to ensure that relational
operations are used correctly.

The paper provides a description of the Jedd language and reports on the
design and implementation of the Jedd translator and associated runtime
system. Of particular interest is the approach to assigning attributes
from the high-level relations to physical domains in the underlying BDDs, which
is done by expressing the constraints as a SAT problem and using a modern
SAT solver to compute the solution. Further, a runtime system is
defined that handles memory management issues and supports a browsable
profiling tool for tuning the key BDD operations.

The motivation for designing Jedd was to support the development of whole
program analyses based on BDDs, and we have used Jedd to express five
key interrelated whole program analyses in our Soot compiler framework.
We provide some examples of this application and discuss our experiences
using Jedd.

Abstract
Basic problems encountered when trying to accurately and reasonably
measure dynamic properties of a program are discussed. These include
problems in determining and assessing specific, desirable metric
qualities that may be perturbed by subtle and unexpected program
behaviour, as well as technical limitations on data collection. Some
interpreter or Java-specific problems are examined, and the general
issue of how to present metrics is also discussed.

Abstract
This paper presents a new, inexpensive, mechanism for constructing
a complete call graph for Java programs at runtime, and provides an
example of using the mechanism for implementing a dynamic reachability-based
interprocedural analysis (IPA), namely dynamic XTA.

Reachability-based IPAs, such as points-to
analysis, type analysis, and escape analysis, require a
context-insensitive call graph of the analyzed program.
Computing a call graph at runtime presents several challenges. First,
the overhead must be low. Second, when implementing the mechanism for
languages such as Java, both polymorphism and lazy class loading must
be dealt with correctly and efficiently.
We propose a new mechanism with low costs for constructing runtime
call graphs in a JIT environment. The mechanism uses a profiling code
stub to capture the fist-time execution of a call edge, and adds at
most one more instruction to the repeated call edge invocations.
Polymorphism and lazy class loading are handled transparently. The
call graph is constructed incrementally, and it supports optimistic
analysis and speculative optimizations with invalidations.

We also developed a dynamic, reachability-based type analysis, dynamic XTA,
as an application of runtime call graphs. It also serves as
an example of handling lazy class loading in dynamic IPAs.

The dynamic call graph construction algorithm and dynamic
version of XTA have been
implemented in Jikes RVM, and we present empirical measurements of the
overhead of call graph construction and the characteristics of
dynamic XTA.

Abstract
This paper presents the integration of Soot, a byte-code analysis and
transformation framework, with an integrated development environment (IDE),
Eclipse. Such an integrated toolkit is useful for both the compiler
developer, to aid in understanding and debugging new analyses,
and also for the end-user of the IDE, to aid in program
understanding by exposing semantic information gathered by the advanced
compiler analyses. The paper discusses these advantages and provides
concrete examples of its usefulness.

There are several major challenges to overcome in developing the integrated
toolkit, and the paper discusses three major challenges and the solutions
to those challenges. An overview of Soot and the integrated toolkit is
given, followed by a more detailed discussion of the fundamental components.
The paper concludes with several illustrative examples of using the
integrated toolkit along with a discussion of future plans and research.

Abstract
To date, Soot has employed very conservative analyses of
exceptional control flow, producing control flow
graphs which include an edge to each exception handler from every
statement within the block of code protected by
that handler, including statements which cannot
throw any exceptions caught by the handler. This
report describes some of the difficulties and
subtleties of analyzing exceptional control flow,
and describes an attempt to refine Soot's
analysis, by determining the types of exceptions
each statement has the potential to throw, and drawing edges to
exception handlers only from protected statements
which may throw exceptions the handlers would catch.

Abstract
We present a flow and context insensitive control flow analysis of the
pi-calculus which computes a superset of the names that can possibly be
transmitted over each channel. The analysis is a simpler version of the
constraint-based analysis of Bodei et al based on techniques inspired
from points-to analyis, but requires no extension to the basic calculus.
The algorithm computes the same solution as the previous one, but
with better space efficiency. Our approach is closely related to
techniques used in compiler optimizers to do points-to analyses, and is a
simpler presentation, provides insight into the inner workings of the
process, and immediately suggests other possible analyses on
pi-calculus processes.

Abstract
In order to perform meaningful experiments in optimizing compilation and
run-time system design, researchers usually rely on a suite of benchmark
programs of interest to the optimization technique under consideration.
Programs are described as numeric, memory-intensive,
concurrent, or object-oriented, based on a qualitative
appraisal, in some cases with little justification. We believe it is
beneficial to quantify the behaviour of programs with a concise and precisely
defined set of metrics, in order to make these intuitive notions of program
behaviour more concrete and subject to experimental validation. We therefore
define and measure a set of unambiguous, dynamic, robust and
architecture-independent metrics that can be used to categorize programs
according to their dynamic behaviour in five areas: size, data structure,
memory use, concurrency, and polymorphism. A framework computing some of
these metrics for Java programs is presented along with specific results
demonstrating how to use metric data to understand a program's behaviour, and
both guide and evaluate compiler optimizations.

Abstract
Existing visualization tools typically do not allow easy extension by
new visualization techniques, and are often coupled with inflexible
data input mechanisms. This paper presents EVolve, a flexible and
extensible framework for visualizing program characteristics and
behaviour. The framework is flexible in the sense that it can
visualize many kinds of data, and it is extensible in the sense that
it is quite straightforward to add new kinds of visualizations.

The overall architecture of the framework consists of
the core EVolve platform that communicates with data sources via a
well defined data protocol and which communicates with visualization methods
via a visualization protocol.

Given a data source, an end-user can use EVolve as a stand-alone tool
by interactively creating, configuring and modifying
visualizations. A variety of visualizations are provided in the
current EVolve library, with features that facilitate the comparison
of multiple views on the same execution data.
We demonstrate EVolve in the context of visualizing execution behaviour
of Java programs.

Abstract
In order to perform meaningful experiments in optimizing compilation
and run-time system design, researchers usually rely on a suite of
benchmark programs of interest to the optimization technique under
consideration. Programs are described as numeric, memory-intensive,
concurrent, or object-oriented, based on a qualitative appraisal, in
some cases with little justification. We believe it is beneficial to
quantify the behavior of programs with a concise and precisely defined
set of metrics, in order to make these intuitive notions of program
behavior more concrete and subject to experimental validation. We
therefore define a set of unambiguous, dynamic, robust and
architecture-independent metrics that can be used to categorize
programs according to their dynamic behavior in five areas: size, data
structure, memory use, concurrency, and polymorphism. A framework
computing some of these metrics for Java programs is presented along
with specific results.

Abstract
This paper reports on a new approach to solving a subset-based
points-to analysis for Java using Binary Decision Diagrams (BDDs).
In the model checking community, BDDs have been shown very effective for
representing large sets and solving very large verification problems.
Our work shows that BDDs can also be very effective for developing a
points-to analysis that is simple to implement and that
scales well, in both space and time, to large programs.

The paper first introduces BDDs and operations on BDDs using some
simple points-to examples. Then, a complete subset-based points-to
algorithm is presented, expressed completely using BDDs and BDD
operations. This algorithm is then refined by finding appropriate
variable orderings and by making the algorithm incremental, in order
to arrive at a very efficient algorithm. Experimental results are
given to justify the choice of variable ordering, to demonstrate the
improvement due to incrementalization, and to compare the performance
of the BDD-based solver to an efficient hand-coded graph-based solver.
Finally, based on the results of the BDD-based solver, a variety of
BDD-based queries are presented, including the points-to query.

Abstract
Most points-to analysis research has been done on different systems by
different groups, making it difficult to compare results, and to understand
interactions between individual factors each group studied.
Furthermore, points-to analysis for Java has been studied much less
thoroughly than for C, and the tradeoffs appear very different.

We introduce Spark, a flexible framework for experimenting with
points-to analyses for Java. Spark supports equality- and subset-based
analyses, variations in field sensitivity, respect for declared types,
variations in call graph construction, off-line simplification, and
several solving algorithms. Spark is composed of building blocks on
which new analyses can be based.

We demonstrate Spark in a substantial study of factors affecting
precision and efficiency of subset-based points-to analyses, including
interactions between these factors. Our results show that Spark is
not only flexible and modular, but also offers superior time/space
performance when compared to other points-to analysis implementations.

Abstract
Dynamic program optimization is becoming increasingly important for
achieving good runtime performance. One of the key issues in such
systems is how it selects which code to optimize. One approach is
to dynamically detect traces, long sequences of instructions which
are likely to execute to completion. Such traces can be stored in
a trace cache and dispatched one trace at a time (rather than one
instruction or one basic block at a time). Furthermore, traces have
been shown to be a good unit for optimization.

This paper reports on a new approach for dynamically detecting, creating
and storing traces in a Java virtual machine.
We first elucidate four important design criteria for a successful trace
strategy: good instruction stream coverage, low dispatch rate, cache
stability and optimizability of traces. We then present our approach which is
based on branch correlation graphs. Branch correlation graphs
store information about the correlation between pairs of branches, as
well as additional state information.

We present the complete design for an efficient implementation of the system,
including a detailed discussion of the trace cache and profiling mechanisms.
We have implemented an experimental framework to measure the traces
generated by our approach in a direct-threaded-inlining Java VM
(SableVM) and we present experimental results to show that the traces we
generate meet the design criteria.

Abstract
Traditional tracing systems are often limited to recording a fixed set of basic
program events. These limitations can frustrate an application or compiler
developer who is trying to understand and characterize the complex behavior of
software systems such as a Java program running on a Java Virtual Machine. In
the past, many developers have resorted to specialized tracing systems that
target a particular type of program event. This approach often results in an
obscure and poorly documented encoding format which can limit the reuse and
sharing of potentially valuable information. To address this problem, we present
STEP, a system designed to provide profiler developers with a standard method
for encoding general program trace data in a flexible and compact format. The
system consists of a trace data definition language along with a compiler and
architecture that simplifies the client interface by encapsulating the details
of encoding and interpretation.

Abstract
Existing visualization tools typically do not allow easy extension by
new visualization techniques, and are often coupled with inflexible
data input mechanisms. This paper presents EVolve, a flexible and
extensible framework for visualizing program characteristics and
behaviour. The framework is flexible in the sense that it can
visualize many kinds of data, and it is extensible in the sense that
it is quite straightforward to add new kinds of visualizations.

The overall architecture of the framework consists of the core EVolve
platform that communicates with data sources via a well defined data
protocol and which communicates with visualization methods via a
visualization protocol.

Given a data source, an end-user can use EVolve as a stand-alone
tool by interactively creating, configuring and modifying one or more
visualizations. A variety of visualizations are provided in the
standard EVolve visualization library. EVolve can be used to build
custom visualizers by implementing new data sources and/or new kinds
of visualizations.

The paper presents an overview of the system, examples of its use,
and discusses our experiences using the tool both as an end-user and
as a developer building custom visualizers.

Abstract
List scheduling is a well-known O(n^2) algorithm for performing instruction scheduling during
compiler code generation on modern, superscalar architecture. An asymptotically more efficient
O(n*log(n)) variation is presented; this algorithm achieves its lower time bounds by asserting
practical constraints on the structure of instruction dependencies.

Abstract
The task of developing, tuning, and debugging compiler
optimizations is a difficult one which can be facilitated by
software visualization. There are many characteristics of the
code which must be considered when studying the kinds of
optimizations which can be performed. Both static data collected
at compile-time and dynamic runtime data can reveal opportunities
for optimization and affect code transformations. Visualization
of such complex systems must include as much information as
possible, and accommodate the different sources from which this
information is acquired.

This paper presents a visualization framework designed to address
these issues. The framework is based on a new, extensible
language called JIL which provides a common format for
encapsulating intermediate representations and associating them
with compile-time and runtime data. Custom visualization
interfaces can then combine JIL data from separate tools, exposing
both static and dynamic characteristics of the underlying code.

Abstract
The Java Intermediate Language (JIL) is a subset of XML and SGML
described in this document. Its goal is to provide an
intermediate representation of Java source code suitable for
machine use. JIL benefits from the features of XML, such as
extensibility and portability, while providing a common ground for
software tools. The following document discusses the design
issues and overall framework for such a language, including a
description of all fundamental elements.

Abstract
Object-oriented languages, like Java, encourage the use of many small
objects linked together by field references, instead of a few monolithic
structures. While this practice is beneficial from a program design
perspective, it can slow down program execution by incurring many
pointer indirections. One solution to this problem is object inlining:
when the compiler can safely do so, it fuses small objects together,
thus removing the reads/writes to the removed field, saving the memory
needed to store the field and object header, and reducing the number of
object allocations.

The objective of this paper is to measure the potential for object
inlining by studying the run-time behavior of a comprehensive set of
Java programs. We study the traces of program executions in order to
determine which fields behave like inlinable fields. Since we are using
dynamic information instead of a static analysis, our results give
an upper bound on what could be achieved via a static compiler-based
approach. Our experimental results measure the potential improvements
attainable with object inlining, including reductions in the numbers of
field reads and writes, and reduced memory usage.

Our study shows that some Java programs can benefit significantly from
object inlining, with close to a 10% speedup. Somewhat to our surprise,
our study found one case, the db benchmark, where the most important
inlinable field was the result of unusual program design, and fixing
this small flaw led to both better performance and clearer program
design. However, the opportunities for object inlining are highly
dependent on the individual program being considered, and are in many
cases very limited. Furthermore, fields that are inlinable also have
properties that make them potential candidates for other optimizations
such as removing redundant memory accesses. The memory savings possible
through object inlining are moderate.

Abstract
This paper introduces an adaptive, region-based allocator for Java.
The basic idea is to allocate non-escaping objects in local regions,
which are allocated and freed in conjunction with their associated
stack frames. By releasing memory associated with these stack frames,
the burden on the garbage collector is reduced, possibly resulting in
fewer collections.

The novelty of our approach is that it does not require static escape
analysis, programmer annotations, or special type systems. The
approach is transparent to the Java programmer and relatively simple
to add to an existing JVM. The system starts by assuming that all
allocated objects are local to their stack region, and then catches
escaping objects via write barriers. When an object is caught
escaping, its associated allocation site is marked as a non-local
site, so that subsequent allocations will be put directly in the
global region. Thus, as execution proceeds, only those allocation
sites that are likely to produce non-escaping objects are allocated to
their local stack region.

The paper presents the overall idea, and then provides details of a
specific design and implementation. In particular, we present a paged
region allocator and the necessary modifications of the Jikes RVM
baseline JIT and a copying collector. Our experimental study
evaluates the idea using the SPEC JVM98 benchmarks, plus one other large
benchmark. We show that a region-based allocator is a reasonable
choice, that overheads can be kept low, and that the adaptive system
is successful at finding local regions that contain no escaping
objects.

Abstract The performance and behaviour of large
object-oriented software systems can be difficult to
profile. Furthermore, existing profiling systems are often limited
by a fixed set of events that can be profiled and a fixed set of
visualizations. This paper presents the Sable Toolkit for
Object-Oriented Profiling (STOOP), which we developed to support a
wide variety of profiling needs. The toolkit makes it simple to
build new profilers which can be used for both standard and less
conventional profiling tasks.

At the core of the toolkit is its event pipe. A profiling agent
produces events for the event pipe and a visualizer consumes events
from the pipe. To provide a general purpose event pipe we have
defined a language for defining events and a compiler which
automatically produces the event pipe interfaces for the specified
events. Many different profiling agents can produce events for the
event pipe, including JVMPI agents, instrumented VMs and instrumented
bytecode. It is also simple to adapt a variety of visualizers to
read events from the event pipe. We have worked with two
visualizers, EVolve and JIMPLEX. We summarize each visualizer and
show how easy it was to adapt them to use events produced from STOOP.

Abstract
Java virtual machines execute Java bytecode instructions. Since this
bytecode is a higher level representation than traditional object code, it
is possible to decompile it back to Java source. Many such decompilers
have been developed and the conventional wisdom is that decompiling Java
bytecode is relatively simple. This may be true when decompiling bytecode
produced directly from a specific compiler, most often Sun's javac
compiler. In this case it is really a matter of inverting a known
compilation strategy. However, there are many problems, traps and
pitfalls when decompiling arbitrary verifiable Java bytecode. Such
bytecode could be produced by other Java compilers, Java bytecode
optimizers or Java bytecode obfuscators. Java bytecode can also be
produced by compilers for other languages, including Haskell, Eiffel, ML,
Ada and Fortran. These compilers often use very different code generation
strategies from javac.

This paper outlines the problems and solutions we have found in our
development of Dava, a decompiler for arbitrary Java bytecode. We first
outline the problems in assigning types to variables and literals, and the
problems due to expression evaluation on the Java stack. Then, we look at
finding structured control flow with a particular emphasis on how to deal
with Java exceptions and synchronized blocks. Throughout the paper we
provide small examples which are not properly decompiled by commonly used
decompilers.

Abstract
This paper reports on a new approach to eliminating array bounds checks in
Java. Our approach is based upon a flow-sensitive intraprocedural analysis
called variable constraint analysis (VCA). This analysis builds a
small constraint graph for each important point in a method, and then uses
the information encoded in the graph to infer the relationship between
array index expressions and the bounds of the array. Using VCA as the
base analysis, we also show how two further analyses can improve the
results of VCA. Array field analysis is applied
on each class and provides information about some arrays stored in fields,
while rectangular array analysis is an interprocedural analysis to approximate
the shape of arrays, and is useful for finding rectangular (non-ragged) arrays.

We have implemented all three analyses using the Soot bytecode
optimization/annotation framework and we transmit the results of the
analysis to virtual machines using class file attributes. We have
modified the Kaffe JIT, and IBM's High Performance
Compiler for Java (HPCJ) to make use of these attributes, and we
demonstrate significant speedups.

Abstract
SableVM is an open-source virtual machine for Java, intended as a research framework for efficient
execution of Java bytecode.
The framework is essentially composed
of an extensible bytecode interpreter using state-of-the-art
and innovative techniques.
Written in the C programming language, and assuming
minimal system dependencies, the interpreter emphasizes high-level
techniques to support efficient execution.
In particular, we introduce new data
layouts for classes, virtual tables and object instances that
reduce the cost of interface method calls to that of normal virtual calls,
allow efficient garbage collection and light synchronization, and make
effective use of memory space.

Abstract
This paper presents a framework for supporting the optimization of
Java programs using attributes in Java class files. We show how
class file attributes may be used to convey both optimization
opportunities and profile information to a variety of Java virtual
machines including ahead-of-time compilers and just-in-time compilers.

We present our work in the context of Soot, a framework that supports
the analysis and transformation of
Java bytecode (class files).
We demonstrate the framework with attributes for elimination of array bounds
and null pointer checks, and we provide experimental results for the
Kaffe just-in-time compiler, and IBM's High Performance Compiler for Java
ahead-of-time compiler.

Abstract
Everyone who knows something about optimizing compilers knows about
common subexpression elimination. It is used to remove redundant
computations, which usually improves the execution time of a program.
The transformation is quite simple to describe, and so common
subexpression elimination must, of course, therefore be trivial to
implement. In this work, we discuss an implementation of common
subexpression elimination for Java, using the Soot framework. Our
implementation passes the Soot sentinel suite. Experimental
results are presented, showing that common subexpression elimination
slightly speeds up Java programs.

Abstract
This paper addresses the problem of resolving virtual method and
interface calls in Java bytecode.
The main focus is on a new practical technique that can
be used to analyze large applications while at the same time providing
more accurate results than class hierarchy analysis and rapid type
analysis.
We present two variations of our new technique, variable-type analysis
and a coarser-grain version called declared-type analysis.
Both of these analyses are inexpensive, easy to implement,
and our experimental results show that they scale linearly in
the size of the program.

We have implemented our new analyses
using the Soot framework, and we report on
empirical results for seven benchmarks.
We have used our techniques to build
accurate call graphs for complete applications (including libraries)
and we show that compared to a conservative call graph built
using class hierarchy analysis, our new variable-type analysis
can remove a significant number of nodes (methods) and call edges.
Further, our results show that we can improve upon the compression obtained
using rapid type analysis.

We also provide dynamic measurements of monomorphic call sites, focusing
on the benchmark code excluding libraries. We demonstrate that when
considering only the benchmark code,
both rapid type analysis and our new declared-type analysis do not add much
precision over class hierarchy analysis. However, our finer-grained
variable-type analysis does resolve significantly more
call sites, particularly for very object-oriented programs.

Abstract
This paper presents Soot, a framework for optimizing Java(tm) bytecode.
The
framework is implemented in Java and supports three intermediate
representations
for representing Java bytecode: Baf, a streamlined representation of
Java's
stack-based bytecode;
Jimple, a typed three-address intermediate
representation suitable for optimization; and Grimp, an aggregated
version of
Jimple.
Our approach to class file optimization is to first convert the
stack-based
bytecode into Jimple, a three-address form more amenable to traditional
program
optimization, and then convert the optimized Jimple back to bytecode.

In order to demonstrate that our approach is feasible, we present
experimental results showing the effects of processing class files
through
our framework. In particular, we study the techniques necessary to
effectively
translate Jimple back to bytecode, without losing performance.
Finally, we
demonstrate that class file optimization can be quite effective by
showing the results of some basic optimizations using our framework.
Our experiments
were done on ten benchmarks, including seven SPECjvm98 benchmarks, and
were
executed on five different Java virtual machine implementations.

Abstract
Although Java
bytecode has a significant amount of type information embedded in it,
there are no explicit types for local variables.
However, knowing types for local variables is very useful for both program
optimization and decompilation.
In this paper, we present a practical algorithm for
inferring static types for local variables
in a 3-address, stackless, representation of Java bytecode.

By decoupling the type inference problem from the
low level bytecode representation, and abstracting it into a
constraint system, we are able to construct a sound and efficient
algorithm for inferring the type of most bytecode found in practice.
Further, for bytecode that cannot be typed, we suggest
transformations that make it typeable.

Using this same constraint system, we show that this type inference
problem is NP-hard in general. However, this hard case does not seem
to arise in practice and our experimental results show
that all of the 17,000 methods used in our tests were successfully
and efficiently typed by our algorithm.

Abstract
Java programmers write applications and applets in plain English-like
text, and then apply a java compiler to the text to obtain {\em class
files}. Class files, which are typically transmitted across the web,
are a low-level representation of the original text; they are not
human-readable. Consider a compiler as a function from text to class
files. My goal is to compute the inverse function: given the compiled
class file, I wish to find the best approximation to the original text
possible. This is called decompilation.

Given a class file, one can build an unstructured graph representation
of the program. The main goal of my work is to develop graph
algorithms to convert these unstructured graphs into structured graphs
corresponding to high-level program constructs, and I will discuss
these in detail. I shall also mention some results concerning
possibility and impossibility which show that decompilation is always
possible if the program may be lengthened.

Abstract
In this paper we present Jimple, a 3-address intermediate representation that has been designed
to simplify analysis and transformation of Java bytecode. We motivate the need for a new
intermediate representation by illustrating several difficulties with optimizing the stack-based
Java bytecode directly. In general, these difficulties are due to the fact that bytecode
instructions affect an expression stack, and thus have implicit uses and definitions of stack
locations. We propose Jimple as an alternative representation, in which each statement refers
explicitly to the variables it uses. We provide both the definition of Jimple and a complete
procedure for translating from Java bytecode to Jimple. This definition and translation have
been implemented using Java, and finally we show how this implementation forms the heart of the
Sable research projects.

Abstract
Java programs can greatly differ in terms of required runtime support. C-like Java programs which
are usually computation intensive tend to require little, whereas heavily object oriented programs such
as compilers, can spend most of their execution time in runtime handlers. This report begins to
explore the differences in instruction level parallelism between these two extreme types of Java
programs. No solid conclusions can be drawn, however, due to an oversimplified execution model and
the scarcity of appropriate benchmarks.

Note: This report is obsolete. It covers an extremely old version of
the Jimple framework which is now called Soot.

Abstract
The Java Virtual Machine is a stack based architecture. This makes compiler analyses
and transformations more complex because of the implicit participation of the operand stack in
instructions. To simplify these tasks, we developed an alternative representation for Java
bytecode based on 3-address code and a framework to manipulate Java classfiles in this
form. This report presents an overview of the Jimple framework and the algorithms used
in the jimplification process.