Parallel hardware is today’s reality and language extensions that ease exploiting its promised performance flourish. For most mainstream languages, one or more tailored solutions exist that address the specific needs of the language to access parallel hardware. Yet, one widely used language is still stuck in the sequential past: JavaScript, the lingua franca of the web.

Our position is that existing solutions do not transfer well to the world of JavaScript due to differences in programming models, the additional requirements of the web, like safety, and to developer expectations. To address this we propose River Trail, a new parallel programming API designed specifically for JavaScript and we show how it satisfies the needs of the web. To prove that our approach is viable, we have implemented a prototype JIT compiler in Firefox that shows an order of magnitude performance improvement for a realistic web application.

Mobile web browsing is slow. With advancement of networking techniques, future mobile web browsing is increasingly limited by serial CPU performance. Researchers have proposed techniques for improving browser CPU performance by parallelizing browser algorithms and subsystems. We propose an alternative approach where we parallelize web pages rather than browser algorithms and subsystems. We present a prototype, called Adrenaline, to perform a preliminary evaluation of our position. Adrenaline is a server and a web browser for parallelizing web workloads. The Adrenaline system parallelizes current web pages automatically and on the ﬂy – it maintains identical abstractions for both end-users and web developers.

Our preliminary experience with Adrenaline is encouraging. We ﬁnd that Adrenaline is a perfect ﬁt for modern browser’s plug-in architecture, requiring only minimal changes to implement in commodity browsers. We evaluate the performance of Adrenaline on a quadcore ARM system for 170 popular web sites. For one experiment, Adrenaline speeds up web browsing by 3:95x, reducing the page load latency time by 14:9 seconds. Among the 170 popular web sites we test, Adrenaline speeds up 151 out of 170 (89%) sites, and reduces the latency for 39 (23%) sites by two seconds or more.

High-performance computing has been on an inexorable march from gigascale to tera- and petascale, with many researchers now actively contemplating exascale (1018, or a million trillion operations per second) systems. This progression is being accelerated by the rapid increase in multi- and many-core processors, which allow even greater opportunities for parallelism. Such densities, though, give rise to a new cohort of challenges; for example, containing system software overhead, dealing with large numbers of schedulable entities, and maintaining energy efﬁciency.

We are studying software and processor-architectural features that will allow us to achieve these goals. We believe that exascale operation will require signiﬁcant “out of the box” thinking, speciﬁcally in terms of the role of operating systems and system software. We describe some of our research into how these goals can be achieved.

Sankaralingam Panneerselvam and Michael M. Swift, University of Wisconsin, Madison

The inexorable demand for computing power has lead to increasing interest in accelerator-based designs. An accelerator is specialized hardware unit that can perform a set of tasks with much higher performance or power eﬃciency than a general-purpose CPU. They may be embedded in the pipeline as a functional unit, as in SIMD instructions, or attached to the system as a separate device, as in a cryptographic co-processor.

Current operating systems provide little support for accelerators: whether integrated into a processor or attached as a device, they are treated as CPU or a device and given no additional consideration. However, future processors may have designs that require more management by the operating system. For example, heterogeneous processors may only provision some cores with accelerators, and IBM’s wire-speed processor allows user-mode code to launch computations on a shared accelerator without kernel involvement. In such systems, the OS can improve performance by allocating accelerator resources and schedulingaccess to the accelerator as it does for memory and CPU time.

In this paper, we discuss the challenges presented by adopting accelerators as an execution resource managed by the operating system. We also present the initial design of our system, which provides ﬂexible control over where and when code executes and can apply power and performance policies. It presents a simple software interface that can leverage new hardware interfaces as well as sharing of specialized units in a heterogeneous system.

This paper presents a lightweight task framework and accompanying type system that statically guarantee deterministic execution. The framework is based on the familiar model of fork-join parallelism, but with two important twists. First, child tasks do not begin execution immediately upon creation, but rather they are both scheduled and joined as one atomic action; this change prevents the parent task from racing with its children. Second, the body of a child task is speciﬁed as a parallel closure.

Parallel closures are a novel variation on traditional closures in which the data inherited from the environment is read-only. Parallel closures have the important property that they can be executed in parallel with one another without creating data races, even if they share the same environment. We also have a controlled means to grant mutable access to data in the environment where necessary.

We have implemented a prototype of our framework in Java. The prototype includes a typechecker that enforces the constraint that parallel closures cannot modify their environment. The paper describes how the prototype has been used to implement a number of realistic examples and also explains how parallel closures can support the creation of structured parallel programming abstractions.

The dominant view of computing is based on sequential processing as the “normal” case, with parallelism considered to be optional “acceleration.” In this position paper, we argue that parallel is becoming the norm, so we must re-examine the primacy of serialized computations in time, simultaneous action-at-a-distance, and total ordering in our thinking and our engineering practices.

We focus in this paper on some of these problematic ideas, design, and implementation structures that must be deprecated, and cleaner modularity concepts supported, if we in the systems part of the computing community are to unshackle parallel computing from its misplaced entanglement to a sequential model of computing. In the end, ours is a call to action, not revolutionary action that discards the past, but rapid evolutionary change, abandoning obsolete ideas.

Stephen Heumann and Vikram Adve, University of Illinois at Urbana-Champaign

Concurrent programming has become ubiquitous, but today’s widely-used concurrent programming models provide few safety guarantees, making it easy to write code with subtle errors. Models that do give strong guarantees often can only express a relatively limited class of programs. We argue that a concurrent programming model should offer strong safety guarantees, while still providing the flexibility and performance needed to support the many ways that concurrency is used in complex, interactive applications.

To achieve this, we propose a new programming model based on tasks with effects. In this model, the core unit of work is a dynamically-created task. The key feature of our model is that each task has programmer-specified, statically-checked effects, and a runtime scheduler is used to ensure that two tasks are run concurrently only if their effects are non-interfering. Our model guarantees strong safety properties, including data race freedom and a form of atomicity. We describe this programming model and its properties, and propose several research questions related to it.

We present a virtual machine prototype, called HydraVM, that automatically extracts parallelism from legacy sequential code (at the bytecode level) through a set of techniques including code proﬁling, data dependency analysis, and execution analysis. HydraVM is built by extending the Jikes RVM and modifying its baseline compiler, and exploits software transactional memory to manage concurrent and out-of-order memory accesses. Our experimental studies show up to 5 speedup on the JOlden benchmark.

As highly parallel heterogeneous computers become commonplace, automatic parallelization of software is an increasingly critical unsolved problem. Continued progress on this problem will require large quantities of information about the runtime structure of sequential programs to be stored and reasoned about. Manually formalizing all this information through traditional approaches, which rely on semantic analysis at the language or instruction level, has historically proved challenging. We take a lower level approach, eschewing semantic analysis and instead modeling von Neumann computation as a dynamical system, i.e., a state space and an evolution rule, which gives a natural way to use probabilistic inference to automatically learn powerful representations of this information. This model enables a promising new approach to automatic parallelization, in which probability distributions empirically learned over the state space are used to guide speculative solvers. We describe a prototype virtual machine that uses this model of computation to automatically achieve linear speedups for an important class of deterministic, sequential Intel binary programs through statistical machine learning and a speculative, generalized form of memoization.

General purpose GPU (GPGPU) programming frameworks such as OpenCL and CUDA allow running individual computation kernels sequentially on a device. However, in some cases it is possible to utilize device resources more efficiently by running kernels concurrently. This raises questions about load balancing and resource allocation that have not previously warranted investigation. For example, what kernel characteristics impact the optimal partitioning of resources among concurrently executing kernels? Current frameworks do not provide the ability to easily run kernels concurrently with fine-grained and dynamic control over resource partitioning. We present KernelMerge, a kernel scheduler that runs two OpenCL kernels concurrently on one device. KernelMerge furnishes a number of settings that can be used to survey concurrent or single kernel configurations, and to investigate how kernels interact and influence each other, or themselves. KernelMerge provides a concurrent kernel scheduler compatible with the OpenCL API.

We present an argument on the benefits of running kernels concurrently. We demonstrate how to use KernelMerge to increase throughput for two kernels that efficiently use device resources when run concurrently, and we establish that some kernels show worse performance when running concurrently. We also outline a method for using KernelMerge to investigate how concurrent kernels influence each other, with the goal of predicting runtimes for concurrent execution from individual kernel runtimes. Finally, we suggest GPU architectural changes that would improve such concurrent schedulers in the future.

For communication-intensive applications on distributed memory systems, performance is bounded by remote memory accesses. Task migration is a potential candidate for reducing network traffic in such applications, thereby improving performance. We seek to answer the question: can a runtime profitably predict when it is better to move the task to the data than move the data to the task? Using a simple model where local work is free and data transferred over the network is costly, we show that a best case task migration schedule can achieve up to 3.5x less total data transferred than no migration for some benchmarks. Given this observation, we develop and evaluate two online task migration policies: Stream Predictor, which uses only immediate remote access history, and Hindsight Migrate, which tracks instruction addresses where task migration is predicted to be beneficial. These predictor policies are able to provide benefit over execution with no migration for small or moderate size tasks on our tested applications.

Many multicore multiprocessors have a non-uniform memory architecture (NUMA), and for good performance, data and computations must be partitioned so that (ideally) all threads execute on the processor that holds their data. However, many multithreaded applications show heavy use of shared data structures that are accessed by all threads of the application. Automatic data placement and thread scheduling for these applications is (still) difﬁcult.

We present a template library for shared data structures that allows a programmer to express both the data layout (how the data space is partitioned) as well as thread mapping and scheduling (when and where a thread is executed). The template library supports programmers in dividing computations and data for reducing the percentage of costly remote memory accesses in NUMA multicore multiprocessors. Initial experience with ferret, a program with irregular memory access patterns from the PARSEC benchmark suite, shows that this approach can reduce the number of remote accesses from 42% to 10% and results in a performance improvement of 3% without overwhelming the programmer.

Maze solving has been used as an example parallel programming problem for some years. Suggested solutions are often based on a sequential program, using work queues to allow multiple threads to explore different portions of the maze concurrently. This paper analyzes such an implementation, but also explores an alternative implementation based on strategies long used by human maze solvers. This alternative implementation outperforms the conventional approach on average, and furthermore exhibits large superlinear speedups. The paper uses insights into the cause of these superlinear speedups to derive a faster sequential algorithm, and ﬁnally considers further implications and future work.

Alex Rubinsteyn, Eric Hielscher, Nathaniel Weinman, and Dennis Shasha, New York University

High level productivity languages such as Python or Matlab enable the use of computational resources by nonexpert programmers. However, these languages often sacriﬁce program speed for ease of use.

This paper proposes Parakeet, a library which provides a just-in-time (JIT) parallel accelerator for Python. Parakeet bridges the gap between the usability of Python and the speed of code written in efﬁciency languages such as C++ or CUDA. Parakeet accelerates data-parallel sections of Python that use the standard NumPy scientiﬁc computing library. Parakeet JIT compiles efﬁcient versions of Python functions and automatically manages their execution on both GPUs and CPUs. We assess Parakeet on a pair of benchmarks and achieve signiﬁcant speedups.

Just as errors in sequential programs can lead to security exploits, errors in concurrent programs can lead to concurrency attacks. Questions such as whether these attacks are feasible and what characteristics they have remain largely unknown. In this paper, we present a preliminary study of concurrency attacks and the security implications of real world concurrency errors. Our study yields several interesting ﬁndings. For instance, we observe that the exploitability of a concurrency error depends on the duration of the timing window within which the error may occur. We further observe that attackers can increase this window through carefully crafted inputs. We also ﬁnd that four out of ﬁve commonly used sequential defenses become unsafe when applied to concurrent programs. Based on our ﬁndings, we propose new defense directions and ﬁxes to existing defenses.

Jacob Burnim, Tayfun Elmas, George Necula, and Koushik Sen, University of California, Berkeley

Testing is the most widely-used methodology for software validation. However, due to the nondeterministic interleavings of threads, traditional testing for concurrent programs is not as effective as for sequential programs. To attack the nondeterminism problem, software model checking techniques have been used to systematically enumerate all possible thread schedules of a test program. But such systematic and exhaustive exploration is typically too time-consuming for many test programs. We believe that the programmer’s help to guide the model checker towards interesting executions is critical to circumvent this problem.

We propose a testing technique and a supporting tool called CONCURRIT, which provides a model checker that can be guided programmatically within test code. While writing a test, the programmer speciﬁes a particular thread interleaving scenario in mind using an embedded domain-speciﬁc language (DSL), and CONCURRIT explores all and only the executions realizing the intended scenario. During the exploration, the programmer is also able to observe the execution (e.g., assert invariants) and constrain the future decisions of the model checker, all within the test code. We believe that providing the programmer the ability to observe and control the exploration of executions will lead to more effective and efﬁcient testing for concurrent programs.

In the multi-core era, it is critical to effectively test multi-threaded software and expose concurrency bugs before software release. Previous work has made a lot of progress in exercising the interleaving space and detecting concurrency bugs under a given input. Unfortunately, since software often has many test inputs and constant pressure to release new versions, existing techniques are still too expensive in practice. In this position paper, we use open-source software to study how interleavings, data races and atomicity violations particularly, overlap across test inputs and software versions. We also conduct preliminary explorations to improve the testing efﬁciency of multi-threaded software by avoiding redundant analysis across inputs and software versions.