SEPS 2015 – Proceedings

Frontmatter

Welcome to the second international workshop on Software Engineering for Parallel Systems (SEPS) held in Pittsburgh, PA, USA on October 27, 2015 and co-located with the ACM SIGPLAN conference on Systems, Programming, Languages and Applications: Software for Humanity (SPLASH 2015). The purpose of this workshop is to provide a stable forum for researchers and practitioners dealing with compelling challenges of the software development life cycle on modern parallel platforms.

Profiling and Program Analysis

As modern memory subsystems have become complex, performance tuning of application code targeting for their deeper memory hierarchy is critical to rewarding their potential performance. However, it has been depending on time-consuming and empirical tasks by hands of domain experts. To assist such a performance tuning process, we have been developing an application analysis tool called Exana and attempted to automate some parts of it. Using already complied executable binary code as an input, Exana can transparently analyze program structures, data dependences, memory access characteristics, cache hit/miss statistics across program execution. In this paper, we demonstrate usefulness and productiveness of these analyses, and evaluate the overheads for them. After we demonstrate that our analysis is feasible and useful to the actual HPC application programs, we show that the overheads of Exana's analyses are much less than these of existing architectural simulators.

Understanding and identifying performance problems is difficult for parallel applications, but is an essential part of software development for parallel systems. In addition to the same problems that exist when analysing sequential programs, software development tools for parallel systems must handle the large number of execution engines (cores) that result in different (possibly non-deterministic) schedules for different executions. Understanding where exactly a concurrent program spends its time (esp. if some aspects of the program paths depend on input data) is the first step towards improving program quality. State-of-the-art profilers, however, aid developers in performance diagnosis by providing hotness information at the level of a class or method (function) and usually report data for just a single program execution. This paper presents a profiling and analysis technique that consolidates execution information for multiple program executions. Currently, our tool's focus is on execution time (CPU cycles) but other metrics (stall cycles for functional units, cache miss rates, etc) are possible, provided such data can be obtained from the processor's monitoring unit. To detect the location of performance anomalies that are worth addressing, the average amount of time spent inside a code block, along with the statistical range of the minimum and maximum amount of time spent, is taken into account. The technique identifies performance bottlenecks at the fine-grained level of a basic block. It can indicate the probability of such a performance bottleneck appearing during actual program executions. The technique utilises profiling information across a range of inputs and tries to induce performance bottlenecks by delaying random memory accesses. The approach is evaluated by performing experiments on the data compression tool pbzip2, the multi-threaded download accelerator axel, the open source security scanner Nmap and Apache httpd web server. An experimental evaluation shows the tool to be effective in detecting performance bottlenecks at the level of a basic block. Modifications in the block that is identified by the tool result in performance improvement of over 2.6x in one case, compared to the original version of the program. The performance overhead incurred by the tool is a reasonable 2-7x in majority of the cases.

Investigation of the runtime behavior is one of the most important processes for performance tuning on a computer system. Profiling tools have been widely used to detect hot- spots in a program. In addition to them, tracing tools produce valuable information especially from parallelized programs, such as thread scheduling, barrier synchronizations, context switching, thread migration, and jitter by interrupts. Users can optimize a runtime system and hardware configuration in addition to a program itself by utilizing the attained in- formation. However, existing tools provide information per process or per function. Finer information like task- or loop- granularity should be required to understand the program behavior more precisely. This paper has proposed a tracing tool, Annotatable Systrace, to investigate runtime execution behavior of a parallelized program based on an extended Linux ftrace. The Annotatable Systrace can add arbitrary an- notations in a trace of a target program. The proposed tool exploits traces from 183.equake, 179.art, and mpeg2enc on Intel Xeon X7560 and ARMv7 as an evaluation. The evaluation shows that the tool enables us to observe load imbalance along with the program execution. It can also generate a trace with the inserted annotations even on a 32-core ma- chine. The overhead of one annotation on Intel Xeon is 1.07 us and the one on ARMv7 is 4.44 us, respectively.

Modeling Techniques for Parallel Software

Concurrency errors, like data races and deadlocks, are difficult to find due to the large number of possible interleavings in a parallel program. Dynamic tools analyze a single observed execution of a program, and even with multiple executions they can not reveal possible errors in other reorderings. This work takes a single program observation and produces a set of alternative orderings of the synchronization primitives that lead to a concurrency error. The new reorderings are enforced under a happens-before detector to discard reorderings that are infeasible or do not produce any error report. We evaluate our approach against multiple repetitions of a state of the art happens-before detector. The results show that through interleaving inference more errors are found and the counterexamples enable easier reproducibility by the developer.

Writing parallel programs is hard, especially for inexperienced programmers. Parallel language features are still being added on a regular basis to most modern object-oriented languages and this trend is likely to continue. Being able to support developers with tools for writing and optimizing parallel programs requires a deep understanding of how programmers approach and implement parallelism. We present an empirical study of 135 parallel open-source projects in Java, C# and C++ ranging from small (< 1000 lines of code) to very large (> 2M lines of code) codebases. We examine the projects to find out how language features, synchronization mechanisms, parallel data structures and libraries are used by developers to express parallelism. We also determine which common parallel patterns are used and how the implemented solutions compare to typical textbook advice. The results show that similar parallel constructs are used equally often across languages, but usage also heavily depends on how easy to use a certain language feature is. Patterns that do not map well to a language are much rarer compared to other languages. Bad practices are prevalent in hobby projects but also occur in larger projects.

The Model-Driven Engineering (MDE) paradigm has been successfully embraced for manufacturing maintainable software in several domains while decreasing costs and efforts. One of its principal concepts is rule-based Model Transformation (MT) that enables an automated processing of models for different intentions. The user-friendly syntax of MT languages is designed for allowing users to specify and execute these operations in an effortless manner. Existing MT engines, however, are incapable of accomplishing transformation operations in an acceptable time while facing complex transformations. Worse, against large amount of data, these tools crash throwing an out of memory exception. In this paper, we introduce ATL-MR, a tool to automatically distribute the execution of model transformations written in a popular MT language, ATL, on top of a well-known distributed programming model, MapReduce. We briefly present an overview of our approach, we describe the changes with respect to the standard ATL transformation engine, finally, we experimentally show the scalability of this solution.

Performance Tuning and Auto-tuning

Performance tuning on a single CPU is still an essential base for massively parallelized applications in the upcoming exascale era to achieve its potential performance against their peak. In this paper, we investigate room for performance improvement by searching possible memory layout optimization. The target application is a stencil computation and we use the roofline model as a performance model of it. The application analysis result of the roofline model and the performance analysis tool which we have been developing expects the performance improvement by conducting padding to the source code. Thus, we explore the appropriate memory layout which achieves the application performance improvement by bruteforce searching of randomly generated 1,000 patterns from possible padding parameters. The evaluation measuring the application performance on a single node shows that the application using the memory layout achieves 4.3 times faster than original.

To achieve performance improvement by using multi-core processor, efficient utilization of thread-level parallelism is essential. However, conventional parallel processing cannot efficiently utilize potential parallelism within program code, since there are various dependencies within program code to be kept strictly. For this problem, speculative parallel processing is considered useful. However, to realize the method on conventional commercial multi-core processors, it is necessary to manage speculative data by software, since conventional processors do not have hardware support for speculative data. Practical performance improvement has been difficult to attain, since its runtime overhead is large. In this research, we focus on hardware transactional memory (HTM), that is provided on the commercial multi-core processors released from Intel recently. This research aims to reduce the runtime overhead caused by the management of speculative data in speculative parallel processors by using HTM, to achieve an efficient speculative parallel processing on commercial multi-core processors. In this paper, as our first step, we evaluate the performance of the pre-computation method using helper-thread, that is a speculative parallel processing technique, by using Dijkstra program that solves the shortest path program on graph data. We quantitatively show the effect of HTM on parallel processing.

Linear algebra provides the building blocks for a wide variety of scientific and engineering simulation codes. Users face a world of continuously developing new algorithms and high-performance implementations of these fundamental calculations. In this paper, we describe new capabilities of our Lighthouse framework, whose goal is to match specific problems in the area of high-performance numerical computing with the best available solutions developed by experts. Lighthouse provides a searchable taxonomy of popular but difficult to use numerical software for dense and sparse linear algebra. Because multiple algorithms and implementations of the same mathematical operations are available, Lighthouse also classifies algorithms based on their performance. We introduce the design of Lighthouse and show some examples of the taxonomy interfaces and algorithm classification results for the preconditioned iterative linear solvers in the Parallel Extensible Toolkit for Scientific Computation (PETSc).