Mailing lists

Workshop description

We are pleased to announce another gathering of GNU tools developers. The basic format of this meeting will be similar to the previous meetings.

The purpose of this workshop is to gather all GNU tools developers, discuss current/future work, coordinate efforts, exchange reports on ongoing efforts, discuss development plans for the next 12 months, developer tutorials and any other related discussions.

This time We will meet again at the Lesser Town Campus of Charles University in Prague (Malostranske Namesti 25, Prague, Czech Republic map1, map2). (The same location as of GNU Tools Cauldorn 2012).

We are inviting every developer working in the GNU toolchain: GCC, GDB, binutils, runtimes, etc. In addition to discussion topics selected at the conference, we are looking for advance submissions.

If you have a topic that you would like to present, please submit an abstract describing what you plan to present. We are accepting three types of submissions:

Prepared presentations: demos, project reports, etc.

BoFs: coordination meetings with other developers.

Tutorials for developers. No user tutorials, please.

Note that we will not be doing in-depth reviews of the presentations. Mainly we are looking for applicability and to decide scheduling. There will be time at the conference to add other topics of discussion, similarly to what we did at the previous meetings.

If you intend to participate, but not necessarily present, please let us know as well. Send a message to tools-cauldron-admin@googlegroups.com stating your intent to participate. Please indicate your affiliation, dietary requirements and t-shirt size.

Abstracts

Talks

Andreas Arnez: Debugging versus hardware transactional memory

Since a few years, some commercial server CPUs, like IBM zEC12, Intel Haswell, and IBM POWER8, implement "hardware transactional memory", a feature that lets a sequence of operations appear as a single atomic transaction. This feature is focused on simplifying and/or speeding up certain synchronization scenarios in parallel computing. But what about debug-capability? When a transaction is interrupted, it is rolled back, and the state at the time of interruption is lost. This leads to various difficulties with breakpoints, watchpoints, single-stepping, etc. The talk outlines these issues and discusses possible solutions.

Andreas Arnez: Debugging Linux kernel dumps with GDB?

GDB can principally be used for analyzing Linux kernel dumps, but there are significant limitations. By contrast, GDB contains special logic for the user-space runtime, such as for the threading library and for the dynamic loader. Analogous functionality is currently lacking for the kernel runtime. Also, GDB does not offer any capabilities specific to kernel dump debugging, like reading compressed kdumps or deploying C-like dumper functions that can be derived from existing code in the Linux kernel sources. In general, there are various possible ways of improving GDB's Linux kernel dump support. Such improvements may also benefit kernel live debugging, such as with a JTAG probe or with Qemu's gdbserver.

James Pallister, Jeremy Bennett: GNU Superoptimizer 2.0

For nearly 20 years, GSO has been the reference superoptimizer, and proved successful in uncovering new peephole optimizations for compilers. The code has been relatively stable for some years. In this talk we'll discuss our work on a new version of GSO, drawing on more recent research in the field. We'll cover:

How to modify GSO for use in a highly parallel environment - in this case the 100,000 node supercomputer at STFC Daresbury

How to implement stochastic superoptimization in GSO.

How to implement cost functions other than those based on code size.

How to address the challenges of memory access, floating point, loops and multiple results.

The result is a complete rewrite - GSO 2.0. We won't have finished the job by the time of the Cauldron, but we'll give a progress update.

Jeremy Bennett: Keeping other compilers honest: How to validate LLVM with the GCC

LLVM comes with two test suites. The regression test suite is used to validate the compiler, and does not involve any execution tests. The LLVM nightly tests are a collection of applications which can be used to test execution of generated code on larger targets. However LLVM lacks any large body of small tests to exercise all aspects of a compiler. By comparison the latest GCC regression test suite includes around 75,000 C tests and 50,000 C++ tests, many of which are execution tests.

For a long time, Apple and Intel have used the GCC 4.2.2 and GDB 6.3 regression tests to validate their LLVM implementations. However these are heavily hand-modified versions, and it is hard to either roll forward to newer tests or to different architectures. So, for example, there is no testing of the 2011 and 2014 C/C++ standards.

In this short talk, I'll present our experience of using the GCC regression tests with a LLVM compiler for a deeply embedded target. I'll outline a unified solution, which makes it feasible to use the latest GCC regression test suite with other compilers and other architectures. It requires a generic patch to DejaGnu (in gdb.exp), which provides a mechanism for a central database to control which tests should run and their expected output.

The result is that only tests which should give a result on the architecture are run. There is no problem with large numbers of tests timing out, and a good compiler should be able to achieve zero FAIL, XPASS and UNRESOLVED results. The approach is generic, so it can be applied to other GNU regression test suites, such as GDB and binutils, providing further validation.

The intention of this talk is to stimulate discussion about one aspect of interaction between LLVM and GCC. LLVM is in many areas led by GCC - if only because it needs to be able to compile code which has historically been compiled with GCC. To what extent should GCC actively support LLVM. For example should DejaGnu tests include dg- directives to indicate relevance to LLVM?

Bin Cheng: IVOPTs current implementation and challenges

The topic consists of two parts. First part is an overview of how IVO currently works. It has been changed a lot since tree level IVO was firstly merged in early 4.* versions. This might also result in a refined top-level comment for tree-ssa-loop-ivopts.c. The second part is about some non-trivial problems that current IVO doesn’t handle very well (in other words, points that I think IVO could be improved):

Register pressure modeling. Though IVO has basic register pressure model, it doesn’t work very well especially for cases with high register pressure. In this case, the IVOed program generally has many more memory spills than non-IVO version.

IVO corrupts loop invariant opportunities. Actually this is a loop-nest optimization problem. Invariant generated by IVO for inner loop should be hoisted out of loop, but it is kept in the loop because IVO for outer loop rewrites it into cheaper form. This causes conflicts between IVO and LIM, results in worse code because of high register pressure, additional instructions.

Too many candidates. IVO’s candidate selecting algorithm has at least cubic behavior, so it avoids compilation time issue by falling back to another primitive but fast algorithm. The point is, some candidates are unnecessary and can be removed.

This problem list might change since I am still working/learning IVO now, but they are major problems for IVO I had noticed for last two years. Though most of them are difficult, I expect that I can fixed some before the Cauldron. I will talk about each problem (and possible solution) in detail with help of examples. As for possible outcome of this talk, I would expect to get feedbacks for further IVO works.

Ilya Enkovich: Vectorization for Intel AVX-512

Intel AVX-512 is a major instructions set extension which introduces new 512-bit vector register, mask registers, embedded masking, embedded rounding and exception control, new instructions etc. This presentation focuses on a new masking feature and how it can be used by GCC to improve loops vectorization for Intel Architecture

This presentation comprises of two parts: first, we want to share with GCC community number of aspects related to our experience of porting data layout optimizations (aka struct-reorg) to LTO/WHOPR infrastructure (version 4.9.0). The second part intends to show our exploratory work on extending original idea of "changing C-like structure memory layout" to "changing object memory layout" in C++ language paradigm.

Being previously intended to run under whole-program flag, data layout optimizations required restructuring to run under LTO/WHOPR with strict division into three separate optimization phases: (1) compile time collection of the data, (2) propagation and analyses of combined data, and (3) transformation phase. Technical routines were added for streaming of both collected data and optimization decisions in and out of data layout specific section in lto object file.

Resolution of symbols, provided by linker at phase (2), essentially simplified data collection and analysis phases ((1) and (2)), mostly relieving conservative assumptions of type-escape analysis. If all symbols related to a candidate type (as variables, or functions with function parameters or return values of a candidate type) are defined inside current compilation (i.e. specified as PREVAILING_DEF_IRONLY) then we can transform a candidate type safely, given it was never casted. In previous version of type-escape analysis we conservatively assumed that address taken operation causes a candidate type to "escape" (since, for example, its address might be passed as actual parameter into a function defined externally). With known symbol resolutions it is possible to transform a candidate types across function boundaries even when function parameters as pointers to a candidate type, as, for example, happens in Spec2006 462.libquantum benchmark. However, reasons like custom defined malloc; a presence of bitfields; an inline assembly; or a pointer arithmetic with a candidate type, might still prevent a candidate type from being transformed.

When dealing with the phase (3), we used both function and variable transforms, extending compiler pass manager with varibale_tranform () support. In our case, generation of new global variables should precede individual functions transformation, where these variables should be visible. We leveraged existing jump-function mechanism to change function definitions. For example, in case of structure peeling, if a function parameters is a pointer to a candidate type, then this function prototype is changed to receive multiple parameters that are pointers to the new peeled types.

Finally we extended original set of data layout optimizations, comprised of structure splitting, peeling and reordering, with structure inlining. Being a combination of a substructure full peeling with subsequent inlininginto containing structure, this optimization has definite benefits of reducing number of redirections required for individual field accesses. Also it opens further opportunities for containing structure reordering. As a result of applying this optimization on SPEC 2006 libquentum benchmark, we achieved approximately +30% run time improvement on Intel Xeon and armv7 platforms, and as high as +90% on armv8 platform, that emphasize this optimization efficiency for paltforms with medium size L2 cache.

The idea of extending structure layout reorganization on C++ objects come up in analysis of multiple applications. It became clear that an order in which data class members are specified inside class bears no particular meaning for developers, rather taking in account logical convenience than performance of application. The purpose of initial experiment was to estimate efficiency of simple reordering of data members, done in the manner similar to structure reordering, when guided by profile information. Further experiments aimed reordering of virtual table entries based on profile or/and trace information. We present results of this experiments applied to pagerank <http://en.wikipedia.org/wiki/PageRank> - multithreded and distributed application based onGraphLab <https://dato.com/products/create/open_source.html> library.

For both parts of our presentation we are looking forward for an open discussion and a feedback from GCCcommunity, and would like to invite its members to actively participate in their development.

Jan Hubicka: Types and type based optimizations in GCC

GCC middle-end is able to represent types of C, C++, Java, Fortran, Ada, and Go. Number of interesting questions arise during the link-time optimization when the types originating from different languages are merged to single translation unit and unified semantics needs to be established. I will discuss current representations, issues and future plans. I will also cover current type based optimizations - alias analysis and devirtualization and possible new uses in the future.

Martin Jambor: Compiling for HSA accelerators with GCC

The talk will describe the HSA development branch of gcc. We will describe what it can and cannot do, how it is structured so that it does not need LTO, and how it is going to co-exist with the other LTO-based accelerators. Because our effort primarily targets OpenMP 4.0, we will also describe what changes we deemed necessary in OpenMP expansion so that we can generate efficient GPGPU code. We will conclude by presenting plans to merge the branch to trunk.

Martin Liska: Inter-producedural Identical Code Folding in GCC

I will talk about initial pass implementation of IPA ICF, which is part of the GCC compiler, starting from version 5.0. Presentation will include comparison with a current implementation in GOLD linker and unexpected issues observed during development of the pass. Moreover, I would like to introduce possible improvements which can make ICF even more powerful. Finally, I will explain how can we adapt current infrastructure to replace tree-ssa-tail-merge pass comparison engine.

Mikhail Maltsev: High Level Loop Optimizations in GCC

We present a simple framework for performing iteration domain modifying loop transformations with an implementation of loop splitting and thoughts on how to implement loop fusion.

Michael Meissner: Gnu PowerPC support in 2015

This talk will cover changes that we have done for PowerPC support in 2014-2015. Among other things this talk will include:

Details of adding full support for the GO language to the PowerPC enviorment.

Changes to flesh out support for the Power8 architecture, including changes for better support of little endian support.

IEEE 128-bit floating point support.

Advanced fusion support and scalar reigster support which involves reimplementing the RTL addressing modes to make it more general.

The POWER instruction set architecture is designed to support both big-endian and little-endian memory models. However, many of the instructions designed for vector support assume that vector elements in registers appear in big endian order, that is, with the lowest-numbered vector element in the most significant portion of the register. This is not particularly natural for programmers used to vector programming on little-endian architectures such as x86. We have designed a vector programming model that provides more natural interfaces for porting from standard little-endian environments, and that facilitates writing vector library code that runs in both endian modes with minimal changes. We also have an alternate model to facilitate porting existing big-endian POWER vector code to little-endian. This talk will outline some of the issues faced in designing a sensible vector programming model on a bi-endian architecture with a big-endian bias, and how we've addressed them. We will also discuss some of the more interesting implementation and performance issues we've encountered.

Siddhesh Poyarekar: Tunables for the C Library

The GNU C library has a number of magic constants that were decided based on performance and resource data available when they were first introduced. Those constants may be suboptimal for some loads and may have even been rendered incorrect due to advances in other components or hardware. Further, there are a number of global configuration variables that were added over the years to work around the problems posed by such magic constants (the MMAP_THRESHOLD in malloc is one such example). These variables have ad hoc names and each have their own scheme of initialization and maintenance.

A tunables framework aims to provide a layer that manages such global configuration and provide a unified interface to programmers and system administrators an integrators to tweak this configuration.This talk describes the architecture of this layer and the interface it provides. If the feature is not ready by then, this would be a BoF to decide on the architecture and interface of the tunables layer.

Hafiz Abid Qadeer: What is new in DWARF5

The version 5 of the DWARF standard is expected to be published later this year. In this talk, I will talk about the new features of the DWARF5 and where these features can be helpful for the debug information consumers.

Torvald Riegel: Updating glibc concurrency

I will give an overview of recent and future changes to concurrent code in glibc. In particular, I will cover (1) the transition to a C11-like memory model and data-race-freedom, (2) updates to the futex documentation and how this relates to POSIX/C++ mutex destruction requirements, and (3) the new semaphore and condition variable algorithms. I will also give an outlook on ongoing or future work: read-write lock scalability and spinning vs. blocking.

Note: We also thought about perhaps proposing a glibc BoF. This presentation could also be a BoF with a presentation side to it. I'm not sure whether you'd want to provide different slots for presentations and BoFs, or put one of those in parallel tracks but not the other. Therefore, if you think a BoF should be better, just let me know.

Deshpande Sameera: Improving the Effectiveness and Generality of GCC Auto-Vectorization

The presentation will demonstrate the approach to improve efficiency and generality of vectorization in GCC by

Analysing GIMPLE statements in the loop together and

Utilising that information

To transform the computations by coalescing ASTs and

To decide where and which permutations are populated based on knowledge of available target instructions.

Dodji Seketeli and Sinny Kumari: ABI comparison with Libabigail based tools: state of the onion

Many interesting developments have occurred in the Libabigail space since our last presentation at the 2014 edition of GNU Cauldron in Cambridge.

The purpose of this talk is to walk the audience through the main achievements, provide guidelines about the ways upstream projects and distributions can now include continuous ABI comparison into their work flow and give hints about the new challenges that we see coming next in this area.

Ulrich Weigand: Supporting the new IBM z13 mainframe and its SIMD vector unit

The IBM z13, the latest model of the IBM z Systems line of mainframe computers, has been recently announced. For the first time in the history of z/Architecture, this model provides a Single Instruction Multiple Data (SIMD) vector unit, intended to speed workloads such as analytics and mathematical modeling.

Supporting a significant new architecture feature like this on Linux requires changes across the stack, starting from the kernel and system libraries, through assemblers and related binary utilities, up to all compilers and debugging tools.

In this talk I'll give an overview of the z13 architecture changes, in particular the integer, floating-point, and string vector instructions. I'll also describe the ABI choices we made to support SIMD, as well as the language extensions we defined to allow source code to exploit vector instructions across the various compilers on the platform. In particular, I'll address similarities and differences to vector extensions on other platforms, like VMX/VSX on Power.

Finally, I'll report on where we stand in implementing those new features across the Linux on z ecosystem, with particular focus on the implementation in the GNU toolchain, and address a couple of challenges that still need to be resolved.

Kirill Yukhin: OpenMP 4 Offloading Features implementation in GCC

GCC 5 was released with support of OpenMP 4.0 offloading to Intel Xeon Phi (Knights Landing) target. Offloading infrastructure was implemented in a very common way, so almost any accelerator support can be integrated easily (provided corresponding backend is contributed). This talk presents high level overview of offloading internals. Xeon Phi is taken as an example of the target card.

Claudiu Zissulescu: Scheduling for ARC HS cores

Synopsys's ARC HS Family processors are 32-bit high-performance CPUs that can be customized for a wide range of uses, from deeply embedded to high-performance host applications. To achieve the desired performance level, we need to properly schedule the instruction stream on two ALUs designed for a low-latency configuration. The present talk will cover the GCC backend port modifications that were required to obtain the desired performance. We will cover the following topics:

HS ALUs configuration

Adding support in GCC for the HS CPUs

Future work.

Tutorials

Torvald Riegel: Modern concurrent code in C/C++

In this tutorial, I will present foundations, tools, and guidelines for how to write modern concurrent code in C and C++. I will (1) give a brief introduction to concurrency and the kind of reasoning necessary to write correct concurrent code, (2) explain the C11/C++11 memory model and data-race-freedom, why it should be used as foundation, and tools that can make this easier, (3) discuss trade-offs between complexity and performance of different synchronization programming abstractions, and (4) propose guidelines for how to document concurrent code so that it is easier to maintain for other people.

BoFs

Peter Bergner: PowerPC BOF

Carlos O'Donell: GNU C Library BOF.

The GNU C Library is used as the C library in the GNU systems and most systems with the Linux kernel. The library is primarily designed to be a portable and high performance C library.It follows all relevant standards including ISO C11 and POSIX.1-2008. It is also internationalized and has one of the most complete internationalization interfaces known.

This BOF aims to bring together developers of other components that have dependencies on glibc and glibc developers to talk about the following topics:

Planning for glibc 2.23 and what work needs to be done between the August -> January 2016 timeframe.

Planning for glibc 2.24 and what work needs to be done between January 2016 and July 2016.

Jan Hubicka: LTO BoF

The aim of this BoF is to discuss the direction of the benchmarks going forward and also come up with a framework for whole system benchmarking that feeds back into the glibc development to help us decide on algorithmic tweaks and also tweaks to tunables within the library.

Ramana Radhakrishnan: BoF for the ARM / AArch64 ports

Aditya Kumar, Sebastian Pop: Loop optimizer and vectorization BOF

We would like to discuss the state of GCC's vectorizer compared to other compilers, and areas that need improvement. We will present testcases and performance differences for the opportunities of vectorization.

The second point to be discussed is how to use Graphite to enable more loops to be vectorized in a similar way as Polly drives the vectorizer of LLVM. We will lay out a plan of action to get better loop transforms for vectorization.

Roland McGrath: BoF for glibc hackers.

Carlos O'Donell, Marek Polacek: Continuous Integration

The topics are Continuous Integration (Carlos O'Donell), news from libabigail (Dodji Seketeli), and I might utter a few words about how we do Fedora mass rebuilds.

Martin Jambor: Accelerator BoF

BoF to bring together all those involved in supporting compilation for accelerators so that we can coordinate and share experience and expectations.

Accomodation

The conference venue can be conveniently reached by the public transport, either by Metro (subway, underground train) line A (green line), to the station of Malostranská and then by a short walk, or by the tramway lines No. 12, 20 or 22 to the stop of Malostranské náměstí. The tramway stop is situated right across the square to the conference venue. A public traffic schemes can be downloaded at http://www.dpp.cz/en/transport-around%20prague/transit-schematics/.

Because of the location just in the center of Prague, it is easy to check lodging options on common booking sites, like http://www.marys.cz/.