Technical Details

Machine Learning Optimization

All modern compilers have hundreds of optimization passes, and most of those passes have tuning parameters that can adjust their precise behavior. The general purpose optimization levels -01, -O2 and -Os etc. are collections of those optimizations that have proved generally effective across a wide range of programs.

But these are general purpose flags. For any individual program, the use of these flags will not give the best possible optimization. All programmers have had the experience where -Os (for small code) gives a faster program than -O3 (for very fast code).

This is even more so, when we look at the compilation of any individual function.

The only way to solve this is to take a program and try all the individual optimization flags, the technique known as iterative compilation. However to try all possible combinations of hundreds of optimizations and parameters would take billions of years. It is not sufficient to try each optimization in isolation, because optimizations interact. The best that can be achieved is for an experienced engineer to try large numbers of likely combinations of flags to try to find the best. While this can double the performance of a program it can take weeks to achieve for a single program.

The solution is based on an observation that “similar” programs tend to yield similar sets of optimization flags and parameters when treated to iterative compilation. Thus if we could try iterative compilation on a wide range of programs, then we could get sets of optimization flags and parameters suited to similar programs.

This is a classic machine learning problem. The technique was first proved practical by the MILEPOST project. Subsequent research led by Embecosm, first through the MAGEEC project and currently as part of the TSERO project, has advanced the technology so that it is now ready for commercial deployment.

Under MAGEEC, this technology has matured to the point where it is ready for deployment as part of commercial compiler development. With TSERO the technology is being further advanced, in particular to improve ease of use.

MILEPOST

MILEPOST (MAchIne Learning for Embedded PrOgrammeS OpTimization) ran from 2006 to 2009. It aimed to demonstrate that existing research of iterative compilation, feedback directed optimization and machine learning could be implemented in a production compiler, GCC, and be effective in a range of systems from embedded processors to HPC.

The project was successful in demonstrating that a machine learning approach could match iterative compilation. However at the end of the project a number of issues remained:

The machine learning framework was hard coded into one specific release of GCC (4.4) and a small number of architectures. Embecosm subsequently migrated the work to GCC 4.5, but it became clear that with such a large patch, the work was not maintainable long term.

The choice of features for learning was arbitrary, based on the designers’ experience. This was a recognized as a limitation of the work. Although a technique exists for determining if a feature is appropriate (Principal Component Analysis), it requires a larger learning dataset than was available to the MILEPOST team.

While the technique was effective within the training set, it was less effective once programs outside the training set were used. This suggests that a more wide ranging training set is required.

MAGEEC

MAGEEC (MAchine Guided Energy Efficient Compilation) was an Innovate UK funded project between Embecosm (as lead partner) and Bristol University. It ran from June 2013 to November 2014.

MAGEEC’s primary objective was to generalize the techniques pioneered by MILEPOST so they could be use for optimization criterion, for any compiler and using any machine learning technique. It was to develop technology to the stage where it could deployed commercially.

The optimization criterion used to evaluate MAGEEC was energy efficiency of the generated code, and the technology was demonstrated on both GCC and LLVM with a range of machine learning techniques.

The decision to optimize for energy was based on the results of an earlier study funded by Embecosm, which showed that choice of compiler options could have a significant impact on energy usage by compiled code. While energy usage and execution performance are closely correlated, there is not a complete correlation, and therefore optimizing for energy can do better than optimizing for speed.

The work proved successful, although more work is needed on the infrastructure for training and machine learning, to make it easier to obtain machine learning databases for new architectures and environments. A number of Embecosm customers are now working on MAGEEC deployment within their compiler tool chains. It is also increasingly clear that the feature set used (which is currently identical to that for MAGEEC) is not appropriate. However the availability of per-function data means that the training set may now be large enough for Principal Component Analysis to be effective.

Optimization Criteria

MAGEEC was specifically set up to optimize for energy. However it is deliberately a general purpose system, and can just as well optimize for code size and speed. Indeed these are easier, because they are not so hard to measure as energy consumption. The first commercial development of MAGEEC is for code size.

Future Work

We identified two core challenges for future work:

The biggest challege for MAGEEC was constructing the infrastructure for training the compiler. This needs to be able to run very large numbers of programs, collecting the data in a flexible way, and dealing with failures cleanly.

The second challenge was the use of the pre-existing MILEPOST feature set. This very much feels like a set of features that are easy to collect, rather than the best set controlling the optimization criterion. To solve this we need to carry out PCA (see above), but this in turn needs a larger data set.

Both these issues will be addressed in the TSERO project.

Secondary Benefits

A number of secondary benefits arose out of this project, which is always a good indicator of successful research work.

Per function optimization

The decision to integrate into the pass manager means that we can offer custom optimizations not just at the file level, but at the individual function level. Whatever the optimization criterion, this leads to better code. There are however challenges sometimes in gaining the necessary per-function training data, particularly for energy.

Optimizations specifically for energy

One of the reasons that optimization for energy and for execution speed are losely correlated is that most existing optimizations are for either code speed or code size.

Under the MAGEEC project we explored two optimizations specifically designed to reduce energy. The first of these was aimed at small embedded processors executing code from flash memory. It aligned small inner loops so that the code resided in a single bit-line of the flash memory. This reduced the energy consumed by the loop by up to 12%, with commensurate benefits on programs containing such loops.

The second optimization was based on the observation that executing from RAM is usually much more energy efficient than executing from flash memory. Thus it can be energy efficient to copy a loop from flash into RAM and then execute in RAM, on processors which permit this, such as the ARM Cortex M-series.

The MAGEEC Wand energy measurement board

MAGEEC was always based on the idea that the energy optimization must be based on actual energy consumed by programs. Even the best high level models of energy consumption are are not sufficiently accurate, and in any case contain assumptions about processor behavior that may not be correct and can skew optimizatation. There are very accurate low-level energy models, but these typically operate at well under 1 Hz and require access to the synthesized netlist of the processor, neither of which is practical for measuring statistically significant numbers of programs.

As part of the MAGEEC project, Simon Hollis, then of Bristol University designed a small daughter board for a ST Discovery board, with three energy sampling points. The MAGEEC Wand can sample voltage and current to an accuracy of 1% up to two million times per second at each sampling point. This is the performance and accuracy necessary for successful compiler optimization.

Experimental design for compiler optimization

With hundreds of optimizations in modern compilers, all of which influence each other, it is not feasible to try all possible combinations. The interaction between optimizations mean it is also not sufficient to just try each optimization on its own.

The technique of fractional factorial design (FFD), which is well known in other branches of engineering and mathematics, has not previously been widely used in computer science. FFD allows selection of an optimal set of test runs to maximize the information about the impact of each optimization, even where those optimizations interact. Meaningful information about optimizations can be obtained with thousands of runs, rather than billions.

Even FFD is not powerful enough to deal with hundreds of optimizations. However the related technique of Placket-Burmann analysis, which assumes that all the optimizations are independent can be used as an initial filter, to remove optimizations which have no effect (in which case interactions are not an issue) in a few hundred test runs. FFD can then provide a detailed analysis of the remaining optimizations.

The Bristol/Embecosm Embedded Benchmark Suite

To train the compiler we need a suitable representative set of benchmarks. There are very few freely available benchmark suites available for this. Almost all rely on the ability to write to a console (printf) and link in significant components of the C library, which can dominate any computation.

To solve this, we created a new set of benchmarks that are specifically geared to deeply embedded systems, BEEBS. Drawing on freely available benchmarks from MiBench, WCET and others, BEEBS is designed to be a representative mix of programs. In particular we have looked at the proportion of instructions in four categories: 1) flow-of-control; 2) memory access; 3) integer arithmetic; and 4) floating point arithmetic. BEEBS contains a balanced mix of programs some heavy in one particular type of instruction, others more evenly balanced.

This paper describes the philosophy behind BEEBS. It was based on the 10 programs which formed BEEBS 1.0. We now have BEEBS 2.0, with approximately 80 programs.

TSERO

TSERO (Total Software Energy Reporting and Optimization) is the follow-on project to MAGEEC. It aims to develop the MAGEEC techniques for high performance computing (HPC) and datacenters. A key part of this will be to use commercially available energy measurement of processor cores, since it is not feasible to attach a MAGEEC Wand to each core in a supercomputer.

This is an Innovate UK funded collaboration between Embecosm, Allinea, Concertim and the Hartree Center at STFC Daresbury. The project began in June 2015 and will run until May 2017.

It should be noted that TSERO also advances the use of superoptimization for energy usage (see superoptimization page).

Technical description

For TSERO, we are moving away from a plugin based interface to the compiler. Whilst the plugin technique is powerful and flexible, it has a number of weaknesses.

It only gives control of whether passes run or not. It does not give control over the various heuristics that adjust the behavior of the pass.

It has the potential to create invalid combinations of passes. The machine learner framework has to filter out combinations which just crash the compiler.

While the plugin interface is a defined part of the compiler API, the internal state it exposes may not be defined, and so change without warning between releases.

Instead TSERO will work through the command line of the compiler. This is a a stable API and also unlikely to allow creation of impossible pass combinations which crash the compiler.

The approach does depend on the compiler exposing sufficient control through its command line. This is true of GCC, which allows control over almost all passes — whether they run or not — and also their various heuristics. For LLVM, the standard compiler driver has an approach of not exposing the low level detail. In this case we have to drive the component
parts of the compiler (clang, opt, llc) individually, giving access to many more of the pass control parameters.

There still remains the question of how to determine the features in a program. We may continue to use the plugin approach for this, but will also explore the potential for extracting such information in other ways. The challenge is that using any other parser has the potential to reject, or at least be confused by, some programs.

A side effect of having control over both passes and heuristic parameters is that the number of variables to be trained becomes even higher (by factor of around 3). The number of training runs and size of the training set are already limiting factors, but we believe that use of Plackett-Burman to eliminate options which have no effect, will still keep the experimental design within a size which fractional factorial design will keep tractable.

Energy measurement for TSERO

It is not feasible to connect MAGEEC wands to each core in a HPC system. Instead we will have to rely on the existing energy measurement facilities. For Intel based systems this is RAPL, offering energy data every millisecond. This courser grained measurement will require longer runs to be statistically significant, but this is offset by the very large number of cores available.

Rather than integrating with RAPL directly, we will be using our partner, Allinea‘s, MAP system to provide the processed energy data. This will also make porting the system to other architectures easier.

Operating system considerations

With MAGEEC we were concerned with deeply embedded systems, where typically there is a single thread of control throughout, so attributing energy measurements to code is (relatively) straightforward.

HPC systems are typically running multi-tasking operating systems, so the data will have to be analysed within the context of the operating system. For HPC we have the advantage that any core is usually running a single code base, so this will be more a matter or removing communication overhead rather than disambiguating multiple processes.

Training data set

In general BEEBS programs are too small to be representative training systems. We are working with our TSERO partners at the STFC Hartree Centre to construct a more appropriate open benchmark set. We may still use some BEEBS programs, scaled up for HPC, where they are appropriate.