Compiler Optimisations for Energy Efficiency – Part 1

James Pallister

Over summer Embecosm funded a research project with the University of Bristol, into how compiler optimisations affect the energy consumption of applications. That project is now complete and the benchmarks and academic paper are available for download. In this blog post I will introduce the main aims of the project and only talk about the high level results in any depth. In future posts I will explore more technical topics, such as fractional factorial designs and the case studies.

Introduction

Selecting the right compiler optimisations to apply to your program is not easy and they often have unexpected effects or interact in unforeseen ways. Furthermore, making a small change to the structure of your code or running it on a slightly different architecture can change how well it performs, potentially making a different set of optimisations more optimal.

The aim of this project was to answer a few questions about the effectiveness and composability of optimisations:

Are execution time and energy consumption correlated?

Are there any optimisations that lower the energy consumption or execution time for all benchmarks and platforms?

How does the processor architecture affect which optimisations will be effective?

Can we discover what the most effective optimisations are without exhaustively searching through every combination?

It is hoped that by answering these questions we can begin to choose better sets of optimisations, and have more energy efficient applications.

Processor architectures

We chose a range of different processors for these experiments, as we hoped to explore how the architecture affects which optimisations work and which don’t. As such, we chose a range of development kits and instrumented each with power measurement circuitry. We selected the following platforms:

This gave a nice spread of different pipeline architectures, memory and cache sizes, instruction sets and clock speeds.

High level results

The first experiment we did was to run each benchmark on each platform several times, once for each optimisation level (O1, O2, O3, O3+LTO, Os). We recorded the energy consumption and execution time for each run. From this we can compare how much of an effect the optimisations had on each of these metrics. We should also be able to make comparisons between the processors and the benchmarks, because we can look at the relative change in performance (whether energy or execution time) for each experiment.

Using all of this data we can build an overall picture of how the platforms, the benchmarks and the optimisation level affects energy consumption. First execution time on the graph:

Execution Time, FDCT, Cortex M0 & Cortex-A8.

This shows how the execution time changed with increasing optimisation level for the FDCT benchmark on the Cortex-M0 and the Cortex-A8 platforms. The difference between the platforms is large, with the higher level greatly affecting the Cortex-A8. Adding energy consumptions:

Execution time and energy consumption, FDCT, Cortex-M0 & Cortex-A8

The energy consumption for all points is now layered on top. This line closely follows the execution time for the Cortex-M0, but diverges at higher optimisation levels for the Cortex-A8. This is due to optimisations such as instruction scheduling, allowing the superscalar pipeline to be used effectively.

Energy, time and power, FDCT, Cortex-M0 and Cortex-A8

This now also shows the average power. Since the execution time and energy consumption are almost perfectly correlated in the Cortex-M0’s case, this line is quite flat. The Cortex-A8 again shows more interesting behaviour, with the average power rising when more parts of the pipeline are in simultaneous use.

We made one of these line graphs for each combination of benchmarks and processors, seen below:

High level energy consumption for all platforms and benchmarks (click for larger version).

Along the Y-axis are all the benchmarks we examined, while along the X-axis are the different processors. By looking across horizontally we can see commonalities between the same benchmark executing on different platforms. For example, the CRC32 benchmark does not get optimised well on any platform. This is due to the code already being simple and well optimised, not leaving the compiler with much to do.

By looking vertically we can see what similarities the platforms impose for different benchmarks. This is most evident for the Cortex-A8 processor, where we see the execution time and energy consumption lines diverging as optimisation level increases. As mentioned earlier, this is most likely due to the superscalar nature of the core – when multiple instructions are executed at once the execution time will be lower, but more resources of the processor are being used so the energy consumption doesn’t drop as much.

The relationship between architecture and energy

The results appear to answer the first question quite well: energy consumption and execution time are correlated in most cases. The exact amount of correlation is dependent upon the complexity of the processor being targeted. Simple processors such as the Cortex-M0 have very few cases in which an optimisation has a larger effect on energy than time. However, these results are for the overall optimisation levels and in a future blog post I will examine if this is true for individual optimisations.

The Cortex-A8 is by far the most complex processor we used. This manifests in the results with a large relative drop when enabling optimisations, as the compiler is able to take advantage of specialised features such as the NEON SIMD unit and superscalar pipeline.

It should be noted that all of the optimisations that have been used are targeted towards decreasing the execution time of the application. Therefore, it may not be surprising that energy and time are correlated — we spend less over time calculating and so our total energy expenditure is lowered. It may be as optimisation specifically for energy are introduced into compiler that we energy consumption dropping by more than execution time.

In future posts I will go into more detail about which optimisations each of these levels enable, and how we can find the most effective optimisation in that set.