Approaching Peak Theoretical Performance with Standard C

Introduction
Embedded processors are typically assigned engineering/scientific tasks. These can be the real-time processing of multimedia streams, structural analysis or health and medical applications. Some mathematical algorithms are common to these tasks and are considered building blocks for many applications.

GCC Status
State of the art compilers like GCC are very effective at optimizing performance through expression elimination, loop-unrolling, pipeline scheduling, and register allocation. The combination of the GCC compiler and a computer architecture optimized for GCC enables very high program efficiency from compiled C code. The current Epiphany compiler is based on GCC 4.7, whose optimizer introduce some new optimization methods that greatly improve the performance relative to previous revisions. This paper describes the execution time results achieved using the new compiler through proper Standard-C code writing and compiler options.

Benchmarks
Traditionally, measuring the performance of such processor is done as the number of clock cycles required for the implementation and execution of such building blocks. The Epiphany framework provides with leading edge efficiency in terms of power consumption and execution of compiled code. Efficient compiled code enables the rapid development of application with and easy migration path, depending on available Standard-C source code, without the requirement to rewrite libraries and applications. The Epiphany compiler offers a best-in-class code optimizer that with proper C coding can generate assembly code level of efficiency.

In order to test the compiler performance four benchmarks were programmed. These are:
1. 16-Tap FIR filter
2. Dual BiQuad IIR filter
3. 8×8 matrix multiplication
4. Vector dot product (scalar product)
All benchmarks were compiled as single thread programs and were run on a single Epiphany core. The programs were first implemented as a naïve implementation and then were re-written to take advantage of optimization techniques like loop unrolling, circular buffer elimination, etc. In addition, optimizer command line options were chosen to further enhance the optimization.

Results

The Nominal figures are calculated as the total number of multiplies that are required to calculate the result. This is a theoretical absolute lower bound for the cycle count and is achievable only if the system can perform any data load/store, looping, buffering and other arithmetic operations in parallel to the multiplier unit. This is rarely the case and actual numbers are higher.
The Efficiency figures are calculated as the Nominal percentage w.r.t the Optimized-C actual cycle count. Thus, 50% efficiency on a 200 cycles nominal means an actual execution in 400 cycles. The Gain figures represent the relative reduction in cycle count of the Optimized-C w.r.t the Naïve-C implementation.

In comparison, practice shows that SIMD architectures that provide optimized C compilers reach efficiency of 10%-20%. It is therefore very hard to implement an effective low power system on a SIMD coprocessor using Standard C code and the support of specialized ported libraries is required.

An observation made is that in order to get best performance, a combination of explicit loop unrolling to some degree and the use of the -funroll-loops compiler option were necessary. Some trial and error was required here, but once the unrolled loop skeleton was coded, it was easy to change the unrolling factor.

It should be emphasized that by taking advantage of the Epiphany’s parallel execution philosophy and framework it is possible to accelerate the calculation of these programs even further.

Conclusion
In conclusion, it is shown how by using Epiphany’s leading edge compiler with some proper coding of Standard-C code it is possible to achieve very high performance compiled code for performing basic mathematical tasks. This fact can greatly reduce the time required to develop an embedded application on the Epiphany platform by eliminating the need to migrate existing assembly code to a new architecture and by allowing the reuse of Standard-C code-base without the performance penalty that is often associated with compiled code.