Post navigation

OpenMP Parallel computing in Raspberry Pi

This article examines how to improve software execution speed in embedded devices having a multi-core ARM mobile processor, by migrating existing software code to utilize parallel computing using OpenMP API. The improvements are implemented and tested in Raspberry Pi 2 and mobile Android platforms, although the same principles and technologies can directly be applied also for optimizing desktop and server software.

Some time ago I was migrating an audio processing software to Android devices and faced a challenge of how to make an intensive calculation algorithm run fast enough even in the low-end Android devices.

The case was about an audio signal processing algorithm that runs very smoothly in a regular laptop computer, occupying about one percent of laptop’s CPU capacity while processing multi-channel audio for transposing it’s pitch in real-time. When ported into a low-end smart phone device, this same software algorithm was on the very verge of executing fast enough for smooth real-time audio experience.

This performance difference was of course due to a different CPU capabilities. Despite the very impressive development of the low-power mobile CPUs during past decade or so, the processors used in mobile/embedded devices are yet optimized for low cost and low power consumption and their true performance still fall rather far behind the processors used in personal computers, which allow much more permissive cost and power design budgets.

Mobile trend towards multi-core CPUs

The mobile device world has recently yet gone through a similar transfer from single-core CPUs to multi-core CPUs as the PC world saw during the past decade: While single-core CPUs were mainstream in mobile smart handsets still few years ago, nowadays even low-end mobile handsets feature two or more CPU cores, and flag-ship smart phone models from Samsung, LG, Sony etc already feature eight-core CPUs, and no doubt we shall see even higher number of cores in the future products. The same trend applies also to embedded microcontroller families.

This puzzle lead to an examination of how to improve execution performance of a C++ software originally written for traditional single CPU execution, by modifying the processing algorithms to utilize several CPU cores in parallel. OpenMP API was selected as the parallel programming technology due to its transparent support of different operating system, compiler tools and hardware environment, so that it works nowadays even in mobile devices.

Parallel programming for multi-core CPUs

parallel execution

Packing several CPU cores into a multi-core unit provides a significant theoretical increase in computation horsepower, yet the full power of a multi-core unit will get harnessed only in if the target software is also (re)designed to utilize multiple parallel CPU cores.

The key for a utilizing multiple CPU cores is to design the program so that it will divide execution of algorithms into several parallel execution threads that each run simultaneously. This is called parallel programming.

Designing efficient parallel algorithm is a non-trivial task, because algorithms execution typically require specific stages of the algorithm to perform in a specific well-defined order sequence. There are certain practical challenges in arranging different parts of the algorithms to execute efficiently in parallel, while ensuring that the algorithm can eventually combine the results together in specific order in the end.

Luckily there software tools designed for easing the parallel programming task.

OpenMP

OpenMP is among the most prominent SMP parallel programming technologies. OpenMP is a specification for programming language extensions that allow the software developer to tag hints into the program source code about which functions and loops can benefit of splitting the execution path into several parallel execution threads. Based on these additional tags added by the software developer, the compiler can then take care of the rest by splitting execution into separate parallel execution threads for multi-core execution, and again synchronize the final results of each thread back together in a consistent way.

OpenMP has several benefits:

High-level abstraction make it easy to use in new and existing software, without developer needing to worry too much about the details of how the threaded execution will get managed internally

Widely adopted: OpenMP specification is being developed by a non-profit organization consisting of industrial corporations and institutions, so it is widely supported and somewhat recent versions of common C / C++ and Fortran compilers support at least elementary OpenMP API features. For example, GCC compiler 4.2 onward and Microsoft Visual Studio 2008 onward support OpenMP

Support various platforms: Same optimizations will work in different compiler and hardware/OS platforms even if compiler and hardware platform are different, thanks to the standard API specification. The same source code can thus be compiled for desktop, mobile and embedded platforms alike.

Transparent API allows building the same software with and without parallel optimizations enabled. This is beneficial in some cases, for example it can be desirable to use same source codes also in platforms that do not have OpenMP-capable compiler tool available. In such case the legacy compiler can simply ignore the OpenMP directives and the software will execute normally using just a single CPU. In some cases it is also useful to work on a non-parallel compilation for easier debugging or comparison reasons.

While OpenMP is easy and versatile approach for parallel programming in SMP multicore/mullti-CPU computers, other parallel programming techniques exist too. For example:

MPI (Message Passing Interface), technology for distributed parallel programming. MPI enables building large-scale parallel programs that can expand a common calculation task beyond a shared-CPU unit to a whole cluster of networked computers. MPI can be used together with OpenMP, so that OpenMP can perform parallel execution inside a single SMP computer unit, while MPI can distribute execution across a cluster of such computers units.

CUDA, Nvidia-sponsored programming model for using GPU (graphics processor units) for general-purpose computing. Concurrent Graphics Processor Units in PC and mobile devices consist of a wide set of multi-core execution units that nowadays can be used also for general-purpose calculation besides drawing graphics. The hardware and programming model of GPUs is much more restricted than what OpenMP enables, but because single GPU can contain between tens to up to thousands of parallel execution units, certain special routine calculation tasks can benefit of remarkable performance acceleration with CUDA.

Optimizing software runtime performance

The principle of software performance optimization is straightforward: Detect where the program spends most of the execution time, and improve these parts to run faster.

strive for performance

Very typically relatively few hot-spot routines use up most of the software execution time, so that typically just handful of critical functions consume >80% of the overall software execution time. The key for software performance optimization is thus to identify these “hot-spot” routines within the software, and focus optimization effort in rewriting these few routines to execute better, which will then result in faster overall execution of the software.

Let’s follow this approach for migrating the existing C/C++ software to execute in parallel CPU cores.

The main steps for performance optimization are following:

Step 1: Define performance benchmark: To measure and identify which parts of the software use most CPU time i.e. involve most optimization potential, it’s necessary to define a realistic and repeatable benchmark scenario for measuring execution performance before and after optimization improvements

Step 2: Measure how the software uses the execution time. Run the software through the benchmark scenario while using a software profiling tool to record information about the software execution.

Step 3: Analyse results to identify what routines to improve. Use software profiling tool to analyse the recorded data to see in which routines the CPU spent most of the time

Step 4: Implement code improvements. Knowing the hot-spot routines with biggest contribution to the software performance, improve these routines to execute faster. In this case, this step involves making these hot-spot routines to utilize multiple parallel CPU cores.

Step 5: Control by comparing results before/after optimization changes. Repeat the benchmark scenario to get feedback about efficiency of the optimizations, and to see how much further optimization potential there can be remaining

Let’s do this!

Step 1. Benchmark setup

Target setup

Raspberry Pi 2 was chosen as working environment for the OpenMP optimization:

It runs a full-fledged Linux OS so it has a versatile set of free software development tools available and thus allows easy testing and tuning of the software

With its four-core ARM CPU, Raspberry Pi 2 makes a good example of product that introduces multi-core CPUs into ever more affordable electronic products

Common smart phone devices (Android, iPhone and Windows Phones) use similar multi-core ARM processors as Raspberry, so similar performance improvements as achieved in Raspberry Pi 2 platform expectedly translate also to smart phones and tables.

As of writing, the GCC compiler version 4.6 included in Raspberry Pi 2’s “Raspbian” operating system distribution supports OpenMP 3.0 specifications; recent enough for this examination.

The target software

The target software to optimize is SoundTouch, an open-source audio processing library+application written in C/C++ language. It makes a good example software for parallel optimization, because it’s openly available for several platforms, it’s originally been designed and written for traditional single-core execution, and it involves relatively calculation intensive DSP routines.

SoundTouch processes digital audio streams, so to give fair amount of work for sake of benchmark testing, let the benchmark scenario be retuning a 1000 seconds long CD-quality (44.1kHz/16bit/stereo) audio track from 440Hz to 432Hz base frequency with the SoundStretch utility. Converted to semi-tones scale, this corresponds to lowering the key by 0.318 semi-tones.

The shell command to run the benchmark scenario in Raspberry / Linux is accordingly:

time ./soundstretch original.wav /dev/null –pitch=-0.318

In this case we direct output to /dev/null instead of a real output file to minimize impact of mass storage I/O to the benchmark result. This is because Raspberry Pi2 uses consumer-grade flash memory SD card as mass storage whose relatively low write speed might unnecessarily bias the processing result times.

The “time” command before the main command is standard GNU command that will inform the total execution time after the main program finishes.

We can see that two first functions together used over 90% of all the execution time:

TDStretch::calcCrossCorr() : 63,3% of CPU time / 77.173 seconds

FIRFilter::evaluateFilterStereo : 27,6% of CPU time / 33.66 seconds

This is a good example of the optimization rule of thumb discussed earlier, that typically just few hot-spot functions consume most of the CPU time. It’s clear that optimization effort should focus around these two functions; cutting the execution time down in these two hot-spot functions will provide a good overall reduction to the execution time.

On the other hand, all the other functions in the program will together consume less than 10% of the execution time, so that even very prudent optimization of any or all other functions would allow only a fractional improvement to the overall execution time.

Step 4. Implement OpenMP improvements

Based on the “before” profiling results we know that about 63% of execution time get spent within single function calcCrossCorr. By examining the source codes we can notice that this function is being called from within a “for” loop from single another other function in the program.

Let’s now see how we can use OpenMP to optimize this loop to execute in several CPU cores in parallel.

OpenMP Syntax

There are good OpenMP programming online tutorials available, describing in details how to use various features of OpenMP API. See for example Tutorials at OpenMP.org

All OpenMP source-code level instructions begin with sentence #pragma omp. #pragma is a C/C++ language standard directive for special compiler definitions. This approach has an advantage that compilers that support OpenMP know what to do with these #pragma omp directives, while compilers that don’t support OpenMP can still compile the same source code correctly and simply ignore the OpenMP #pragma statements.

In this case we need to use only two OpenMP’s #pragma directives. The first of these directives is:

#pragma omp parallel for

This directive instructs compiler to split execution for the following for loop into several threads in parallel, which will allow each thread to utilize separate CPU core for parallel execution. After the for loop, the program execution will continue in usual single-thread execution.
This directive is used by adding it before a ‘for’ loop as follows:

compiler switches

Then, let’s add proper compiler switches to the project’s Makefile to build a proper OpenMP-enabled compilation. For gcc compiler this flag is -fopenmp and that is to be added to both compiler and linker options.

These can be added into Makefile CXXFLAGS/LDFLAGS variables, or alternatively add these in command line when launching the software build:

make –j CXXFLAGS=-fopenmp LDFLAGS=-fopenmp

That’s it! This is in principle all that is required to make the given ‘for’ loop execute the loop iterations in parallel. This single directive would in this case be already enough to divide the execution to multiple threads and thus run on multiple CPU cores in parallel!

…however: multi-thread access protection

However, there’s still a small catch. Once the contents of the “for” will loop execute in several parallel threads and cores simultaneously, it’s possible that several threads would attempt to modify the shared variables bestCorr and bestOffs simultaneously. This is called a multi-thread access conflict, meaning that reading and modifying shared variable values at the very same time by two threads, and this can cause unpredictable results.

What is typically seen when multiple threads get to modify shared variable values simultaneously is that produced calculation results are not correct, and results can also be unpredictable so that the same algorithm may produce randomly different result on different run times, even with the same input and parameter data.

This is a directive for controlled access of shared data. This directive ensures that the “critical section” immediately following this directive can get executed only by a single thread at a time.

If several threads will reach the critical section simultaneously, then only one thread will get to execute the code in critical section at a time, so that all other threads wanting to access the critical section will pause and wait their turn until the previous thread(s) have exited the critical section block.

critical section performance consideration

Apply the critical section directive with some caution to avoid degradation in performance, because critical section can cause parallel threads to pause and wait. If parallel threads will end up using too much of their time waiting ahead of a critical section, then the resulting code will not possibly run any faster than a single-thread / single-core execution, or in worst case can even become slower than original single-thread execution.

For this reason we add here an additional “if” check before the critical section in the code, to enter the critical section only when it is surely necessary to modify the shared variable value:

In similar fashion we improve the other detected hot-spot function evaluateFilterStereo, by again adding the directive #pragma omp parallel for in front of a for loop. In that case it’s not even necessary to add a critical section code, because in this function two parallel threads won’t ever try to modify same data simultaneously:

The table below compares the before / after results. We can conclude that OpenMP optimizations reduced the real-time execution duration from about 122 seconds to just below 40 second, meaning better than three-fold improvement in execution time. That’s quite good improvement from adding just few optimization clauses into two functions!

Before

With OpenMP

Real time

121.939 sec

39.647 sec

User time

121.210 sec

156.040 sec

System time

0.610 sec

0.630 sec

Execution times for the hotspot function level are as below. We can see that execution times of the two OpenMP-optimized functions have decreased to one-fourth of the original, which is expectable as they now run in four CPU cores in parallel!

Before

With OpenMP

soundtouch::TDStretch::calcCrossCorr

77.13 sec

19.34 sec

gomp_barrier_wait_end

–

10.58 sec

soundtouch::FIRFilter::evaluateFilterStereo

33.66 sec

8.27 sec

gomp_team_barrier_wait_end

–

0.85 sec

Result considerations

Surveying the results above, we can notice that while Raspberry Pi 2 has four CPU cores and indeed the optimized functions spend now only fourth of the time vs. original, the overall run-time improvement is about three-fold, not four-fold as would be expected in case of perfect improvement.

The reason for the difference is that splitting execution of an algorithm into four threads in parallel will unavoidably introduce some multi-threading overhead and thread wait times to synchronize results between the threads. We can see this overhead in that the profiled function results now have two new functions that were not there in the original profiling results: these are gomp_barrier_wait_end and gomp_team_barrier_wait_end.

These new function calls are OpenMP thread synchronization function that manage thread access synchronizing before the critical section, and also the waiting times at end of parallel sections to wait until all threads have complete their work before pulling the results together from each parallel tasks. Due to this additional threading overhead, the overall OpenMP new improvement in this case is about three-fold reduction in runtime, not four-fold.

We can see this same OpenMP overhead also yet in another way if we look at the overall “user time” profiling result figures. The “user time” means how much CPU time the program have occupied in all utilized CPU cores together. In this case the parallel OpenMP accumulative CPU user time increased to 156 from 121 seconds of the original run: 156 seconds on four cores together is about three times as fast as 121 seconds on single core. The parallelized version can complete the work much faster by using four CPU cores, yet it also consumes some 35 seconds of extra CPU work for managing the threading overhead.

Conclusion

We just improved existing C++ software that had originally been written for single-thread execution to utilize several parallel CPU cores, by adding just few simple OpenMP API directives into the source codes. These same source codes can now be compiled for OpenMP parallel processing support, or without OpenMP if so desired.

Based on benchmark tests, these OpenMP improvements yield three-fold improvement in real-world processing speed in Raspberry Pi 2 platform with a four-core ARM processor. What an easy yet remarkable improvement!

Raspberry Pi clusters can make nice academic examination platforms for how to build, develop and optimize software for large computer clusters, as Raspberry Pi setups support SIMD-level, Thread-level and Cluster-level optimization techniques (particularly Pi models 2 & 3 that have multi-core CPUs).

However, Raspberry Pi boards are optimized for low-cost/low-power use and thus feature somewhat feeble CPUs. If looking for serious real-world cluster calculation horsepower, then one will accordingly find much better performance for the price by looking setups based on more advanced CPUs. Currently the ultimate choice would be a Xeon Phi system, yet one might also be able to build interesting academic computing cluster by using <100USD Intel x86 stick PCs...