Documentation

Overview

FFTW is perhaps one of the most widely used fast Fourier transforms available today. It builds open decades of research and innovation in spectral analysis to provide a reliable, adaptive and extensible DFT / DCT / DST / RDFT solution. FFTW provides the backend for both the Matlab and GNU Octave "fft" commands, and FFTW has been integrated into other applications such as GNURadio.

Historically, FFTW was only optimized for personal-computers based on the x86 or PowerPC architectures. Recently, support for the Cell Broadband Engine was also added. However, there was a distinct lack of ARM support. Since ARM devices have begun approaching the GHz range, they have become more suitable for personal computing devices - not just embedded controllers. The primary reason for the delay, was that it was traditionally rare to find ARM processors with a dedicated floating point (FP) unit (essentially a prerequisite for FFTW).

However, with the introduction of the ARMv7 architecture and the Advanced SIMD Extensions (NEON), performing high-throughput floating-point processing became a serious reality. BeagleBoard made it possible and affordable for the masses to start tinkering with Cortex-A8 and NEON with the OMAP3 series of SoCs from Texas Instruments (who continues to be a world-class leader in producing ARM silicon... wink, wink).

Motivation

I've been working with ARM devices for several years, and have never lost one iota of interest. However, most of the devices I'd used for work (mainly ARMv4t and ARMv5te) were distinctly less powerful than the OMAP3. So once I heard about the project (a few years ago already), I picked up a BeagleBoard as soon as they became available and have been happy with it ever since.

Now that I had this powerhouse of an SBC (single-board computer), I could realize one of the cool ideas that I had had for some time. Ever since I started my bachelor's degree, started working, and then subsequently started my master's, I always thought it would be really neat to have the best graphing calculator money could buy for doing things like spectral analysis. Seriously - how many people would let their jaws drop in awe if you wer to draw your mobile and suddenly do a quick simulation and spectral analysis of a radio channel or to provide a real-time scope reflecting the sounds that are all around you!?

Ok, well... maybe not everyone thinks that would be cool, but I do. So, like any self-respecting open-source junkie / embedded system's engineer, I immediately thought of GNU Octave and FFTW. Hence, the goal of this project was to speed up FFTW performance on NEON-enabled arm devices. That involved three primary feature additions and a demonstration.

In general though, as next generation chips become available (i.e. the OMAP4 series), the availability of ARM-based personal computing devices is coming closer and closer. For example, although Intel-based netbooks are great, I always found that they consumed too much power, or the fan was too loud, or the became too hot, or that they were just too bulky for my purposes. Now, we're obviously seeing things like the iPad and OLPC tablets. In my opinion, a fanless, low-power ARM chip would most certainly dominate the netbook / tablet market. Having a versatile FFT library certainly make that a more attractive option (particularly for students) along with lower power consumption, less noise, longer battery life, less weight, ...

List of Goals

extending the FFTW SIMD interface to support the NEON instruction set

adding a performance counter, so that the FFTW planner could accurately determine which algorithm was faster (rather than using approximation methods)

Status

Details

FFTW SIMD Interface

The NEON SIMD interface that I implemented can be configured with the '--enable-neon' option for FFTW. By default, it uses hand-optimized inline-assembler routines, but you can change that with '--enable-neon-intrinsics' if you would prefer (discouraged). The routines are used by FFTW's in dft/simd/codelets and rdft/simd/codelets.

Performance Counter

The cycle counter (sometimes called performance counter) was originally given to me in a demo app by my mentor, Mans, and I basically just had to coerce it into the FFTW codebase and ensure that it was getting turned off and on correctly during regular FFTW usage (which required a bit of debugging). You can enable the cycle counter with '--enable-armv7a-cycle-counter'. The cycle counter is crucial to having an optimally performing FFTW library. If it is not enabled, then FFTW purely uses estimation methods for optimization.

Speed, Speed and more Speed

Speed. That's what is most important after all, right? I spent quite a bit of time at the beginning of the project familiarizing myself with the many intricacies of the NEON instruction set, getting the correct habits down for alignment, scheduling, and so on to produce the best throughput for parallel arithmetic. After implementing the NEON SIMD interface in FFTW, the initial results were not particularly impressive and only made for a 1.5x to 2x increase in MFLOPS (as estimated by benchfft-3.1). I was aiming for a 10x increase in performance. To better gauge the potential, I put together an interface for FFTW to use the well-known and very fast power-of-two (POT) transforms of FFMPEG (see libavcodec/avfft.h from the ffmpeg git repository). This was done in an architecture-agnostic manner so that it could be used on any supported platform (not just ARM / NEON).

The FFMPEG routines blew way past what my SIMD interface had done which was foreseeable, since all of the ffmpeg routines for NEON were written in hand-crafted assembler. Rather late in my project, I realized that the bottleneck was really poorly implemented memory copies. To be a bit clearer, you might need to read up about the Cooley-Tukey algorithm (basics of the mixed-radix DFT). Essentially, when one computes the DFT of, say, a length N signal using two composite transforms of length N1 and N2 such that N1*N2=N, the input for N1 and N2 transforms are not contiguous in memory. The solution to that problem is to either a) copy the input to a contiguous section of memory, or b) use somewhat more complicated indexing routines. However, since option 'b' could have different indexing systems for virtually any composite transform, since it could also incur cache penalties, and lastly since it made life very difficult for writing efficient SIMD code, option 'a' is really the only alternative. So after implementing the SIMD interface, the resulting problem boiled down to using NEON instructions to optimize memory transfers to and from contiguous regions (what FFTW calls a rank-0 transform). I was working on that part up until the last day, and most of the groundwork is laid out, but I just didn't have time to get to the end (my MSc thesis demanded the major part of my time).

So, as it stands, POT transforms are ~10x faster because they simply fire-off a call to FFMPEG's routines, but non-power-of-two transforms (NPOT, which FFTW is specifically renowned for) only exhibit the increase gained from my NEON SIMD interface at ~2x. The good news is, that the NPOT transform algorithms generally use a recursive or composite strategy to calculate the transform using composites of small prime numbers (such as 2, 3, 5, 7, 11, 13, etc). Actually, the NPOT algorithms that FFTW utilizes will theoretically work at the same approximate asymptotic complexity as the POT algorithms. So, once again, the problem simply boiled down to optimizing rank-0 transforms (strided memory copies).

Demonstration

The demonstration consists of graphs showing performance (see my project blog) and ultimately a screencast or video to show the real-world applicability of NEON optimizations to something like GNU Octave.

As for the video demonstration, within the next few days I'm planning on putting together a small screencast to show a real-world application (yaay!) using GNU Octave. The time it will take for the 'fft' command to execute will then be noticably less, which is great if you're planning to use GNU Octave on a mobile device. Naturally, Octave has about a billion other things that still need to be optimized for ARM / NEON, but that is a completely other project.

This step is required for the gnu build system to recognize the additional configure & build options. Simply run

sh bootstrap.sh

and ignore any messages that you see. The additional configure options are '--enable-armv7a-cycle-counter', '--enable-neon', '--enable-neon-intrinsics', '--enable-ffmpeg', and '--enable-ffmpeg-test'. I would strongly suggest that you enable the cycle counter, otherwise your FFTW build will always use ESTIMATE mode, which will likely result in degraded performance.

IMPORTANT

If you choose to use '--enable-armv7a-cycle-counter', you will need a recent kernel with the Mans Rullgard's USER_PMON patch applied. You can obtain his patch from here.

Patch FFTW

Change to the base directory of your FFTW source and run the following.

Alternatively, instead of --enable-neon, you can choose to use --enable-neon-intrinsics, but I would strongly discourage that.

Configure FFTW (with ffmpeg)

Linking to FFMPEG requires that the appropriate C header locations and library locations are passed to gcc. If you choose to link with ffmpeg, then you may also use the '--enable-ffmpeg' and '--enable-ffmpeg-test' configure parameters. The latter parameter disables almost all other dft algorithms in FFTW, so it is not really important for general purposes. Also, feel free to use '--enable-neon', or '--enable-neon-intrinsics' as well.

To build the source and install it to a temporary location, use the following command.

make && make DESTDIR="/tmp/my-install" install

Strip

strip --strip-unneeded $(find /tmp/my-install -name '*.so*')

Install

In order to install the binaries (and possibly headers, documentation, etc) that were built, a good method is to simply use rsync. Assuming that your target device resides at IP 192.168.254.202, use the following command.

Alternatively you could simply use Octave's 'fft' command. I've set up a fairly straight-forward demonstration that will compare the average execution times for power-of-two cases. For this demonstration, there are a few prerequisites; you need to have GNU Octave, fftw3, and a recent version of ffmpeg (from the git repository) installed. The first two should be fairly simple to test, but to test out the ffmpeg version (specifically libavcodec), use the command below and ensure the following returns a valid symbol.

ldd /usr/lib/libavcodec.so* | grep "av_fft_init"

If you do indeed have a recent version of ffmpeg, then download my octave demo to your beagleboard, extract it in the root directory, and then run the demo script.

tar xpjf fftw-neon-demo-octave*.tar.bz2 -C /
./fftw-neon-demo-octave

FAQ

Why would someone want to go through FFTW just to get to the good stuff in FFMPEG?

That's a really good question, and I have an equally good answer. The main drawback of using FFMPEG is that it only works with POT transforms (and works really well at that). However, as you may find when working with very large datasets or multidimensional data (neither of which FFMPEG provides a handle for) you will quickly find out that your will run out of memory by using the traditional POT algorithms, because the only way to compute a NPOT transform with FFMPEG is to zero-pad it to a POT length. So, for example, if I wanted to compute the transform of a length-32769 ((2^15)+1) signal without getting the side effects of windowing, FFMPEG would require that I pad the signal to a length of 65536! This can have a major impact when one has limited memory resources (and cache size), like on the BeagleBoard or virtually any other ARM-based device.

On the other hand, FFTW has the ability to split the work of a length-32769 transform into composite transforms of 3, 3, 11, and 331. Recursively, FFTW can also reduce the prime-length-331 transform into smaller sized transforms using Rader's algorithm. FFTW continues this recursion, measuring the execution time of several different strategies (not unlike Dykstra's algorithm of computing the shortest path through a directed graph), until it reduces the problem to several, small, prime-numbered composite problems. In this case, it could fire-off a POT problem to FFMPEG or use the internal codelets to compute a length-3 composite transform, or virtually any other algorithm that it sees fit. FFTW then becomes a dispatcher of sorts, and a compiler of fast DFT algorithms, even for prime and NPOT lengths. Furthermore, it then stores the strategy so that if you have to perform the same computation in the future, you already have a plan to do it quickly.