Annex D - When does FFT Convolution/Correlation Pay?

Although it is widely known that FFT's provide a fast way to implement long convolutions or correlations (a correlation is just a convolution with the impulse response reversed and conjugated), using them requires longer and more complex code. So the question arises when is it worth it? Just how long does a convolution have to be before using FFT's provides significant savings? For some reason, this issue in not addressed in most DSP/FFT literature, despite its importance to a DSP system designer. To get a feel for the answer to this we shall consider the implementation of a complex 1D FIR filter on an 'infinite' 1D complex data stream by two methods. Direct convolution and Radix 2 FFT convolution using 50% overlap between FFT's.

We shall assume that the filter contains 2p+1 taps (its impulse response is 2p+1 samples long). This length has been chosen to ensure that when convolved with 2p signal points, the resulting sequence has length (2p+1)+(2p)-1=2p+1, which fits a Radix 2 FFT comfortably.

In each case, we shall quantify the algorithm in terms of the number of complex additions NA(p) and the number of complex multiplies NM(p) required to generate 2p output points.

Direct Convolution.

Direct convolution can be performed point by point or in blocks. In either case each output point will require 2p+1 complex multiplies and 2p complex additions. Therefore:

FFT Convolution.

This scheme works as follows:

Divide the input stream into adjacent blocks of 2p points.

Perform a 2p+1 point DIF FFT of each block, which has been padded by 2p zeros.

Multiply by the 2p+1 point FFT of the impulse response (a constant look up table which may include scaling).

Perform a 2p+1 point IFFT of the resulting block, generating 2p+1 output points.

Add the first 2p output points to the last 2p output points of the previous convolved block. (This data is the filter output.) Store the last 2p output points for the next convolved block.

The following optimisations are assumed:

Each FFT/IFFT makes full use of the 'trivial' butterflies, thereby avoiding unnecessary complex multiplies.

The forward (DIF) FFT can avoid 2p complex additions in the first pass, because the 'bottom' half of the array is zero.

We may note that the saving presented optimisation 2 above effectively cancels out the cost of step 5, so we are left with two 2p+1 point FFT's (or IFFT's) and 2p+1 complex multiplies. We have derived the number of butterflies required by a 2p Radix 2 FFT:

The number of 'trivial' butterflies (which require no multiplication) is (p>2 !!):

Each (non trivial) butterfly requires 2 complex additions and 1 complex multiply.So, for a single 2p+1 point FFT (p>1 !!):

So for the entire algorithm (two 2p+1 point FFT's (or IFFT's) and 2p+1 complex multiplies):

Comparing of the Results.

To make the comparison simpler, we shall assume that real multiplies and adds take the same time (as they often do), and combine the figures for NA(p) and NM(p) as NOP(p) operations on real numbers as follows:

So for direct convolution we have:

For FFT convolution we have:

The ratio of these two is what we really want..

This tells us (roughly) how much slower direct convolution (or correlation) is compared to FFT convolution. Let's tabulate a few values:

p

FIR Taps (=2p+1)

(Direct/FFT) Ratio

3

9

1.2

4

17

1.7

5

33

2.7

6

65

4.5

7

129

7.6

8

257

13.2

9

513

23.3

10

1025

41.8

For large p, a good approximation is:

So it would appear that even for filters as short as 17 taps FFT based convolution looks attractive (in terms of the number of arithmetic operations). However, a few notes of caution are appropriate:

The number of arithmetic operations required is not necessarily a good indicator of execution time. For DSP instruction sets, the availability single cycle multiply/accumulate (MAC) instructions mean that many of the adds come for free (so to speak). This is especially effective in direct convolution, which can be accomplished using nothing other than a long sequence of MAC's.

In this example we have assumed a complex filter and signal. If we chose a real filter and signal this would speed up direct convolution by a factor of 4 (or so). FFT based convolution would only be speeded up by a factor of 2 (or so).

For most DSP's, direct convolution routines can keep intermediate results in an extended precision accumulator. The precision of the FFT based intermediate results will be limited by the DSP word size, so care is needed with scaling and 'trace' calculations. This is an especially important consideration for fixed point arithmetic.

As usual, if you really want to know which method is fastest for your particular application on your particular processor, the best advice is to try them both.