In Part I of this 3-part series we created a simple FIR filter. In this part we will look at some ways of improving the implementation using some functions from the Accelerate Framework.

The Accelerate Framework

The accelerate framework provides a C API with functions for linear algebra, matrix math, DSP, and image processing. In this case we will specifically be focusing on the vDSP functions. The functions provide vectorized implementations of common functions using specialized instruction sets (SSE, NEON, etc) or DSP hardware on your processor.

Improving The Filter

The basis for our filter is the convolution function. This is an extremely slow function, it runs in quadratic time (O(N²)). Looking at the nested loops doing multiplication and accumulation should give you some intuition for why this is the case. This is where we are going to focus our attention. Fortunately, the accelerate framework provides a convolution function, using vectorized multiply-accumulate instructions. The function we are going to use is vDSP_conv (or its double-precision equivalent vDSP_convD).

The prototype for vDSP_conv is as follows:

1

2

3

4

5

6

7

8

9

10

voidvDSP_conv(

constfloat__vDSP_signal[],

vDSP_Stride __vDSP_signalStride,

constfloat__vDSP_filter[],

vDSP_Stride __vDSP_strideFilter,

float__vDSP_result[],

vDSP_Stride __vDSP_strideResult,

vDSP_Length __vDSP_lenResult,

vDSP_Length __vDSP_lenFilter

);

This matches up fairly closely to our convolution function prototype, with the addition of the Stride arguments, which is used to set the direction and step size when looping through the input and filter arrays. There are a few things we need to look out for when using this function in our filter code. Per the vDSP_conv documentation, The input signal length must be at least (length of filter + length of result – 1)[1]. We need to set __vDSP_strideFilter to -1 [2], as well as pointing __vDSP_filter at the end of our filter array in order to perform convolution [3].

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

// Pointer to end of filter for use with vDSP_conv [3]

float*h_end=h+(h_length-1);

// Length of signal passed to vDSP_conv [1]

unsignedsignal_length=(h_length+result_length);

// Create an array to store the signal passed to vDSP_conv, padded with zeros[1]

floatpadded[signal_length];

// fill padded buffer with zeros

unsignedi;

for(i=0;i<signal_length;++i)

{

padded[i]=0.0;

}

// Copy input into padded buffer

for(i=0;i<x_length;++i)

{

padded[i]=x[1];

}

// use the Accelerate convolution function [2]

vDSP_conv(padded,1,h_end,-1,dest,1,result_length,h_length);

We can eliminate some more of the loops in this code using some more vDSP functions as well a BLAS function from the Accelerate Framework. The first loop which fills our padded buffer with zeros can be replaced with vDSP_vfill. Here is the code we can use to replace the first loop:

1

2

3

4

5

6

// vDSP_vfill takes a pointer to the value used to fill the destination array. Create a variable

// holding 0.0 which we can point to

floatzero=0.0;

// Fill the padded buffer with zeros

vDSP_vfill(&zero,padded,1,signal_length);

The second loop can be replaced with a call to cblas_scopy, which copies a vector from one location to another:

1

2

// use cblas_scopy to copy the input vector into our padded vector.

cblas_scopy(x_length,x,1,padded,1);

Here is our updated FIR filter using our vectorized convolution code. I’ve also replaced a loop which adds two vectors with vDSP_vadd, which is a fairly straightforward replacement:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

#include <Accelerate/Accelerate.h>

/* filter signal x with filter h and store the result in output.

* output must be at least as long as x

*/

void

fir_filter(constfloat*x,

unsignedx_length,

constfloat*h,

unsignedh_length,

float*output)

{

// Create buffer to store overflow across calls

staticfloatoverflow[h_length-1]={0.0};

// The length of the result from linear convolution is one less than the

// sum of the lengths of the two inputs.

unsignedresult_length=x_length+h_length-1;

unsignedoverlap_length=result_length-x_length;

// Create a temporary buffer to store the entire convolution result

floattemp_buffer[result_length];

// Pointer to end of filter for use with vDSP_conv

float*h_end=h+(h_length-1);

// Length of signal passed to vDSP_conv

unsignedsignal_length=(h_length+result_length);

// Create an array to store the signal passed to vDSP_conv, padded with zeros

floatpadded[signal_length];

// fill padded buffer with zeros

floatzero=0.0;

vDSP_vfill(&zero,padded,1,signal_length);

// Copy input into padded buffer

cblas_scopy(x_length,x,1,padded,1)

// use the Accelerate convolution function

vDSP_conv(padded,1,h_end,-1,temp_buffer,1,result_length,h_length);

// Add the overlap from the previous run

// use vDSP_vadd instead of loop

vDSP_vadd(temp_buffer,overflow,buffer,overlap_length);

// Copy overlap into overlap buffer

// use BLAS copy instead of loop

cblas_scopy(overlap_length,temp_buffer+x_length,1,overflow,1);

// write the final result to the output. use BLAS copy instead of loop

cblas_scopy(x_length,temp_buffer,1,output,1);

}

In Part III, I will walk through another improvement which leverages the fast fourier transform to take our convolution from O(N²) to O(N log N).