1) Does vDSP_fft_zrip() only access the data within fftSetup (or the data pointed to by it) in a "read-only" fashion? Or are there perhaps some temporary buffers (scratch space) within fftSetup that is written to by vDSP_fft_zrip() in performing its FFT computations?

2) If data like that in fftSetupis being accessed in a "read-only" fashion, is it okay for multiple processes/threads/tasks/blocks to access it simultaneously? (I am thinking of the case where it is possible for more than one process to open the same file for reading, though not necessarily for writing or appending. Is this analogy appropriate?)

On a related note, just how much memory is being taken up by the FFTSetup data structure? Is there any way to find out? (It is an opaque data type.)

Then OP should log a documentation bug, because the intended use isn't clear.
–
Rhythmic FistmanJul 13 '12 at 15:34

Yes, please, file bugs for any documentation that is not clear. There is some bureaucracy between the software engineers and the documentation writers. In the case of the new vDSP_DFT* routines, we have put some documentation directly in the header file (/System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Hea‌​ders/vDSP.h), and it is explicit about using the DFT routines with multithreading. However, the DFT routines currently support only lengths of 3*2n and 5*2n, so they do not replace the FFT routines yet.
–
Eric PostpischilJul 13 '12 at 15:51

Once prepared, the setup structure can be used repeatedly by FFT
functions (which read the data in the structure and do not alter it)
for any (power of two) length up to that specified when you created
the structure.

so

conceptually, vDSP_fft_zrip should not need to modify the weight array and so it would appear to be one of the FFT functions that do not alter the FFTSetup (I haven't seen any that do apart from create/destroy), however there are no guarantees on what the actual implementation does - it could do anything.

ifvDSP_fft_zrip truly accesses its FFTSetup in a read-only fashion, then it's fine to do that from multiple threads.

As for memory usage, the FFT weight array is e^{i*k*2*M_PI/N} for k = [0..N-1], which are N complex float values, so that would 2*N*sizeof(float).

But those complex exponentials are very symmetric so who knows, under the hood the implementation could require less memory. Or more!

In your case, N = 2^16, so it wouldn't be strange to see up to 256k being used.

Where does that leave you? I think it seems reasonable that the FFTSetup be accessible from multiple threads, but it appears to be undocumented. You could be lucky. Or unlucky and unpleasantly surprised now or in a future version of the framework.

vDSP_fft_zrip does not use a weight array as you describe. There are several arrays, used for different portions of the FFT. Some of the arrays supply the weights in orders convenient to the implementation, and some may supply indices used in the bit-reversal permutation. The setup may differ from Intel to ARM, even from Intel 32-bit to Intel 64-bit, and Apple may change it from time to time. You cannot expect any particular amount of space to be used. (I wrote the code.)
–
Eric PostpischilJul 13 '12 at 14:43

I wouldn't attempt any explicit concurrency with the vDSP functions, or any other function in the Accelerate framework (of which vDSP is a part) for that matter. Why? Because Accelerate is already designed to take advantage of multiple cores, as well as specific nuances of a given processor implementation, on your behalf - see http://developer.apple.com/library/mac/#DOCUMENTATION/Darwin/Reference/ManPages/man7/vecLib.7.html. You may end up essentially re-parallelizing already parallel computations that are internal to the implementation (if not now, then possibly in a later version). The best approach to the Accelerate framework is generally to assume that it's more clever than you are and just use it in the simplest way possible, then do your performance measurements. If those measurements reflect a level of performance that is somehow insufficient for your needs, then try your own optimizations (and/or file a bug report against the Accelerate framework at http://bugreport.apple.com since the authors of that framework are always interested in knowing where or if their efforts somehow fell short of developer requirements).

The vImage portion of Accelerate distributes work to multiple processors, given jobs of sufficient size, unless asked not to. However, vDSP_fft_zrip is in the vecLib portion of Accelerate and operates single-threaded.
–
Eric PostpischilJul 13 '12 at 14:57

Sure, which is why I made the proviso "if not now, then possibly in a later version" - it would be better over the long term for developers to use these APIs naively and let Apple's own engineers find the hot spots and apply optimizations (whether they be multi-core or "other") as needed than have later implementations fight with the developer's own attempts at optimizing performance.
–
jkhJul 15 '12 at 15:38

Writing code to call the current single-threaded routines and hoping they become multi-threaded later is not a useful technique, especially for anybody who wants their software to be excellent now. If an application would benefit multi-threading with vDSP routines, I suggest implementing multi-threading in the application and setting the VECLIB_MAXIMUM_THREADS environment variable to 1 to prevent vDSP from interfering with multi-threading in the future.
–
Eric PostpischilJul 18 '12 at 14:18