I think both issues are due to the same fact: the device (GPU) cannot allocate its own memory.

When you try and call a subroutine with an array slice, the compiler (most likely?) wants to make a temporary copy and pass a reference to that temp copy to the subroutine. Since the device can't do so, error.[1]

The second case makes the same issue. Automatic arrays are allocated upon entry and the GPU can't do that. In my code, I often do just what you are tentative to do which is allocate local, per-thread arrays at compile-time with some maximum fixed size that I can know a priori (number of levels in the system, say, which we know roughly) that I do with the preprocessor in my Makefile. I'm lucky that my "M" is fairly small (O(100)) so I can make those. If your "M" is big...might not work. But, if you can, and you aren't using much shared memory, you can tell the compiler to prefer L1 cache (make it 48k) which will increase your chances of getting a hit on L1-cached local memory.

But, if you don't want to, or can't due to the size of M, the only other thought I have is to pass in the reference to all of A along with the thread number:

Code:

call subroutine(A,thread)

and then inside that subroutine, just do all your work on A(thread,:) inside. It's not ideal, but try it out. You might find there isn't much of a performance hit at all.

Matt

[1] Note: I think this is also why you can't do math in subroutine calls:

Code:

call subroutine(2*A,...)

since the compiler would try and make a temp array B=2*A and use that.

where "cacheconfig" is cudaFuncCachePreferNone, cudaFuncCachePreferShared, or cudaFuncCachePreferL1 (pretty self-explanatory). For the FuncSet, func is usually decorated by the module name, so subroutine dothis in module mymodule, would (I think) be mymodule_dothis.

Obviously, it's probably not best to do the device-wide set if you use near 48k of shared memory in most of your code, but the function version might help.

But, benchmark as always, some codes might respond, some might not. Per the spec, this is only the "preferred" configuration that the user wants. I think the CUDA compiler can always choose its own configuration when it determines its is better. Or when it's Tuesday. I'm not sure I've ever seen how it determines this (probably looks at how many registers spilled into local, &c.).

Oh, and if PGI chimes in here saying differently, believe them more than me!

Matt is correct about the problem being lack of dynamic allocation from device code. The may change with CUDA 5 and the Kepler K20 GPUs, but for now we're stuck.

Though, another thing to try is using automatic arrays of shared memory. The third argument to the kernel chevron is the size in bytes to dynamically allocate in shared memory. The compiler can then map this dynamic shared memory to device automatic arrays. The glitch being that the automatic arrays are shared by all the threads in the block and the amount of shared memory is relatively small.