On Jul 18, 2013, at 10:46 PM, Michael McNeil Forbes
<michael.forbes+python(a)gmail.com&gt; wrote:
> What is the recommended way of preparing ElementwiseKernel instances for repeated
calling on the same GPU arrays for performance?
Here is one attempt… is this headed in the right direction or is there a better/safer
way?

I'm not in principle opposed to including such a thing. But I do have
one question: Have you measured that this is really a
performance-limiting issue for you?
Andreas

Now this is just confusing to me. generate_stride_kernel_and_types has a
@memoize_method decorator, which should take care of caching the built
kernel. Unless you're instantiating a new ElementwiseKernel for each
call, generate_stride_kernel_and_types should only ever get called
once. The default (cached) case should amount to one dictionary lookup,
so I'm confused as to how that would eat up so much time. Can you
perhaps create a small reproducer for this?
Thanks,
Andreas

Okay, my bad. I was only looping 40 times, so the initial application was was eating all
the time. Iterating 4000 times through the loop gives more reasonable per-function calls
-- @memoize_method is indeed working.
That being said, having a way to explicitly prepare the function before using it can be
helpful. One use-case is to facilitate profiling loops...:-)
Sorry for the red-herring.
Michael.
On Jul 19, 2013, at 11:50 AM, Andreas Kloeckner <lists(a)coyote.tiker.net&gt; wrote:

Michael McNeil Forbes <michael.forbes+python(a)gmail.com&gt;
writes:
> Here is the profile of the slow __call__. All the time is spent in
generate_stride_kernel_and_types:
>
> Line # Hits Time Per Hit % Time Line Contents
> ==============================================================
> 192 def __call__(self, *args,
**kwargs):
> 193 78 145 1.9 0.1 vectors = []
> ...
> 204 78 104 1.3 0.1 func, arguments =
self.generate_stride_kernel_and_types(
> 205 78 199968 2563.7 97.3 range_ is not None or
slice_ is not None)
> 206
> 207 156 354 2.3 0.2 for arg, arg_descr in
zip(args, arguments):
> ...
> 241
> 242 78 2780 35.6 1.4
func.prepared_async_call(grid, block, stream, *invocation_args)
Now this is just confusing to me. generate_stride_kernel_and_types has a
@memoize_method decorator, which should take care of caching the built
kernel. Unless you're instantiating a new ElementwiseKernel for each
call, generate_stride_kernel_and_types should only ever get called
once. The default (cached) case should amount to one dictionary lookup,
so I'm confused as to how that would eat up so much time. Can you
perhaps create a small reproducer for this?
Thanks,
Andreas