Hi Pauli
2009/7/9 Pauli Virtanen <pav+sp@iki.fi>:
> Unfortunately, improving the performance using the above scheme
> comes at the cost of some slightly murky heuristics. I didn't
> manage to come up with an optimal decision rule, so they are
> partly empirical. There is one parameter tuning the cross-over
> between minimizing stride and avoiding small dimensions. (This is
> more or less straightforward.) Another empirical decision is
> required in choosing whether to use the usual reduction loop,
> which is better in some cases, or the blocked loop. How to make
> this latter choice is not so clear to me.
I know very little about cache optimality, so excuse the triviality of
this question: Is it possible to design this loop optimally (taking
into account certain build-time measurable parameters), or is it the
kind of thing that can only be discovered by tuning at compile-time?
ATNumPy... scary :-)
Cheers
Stéfan