Re: SSE Level 3 drop in gemm

Camm,
>> Neither the timer nor tester uses ATL_cachelen; the tester doesn't measure
>> performance, and the kernels *should* work with any alignment legal for the
>> type, so it seems like this is the right thing to do to me. The lack of using
>
>Well, in the case at hand, certain significantly performance enhancing
>asm instructions will segfault if the data isn't aligned to 16 bytes.
>Is this an illegal kernel? These instructions are only used on a and
>b, which seem to be internal to atlas, and therefore aligned however
>we would wish, no?
Ah, a tough question. At the present, it is not quite legal with the existing
code, which has a case where sometimes A and B are not copied. It is almost
never invoked, since it requires that the data is already in the correct
format (for instance, for A not to be copied, it needs to be in tranpose format,
and have K == NB && lda == NB). We can add that A have the correct alignment
as well, so that this kernel would become legal, obviously. However, this
requires that we make certain guarantees about how the stuff is copied
(i.e., that assuming the NB is a multiple of the blocking factor, each
block, or partial block, remains on this alignment), which always makes
me a little nervous. However, my guess is that SSE will not be the only
specialized kernel in the world that benefits from known alignment, so I'm
inclined to put in the change . . . Anybody want argue for/against this?
Regardless, since C is not copied, it *must* support any legal alignment
(as you noted) . . .
>> Now, as far as ATLAS is concerned, it only aligns the *mallocs* to
>> ATL_Cachelen; multiple blocks are stored contiguously, which means that
>> you'll do best if NB*sizeof(TYPE) is a multiple of your ATL_Cachelen . . .
>>
>OK, so if I demand that mb,nb,kb be a multiple of 4, and set
>ATL_Cachelen appropriately, then I can be assured that the A and B
>passed to the kernel will be 16 byte aligned, as will each
>column/row. Is this right? It seems to work that way anyway.
Yep, that should work, but of course your kernel will not be able to
be used for cleanup for any but cases where the remainder divides 4
evenly; it's not a big deal, but just so you know . . .
Cheers,
Clint