On Fri, 23 Nov 2001, Don Holmgren wrote:
> At the very bottom of the page,
>http://qcdhome.fnal.gov/sse/> I have a table with cycle counts posted for a number of matrix-matrix
> and matrix-vector routines as measured on a P-III (Coppermine), P4, and
> an Athlon MP. Times are posted for both a pure-C version of each
> routine, built with gcc, as well as for an SSE version. The sources
> for each are available at
>http://qcdhome.fnal.gov/sse/catalog.html>> The results are a mixed bag, with each flavor processor sometimes first,
> second, or third. I'm using only a small subset of SSE - mostly shufps,
> addps, mulps, with a few xops, movaps, and movups thrown in. I haven't
> timed individual instructions on all three processors.
>> Don Holmgren
> Fermilab
Awesomely useful, Don, thanks.
Do you have any idea what the overall marginal benefit is of using your
hand-optimized routines when working on large datasets (too big to fit
into cache)? In particular, does performance devolve to
memory-bandwidth-bound behavior (and hence end up being the same for
MILC and SSE and dominated by the memory bus speed)?
rgb
--
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf