Get for AMD based systems ACML and gcc,pgi or pathscale
Get for Intel based systems MKL and intel compiler
run N problem size around 90% workload.is, 1.8GB per core memory footprint.
Run NB 192 on AMD, I don't know the best blocking factor for MKL. I've tried
the same 192 and does fairly well.
Set affinity for the mpi even with 1 socket runs.
Run PxQ 2x2,2x4,4x4,.. depending on the number of cores.
With the above you should get on AMD and on Intel at least 77% efficiency.
As suggested by Tom, Goto library will give you good performance as well.
You can try also the multithreaded version so use PxQ=1x1 and
OMP_NUM_THREADS=4 for a single socket quadcore.
Reduce misses with huge pages.
If you get below 75% efficiency, you are doing something wrong.
If you do more than 85% on quadcore, please let me know :)
Regards,
Joshua
------ Original Message ------
Received: Wed, 02 Apr 2008 12:33:25 PM PDT
From: Ellis Wilson <xclski at yahoo.com>
To: beowulf at beowulf.org
Subject: Re: [Beowulf] HPL Benchmarking and Optimization
> Ellis Wilson wrote:
> > Currently I get these kind of numbers from tested
> > computers using the
> > same environment (gentoo, fortran in gcc, hpl, all
> > same compilation
> > options):
> > 1 x Core2Duo (2.1ghz/core, 2gigs ram) - 2.3Gflops
> > 1 x Athlon 64 3500+ (2.2ghz, 1gig ram) - 1.0Glops
> > 4 x Core2Duo (2.1ghz/core for a total of 8 cores,
> > 2gigs ram/node,
> > 100mbit Ethernet interconnect) - 6.7Gflops
>> Sorry to double post all, however, I realized my issue
> involved running
> HPL on the reference library of BLAS that is generic
> for every
> architecture and didn't want to waste anyones time.
> Giving Portage the
> benefit of the doubt, I had failed to check that it's
> dependencies were
> best for HPL. Following an install of ATLAS and
> relinking to its
> libraries, I've gotten the following numbers:
> 1 x Athlon64 3500+ (2.2ghz, 1gig ram) - 3.6GFlops
> 1 x Phenom9600 Quadcore (2.3ghz/core, 2gigs ram) -
> 11.9GFlops
>> I'll likely try MKL soon for the Intel processors I'm
> interested in.
>> The phenom9600 had previously only gotten 4.5 GFlops,
> and when I tested
> it the second time I simply used the same environment
> I had compiled for
> the athlon64. Certainly compiling ATLAS native on the
> phenom will
> increase the result, hopefully about 350% like with
> the athlon64 (though
> I suspect things will be interesting due to bandwidth,
> etc for quadcores).
>> Anyway, not to end the thread I still am wondering:
>> Do those of you who have professional installations or
> even simply large
> setups that are unsure of the exact code which will be
> run upon your
> cluster utilize compilation options such as -O3,
> funroll-loops,
> -fomit-frame-pointer, etc?
>> Thanks,
>> Ellis
>>>>>____________________________________________________________________________________
> You rock. That's why Blockbuster's offering you one month of Blockbuster
Total Access, No Cost.
>http://tc.deals.yahoo.com/tc/blockbuster/text5.com> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>