From jkrauska at cisco.com Sun Jul 3 00:47:23 2005
From: jkrauska at cisco.com (Joel Krauska)
Date: Sat, 02 Jul 2005 21:47:23 -0700
Subject: [Beowulf] Application Profiling Request
Message-ID: <42C76DDB.40101@cisco.com>
Hello.
I work in a HPC cluster application profiling lab at Cisco.
We have a 64-node dual-opteron cluster running beowulf, and are growing
the cluster to 128 nodes in the coming months.
We've spent plenty of time running the stock benchmarks like linpack and
pallas and hpcc, but we're looking for some more /actual/ sample
applications that get run in the real world.
We would like to have your application with real workloads running in
our lab. In exchange we can provide you performance results (privately
if you desire) on traffic patterns and how your application scales on a
variety a Cisco GigE , 10GigE (including using some RNIC technologies)
and now IB switches.
If instead of a special app, you've got a dataset/workload that you run
on any of the commonly available apps, we'd be happy to run that too.
If you're interested in participating, please drop me a line.
Thanks,
Joel Krauska
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Sun Jul 3 23:12:10 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Sun, 3 Jul 2005 20:12:10 -0700
Subject: [Beowulf] Shared memory
In-Reply-To: <77673C9ECE12AB4791B5AC0A7BF40C8F1544B8@exchange02.fed.cclrc.ac.uk>
References: <77673C9ECE12AB4791B5AC0A7BF40C8F1544B8@exchange02.fed.cclrc.ac.uk>
Message-ID: <20050704031210.GA19958@greglaptop.hsd1.ca.comcast.net>
On Mon, Jun 27, 2005 at 12:25:13PM +0100, Kozin, I (Igor) wrote:
> I think MPI/OpenMP has its niche.
I think it's a tiny one. Modern interconnects like InfiniPath are
getting to such low latencies that the spinlocks needed for a fully
threaded MPI are very expensive. And a single thread can't necessarily
max out the interconnect performance.
> BTW, "taskset" worked fine with MPI but could not get a grip on OpenMP
> threads on a dual core.
You didn't say which compiler you were using, but in the PathScale
case, our compiler default is to set process affinity for you. Our
manual describes how you can turn this off, but you probably don't
want to.
> Unfortunately I can't recommend a simple established code or benchmark
> which would allow transparent comparison of MPI versus OpenMP/MPI.
MM5 runs both ways... and it's faster as pure MPI. If OpenMPI+MPI
doesn't have some special benefit such as accellerating convergence,
it's not going to be a win.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From federico.ceccarelli at techcom.it Fri Jul 1 03:38:26 2005
From: federico.ceccarelli at techcom.it (Federico Ceccarelli)
Date: Fri, 01 Jul 2005 09:38:26 +0200
Subject: [Beowulf] WRF model on linux cluster: Mpi problem
In-Reply-To: <42C443AE.2060706@penguincomputing.com>
References: <3.0.32.20050630145204.011261c0@pop3.xs4all.nl>
<42C443AE.2060706@penguincomputing.com>
Message-ID: <1120203507.5114.9.camel@localhost.localdomain>
yeas,
I will remove openmosix.
I patched the kernel with openmosix because I used the cluster also for
other smaller applications, so the load balance was useful to me.
I already tried to switch off openmosix with
> service openmosix stop
but nothing seems to change...
Do tou think it could be different to completely remove it, replacing
the kernel with a new one without openmosix patch?
thanks...
federico
Il giorno gio, 30-06-2005 alle 12:10 -0700, Michael Will ha scritto:
> Vincent is on target here:
>
> If your application already uses MPI as a middleware assuming
> distributed memory, then you should definitly use a beowulf style
> setup rather than openmosix with it's pseudo-shared memory model.
>
> Look at rocks 4.0.0 http://www.rocksclusters.org/Rocks/ which
> is free and based on CentOS 4 which again is a free version of RHEL4.
>
> Michael
>
> Vincent Diepeveen wrote:
>
> >At 02:34 PM 6/30/2005 +0200, Federico Ceccarelli wrote:
> >
> >
> >>Thanks for you answer Vincent,
> >>
> >>my network cards are Intel Pro 1000, Gigabit.
> >>
> >>Yes I did a 72h (real time) simulations that lasted 20h on 4 cpus...same
> >>behaviour...
> >>
> >>I'm thinking about a bandwith problem...
> >>
> >>....maybe due to hardware failure of some network card, or switch (3com
> >>-Baseline switch 2824).
> >>
> >>Or the pci-raisers for the network card (I have a 2 unit rack so that I
> >>cannot mount network cards directly on the pci slot)...
> >>
> >>
> >
> >because the gigabit cards have such horrible one way ping pong latencies as
> >compared to the highend cards (myri,dolphin,quadrics and relative seen also
> >infiniband), the pci bus is not your biggest problem which is the case here.
> >
> >The specifications of the card are so so so restricted that the pci is not
> >the problem at all.
> >
> >There are many tests out there to test things. You should try some one-way
> >pingpong test.
> >
> >By the way, the reason for me to not run openmosix nor similar single image
> >software systems is because it has such ugly effect at the latencies and
> >the way it pages shared memory communication between nodes is real ugly
> >slow and bad for this type of software. There is also something called
> >OpenSSI which is pretty active getting developed. It has the same problem.
> >
> >Vincent
> >
> >
> >
> >>Did you experience problem with pci-raisers?
> >>
> >>Can you suggest me a bandwidth benchmark?
> >>
> >>thanks again...
> >>
> >>federico
> >>
> >>Il giorno gio, 30-06-2005 alle 12:44 +0200, Vincent Diepeveen ha
> >>scritto:
> >>
> >>
> >>>Hello Federico,
> >>>
> >>>Hope you can find contacts to colleges.
> >>>
> >>>A few questions.
> >>> a) what kind of interconnects does the cluster have (networkcards and
> >>>which type?)
> >>> b) if you run a simulation that eats a few hours instead of a few
> >>>
> >>>
> >seconds,
> >
> >
> >>> do you get the same speed outcome difference?
> >>>
> >>>I see the program is pretty big for open source calculating software, about
> >>>1.9MB fortran code, so bit time consuming to figure out for someone who
> >>>isn't a non-meteorological expert.
> >>>
> >>>E:\wrf>dir *.f* /s /p
> >>>..
> >>> Total Files Listed:
> >>> 141 File(s) 1,972,938 bytes
> >>>
> >>>Best regards,
> >>>Vincent
> >>>
> >>>At 06:56 PM 6/29/2005 +0200, federico.ceccarelli wrote:
> >>>
> >>>
> >>>>Hi!
> >>>>
> >>>>I would like to get in touch with people running numerical meteorological
> >>>>models on a linux cluster (16cpu) , distributed memory (1Gb every node),
> >>>>diskless nodes, Gigabit lan, mpich and openmosix.
> >>>>
> >>>>I'm tring to run WRF model but the mpi version parallelized on 4, 8, or 16
> >>>>nodes runs slower than the single node one! It runs correctly but so
> >>>>
> >>>>
> >slow...
> >
> >
> >>>>When I run wrf.exe on a single processor the cpu time for every
> >>>>
> >>>>
> >timestep is
> >
> >
> >>>>about 10s for my configuration.
> >>>>
> >>>>When I switch to np=4, 8 or 16 the cpu time for a single step sometimes
> >>>>
> >>>>
> >its
> >
> >
> >>>>faster (as It should always be, for example 3sec for 4 cpu ) but often
> >>>>
> >>>>
> >it is
> >
> >
> >>>>slower and slower (60sec and more!). The overall time of the simulation is
> >>>>bigger than for the single node run...
> >>>>
> >>>>anyone have experienced the same problem?
> >>>>
> >>>>thanks in advance to everybody...
> >>>>
> >>>>federico
> >>>>
> >>>>
> >>>>
> >>>>Dr. Federico Ceccarelli (PhD)
> >>>>-----------------------------
> >>>> TechCom snc
> >>>>Via di Sottoripa 1-18
> >>>>16124 Genova - Italia
> >>>>Tel: +39 010 860 5664
> >>>>Fax: +39 010 860 5691
> >>>>http://www.techcom.it
> >>>>
> >>>>_______________________________________________
> >>>>Beowulf mailing list, Beowulf at beowulf.org
> >>>>To change your subscription (digest mode or unsubscribe) visit
> >>>>
> >>>>
> >>>http://www.beowulf.org/mailman/listinfo/beowulf
> >>>
> >>>
> >>>>
> >>>>
> >>
> >>
> >>
> >_______________________________________________
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> >
> >
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From Karsten.Petersen at megware.de Fri Jul 1 03:55:40 2005
From: Karsten.Petersen at megware.de (Karsten Petersen)
Date: Fri, 01 Jul 2005 09:55:40 +0200
Subject: [Beowulf] more news on the Cell
In-Reply-To: <20050630141244.GL25947@leitl.org>
References: <20050630141244.GL25947@leitl.org>
Message-ID: <42C4F6FC.3080205@megware.de>
Hi
Eugen Leitl wrote:
> http://www-128.ibm.com/developerworks/power/library/pa-cell/?ca=dgr-pow03SpufsCell
>
> also, see http://www.research.scea.com/research/html/CellGDC05/index.html
Regarding the cell: I think it was pretty demonstrative by IBM to show
the cell chip at Linuxtag2005 but not at ISC2005 (although IBM was the
main sponsor and even had a BlueGene box there). Asked about this, the
IBM guys conceded that they do not really see the cell chip within the
HPC market.
Some participants at ISC also discussed if it will be possible to port
e.g. blas/lapack to the cell architecture and in what amount it will be
able to do double precision float operations. (It seems to be optimized
for single precision!)
BTW: When it comes to raw numbercrunching power, the ClearSpeed
coprocessor card looks promising, too.
Best wishes
Karsten
--
HPC System Engineer
MEGWARE Computer GmbH
http://www.megware.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From i.kozin at dl.ac.uk Fri Jul 1 05:30:38 2005
From: i.kozin at dl.ac.uk (Kozin, I (Igor))
Date: Fri, 1 Jul 2005 10:30:38 +0100
Subject: [Beowulf] [gorelsky@stanford.edu: CCL:dual-core Opteron
275performance]
Message-ID: <77673C9ECE12AB4791B5AC0A7BF40C8F1544C9@exchange02.fed.cclrc.ac.uk>
It's great to see someone is brave enough to publish
a Gaussian benchmark.
On the other hand the results are predictable:
since the Xeon scales linearly from one to two
you'd expect the Opteron to scale well too, wouldn't you?
So the factor 1.95 comes from a comparison of four
and two "cores" on a test which apparently performs
well out of cache.
> To add to the discussion about the performance of new dual-core
> processors for computational chemistry applications,
>
> the comparison of Intel and AMD dual-CPU based computers is shown at:
>
> http://www.sg-chem.net/cluster/
>
> As can be seen from the graph, the Gaussian 03 execution
> speed (test job
> 397) on dual-core dual-CPU Opteron 275 workstation is faster
> by a factor of 1.95
> as compared to the dual-CPU Xeon 3.2GHz 800MHz FSB machine.
>
> -----------------
>
> I would like to thank Ed Gasiorowski (AMD) and Mike Fay (Colfax
> International) for their support.
>
> Serge Gorelsky
>
> ----------------------------------------------------------------
> Dr S.I. Gorelsky, Department of Chemistry, Stanford University
> Box 155, 333 Campus Drive, Stanford, CA 94305-5080 USA
> Phone: (650) 723-0041. Fax: (650) 723-0852.
> ----------------------------------------------------------------
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From john.hearns at streamline-computing.com Mon Jul 4 03:48:47 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Mon, 04 Jul 2005 08:48:47 +0100
Subject: [Beowulf] WRF model on linux cluster: Mpi problem
In-Reply-To: <1120203507.5114.9.camel@localhost.localdomain>
References: <3.0.32.20050630145204.011261c0@pop3.xs4all.nl>
<42C443AE.2060706@penguincomputing.com>
<1120203507.5114.9.camel@localhost.localdomain>
Message-ID: <1120463328.22587.13.camel@vigor11>
On Fri, 2005-07-01 at 09:38 +0200, Federico Ceccarelli wrote:
> yeas,
>
> I will remove openmosix.
> I patched the kernel with openmosix because I used the cluster also for
> other smaller applications, so the load balance was useful to me.
>
> I already tried to switch off openmosix with
>
> > service openmosix stop
Having a small amount of Openmosix experience, that should work.
Have you used the little graphical tool to display the loads on each
node? (can't remember the name).
Anyway, I go along with the earlier advice to look at the network card
performance.
Do an lspci -vv on all nodes to check that your riser cards are running
at full speed.
What I would do is break this problem down.
Start by running the Pallas benchmark, on one node, then two, then four
etc. See if a pattern develops.
The same with your model, if it is possible to cut down the problem
size. Run on one node (two processors), then two then four.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Mon Jul 4 11:59:40 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Mon, 04 Jul 2005 17:59:40 +0200
Subject: [Beowulf] [gorelsky@stanford.edu: CCL:dual-core
Opteron275performance]
Message-ID: <3.0.32.20050704175935.0123a9d0@pop3.xs4all.nl>
Opteron for most workloads scales better than Xeon of course.
A quad xeon has 1 memory controller and a dual core dual opteron has 2.
The opteron has a higher bandwidth and a faster one TLB trashing latency.
Effectively if you do next test:
take a buffer of memory of say 400MB, and randomly read 8 bytes from the
buffer and then test which machine is going to do it faster, then opteron
eats the xeon alive of course.
Testing metholody: each processor allocates a buffer of n bytes and cross
attaches to the other processors.
Of course we take a large buffer. Around 400MB is the working set size for
the hashtable which i use for my chess software (which is reading randomly
a 8-64 bytes from the cache).
Results:
single cpu A64 : 91 ns (cl2 memory)
single cpu P4 : 220 ns (cl2 memory, bus overclocked)
dual opteron : 120 ns
quad opteron : 133 ns
dual xeon : 280 ns (800Mhz bus)
dual xeon : 400 ns (533Mhz bus)
So obviously things that do not fit in L2 cache, the opteron runs away with
it. Only if the executable is optimized in question by the intel c++
compiler it will have done stuff to run it faster at intel processors than
at opteron,
then results do not look too bad for P4. Yet that's a matter of optimizing
it for opteron better, which most software dudes do NOT do, as intel
delivers good support and AMD historically didn't deliver *any* kind of
support (they are improving now, but even then their math libraries are so
pathetic compared to the ease of the intel libraries that i can imagine at
least *that* part of
the problems).
Objectively there is however no question about it that hardware delivering
TLB trashing random lookups of 8-64 bytes from a big cache, that the
Opteron is over 2 times faster there than Xeon.
Because of 3 reasons:
a) it has an on die memory controller,
b) so it has MORE memory controllers
b) on die memory controller is clocked higher than the memory
chipset from intel
Vincent
At 10:30 AM 7/1/2005 +0100, Kozin, I \(Igor\) wrote:
>
>It's great to see someone is brave enough to publish
>a Gaussian benchmark.
>
>On the other hand the results are predictable:
>since the Xeon scales linearly from one to two
>you'd expect the Opteron to scale well too, wouldn't you?
>So the factor 1.95 comes from a comparison of four
>and two "cores" on a test which apparently performs
>well out of cache.
>
>
>> To add to the discussion about the performance of new dual-core
>> processors for computational chemistry applications,
>>
>> the comparison of Intel and AMD dual-CPU based computers is shown at:
>>
>> http://www.sg-chem.net/cluster/
>>
>> As can be seen from the graph, the Gaussian 03 execution
>> speed (test job
>> 397) on dual-core dual-CPU Opteron 275 workstation is faster
>> by a factor of 1.95
>> as compared to the dual-CPU Xeon 3.2GHz 800MHz FSB machine.
>>
>> -----------------
>>
>> I would like to thank Ed Gasiorowski (AMD) and Mike Fay (Colfax
>> International) for their support.
>>
>> Serge Gorelsky
>>
>> ----------------------------------------------------------------
>> Dr S.I. Gorelsky, Department of Chemistry, Stanford University
>> Box 155, 333 Campus Drive, Stanford, CA 94305-5080 USA
>> Phone: (650) 723-0041. Fax: (650) 723-0852.
>> ----------------------------------------------------------------
>>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kus at free.net Mon Jul 4 13:48:28 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Mon, 04 Jul 2005 21:48:28 +0400
Subject: [Beowulf] [gorelsky@stanford.edu: CCL:dual-core
Opteron275performance]
In-Reply-To: <3.0.32.20050704175935.0123a9d0@pop3.xs4all.nl>
Message-ID:
In message from Vincent Diepeveen (Mon, 04 Jul 2005
17:59:40 +0200):
> ...
> ...
>Of course we take a large buffer. Around 400MB is the working set
>size for
>the hashtable which i use for my chess software (which is reading
>randomly
>a 8-64 bytes from the cache).
>
>Results:
> single cpu A64 : 91 ns (cl2 memory)
> single cpu P4 : 220 ns (cl2 memory, bus overclocked)
> dual opteron : 120 ns
> quad opteron : 133 ns
> dual xeon : 280 ns (800Mhz bus)
> dual xeon : 400 ns (533Mhz bus)
The latencies should depends from processors frequencies (although
RAM part is much higher),
so what was the frequencies for A64/P4/Opteron/Xeon ?
And do I understand you correctly that you have 1/2/4 threads which
perform "random" read of some bytes from main memory ?
>
>So obviously things that do not fit in L2 cache, the opteron runs
>away with
>it. Only if the executable is optimized in question by the intel c++
>compiler it will have done stuff to run it faster at intel processors
>than >at opteron,
>then results do not look too bad for P4.
If the results above are for "bad" (bad optimizing) compiler -
in some sense it's the problem of compiler :-) Yes, old binary
software will work slow. But many, many HPC applications may be
compiled
from source.
BTW, more good results are for icc++ only - do you know
something about PGI and PathScale compilers ?
> Yet that's a matter of
>optimizing
>it for opteron better, which most software dudes do NOT do, as intel
>delivers good support and AMD historically didn't deliver *any* kind
>of
>support (they are improving now, but even then their math libraries
>are so
>pathetic compared to the ease of the intel libraries that i can
>imagine at
>least *that* part of
>the problems).
acml 2.1 gives me a set of good results for Opteron in comparison
w/MKL
Yours
Mikhail
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Mon Jul 4 15:04:20 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Mon, 04 Jul 2005 21:04:20 +0200
Subject: [Beowulf] RASM : random memory latency test
Message-ID: <3.0.32.20050704210420.012fcbf0@pop3.xs4all.nl>
Hi,
I wrote this program to test random memory latencies lookups at SSI (single
system images) using shared memory.
Latencies are NOT dependant upon memory frequency, just dependant upon the
speed the RAM works at (which sometimes IS related to cpu frequency in case
of opteron as the faster opterons allow 400Mhz ram.
Attached the program to test with. Just compile it and then start it with
for example:
gcc -O2 -o lat latencylinux.c
(as root) :
echo 450000000 > /proc/sys/kernel/shmmax
now you can run it for example at 2 processors, each processor 200MB shared
memory where the other processor(s) attach to :
./lat 200000000 2
if you want to run it at 4 processors (it used 800MB in total then) :
./lat 200000000 4
And so on. In fact the program has been bugfixed to run up to 500 processors.
Of course random access to memory is SLOWER in general when you take a
larger buffer. That's not because of the L2 cache only. So doing tests with
each processor the same buffer, a lot bigger than the L2 cache, say 200MB
is a good thing to do.
The reason why i wrote this test is a bit sad, but just for historic
consumption.
SGI was claiming 460 nanoseconds for it worst case at 512 processor
partition, even when TLB trashing and when doing reads from and to random
processors; which obviously wasn't true.
Yet for computerchess and searching algorithms in general it matters a lot
as huge caches get used.
I tested it at 460 processors to be 5800 nanoseconds on average to get 8
bytes from a random processor at a random memory location, let alone worst
case.
Altix3000 is not better when using this test, to say polite.
Amazingly it's pretty accurate to test pc's with too.
Take into account that it is just doing READS. It's not WRITING.
So from intel viewpoint this is an intel friendly test, because when
sharing a memory controller, reading can be done usually in parallel, but
writing can't.
So if for your software you notice a difference when writing at random to
memory, that's probably because you also WRITE to memory.
At 09:48 PM 7/4/2005 +0400, Mikhail Kuzminsky wrote:
>In message from Vincent Diepeveen (Mon, 04 Jul 2005
>17:59:40 +0200):
>> ...
>> ...
>>Of course we take a large buffer. Around 400MB is the working set
>>size for
>>the hashtable which i use for my chess software (which is reading
>>randomly
>>a 8-64 bytes from the cache).
>>
>>Results:
>> single cpu A64 : 91 ns (cl2 memory)
>> single cpu P4 : 220 ns (cl2 memory, bus overclocked)
>> dual opteron : 120 ns
>> quad opteron : 133 ns
>> dual xeon : 280 ns (800Mhz bus)
>> dual xeon : 400 ns (533Mhz bus)
>The latencies should depends from processors frequencies (although
>RAM part is much higher),
>so what was the frequencies for A64/P4/Opteron/Xeon ?
>
>And do I understand you correctly that you have 1/2/4 threads which
>perform "random" read of some bytes from main memory ?
>
>>
>>So obviously things that do not fit in L2 cache, the opteron runs
>>away with
>>it. Only if the executable is optimized in question by the intel c++
>>compiler it will have done stuff to run it faster at intel processors
>>than >at opteron,
>>then results do not look too bad for P4.
>If the results above are for "bad" (bad optimizing) compiler -
>in some sense it's the problem of compiler :-) Yes, old binary
>software will work slow. But many, many HPC applications may be
>compiled
>from source.
>BTW, more good results are for icc++ only - do you know
>something about PGI and PathScale compilers ?
>
>> Yet that's a matter of
>>optimizing
>>it for opteron better, which most software dudes do NOT do, as intel
>>delivers good support and AMD historically didn't deliver *any* kind
>>of
>>support (they are improving now, but even then their math libraries
>>are so
>>pathetic compared to the ease of the intel libraries that i can
>>imagine at
>>least *that* part of
>>the problems).
>acml 2.1 gives me a set of good results for Opteron in comparison
>w/MKL
>
>Yours
>Mikhail
>
>
-------------- next part --------------
/*-----------------10-6-2003 3:48-------------------*
*
* This program rasml.c measures the Random Average Shared Memory Latency (RASML)
* Thanks to Agner Fog for his excellent random number generator.
*
* This testset is using a 64 bits optimized RNG of Agner Fog's ranrot generator.
*
* Created by Vincent Diepeveen who hereby releases this under GPL
* Feel free to look at the FSF (free software foundation) for what
* GPL is and its conditions.
*
* Please don't confuse the times achieved here with two times the one
* way pingpong latency, though at
* ideal scaling supercomputers/clusters they will be close. There is a few
* differences:
* a) this is TLB trashing
* b) this test tests ALL processors at the same time and not
* just 2 cpu's while the rest of the entire cluster is idle.
* c) this test ships 8 bytes whereas one way pingpong typical also
* gets used to test several kilobyte sizes, or just returns a pong.
* d) this doesn't use MPI but shared memory and the way such protocols are
* implemented matters possibly for latency.
*
* Vincent Diepeveen diep at xs4all.nl
* Veenendaal, The Netherlands 10 june 2003
*
* First a few lines about the random number generator. Note that I modified Agner Fog's
* RanRot very slightly. Basically its initialization has been done better and some dead
* slow FPU code rewritten to fast 64 bits integer code.
*/
#define UNIX 1 /* put to 1 when you are under unix or using gcc a look like compilers */
#define IRIX 1 /* this value only matters when UNIX is set to 1. For Linux put to 0
* basically allocating shared memory in linux is pretty buggy done in
* its kernel.
*
* Therefore you might want to do 'cat /proc/sys/kernel/shmmax'
* and look for yourself how much shared memory YOU can allocate in linux.
*
* If that is not enough to benchmark this program then try modifying it with:
* echo > /proc/sys/kernel/shmmmax
* Be sure you are root when doing that each time the system boots.
*/
#define FREEBSD 0 // be sure to not use more than 2 GB memory with freebsd with this test. sorry.
#if UNIX
#include
#include
#include
#include
#include
#include
#else
#include
#include // for GetTickCount()
#include // _spawnl
#endif
#include
#include
#include
#include
#include
#define SWITCHTIME 60000 /* in milliseconds. Modify this to let a test run longer or shorter.
* basically it is a good idea to use about the cpu number times
* thousand for this. 30 seconds is fine for PC's, but a very
* bad idea for supercomputers. I recomment several minutes
* there, and at least a few hours for big supers if the partition isn't started yet
* if the partition is started starting it at 460 processors (SGI) should
* take 10 minutes, otherwise it takes 3 hours to attach all.
* Of course that let's a test take way way longer.
*/
#define MAXPROCESSES 512 /* this test can go up to this amount of processes to be tested */
#define CACHELINELENGTH 128 /* cache line length at the machine. Modify this if you want to */
#if UNIX
#include
// #include
#define FORCEINLINE __inline
/* UNIX and such this is 64 bits unsigned variable: */
#define BITBOARD unsigned long long
#else
#define FORCEINLINE __forceinline
/* in WINDOWS we also want to be 64 bits: */
#define BITBOARD unsigned _int64
#endif
#define STATUS_NOTSTARTED 0
#define STATUS_ATTACH 1
#define STATUS_GOATTACH 2
#define STATUS_ATTACHED 3
#define STATUS_STARTREAD 4
#define STATUS_READ 5
#define STATUS_MEASUREREAD 6
#define STATUS_MEASUREDREAD 7
#define STATUS_QUIT 10
struct ProcessState {
volatile int status; /* 0 = not started yet
* 1 = ready to start reading
*
* 10 = quitted
* */
/* now the numbers each cpu gathers. The name of the first number is what
* cpu0 is doing and the second name what all the other cpu's were doing at that
* time
*/
volatile BITBOARD readread; /* */
char dummycacheline[CACHELINELENGTH];
};
typedef struct {
BITBOARD nentries; // number of entries of 64 bits used for cache.
struct ProcessState ps[MAXPROCESSES];
} GlobalTree;
void RanrotAInit(void);
float ToNano(BITBOARD);
int GetClock(void);
float TimeRandom(void);
void ParseBuffer(BITBOARD);
void ClearHash(void);
void DeAllocate(void);
int DoNrng(BITBOARD);
int DoNreads(BITBOARD);
int DoNreadwrites(BITBOARD);
//void TestLatency(float);
int AllocateTree(void);
void InitTree(int);
void WaitForStatus(int,int);
void PutStatus(int,int);
int CheckStatus(int,int);
int CheckAllStatus(int,int);
void Slapen(int);
float LoopRandom(void);
/* define parameters (R1 and R2 must be smaller than the integer size): */
#define KK 17
#define JJ 10
#define R1 5
#define R2 3
/* global variables Ranrot */
BITBOARD randbuffer[KK+3] = { /* history buffer filled with some random numbers */
0x92930cb295f24dab,0x0d2f2c860b685215,0x4ef7b8f8e76ccae7,0x03519154af3ec239,0x195e36fe715fad23,
0x86f2729c24a590ad,0x9ff2414a69e4b5ef,0x631205a6bf456141,0x6de386f196bc1b7b,0x5db2d651a7bdf825,
0x0d2f2c86c1de75b7,0x5f72ed908858a9c9,0xfb2629812da87693,0xf3088fedb657f9dd,0x00d47d10ffdc8a9f,
0xd9e323088121da71,0x801600328b823ecb,0x93c300e4885d05f5,0x096d1f3b4e20cd47,0x43d64ed75a9ad5d9
/*0xa05a7755512c0c03,0x960880d9ea857ccd,0x7d9c520a4cc1d30f,0x73b1eb7d8891a8a1,0x116e3fc3a6b7aadb*/
};
int r_p1, r_p2; /* indexes into history buffer */
/* global variables RASML */
BITBOARD *hashtable[MAXPROCESSES],nentries,globaldummy=0;
GlobalTree *tree;
int ProcessNumber,
cpus; // number of processes for this test
#if UNIX
int shm_tree,shm_hash[MAXPROCESSES];
#endif
char rasmexename[2048];
/******************************************************** AgF 1999-03-03 *
* Random Number generator 'RANROT' type B *
* by Agner Fog *
* *
* This is a lagged-Fibonacci type of random number generator with *
* rotation of bits. The algorithm is: *
* X[n] = ((X[n-j] rotl r1) + (X[n-k] rotl r2)) modulo 2^b *
* *
* The last k values of X are stored in a circular buffer named *
* randbuffer. *
* *
* This version works with any integer size: 16, 32, 64 bits etc. *
* The integers must be unsigned. The resolution depends on the integer *
* size. *
* *
* Note that the function RanrotAInit must be called before the first *
* call to RanrotA or iRanrotA *
* *
* The theory of the RANROT type of generators is described at *
* www.agner.org/random/ranrot.htm *
* *
*************************************************************************/
FORCEINLINE BITBOARD rotl(BITBOARD x,int r) {return(x<>(64-r));}
/* returns a random number of 64 bits unsigned */
FORCEINLINE BITBOARD RanrotA(void) {
/* generate next random number */
BITBOARD x = randbuffer[r_p1] = rotl(randbuffer[r_p2],R1) + rotl(randbuffer[r_p1], R2);
/* rotate list pointers */
if( --r_p1 < 0)
r_p1 = KK - 1;
if( --r_p2 < 0 )
r_p2 = KK - 1;
return x;
}
/* this function initializes the random number generator. */
void RanrotAInit(void) {
int i;
/* one can fill the randbuffer here with possible other values here */
randbuffer[0] = 0x92930cb295f24000 | (BITBOARD)ProcessNumber;
randbuffer[1] = 0x0d2f2c860b000215 | ((BITBOARD)ProcessNumber<<12);
/* initialize pointers to circular buffer */
r_p1 = 0;
r_p2 = JJ;
/* randomize */
for( i = 0; i < 300; i++ )
(void)RanrotA();
}
/* Now the RASML code */
char *To64(BITBOARD x) {
static char buf[256];
char *sb;
sb = &buf[0];
#if UNIX
sprintf(buf,"%llu",x);
#else
sprintf(buf,"%I64u",x);
#endif
return sb;
}
int GetClock(void) {
/* The accuracy is measured in millisecondes. The used function is very accurate according
* to the NT team, way more accurate nowadays than mentionned in the MSDN manual. The accuracy
* for linux or unix we can only guess. Too many experts there.
*/
#if UNIX
struct timeval timeval;
struct timezone timezone;
gettimeofday(&timeval, &timezone);
return((int)(timeval.tv_sec*1000+(timeval.tv_usec/1000)));
#else
return((int)GetTickCount());
#endif
}
float ToNano(BITBOARD nps) {
/* convert something from times a second to nanoseconds.
* NOTE THAT THERE IS COMPILER BUGS SOMETIMES AT OLD COMPILERS
* SO THAT'S WHY MY CODE ISN'T A 1 LINE RETURN HERE. PLEASE DO
* NOT MODIFY THIS CODE */
float tn;
tn = 1000000000/(float)nps;
return tn;
}
float TimeRandom(void) {
/* timing the random number generator is very easy of course. Returns
* number of random numbers a second that can get generated
*/
BITBOARD bb=0,i,value,nps;
float ns_rng;
int t1,t2,took;
printf("Benchmarking Pseudo Random Number Generator speed, RanRot type 'B'!\n");
printf("Speed depends upon CPU and compile options from RASML,\n therefore we benchmark the RNG\n");
printf("Please wait a few seconds.. "); fflush(stdout);
value = 100000;
took = 0;
while( took < 3000 ) {
value <<= 2; // x4
t1 = GetClock();
for( i = 0; i < value; i++ ) {
bb ^= RanrotA();
}
t2 = GetClock();
took = t2-t1;
}
nps = (1000*value)/(BITBOARD)took;
#if UNIX
printf("..took %i milliseconds to generate %llu numbers\n",took,value);
printf("Speed of RNG = %llu numbers a second\n",nps);
#else
printf("..took %i milliseconds to generate %I64 numbers\n",took,value);
printf("Speed of RNG = %I64u numbers a second\n",nps);
#endif
ns_rng = ToNano(nps);
printf("So 1 RNG call takes %f nanoseconds\n",ns_rng);
return ns_rng;
}
void ParseBuffer(BITBOARD nbytes) {
tree->nentries = nbytes/sizeof(BITBOARD);
#if UNIX
printf("Trying to allocate %llu entries. ",tree->nentries);
printf("In total %llu bytes\n",tree->nentries*(BITBOARD)sizeof(BITBOARD));
#else
printf("Trying to allocate %s entries. ",To64(tree->nentries));
printf("In total %s bytes\n",To64(tree->nentries*(BITBOARD)sizeof(BITBOARD)));
#endif
}
void ClearHash(void) {
BITBOARD *hi,i,nentries = tree->nentries;
/* clearing hashtable */
printf("Clearing hashtable for processor %i\n",ProcessNumber);
fflush(stdout);
hi = hashtable[ProcessNumber];
for( i = 0 ; i < nentries ; i++ ) /* very unoptimized way of clearing */
hi[i] = i;
}
void DeAllocate(void) {
int i;
#if UNIX
shmctl(shm_tree,IPC_RMID,0);
for( i = 0; i < cpus; i++ ) {
shmctl(shm_hash[i],IPC_RMID,0);
}
#else
UnmapViewOfFile(tree);
for( i = 0; i < cpus; i++ ) {
UnmapViewOfFile(hashtable[i]);
}
#endif
}
int DoNrng(BITBOARD n) {
BITBOARD i=1,dummyres,nents;
int t1,t2,ncpu;
ncpu = cpus;
nents = nentries; /* hopefully this gets into a register */
dummyres = globaldummy;
t1 = GetClock();
do {
BITBOARD rani=RanrotA(),index=rani%nents;
unsigned int i2 = (unsigned int)(rani>>32)%ncpu;
dummyres ^= (index+(BITBOARD)i2);
} while( i++ < n );
t2 = GetClock();
globaldummy = dummyres;
return(t2-t1);
}
int DoNreads(BITBOARD n) {
BITBOARD i=1,dummyres,nents;
int t1,t2,ncpu;
ncpu = cpus;
nents = nentries; /* hopefully this gets into a register */
dummyres = globaldummy;
t1 = GetClock();
do {
BITBOARD rani=RanrotA(),index=rani%nents;
unsigned int i2 = (unsigned int)(rani>>32)%ncpu;
dummyres ^= hashtable[i2][index];
} while( i++ < n );
t2 = GetClock();
globaldummy = dummyres;
return(t2-t1);
}
#if 0
int DoNreadwrites(BITBOARD n) {
BITBOARD i=1,dummyres,nents;
int t1,t2;
nents = nentries; /* hopefully this gets into a register */
dummyres = globaldummy;
t1 = GetClock();
do {
BITBOARD index = RanrotA()%nents;
dummyres ^= hashtable[index];
hashtable[index] = dummyres;
} while( i++ < n );
t2 = GetClock();
globaldummy = dummyres;
return(t2-t1);
}
void TestLatency(float ns_rng) {
BITBOARD n,nps_read,nps_rw,nps_rng;
float ns,fns;
int timetaken;
printf("Doing random RNG test. Please wait..\n");
n = 50000000; // 50 mln
timetaken = DoNrng(n);
nps_rng = (1000*n) / (BITBOARD)timetaken;
fns = ToNano(nps_rng);
printf("Machine needs %f ns for RND loop\n",fns);
/* READING SINGLE CPU RANDOM ENTRIES */
printf("Doing random read tests single cpu. Please wait..\n");
n = 100000000; // 100 mln
timetaken = DoNreads(n);
nps_read = (1000*n) / (BITBOARD)timetaken;
ns = ToNano(nps_read);
printf("Machine needs %f ns for single cpu random reads.\nExtrapolated=%f nanoseconds a read\n",ns,ns-fns);
/* READING AND THEN WRITING SINGLE CPU RANDOM ENTRIES */
printf("Doing random readwrite tests single cpu. Please wait..\n");
n = 100000000; // 100 mln
timetaken = DoNreadwrites(n);
nps_rw = (1000*n) / (BITBOARD)timetaken;
ns = ToNano(nps_rw);
printf("Machine needs %f ns for single cpu random readwrites.\n",ns);
printf("Extrapolated=%f nanoseconds a readwrite (to the same slot)\n\n",ns-fns);
printf("So far the useless tests.\nBut we have vague read/write nodes a second numbers now\n");
}
#endif
int AllocateTree(void) { /* initialize the tree. returns 0 if error */
#if UNIX
shm_tree = shmget(
ftok(".",'t'),
sizeof(GlobalTree),IPC_CREAT|0777);
if( shm_tree == -1 )
return 0;
tree = (GlobalTree *)shmat(shm_tree,0,0);
if( tree == (GlobalTree *)-1 )
return 0;
#else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
if( !ProcessNumber ) {
HANDLE TreeFileMap;
TreeFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
(DWORD)sizeof(GlobalTree),"RASM_Tree");
if( TreeFileMap == NULL )
return 0;
tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( tree == NULL )
return 0;
}
else { /* Slaves attach also try to attach to the tree */
HANDLE TreeFileMap;
TreeFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Tree");
if( TreeFileMap == NULL )
return 0;
tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( tree == NULL )
return 0;
}
#endif
return 1;
}
int AttachAll(void) {
#if UNIX
#else
HANDLE HashFileMap;
#endif
char hashname2[32] = {"RASM_Hash00"},hashname[32];
int i,r;
for( r = 0; r < cpus; r++ ) {
i = ProcessNumber+r;
i %= cpus;
if( i == ProcessNumber )
continue;
#if UNIX
shm_hash[i] = shmget(
#if IRIX
ftok(".",200+i),
#else
ftok(".",(char)i),
#endif
tree->nentries*8,IPC_CREAT|0777);
if( shm_hash[i] == -1 )
return 0;
hashtable[i] = (BITBOARD *)shmat(shm_hash[i],0,0);
if( hashtable[i] == (BITBOARD *)-1 )
return 0;
#else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
strcpy(hashname,hashname2);
hashname[9] += (i/10);
hashname[10] += (i%10);
HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,hashname);
if( HashFileMap == NULL )
return 0;
hashtable[i] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( hashtable[i] == NULL )
return 0;
#endif
}
return 1;
}
int AllocateHash(void) { /* initialize the hashtable (cache). returns 0 if error */
char hashname[32] = {"RASM_Hash00"};
#if UNIX
shm_hash[ProcessNumber] = shmget(
#if IRIX
ftok(".",200+ProcessNumber),
#else
ftok(".",(char)ProcessNumber),
#endif
tree->nentries*8,IPC_CREAT|0777);
if( shm_hash[ProcessNumber] == -1 )
return 0;
hashtable[ProcessNumber] = (BITBOARD *)shmat(shm_hash[ProcessNumber],0,0);
if( hashtable[ProcessNumber] == (BITBOARD *)-1 )
return 0;
#else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
//if( !ProcessNumber ) {
HANDLE HashFileMap;
hashname[9] += (ProcessNumber/10);
hashname[10] += (ProcessNumber%10);
HashFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
(DWORD)tree->nentries*8,hashname);
if( HashFileMap == NULL )
return 0;
hashtable[ProcessNumber] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( hashtable[ProcessNumber] == NULL )
return 0;
//}
//else { /* Slaves attach also try to attach to the tree */
/* HANDLE HashFileMap;
HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Hash");
if( HashFileMap == NULL )
return 0;
hashtable[ProcessNumber] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( hashtable[ProcessNumber] == NULL )
return 0;*/
//}
#endif
return 1;
}
int StartProcesses(int ncpus) {
char buf[256];
int i;
/* returns 1 if ncpus-1 started ok */
if( ncpus == 1 )
return 1;
for( i = 1 ; i < ncpus ; i++ ) {
sprintf(buf,"%i_%i",i+1,ncpus);
#if UNIX
if( !fork() )
execl(rasmexename,rasmexename,buf,NULL);
#else
(void)_spawnl(_P_NOWAIT,rasmexename,rasmexename,buf,NULL);
#endif
}
return 1;
}
void InitTree(int ncpus) {
int i;
for( i = 0 ; i < ncpus ; i++ ) {
tree->ps[i].status = STATUS_NOTSTARTED;
tree->ps[i].readread = 0;
}
}
void WaitForStatus(int ncpus,int waitforstate) {
/* wait for all processors to have the same state */
int i,badluck=1;
while( badluck ) {
badluck = 0;
for( i = 0 ; i < ncpus ; i++ ) {
if( tree->ps[i].status != waitforstate )
badluck = 1;
}
}
}
void PutStatus(int ncpus,int statenew) {
int i;
for( i = 0 ; i < ncpus ; i++ ) {
tree->ps[i].status = statenew;
}
}
int CheckStatus(int ncpus,int statenew) {
/* returns false when not all cpu's are in the new state */
int i;
for( i = 0 ; i < ncpus ; i++ ) {
if( tree->ps[i].status != statenew )
return 0;
}
return 1;
}
int CheckAllStatus(int ncpus,int status) {
/* Tries with a single loop to determine whether the other cpu's also finished
*
* returns:
* true ==> when all the processes have this status
* false ==> when 1 or more are still busy measuring
*/
int i,badluck=1;
for( i = 0 ; i < ncpus ; i++ ) {
if( tree->ps[i].status != status ) {
badluck = 0;
break;
}
}
return badluck;
}
void Slapen(int ms) {
#if UNIX
usleep(ms*1000); /* 0.050 000 secondes, it is in microseconds! */
#else
Sleep(ms); /* 0.050 seconds, it is in milliseconds */
#endif
}
float LoopRandom(void) {
BITBOARD n,nps_rng;
float fns;
int timetaken;
printf("Benchmarking random RNG test. Please wait..\n");
n = 25000000; // 50 mln
timetaken = 0;
while( timetaken < 500 ) {
n += n;
timetaken = DoNrng(n);
}
printf("timetaken=%i\n",timetaken);
nps_rng = (1000*n) / (BITBOARD)timetaken;
fns = ToNano(nps_rng);
printf("Machine needs %f ns for RND loop\n",fns);
return fns;
}
/* Example showing how to use the random number generator: */
int main(int argc,char *argv[]) {
/* allocate a big memory buffer parameter is in bytes.
* don't hesitate to MODIFY this to how many gigabytes
* you want to try.
* The more the better i keep saying to myself.
*
* Note that under linux your maximum shared memory limit can be set with:
*
* echo > /proc/sys/kernel/shmmax
*
* and under IRIX it is usually 80% from the total RAM onboard that can get allocated
*/
BITBOARD nbytes,firstguess;
float ns_rng,f_loop;
int tottimes,t1,t2;
if( argc <= 1 ) {
printf("Latency test usage is: latency \n");
printf("Where 'buffer' is the buffer in number of bytes to allocate PRO PROCESSOR\n");
printf("and where 'cpus' is the number of processes that this test will try to use (1 = default) \n");
return 1;
}
/* parse the input */
nbytes = 0;
cpus = 1; // default
if( strchr(argv[1],'_') == NULL ) { /* main startup process */
int np = 0;
#if UNIX
#if FREEBSD
nbytes = (BITBOARD)atoi(argv[1]); // freebsd doesn't support > 2 GB memory
#else
nbytes = (BITBOARD)atoll(argv[1]);
#endif
#else
nbytes = (BITBOARD)_atoi64(argv[1]);
#endif
printf("Welcome to RASM Latency!\n");
printf("RASML measures the RANDOM AVERAGE SHARED MEMORY LATENCY!\n\n");
if( argc > 2 ) {
cpus = 0;
do {
cpus *= 10;
cpus += (int)(argv[2][np]-'1')+1;
np++;
} while( argv[2][np] >= '0' && argv[2][np] <= '9' );
}
//printf("Master: buffer = %s bytes. #CPUs = %i\n",To64(nbytes),cpus);
ProcessNumber = 0;
/* check whether we are not getting out of bounds */
if( cpus > MAXPROCESSES ) {
printf("Error: Recompile with a bigger stack for MAXPROCESSES. %i processors is too much\n",cpus);
return 1;
}
/* find out the file name */
#if UNIX
strcpy(rasmexename,argv[0]);
#else
GetModuleFileName(NULL,rasmexename,2044);
#endif
printf("Stored in rasmexename = %s\n",rasmexename);
}
else { // latency 2_452 ==> means processor 2 out of 452.
int np = 0;
ProcessNumber = 0;
do {
ProcessNumber *= 10;
ProcessNumber += (argv[1][np]-'1')+1; // n
np++;
} while( argv[1][np] >= '0' && argv[1][np] <= '9' );
ProcessNumber--; // 1 less because of ProcessNumber ==> [0..n-1]
np++; // skip underscore
cpus = 0;
do {
cpus *= 10;
cpus += (argv[1][np]-'1')+1; // n
np++;
} while( argv[1][np] >= '0' && argv[1][np] <= '9' );
//printf("Slave: ProcessNumber=%i cpus=%i\n",ProcessNumber,cpus);
}
/* first we setup the random number generator. */
RanrotAInit();
/* initialize shared memory tree; it gets used for communication between the processes */
if( !AllocateTree() ) {
printf("Error: ProcessNumber %i could not allocate the tree\n",ProcessNumber);
return 1;
}
if( !ProcessNumber )
ParseBuffer(nbytes);
nentries = tree->nentries;
/* Now some stuff only the Master has to do */
if( !ProcessNumber ) {
/* Master: now let's time the pseudo random generators speed in nanoseconds a call */
ns_rng = TimeRandom();
f_loop = LoopRandom();
printf("Trying to Allocate Buffer\n");
t1 = GetClock();
if( !AllocateHash() ) {
printf("Error: Could not allocate buffer!\n");
return 1;
}
t2 = GetClock();
printf("Took %i.%03i seconds to allocate Hash\n",(t2-t1)/1000,(t2-t1)%1000);
ClearHash(); // local hash
t1 = GetClock();
printf("Took %i.%03i seconds to clear Hash\n",(t1-t2)/1000,(t1-t2)%1000);
/* so now hashtable is setup and we know quite some stuff. So it is time to
* start all other processes */
InitTree(cpus);
printf("Starting Other processes\n");
t1 = GetClock();
if( !StartProcesses(cpus) ) {
printf("Error: Could not start processes\n");
DeAllocate();
}
t2 = GetClock();
printf("Took %i milliseconds to start %i additional processes\n",t2-t1,cpus-1);
t1 = GetClock();
}
else { /* all Slaves do this */
if( !AllocateHash() ) {
printf("Error: slave %i Could not allocate buffer!\n",ProcessNumber);
return 1;
}
ClearHash(); // local hash
}
tree->ps[ProcessNumber].status = STATUS_ATTACH;
if( ! ProcessNumber ) {
WaitForStatus(cpus,STATUS_ATTACH);
t2 = GetClock();
printf("Took %i milliseconds to synchronize %i additional processes\n",t2-t1,cpus-1);
t1 = GetClock();
/* now we can continue with the next phase that is attaching all the segments */
PutStatus(cpus,STATUS_GOATTACH);
}
else {
while( tree->ps[ProcessNumber].status == STATUS_ATTACH ) {
Slapen(500);
}
}
if( !AttachAll() ) {
printf("Error: process %i Could not attach correctly!\n",ProcessNumber);
return 1;
}
tree->ps[ProcessNumber].status = STATUS_ATTACHED;
if( ! ProcessNumber ) {
WaitForStatus(cpus,STATUS_ATTACHED);
t2 = GetClock();
printf("Took %i milliseconds to ATTACH. %llu total RAM\n",t2-t1,(BITBOARD)cpus*tree->nentries*8);
PutStatus(cpus,STATUS_STARTREAD);
printf("Read latency measurement STARTS NOW using steps of 2 * %i.%03i seconds :\n",
(SWITCHTIME/1000),(SWITCHTIME%1000));
}
else {
while( tree->ps[ProcessNumber].status == STATUS_ATTACHED ) {
Slapen(500);
}
}
tree->ps[ProcessNumber].status = STATUS_READ;
firstguess = 200000;
tottimes = 0;
for( ;; ) {
int timetaken = 0;
if( tree->ps[ProcessNumber].status == STATUS_MEASUREREAD ) {
/* this really MEASURES the readread */
BITBOARD ntried = 0,avnumber;
int totaltime=0;
while( totaltime < SWITCHTIME ) { /* go measure around switchtime seconds */
totaltime += DoNreads(firstguess);
ntried += firstguess;
}
/* now put the average number of readreads into the shared memory */
avnumber = (ntried*1000) / (BITBOARD)totaltime;
tree->ps[ProcessNumber].readread = avnumber;
/* show that it is finished */
tree->ps[ProcessNumber].status = STATUS_MEASUREDREAD;
/* now keep doing the same thing until status gets modified */
while( tree->ps[ProcessNumber].status == STATUS_MEASUREDREAD ) {
(void)DoNreads(firstguess);
if( !ProcessNumber ) {
if( CheckAllStatus(cpus,STATUS_MEASUREDREAD) ) {
PutStatus(cpus,STATUS_QUIT);
break;
}
}
}
}
else if( tree->ps[ProcessNumber].status == STATUS_READ ) {
BITBOARD nextguess;
/* now software must try to determine how many reads a seconds are possible for that
* process
*/
//printf("proc=%i trying %s reads\n",ProcessNumber,To64(firstguess));
timetaken = DoNreads(firstguess);
/* try to guess such that next test takes 1 second, or if test was too inaccurate
* then double the number simply. also prevents divide by zero error ;)
*/
if( timetaken < 400 )
nextguess = firstguess*2;
else
nextguess = (firstguess*1000)/(BITBOARD)timetaken;
firstguess = nextguess;
if( !ProcessNumber ) {
tottimes += timetaken;
if( tottimes >= SWITCHTIME ) { // 30 seconds to a few minutes
tottimes = 0;
if( CheckStatus(cpus,STATUS_READ) ) {
PutStatus(cpus,STATUS_MEASUREREAD);
} /* waits another SWITCH time before starting to measure */
}
}
}
else if( tree->ps[ProcessNumber].status == STATUS_QUIT )
break;
}
/* now do the latency tests
*/
//TestLatency(ns_rng);
tree->ps[ProcessNumber].status = STATUS_QUIT;
if( !ProcessNumber ) {
BITBOARD averagereadread;
int i;
averagereadread = 0;
WaitForStatus(cpus,STATUS_QUIT);
printf("the raw output\n");
for( i = 0; i < cpus ; i++ ) {
BITBOARD tr=tree->ps[i].readread;
averagereadread += tr;
printf("%llu ",tr);
}
printf("\n");
averagereadread /= (BITBOARD)cpus;
printf("Raw Average measured read read time at %i processes = %f ns\n",cpus,ToNano(averagereadread));
printf("Now for the final calculation it gets compensated:\n");
printf(" Average measured read read time at %i processes = %f ns\n",cpus,ToNano(averagereadread)-f_loop);
}
DeAllocate();
return 0;
}
/* EOF latencyC.c */
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Mon Jul 4 16:40:52 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Mon, 4 Jul 2005 13:40:52 -0700
Subject: [Beowulf] RASM : random memory latency test
In-Reply-To: <3.0.32.20050704210420.012fcbf0@pop3.xs4all.nl>
References: <3.0.32.20050704210420.012fcbf0@pop3.xs4all.nl>
Message-ID: <20050704204052.GB19958@greglaptop.hsd1.ca.comcast.net>
On Mon, Jul 04, 2005 at 09:04:20PM +0200, Vincent Diepeveen wrote:
> Latencies are NOT dependant upon memory frequency, just dependant upon the
> speed the RAM works at (which sometimes IS related to cpu frequency in case
> of opteron as the faster opterons allow 400Mhz ram.
This is simply wrong -- on the Opteron, if you hold the memory
frequency constant, the memory bandwidth and latency improve by a
modest amount as you increase the CPU clock.
This test does have the advantage that it uses all the cpus, which is
more realistic than using just one.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From scheinin at crs4.it Tue Jul 5 08:07:20 2005
From: scheinin at crs4.it (Alan Louis Scheinine)
Date: Tue, 05 Jul 2005 14:07:20 +0200
Subject: [Beowulf] [gorelsky@stanford.edu: CCL:dual-core
Opteron 275performance]
In-Reply-To: <77673C9ECE12AB4791B5AC0A7BF40C8F1544C9@exchange02.fed.cclrc.ac.uk>
References: <77673C9ECE12AB4791B5AC0A7BF40C8F1544C9@exchange02.fed.cclrc.ac.uk>
Message-ID: <42CA77F8.5000302@crs4.it>
With regard to performance of dual-core Opteron, your milage may vary.
I just finished a benchmark using a program that needs a high
bandwidth to memory. A quad-CPU board with single-core Opteron was
nearly twice as fast as a dual-CPU board with dual-core Opteron,
in each case four MPI processes; compiled with MPICH with PGI;
in every case 2.2 GHz CPU frequency. The program is meteorology,
short-term weather prediction.
-- Alan
Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
Center for Advanced Studies, Research, and Development in Sardinia
Postal Address: | Physical Address for FedEx, UPS, DHL:
--------------- | -------------------------------------
Alan Scheinine | Alan Scheinine
c/o CRS4 | c/o CRS4
C.P. n. 25 | Loc. Pixina Manna Edificio 1
09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy
Email: scheinin at crs4.it
Phone: 070 9250 238 [+39 070 9250 238]
Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220]
Operator at reception: 070 9250 1 [+39 070 9250 1]
Mobile phone: 347 7990472 [+39 347 7990472]
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From Andrei.Maslennikov at caspur.it Tue Jul 5 04:18:24 2005
From: Andrei.Maslennikov at caspur.it (Andrei Maslennikov)
Date: Tue, 5 Jul 2005 10:18:24 +0200 (MEST)
Subject: [Beowulf] Bonding-alb and NAS performance
In-Reply-To:
References:
Message-ID:
On Tue, 5 Jul 2005, Maurice Volaski wrote:
> Hi, I saw your post below and was just wondering how you made out. For
> example, did you find out the Cisco sends back packets to the bonded
> NIC? I am preparing a setup here and we have a Cisco 6509. I am thinking
> of trying out balance-alb first because I think the server would have
> the most control, but I don't know how performance will be affected.
Hi Maurice,
we have followed the advice of Jay Vosburgh, and I must say that
it was a right advice. We were able to get an excellent symmetric
throughput of 220+ MB/sec with bonding-alb and a Cisco switch.
With 4 high-end servers totalling 8 GigE NICs we have seen the
aggregate NAS speeds in excess of 800 MB/sec. It seems that the
"fat" NAS servers capable to deliver 200+ MB/sec per unit may
soon become the platform of choice for cluster storage.
You may also want to check our recent presentation at HEPiX, see:
http://hepix.fzk.de/upload/lectures/Maslennikov-storage-2005.pdf
Andrei.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From federico.ceccarelli at techcom.it Mon Jul 4 15:13:39 2005
From: federico.ceccarelli at techcom.it (Federico Ceccarelli)
Date: Mon, 04 Jul 2005 21:13:39 +0200
Subject: [Beowulf] WRF model on linux cluster: Mpi problem
In-Reply-To: <1120463328.22587.13.camel@vigor11>
References: <3.0.32.20050630145204.011261c0@pop3.xs4all.nl>
<42C443AE.2060706@penguincomputing.com>
<1120203507.5114.9.camel@localhost.localdomain>
<1120463328.22587.13.camel@vigor11>
Message-ID: <1120504420.5106.63.camel@localhost.localdomain>
Hi,
I did the Pallas benchmark...after removing openmosix...here are the
ping-pong and ping-ping results...for 2 processes
What do you think about them?
Why the bandwidth is raising and decreasing many times as the #bytes
grow?
thanks again...
federico
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V2.3, MPI-1 part
#---------------------------------------------------
# Date : Mon Jul 4 15:20:32 2005
# Machine : i686# System : Linux
# Release : 2.4.26-om1
# Version : #3 mer feb 23 04:32:26 CET 2005
#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_scatter
# Allgather
# Allgatherv
# Alltoall
# Bcast
# Barrier
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
# ( 6 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 109.00 0.00
1 1000 109.43 0.01
2 1000 138.81 0.01
4 1000 238.29 0.02
8 1000 246.77 0.03
16 1000 246.26 0.06
32 1000 273.79 0.11
64 1000 250.73 0.24
128 1000 250.98 0.49
256 1000 250.73 0.97
512 1000 250.74 1.95
1024 1000 250.23 3.90
2048 1000 251.99 7.75
4096 1000 256.01 15.26
8192 1000 500.27 15.62
16384 1000 785.51 19.89
32768 1000 15087.75 2.07
65536 640 33256.60 1.88
131072 320 5399.92 23.15
262144 160 95577.23 2.62
524288 80 102396.36 4.88
1048576 40 529898.21 1.89
2097152 20 89600.72 22.32
4194304 10 794578.55 5.03
#---------------------------------------------------
# Benchmarking PingPing
# #processes = 2
# ( 6 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 94.45 0.00
1 1000 94.92 0.01
2 1000 94.07 0.02
4 1000 95.82 0.04
8 1000 95.33 0.08
16 1000 105.89 0.14
32 1000 117.57 0.26
64 1000 120.45 0.51
128 1000 124.39 0.98
256 1000 136.02 1.79
512 1000 171.28 2.85
1024 1000 185.80 5.26
2048 1000 238.80 8.18
4096 1000 256.54 15.23
8192 1000 381.98 20.45
16384 1000 13932.86 1.12
32768 1000 42027.47 0.74
65536 640 45166.66 1.38
131072 320 9002.89 13.88
262144 160 194274.79 1.29
524288 80 773914.26 0.65
1048576 40 85866.48 11.65
2097152 20 839526.30 2.38
4194304 10 310144.00 12.90
Il giorno lun, 04-07-2005 alle 08:48 +0100, John Hearns ha scritto:
> On Fri, 2005-07-01 at 09:38 +0200, Federico Ceccarelli wrote:
> > yeas,
> >
> > I will remove openmosix.
> > I patched the kernel with openmosix because I used the cluster also for
> > other smaller applications, so the load balance was useful to me.
> >
> > I already tried to switch off openmosix with
> >
> > > service openmosix stop
> Having a small amount of Openmosix experience, that should work.
>
> Have you used the little graphical tool to display the loads on each
> node? (can't remember the name).
>
> Anyway, I go along with the earlier advice to look at the network card
> performance.
> Do an lspci -vv on all nodes to check that your riser cards are running
> at full speed.
>
> What I would do is break this problem down.
> Start by running the Pallas benchmark, on one node, then two, then four
> etc. See if a pattern develops.
> The same with your model, if it is possible to cut down the problem
> size. Run on one node (two processors), then two then four.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From regatta at gmail.com Mon Jul 4 01:02:44 2005
From: regatta at gmail.com (regatta)
Date: Mon, 4 Jul 2005 08:02:44 +0300
Subject: [Beowulf] [gorelsky@stanford.edu: CCL:dual-core Opteron 275
performance]
In-Reply-To: <20050701084941.GF25947@leitl.org>
References: <20050701084941.GF25947@leitl.org>
Message-ID: <5a3ed5650507032202108310b6@mail.gmail.com>
Hi
I was happy to see this benchmark ( although we have our own
benchmark) but be serious , you are comparing 2.4 kernel with 2.6
kernel !!!
On 7/1/05, Eugen Leitl wrote:
> ----- Forwarded message from "S.I.Gorelsky" -----
>
> From: "S.I.Gorelsky"
> Date: Thu, 30 Jun 2005 23:00:42 -0700 (PDT)
> To: chemistry at ccl.net
> Cc: ed.gasiorowski at amd.com, mike at colfax-intl.com
> Subject: CCL:dual-core Opteron 275 performance
> Reply-To: chemistry at ccl.net
>
>
> To add to the discussion about the performance of new dual-core
> processors for computational chemistry applications,
>
> the comparison of Intel and AMD dual-CPU based computers is shown at:
>
> http://www.sg-chem.net/cluster/
>
> As can be seen from the graph, the Gaussian 03 execution speed (test job
> 397) on dual-core dual-CPU Opteron 275 workstation is faster by a factor of 1.95
> as compared to the dual-CPU Xeon 3.2GHz 800MHz FSB machine.
>
> -----------------
>
> I would like to thank Ed Gasiorowski (AMD) and Mike Fay (Colfax
> International) for their support.
>
> Serge Gorelsky
>
> ----------------------------------------------------------------
> Dr S.I. Gorelsky, Department of Chemistry, Stanford University
> Box 155, 333 Campus Drive, Stanford, CA 94305-5080 USA
> Phone: (650) 723-0041. Fax: (650) 723-0852.
> ----------------------------------------------------------------
>
>
>
>
>
>
> -= This is automatically added to each message by the mailing script =-
> To send e-mail to subscribers of CCL put the string CCL: on your Subject: line
> and send your message to: CHEMISTRY at ccl.net
>
> Send your subscription/unsubscription requests to: CHEMISTRY-REQUEST at ccl.net
> HOME Page: http://www.ccl.net | Jobs Page: http://www.ccl.net/jobs
>
> If your is mail bouncing from ccl.net domain due to spam filters, please
> use the Web based form from CCL Home Page
> -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
>
>
>
>
> ----- End forwarded message -----
> --
> Eugen* Leitl leitl
> ______________________________________________________________
> ICBM: 48.07100, 11.36820 http://www.leitl.org
> 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
>
>
> BodyID:571242437.2.n.logpart (stored separately)
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>
>
--
Best Regards,
--------------------
-*- If Linux doesn't have the solution, you have the wrong problem -*-
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kus at free.net Tue Jul 5 12:33:56 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Tue, 05 Jul 2005 20:33:56 +0400
Subject: [Beowulf] [gorelsky@stanford.edu: CCL:dual-core Opteron
275performance]
In-Reply-To: <42CA77F8.5000302@crs4.it>
Message-ID:
In message from Alan Louis Scheinine (Tue, 05 Jul
2005 14:07:20 +0200):
>With regard to performance of dual-core Opteron, your milage may
>vary.
>I just finished a benchmark using a program that needs a high
>bandwidth to memory.
Yes, it's good idea.
> A quad-CPU board with single-core Opteron was
>nearly twice as fast as a dual-CPU board with dual-core Opteron,
But this result means, that 4 cores of Opteron are "equal by
performance" to 2 "single core" Opterons. If it'll be *exactly*,
your program looks as working "only" w/RAM (I suppose that
memory throughput don't scale from single core Opteron to 2-cores
chip, what is, generally speaking, incorrect), and there is
practically no "memory-independed" computations !
But I understand that some typical finite elements theory loops
may contain a lot of memeory references.
I plan to receive Opteron275 dual cpu server in pair of days,
and also to perform a set of tests.
Yours
Mikhail
>in each case four MPI processes; compiled with MPICH with PGI;
>in every case 2.2 GHz CPU frequency. The program is meteorology,
>short-term weather prediction.
>
>-- Alan
>
> Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
> Center for Advanced Studies, Research, and Development in Sardinia
>
> Postal Address: | Physical Address for FedEx, UPS,
>DHL:
> --------------- |
> -------------------------------------
> Alan Scheinine | Alan Scheinine
> c/o CRS4 | c/o CRS4
> C.P. n. 25 | Loc. Pixina Manna Edificio 1
> 09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy
>
> Email: scheinin at crs4.it
>
> Phone: 070 9250 238 [+39 070 9250 238]
> Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220]
> Operator at reception: 070 9250 1 [+39 070 9250 1]
> Mobile phone: 347 7990472 [+39 347 7990472]
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hahn at physics.mcmaster.ca Tue Jul 5 13:04:03 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Tue, 5 Jul 2005 13:04:03 -0400 (EDT)
Subject: [Beowulf] problem : mpi dynamic scheduling ??
In-Reply-To: <20050704073344.75237.qmail@web32807.mail.mud.yahoo.com>
Message-ID:
> i m using mpiJava and i need to know ,Do MPI 0ffers Api's / Functions for
> dynamic scheduling , Process migration , Load and balancing?? .
not generally, and I think there are good reasons, not just laziness.
to make any large, tightly-coupled application run efficiently, each proc
needs to avoid being preempted. that means that you really can't do any
load-balancing, since you have to set aside a whole cpu for a proc, and hope
that the kernel and random daemons don't interfere too much. (indeed, there
are systems which try to micro-kernel-ize the environment. it's also
traditional to gang-schedule, so that if you really need to run a daemon, you
do it on all nodes at once.)
in other words, large-tight apps are incompatible with meaningful
load-balancing.
not all the world consists of large, tightly coupled MPI applications, of
course. but I'm pretty convinced that for looser applications, especially in
the presence of dynamic parallelism, and migration, I'd try to build a sort
of peer-to-peer version of Linda, rather than use MPI.
regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From waheed751 at yahoo.com Mon Jul 4 03:33:44 2005
From: waheed751 at yahoo.com (waheed iqbal)
Date: Mon, 4 Jul 2005 00:33:44 -0700 (PDT)
Subject: [Beowulf] problem : mpi dynamic scheduling ??
Message-ID: <20050704073344.75237.qmail@web32807.mail.mud.yahoo.com>
Hi
i m using mpiJava and i need to know ,Do MPI 0ffers Api's / Functions for dynamic scheduling , Process migration , Load and balancing?? .
plz give some suggestions
Thankx
Waheed
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From laurence at scalablesystems.com Tue Jul 5 04:45:59 2005
From: laurence at scalablesystems.com (Laurence Liew)
Date: Tue, 05 Jul 2005 16:45:59 +0800
Subject: [Beowulf] ANNOUCEMENT - SGE6 (Update 4) SRPMS available for download
Message-ID: <42CA48C7.40206@scalablesystems.com>
Hi all
We are please to make available the SGE6 (update 4) SRPMS for the community.
It has been prepared for RH9, FC2, RHEL3 and RHEL4.
Please download from http://www.scalablesystems.com [Downloads] section.
Build instructions:
for rh9,
# rpmbuild --rebuild --define 'dist 1.rh9' --define 'rh9 1'
gridengine-6.0u4-1.src.rpm
for rhel3,
# rpmbuild --rebuild --define 'dist 1.el3' --define 'el3 1'
gridengine-6.0u4-1.src.rpm
for fc2,
# rpmbuild --rebuild --define 'dist 1.fc2' --define 'fc2 1'
gridengine-6.0u4-1.src.rpm
for rhel4,
# rpmbuild --rebuild --define 'dist 1.el4' --define 'el4 1'
gridengine-6.0u4-1.src.rpm
Enjoy!
laurence
--
Laurence Liew, CTO Email: laurence at scalablesystems.com
Scalable Systems Pte Ltd Web : http://www.scalablesystems.com
(Reg. No: 200310328D)
7 Bedok South Road Tel : 65 6827 3953
Singapore 469272 Fax : 65 6827 3922
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From Hakon.Bugge at scali.com Mon Jul 4 04:49:46 2005
From: Hakon.Bugge at scali.com (=?iso-8859-1?Q?H=E5kon?= Bugge)
Date: Mon, 04 Jul 2005 10:49:46 +0200
Subject: [Beowulf] Shared memory
Message-ID: <6.2.3.3.0.20050704104635.03697d40@mail.scali.com>
On Sun, 3 Jul 2005 20:12:10 -0700, Greg Lindahl wrote:
> > Unfortunately I can't recommend a simple established code or benchmark
> > which would allow transparent comparison of MPI versus OpenMP/MPI.
>
>MM5 runs both ways... and it's faster as pure MPI. If OpenMPI+MPI
>doesn't have some special benefit such as accellerating convergence,
>it's not going to be a win.
>
>-- greg
Hmm, last time I measured I got ~20% speedup using OpenMP+MPI on MM5.
Internode message exchange is less frequent and with larger messages;
both a good thing for most clusters...
-Hakon
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From joachim at ccrl-nece.de Wed Jul 6 04:05:52 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Wed, 06 Jul 2005 10:05:52 +0200
Subject: [Beowulf] WRF model on linux cluster: Mpi problem
In-Reply-To: <1120504420.5106.63.camel@localhost.localdomain>
References: <3.0.32.20050630145204.011261c0@pop3.xs4all.nl> <42C443AE.2060706@penguincomputing.com> <1120203507.5114.9.camel@localhost.localdomain> <1120463328.22587.13.camel@vigor11>
<1120504420.5106.63.camel@localhost.localdomain>
Message-ID: <42CB90E0.1080801@ccrl-nece.de>
Federico Ceccarelli wrote:
>
> Hi,
>
> I did the Pallas benchmark...after removing openmosix...here are the
> ping-pong and ping-ping results...for 2 processes
> What do you think about them?
> Why the bandwidth is raising and decreasing many times as the #bytes
> grow?
The latency is quite high and grows too fast from 0 to 8 bytes. But most of all,
the bandwidth should be constant with at least 80MB/s, better >100MB/s for large
messages. Something is wrong, probably with your ethernet setup. I remember such
effects for half/full-duplex mismatch with 100Mb auto negotiation problems
between NIC and switch.
Joachim
--
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From scheinin at crs4.it Wed Jul 6 10:12:25 2005
From: scheinin at crs4.it (Alan Louis Scheinine)
Date: Wed, 06 Jul 2005 16:12:25 +0200
Subject: [Beowulf] [gorelsky@stanford.edu:
CCL:dual-core Opteron 275performance]
In-Reply-To:
References:
Message-ID: <42CBE6C9.1070406@crs4.it>
I wrote:
> > A quad-CPU board with single-core Opteron was
> > nearly twice as fast as a dual-CPU board with dual-core Opteron,
Mikhail Kuzminsky wrote:
> But this result means, that 4 cores of Opteron are "equal by performance"
> to 2 "single core" Opterons. If it'll be *exactly*,
> your program looks as working "only" w/RAM (I suppose that
> memory throughput don't scale from single core Opteron to 2-cores chip,
> what is, generally speaking, incorrect), and there is
> practically no "memory-independed" computations !
I did some other benchmarking tests, a two-chip board with dual-core,
that is, 4 cores on the board, was in other cases 20 percent and 40 percent
slower than two nodes of a cluster, each node with two single-core chips.
Really, the first program is very dependent on main memory.
It is a bit of an exaggeration to say that such a program has "practically
no 'memory-independent' computations". Since both level 1 and level 2 cache
are necessary on the Opteron, it seems evident that bandwidth to main memory
is much less than the computational potential. There might be reuse of
variables and some memory-independent computations in the program, but still
the bandwidth to main memory is relatively narrow compared to the potential of
the arithmetic units.
My main point is, as I wrote, "your milage may vary." I've heard from various
people that "everybody is going to dual-core". I simply want to emphasize that
the dual-core choice is not for everybody. In particular, I looked at profiling
done by the compiler from PGI, pgf90, it managed to vectorized some rather
complicated arithmetic expressions. This suggests to me that more programs
than in the past will efficiently use very long vectors for which the memory
bandwidth is important.
On this same theme, the programs that are impacted by bandwidth to main memory
seem to hit a limit for single-core CPUs of about 2.0 GHz. Aside from the
question of dual-core, what has been the experience of other people with
regard to very fast single-core CPUs? For programs that have vectors longer
than the size of L2 cache, is there a speed grade above which no gain is seen?
Alan
--
Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
Center for Advanced Studies, Research, and Development in Sardinia
Postal Address: | Physical Address for FedEx, UPS, DHL:
--------------- | -------------------------------------
Alan Scheinine | Alan Scheinine
c/o CRS4 | c/o CRS4
C.P. n. 25 | Loc. Pixina Manna Edificio 1
09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy
Email: scheinin at crs4.it
Phone: 070 9250 238 [+39 070 9250 238]
Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220]
Operator at reception: 070 9250 1 [+39 070 9250 1]
Mobile phone: 347 7990472 [+39 347 7990472]
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From scheinin at crs4.it Wed Jul 6 11:27:24 2005
From: scheinin at crs4.it (Alan Louis Scheinine)
Date: Wed, 06 Jul 2005 17:27:24 +0200
Subject: [Beowulf] [gorelsky@stanford.edu:
CCL:dual-core Opteron 275performance]
In-Reply-To: <20050706144300.GN422@unthought.net>
References: <42CBE6C9.1070406@crs4.it>
<20050706144300.GN422@unthought.net>
Message-ID: <42CBF85C.1000403@crs4.it>
Jakob Oestergaard wrote:
> Guys, the results are very interesting - it would be very interesting to
> know, too, which kernels you have been running on.
Four-cpu single core (Celestica): Linux version 2.6.7
Two nodes of a cluster (Tyan 2882), each node single core,
sometimes the fastest of all, they have: Linux version 2.6.10
Two "chips" Opteron 875 on a board Tyan 2882, each chip dual core: Linux version 2.6.9-11.ELsmp
(The last from CentOS 4.1 with Red Hat June update. By the way, the board has the 875
because the 275 was not available.)
In all cases the memory was 400 MHz DDR Registered ECC.
I'm not giving all the details such as CPU speed because I mostly wanted
to give a word of warning to do benchmarking on dual core in order to take into
consideration to memory bandwidth. Towards the end of the day someone came
into my office to say that, instead of running an MPI job with 4 processes,
if he ran two MPI jobs each with 2 processes, the dual-core machine was
much better. Evidently there is more to learn.
Jakob wrote: I'm pretty sure at least 2.6.12 had NUMA/dual-core fixes in it.
The CentOS 4.1 is very recent, being rebuilds, perhaps also Rocks Clusters
and Scientific Linux have 2.6.9-11.ELsmp -- I didn't check. It is useful
to learn that we should upgrade to at least 2.6.12 to get important fixes
to NUMA.
--
Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
Center for Advanced Studies, Research, and Development in Sardinia
Postal Address: | Physical Address for FedEx, UPS, DHL:
--------------- | -------------------------------------
Alan Scheinine | Alan Scheinine
c/o CRS4 | c/o CRS4
C.P. n. 25 | Loc. Pixina Manna Edificio 1
09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy
Email: scheinin at crs4.it
Phone: 070 9250 238 [+39 070 9250 238]
Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220]
Operator at reception: 070 9250 1 [+39 070 9250 1]
Mobile phone: 347 7990472 [+39 347 7990472]
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From jakob at unthought.net Wed Jul 6 10:43:00 2005
From: jakob at unthought.net (Jakob Oestergaard)
Date: Wed, 6 Jul 2005 16:43:00 +0200
Subject: [Beowulf] [gorelsky@stanford.edu:
CCL:dual-core Opteron 275performance]
In-Reply-To: <42CBE6C9.1070406@crs4.it>
References: <42CBE6C9.1070406@crs4.it>
Message-ID: <20050706144300.GN422@unthought.net>
On Wed, Jul 06, 2005 at 04:12:25PM +0200, Alan Louis Scheinine wrote:
...
> My main point is, as I wrote, "your milage may vary." I've heard from
> various
> people that "everybody is going to dual-core". I simply want to emphasize
> that
> the dual-core choice is not for everybody. In particular, I looked at
> profiling
Guys, the results are very interesting - it would be very interesting to
know, too, which kernels you have been running on.
For some time there have frequently been fixes to the NUMA code in
general, and there's definitely been some changes for handling dual-core
processors in the opteron case.
If, and I don't know if this is the case, the kernels you have tested
with do not properly understand the topology of your NUMA setup in
dual-core (two processors (cores) per node (memory controller)),
performance should definitely be expected to degrade noticably.
I'm pretty sure at least 2.6.12 had NUMA/dual-core fixes in it. Anyway,
if you're running older 2.6 kernels it would probably be interesting to
try a more up-to-date kernel.
Thanks for the results and observations - it's a good read :)
--
/ jakob
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From Daniel.G.Roberts at sanofi-aventis.com Wed Jul 6 09:08:59 2005
From: Daniel.G.Roberts at sanofi-aventis.com (Daniel.G.Roberts at sanofi-aventis.com)
Date: Wed, 6 Jul 2005 09:08:59 -0400
Subject: [Beowulf] Beowulf and Ganglia config help please
Message-ID:
Hello all
I have some general Beowulf/Ganglia configuration woes that I am seeking help with!
1>I have two beowulf style clusters.
I would like to use cluster A to monitor Cluster B. Cluster A is 18 nodes cluster B is 90 nodes.
Monitoring on Cluster A is no problem. But on Cluster B, for whatever reason, the gmetd that is running on the headnode only "sees" about half of the gmonds running on the corresponding compute nodes. I know the gmonds are running on each of the 90 compute nodes as a simple ps tells me so. Further I can go to each compute node in turn and do a localhost 8649 I see the spewage of XML. But, yet the gmetd on the headnode only see about half of the compute nodes. Have any idea why>
2> Does a gmetd need to be running on the headnode of cluster B if I wish to monitor Cluster B from Cluster A? Also in general should a gmond be running on my headnodes? I have seen that when a gmond is running on the headnode as well, the corresponding gmetd ignores all the other gmonds and only reports the one of the headnode.
3> On cluster B as the data_source line in the gmetd.conf file should I put the IP address of all the corresponding compute nodes? I seem to get a variety of results and behaviors depending on what I may put..
4> The ganglia conf files seem much happier if I use IP addresses instead of FQDN. Is this really the case?
5> In general what should be on the data_source line of my gmetd.conf file? All the IP addresses of every single gmond running in my corresponding computer nodes?
If you have some general docs on how to correctly setup up ganglia on a grid of beowulfs clusters that would be great to have!
Thanks for any and all help!
Sincerely
Dan Roberts
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kus at free.net Wed Jul 6 14:02:31 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Wed, 06 Jul 2005 22:02:31 +0400
Subject: [Beowulf] [gorelsky@stanford.edu: CCL:dual-core Opteron
275performance]
In-Reply-To: <42CBE6C9.1070406@crs4.it>
Message-ID:
In message from Alan Louis Scheinine (Wed, 06 Jul
2005 16:12:25 +0200):
>
>I wrote:
> > > A quad-CPU board with single-core Opteron was
> > > nearly twice as fast as a dual-CPU board with dual-core Opteron,
>Mikhail Kuzminsky wrote:
> > But this result means, that 4 cores of Opteron are "equal by
>performance"
> > to 2 "single core" Opterons. If it'll be *exactly*,
> > your program looks as working "only" w/RAM (I suppose that
> > memory throughput don't scale from single core Opteron to 2-cores
>chip,
> > what is, generally speaking, incorrect), and there is
> > practically no "memory-independed" computations !
>
>I did some other benchmarking tests, a two-chip board with dual-core,
>that is, 4 cores on the board, was in other cases 20 percent and 40
>percent
>slower than two nodes of a cluster, each node with two single-core
>chips.
>Really, the first program is very dependent on main memory.
>It is a bit of an exaggeration to say that such a program has
>"practically
>no 'memory-independent' computations". Since both level 1 and level
>2 cache
>are necessary on the Opteron, it seems evident that bandwidth to main
>memory
>is much less than the computational potential. There might be reuse
>of
>variables and some memory-independent computations in the program,
>but still
>the bandwidth to main memory is relatively narrow compared to the
>potential of
>the arithmetic units.
>
>My main point is, as I wrote, "your milage may vary." I've heard
>from various
>people that "everybody is going to dual-core". I simply want to
>emphasize that
>the dual-core choice is not for everybody.
Ehh, it'll be for everybody simple because there will be *no* single
core server microprocessors :-)
But I absolutely agree w/you about memory bandwith-limited
aplications.
Today we have choice.
Yours
Mikhail
>In particular, I looked
>at profiling
>done by the compiler from PGI, pgf90, it managed to vectorized some
>rather
>complicated arithmetic expressions. This suggests to me that more
>programs
>than in the past will efficiently use very long vectors for which the
>memory
>bandwidth is important.
>
>On this same theme, the programs that are impacted by bandwidth to
>main memory
>seem to hit a limit for single-core CPUs of about 2.0 GHz. Aside
>from the
>question of dual-core, what has been the experience of other people
>with
>regard to very fast single-core CPUs? For programs that have vectors
>longer
>than the size of L2 cache, is there a speed grade above which no gain
>is seen?
>
>Alan
>--
>
> Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
> Center for Advanced Studies, Research, and Development in Sardinia
>
> Postal Address: | Physical Address for FedEx, UPS,
>DHL:
> --------------- |
> -------------------------------------
> Alan Scheinine | Alan Scheinine
> c/o CRS4 | c/o CRS4
> C.P. n. 25 | Loc. Pixina Manna Edificio 1
> 09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy
>
> Email: scheinin at crs4.it
>
> Phone: 070 9250 238 [+39 070 9250 238]
> Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220]
> Operator at reception: 070 9250 1 [+39 070 9250 1]
> Mobile phone: 347 7990472 [+39 347 7990472]
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kus at free.net Wed Jul 6 13:59:33 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Wed, 06 Jul 2005 21:59:33 +0400
Subject: [Beowulf] [gorelsky@stanford.edu: CCL:dual-core Opteron
275performance]
In-Reply-To: <20050706144300.GN422@unthought.net>
Message-ID:
In message from Jakob Oestergaard (Wed, 6 Jul
2005 16:43:00 +0200):
>On Wed, Jul 06, 2005 at 04:12:25PM +0200, Alan Louis Scheinine wrote:
>...
>> My main point is, as I wrote, "your milage may vary." I've heard
>>from
>> various
>> people that "everybody is going to dual-core". I simply want to
>>emphasize
>> that
>> the dual-core choice is not for everybody. In particular, I looked
>>at
>> profiling
>
>Guys, the results are very interesting - it would be very interesting
>to
>know, too, which kernels you have been running on.
>
>For some time there have frequently been fixes to the NUMA code in
>general, and there's definitely been some changes for handling
>dual-core
>processors in the opteron case.
>
>If, and I don't know if this is the case, the kernels you have tested
>with do not properly understand the topology of your NUMA setup in
>dual-core (two processors (cores) per node (memory controller)),
>performance should definitely be expected to degrade noticably.
>
>I'm pretty sure at least 2.6.12 had NUMA/dual-core fixes in it.
>Anyway,
>if you're running older 2.6 kernels it would probably be interesting
>to
>try a more up-to-date kernel.
I'll try to test dual cores Opteron w/2.4 kernel and 2.6 kernel,
but if I'm correct, SuSE 9.3 (we plan to use) has a bit more old 2.6
:-(
BTW, may be there some difference in tests performance between 2.4 and
more old 2.6 kernels ?
Yours
Mikhail
>
>Thanks for the results and observations - it's a good read :)
>
>--
>
> / jakob
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From landman at scalableinformatics.com Wed Jul 6 14:47:52 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 6 Jul 2005 14:47:52 -0400
Subject: [Beowulf] [gorelsky@stanford.edu: CCL:dual-core Opteron
275performance]
In-Reply-To:
References: <20050706144300.GN422@unthought.net>
Message-ID: <20050706184349.M78132@scalableinformatics.com>
Hi Mikhail and Jacob:
We have done quite a bit of testing with dual core on a number of other
chemistry and informatics apps at customer sites, and written a white paper
about it. The summary (punch-line) that we saw for Amber8 and GAMESS was that
dual CPU dual core results were effectively the same as quad CPU single cores.
There are some cases where we expected memory contention to be an issue,
though we did not see this in our measurements.
I am sure it is possible to find codes and test cases which will fill the
bandwidth for the memory controller. The usual suspects (Amber8, GAMESS,
BLAST, HMMer) did not.
Joe
On Wed, 06 Jul 2005 21:59:33 +0400, Mikhail Kuzminsky wrote
> In message from Jakob Oestergaard (Wed, 6 Jul
> 2005 16:43:00 +0200):
> >On Wed, Jul 06, 2005 at 04:12:25PM +0200, Alan Louis Scheinine wrote:
> >...
> >> My main point is, as I wrote, "your milage may vary." I've heard
> >>from
> >> various
> >> people that "everybody is going to dual-core". I simply want to
> >>emphasize
> >> that
> >> the dual-core choice is not for everybody. In particular, I looked
> >>at
> >> profiling
> >
> >Guys, the results are very interesting - it would be very interesting
> >to
> >know, too, which kernels you have been running on.
> >
> >For some time there have frequently been fixes to the NUMA code in
> >general, and there's definitely been some changes for handling
> >dual-core
> >processors in the opteron case.
> >
> >If, and I don't know if this is the case, the kernels you have tested
> >with do not properly understand the topology of your NUMA setup in
> >dual-core (two processors (cores) per node (memory controller)),
> >performance should definitely be expected to degrade noticably.
> >
> >I'm pretty sure at least 2.6.12 had NUMA/dual-core fixes in it.
> >Anyway,
> >if you're running older 2.6 kernels it would probably be interesting
> >to
> >try a more up-to-date kernel.
> I'll try to test dual cores Opteron w/2.4 kernel and 2.6 kernel,
> but if I'm correct, SuSE 9.3 (we plan to use) has a bit more old 2.6
> :-(
> BTW, may be there some difference in tests performance between 2.4
> and more old 2.6 kernels ?
>
> Yours
> Mikhail
>
> >
> >Thanks for the results and observations - it's a good read :)
> >
> >--
> >
> > / jakob
> >
> >_______________________________________________
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit
> >http://www.beowulf.org/mailman/listinfo/beowulf
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Scalable Informatics LLC
http://www.scalableinformatics.com
phone: +1 734 786 8423
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Wed Jul 6 21:28:02 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed, 6 Jul 2005 18:28:02 -0700
Subject: [Beowulf] [gorelsky@stanford.edu:
CCL:dual-core Opteron 275performance]
In-Reply-To: <42CBE6C9.1070406@crs4.it>
References: <42CBE6C9.1070406@crs4.it>
Message-ID: <20050707012802.GC5018@greglaptop.internal.keyresearch.com>
On Wed, Jul 06, 2005 at 04:12:25PM +0200, Alan Louis Scheinine wrote:
> On this same theme, the programs that are impacted by bandwidth to
> main memory seem to hit a limit for single-core CPUs of about 2.0
> GHz.
I don't think so. For a given speed of memory, the STREAM bandwidth on
Opteron increases slightly as the cpu gets faster. So there's no magic
line at any cpu speed, it's all about what speed of memory you can
use.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From landman at scalableinformatics.com Wed Jul 6 22:21:00 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 6 Jul 2005 22:21:00 -0400
Subject: [Beowulf] need some testers for mpich2 rpms
Message-ID: <20050707021233.M48193@scalableinformatics.com>
Hi beowulfers:
I am working on debugging some RPMs we built for a project we are helping
out with. If you have time and a need to play with mpich2, please browse over
to http://downloads.scalableinformatics.com/downloads/mpich and pull down your
relevant mpich2 rpm (or the source rpm). Please send me email offline if you
get any build errors, install errors, or related. Yes, I know the spec file
is a hack of a hack. If you have any sage advice/patches, those would be
appreciated. Offline of course, to avoid spamming folks here.
I am working on an Opteron build bug right now (doesn't work out of the box
on AMD64, due entirely to my hacking, and not an AMD64 issue), and should get
resolution on that soon. Our Itanium2 was knocked out by a recent storm, so
we may have some lag in getting an ia64 build tested.
Thanks.
Joe
--
Scalable Informatics LLC
http://www.scalableinformatics.com
phone: +1 734 786 8423
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From mathog at mendel.bio.caltech.edu Thu Jul 7 14:23:19 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Thu, 07 Jul 2005 11:23:19 -0700
Subject: [Beowulf] CCL:dual-core Opteron
Message-ID:
> I am sure it is possible to find codes and test cases which will
fill the
> bandwidth for the memory controller. The usual suspects (Amber8, GAMESS,
> BLAST, HMMer) did not.
Regarding HMMer, did you try running tests where you step through
a range of query sequence sizes (all vs. the same HMM database)?
On the Athlon MP 2200+ machines in our cluster there was
a large variation in run time resulting from complex cache interactions.
This processor has a relatively small cache and is presumably much
more sensitive to this than would be the Opterons. A substantial
amount of reorganization of the data structures minimized this effect
but at least initially going from n residues to n+1 residues in the
same sequence could change the run time for the query by a factor of 3.
You may be able to trigger the same effect on an Opteron by using
a largish query sequence, say 4000 amino acids or so. Admittedly
there are not very many proteins which are that large.
Regards,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From dgmr at optonline.net Wed Jul 6 21:39:02 2005
From: dgmr at optonline.net (dgmr at optonline.net)
Date: Wed, 06 Jul 2005 21:39:02 -0400
Subject: [Beowulf] Beowulf and Ganglia config help needed
Message-ID: <61aeadb61ab898.61ab89861aeadb@optonline.net>
Hello all
I have some general Beowulf/Ganglia configuration woes that I am seeking help with!
1>I have two beowulf style clusters.
I would like to use cluster A to monitor Cluster B. Cluster A is 18 nodes cluster B is 90 nodes.
Monitoring on Cluster A is no problem. But on Cluster B, for whatever reason, the gmetd that is running on the headnode only "sees" about half of the gmonds running on the corresponding compute nodes. I know the gmonds are running on each of the 90 compute nodes as a simple ps tells me so. Further I can go to each compute node in turn and do a localhost 8649 I see the spewage of XML. But, yet the gmetd on the headnode only see about half of the compute nodes. Have any idea why>
2> Does a gmetd need to be running on the headnode of cluster B if I wish to monitor Cluster B from Cluster A? Also in general should a gmond be running on my headnodes? I have seen that when a gmond is running on the headnode as well, the corresponding gmetd ignores all the other gmonds and only reports the one of the headnode.
3> On cluster B as the data_source line in the gmetd.conf file should I put the IP address of all the corresponding compute nodes? I seem to get a variety of results and behaviors depending on what I may put..
4> The ganglia conf files seem much happier if I use IP addresses instead of FQDN. Is this really the case?
5> In general what should be on the data_source line of my gmetd.conf file? All the IP addresses of every single gmond running in my corresponding computer nodes?
If you have some general docs on how to correctly setup up ganglia on a grid of beowulfs clusters that would be great to have!
Thanks for any and all help!
Sincerely
Dan Roberts
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From ilya at gray-world.net Thu Jul 7 00:48:05 2005
From: ilya at gray-world.net (Ilya)
Date: Thu, 07 Jul 2005 09:48:05 +0500
Subject: [Beowulf] Bonding on the Gigabit Ethernet
Message-ID:
Hi all!
A little paper about Bonding on the Gigabit Ethernet :
http://tom.imm.uran.ru/~u1330/bonding.pdf
---------------------------------------
Audi, Vide, Tace, si vis vivere in pace
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lcsc-snicinteraction at nsc.liu.se Thu Jul 7 04:19:02 2005
From: lcsc-snicinteraction at nsc.liu.se (lcsc-snicinteraction at nsc.liu.se)
Date: Thu, 07 Jul 2005 10:19:02 +0200
Subject: [Beowulf]
LCSC and SNIC Interaction: Call for Abstracts (due Aug 25, 2005)
Message-ID:
=========================================================================
Call for Abstracts
Workshop on Linux Clusters for Super Computing (LCSC)
and
SNIC Interaction
17-19 October 2005
National Supercomputer Centre, Link?ping, Sweden
http://www.nsc.liu.se/lcsc/
=========================================================================
LCSC and SNIC interaction welcome submission of abstracts in the field
of high performance computing.
LCSC gathers both people with experience of building and maintaining
clusters and people using or interested in using cluster and grid
resources. It serves as a meeting place for people from academia,
government and industry. The workshop will address the hardware and
software issues encountered during design as well as the efficiency,
portability and utilization of deployed applications. Topics of
interest include: new cluster and grid technologies for computing and
storage, software tools, scalability issues and benchmarking,
scheduling and accounting.
SNIC Interaction provides an open forum for users of the Swedish
Infrastructure of Computing (SNIC). The meeting is intended to
stimulate interaction among the users of SNIC centers and between the
users of the center organizations. It will include: state of the art
HPC applications, software development, application specific
scalability issues and benchmarking, information on ongoing and future
development of SNIC infrastructure.
Abstract submissions are due August 25, 2005. Submissions are limited
to 300 word abstracts. The abstract may be submitted in one of the
following formats: plain text, PDF or MS Word. Send the abstract via
email to . Please indicate in your
submission if you prefer an oral (20 minutes) or a poster
presentation. The LCSC and SNIC interaction program committee will
select about 10 abstracts for oral presentations and the remaining
accepted contributions for poster presentations. For further
information please contact the Program Chair Peter M?nger
.
More information about the conference is available at
http://www.nsc.liu.se/lcsc/
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Thu Jul 7 16:37:58 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Thu, 7 Jul 2005 13:37:58 -0700
Subject: [Beowulf] problem : mpi dynamic scheduling ??
In-Reply-To:
References: <20050704073344.75237.qmail@web32807.mail.mud.yahoo.com>
Message-ID: <20050707203758.GA3324@greglaptop.internal.keyresearch.com>
On Tue, Jul 05, 2005 at 01:04:03PM -0400, Mark Hahn wrote:
> in other words, large-tight apps are incompatible with meaningful
> load-balancing.
Mark,
There are a couple of examples of tightly coupled apps that do either
static or dynamic load balancing, but in most cases this load
balancing was programmed in by the programmer above the MPI level. For
example, ocean codes don't have to compute on land, so you can static
load balance by using different-sized rectangles for your data
decomposition. A system that does this invisibly is Charm++, used by
namd, a molecular dynamics code. In namd's case the workload at a
given point varies over time.
Another classic example which isn't solved is in weather forecasting.
Air columns with cumulus action (convection) take more time to compute
than air columns without. But the ratio of computation to data size
is small enough that it would require a really good interconnect to
enable load-balancing this part of the computation.
This is the kind of stuff that you'd never expect Mosix to magically
load-balance.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hahn at physics.mcmaster.ca Thu Jul 7 17:38:38 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Thu, 7 Jul 2005 17:38:38 -0400 (EDT)
Subject: [Beowulf] problem : mpi dynamic scheduling ??
In-Reply-To: <20050707203758.GA3324@greglaptop.internal.keyresearch.com>
Message-ID:
> > in other words, large-tight apps are incompatible with meaningful
> > load-balancing.
>
> Mark,
>
> There are a couple of examples of tightly coupled apps that do either
> static or dynamic load balancing, but in most cases this load
> balancing was programmed in by the programmer above the MPI level. For
> example, ocean codes don't have to compute on land, so you can static
> load balance by using different-sized rectangles for your data
> decomposition.
absolutely - I meant "load balance" in the context of the original poster,
who was talking about openmosix moving programs to less-loaded nodes.
that is, process load-balancing, rather than domain-decomp-type balancing.
> A system that does this invisibly is Charm++, used by
> namd, a molecular dynamics code. In namd's case the workload at a
> given point varies over time.
indeed - my experience is that any *serious* simulation-type code
needs to do this kind of data-load balancing (adaptive domain decomposition).
often, the picture gets really interesting when you make timesteps
adaptive as well as spatial dimensions ;)
really, my comment is mainly to emphasize that for intensive MPI apps
(which means significant in size and fairly tightly-coupled)
you have to avoid jitter at all costs. here's a BEAUTIFUL paper:
http://www.sc-conference.org/sc2003/paperpdfs/pap301.pdf
regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From tmattox at gmail.com Thu Jul 7 17:15:35 2005
From: tmattox at gmail.com (Tim Mattox)
Date: Thu, 7 Jul 2005 17:15:35 -0400
Subject: [Beowulf] Bonding on the Gigabit Ethernet
In-Reply-To:
References:
Message-ID:
Thank you posting your paper.
If you still have your testing rig available, I would be interested to see
what happens to the performance numbers as you vary the value of
/proc/sys/net/ipv4/tcp_reordering
The node will presume that a packet has been lost if it
sees "tcp_reording" number of packets out of order.
tcp_reordering defaults to 3, and can go as high as 127.
At least this is my understanding of how this sysctl works. I haven't
had the appropriate test rig (or time) to really find out.
On 7/7/05, Ilya wrote:
> Hi all!
>
> A little paper about Bonding on the Gigabit Ethernet :
>
> http://tom.imm.uran.ru/~u1330/bonding.pdf
>
> ---------------------------------------
> Audi, Vide, Tace, si vis vivere in pace
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Tim Mattox - tmattox at gmail.com
http://homepage.mac.com/tmattox/
I'm a bright... http://www.the-brights.net/
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From ilya at gray-world.net Fri Jul 8 00:04:16 2005
From: ilya at gray-world.net (Ilya)
Date: Fri, 08 Jul 2005 09:04:16 +0500
Subject: [Beowulf] Bonding on the Gigabit Ethernet
In-Reply-To:
Message-ID:
Yes, I?ve played with that when I realized that packets has arrive out of order. But it doesn?t
help us much more :
1) additional consuming of processor time for sorting packages in the buffer on the recipient side
2) ??congestion window?. This parameter is defined how many packets can be send without ACK. So, it
has an influence on the overall throughput of the TCP stream. And while stream has no DUP ACK (or
losses) congestion window is increasing. But when we received DUP ACK, congestion window is
decreasing, even without retransmission.
Even artificial blocking of DUP ACK doesn?t help.
>From paper : ?Experiments have shown, that at artificial blocking the duplicated
acknowledgement may raise productivity of bonding on two Gigabit interfaces from 0.9 Gb/sec up to
1.3 Gb/sec. Expected 1.8 Gb/sec was not reached due to big time spent on sorting of packages in
the buffer on the recipient side. Thus, for the system containing 2 processors
by 2.4 GHz everyone, throughput of a network with two Gigabit interfaces and use standard bonding
makes a maximum 1.3 Gb/sec.? That is why it is so necessary to achieve that recipient side can
recieve TCP packages by way in turn.
7/7/2005, "Tim Mattox" ?? ??????:
>Thank you posting your paper.
>
>If you still have your testing rig available, I would be interested to see
>what happens to the performance numbers as you vary the value of
>
>/proc/sys/net/ipv4/tcp_reordering
>
>The node will presume that a packet has been lost if it
>sees "tcp_reording" number of packets out of order.
>tcp_reordering defaults to 3, and can go as high as 127.
>At least this is my understanding of how this sysctl works. I haven't
>had the appropriate test rig (or time) to really find out.
>
>On 7/7/05, Ilya wrote:
>> Hi all!
>>
>> A little paper about Bonding on the Gigabit Ethernet :
>>
>> http://tom.imm.uran.ru/~u1330/bonding.pdf
>>
>> ---------------------------------------
>> Audi, Vide, Tace, si vis vivere in pace
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>--
>Tim Mattox - tmattox at gmail.com
> http://homepage.mac.com/tmattox/
> I'm a bright... http://www.the-brights.net/
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
---------------------------------------
Audi, Vide, Tace, si vis vivere in pace
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Fri Jul 8 19:59:32 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Fri, 8 Jul 2005 16:59:32 -0700
Subject: [Beowulf] BWBUG June 12th -- new meeting location (downtown
Washington, DC)
Message-ID: <20050708235932.GA3685@greglaptop.internal.keyresearch.com>
Mike Fitzmaurice asked me to forward this note. I'm not so sure I can
agree with the "most sought after" bit, but I can promise that the
talk will be interesting.
----------------------------------------------------------------------
The next meeting date will be at a new location. Tuesday July 12th the
time will be 2:00 to 5:00 PM at 2025 M Street NW Washington DC 20036 and
will feature a series of technical briefings featuring Greg Lindahl,
Distinguished Engineer and Founder at PathScale. Greg is one of the most
sought-after speakers on HPC. Greg will also speak about the new Dual
Core Opteron Processor. This will be the seventh event in a series of
what we describe as "THE LINUX CLUSTER FORUM" which will focus on
complete Linux Cluster solutions for mission critical applications and
production research.
Please Note Venue Change for the July 12th Meeting which will be at
Radio Free Asia in downtown DC. This is the same site that hosts the
DCLUG meetings. Directions are available on the DCLUG site:
Live video stream in ogg theora format will be available at:
http:/streamer0.rfa.org:8000/bwbug_video.ogg
The camera operator will be available in #bwbug on irc.freenode.net to
relay questions from remote attendees.
Please click on the link to register
http://bwbug.org/Meeting_Registration.php
Michael Fitzmaurice
BWBUG
703-502-2904 (o)
703-973-9054 (c)
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From turuncu at be.itu.edu.tr Fri Jul 8 16:49:06 2005
From: turuncu at be.itu.edu.tr (turuncu at be.itu.edu.tr)
Date: Fri, 8 Jul 2005 23:49:06 +0300 (EEST)
Subject: [Beowulf] hybrid (openmp+mpi) job submit
Message-ID: <49533.85.96.105.231.1120855746.squirrel@www.be.itu.edu.tr>
hi,
i try to run a job that is parallelized using openmp and mpi programming
interfaces (hybrid). I need to run mpi jobs in each node as an openmp job.
for this reason, i have to define OMP_NUM_THREADS environment variable for
each one of the node. first i try to put it into .profile file but it is
not sucessful. also i try to write an LSF job script and i fail too. The
LSF script as fallows,
#!/bin/ksh
#BSUB -J MM5_RUN # job name
#BSUB -n 2 # sum of number of tasks
#BSUB -R "span[ptile=1]" # number of processes per node
#BSUB -m "cn07 cn08" # run host
#BSUB -o mm5lsf.out # output file name
#BSUB -q cigq # queue name
#BSUB -L /bin/bash #
#BSUB -E "export OMP_NUM_THREADS=2"
. ${PWD}/mm5.deck.par
time mpirun -np 2 -machinefile ../machfile ./mm5.mpp
in this case. job run in each of the specified node as a single processor
mode (except execution host, because it is same machine which is login in
and OMP_NUM_THREADS environment variable comes from .profile file).
how can i run a command (or script) in each node just before runing mpi
executable?
thanks,
Ufuk Utku Turuncoglu
Istanbul Technical University
Informatics Institute
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From green at sectorb.msk.ru Sun Jul 10 06:13:24 2005
From: green at sectorb.msk.ru (Alexander Zubkov)
Date: Sun, 10 Jul 2005 14:13:24 +0400
Subject: [Beowulf] queue management systems survey
Message-ID: <42D0F4C4.50209@sectorb.msk.ru>
Hi all!
I making some type of analysis of queue management systems and will be
glad to hear answers on survey, I prepared. I know that specifications
and characteristics can be "extracted" from official manuals, but here
I'm interested in practical cases. Thanks in advance.
1) Which queuing system You use at your cluser (OpenPBS, Torque, Sun
Grid Engine, ...)? Mostly interest is queuing system, but scheduler will
be interesting too.
2) Characteristics of your cluster:
- number of nodes
- type of nodes (workstations/dedicated, homogenous/heterogenous, cpus)
3) Queue characteristics:
- number of users
- average number of tasks in queue
- average number of running tasks
- only processor time taken into account or other resources too (memory,
disk space, ...)
4) Common tasks characteristics:
- how much procesors required
- how much time required
- type of parallelism: exclusion of concurrent run of other tasks on
assigned processors/something like SETI at home or United Devices projects
5) What is the pros and cons for this queuing system? Why it was chosen
instead of others for your cluster?
PS. Some of this questions are related to scheduler too. But I need it
mostly to recognize the capabilities of the queuing system. I.e. even if
we have outstanding scheduler and the queuing system is bad - the result
will be bad too.
----
Alexander Zubkov
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hendrata at ilovejesus.net Mon Jul 11 05:56:28 2005
From: hendrata at ilovejesus.net (hendra tampang allo)
Date: 11 Jul 2005 16:56:28 +0700
Subject: [Beowulf] [Fwd: Problem with Beowulf]
Message-ID: <1121075786.28653.5.camel@master>
-------------- next part --------------
An embedded message was scrubbed...
From: 13200178 Hendra Tampang Allo
Subject: Problem with Beowulf
Date: Mon, 11 Jul 2005 16:37:26 +0700 (WIT)
Size: 4697
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From mjoyner at vbservices.net Sun Jul 10 19:55:14 2005
From: mjoyner at vbservices.net (Michael Joyner)
Date: Sun, 10 Jul 2005 19:55:14 -0400
Subject: [Beowulf] SuSE 9.3
Message-ID: <42D1B562.90104@vbservices.net>
Is there any how-to or guide for setting up a cluster using SuSE 9.2 or 9.3?
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From cap at nsc.liu.se Mon Jul 11 04:49:58 2005
From: cap at nsc.liu.se (Peter =?iso-8859-1?q?Kjellstr=F6m?=)
Date: Mon, 11 Jul 2005 10:49:58 +0200
Subject: [Beowulf] WRF model on linux cluster: Mpi problem
In-Reply-To: <42CB90E0.1080801@ccrl-nece.de>
References: <3.0.32.20050630145204.011261c0@pop3.xs4all.nl>
<1120504420.5106.63.camel@localhost.localdomain>
<42CB90E0.1080801@ccrl-nece.de>
Message-ID: <200507111049.59273.cap@nsc.liu.se>
I agree with Joachim, somethings wrong. A decent network shoult give you < 35
us latency and > 70 MiB/s bandwidh at larger packet sizes. Here are some
things to check:
* if you use e1000, set interruptthrottlerate=0
* if you for speed and duplex make sure it's a 100% forced configuration,
don't mix in autoneg in any way
* have a look at ifconfig (error numbers) and check dmesg for ugliness
* test between some other nodes, is it the same?
good luck,
Peter
On Wednesday 06 July 2005 10.05, Joachim Worringen wrote:
> Federico Ceccarelli wrote:
> > Hi,
> >
> > I did the Pallas benchmark...after removing openmosix...here are the
> > ping-pong and ping-ping results...for 2 processes
> > What do you think about them?
> > Why the bandwidth is raising and decreasing many times as the #bytes
> > grow?
>
> The latency is quite high and grows too fast from 0 to 8 bytes. But most of
> all, the bandwidth should be constant with at least 80MB/s, better >100MB/s
> for large messages. Something is wrong, probably with your ethernet setup.
> I remember such effects for half/full-duplex mismatch with 100Mb auto
> negotiation problems between NIC and switch.
>
> Joachim
--
------------------------------------------------------------
Peter Kjellstr?m |
National Supercomputer Centre |
Sweden | http://www.nsc.liu.se
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From toon.knapen at fft.be Mon Jul 11 08:23:53 2005
From: toon.knapen at fft.be (Toon Knapen)
Date: Mon, 11 Jul 2005 14:23:53 +0200
Subject: [Beowulf] MorphMPI idea in reality ?
Message-ID: <42D264D9.4020003@fft.be>
Hi all,
You might remember the very intriging discussions we had on a MPI ABI.
During this discussion the idea of a 'MorphMPI' was launched
(http://www.open-mpi.org/community/lists/users/2005/03/0028.php).
I for one like the idea because it allows us to launch applications
using MPI implementations that are different from the MPI implementation
used for compiling the application.
We are seriously looking into going this route but first I wanted to ask
if anybody already implemented such a solution (and eventually
open-source'd it).
toon
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From cap at nsc.liu.se Mon Jul 11 04:43:45 2005
From: cap at nsc.liu.se (Peter =?iso-8859-9?q?Kjellstr=F6m?=)
Date: Mon, 11 Jul 2005 10:43:45 +0200
Subject: [Beowulf] hybrid (openmp+mpi) job submit
In-Reply-To: <49533.85.96.105.231.1120855746.squirrel@www.be.itu.edu.tr>
References: <49533.85.96.105.231.1120855746.squirrel@www.be.itu.edu.tr>
Message-ID: <200507111043.55492.cap@nsc.liu.se>
Hello,
There are quite a few ways to set the environment for remote mpi-processes in
batch-jobs but they are often ugly or specific (or both). The root problem
here is that simple batch-systems simply runs your batch script on the first
node and that's it. Then mpirun is responsible for starting the rest of the
the processes (often using ssh or, horrible thought, rsh..).
Here are a few ways to make it work (setting the enviroment variable that is,
not actually running a hybrid mpi/openmp app ;-).
* try setting it hard in you .bashrc if you use bash (ugly)
* use LSFs tight mpi-integration (specific to LSF)
* use an MPI with an mpirun that propagates the environment
-- tosses .02 swedish kronor into the thread
On Friday 08 July 2005 22.49, turuncu at be.itu.edu.tr wrote:
> hi,
>
> i try to run a job that is parallelized using openmp and mpi programming
> interfaces (hybrid). I need to run mpi jobs in each node as an openmp job.
> for this reason, i have to define OMP_NUM_THREADS environment variable for
> each one of the node. first i try to put it into .profile file but it is
> not sucessful. also i try to write an LSF job script and i fail too. The
> LSF script as fallows,
>
> #!/bin/ksh
> #BSUB -J MM5_RUN # job name
> #BSUB -n 2 # sum of number of tasks
> #BSUB -R "span[ptile=1]" # number of processes per node
> #BSUB -m "cn07 cn08" # run host
> #BSUB -o mm5lsf.out # output file name
> #BSUB -q cigq # queue name
> #BSUB -L /bin/bash #
> #BSUB -E "export OMP_NUM_THREADS=2"
>
> . ${PWD}/mm5.deck.par
> time mpirun -np 2 -machinefile ../machfile ./mm5.mpp
>
> in this case. job run in each of the specified node as a single processor
> mode (except execution host, because it is same machine which is login in
> and OMP_NUM_THREADS environment variable comes from .profile file).
>
> how can i run a command (or script) in each node just before runing mpi
> executable?
>
> thanks,
>
> Ufuk Utku Turuncoglu
> Istanbul Technical University
> Informatics Institute
>
--
------------------------------------------------------------
Peter Kjellstr?m |
National Supercomputer Centre |
Sweden | http://www.nsc.liu.se
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Mon Jul 11 10:27:07 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Mon, 11 Jul 2005 16:27:07 +0200 (CEST)
Subject: [Beowulf] SuSE 9.3
In-Reply-To: <42D1B562.90104@vbservices.net>
Message-ID: <20050711162555.O95360-100000@xs2.xs4all.nl>
Suse 9.3 failed to install at this quad opteron dual core 1.8Ghz
Suse 9.0 works fine (i'm connected from that machine) for beowulf.
On Sun, 10 Jul 2005, Michael Joyner wrote:
> Is there any how-to or guide for setting up a cluster using SuSE 9.2 or 9.3?
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From scheinin at crs4.it Mon Jul 11 11:40:14 2005
From: scheinin at crs4.it (Alan Louis Scheinine)
Date: Mon, 11 Jul 2005 17:40:14 +0200
Subject: [Beowulf] [gorelsky@stanford.edu:
CCL:dual-core Opteron 275performance]
In-Reply-To: <42CBF85C.1000403@crs4.it>
References: <42CBE6C9.1070406@crs4.it>
<20050706144300.GN422@unthought.net> <42CBF85C.1000403@crs4.it>
Message-ID: <42D292DE.5060209@crs4.it>
Jakob Oestergaard wrote:
> Guys, the results are very interesting - it would be very interesting to
> know, too, which kernels you have been running on.
> ... I'm pretty sure at least 2.6.12 had NUMA/dual-core fixes in it.
In all cases, 4 MPI processes on a machine with 4 cores (two dual-core CPUs).
Meteorology program 1, "bolam" CPU time, real time (in seconds)
Linux kernel 2.6.9-11.ELsmp 122 128
Linux kernel 2.6.12.2 64 77
Meteorology program 2, "non-hydrostatic"
Linux kernel 2.6.9-11.ELsmp 598 544
Linux kernel 2.6.12.2 430 476
The new information is the kernel 2.6.12.2 that I installed today.
Yes, the new kernel helps quite a bit.
-- Alan
Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
Center for Advanced Studies, Research, and Development in Sardinia
Postal Address: | Physical Address for FedEx, UPS, DHL:
--------------- | -------------------------------------
Alan Scheinine | Alan Scheinine
c/o CRS4 | c/o CRS4
C.P. n. 25 | Loc. Pixina Manna Edificio 1
09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy
Email: scheinin at crs4.it
Phone: 070 9250 238 [+39 070 9250 238]
Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220]
Operator at reception: 070 9250 1 [+39 070 9250 1]
Mobile phone: 347 7990472 [+39 347 7990472]
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kus at free.net Mon Jul 11 13:37:16 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Mon, 11 Jul 2005 21:37:16 +0400
Subject: [Beowulf] [gorelsky@stanford.edu: CCL:dual-core Opteron
275performance]
In-Reply-To: <42D292DE.5060209@crs4.it>
Message-ID:
In message from Alan Louis Scheinine (Mon, 11 Jul
2005 17:40:14 +0200):
>
>Jakob Oestergaard wrote:
> > Guys, the results are very interesting - it would be very
>interesting to
> > know, too, which kernels you have been running on.
> > ... I'm pretty sure at least 2.6.12 had NUMA/dual-core fixes in
>it.
>
>In all cases, 4 MPI processes on a machine with 4 cores (two
>dual-core CPUs).
>
>Meteorology program 1, "bolam" CPU time, real time (in seconds)
> Linux kernel 2.6.9-11.ELsmp 122 128
> Linux kernel 2.6.12.2 64 77
>
>Meteorology program 2, "non-hydrostatic"
> Linux kernel 2.6.9-11.ELsmp 598 544
> Linux kernel 2.6.12.2 430 476
>
>The new information is the kernel 2.6.12.2 that I installed today.
>Yes, the new kernel helps quite a bit.
Sorry, do you have "node interleave memory" (in BIOS) switched off ?
Yours
Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow
>
>-- Alan
>
> Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
> Center for Advanced Studies, Research, and Development in Sardinia
>
> Postal Address: | Physical Address for FedEx, UPS,
>DHL:
> --------------- |
> -------------------------------------
> Alan Scheinine | Alan Scheinine
> c/o CRS4 | c/o CRS4
> C.P. n. 25 | Loc. Pixina Manna Edificio 1
> 09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy
>
> Email: scheinin at crs4.it
>
> Phone: 070 9250 238 [+39 070 9250 238]
> Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220]
> Operator at reception: 070 9250 1 [+39 070 9250 1]
> Mobile phone: 347 7990472 [+39 347 7990472]
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From brian at cypher.acomp.usf.edu Mon Jul 11 11:17:04 2005
From: brian at cypher.acomp.usf.edu (Brian R Smith)
Date: Mon, 11 Jul 2005 11:17:04 -0400
Subject: [Beowulf] SuSE 9.3
In-Reply-To: <42D1B562.90104@vbservices.net>
References: <42D1B562.90104@vbservices.net>
Message-ID: <1121095025.16903.32.camel@daemon.acomp.usf.edu>
Michael,
I would forget trying to find a single comprehensive guide for a
specific distribution as many of these guides, from what I have seen,
make certain assumptions about your situation that simply aren't
correct.
Usually, setting up a cluster is fairly distribution-independent meaning
it really doesn't matter if you're using SuSE or RedHat or one of the
many myriads of distributions out there.
SuSE does come with a few helpful packages like mpich/lam and queuing
software like OpenPBS, but in my experience, you are always better off
following a more generic model: build it yourself.
If you're running with SuSE, check out autoyast2
http://forgeftp.novell.com////yast/doc/SLES9/autoinstall/ref.html for
building your compute nodes. If you need to run parallel software over
MPI, there are many resources available for setting up MPI and many of
these offer help with configuring basic services like rsh/ssh to work
correctly in these setups. Here, google is your friend. Look for
'mpich'. There is plenty of documentation goodness on job scheduling
(and its respective software(s)) and a good deal of general systems
administration principles still apply to running a cluster.
Also, read Robert Brown's book
http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php
and get an idea of exactly what it is you need to do because "setting up
a cluster using x" is a very generic question. I've made the assumption
that you meant "HPC" or "High-Performance Computing", but we all know
about assuming things... Is this HPC, HTC, or HA? Are you trying to run
scientific codes in parallel or are you trying to cluster an Oracle
database?
-Brian
On Sun, 2005-07-10 at 19:55 -0400, Michael Joyner wrote:
> Is there any how-to or guide for setting up a cluster using SuSE 9.2 or 9.3?
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From mjoyner at vbservices.net Mon Jul 11 11:32:36 2005
From: mjoyner at vbservices.net (Michael Joyner)
Date: Mon, 11 Jul 2005 11:32:36 -0400
Subject: [Beowulf] SuSE 9.3
In-Reply-To: <1121095025.16903.32.camel@daemon.acomp.usf.edu>
References: <42D1B562.90104@vbservices.net>
<1121095025.16903.32.camel@daemon.acomp.usf.edu>
Message-ID: <42D29114.4060103@vbservices.net>
Brian R Smith wrote:
> Michael,
>
> I would forget trying to find a single comprehensive guide for a
> specific distribution as many of these guides, from what I have seen,
> make certain assumptions about your situation that simply aren't
> correct.
>
> Usually, setting up a cluster is fairly distribution-independent meaning
> it really doesn't matter if you're using SuSE or RedHat or one of the
> many myriads of distributions out there.
After talking it over with the professor, I think we are going to use
Fedora or Mandriva + OSCAR.
>
> SuSE does come with a few helpful packages like mpich/lam and queuing
> software like OpenPBS, but in my experience, you are always better off
> following a more generic model: build it yourself.
We were initially looking at SuSE because that is what we have
everywhere else. :)
>
> If you're running with SuSE, check out autoyast2
> http://forgeftp.novell.com////yast/doc/SLES9/autoinstall/ref.html for
> building your compute nodes. If you need to run parallel software over
> MPI, there are many resources available for setting up MPI and many of
> these offer help with configuring basic services like rsh/ssh to work
> correctly in these setups. Here, google is your friend. Look for
> 'mpich'. There is plenty of documentation goodness on job scheduling
> (and its respective software(s)) and a good deal of general systems
> administration principles still apply to running a cluster.
>
> Also, read Robert Brown's book
> http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php
> and get an idea of exactly what it is you need to do because "setting up
> a cluster using x" is a very generic question. I've made the assumption
> that you meant "HPC" or "High-Performance Computing", but we all know
> about assuming things... Is this HPC, HTC, or HA? Are you trying to run
> scientific codes in parallel or are you trying to cluster an Oracle
> database?
>
>
>
> -Brian
>
> On Sun, 2005-07-10 at 19:55 -0400, Michael Joyner wrote:
>>Is there any how-to or guide for setting up a cluster using SuSE 9.2 or 9.3?
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From gerry.creager at tamu.edu Mon Jul 11 16:28:06 2005
From: gerry.creager at tamu.edu (Gerry Creager)
Date: Mon, 11 Jul 2005 15:28:06 -0500
Subject: [Beowulf]
[gorelsky@stanford.edu: CCL:dual-core Opteron 275performance]
In-Reply-To: <42D292DE.5060209@crs4.it>
References:
<42CBE6C9.1070406@crs4.it> <20050706144300.GN422@unthought.net>
<42CBF85C.1000403@crs4.it> <42D292DE.5060209@crs4.it>
Message-ID: <42D2D656.4020600@tamu.edu>
Hoowa! Excellent information.
Thanks,
Gerry
Alan Louis Scheinine wrote:
>
> Jakob Oestergaard wrote:
> > Guys, the results are very interesting - it would be very interesting to
> > know, too, which kernels you have been running on.
> > ... I'm pretty sure at least 2.6.12 had NUMA/dual-core fixes in it.
>
> In all cases, 4 MPI processes on a machine with 4 cores (two dual-core
> CPUs).
>
> Meteorology program 1, "bolam" CPU time, real time (in seconds)
> Linux kernel 2.6.9-11.ELsmp 122 128
> Linux kernel 2.6.12.2 64 77
>
> Meteorology program 2, "non-hydrostatic"
> Linux kernel 2.6.9-11.ELsmp 598 544
> Linux kernel 2.6.12.2 430 476
>
> The new information is the kernel 2.6.12.2 that I installed today.
> Yes, the new kernel helps quite a bit.
>
> -- Alan
>
> Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
> Center for Advanced Studies, Research, and Development in Sardinia
>
> Postal Address: | Physical Address for FedEx, UPS, DHL:
> --------------- | -------------------------------------
> Alan Scheinine | Alan Scheinine
> c/o CRS4 | c/o CRS4
> C.P. n. 25 | Loc. Pixina Manna Edificio 1
> 09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy
>
> Email: scheinin at crs4.it
>
> Phone: 070 9250 238 [+39 070 9250 238]
> Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220]
> Operator at reception: 070 9250 1 [+39 070 9250 1]
> Mobile phone: 347 7990472 [+39 347 7990472]
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
--
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020
FAX: 979.847.8578 Pager: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From john.hearns at streamline-computing.com Mon Jul 11 17:06:54 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Mon, 11 Jul 2005 22:06:54 +0100
Subject: [Beowulf] SuSE 9.3
In-Reply-To: <42D29114.4060103@vbservices.net>
References: <42D1B562.90104@vbservices.net>
<1121095025.16903.32.camel@daemon.acomp.usf.edu>
<42D29114.4060103@vbservices.net>
Message-ID: <1121116014.5923.1.camel@vigor13>
On Mon, 2005-07-11 at 11:32 -0400, Michael Joyner wrote:
> Brian R Smith wrote:
>
> >
> > SuSE does come with a few helpful packages like mpich/lam and queuing
> > software like OpenPBS, but in my experience, you are always better off
> > following a more generic model: build it yourself.
> We were initially looking at SuSE because that is what we have
> everywhere else. :)
Well, use SuSE on your cluster then, if that is the distro which you are
most used to.
Personally, I would shy away from Fedora, much though I have a liking
for Redhat hand have used it for years.
I agree with the advice though to build your own packages rather than
relying on the RPMs.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From mjoyner at vbservices.net Mon Jul 11 17:40:41 2005
From: mjoyner at vbservices.net (Michael Joyner)
Date: Mon, 11 Jul 2005 17:40:41 -0400
Subject: [Beowulf] SuSE 9.3
In-Reply-To: <1121116014.5923.1.camel@vigor13>
References: <42D1B562.90104@vbservices.net> <1121095025.16903.32.camel@daemon.acomp.usf.edu> <42D29114.4060103@vbservices.net>
<1121116014.5923.1.camel@vigor13>
Message-ID: <42D2E759.1090008@vbservices.net>
After discussing it with the physics professor, we have decided to try
Fedora 2 + OSCAR.
Wish me luck! :)
John Hearns wrote:
> On Mon, 2005-07-11 at 11:32 -0400, Michael Joyner wrote:
>>Brian R Smith wrote:
>
>>>SuSE does come with a few helpful packages like mpich/lam and queuing
>>>software like OpenPBS, but in my experience, you are always better off
>>>following a more generic model: build it yourself.
>>We were initially looking at SuSE because that is what we have
>>everywhere else. :)
> Well, use SuSE on your cluster then, if that is the distro which you are
> most used to.
> Personally, I would shy away from Fedora, much though I have a liking
> for Redhat hand have used it for years.
>
> I agree with the advice though to build your own packages rather than
> relying on the RPMs.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kinghorn at pqs-chem.com Mon Jul 11 19:04:35 2005
From: kinghorn at pqs-chem.com (Don Kinghorn)
Date: Mon, 11 Jul 2005 18:04:35 -0500
Subject: [Beowulf] SuSE 9.3 and dual core Opterons
Message-ID: <200507111804.35862.kinghorn@pqs-chem.com>
I'll start this thread to kill two birds with one stone ... gee that's not a
very nice old saying is it :-)
To answer the question about using SuSE 9.3. -- Yes you can use it in general
and yes, you can use it with dual-core Opterons too.
However, I had to jump through some insane hoops to get it running on a Tyan
S2891 board with dual-dual-core 2.0GHz Opterons, 4GB mem ....
The results were quite satisfying though. We benchmarked our computational
chemistry software (PQS) against a 2 node dual Opteron setup with the same
general configuration. [2x Tyan S2875 and 2.0GHz Opterons 2GB/per board, SuSE
9.2]
We got better performance for most 4 process parallel (pvm) jobs on the
dual-core system than on the 2 dual-node setup. The only slower jobs were MP2
jobs that did heavy disk i/o but this was expected. ... I have SATA-II drives
on order :-)
The main problem with the SuSE 9.3 install on the dual-core setup was that the
install didn't create the correct initrd to boot the system. I've worked
around it using the SuSE 9.3 Live CD, (which does boot and run fine), by
mounting the installed system and chroot'ing into it and fixing the initrd.
>From now on I'll clone it.
I'll be doing our cluster systems with the new Tyan board and dual-core
Opterons with a configuration running SuSE 9.3 for our next hardware
iteration starting in Sept..
Hope this is helpfull info.
Best wishes
-Don
--
Dr. Donald B. Kinghorn Parallel Quantum Solutions LLC
http://www.pqs-chem.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From john.hearns at streamline-computing.com Tue Jul 12 02:58:23 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Tue, 12 Jul 2005 07:58:23 +0100
Subject: [Beowulf] SuSE 9.3
In-Reply-To: <42D2E759.1090008@vbservices.net>
References: <42D1B562.90104@vbservices.net>
<1121095025.16903.32.camel@daemon.acomp.usf.edu>
<42D29114.4060103@vbservices.net> <1121116014.5923.1.camel@vigor13>
<42D2E759.1090008@vbservices.net>
Message-ID: <1121151504.5923.3.camel@vigor13>
On Mon, 2005-07-11 at 17:40 -0400, Michael Joyner wrote:
> After discussing it with the physics professor, we have decided to try
> Fedora 2 + OSCAR.
>
> Wish me luck! :)
Luck. But you are doolally.
Fedora is a distribution with a short lifetime.
And believe me I know - I have used Fedora since the beginning.
I currently run Fedora 3 on this laptop, which was installed in
December. It is already superseded.
Please, please consider using Scientific Linux
http://www.scientificlinux.org
you'll get updates for that for several years.
Version 4 is out with a 2.6 kernel.
(as an aside, unless you have an overwhelming need, why use a 2.4 kernel
on a new cluster?)
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From becker at scyld.com Tue Jul 12 03:14:45 2005
From: becker at scyld.com (Donald Becker)
Date: Tue, 12 Jul 2005 03:14:45 -0400 (EDT)
Subject: [Beowulf] July 12 2005 BWBUG / Linux Cluster Forum meeting,
2pm D.C.
Message-ID:
Baltimore Washington Beowulf User Group
presents the 7th
LINUX CLUSTER FORUM
BWBUG / LINUX CLUSTER FORUM
July 2005 Meeting
--- Special Notes:
- This months meeting is at *** Radio Free Asia in downtown DC ***
- Live video streaming is planned
- See http://www.bwbug.org/ for full information and any corrections
BWBUG Meeting
Date: Tuesday 12 July 2005 at 2:00 pm.
Location: Radio Free Asia 2025 M Street NW Washington DC 20036
Speakers: Greg Lindalh of PathScale
Essential questions:
Need to be a member?
No, guests are welcome. Please register on the web site
Parking and parking fees:
Easy access by D.C. Metro on the blue, orange and red lines
Paid lot and metered on-street parking
Also as usual, the organizer and host for the meeting is
T. Michael Fitzmaurice
The next meeting date will be at a new location. Tuesday July 12th the
time will be 2:00 to 5:00 PM at 2025 M Street NW Washington DC 20036 and
will feature a series of technical briefings featuring Greg Lindalh
Distinguished Engineer and Founder at PathScale. Greg is one of the most
sought-after speakers on HPC. Greg will also speak about the new Dual
Core Opteron Processor. This will be the seventh event in a series of
what we describe as "THE LINUX CLUSTER FORUM" which will focus on
complete Linux Cluster solutions for mission critical applications and
production research.
Please Note Venue Change for the July 12th Meeting which will be at
Radio Free Asia in downtown DC. This is the same site that hosts the
DCLUG meetings. Directions are available on the DCLUG site:
Live video stream in ogg theora format will be available at:
http:/streamer0.rfa.org:8000/bwbug_video.ogg
The camera operator will be available in #bwbug on irc.freenode.net to
relay questions from remote attendees.
Please click on the link to register
http://bwbug.org/Meeting_Registration.php
Donald Becker becker at scyld.com
Scyld Software Scyld Beowulf cluster systems
914 Bay Ridge Road, Suite 220 www.scyld.com
Annapolis MD 21403 410-990-9993
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Tue Jul 12 05:08:26 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Tue, 12 Jul 2005 11:08:26 +0200
Subject: [Beowulf] SuSE 9.3
Message-ID: <3.0.32.20050712110822.01331800@pop3.xs4all.nl>
Fedora core 2 i already tried and when installed at my dual, it was wasting
cpu time for nothing. The worst distribution ever. Not worth downloading if
your intentions are more than 'just run linux'. If you need to run
applications that will eat system time, Fedora Core is the worst choice.
In general Suse and Redhat are deteriorating, only their commercial product
lines might be doing fine, which are what is it, $1500 a piece or so in
case of Redhat?
Suse 9.3 was a waste of money. It doesn't even install correct. Either you
get 'kernel panic', or some file system stuff is going wrong.
Amazingly Suse 9.0 at the same machine worked fine (but of course 2.4.x is
wrong kernel for a quad opteron so i must upgrade that).
Anyone tried opensolaris.org actually and download their compiler at
http://opensolaris.org/os/community/tools/sun_studio_tools/
Or is this all a big commercial show from Sun?
At 05:40 PM 7/11/2005 -0400, Michael Joyner wrote:
>After discussing it with the physics professor, we have decided to try
>Fedora 2 + OSCAR.
>
>Wish me luck! :)
>
>John Hearns wrote:
>> On Mon, 2005-07-11 at 11:32 -0400, Michael Joyner wrote:
>>>Brian R Smith wrote:
>>
>>>>SuSE does come with a few helpful packages like mpich/lam and queuing
>>>>software like OpenPBS, but in my experience, you are always better off
>>>>following a more generic model: build it yourself.
>>>We were initially looking at SuSE because that is what we have
>>>everywhere else. :)
>> Well, use SuSE on your cluster then, if that is the distro which you are
>> most used to.
>> Personally, I would shy away from Fedora, much though I have a liking
>> for Redhat hand have used it for years.
>>
>> I agree with the advice though to build your own packages rather than
>> relying on the RPMs.
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Tue Jul 12 05:22:11 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Tue, 12 Jul 2005 11:22:11 +0200
Subject: [Beowulf] dual core Opteron performance - re suse 9.3
Message-ID: <3.0.32.20050712112207.01331800@pop3.xs4all.nl>
Good morning Don,
If you 'did get better performance', that's possibly because
you have some kernel 2.6.x now, allowing NUMA, and a new
compiler version of gcc like 4.0.1 that has been bugfixed more than
the very buggy series 3.3.x and 3.4.x
Can you show us the differences between the compiler versions and kernel
versions you had and whether it's NUMA?
Also how is your memory banks configured, for 64 bits usage or 128 bits
single cpu usage, or are all banks filled up?
When using all cpu's simultaneously obviously 64 bits is faster.
I know for Diep the NUMA part at dual opteron matters about 50% in speed.
For the version in question it was around 170-180k nps versus NUMA 255k nps.
Windows XP 64 NUMA reached 270k nps, but that's because the compiler is
superior to GCC.
Did you try opensolaris already and their free compiler tools?
www.opensolaris.org
http://opensolaris.org/os/community/tools/sun_studio_tools/
I'm interested in speedcompares of it from objective (academic) sources,
what Sun will claim themselves i can guess :)
Thanks,
Vincent
At 06:04 PM 7/11/2005 -0500, Don Kinghorn wrote:
>I'll start this thread to kill two birds with one stone ... gee that's not a
>very nice old saying is it :-)
>
>To answer the question about using SuSE 9.3. -- Yes you can use it in
general
>and yes, you can use it with dual-core Opterons too.
>
>However, I had to jump through some insane hoops to get it running on a Tyan
>S2891 board with dual-dual-core 2.0GHz Opterons, 4GB mem ....
>
>The results were quite satisfying though. We benchmarked our computational
>chemistry software (PQS) against a 2 node dual Opteron setup with the same
>general configuration. [2x Tyan S2875 and 2.0GHz Opterons 2GB/per board,
SuSE
>9.2]
>
>We got better performance for most 4 process parallel (pvm) jobs on the
>dual-core system than on the 2 dual-node setup. The only slower jobs were
MP2
>jobs that did heavy disk i/o but this was expected. ... I have SATA-II
drives
>on order :-)
>
>
>The main problem with the SuSE 9.3 install on the dual-core setup was that
the
>install didn't create the correct initrd to boot the system. I've worked
>around it using the SuSE 9.3 Live CD, (which does boot and run fine), by
>mounting the installed system and chroot'ing into it and fixing the initrd.
>>From now on I'll clone it.
>
>I'll be doing our cluster systems with the new Tyan board and dual-core
>Opterons with a configuration running SuSE 9.3 for our next hardware
>iteration starting in Sept..
>
>Hope this is helpfull info.
>
>Best wishes
>-Don
>
>--
>Dr. Donald B. Kinghorn Parallel Quantum Solutions LLC
>http://www.pqs-chem.com
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From john.hearns at streamline-computing.com Tue Jul 12 05:52:42 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Tue, 12 Jul 2005 10:52:42 +0100
Subject: [Beowulf] SuSE 9.3
In-Reply-To: <3.0.32.20050712110822.01331800@pop3.xs4all.nl>
References: <3.0.32.20050712110822.01331800@pop3.xs4all.nl>
Message-ID: <1121161962.5923.39.camel@vigor13>
On Tue, 2005-07-12 at 11:08 +0200, Vincent Diepeveen wrote:
> Fedora core 2 i already tried and when installed at my dual, it was wasting
> cpu time for nothing. The worst distribution ever. Not worth downloading if
> your intentions are more than 'just run linux'. If you need to run
> applications that will eat system time, Fedora Core is the worst choice.
>
> In general Suse and Redhat are deteriorating, only their commercial product
> lines might be doing fine, which are what is it, $1500 a piece or so in
> case of Redhat?
I hesitate to get embroiled in this one.
Redhat do not sell Linux. They can't - it is GPLed.
What they do sell is support and binary updates, that is what you pay
for.
And many commercial companies - finance, oil and gas, engineering,
decide that they need support, and just as importantly a distribution
which their applications are certified to run on.
Also remember that it is common to have compute nodes running
Workstation, which is much cheaper.
Redhat play fair and release the source for updates.
> Suse 9.3 was a waste of money. It doesn't even install correct. Either you
> get 'kernel panic', or some file system stuff is going wrong.
That's your choice, and your finding with your hardware.
We are very happily shipping SuSE 9.3 on Opteron clusters, including
dual cores.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From scheinin at crs4.it Tue Jul 12 06:24:27 2005
From: scheinin at crs4.it (Alan Louis Scheinine)
Date: Tue, 12 Jul 2005 12:24:27 +0200
Subject: [Beowulf]
[gorelsky@stanford.edu: CCL:dual-core Opteron 275performance]
In-Reply-To:
References:
Message-ID: <42D39A5B.8050300@crs4.it>
1) Gerry Creager wrote "Hoowa!"
Since the results seem useful, I would like to add the following.
On dual-CPU boards with Athlon32 CPUs, the program "bolam" was slow if
both CPUs on the board were used, it was better to have one MPICH process
per compute node. This problem did not appear in another cluster that had
Opteron dual-CPU boards (single-core), that is, two processes for each node
did not cause a slowdown. This is an indication that "bolam" is at a
threshold for memory access being a bottleneck. A complication for this
interpretation is that the Athlon32 nodes use Linux kernel 2.4.21.
2) Mikhail Kuzminsky asked "do you have "node interleave memory" switched off?
Reading the BIOS:
Bank interleaving "Auto", there are two memory modules per CPU so there
should be bank interleaving.
Node interleaving "Disable"
3) In an email Guy Coates asked
> Did you need to use numa-tools to specify the CPU placement, or did the
> kernel "do the right thing" by itself?
The kernel did the right thing by itself.
I have a question: what are numa-tools?
On the computer I find
man -k numa
numa (3) - NUMA policy library
numactl(8) - Control NUMA policy for processes or shared memory
rpm -qa | grep -i numa
numactl-0.6.4-1.13
Is numactl the "numa-tools"? Is there another package to consider installing?
I see that numactl has many "man" pages.
Reference, previous message:
>In all cases, 4 MPI processes on a machine with 4 cores (two dual-core CPUs).
>Meteorology program 1, "bolam" CPU time, real time (in seconds)
> Linux kernel 2.6.9-11.ELsmp 122 128
> Linux kernel 2.6.12.2 64 77
>
>Meteorology program 2, "non-hydrostatic"
> Linux kernel 2.6.9-11.ELsmp 598 544
> Linux kernel 2.6.12.2 430 476
--
Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
Center for Advanced Studies, Research, and Development in Sardinia
Postal Address: | Physical Address for FedEx, UPS, DHL:
--------------- | -------------------------------------
Alan Scheinine | Alan Scheinine
c/o CRS4 | c/o CRS4
C.P. n. 25 | Loc. Pixina Manna Edificio 1
09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy
Email: scheinin at crs4.it
Phone: 070 9250 238 [+39 070 9250 238]
Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220]
Operator at reception: 070 9250 1 [+39 070 9250 1]
Mobile phone: 347 7990472 [+39 347 7990472]
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From gerry.creager at tamu.edu Tue Jul 12 07:47:55 2005
From: gerry.creager at tamu.edu (Gerry Creager N5JXS)
Date: Tue, 12 Jul 2005 06:47:55 -0500
Subject: [Beowulf] SuSE 9.3
In-Reply-To: <3.0.32.20050712110822.01331800@pop3.xs4all.nl>
References: <3.0.32.20050712110822.01331800@pop3.xs4all.nl>
Message-ID: <42D3ADEB.8000808@tamu.edu>
We have also become fond of CentOS (specifically, v4.0).
gerry
Vincent Diepeveen wrote:
> Fedora core 2 i already tried and when installed at my dual, it was wasting
> cpu time for nothing. The worst distribution ever. Not worth downloading if
> your intentions are more than 'just run linux'. If you need to run
> applications that will eat system time, Fedora Core is the worst choice.
>
> In general Suse and Redhat are deteriorating, only their commercial product
> lines might be doing fine, which are what is it, $1500 a piece or so in
> case of Redhat?
>
> Suse 9.3 was a waste of money. It doesn't even install correct. Either you
> get 'kernel panic', or some file system stuff is going wrong.
>
> Amazingly Suse 9.0 at the same machine worked fine (but of course 2.4.x is
> wrong kernel for a quad opteron so i must upgrade that).
>
> Anyone tried opensolaris.org actually and download their compiler at
> http://opensolaris.org/os/community/tools/sun_studio_tools/
>
> Or is this all a big commercial show from Sun?
>
> At 05:40 PM 7/11/2005 -0400, Michael Joyner wrote:
>
>>After discussing it with the physics professor, we have decided to try
>>Fedora 2 + OSCAR.
>>
>>Wish me luck! :)
>>
>>John Hearns wrote:
>>
>>>On Mon, 2005-07-11 at 11:32 -0400, Michael Joyner wrote:
>>>
>>>>Brian R Smith wrote:
>>>
>>>>>SuSE does come with a few helpful packages like mpich/lam and queuing
>>>>>software like OpenPBS, but in my experience, you are always better off
>>>>>following a more generic model: build it yourself.
>>>>
>>>>We were initially looking at SuSE because that is what we have
>>>>everywhere else. :)
>>>
>>>Well, use SuSE on your cluster then, if that is the distro which you are
>>>most used to.
>>>Personally, I would shy away from Fedora, much though I have a liking
>>>for Redhat hand have used it for years.
>>>
>>>I agree with the advice though to build your own packages rather than
>>>relying on the RPMs.
>>>
>>>_______________________________________________
>>>Beowulf mailing list, Beowulf at beowulf.org
>>>To change your subscription (digest mode or unsubscribe) visit
>
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit
>
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
Page: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From rgb at phy.duke.edu Tue Jul 12 08:57:35 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 12 Jul 2005 08:57:35 -0400
Subject: [Beowulf] SuSE 9.3
References: <3.0.32.20050712110822.01331800@pop3.xs4all.nl>
<42D3ADEB.8000808@tamu.edu>
Message-ID:
Gerry Creager N5JXS writes:
> We have also become fond of CentOS (specifically, v4.0).
Where it is worth parenthetically noting that Centos 4 "is" RHEL 4 which
"is" Fedora 4, only frozen. Also that (IIRC) Scientific Linux is built
on top of RHEL 4 (and hence is "like" Centos plus add-ons if it doesn't
actually share de-RH-logified rpms).
Regarding FC -- FC 1 sucked -- sort of a destabilized RH 9 and no (good)
support for x86-64. FC 2 was pretty good and by the time it got turned
into Linux at Duke (FC2 plus enhancements and fixes) it was very stable and
has run on both cluster nodes and desktops for a long time, since we are
updating only every other FC release (and will have a linux at duke based
on FC 4 "soon" this summer). FC 3 is running on the laptop I'm typing
this on (and a few other systems in my house) and seems to work very
well and contain significant enhancements of various sorts relative to
FC 3.
However, my broader experience is that with distros your mileage ALWAYS
may vary. People tend to have a negative experience (often because of a
quirk in their particular combination of hardware) and then write a
distro off, but if one perseveres and gets a clean install it will
probably run just fine -- not that crazy given the tremendous overlap in
source and build across distros. For example, saying that you "like"
only some of Centos, RHEL, SL, or FC but not the rest is almost
certainly due to user error or because you dislike something about the
philosophy of one or the other, not because there are deep substantive
differences in install, basic package selection, build methodology, etc.
I personally think that FC is only marginally less stable than the RHEL
clones, for example, and in anything but a brand-new FC release the
update stream almost certainly fixes those relatively few initial
problems. This makes yum a key component of any install, but WITH yum
one has a truly impressive range of prebuilt RPMs available with the
various add-on repos.
The dark side of the RHEL clones is the slowness of their advances.
Centos 3 was running GSL in some really early version LONG after
significant new functionality and bug fixes were available in the STABLE
RELEASE version in FC. Stability and update stream are just great, but
I personally think RHEL may be carrying the stability thing to a fault.
The kernel, also, can be a real problem if "stabilized" for too long --
two years is a LONG time in hardware space; lots of products released
and supported in more aggressive kernel update streams, lots of
improvements in the kernel itself.
rgb
>
> gerry
>
> Vincent Diepeveen wrote:
>> Fedora core 2 i already tried and when installed at my dual, it was wasting
>> cpu time for nothing. The worst distribution ever. Not worth downloading if
>> your intentions are more than 'just run linux'. If you need to run
>> applications that will eat system time, Fedora Core is the worst choice.
>>
>> In general Suse and Redhat are deteriorating, only their commercial product
>> lines might be doing fine, which are what is it, $1500 a piece or so in
>> case of Redhat?
>>
>> Suse 9.3 was a waste of money. It doesn't even install correct. Either you
>> get 'kernel panic', or some file system stuff is going wrong.
>>
>> Amazingly Suse 9.0 at the same machine worked fine (but of course 2.4.x is
>> wrong kernel for a quad opteron so i must upgrade that).
>>
>> Anyone tried opensolaris.org actually and download their compiler at
>> http://opensolaris.org/os/community/tools/sun_studio_tools/
>>
>> Or is this all a big commercial show from Sun?
>>
>> At 05:40 PM 7/11/2005 -0400, Michael Joyner wrote:
>>
>>>After discussing it with the physics professor, we have decided to try
>>>Fedora 2 + OSCAR.
>>>
>>>Wish me luck! :)
>>>
>>>John Hearns wrote:
>>>
>>>>On Mon, 2005-07-11 at 11:32 -0400, Michael Joyner wrote:
>>>>
>>>>>Brian R Smith wrote:
>>>>
>>>>>>SuSE does come with a few helpful packages like mpich/lam and queuing
>>>>>>software like OpenPBS, but in my experience, you are always better off
>>>>>>following a more generic model: build it yourself.
>>>>>
>>>>>We were initially looking at SuSE because that is what we have
>>>>>everywhere else. :)
>>>>
>>>>Well, use SuSE on your cluster then, if that is the distro which you are
>>>>most used to.
>>>>Personally, I would shy away from Fedora, much though I have a liking
>>>>for Redhat hand have used it for years.
>>>>
>>>>I agree with the advice though to build your own packages rather than
>>>>relying on the RPMs.
>>>>
>>>>_______________________________________________
>>>>Beowulf mailing list, Beowulf at beowulf.org
>>>>To change your subscription (digest mode or unsubscribe) visit
>>
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>>_______________________________________________
>>>Beowulf mailing list, Beowulf at beowulf.org
>>>To change your subscription (digest mode or unsubscribe) visit
>>
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> Gerry Creager -- gerry.creager at tamu.edu
> Texas Mesonet -- AATLT, Texas A&M University
> Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
> Page: 979.228.0173
> Office: 903A Eller Bldg, TAMU, College Station, TX 77843
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From gerry.creager at tamu.edu Tue Jul 12 10:42:05 2005
From: gerry.creager at tamu.edu (Gerry Creager N5JXS)
Date: Tue, 12 Jul 2005 09:42:05 -0500
Subject: [Beowulf] SuSE 9.3
In-Reply-To:
References: <3.0.32.20050712110822.01331800@pop3.xs4all.nl>
<42D3ADEB.8000808@tamu.edu>
Message-ID: <42D3D6BD.7040807@tamu.edu>
Howdy!
Robert G. Brown wrote:
> Gerry Creager N5JXS writes:
>
>> We have also become fond of CentOS (specifically, v4.0).
>
>
> Where it is worth parenthetically noting that Centos 4 "is" RHEL 4 which
> "is" Fedora 4, only frozen. Also that (IIRC) Scientific Linux is built
> on top of RHEL 4 (and hence is "like" Centos plus add-ons if it doesn't
> actually share de-RH-logified rpms).
...
> I personally think that FC is only marginally less stable than the RHEL
> clones, for example, and in anything but a brand-new FC release the
> update stream almost certainly fixes those relatively few initial
> problems. This makes yum a key component of any install, but WITH yum
> one has a truly impressive range of prebuilt RPMs available with the
> various add-on repos.
Iwill agree with your characterization of FC1, but I found FC2 to lack
significant benefits for far too long. FC3, well, I'm pretty happy with
it. We made the move to CentOS 4, realiznig it was a frozen FC4, but
I'm also of the opinion that the FC's are a little more than beta but
not quite gold, and the RHEL/CentOS versions are the result of a little
more testing and learning about consequences.
We're going with "more stable" confident that we can, if needed, update
from tarballs or FC RPMs if something's really broken and not in the
rapid update cycle.
gerry
>> Vincent Diepeveen wrote:
>>
>>> Fedora core 2 i already tried and when installed at my dual, it was
>>> wasting
>>> cpu time for nothing. The worst distribution ever. Not worth
>>> downloading if
>>> your intentions are more than 'just run linux'. If you need to run
>>> applications that will eat system time, Fedora Core is the worst choice.
>>>
>>> In general Suse and Redhat are deteriorating, only their commercial
>>> product
>>> lines might be doing fine, which are what is it, $1500 a piece or so in
>>> case of Redhat?
>>> Suse 9.3 was a waste of money. It doesn't even install correct.
>>> Either you
>>> get 'kernel panic', or some file system stuff is going wrong.
>>>
>>> Amazingly Suse 9.0 at the same machine worked fine (but of course
>>> 2.4.x is
>>> wrong kernel for a quad opteron so i must upgrade that).
>>>
>>> Anyone tried opensolaris.org actually and download their compiler at
>>> http://opensolaris.org/os/community/tools/sun_studio_tools/
>>>
>>> Or is this all a big commercial show from Sun?
>>>
>>> At 05:40 PM 7/11/2005 -0400, Michael Joyner wrote:
>>>
>>>> After discussing it with the physics professor, we have decided to
>>>> try Fedora 2 + OSCAR.
>>>>
>>>> Wish me luck! :)
>>>>
>>>> John Hearns wrote:
>>>>
>>>>> On Mon, 2005-07-11 at 11:32 -0400, Michael Joyner wrote:
>>>>>
>>>>>> Brian R Smith wrote:
>>>>>
>>>>>
>>>>>>> SuSE does come with a few helpful packages like mpich/lam and
>>>>>>> queuing
>>>>>>> software like OpenPBS, but in my experience, you are always
>>>>>>> better off
>>>>>>> following a more generic model: build it yourself.
>>>>>>
>>>>>>
>>>>>> We were initially looking at SuSE because that is what we have
>>>>>> everywhere else. :)
>>>>>
>>>>>
>>>>> Well, use SuSE on your cluster then, if that is the distro which
>>>>> you are
>>>>> most used to.
>>>>> Personally, I would shy away from Fedora, much though I have a liking
>>>>> for Redhat hand have used it for years.
>>>>>
>>>>> I agree with the advice though to build your own packages rather than
>>>>> relying on the RPMs.
>>>>>
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>
>>>
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>
>>>
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>> --
>> Gerry Creager -- gerry.creager at tamu.edu
>> Texas Mesonet -- AATLT, Texas A&M University
>> Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
>> Page: 979.228.0173
>> Office: 903A Eller Bldg, TAMU, College Station, TX 77843
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
--
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
Page: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hahn at physics.mcmaster.ca Tue Jul 12 11:44:22 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Tue, 12 Jul 2005 11:44:22 -0400 (EDT)
Subject: [Beowulf] SuSE 9.3
In-Reply-To:
Message-ID:
> support for x86-64. FC 2 was pretty good and by the time it got turned
> into Linux at Duke (FC2 plus enhancements and fixes) it was very stable and
> has run on both cluster nodes and desktops for a long time, since we are
I have an x86 cluster which is quite happy on FC2, though I'm
pretty tempted to jump to FC4.
> quirk in their particular combination of hardware) and then write a
> distro off, but if one perseveres and gets a clean install it will
right - this sort of ridiculously shallow bigotry always pisses me off.
sure, the first time you ever run the installer for a distro,
you'll probably screw up. and sure, if you have weird (new/old/fringe)
hardware, you'll certainly increase the chances of a problem.
but come on - updating the boot kernel on an install CD is not
rocket science!
> I personally think RHEL may be carrying the stability thing to a fault.
I hate it when people conflate various meanings of stability.
yes, lack of crashes is good stability. yes, the glibc and kernel
ABI's should not change rapidly.
but all too often, "stability" means "lazy admin" - that is, lack
of updates. if it's not broke, sure. but updates that improve performance
are important too.
finally, I don't really know why anyone cares that much about the distro.
I don't even tell my uses which Linux a cluster uses, because they shouldn't
have to know. if the kernel works, and glibc is not broken, and the
compiler produces decently efficient code, they're happy. occasionally
someone will complain that python/java/random-library is too old,
but that's to be expected (in this case, the cluster was installed
April 2003 and has been running and growing the whole time. it's due
for a bit of an upgrade, I think.)
regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Tue Jul 12 12:04:40 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Tue, 12 Jul 2005 18:04:40 +0200
Subject: [Beowulf] Re: dual core Opteron performance - re suse 9.3
Message-ID: <3.0.32.20050712180440.01288de0@pop3.xs4all.nl>
Hi Don,
A few questions.
Did you use PGO (profile guided optimizations) with gcc 3.3.4 for your code?
PGO is broken in 3.3.4 for my software when i deterministically compare 1
cpu's output compiled with 3.3.4 + pgo. Did you deterministically compare
both executables with each other (when running single cpu) and see whether
output is 100% equal?
Note 3.3.4-suse gcc may not be 100% similar to 3.3.4 gcc. However same bug
is there in the 3.3.1-3.3.x series from Suse GCC.
In 4.0.0 the PGO works better and creates the same output, YMMV there.
At 10:36 AM 7/12/2005 -0500, Don Kinghorn wrote:
>Hi Vincent, ...all,
>
>The code was built on a SuSE9.2 machine with gcc/g77 3.3.4. The same
>executable was run on both systems.
>
>Kernel for the 2 dual-node setup was SuSE stock 2.6.8-24-smp
>for the 9.3 setup with the dual-core cpus it was the stock install kernel
>2.6.11.4-21.7-smp
>Memory was fully populated on the 2 node setup -- 4 one GB modules per
board,
>there are only 4 slots on the Tyan 2875 (I had mistakenly reported yesterday
I'm not seeing anywhere at Tyan an indication this board can take advantage
of NUMA.
Looks like it that there is 1 shared memory, correct me if i'm wrong. It's
not showing the RAM as being working for 1 cpu, but rather for both.
>that there was only 2GB/per board for the benchmark numbers)
>The dual-core system had 4 one GB modules arranged 2 for each cpu.
So you compared a dual opteron dual core (non-tiger board)
with dual opteron (Tiger).
I assume you used at both machines 2 cpu's to compare speed of your code.
Currently setting up gentoo at quad.
>Important(?) bios settings were;
>
>Bank interleaving "Auto"
>Node interleaving "Auto"
>PowerNow "Disable"
>MemoryHole "Disabled" for both hardware and software settings
>
>The speedup we saw on the dual-core was less than 10% for the most jobs. MP2
>jobs with heavy i/o (worst case) was around a %20 hit (there were twice as
>many processes hitting the raid scratch space at the same time)
Are you speaking now of comparing a 4 core (dual opteron dual core) as
compared to a dual opteron tiger, which gave a 10% speedup for the added 2
cores?
That's an ugly speedup in that case, perhaps improve the code?
Excuses like memory controllers is not a good excuse. The 2 memory
controllers can deliver more data per second than the cpu's deliver gflop
per second.
As you can see at sudhian, diep has a speedup of 3.92 out of 4 cores.
Of course that was years of hard programming.
>I still have lots of testing and tuning to do. These tests were just to
see if
>was going to work and how much trouble it was going to be. ( It was a LOT of
>trouble getting SuSE9.3 installed but I think worth it in the end)
Setting up gentoo 2005.0 amd64 universal here now. Will go fine.
>Best to all
>-Don
>
>> If you 'did get better performance', that's possibly because
>> you have some kernel 2.6.x now, allowing NUMA, and a new
>> compiler version of gcc like 4.0.1 that has been bugfixed more than
>> the very buggy series 3.3.x and 3.4.x
>>
>> Can you show us the differences between the compiler versions and kernel
>> versions you had and whether it's NUMA?
>>
>> Also how is your memory banks configured, for 64 bits usage or 128 bits
>> single cpu usage, or are all banks filled up?
>--
>Dr. Donald B. Kinghorn Parallel Quantum Solutions LLC
>http://www.pqs-chem.com
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hahn at physics.mcmaster.ca Tue Jul 12 17:40:58 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Tue, 12 Jul 2005 17:40:58 -0400 (EDT)
Subject: [Beowulf] Re: dual core Opteron performance - re suse 9.3
In-Reply-To: <3.0.32.20050712180440.01288de0@pop3.xs4all.nl>
Message-ID:
> >there are only 4 slots on the Tyan 2875 (I had mistakenly reported yesterday
>
> I'm not seeing anywhere at Tyan an indication this board can take advantage
> of NUMA.
node interleave is meaningless for the 2875, since the board only has
memory attached to one CPU. while the bios probably does include the
ACPI table that informs the kernel's k8-numa code, it's moot, since
there's no way to arrange cpu-proc affinity to minimize non-local
accesses. (except by not using the second socket, of course!)
I'd expect NUMA support to make more of a difference on 4-socket systems,
since on them, a process can be >1 hop away from memory. on a 2-socket
system, it's probably still worth doing, but can't be all that critical.
naturally, latency-sensitive codes (big but with poor locality) will
show a bigger difference.
> >Bank interleaving "Auto"
I tried to measure this on a dual, and couldn't. it's hard to see,
based on the low-level hardware specs, why it would matter much.
yes, bank interleave should reduce the amount of time waiting on
bank misses, but it's certainly not visible to Stream.
> >Node interleaving "Auto"
turning this to on essentially defeats NUMA; it could be the right thing
for some codes/systems, since it means that no process has any special
affinity for a particular socket.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From rss000f at smsu.edu Tue Jul 12 12:38:33 2005
From: rss000f at smsu.edu (Randall S. Sexton)
Date: Tue, 12 Jul 2005 11:38:33 -0500
Subject: [Beowulf] MPI/Fortran programming problems
Message-ID:
I'm getting the same thing, did you find an answer?
Sending bcast
8 - MPI_BCAST : Message truncated
[8] Aborting program !
[8] Aborting program!
p8_3396: p4_error: : 14
rm_l_8_3397: p4_error: interrupt SIGINT: 2
P4 procgroup file is procgroup.
Best Regards,
Randall S. Sexton
Computer Information Systems
Southwest Missouri State University
901 South National
Springfield, MO 65804
Work: 417-836-6453
Fax: 417-836-6907
Web: http://www.faculty.smsu.edu/r/rss000f/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From John.Brookes at compusys.co.uk Tue Jul 12 10:07:13 2005
From: John.Brookes at compusys.co.uk (John Brookes)
Date: Tue, 12 Jul 2005 15:07:13 +0100
Subject: [Beowulf] SuSE 9.3
Message-ID: <340E909D2A730C409DE89E2DD629403EAE736C@asset51023.compusys.co.uk>
My tuppenny worth:
On the subject of Fedora, I've had no major problems with FC3, had
'some' with FC2 (but mainly on laptops, really) and waaaaaay too many
with FC1.
> Where it is worth parenthetically noting that Centos 4 "is" RHEL 4
which
"is" Fedora 4, only frozen. Also that (IIRC) Scientific Linux is built
on top of RHEL 4 (and hence is "like" Centos plus add-ons if it doesn't
actually share de-RH-logified rpms).
You remember accurately. Scientific Linux is
built from (open) source, minus anything that _may_ cause licensing
issues.
> Stability and update stream are just great, but I personally think
RHEL may be carrying the stability thing to a fault.
It's not entirely fair to criticise RH for the relative sloth of the
release of new RHEL versions. It _is_ 'Enterprise Linux', after all, and
I'd guess the air would be thick with lawsuits if J Random MegaCorp were
left exposed due to insufficient testing on RH's part (or Oracle's, or
Fluent's etc on the certification front). In this realm, though, I agree
that it can be an issue.
John Brookes
Senior Technical Consultant
COMPUSYS PLC
DD: +44 (0)1296 505348
Mob: +44 (0)7789278947
Tel: +44 (0)1296 505100
Fax: +44 (0)1296 424165
Web: www.compusys.co.uk
This email is confidential and intended solely for the use of the
individual to whom it is addressed. Any views or opinions presented are
solely those of the author and do not necessarily represent those of
Compusys or any of its affiliates. If you are not the intended
recipient, be advised that you have received this email in error and
that any use, dissemination, forwarding, printing, or copying of this
email is strictly prohibited. If you have received this email in error
please notify Compusys Customer Services by telephone on +44(0)1296
505140
-----Original Message-----
From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
On Behalf Of Robert G. Brown
Sent: 12 July 2005 13:58
To: Gerry Creager N5JXS
Cc: johnh at streamline-computing.com; beowulf at beowulf.org
Subject: Re: [Beowulf] SuSE 9.3
Gerry Creager N5JXS writes:
> We have also become fond of CentOS (specifically, v4.0).
Where it is worth parenthetically noting that Centos 4 "is" RHEL 4 which
"is" Fedora 4, only frozen. Also that (IIRC) Scientific Linux is built
on top of RHEL 4 (and hence is "like" Centos plus add-ons if it doesn't
actually share de-RH-logified rpms).
Regarding FC -- FC 1 sucked -- sort of a destabilized RH 9 and no (good)
support for x86-64. FC 2 was pretty good and by the time it got turned
into Linux at Duke (FC2 plus enhancements and fixes) it was very stable and
has run on both cluster nodes and desktops for a long time, since we are
updating only every other FC release (and will have a linux at duke based
on FC 4 "soon" this summer). FC 3 is running on the laptop I'm typing
this on (and a few other systems in my house) and seems to work very
well and contain significant enhancements of various sorts relative to
FC 3.
However, my broader experience is that with distros your mileage ALWAYS
may vary. People tend to have a negative experience (often because of a
quirk in their particular combination of hardware) and then write a
distro off, but if one perseveres and gets a clean install it will
probably run just fine -- not that crazy given the tremendous overlap in
source and build across distros. For example, saying that you "like"
only some of Centos, RHEL, SL, or FC but not the rest is almost
certainly due to user error or because you dislike something about the
philosophy of one or the other, not because there are deep substantive
differences in install, basic package selection, build methodology, etc.
I personally think that FC is only marginally less stable than the RHEL
clones, for example, and in anything but a brand-new FC release the
update stream almost certainly fixes those relatively few initial
problems. This makes yum a key component of any install, but WITH yum
one has a truly impressive range of prebuilt RPMs available with the
various add-on repos.
The dark side of the RHEL clones is the slowness of their advances.
Centos 3 was running GSL in some really early version LONG after
significant new functionality and bug fixes were available in the STABLE
RELEASE version in FC. Stability and update stream are just great, but
I personally think RHEL may be carrying the stability thing to a fault.
The kernel, also, can be a real problem if "stabilized" for too long --
two years is a LONG time in hardware space; lots of products released
and supported in more aggressive kernel update streams, lots of
improvements in the kernel itself.
rgb
>
> gerry
>
> Vincent Diepeveen wrote:
>> Fedora core 2 i already tried and when installed at my dual, it was
wasting
>> cpu time for nothing. The worst distribution ever. Not worth
downloading if
>> your intentions are more than 'just run linux'. If you need to run
>> applications that will eat system time, Fedora Core is the worst
choice.
>>
>> In general Suse and Redhat are deteriorating, only their commercial
product
>> lines might be doing fine, which are what is it, $1500 a piece or so
in
>> case of Redhat?
>>
>> Suse 9.3 was a waste of money. It doesn't even install correct.
Either you
>> get 'kernel panic', or some file system stuff is going wrong.
>>
>> Amazingly Suse 9.0 at the same machine worked fine (but of course
2.4.x is
>> wrong kernel for a quad opteron so i must upgrade that).
>>
>> Anyone tried opensolaris.org actually and download their compiler at
>> http://opensolaris.org/os/community/tools/sun_studio_tools/
>>
>> Or is this all a big commercial show from Sun?
>>
>> At 05:40 PM 7/11/2005 -0400, Michael Joyner wrote:
>>
>>>After discussing it with the physics professor, we have decided to
try
>>>Fedora 2 + OSCAR.
>>>
>>>Wish me luck! :)
>>>
>>>John Hearns wrote:
>>>
>>>>On Mon, 2005-07-11 at 11:32 -0400, Michael Joyner wrote:
>>>>
>>>>>Brian R Smith wrote:
>>>>
>>>>>>SuSE does come with a few helpful packages like mpich/lam and
queuing
>>>>>>software like OpenPBS, but in my experience, you are always better
off
>>>>>>following a more generic model: build it yourself.
>>>>>
>>>>>We were initially looking at SuSE because that is what we have
>>>>>everywhere else. :)
>>>>
>>>>Well, use SuSE on your cluster then, if that is the distro which you
are
>>>>most used to.
>>>>Personally, I would shy away from Fedora, much though I have a
liking
>>>>for Redhat hand have used it for years.
>>>>
>>>>I agree with the advice though to build your own packages rather
than
>>>>relying on the RPMs.
>>>>
>>>>_______________________________________________
>>>>Beowulf mailing list, Beowulf at beowulf.org
>>>>To change your subscription (digest mode or unsubscribe) visit
>>
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>>_______________________________________________
>>>Beowulf mailing list, Beowulf at beowulf.org
>>>To change your subscription (digest mode or unsubscribe) visit
>>
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> Gerry Creager -- gerry.creager at tamu.edu
> Texas Mesonet -- AATLT, Texas A&M University
> Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
> Page: 979.228.0173
> Office: 903A Eller Bldg, TAMU, College Station, TX 77843
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From jlb17 at duke.edu Tue Jul 12 05:35:03 2005
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Tue, 12 Jul 2005 05:35:03 -0400 (EDT)
Subject: [Beowulf] SuSE 9.3
In-Reply-To: <3.0.32.20050712110822.01331800@pop3.xs4all.nl>
References: <3.0.32.20050712110822.01331800@pop3.xs4all.nl>
Message-ID:
On Tue, 12 Jul 2005 at 11:08am, Vincent Diepeveen wrote
> Fedora core 2 i already tried and when installed at my dual, it was wasting
> cpu time for nothing. The worst distribution ever. Not worth downloading if
> your intentions are more than 'just run linux'. If you need to run
> applications that will eat system time, Fedora Core is the worst choice.
>
> In general Suse and Redhat are deteriorating, only their commercial product
> lines might be doing fine, which are what is it, $1500 a piece or so in
> case of Redhat?
Don't forget the various RHEL rebuild distros out there. We use centos
here, but (IIRC) Scientific Linux is also based off of RHEL.
--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From laurence at scalablesystems.com Tue Jul 12 10:12:29 2005
From: laurence at scalablesystems.com (Laurence Liew)
Date: Tue, 12 Jul 2005 22:12:29 +0800
Subject: [Beowulf] SuSE 9.3
In-Reply-To: <3.0.32.20050712110822.01331800@pop3.xs4all.nl>
References: <3.0.32.20050712110822.01331800@pop3.xs4all.nl>
Message-ID: <42D3CFCD.5070406@scalablesystems.com>
See http://www.redhat.com/software/rhel/hpc/
Red Hat HPC pricing is US$79 per 2-way server and US$158 per 4-way
server compute nodes. You buy in qty of 8.
This is reasonable for organisation requiring the assurance of a
certified platform, and support ... not sure how much HPC support Red
Hat can provide though.. :-)
cheers!
laurence
Vincent Diepeveen wrote:
> Fedora core 2 i already tried and when installed at my dual, it was wasting
> cpu time for nothing. The worst distribution ever. Not worth downloading if
> your intentions are more than 'just run linux'. If you need to run
> applications that will eat system time, Fedora Core is the worst choice.
>
> In general Suse and Redhat are deteriorating, only their commercial product
> lines might be doing fine, which are what is it, $1500 a piece or so in
> case of Redhat?
>
> Suse 9.3 was a waste of money. It doesn't even install correct. Either you
> get 'kernel panic', or some file system stuff is going wrong.
>
> Amazingly Suse 9.0 at the same machine worked fine (but of course 2.4.x is
> wrong kernel for a quad opteron so i must upgrade that).
>
> Anyone tried opensolaris.org actually and download their compiler at
> http://opensolaris.org/os/community/tools/sun_studio_tools/
>
> Or is this all a big commercial show from Sun?
>
> At 05:40 PM 7/11/2005 -0400, Michael Joyner wrote:
>
>>After discussing it with the physics professor, we have decided to try
>>Fedora 2 + OSCAR.
>>
>>Wish me luck! :)
>>
>>John Hearns wrote:
>>
>>>On Mon, 2005-07-11 at 11:32 -0400, Michael Joyner wrote:
>>>
>>>>Brian R Smith wrote:
>>>
>>>>>SuSE does come with a few helpful packages like mpich/lam and queuing
>>>>>software like OpenPBS, but in my experience, you are always better off
>>>>>following a more generic model: build it yourself.
>>>>
>>>>We were initially looking at SuSE because that is what we have
>>>>everywhere else. :)
>>>
>>>Well, use SuSE on your cluster then, if that is the distro which you are
>>>most used to.
>>>Personally, I would shy away from Fedora, much though I have a liking
>>>for Redhat hand have used it for years.
>>>
>>>I agree with the advice though to build your own packages rather than
>>>relying on the RPMs.
>>>
>>>_______________________________________________
>>>Beowulf mailing list, Beowulf at beowulf.org
>>>To change your subscription (digest mode or unsubscribe) visit
>
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit
>
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>
--
Laurence Liew, CTO Email: laurence at scalablesystems.com
Scalable Systems Pte Ltd Web : http://www.scalablesystems.com
(Reg. No: 200310328D)
7 Bedok South Road Tel : 65 6827 3953
Singapore 469272 Fax : 65 6827 3922
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From brian at cypher.acomp.usf.edu Tue Jul 12 10:23:44 2005
From: brian at cypher.acomp.usf.edu (Brian R Smith)
Date: Tue, 12 Jul 2005 10:23:44 -0400
Subject: [Beowulf] SuSE 9.3
In-Reply-To:
References: <3.0.32.20050712110822.01331800@pop3.xs4all.nl>
<42D3ADEB.8000808@tamu.edu>
Message-ID: <1121178224.18795.11.camel@daemon.acomp.usf.edu>
We've been a CentOS shop since version 3.0 and haven't looked back. The
fact that they offer update support for 5 years allows us more
flexibility in planning our upgrades. I've noticed that a lot of people
on the fedora front do the "every other release" upgrade cycle which
probably works out perfectly, as support for each release is continued
up until the second subsequent release. It is then transfered to
FedoraLegacy, so you don't really lose support altogether (though my
experience with them for earlier RH distros wasn't that inspiring).
Basically, the choice is yours. FC2 was really impressive, but it is
also getting up there in age, mostly because of Fedora's rapid release
cycle and is, I believe, being transfered to Fedora Legacy already. You
should always take that into consideration.
But in any case, good luck with your build.
-Brian
On Tue, 2005-07-12 at 08:57 -0400, Robert G. Brown wrote:
> Gerry Creager N5JXS writes:
>
> > We have also become fond of CentOS (specifically, v4.0).
>
> Where it is worth parenthetically noting that Centos 4 "is" RHEL 4 which
> "is" Fedora 4, only frozen. Also that (IIRC) Scientific Linux is built
> on top of RHEL 4 (and hence is "like" Centos plus add-ons if it doesn't
> actually share de-RH-logified rpms).
>
> Regarding FC -- FC 1 sucked -- sort of a destabilized RH 9 and no (good)
> support for x86-64. FC 2 was pretty good and by the time it got turned
> into Linux at Duke (FC2 plus enhancements and fixes) it was very stable and
> has run on both cluster nodes and desktops for a long time, since we are
> updating only every other FC release (and will have a linux at duke based
> on FC 4 "soon" this summer). FC 3 is running on the laptop I'm typing
> this on (and a few other systems in my house) and seems to work very
> well and contain significant enhancements of various sorts relative to
> FC 3.
>
> However, my broader experience is that with distros your mileage ALWAYS
> may vary. People tend to have a negative experience (often because of a
> quirk in their particular combination of hardware) and then write a
> distro off, but if one perseveres and gets a clean install it will
> probably run just fine -- not that crazy given the tremendous overlap in
> source and build across distros. For example, saying that you "like"
> only some of Centos, RHEL, SL, or FC but not the rest is almost
> certainly due to user error or because you dislike something about the
> philosophy of one or the other, not because there are deep substantive
> differences in install, basic package selection, build methodology, etc.
>
> I personally think that FC is only marginally less stable than the RHEL
> clones, for example, and in anything but a brand-new FC release the
> update stream almost certainly fixes those relatively few initial
> problems. This makes yum a key component of any install, but WITH yum
> one has a truly impressive range of prebuilt RPMs available with the
> various add-on repos.
>
> The dark side of the RHEL clones is the slowness of their advances.
> Centos 3 was running GSL in some really early version LONG after
> significant new functionality and bug fixes were available in the STABLE
> RELEASE version in FC. Stability and update stream are just great, but
> I personally think RHEL may be carrying the stability thing to a fault.
> The kernel, also, can be a real problem if "stabilized" for too long --
> two years is a LONG time in hardware space; lots of products released
> and supported in more aggressive kernel update streams, lots of
> improvements in the kernel itself.
>
> rgb
>
> >
> > gerry
> >
> > Vincent Diepeveen wrote:
> >> Fedora core 2 i already tried and when installed at my dual, it was wasting
> >> cpu time for nothing. The worst distribution ever. Not worth downloading if
> >> your intentions are more than 'just run linux'. If you need to run
> >> applications that will eat system time, Fedora Core is the worst choice.
> >>
> >> In general Suse and Redhat are deteriorating, only their commercial product
> >> lines might be doing fine, which are what is it, $1500 a piece or so in
> >> case of Redhat?
> >>
> >> Suse 9.3 was a waste of money. It doesn't even install correct. Either you
> >> get 'kernel panic', or some file system stuff is going wrong.
> >>
> >> Amazingly Suse 9.0 at the same machine worked fine (but of course 2.4.x is
> >> wrong kernel for a quad opteron so i must upgrade that).
> >>
> >> Anyone tried opensolaris.org actually and download their compiler at
> >> http://opensolaris.org/os/community/tools/sun_studio_tools/
> >>
> >> Or is this all a big commercial show from Sun?
> >>
> >> At 05:40 PM 7/11/2005 -0400, Michael Joyner wrote:
> >>
> >>>After discussing it with the physics professor, we have decided to try
> >>>Fedora 2 + OSCAR.
> >>>
> >>>Wish me luck! :)
> >>>
> >>>John Hearns wrote:
> >>>
> >>>>On Mon, 2005-07-11 at 11:32 -0400, Michael Joyner wrote:
> >>>>
> >>>>>Brian R Smith wrote:
> >>>>
> >>>>>>SuSE does come with a few helpful packages like mpich/lam and queuing
> >>>>>>software like OpenPBS, but in my experience, you are always better off
> >>>>>>following a more generic model: build it yourself.
> >>>>>
> >>>>>We were initially looking at SuSE because that is what we have
> >>>>>everywhere else. :)
> >>>>
> >>>>Well, use SuSE on your cluster then, if that is the distro which you are
> >>>>most used to.
> >>>>Personally, I would shy away from Fedora, much though I have a liking
> >>>>for Redhat hand have used it for years.
> >>>>
> >>>>I agree with the advice though to build your own packages rather than
> >>>>relying on the RPMs.
> >>>>
> >>>>_______________________________________________
> >>>>Beowulf mailing list, Beowulf at beowulf.org
> >>>>To change your subscription (digest mode or unsubscribe) visit
> >>
> >> http://www.beowulf.org/mailman/listinfo/beowulf
> >>
> >>>_______________________________________________
> >>>Beowulf mailing list, Beowulf at beowulf.org
> >>>To change your subscription (digest mode or unsubscribe) visit
> >>
> >> http://www.beowulf.org/mailman/listinfo/beowulf
> >>
> >>>
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org
> >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> >
> > --
> > Gerry Creager -- gerry.creager at tamu.edu
> > Texas Mesonet -- AATLT, Texas A&M University
> > Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
> > Page: 979.228.0173
> > Office: 903A Eller Bldg, TAMU, College Station, TX 77843
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kinghorn at pqs-chem.com Tue Jul 12 11:36:15 2005
From: kinghorn at pqs-chem.com (Don Kinghorn)
Date: Tue, 12 Jul 2005 10:36:15 -0500
Subject: [Beowulf] Re: dual core Opteron performance - re suse 9.3
In-Reply-To: <3.0.32.20050712112207.01331800@pop3.xs4all.nl>
References: <3.0.32.20050712112207.01331800@pop3.xs4all.nl>
Message-ID: <200507121036.15586.kinghorn@pqs-chem.com>
Hi Vincent, ...all,
The code was built on a SuSE9.2 machine with gcc/g77 3.3.4. The same
executable was run on both systems.
Kernel for the 2 dual-node setup was SuSE stock 2.6.8-24-smp
for the 9.3 setup with the dual-core cpus it was the stock install kernel
2.6.11.4-21.7-smp
Memory was fully populated on the 2 node setup -- 4 one GB modules per board,
there are only 4 slots on the Tyan 2875 (I had mistakenly reported yesterday
that there was only 2GB/per board for the benchmark numbers)
The dual-core system had 4 one GB modules arranged 2 for each cpu.
Important(?) bios settings were;
Bank interleaving "Auto"
Node interleaving "Auto"
PowerNow "Disable"
MemoryHole "Disabled" for both hardware and software settings
The speedup we saw on the dual-core was less than 10% for the most jobs. MP2
jobs with heavy i/o (worst case) was around a %20 hit (there were twice as
many processes hitting the raid scratch space at the same time)
I still have lots of testing and tuning to do. These tests were just to see if
was going to work and how much trouble it was going to be. ( It was a LOT of
trouble getting SuSE9.3 installed but I think worth it in the end)
Best to all
-Don
> If you 'did get better performance', that's possibly because
> you have some kernel 2.6.x now, allowing NUMA, and a new
> compiler version of gcc like 4.0.1 that has been bugfixed more than
> the very buggy series 3.3.x and 3.4.x
>
> Can you show us the differences between the compiler versions and kernel
> versions you had and whether it's NUMA?
>
> Also how is your memory banks configured, for 64 bits usage or 128 bits
> single cpu usage, or are all banks filled up?
--
Dr. Donald B. Kinghorn Parallel Quantum Solutions LLC
http://www.pqs-chem.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From david.n.lombard at intel.com Tue Jul 12 19:13:37 2005
From: david.n.lombard at intel.com (Lombard, David N)
Date: Tue, 12 Jul 2005 16:13:37 -0700
Subject: [Beowulf] SuSE 9.3
Message-ID: <187D3A7CAB42A54DB61F1D05F01257220657E7A8@orsmsx402.amr.corp.intel.com>
From: John Brookes on Tuesday, July 12, 2005 7:07 AM
> > Stability and update stream are just great, but I personally think
> RHEL may be carrying the stability thing to a fault.
>
> It's not entirely fair to criticise RH for the relative sloth of the
> release of new RHEL versions. It _is_ 'Enterprise Linux', after all,
and
> I'd guess the air would be thick with lawsuits if J Random MegaCorp
were
> left exposed due to insufficient testing on RH's part (or Oracle's, or
> Fluent's etc on the certification front). In this realm, though, I
agree
> that it can be an issue.
>
Don't be so quick to blame the lawyers, well, not on this one.
Corporate users and ISVs don't want to see the OS revised more than once
a year. On the user side, it takes a lot of effort to install, certify,
and then deploy a new OS. On the ISV side, there is little to no
revenue associated with certifying an already released app on a new OS
-- it's purely a cost factor unless the ISV is releasing new code.
This is much more a case of having better things to do with the staff,
time, and costs.
If it ain't broke, ...
And if it is broke, that's what the user is paying for, isn't it?
--
dnl
My comments represent my opinions, not those of Intel Corporation.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hahn at physics.mcmaster.ca Tue Jul 12 20:26:32 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Tue, 12 Jul 2005 20:26:32 -0400 (EDT)
Subject: [Beowulf] SuSE 9.3
In-Reply-To: <187D3A7CAB42A54DB61F1D05F01257220657E7A8@orsmsx402.amr.corp.intel.com>
Message-ID:
> Corporate users and ISVs don't want to see the OS revised more than once
> a year.
which is sad, really. they've been so traumatized by the dominant
platform that they expect that changing anything will break everything.
the very concept of a standard, let alone an interface standard (ABI)
is foreign to this mentality.
> On the user side, it takes a lot of effort to install, certify,
> and then deploy a new OS. On the ISV side, there is little to no
certification is bad for users. instead of an ISV stepping up to the
plate and saying "our application requires Linux ABI 3.14 and we will
fix bugs where it doesn't", the ISV just annoints a particular config
(OS release, firmware revision, disk setup, these magic three patches,
phase of moon). certification is really an admission that the app is
buggy in indeterminate ways, and that the ISV doesn't care.
> revenue associated with certifying an already released app on a new OS
> -- it's purely a cost factor unless the ISV is releasing new code.
managing cost is a good thing. but doing so does not necessarily mean
that you also have to screw your customers. hey, with a little spin
we can even make certification sound like we're _doing_them_a_favor_!
regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Wed Jul 13 00:00:16 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Tue, 12 Jul 2005 21:00:16 -0700
Subject: [Beowulf] Re: dual core Opteron performance - re suse 9.3
In-Reply-To: <200507121036.15586.kinghorn@pqs-chem.com>
References: <3.0.32.20050712112207.01331800@pop3.xs4all.nl>
<200507121036.15586.kinghorn@pqs-chem.com>
Message-ID: <20050713040016.GA6658@greglaptop.ip3networks.com>
On Tue, Jul 12, 2005 at 10:36:15AM -0500, Don Kinghorn wrote:
> The dual-core system had 4 one GB modules arranged 2 for each cpu.
To be anal-hyphen-retentive, don't you mean "2 for each socket"?
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From john.hearns at streamline-computing.com Wed Jul 13 02:18:25 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Wed, 13 Jul 2005 07:18:25 +0100
Subject: [Beowulf] Re: dual core Opteron performance - re suse 9.3
In-Reply-To: <20050713040016.GA6658@greglaptop.ip3networks.com>
References: <3.0.32.20050712112207.01331800@pop3.xs4all.nl>
<200507121036.15586.kinghorn@pqs-chem.com>
<20050713040016.GA6658@greglaptop.ip3networks.com>
Message-ID: <1121235506.5923.151.camel@vigor13>
On Tue, 2005-07-12 at 21:00 -0700, Greg Lindahl wrote:
> On Tue, Jul 12, 2005 at 10:36:15AM -0500, Don Kinghorn wrote:
>
> > The dual-core system had 4 one GB modules arranged 2 for each cpu.
>
> To be anal-hyphen-retentive, don't you mean "2 for each socket"?
>
Acktcherly....
we do need to decide on a terminology here.
I recently did a response to a tender for a prospective customer.
I was tying myself in knots getting the correct terminology,
for questions such as "the systems MUST have xxx gigabytes of RAM per
processor"
I went with reading that as 'per socket' in the case of dual cores.
Also talking about 'dual nodes' is going to be more tricky.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From landman at scalableinformatics.com Wed Jul 13 08:37:59 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 13 Jul 2005 08:37:59 -0400
Subject: [Beowulf] Re: dual core Opteron performance - re suse 9.3
In-Reply-To: <1121235506.5923.151.camel@vigor13>
References: <3.0.32.20050712112207.01331800@pop3.xs4all.nl> <200507121036.15586.kinghorn@pqs-chem.com> <20050713040016.GA6658@greglaptop.ip3networks.com>
<1121235506.5923.151.camel@vigor13>
Message-ID: <42D50B27.4090905@scalableinformatics.com>
Hi John:
We find that we are talking about "per core" to our customers now. I
explain that previously, there has been an implicit 1-to-1 mapping
between processor cores and chips, so that you could talk about either
one and mean the other. Now however, we are talking about per core, as
things like licensing (lmgrd) aren't going to count chips or sockets,
but will count cores.
AMD uses the terminology of
Np processors / Mc cores
so a dual Opteron 275 would look like
2p/4c
system. I prefer the converse of this, 4c/2p, but thats just me. I
don't know what (if any) terminology Intel uses for this.
Per socket is the same as per chip. The issue is the terminology may
not shift if you are talking about single core, dual core, quad core,
... N core. From an end user perspective, the cores are real full
fledged CPUs that happen to share the same physical die as one or more
other cores. That is, with a little though, the end user can break the
1-to-1 mapping and talk in terms of cores, in which case specifications
start to make sense again. 1 GB per socket doesnt make much sense if
each core needs 1 GB for a particular calculation.
Joe
John Hearns wrote:
> On Tue, 2005-07-12 at 21:00 -0700, Greg Lindahl wrote:
>
>>On Tue, Jul 12, 2005 at 10:36:15AM -0500, Don Kinghorn wrote:
>>
>>
>>>The dual-core system had 4 one GB modules arranged 2 for each cpu.
>>
>>To be anal-hyphen-retentive, don't you mean "2 for each socket"?
>>
>
> Acktcherly....
> we do need to decide on a terminology here.
>
> I recently did a response to a tender for a prospective customer.
> I was tying myself in knots getting the correct terminology,
> for questions such as "the systems MUST have xxx gigabytes of RAM per
> processor"
> I went with reading that as 'per socket' in the case of dual cores.
> Also talking about 'dual nodes' is going to be more tricky.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From david.n.lombard at intel.com Wed Jul 13 10:23:57 2005
From: david.n.lombard at intel.com (Lombard, David N)
Date: Wed, 13 Jul 2005 07:23:57 -0700
Subject: [Beowulf] SuSE 9.3
Message-ID: <187D3A7CAB42A54DB61F1D05F0125722065B397F@orsmsx402.amr.corp.intel.com>
From: Mark Hahn on Tuesday, July 12, 2005 5:27 PM
>
> > Corporate users and ISVs don't want to see the OS revised more than
once
> > a year.
>
> which is sad, really. they've been so traumatized by the dominant
> platform that they expect that changing anything will break
everything.
> the very concept of a standard, let alone an interface standard (ABI)
> is foreign to this mentality.
Theory v. practice. Implementation (i.e., compiler and linker output)
can substantially impact performance and correctness.
As example, the 2.4.6-2.4.18 range of kernels saw a steady rise in
short/small I/O performance at the expense of a steady and significant
loss of large-I/O performance. In a cluster that we built, we had to
have an I/O monster app run on 2.6.3 nodes and another app run on
2.6.10? nodes so that both apps made their performance targets.
> > On the user side, it takes a lot of effort to install, certify,
> > and then deploy a new OS. On the ISV side, there is little to no
>
> certification is bad for users. instead of an ISV stepping up to the
> plate and saying "our application requires Linux ABI 3.14 and we will
> fix bugs where it doesn't", the ISV just annoints a particular config
> (OS release, firmware revision, disk setup, these magic three patches,
> phase of moon). certification is really an admission that the app is
> buggy in indeterminate ways, and that the ISV doesn't care.
When I was responsible for such a statement, I had a "batch" app that
was truly only sensitive to kernel and glibc -- and that's EXACTLY the
requirement I enumerated, a minimal set of kernel and glibc
requirements; eventually it needed to become a bounded range. I also
had a substantial history with the app and a sufficient understanding of
the kernel and glibc to confidently make the claim.
A sister app, a large graphic app was much more sensitive to the
environment, e.g., C++ libs, graphics device drivers, yada, yada, yada.
Despite my best efforts, they stuck with RHL x.y requirements. They too
had a substantial history with their app that supported their position;
they also thought I was out of my mind--perhaps, but not to this point.
> > revenue associated with certifying an already released app on a new
OS
> > -- it's purely a cost factor unless the ISV is releasing new code.
>
> managing cost is a good thing. but doing so does not necessarily mean
> that you also have to screw your customers. hey, with a little spin
> we can even make certification sound like we're _doing_them_a_favor_!
Many customers DEMAND certification, most accepted policy explanations
and our continued specific performance in lieu of certification. But,
even so, on a couple of occasions I was ultimately required to provide
distro-release.version-specific certification to specific customers
based solely on their insistence--this I do blame on the lawyers...
--
dnl
My comments represent my opinions, not those of Intel Corporation.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kus at free.net Wed Jul 13 10:52:08 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Wed, 13 Jul 2005 18:52:08 +0400
Subject: [Beowulf] Re: dual core Opteron performance - re suse 9.3
In-Reply-To: <42D50B27.4090905@scalableinformatics.com>
Message-ID:
In message from Joe Landman (Wed, 13
Jul 2005 08:37:59 -0400):
>Hi John:
>
> We find that we are talking about "per core" to our customers now.
> I
>explain that previously, there has been an implicit 1-to-1 mapping
>between processor cores and chips, so that you could talk about
>either one and mean the other. Now however, we are talking about per
>core, as things like licensing (lmgrd) aren't going to count chips or
>sockets, but will count cores.
>
> AMD uses the terminology of
>
> Np processors / Mc cores
>
>so a dual Opteron 275 would look like
>
> 2p/4c
>
>system.
I read some articles which used also MPU (Microprocessor Unit,
if I remember correctly) instead of CPU,
i.e. MPU as equivalency to chip, and CPU as equivalence to core.
Mikhail
> I prefer the converse of this, 4c/2p, but thats just me. I
>don't know what (if any) terminology Intel uses for this.
>
>Per socket is the same as per chip. The issue is the terminology may
>not shift if you are talking about single core, dual core, quad core,
>... N core. From an end user perspective, the cores are real full
>fledged CPUs that happen to share the same physical die as one or
>more other cores. That is, with a little though, the end user can
>break the 1-to-1 mapping and talk in terms of cores, in which case
>specifications start to make sense again. 1 GB per socket doesnt
>make much sense if each core needs 1 GB for a particular calculation.
>
>Joe
>John Hearns wrote:
>> On Tue, 2005-07-12 at 21:00 -0700, Greg Lindahl wrote:
>>
>>>On Tue, Jul 12, 2005 at 10:36:15AM -0500, Don Kinghorn wrote:
>>>
>>>
>>>>The dual-core system had 4 one GB modules arranged 2 for each cpu.
>>>
>>>To be anal-hyphen-retentive, don't you mean "2 for each socket"?
>>>
>>
>> Acktcherly....
>> we do need to decide on a terminology here.
>>
>> I recently did a response to a tender for a prospective customer.
>> I was tying myself in knots getting the correct terminology,
>> for questions such as "the systems MUST have xxx gigabytes of RAM
>>per
>> processor"
>> I went with reading that as 'per socket' in the case of dual cores.
>> Also talking about 'dual nodes' is going to be more tricky.
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>>http://www.beowulf.org/mailman/listinfo/beowulf
>
>--
>Joseph Landman, Ph.D
>Founder and CEO
>Scalable Informatics LLC,
>email: landman at scalableinformatics.com
>web : http://www.scalableinformatics.com
>phone: +1 734 786 8423
>fax : +1 734 786 8452
>cell : +1 734 612 4615
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kus at free.net Wed Jul 13 10:58:32 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Wed, 13 Jul 2005 18:58:32 +0400
Subject: [Beowulf] Re: dual core Opteron performance - re suse 9.3
In-Reply-To:
Message-ID:
In message from Mark Hahn (Tue, 12 Jul 2005
17:40:58 -0400 (EDT)):
>> >there are only 4 slots on the Tyan 2875 (I had mistakenly reported
>>yesterday
>>
>> I'm not seeing anywhere at Tyan an indication this board can take
>>advantage
>> of NUMA.
>
>node interleave is meaningless for the 2875, since the board only has
>memory attached to one CPU. while the bios probably does include the
>ACPI table that informs the kernel's k8-numa code, it's moot, since
>there's no way to arrange cpu-proc affinity to minimize non-local
>accesses. (except by not using the second socket, of course!)
>
>I'd expect NUMA support to make more of a difference on 4-socket
>systems,
>since on them, a process can be >1 hop away from memory. on a
>2-socket
>system, it's probably still worth doing, but can't be all that
>critical.
>
>naturally, latency-sensitive codes (big but with poor locality) will
>show a bigger difference.
>
>> >Bank interleaving "Auto"
>
>I tried to measure this on a dual, and couldn't. it's hard to see,
>based on the low-level hardware specs, why it would matter much.
>yes, bank interleave should reduce the amount of time waiting on
>bank misses, but it's certainly not visible to Stream.
>
>> >Node interleaving "Auto"
>
>turning this to on essentially defeats NUMA; it could be the right
>thing
>for some codes/systems, since it means that no process has any
>special
>affinity for a particular socket.
The main practical result for dual-CPU single core Opteron server
is: if I turn "Node interleaving" to ON, STREAM results will be much
more worse. Therefore some applications will work more slow.
But what will be in the case of setting it to "AUTO" ? "Who"
(and how) will solve about real setting ?
Yours
Mikhail
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kus at free.net Wed Jul 13 11:17:58 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Wed, 13 Jul 2005 19:17:58 +0400
Subject: [Beowulf] SuSE 9.3
In-Reply-To: <187D3A7CAB42A54DB61F1D05F0125722065B397F@orsmsx402.amr.corp.intel.com>
Message-ID:
In message from "Lombard, David N" (Wed,
13 Jul 2005 07:23:57 -0700):
>From: Mark Hahn on Tuesday, July 12, 2005 5:27 PM
>>
>> > Corporate users and ISVs don't want to see the OS revised more
>>than
>once
>> > a year.
>>
>> which is sad, really. they've been so traumatized by the dominant
>> platform that they expect that changing anything will break
>everything.
>> the very concept of a standard, let alone an interface standard
>>(ABI)
>> is foreign to this mentality.
>
>Theory v. practice. Implementation (i.e., compiler and linker
>output)
>can substantially impact performance and correctness.
>
>As example, the 2.4.6-2.4.18 range of kernels saw a steady rise in
>short/small I/O performance at the expense of a steady and
>significant
>loss of large-I/O performance.
Sorry, does it incorrect for latest 2.4.x, for example 2.4.21 ?
BTW, if you say about ext2fs/ext3fs, and need huge I/O, why not to use
xfs file system ?
Yours
Mikhail
> In a cluster that we built, we had to
>have an I/O monster app run on 2.6.3 nodes and another app run on
>2.6.10? nodes so that both apps made their performance targets.
>
>> > On the user side, it takes a lot of effort to install, certify,
>> > and then deploy a new OS. On the ISV side, there is little to no
>>
>> certification is bad for users. instead of an ISV stepping up to
>>the
>> plate and saying "our application requires Linux ABI 3.14 and we
>>will
>> fix bugs where it doesn't", the ISV just annoints a particular
>>config
>> (OS release, firmware revision, disk setup, these magic three
>>patches,
>> phase of moon). certification is really an admission that the app
>>is
>> buggy in indeterminate ways, and that the ISV doesn't care.
>
>When I was responsible for such a statement, I had a "batch" app that
>was truly only sensitive to kernel and glibc -- and that's EXACTLY
>the
>requirement I enumerated, a minimal set of kernel and glibc
>requirements; eventually it needed to become a bounded range. I also
>had a substantial history with the app and a sufficient understanding
>of
>the kernel and glibc to confidently make the claim.
>
>A sister app, a large graphic app was much more sensitive to the
>environment, e.g., C++ libs, graphics device drivers, yada, yada,
>yada.
>Despite my best efforts, they stuck with RHL x.y requirements. They
>too
>had a substantial history with their app that supported their
>position;
>they also thought I was out of my mind--perhaps, but not to this
>point.
>
>> > revenue associated with certifying an already released app on a
>>new
>OS
>> > -- it's purely a cost factor unless the ISV is releasing new code.
>>
>> managing cost is a good thing. but doing so does not necessarily
>>mean
>> that you also have to screw your customers. hey, with a little spin
>> we can even make certification sound like we're
>>_doing_them_a_favor_!
>
>Many customers DEMAND certification, most accepted policy
>explanations
>and our continued specific performance in lieu of certification.
> But,
>even so, on a couple of occasions I was ultimately required to
>provide
>distro-release.version-specific certification to specific customers
>based solely on their insistence--this I do blame on the lawyers...
>
>--
>dnl
>
>My comments represent my opinions, not those of Intel Corporation.
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kus at free.net Wed Jul 13 12:31:48 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Wed, 13 Jul 2005 20:31:48 +0400
Subject: [Beowulf] [gorelsky@stanford.edu: CCL:dual-core Opteron
275performance]
In-Reply-To: <42D39A5B.8050300@crs4.it>
Message-ID:
In message from Alan Louis Scheinine (Tue, 12 Jul
2005 12:24:27 +0200):
> 1) Gerry Creager wrote "Hoowa!"
> Since the results seem useful, I would like to add the
>following.
> On dual-CPU boards with Athlon32 CPUs, the program "bolam" was
>slow if
> both CPUs on the board were used, it was better to have one
>MPICH process
> per compute node. This problem did not appear in another
>cluster that had
> Opteron dual-CPU boards (single-core), that is, two processes
>for each node
> did not cause a slowdown. This is an indication that "bolam" is
>at a
> threshold for memory access being a bottleneck.
The original post by S.Gorelsky (re-sent by E.Leitl) was about good
scalability of 4cores/dual-CPUs Opteron 275 server on Gaussian 03
DFT/test397 test. I'm testing just now like Supermicro server
w/2*Opteron 275 but w/DDR333 instead of DDR400 used by S.Gorelsky.
I used SuSE 9.0 w/2.4.21 kernel.
I understood, that original results of S.Gorelsky were obtained,
probably,
for shared memory parallelization ! If I use G03 w/Linda (which
is main parallelization tool for G03 - parallelization in shared
memory model of G03 is available only for more restricted subset
of quantum-chemical methods) - then the results are much more bad.
On 4 cores I obtained speedup only 2.95 for Linda vs 3.6 for
shared memory. The difference is, as I understand, simple because
of data exchanges through RAM for the case of Linda; in shared memory
model like memory traffic is absent.
FYI: speedup by S.Gorelsky for 4 CPUs is 3.4 (hope that I calculated
properly :-)).
I also obtained similar results for other quantum-chemical methods
which show that using of Linda/G03 may give bad scalability for
dual-core Opteron.
We also have some (developing by us) quantum-chemical application
which
is bandwidth-limited under parallelization, and using of 1 CPU (1 MPI
process) per dual Xeon nodes for Myrinet/MPICH is strongly preferred.
In the case of (dual single core CPUs)-Opteron nodes the situation is
better.
But now for 4cores/2CPUs per Opteron node to force the using of
only 2 cores (from 4), by 1 for each chip, we'll need to have
cpu affinity support in Linux.
Yours
Mikhail
> A complication
>for this
> interpretation is that the Athlon32 nodes use Linux kernel
>2.4.21.
> 2) Mikhail Kuzminsky asked "do you have "node interleave memory"
>switched off?
> Reading the BIOS:
> Bank interleaving "Auto", there are two memory modules per CPU
>so there
> should be bank interleaving.
> Node interleaving "Disable"
> 3) In an email Guy Coates asked
> > Did you need to use numa-tools to specify the CPU placement,
>or did the
> > kernel "do the right thing" by itself?
> The kernel did the right thing by itself.
> I have a question: what are numa-tools?
> On the computer I find
> man -k numa
> numa (3) - NUMA policy library
> numactl(8) - Control NUMA policy for processes or shared
>memory
> rpm -qa | grep -i numa
> numactl-0.6.4-1.13
> Is numactl the "numa-tools"? Is there another package to
>consider installing?
> I see that numactl has many "man" pages.
>
>Reference, previous message:
> >In all cases, 4 MPI processes on a machine with 4 cores (two
>dual-core CPUs).
> >Meteorology program 1, "bolam" CPU time, real time (in seconds)
> > Linux kernel 2.6.9-11.ELsmp 122 128
> > Linux kernel 2.6.12.2 64 77
> >
> >Meteorology program 2, "non-hydrostatic"
> > Linux kernel 2.6.9-11.ELsmp 598 544
> > Linux kernel 2.6.12.2 430 476
>
>
>--
>
> Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
> Center for Advanced Studies, Research, and Development in Sardinia
>
> Postal Address: | Physical Address for FedEx, UPS,
>DHL:
> --------------- |
> -------------------------------------
> Alan Scheinine | Alan Scheinine
> c/o CRS4 | c/o CRS4
> C.P. n. 25 | Loc. Pixina Manna Edificio 1
> 09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy
>
> Email: scheinin at crs4.it
>
> Phone: 070 9250 238 [+39 070 9250 238]
> Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220]
> Operator at reception: 070 9250 1 [+39 070 9250 1]
> Mobile phone: 347 7990472 [+39 347 7990472]
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From landman at scalableinformatics.com Wed Jul 13 12:56:09 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 13 Jul 2005 12:56:09 -0400
Subject: [Beowulf]
[gorelsky@stanford.edu: CCL:dual-core Opteron 275performance]
In-Reply-To:
References:
Message-ID: <42D547A9.50208@scalableinformatics.com>
Hi Mikhail:
If you use numactl, you should have control over processor affinity
for a particular process. I am not sure how this ties in to MPI though,
so there may need to be some work there.
Joe
Mikhail Kuzminsky wrote:
> In message from Alan Louis Scheinine (Tue, 12 Jul
> 2005 12:24:27 +0200):
>
>> 1) Gerry Creager wrote "Hoowa!"
>> Since the results seem useful, I would like to add the following.
>> On dual-CPU boards with Athlon32 CPUs, the program "bolam" was
>> slow if
>> both CPUs on the board were used, it was better to have one MPICH
>> process
>> per compute node. This problem did not appear in another cluster
>> that had
>> Opteron dual-CPU boards (single-core), that is, two processes for
>> each node
>> did not cause a slowdown. This is an indication that "bolam" is at a
>> threshold for memory access being a bottleneck.
>
> The original post by S.Gorelsky (re-sent by E.Leitl) was about good
> scalability of 4cores/dual-CPUs Opteron 275 server on Gaussian 03
> DFT/test397 test. I'm testing just now like Supermicro server
> w/2*Opteron 275 but w/DDR333 instead of DDR400 used by S.Gorelsky.
> I used SuSE 9.0 w/2.4.21 kernel.
>
> I understood, that original results of S.Gorelsky were obtained, probably,
> for shared memory parallelization ! If I use G03 w/Linda (which
> is main parallelization tool for G03 - parallelization in shared
> memory model of G03 is available only for more restricted subset
> of quantum-chemical methods) - then the results are much more bad.
>
> On 4 cores I obtained speedup only 2.95 for Linda vs 3.6 for
> shared memory. The difference is, as I understand, simple because
> of data exchanges through RAM for the case of Linda; in shared memory
> model like memory traffic is absent.
> FYI: speedup by S.Gorelsky for 4 CPUs is 3.4 (hope that I calculated
> properly :-)).
>
> I also obtained similar results for other quantum-chemical methods which
> show that using of Linda/G03 may give bad scalability for
> dual-core Opteron.
> We also have some (developing by us) quantum-chemical application which
> is bandwidth-limited under parallelization, and using of 1 CPU (1 MPI
> process) per dual Xeon nodes for Myrinet/MPICH is strongly preferred. In
> the case of (dual single core CPUs)-Opteron nodes the situation is better.
>
> But now for 4cores/2CPUs per Opteron node to force the using of
> only 2 cores (from 4), by 1 for each chip, we'll need to have
> cpu affinity support in Linux.
>
> Yours
> Mikhail
>
>> A complication for this
>> interpretation is that the Athlon32 nodes use Linux kernel 2.4.21.
>> 2) Mikhail Kuzminsky asked "do you have "node interleave memory"
>> switched off?
>> Reading the BIOS:
>> Bank interleaving "Auto", there are two memory modules per CPU so
>> there
>> should be bank interleaving.
>> Node interleaving "Disable"
>> 3) In an email Guy Coates asked
>> > Did you need to use numa-tools to specify the CPU placement, or
>> did the
>> > kernel "do the right thing" by itself?
>> The kernel did the right thing by itself.
>> I have a question: what are numa-tools?
>> On the computer I find
>> man -k numa
>> numa (3) - NUMA policy library
>> numactl(8) - Control NUMA policy for processes or shared memory
>> rpm -qa | grep -i numa
>> numactl-0.6.4-1.13
>> Is numactl the "numa-tools"? Is there another package to consider
>> installing?
>> I see that numactl has many "man" pages.
>>
>> Reference, previous message:
>> >In all cases, 4 MPI processes on a machine with 4 cores (two
>> dual-core CPUs).
>> >Meteorology program 1, "bolam" CPU time, real time (in seconds)
>> > Linux kernel 2.6.9-11.ELsmp 122 128
>> > Linux kernel 2.6.12.2 64 77
>> >
>> >Meteorology program 2, "non-hydrostatic"
>> > Linux kernel 2.6.9-11.ELsmp 598 544
>> > Linux kernel 2.6.12.2 430 476
>>
>>
>> --
>>
>> Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
>> Center for Advanced Studies, Research, and Development in Sardinia
>>
>> Postal Address: | Physical Address for FedEx, UPS, DHL:
>> --------------- | -------------------------------------
>> Alan Scheinine | Alan Scheinine
>> c/o CRS4 | c/o CRS4
>> C.P. n. 25 | Loc. Pixina Manna Edificio 1
>> 09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy
>>
>> Email: scheinin at crs4.it
>>
>> Phone: 070 9250 238 [+39 070 9250 238]
>> Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220]
>> Operator at reception: 070 9250 1 [+39 070 9250 1]
>> Mobile phone: 347 7990472 [+39 347 7990472]
>>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From James.P.Lux at jpl.nasa.gov Wed Jul 13 13:29:04 2005
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Wed, 13 Jul 2005 10:29:04 -0700
Subject: [Beowulf] SuSE 9.3
Message-ID: <6.1.1.1.2.20050713101908.02611420@mail.jpl.nasa.gov>
At 07:23 AM 7/13/2005, Lombard, David N wrote:
>From: Mark Hahn on Tuesday, July 12, 2005 5:27 PM
> >
> > > Corporate users and ISVs don't want to see the OS revised more than
>once
> > > a year.
> >
> > which is sad, really. they've been so traumatized by the dominant
> > platform that they expect that changing anything will break
>everything.
> > the very concept of a standard, let alone an interface standard (ABI)
> > is foreign to this mentality.
Aside from compatibility issues, there's a not-insignificant cost to
rolling out a change to thousands of desktops, some small fraction of which
WILL break for one reason or another, no matter what OS you're
running. Say you're a corporate IT manager responsible for 10,000 desktop
machines. If 0.1% of those machines break (which is a very small number),
you've got to handle 10 calls, each of which probably costs you somewhere
between $500-1000 (staff to handle it, the lost productivity of the person
who's machine broke, etc.).
Say you roll out the change in the daytime.. there's going to be some
disruption of what's going on with each desktop. Say it costs 15 minutes
for each user.. times 10,000 users, that's 2500 work hours, conservatively
well over $100K worth. Assuming all goes well. If some fraction of those
users decide to call the help line because "something weird is going on
with my PC", you've just radically increased the cost of the roll out.
So, you say, roll it out at night. Then, some fairly significant fraction
of the machines won't get the update because the user has turned it off
(despite broadcast messages and exhortations to "please leave your computer
on tonight").
This is more a manifestation of having thousands of machines in the hands
of unsophisticated users, than any particular OS choice. By the way,
sophisticated users are actually worse: They notice that something weird is
going on and call to ask; They're more likely to have changed the "default
configuration" of the system; They're more likely to have installed some
other software, outside the official configuration management regime.
So, moderate to big shops tend to want to avoid willy-nilly rearrangement
of the computing configuration. Once a year is nice.. You budget $500K or
so for the rollout and its costs (pretesting, support, organization, etc.)
and you're done with it.
James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Wed Jul 13 14:57:40 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Wed, 13 Jul 2005 20:57:40 +0200
Subject: [Beowulf] SuSE 9.3
Message-ID: <3.0.32.20050713205739.012988d8@pop3.xs4all.nl>
At 10:29 AM 7/13/2005 -0700, Jim Lux wrote:
>At 07:23 AM 7/13/2005, Lombard, David N wrote:
>>From: Mark Hahn on Tuesday, July 12, 2005 5:27 PM
>> >
>> > > Corporate users and ISVs don't want to see the OS revised more than
>>once
>> > > a year.
>> >
>> > which is sad, really. they've been so traumatized by the dominant
>> > platform that they expect that changing anything will break
>>everything.
>> > the very concept of a standard, let alone an interface standard (ABI)
>> > is foreign to this mentality.
>
>Aside from compatibility issues, there's a not-insignificant cost to
>rolling out a change to thousands of desktops, some small fraction of which
>WILL break for one reason or another, no matter what OS you're
>running. Say you're a corporate IT manager responsible for 10,000 desktop
>machines. If 0.1% of those machines break (which is a very small number),
>you've got to handle 10 calls, each of which probably costs you somewhere
>between $500-1000 (staff to handle it, the lost productivity of the person
>who's machine broke, etc.).
>
>Say you roll out the change in the daytime.. there's going to be some
>disruption of what's going on with each desktop. Say it costs 15 minutes
>for each user.. times 10,000 users, that's 2500 work hours, conservatively
>well over $100K worth. Assuming all goes well. If some fraction of those
>users decide to call the help line because "something weird is going on
>with my PC", you've just radically increased the cost of the roll out.
>
>So, you say, roll it out at night. Then, some fairly significant fraction
>of the machines won't get the update because the user has turned it off
>(despite broadcast messages and exhortations to "please leave your computer
>on tonight").
>
>This is more a manifestation of having thousands of machines in the hands
>of unsophisticated users, than any particular OS choice. By the way,
>sophisticated users are actually worse: They notice that something weird is
Actually freaks (as i call those 'sophisticated users') are the worst of
them all, because they will complain loud and request feature X,
which has only limited use to this person and someone at
the Northpole perhaps, but they still request it and one in so many requests
convinces even managers and their programming teams to implement that nerd
feature X. That's usually bad, as those freaks do thousands of requests a day.
Convincing programmers to make something that the average person likes is
real difficult, as in their own world the number of features supported is
more important than ease of use of just a few features.
Vincent
>going on and call to ask; They're more likely to have changed the "default
>configuration" of the system; They're more likely to have installed some
>other software, outside the official configuration management regime.
>
>
>So, moderate to big shops tend to want to avoid willy-nilly rearrangement
>of the computing configuration. Once a year is nice.. You budget $500K or
>so for the rollout and its costs (pretesting, support, organization, etc.)
>and you're done with it.
>
>
>James Lux, P.E.
>Spacecraft Radio Frequency Subsystems Group
>Flight Communications Systems Section
>Jet Propulsion Laboratory, Mail Stop 161-213
>4800 Oak Grove Drive
>Pasadena CA 91109
>tel: (818)354-2075
>fax: (818)393-6875
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Wed Jul 13 14:52:42 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Wed, 13 Jul 2005 20:52:42 +0200
Subject: [Beowulf][gorelsky@stanford.edu:CCL:dual-coreOpteron275performance]
Message-ID: <3.0.32.20050713205238.012988d8@pop3.xs4all.nl>
Within 1 node there is no need't use MPI, that's just unnecessary memory
overhead.
Processor affinity goes automatically when having a NUMA kernel. More
importantly for the software than processor affinity is that processors do
their work in memory at their own memory controller and that they don't work
at data from remote memory controllers.
Vincent
At 12:56 PM 7/13/2005 -0400, Joe Landman wrote:
>Hi Mikhail:
>
> If you use numactl, you should have control over processor affinity
>for a particular process. I am not sure how this ties in to MPI though,
>so there may need to be some work there.
>
>Joe
>
>Mikhail Kuzminsky wrote:
>> In message from Alan Louis Scheinine (Tue, 12 Jul
>> 2005 12:24:27 +0200):
>>
>>> 1) Gerry Creager wrote "Hoowa!"
>>> Since the results seem useful, I would like to add the following.
>>> On dual-CPU boards with Athlon32 CPUs, the program "bolam" was
>>> slow if
>>> both CPUs on the board were used, it was better to have one MPICH
>>> process
>>> per compute node. This problem did not appear in another cluster
>>> that had
>>> Opteron dual-CPU boards (single-core), that is, two processes for
>>> each node
>>> did not cause a slowdown. This is an indication that "bolam" is at a
>>> threshold for memory access being a bottleneck.
>>
>> The original post by S.Gorelsky (re-sent by E.Leitl) was about good
>> scalability of 4cores/dual-CPUs Opteron 275 server on Gaussian 03
>> DFT/test397 test. I'm testing just now like Supermicro server
>> w/2*Opteron 275 but w/DDR333 instead of DDR400 used by S.Gorelsky.
>> I used SuSE 9.0 w/2.4.21 kernel.
>>
>> I understood, that original results of S.Gorelsky were obtained, probably,
>> for shared memory parallelization ! If I use G03 w/Linda (which
>> is main parallelization tool for G03 - parallelization in shared
>> memory model of G03 is available only for more restricted subset
>> of quantum-chemical methods) - then the results are much more bad.
>>
>> On 4 cores I obtained speedup only 2.95 for Linda vs 3.6 for
>> shared memory. The difference is, as I understand, simple because
>> of data exchanges through RAM for the case of Linda; in shared memory
>> model like memory traffic is absent.
>> FYI: speedup by S.Gorelsky for 4 CPUs is 3.4 (hope that I calculated
>> properly :-)).
>>
>> I also obtained similar results for other quantum-chemical methods which
>> show that using of Linda/G03 may give bad scalability for
>> dual-core Opteron.
>> We also have some (developing by us) quantum-chemical application which
>> is bandwidth-limited under parallelization, and using of 1 CPU (1 MPI
>> process) per dual Xeon nodes for Myrinet/MPICH is strongly preferred. In
>> the case of (dual single core CPUs)-Opteron nodes the situation is better.
>>
>> But now for 4cores/2CPUs per Opteron node to force the using of
>> only 2 cores (from 4), by 1 for each chip, we'll need to have
>> cpu affinity support in Linux.
>>
>> Yours
>> Mikhail
>>
>>> A complication for this
>>> interpretation is that the Athlon32 nodes use Linux kernel 2.4.21.
>>> 2) Mikhail Kuzminsky asked "do you have "node interleave memory"
>>> switched off?
>>> Reading the BIOS:
>>> Bank interleaving "Auto", there are two memory modules per CPU so
>>> there
>>> should be bank interleaving.
>>> Node interleaving "Disable"
>>> 3) In an email Guy Coates asked
>>> > Did you need to use numa-tools to specify the CPU placement, or
>>> did the
>>> > kernel "do the right thing" by itself?
>>> The kernel did the right thing by itself.
>>> I have a question: what are numa-tools?
>>> On the computer I find
>>> man -k numa
>>> numa (3) - NUMA policy library
>>> numactl(8) - Control NUMA policy for processes or shared memory
>>> rpm -qa | grep -i numa
>>> numactl-0.6.4-1.13
>>> Is numactl the "numa-tools"? Is there another package to consider
>>> installing?
>>> I see that numactl has many "man" pages.
>>>
>>> Reference, previous message:
>>> >In all cases, 4 MPI processes on a machine with 4 cores (two
>>> dual-core CPUs).
>>> >Meteorology program 1, "bolam" CPU time, real time (in seconds)
>>> > Linux kernel 2.6.9-11.ELsmp 122 128
>>> > Linux kernel 2.6.12.2 64 77
>>> >
>>> >Meteorology program 2, "non-hydrostatic"
>>> > Linux kernel 2.6.9-11.ELsmp 598 544
>>> > Linux kernel 2.6.12.2 430 476
>>>
>>>
>>> --
>>>
>>> Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
>>> Center for Advanced Studies, Research, and Development in Sardinia
>>>
>>> Postal Address: | Physical Address for FedEx, UPS, DHL:
>>> --------------- | -------------------------------------
>>> Alan Scheinine | Alan Scheinine
>>> c/o CRS4 | c/o CRS4
>>> C.P. n. 25 | Loc. Pixina Manna Edificio 1
>>> 09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy
>>>
>>> Email: scheinin at crs4.it
>>>
>>> Phone: 070 9250 238 [+39 070 9250 238]
>>> Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220]
>>> Operator at reception: 070 9250 1 [+39 070 9250 1]
>>> Mobile phone: 347 7990472 [+39 347 7990472]
>>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Wed Jul 13 14:50:45 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Wed, 13 Jul 2005 20:50:45 +0200
Subject: [Beowulf]
[gorelsky@stanford.edu:CCL:dual-coreOpteron275performance]
Message-ID: <3.0.32.20050713205044.00d37b60@pop3.xs4all.nl>
Hello Mikhail,
AFAIK 2.4 kernels (except some SGI patched ones) do not have much of a NUMA
support for dual opterons.
Is the software doing this (which is how you optimize for numa):
- each processor starts, allocates its own shared memory,
puts data in THAT shared memory (allocating without writing doesn't
make sense as data gets allocated practically at moment of writing),
then attaches to other processors shared memory and all 4 cpu's can
write in each others ram. Each cpu then goes calculate in its own
memory. Cores goes idemdito. What matters is that core X at
memory controller Y should not eat data from memory controller Z,
as that slows down things significantly.
Is that how the software works?
In that case you should be getting close to a 4.0 scaling,
somewhere in the 3.9x when using kernel 2.6.x with NUMA turned on.
Diep is doing exactly the above and gets 3.93 scaling at quad opteron,
and 3.92 at dual opteron dual core (4 cores in total). See sudhian.com
for the accurate test of it and also the poor scaling of the P4 dual core.
Best regards,
Vincent
At 08:31 PM 7/13/2005 +0400, Mikhail Kuzminsky wrote:
>In message from Alan Louis Scheinine (Tue, 12 Jul
>2005 12:24:27 +0200):
>> 1) Gerry Creager wrote "Hoowa!"
>> Since the results seem useful, I would like to add the
>>following.
>> On dual-CPU boards with Athlon32 CPUs, the program "bolam" was
>>slow if
>> both CPUs on the board were used, it was better to have one
>>MPICH process
>> per compute node. This problem did not appear in another
>>cluster that had
>> Opteron dual-CPU boards (single-core), that is, two processes
>>for each node
>> did not cause a slowdown. This is an indication that "bolam" is
>>at a
>> threshold for memory access being a bottleneck.
>The original post by S.Gorelsky (re-sent by E.Leitl) was about good
>scalability of 4cores/dual-CPUs Opteron 275 server on Gaussian 03
>DFT/test397 test. I'm testing just now like Supermicro server
>w/2*Opteron 275 but w/DDR333 instead of DDR400 used by S.Gorelsky.
>I used SuSE 9.0 w/2.4.21 kernel.
>
>I understood, that original results of S.Gorelsky were obtained,
>probably,
>for shared memory parallelization ! If I use G03 w/Linda (which
>is main parallelization tool for G03 - parallelization in shared
>memory model of G03 is available only for more restricted subset
>of quantum-chemical methods) - then the results are much more bad.
>
>On 4 cores I obtained speedup only 2.95 for Linda vs 3.6 for
>shared memory. The difference is, as I understand, simple because
>of data exchanges through RAM for the case of Linda; in shared memory
>model like memory traffic is absent.
>FYI: speedup by S.Gorelsky for 4 CPUs is 3.4 (hope that I calculated
>properly :-)).
>
>I also obtained similar results for other quantum-chemical methods
>which show that using of Linda/G03 may give bad scalability for
>dual-core Opteron.
>
>We also have some (developing by us) quantum-chemical application
>which
>is bandwidth-limited under parallelization, and using of 1 CPU (1 MPI
>process) per dual Xeon nodes for Myrinet/MPICH is strongly preferred.
>In the case of (dual single core CPUs)-Opteron nodes the situation is
>better.
>
>But now for 4cores/2CPUs per Opteron node to force the using of
>only 2 cores (from 4), by 1 for each chip, we'll need to have
>cpu affinity support in Linux.
>
>Yours
>Mikhail
>
>> A complication
>>for this
>> interpretation is that the Athlon32 nodes use Linux kernel
>>2.4.21.
>> 2) Mikhail Kuzminsky asked "do you have "node interleave memory"
>>switched off?
>> Reading the BIOS:
>> Bank interleaving "Auto", there are two memory modules per CPU
>>so there
>> should be bank interleaving.
>> Node interleaving "Disable"
>> 3) In an email Guy Coates asked
>> > Did you need to use numa-tools to specify the CPU placement,
>>or did the
>> > kernel "do the right thing" by itself?
>> The kernel did the right thing by itself.
>> I have a question: what are numa-tools?
>> On the computer I find
>> man -k numa
>> numa (3) - NUMA policy library
>> numactl(8) - Control NUMA policy for processes or shared
>>memory
>> rpm -qa | grep -i numa
>> numactl-0.6.4-1.13
>> Is numactl the "numa-tools"? Is there another package to
>>consider installing?
>> I see that numactl has many "man" pages.
>>
>>Reference, previous message:
>> >In all cases, 4 MPI processes on a machine with 4 cores (two
>>dual-core CPUs).
>> >Meteorology program 1, "bolam" CPU time, real time (in seconds)
>> > Linux kernel 2.6.9-11.ELsmp 122 128
>> > Linux kernel 2.6.12.2 64 77
>> >
>> >Meteorology program 2, "non-hydrostatic"
>> > Linux kernel 2.6.9-11.ELsmp 598 544
>> > Linux kernel 2.6.12.2 430 476
>>
>>
>>--
>>
>> Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
>> Center for Advanced Studies, Research, and Development in Sardinia
>>
>> Postal Address: | Physical Address for FedEx, UPS,
>>DHL:
>> --------------- |
>> -------------------------------------
>> Alan Scheinine | Alan Scheinine
>> c/o CRS4 | c/o CRS4
>> C.P. n. 25 | Loc. Pixina Manna Edificio 1
>> 09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy
>>
>> Email: scheinin at crs4.it
>>
>> Phone: 070 9250 238 [+39 070 9250 238]
>> Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220]
>> Operator at reception: 070 9250 1 [+39 070 9250 1]
>> Mobile phone: 347 7990472 [+39 347 7990472]
>>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From david.n.lombard at intel.com Wed Jul 13 11:37:33 2005
From: david.n.lombard at intel.com (Lombard, David N)
Date: Wed, 13 Jul 2005 08:37:33 -0700
Subject: [Beowulf] SuSE 9.3
Message-ID: <187D3A7CAB42A54DB61F1D05F0125722065B3A25@orsmsx402.amr.corp.intel.com>
From: Mikhail Kuzminsky Wednesday, July 13, 2005 8:18 AM
>
> In message from "Lombard, David N" (Wed,
> 13 Jul 2005 07:23:57 -0700):
> >From: Mark Hahn on Tuesday, July 12, 2005 5:27 PM
> >>
> >> > Corporate users and ISVs don't want to see the OS revised more
> >>than
> >once
> >> > a year.
> >>
> >> which is sad, really. they've been so traumatized by the dominant
> >> platform that they expect that changing anything will break
> >everything.
> >> the very concept of a standard, let alone an interface standard
> >>(ABI)
> >> is foreign to this mentality.
> >
> >Theory v. practice. Implementation (i.e., compiler and linker
> >output)
> >can substantially impact performance and correctness.
> >
> >As example, the 2.4.6-2.4.18 range of kernels saw a steady rise in
> >short/small I/O performance at the expense of a steady and
> >significant
> >loss of large-I/O performance.
> Sorry, does it incorrect for latest 2.4.x, for example 2.4.21 ?
Sorry, I didn't explicitly state this, but at 2.4.18 (maybe [.17,.19]?)
the large-I/O slowdown was corrected.
> BTW, if you say about ext2fs/ext3fs, and need huge I/O, why not to use
> xfs file system ?
Be assured, XFS *was* used, with very careful tuning to the app.
--
dnl
My comments represent my opinions, not those of Intel Corporation.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From sdm900 at gmail.com Wed Jul 13 19:21:21 2005
From: sdm900 at gmail.com (Stuart Midgley)
Date: Thu, 14 Jul 2005 07:21:21 +0800
Subject: [Beowulf] Re: dual core Opteron performance - re suse 9.3
In-Reply-To:
References:
Message-ID:
Actually, this statement isn't quite correct. I have managed to
increase the single-cpu stream number by turning interleaving to on.
Streams is effectively getting the benefit of the other memory
controllers. This was very pominent on an HP DL585. I was playing
with the numa-tools and page placement quite a bit and it was very
obvious that as I spread the memory across more numa-nodes the
bandwidth went up.
Of course, your statement is true if you run a the streams benchmark
on more than 1 cpus.
Stu.
>>
> The main practical result for dual-CPU single core Opteron server
> is: if I turn "Node interleaving" to ON, STREAM results will be
> much more worse. Therefore some applications will work more slow.
>
> But what will be in the case of setting it to "AUTO" ? "Who"
> (and how) will solve about real setting ?
>
> Yours
> Mikhail
>
--
Dr Stuart Midgley
sdm900 at gmail.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2361 bytes
Desc: not available
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From sdm900 at gmail.com Wed Jul 13 19:31:35 2005
From: sdm900 at gmail.com (Stuart Midgley)
Date: Thu, 14 Jul 2005 07:31:35 +0800
Subject: [Beowulf]
[gorelsky@stanford.edu: CCL:dual-core Opteron 275performance]
In-Reply-To: <42D547A9.50208@scalableinformatics.com>
References: <42D547A9.50208@scalableinformatics.com>
Message-ID:
We use numa tools to lock each process in an MPI job to a particular
cpu and sets its memory affinity to the directly attached memory.
The result is very effective. We are able to get very reliable
performance.
It required that someone in the group write a new mpirun to do all
the placement (which required lots of hooks into the queue to
determine exactly which cpu's to run on etc) and memory policies
etc. We also use the "job" kernel module to place each task in a job
which means that a task can't escape our control.
To seg to another discussion, we are using SUSE and it is quite
nice. It is very progressive in the kernel modules it includes and
works very well (it also happens to be the only linux that SGI
supports). Of course, running YAST completely blows away our
customisations, so we avoid using any of the configuration tools and
just modify the files ourselves.
Stu.
On 14/07/2005, at 0:56, Joe Landman wrote:
> Hi Mikhail:
>
> If you use numactl, you should have control over processor
> affinity for a particular process. I am not sure how this ties in
> to MPI though, so there may need to be some work there.
>
> Joe
>
--
Dr Stuart Midgley
sdm900 at gmail.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2361 bytes
Desc: not available
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From joachim at ccrl-nece.de Thu Jul 14 03:20:45 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Thu, 14 Jul 2005 09:20:45 +0200
Subject: [Beowulf] Open position in Clustercomputing/MPI R&D (NEC Europe)
Message-ID: <42D6124D.4050300@ccrl-nece.de>
Dear all,
although I rarely see job offers on this list, I dare to point you to a current
open position in our lab in the area of Linux Clustercomputing/MPI R&D (see
http://www.ccrl-nece.de). I hope you don't mind.
thanks, Joachim
--
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kus at free.net Thu Jul 14 08:46:48 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Thu, 14 Jul 2005 16:46:48 +0400
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To:
Message-ID:
In message from "S.I.Gorelsky" (Wed, 13 Jul
2005 12:21:32 -0700 (PDT)):
>
>
>>The original post by S.Gorelsky (re-sent by E.Leitl) was about good
>>scalability of 4cores/dual-CPUs Opteron 275 server on Gaussian 03
>>DFT/test397 test. I'm testing just now like Supermicro server
>>w/2*Opteron 275 but w/DDR333 instead of DDR400 used by S.Gorelsky.
>>I used SuSE 9.0 w/2.4.21 kernel.
>
>>I understood, that original results of S.Gorelsky were obtained,
>>probably, for shared memory parallelization!
>
>This is correct. I did not use Linda parallelization. The reason
>for that is Linda's "far-from-best" scalability with Gaussian 03.
It depends. DFT, for example, (and not only DFT) has good scalability
under Linda also (about 1.8-1.9 at every twice increasing of CPU
numbers).
And I don't remember, what else than HF and DFT is parallelized in
G03 for shared memory ?
Yours
Mikhail
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From ctibirna at giref.ulaval.ca Wed Jul 13 15:50:46 2005
From: ctibirna at giref.ulaval.ca (Cristian Tibirna)
Date: Wed, 13 Jul 2005 15:50:46 -0400
Subject: was OT -> Re: [Beowulf] SuSE 9.3
In-Reply-To: <6.1.1.1.2.20050713101908.02611420@mail.jpl.nasa.gov>
References: <6.1.1.1.2.20050713101908.02611420@mail.jpl.nasa.gov>
Message-ID: <200507131550.47063.ctibirna@giref.ulaval.ca>
On 13 July 2005 13:29, Jim Lux wrote:
> Aside from compatibility issues, there's a not-insignificant cost to
> rolling out a change to thousands of desktops, some small fraction of which
> WILL break for one reason or another, no matter what OS you're
> running. Say you're a corporate IT manager responsible for 10,000 desktop
> machines. If 0.1% of those machines break (which is a very small number),
> you've got to handle 10 calls, each of which probably costs you somewhere
> between $500-1000 (staff to handle it, the lost productivity of the person
> who's machine broke, etc.).
>
> Say you roll out the change in the daytime.. there's going to be some
> disruption of what's going on with each desktop. Say it costs 15 minutes
> for each user.. times 10,000 users, that's 2500 work hours, conservatively
> well over $100K worth. Assuming all goes well. If some fraction of those
> users decide to call the help line because "something weird is going on
> with my PC", you've just radically increased the cost of the roll out.
>
> So, you say, roll it out at night. Then, some fairly significant fraction
> of the machines won't get the update because the user has turned it off
> (despite broadcast messages and exhortations to "please leave your computer
> on tonight").
>
> This is more a manifestation of having thousands of machines in the hands
> of unsophisticated users, than any particular OS choice. By the way,
> sophisticated users are actually worse: They notice that something weird is
> going on and call to ask; They're more likely to have changed the "default
> configuration" of the system; They're more likely to have installed some
> other software, outside the official configuration management regime.
>
>
> So, moderate to big shops tend to want to avoid willy-nilly rearrangement
> of the computing configuration. Once a year is nice.. You budget $500K or
> so for the rollout and its costs (pretesting, support, organization, etc.)
> and you're done with it.
>
>
So long live braindead terminals connected in X, VNC, NX (especially the
later) to the _very_ few central servers, which can be -- yes, let's come
back on topic -- Linux clusters, even beowulf ones, even running the other
damn os in virtual machines; clusters equipped with HA (APM, backup,
whatever). And all these costs you mention will simply go away. _And_ you
gain the freedom of having users run roaming sessions, even from their car in
the parking lot outside, if needed, as well as the freedom of running huge
jobs on readily available numerical behemots if ever need to (granted that
all code would finally be written with parallelism in mind).
It's a bitter irony that we come after 35 years (yeah, almost as long as I
actually lived) back to the realization that, at least with IT, some amount
of centralization is better. Then, upgrades don't depend on freaks or dumbs
or on factor 10e5 plannings.
--
Cristian Tibirna (418) 656-2131 / 4340
Laval University - Qu?bec, CAN ... http://www.giref.ulaval.ca/~ctibirna
Research professional - GIREF ... ctibirna at giref.ulaval.ca
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kinghorn at pqs-chem.com Wed Jul 13 11:06:09 2005
From: kinghorn at pqs-chem.com (Don Kinghorn)
Date: Wed, 13 Jul 2005 10:06:09 -0500
Subject: [Beowulf] Tip for SuSE9.3 + dual-core Opteron
Message-ID: <200507131006.09263.kinghorn@pqs-chem.com>
Hi All,
I just wanted to pass this message on that I got from Tyan yesterday. It's
from a FAQ for the S2891 motherboard ...
"
Why during the installation of SuSE 9.3 does the installation does complete
successfully, but when you try to boot, only a black screen appears?
This is a kernel bug. You have to boot with kernel parameter "maxcpus=1".
Then recompile to kernel version 2.6.12-rc3.
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.12-rc3.tar.bz2
"
--
Dr. Donald B. Kinghorn Parallel Quantum Solutions LLC
http://www.pqs-chem.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From i.kozin at dl.ac.uk Thu Jul 14 06:25:12 2005
From: i.kozin at dl.ac.uk (Kozin, I (Igor))
Date: Thu, 14 Jul 2005 11:25:12 +0100
Subject: [Beowulf]
[gorelsky@stanford.edu: CCL:dual-core Opteron275performance]
Message-ID: <77673C9ECE12AB4791B5AC0A7BF40C8F0147D611@exchange02.fed.cclrc.ac.uk>
> But now for 4cores/2CPUs per Opteron node to force the using of
> only 2 cores (from 4), by 1 for each chip, we'll need to have
> cpu affinity support in Linux.
Mikhail,
you can use "taskset" for that purpose.
For example, (perhaps not in the most elegant form)
mpiexec -n 1 taskset -c 0 $code : -n 1 taskset -c 2 $code
But I doubt you want to let the idle cores to do something else
in the mean time. However small you will generally see an increase
in performance if you use all the cores.
Best,
Igor
I. Kozin (i.kozin at dl.ac.uk)
CCLRC Daresbury Laboratory
tel: 01925 603308
http://www.cse.clrc.ac.uk/disco
Distributed Computing Forum
http://www.cse.clrc.ac.uk/disco/forums/ubbthreads.php?Cat=0
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kinghorn at pqs-chem.com Wed Jul 13 12:49:49 2005
From: kinghorn at pqs-chem.com (Don Kinghorn)
Date: Wed, 13 Jul 2005 11:49:49 -0500
Subject: [Beowulf] Re: Beowulf Digest, Vol 17, Issue 17
In-Reply-To: <200507122222.j6CMM6LD022620@bluewest.scyld.com>
References: <200507122222.j6CMM6LD022620@bluewest.scyld.com>
Message-ID: <200507131149.49847.kinghorn@pqs-chem.com>
Hi Mark,
Yes, for the most part the bios settings on the 2875 don't have much impact on
performance. [It was an inexpensive ATX form factor board which I needed at
the time and real performance with our code was pretty good] The 2891 looks
to be a MUCH better board (and gives marginally better performance with our
code as is expected).
I had to set both interleaving settings to "Auto" on the 2875 to get it to
play nice with 4GB memory when the AMD 248 Opterons changed from "model 5
stepping 10" to "model 37 stepping 0" It was just a painfull trial and
error process to get to things to work ... no performance impact ... no real
senseable reason.
"you do that voodoo so well" :-)
-Don
> > >there are only 4 slots on the Tyan 2875 (I had mistakenly reported
> > > yesterday
> >
> > I'm not seeing anywhere at Tyan an indication this board can take
> > advantage of NUMA.
>
> node interleave is meaningless for the 2875, since the board only has
> memory attached to one CPU. ?while the bios probably does include the
> ACPI table that informs the kernel's k8-numa code, it's moot, since
> there's no way to arrange cpu-proc affinity to minimize non-local
> accesses. ?(except by not using the second socket, of course!)
>
> I'd expect NUMA support to make more of a difference on 4-socket systems,
> since on them, a process can be >1 hop away from memory. ?on a 2-socket
> system, it's probably still worth doing, but can't be all that critical.
>
> naturally, latency-sensitive codes (big but with poor locality) will
> show a bigger difference.
>
> > >Bank interleaving "Auto"
>
> I tried to measure this on a dual, and couldn't. ?it's hard to see,
> based on the low-level hardware specs, why it would matter much.
> yes, bank interleave should reduce the amount of time waiting on
> bank misses, but it's certainly not visible to Stream.
>
> > >Node interleaving "Auto"
>
> turning this to on essentially defeats NUMA; it could be the right thing
> for some codes/systems, since it means that no process has any special
> affinity for a particular socket.
--
Dr. Donald B. Kinghorn Parallel Quantum Solutions LLC
http://www.pqs-chem.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From turuncu at be.itu.edu.tr Thu Jul 14 04:29:01 2005
From: turuncu at be.itu.edu.tr (Ufuk Utku Turuncoglu)
Date: Thu, 14 Jul 2005 11:29:01 +0300
Subject: [Beowulf] Re: Contents of Beowulf digest Vol 17, Issue 12,
hybrid (openmp+mpi) job submit (solution)
Message-ID: <42D6224D.3020606@be.itu.edu.tr>
hi,
i found the solition of my problem at the end. The problem is related
with default shell (/bin/ksh, in NIS).
I change my default shell to tcsh and i edit the .tcshrc file to define
the OMP_NUM_THREADS as 2.
that is all,
thanks for your help,
ufuk
my previous mail in list,
i try to run a job that is parallelized using openmp and mpi programming
interfaces (hybrid). I need to run mpi jobs in each node as an openmp job.
for this reason, i have to define OMP_NUM_THREADS environment variable for
each one of the node. first i try to put it into .profile file but it is
not sucessful. also i try to write an LSF job script and i fail too. The
LSF script as fallows,
#!/bin/ksh
#BSUB -J MM5_RUN # job name
#BSUB -n 2 # sum of number of tasks
#BSUB -R "span[ptile=1]" # number of processes per node
#BSUB -m "cn07 cn08" # run host
#BSUB -o mm5lsf.out # output file name
#BSUB -q cigq # queue name
#BSUB -L /bin/bash #
#BSUB -E "export OMP_NUM_THREADS=2"
. ${PWD}/mm5.deck.par
time mpirun -np 2 -machinefile ../machfile ./mm5.mpp
in this case. job run in each of the specified node as a single processor
mode (except execution host, because it is same machine which is login in
and OMP_NUM_THREADS environment variable comes from .profile file).
how can i run a command (or script) in each node just before runing mpi
executable?
thanks,
Ufuk Utku Turuncoglu
Istanbul Technical University
Informatics Institute
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kinghorn at pqs-chem.com Wed Jul 13 12:22:29 2005
From: kinghorn at pqs-chem.com (Don Kinghorn)
Date: Wed, 13 Jul 2005 11:22:29 -0500
Subject: [Beowulf] Re: Beowulf Digest, Vol 17, Issue 18
In-Reply-To: <200507131440.j6DEePZ2012450@bluewest.scyld.com>
References: <200507131440.j6DEePZ2012450@bluewest.scyld.com>
Message-ID: <200507131122.29340.kinghorn@pqs-chem.com>
... phhhtttt! :-)
something like that ... you know what I mean :-)
-Don
> > The dual-core system had 4 one GB modules arranged 2 for each cpu.
>
> To be anal-hyphen-retentive, don't you mean "2 for each socket"?
>
> -- greg
--
Dr. Donald B. Kinghorn Parallel Quantum Solutions LLC
http://www.pqs-chem.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From gorelsky at stanford.edu Wed Jul 13 15:21:32 2005
From: gorelsky at stanford.edu (S.I.Gorelsky)
Date: Wed, 13 Jul 2005 12:21:32 -0700 (PDT)
Subject: [Beowulf] Opteron 275 performance
Message-ID:
>The original post by S.Gorelsky (re-sent by E.Leitl) was about good
>scalability of 4cores/dual-CPUs Opteron 275 server on Gaussian 03
>DFT/test397 test. I'm testing just now like Supermicro server
>w/2*Opteron 275 but w/DDR333 instead of DDR400 used by S.Gorelsky.
>I used SuSE 9.0 w/2.4.21 kernel.
>I understood, that original results of S.Gorelsky were obtained,
>probably, for shared memory parallelization!
This is correct. I did not use Linda parallelization. The reason
for that is Linda's "far-from-best" scalability with Gaussian 03.
Thus, we only use shared memory parallelization for Gaussian 03 and only
those results are shown on http://www.sg-chem.net/cluster/
Best regards,
Serge Gorelsky
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From jbowen at hpcsystems.com Wed Jul 13 11:33:53 2005
From: jbowen at hpcsystems.com (Jim Bowen)
Date: Wed, 13 Jul 2005 08:33:53 -0700
Subject: [Beowulf] Re: dual core Opteron performance - re suse 9.3
In-Reply-To: <42D50B27.4090905@scalableinformatics.com>
References: <3.0.32.20050712112207.01331800@pop3.xs4all.nl>
<200507121036.15586.kinghorn@pqs-chem.com>
<20050713040016.GA6658@greglaptop.ip3networks.com>
<1121235506.5923.151.camel@vigor13>
<42D50B27.4090905@scalableinformatics.com>
Message-ID: <1121268833.22359.16.camel@pacifica>
On Wed, 2005-07-13 at 08:37 -0400, Joe Landman wrote:
> Hi John:
>
> We find that we are talking about "per core" to our customers now. I
> explain that previously, there has been an implicit 1-to-1 mapping
> between processor cores and chips, so that you could talk about either
> one and mean the other. Now however, we are talking about per core, as
> things like licensing (lmgrd) aren't going to count chips or sockets,
> but will count cores.
>
Unfortunately it isn't going to be that straightforward ... some vendors
are going to go by sockets/chips (Microsoft, Red Hat) and others by
cores (your example of lmgrd and I believe Oracle). Confusion will
abound. Oh well.
--
Jim Bowen
Dir. Systems Engineering
HPC Systems, Inc.
http://www.hpcsystems.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From david.n.lombard at intel.com Thu Jul 14 09:59:21 2005
From: david.n.lombard at intel.com (Lombard, David N)
Date: Thu, 14 Jul 2005 06:59:21 -0700
Subject: [Beowulf] Re: dual core Opteron performance - re suse 9.3
Message-ID: <187D3A7CAB42A54DB61F1D05F0125722065B4218@orsmsx402.amr.corp.intel.com>
From: Jim Bowen on Wednesday, July 13, 2005 8:34 AM
> On Wed, 2005-07-13 at 08:37 -0400, Joe Landman wrote:
> >
> > We find that we are talking about "per core" to our customers
now. I
> > explain that previously, there has been an implicit 1-to-1 mapping
> > between processor cores and chips, so that you could talk about
either
> > one and mean the other. Now however, we are talking about per core,
as
> > things like licensing (lmgrd) aren't going to count chips or
sockets,
> > but will count cores.
>
> Unfortunately it isn't going to be that straightforward ... some
vendors
> are going to go by sockets/chips (Microsoft, Red Hat) and others by
> cores (your example of lmgrd and I believe Oracle). Confusion will
> abound. Oh well.
I suspect the MS & RH and Macrovision(FLEXlm) & Oracle were actually
talking about different things related to *what* their licensing is
counting.
I haven't seen a lot of confusion about socket (or chip) as it refers to
a physical thing; similarly, core also appears reasonably well
understood.
The term "CPU" appears to be the problem...
--
dnl
My comments represent my opinions, not those of Intel Corporation.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kus at free.net Thu Jul 14 11:29:40 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Thu, 14 Jul 2005 19:29:40 +0400
Subject: [Beowulf] [gorelsky@stanford.edu: CCL:dual-core
Opteron275performance]
In-Reply-To: <77673C9ECE12AB4791B5AC0A7BF40C8F0147D611@exchange02.fed.cclrc.ac.uk>
Message-ID:
In message from "Kozin, I \(Igor\)" (Thu, 14 Jul
2005 11:25:12 +0100):
>> But now for 4cores/2CPUs per Opteron node to force the using of
>> only 2 cores (from 4), by 1 for each chip, we'll need to have
>> cpu affinity support in Linux.
>
>Mikhail,
>you can use "taskset" for that purpose.
>For example, (perhaps not in the most elegant form)
> mpiexec -n 1 taskset -c 0 $code : -n 1 taskset -c 2 $code
>But I doubt you want to let the idle cores to do something else
>in the mean time. However small you will generally see an increase
>in performance if you use all the cores.
Thanks !
AFAIK, taskset isn't a part of numactl package. Sorry, where is
possible to download taskset ?
What is about using of all the cores, you are, of course, in
general right ;-)
But there may be some cases where using of "pairs of cores" on Opteron
is bad under parallelization. For example, test178/RHF on the same
G03 w/Linda, gave more worse performance on 2 cores than for 1 core
(because of extra memory traffic).
And, again at least theoretically, I may "occupy" 2 (free) of total
4 cores in 2-chip server for example w/some independed cache-friendly
tasks.
Yours
Mikhail
Yours
Mikhail
>
>Best,
>Igor
>
>
>I. Kozin (i.kozin at dl.ac.uk)
>CCLRC Daresbury Laboratory
>tel: 01925 603308
>http://www.cse.clrc.ac.uk/disco
>Distributed Computing Forum
>http://www.cse.clrc.ac.uk/disco/forums/ubbthreads.php?Cat=0
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From i.kozin at dl.ac.uk Thu Jul 14 11:42:00 2005
From: i.kozin at dl.ac.uk (Kozin, I (Igor))
Date: Thu, 14 Jul 2005 16:42:00 +0100
Subject: [Beowulf] [gorelsky@stanford.edu: CCL:dual-coreOpteron275performance]
Message-ID: <77673C9ECE12AB4791B5AC0A7BF40C8F0147D612@exchange02.fed.cclrc.ac.uk>
taskset is part of schedutils by Robert Love http://rlove.org/schedutils/
> >> But now for 4cores/2CPUs per Opteron node to force the using of
> >> only 2 cores (from 4), by 1 for each chip, we'll need to have
> >> cpu affinity support in Linux.
> >
> >Mikhail,
> >you can use "taskset" for that purpose.
> >For example, (perhaps not in the most elegant form)
> > mpiexec -n 1 taskset -c 0 $code : -n 1 taskset -c 2 $code
> >But I doubt you want to let the idle cores to do something else
> >in the mean time. However small you will generally see an increase
> >in performance if you use all the cores.
>
> Thanks !
> AFAIK, taskset isn't a part of numactl package. Sorry, where is
> possible to download taskset ?
>
> What is about using of all the cores, you are, of course, in
> general right ;-)
> But there may be some cases where using of "pairs of cores" on Opteron
> is bad under parallelization. For example, test178/RHF on the same
> G03 w/Linda, gave more worse performance on 2 cores than for 1 core
> (because of extra memory traffic).
> And, again at least theoretically, I may "occupy" 2 (free) of total
> 4 cores in 2-chip server for example w/some independed cache-friendly
> tasks.
>
> Yours
> Mikhail
>
>
> Yours
> Mikhail
>
>
>
> >
> >Best,
> >Igor
> >
> >
> >I. Kozin (i.kozin at dl.ac.uk)
> >CCLRC Daresbury Laboratory
> >tel: 01925 603308
> >http://www.cse.clrc.ac.uk/disco
> >Distributed Computing Forum
> >http://www.cse.clrc.ac.uk/disco/forums/ubbthreads.php?Cat=0
> >
> >_______________________________________________
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit
> >http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kus at free.net Thu Jul 14 11:55:19 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Thu, 14 Jul 2005 19:55:19 +0400
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To:
Message-ID:
In message from "S.I.Gorelsky" (Thu, 14 Jul
2005 07:18:22 -0700 (PDT)):
>
>> It depends. DFT, for example, (and not only DFT) has good
>>scalability
>> under Linda also (about 1.8-1.9 at every twice increasing of CPU
>> numbers).
>
>What Gaussian version did you test? Gaussian 98 had a good Linda
>parallelization (at least for DFT).
Of course we had a lot of tasks on G98 also, but ...
>With Gaussian 03, I do not think this is the case.
... I have no data that G03/Linda is more worse w/Linda than G98.
We have G03 on cluster w/only 3 nodes (dual-Opteron), and at least for
this small cluster G03/Linda scalability is appropriate.
Did you code NoFMM G03 keyword for your cluster runs ?
In any case for G03 in cluster there is no alternative to Linda.
*I forgot to say*: the best way in this case is to use SMP(*in* the
nodes) +Linda between nodes, this is allowed by G03.
>>And I don't remember, what else than HF and DFT is parallelized in
G03 for shared memory ?
>Most of jobs, not just HF and DFT, can be run in SMP. This is not a
problem.
Ehh, it looks that my data about Gaussian SMP parallelization are
out-of date: from the old times AFAIK things like MP2 was not
SMP-parallelized. Now HF, DFT, CIS (may be MP2 also) are
SMP-parallelized. What else ?
Yours
Mikhail
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From reuti at staff.uni-marburg.de Thu Jul 14 16:28:58 2005
From: reuti at staff.uni-marburg.de (Reuti)
Date: Thu, 14 Jul 2005 22:28:58 +0200
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To:
References:
Message-ID: <1121372938.42d6cb0ac8c82@home.staff.uni-marburg.de>
Hi Michael,
> And I don't remember, what else than HF and DFT is parallelized in
> G03 for shared memory ?
the "old" shared memory model like in G98 is no longer used in G03 C.02, only
OpenMP is supported. In B.05 it was still possible to change the makefile to get
the shared memory behavior back, but as said, not in C.02. For this I got a
confirmation from Gaussian.
Cheers - Reuti
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hahn at physics.mcmaster.ca Thu Jul 14 18:10:45 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Thu, 14 Jul 2005 18:10:45 -0400 (EDT)
Subject: [Beowulf] [gorelsky@stanford.edu:
CCL:dual-coreOpteron275performance]
In-Reply-To: <77673C9ECE12AB4791B5AC0A7BF40C8F0147D612@exchange02.fed.cclrc.ac.uk>
Message-ID:
> taskset is part of schedutils by Robert Love http://rlove.org/schedutils/
it's also worth knowing that sched_setaffinity does all the work (not much!).
for most normal systems, setting affinity (especially to a single core)
is probably going to have bad effects some times. for instance, interrupt
handling. which has a similar bitmask-based interface like
echo 1 > /proc/irq/0/smp_affinity
I've wondered whether this makes sense to do on opteron systems where
the IO bridge is attached to a single socket.
regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From rupinder.bhangu at gmail.com Mon Jul 18 01:03:38 2005
From: rupinder.bhangu at gmail.com (rupinder bhangu)
Date: Sun, 17 Jul 2005 22:03:38 -0700
Subject: [Beowulf] help a newbie
Message-ID: <87f1c381050717220375a693a6@mail.gmail.com>
hi
I am Rupinder.I am a final year student.I have planned to work on the topic
of Beowulf clusters during my six months training.I have also gone through
some of the sites & the other stuff on the Internet to gather the basic info
regarding beowulfs, because I had to convince my teachers for allowing me to
work on this topic.Having done that job successfully, I would now like to
have the help from the people who are experienced in this field. I am really
a newbie in this field, but I want to do it. Could you please tell me where
to start, how to work & the related help that you think would be useful for
me?Could you also tell that whether a period of 6 months is adequate for a
person like me to build a cluster with 3-4 nodes successfully?
Thanks
Rupinder Kaur
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kc40 at hw.ac.uk Sun Jul 17 22:29:25 2005
From: kc40 at hw.ac.uk (Cheng, Kevin )
Date: Mon, 18 Jul 2005 03:29:25 +0100
Subject: [Beowulf] Dynamic Processes woes :S
Message-ID: <2C104E0B5F7AFE4CA6D266BE280EF72B23BBB6@ex1.mail.win.hw.ac.uk>
I have read the install and user guide to MPICH2 and could not find any
information on how to use the "dynamically create/destroy individual MPI
processes on specific host machines" - as specified in the new features of
MPICH2.
Please can someone tell me how to dynamically create/destroy MPI processes
on specific host machines from one host.
Also, does the MPICH-G (grid version) contain the above functionality as
well? Please tell me more details.
Many thanks
Kev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Mon Jul 18 10:38:48 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Mon, 18 Jul 2005 16:38:48 +0200
Subject: [Beowulf] Re: dual core (latency)
Message-ID: <3.0.32.20050718163844.0127d450@pop3.xs4all.nl>
I've been toying some with the numactl at dual core and it doesn't
really seem to help much. It helps 0.00
System: Ubuntu at a quad opteron dual core 1.8Ghz 2.6.10-5 smp kernel.
Latencies as measured by my own program (TLB trashing read of 8 bytes,
each cpu 250MB buffer):
#cpu latency
1 144-147 ns
2 174 ns
4 206 ns
8 234 ns
That single cpu figure is pretty ugly bad if i may say so.
All kind of numa calls just didn't help a thing. I've tried for example:
if(numa_available() < 0 ) {
setitnuma = 0;
}
else {
int i,back;
nodemask_t nt,n2,rnm;
maxnodes = numa_max_node()+1; // () returns 3 when 4 controllers
printf("numa=%i maxnodes=%i\n",setitnuma,maxnodes);
nt = numa_get_interleave_mask();
for( i = 0 ; i < maxnodes ; i++ ) {
printf("node = %i mask = %i\n",i,nt.n[i]);
nt.n[i] = 0;
n2.n[i] = 0;
}
numa_set_interleave_mask(&nt);
nt = numa_get_interleave_mask();
for( i = 0 ; i < maxnodes ; i++ )
printf("checking memory interleave node = %i mask = %i\n",i,nt.n[i]);
rnm = numa_get_run_node_mask();
printf("numa get run node mask = %i\n",rnm);
back = numa_run_on_node(0);
if( !back )
printf("set to run on node 0\n");
else
printf("failed to set run on node 0\n");
}
Whatever i try, single cpu latency keeps 144-147 ns.
A dual opteron dual core with 2.2Ghz dual core controllers shows similar
latencies. 200 ns for example when running 4 processes with the same
testprogram.
This single cpu latency behaviour of dual core opteron is ugly bad
compared to other dual opterons which are not dual core.
Nearly identical Tyan mainboard with dual opteron 2.2Ghz gives single cpu
with SAME kernel, with SAME program 115 ns latency. When turning off ECC at
that dual opteron it gets down to 113 ns even.
The frustrating thing is, the dual opteron 2.2Ghz has pc2700,
whereas the quad opteorn dual core has all banks filled
with pc3200 registered ram, a-brand.
Vincent
At 07:29 PM 7/14/2005 +0400, Mikhail Kuzminsky wrote:
>In message from "Kozin, I \(Igor\)" (Thu, 14 Jul
>2005 11:25:12 +0100):
>>> But now for 4cores/2CPUs per Opteron node to force the using of
>>> only 2 cores (from 4), by 1 for each chip, we'll need to have
>>> cpu affinity support in Linux.
>>
>>Mikhail,
>>you can use "taskset" for that purpose.
>>For example, (perhaps not in the most elegant form)
>> mpiexec -n 1 taskset -c 0 $code : -n 1 taskset -c 2 $code
>>But I doubt you want to let the idle cores to do something else
>>in the mean time. However small you will generally see an increase
>>in performance if you use all the cores.
>
>Thanks !
>AFAIK, taskset isn't a part of numactl package. Sorry, where is
>possible to download taskset ?
>
>What is about using of all the cores, you are, of course, in
>general right ;-)
>But there may be some cases where using of "pairs of cores" on Opteron
>is bad under parallelization. For example, test178/RHF on the same
>G03 w/Linda, gave more worse performance on 2 cores than for 1 core
>(because of extra memory traffic).
>And, again at least theoretically, I may "occupy" 2 (free) of total
>4 cores in 2-chip server for example w/some independed cache-friendly
>tasks.
>
>Yours
>Mikhail
>
>
>Yours
>Mikhail
>
>
>
>>
>>Best,
>>Igor
>>
>>
>>I. Kozin (i.kozin at dl.ac.uk)
>>CCLRC Daresbury Laboratory
>>tel: 01925 603308
>>http://www.cse.clrc.ac.uk/disco
>>Distributed Computing Forum
>>http://www.cse.clrc.ac.uk/disco/forums/ubbthreads.php?Cat=0
>>
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit
>>http://www.beowulf.org/mailman/listinfo/beowulf
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
/*-----------------10-6-2003 3:48-------------------*
*
* This program rasml.c measures the Random Average Shared Memory Latency (RASML)
* Thanks to Agner Fog for his excellent random number generator.
*
* This testset is using a 64 bits optimized RNG of Agner Fog's ranrot generator.
*
* Created by Vincent Diepeveen who hereby releases this under GPL
* Feel free to look at the FSF (free software foundation) for what
* GPL is and its conditions.
*
* Please don't confuse the times achieved here with two times the one
* way pingpong latency, though at
* ideal scaling supercomputers/clusters they will be close. There is a few
* differences:
* a) this is TLB trashing
* b) this test tests ALL processors at the same time and not
* just 2 cpu's while the rest of the entire cluster is idle.
* c) this test ships 8 bytes whereas one way pingpong typical also
* gets used to test several kilobyte sizes, or just returns a pong.
* d) this doesn't use MPI but shared memory and the way such protocols are
* implemented matters possibly for latency.
*
* Vincent Diepeveen diep at xs4all.nl
* Veenendaal, The Netherlands 10 june 2003
*
* First a few lines about the random number generator. Note that I modified Agner Fog's
* RanRot very slightly. Basically its initialization has been done better and some dead
* slow FPU code rewritten to fast 64 bits integer code.
*/
#define UNIX 1 /* put to 1 when you are under unix or using gcc a look like compilers */
#define IRIX 1 /* this value only matters when UNIX is set to 1. For Linux put to 0
* basically allocating shared memory in linux is pretty buggy done in
* its kernel.
*
* Therefore you might want to do 'cat /proc/sys/kernel/shmmax'
* and look for yourself how much shared memory YOU can allocate in linux.
*
* If that is not enough to benchmark this program then try modifying it with:
* echo > /proc/sys/kernel/shmmmax
* Be sure you are root when doing that each time the system boots.
*/
#define FREEBSD 0 // be sure to not use more than 2 GB memory with freebsd with this test. sorry.
#if UNIX
#include
#include
#include
#include
#include
#include
#else
#include
#include // for GetTickCount()
#include // _spawnl
#endif
#include
#include
#include
#include
#include
#define SWITCHTIME 60000 /* in milliseconds. Modify this to let a test run longer or shorter.
* basically it is a good idea to use about the cpu number times
* thousand for this. 30 seconds is fine for PC's, but a very
* bad idea for supercomputers. I recomment several minutes
* there, and at least a few hours for big supers if the partition isn't started yet
* if the partition is started starting it at 460 processors (SGI) should
* take 10 minutes, otherwise it takes 3 hours to attach all.
* Of course that let's a test take way way longer.
*/
#define MAXPROCESSES 512 /* this test can go up to this amount of processes to be tested */
#define CACHELINELENGTH 128 /* cache line length at the machine. Modify this if you want to */
#if UNIX
#include
// #include
#define FORCEINLINE __inline
/* UNIX and such this is 64 bits unsigned variable: */
#define BITBOARD unsigned long long
#else
#define FORCEINLINE __forceinline
/* in WINDOWS we also want to be 64 bits: */
#define BITBOARD unsigned _int64
#endif
#define STATUS_NOTSTARTED 0
#define STATUS_ATTACH 1
#define STATUS_GOATTACH 2
#define STATUS_ATTACHED 3
#define STATUS_STARTREAD 4
#define STATUS_READ 5
#define STATUS_MEASUREREAD 6
#define STATUS_MEASUREDREAD 7
#define STATUS_QUIT 10
struct ProcessState {
volatile int status; /* 0 = not started yet
* 1 = ready to start reading
*
* 10 = quitted
* */
/* now the numbers each cpu gathers. The name of the first number is what
* cpu0 is doing and the second name what all the other cpu's were doing at that
* time
*/
volatile BITBOARD readread; /* */
char dummycacheline[CACHELINELENGTH];
};
typedef struct {
BITBOARD nentries; // number of entries of 64 bits used for cache.
struct ProcessState ps[MAXPROCESSES];
} GlobalTree;
void RanrotAInit(void);
float ToNano(BITBOARD);
int GetClock(void);
float TimeRandom(void);
void ParseBuffer(BITBOARD);
void ClearHash(void);
void DeAllocate(void);
int DoNrng(BITBOARD);
int DoNreads(BITBOARD);
int DoNreadwrites(BITBOARD);
//void TestLatency(float);
int AllocateTree(void);
void InitTree(int);
void WaitForStatus(int,int);
void PutStatus(int,int);
int CheckStatus(int,int);
int CheckAllStatus(int,int);
void Slapen(int);
float LoopRandom(void);
/* define parameters (R1 and R2 must be smaller than the integer size): */
#define KK 17
#define JJ 10
#define R1 5
#define R2 3
/* global variables Ranrot */
BITBOARD randbuffer[KK+3] = { /* history buffer filled with some random numbers */
0x92930cb295f24dab,0x0d2f2c860b685215,0x4ef7b8f8e76ccae7,0x03519154af3ec239,0x195e36fe715fad23,
0x86f2729c24a590ad,0x9ff2414a69e4b5ef,0x631205a6bf456141,0x6de386f196bc1b7b,0x5db2d651a7bdf825,
0x0d2f2c86c1de75b7,0x5f72ed908858a9c9,0xfb2629812da87693,0xf3088fedb657f9dd,0x00d47d10ffdc8a9f,
0xd9e323088121da71,0x801600328b823ecb,0x93c300e4885d05f5,0x096d1f3b4e20cd47,0x43d64ed75a9ad5d9
/*0xa05a7755512c0c03,0x960880d9ea857ccd,0x7d9c520a4cc1d30f,0x73b1eb7d8891a8a1,0x116e3fc3a6b7aadb*/
};
int r_p1, r_p2; /* indexes into history buffer */
/* global variables RASML */
BITBOARD *hashtable[MAXPROCESSES],nentries,globaldummy=0;
GlobalTree *tree;
int ProcessNumber,
cpus; // number of processes for this test
#if UNIX
int shm_tree,shm_hash[MAXPROCESSES];
#endif
char rasmexename[2048];
/******************************************************** AgF 1999-03-03 *
* Random Number generator 'RANROT' type B *
* by Agner Fog *
* *
* This is a lagged-Fibonacci type of random number generator with *
* rotation of bits. The algorithm is: *
* X[n] = ((X[n-j] rotl r1) + (X[n-k] rotl r2)) modulo 2^b *
* *
* The last k values of X are stored in a circular buffer named *
* randbuffer. *
* *
* This version works with any integer size: 16, 32, 64 bits etc. *
* The integers must be unsigned. The resolution depends on the integer *
* size. *
* *
* Note that the function RanrotAInit must be called before the first *
* call to RanrotA or iRanrotA *
* *
* The theory of the RANROT type of generators is described at *
* www.agner.org/random/ranrot.htm *
* *
*************************************************************************/
FORCEINLINE BITBOARD rotl(BITBOARD x,int r) {return(x<>(64-r));}
/* returns a random number of 64 bits unsigned */
FORCEINLINE BITBOARD RanrotA(void) {
/* generate next random number */
BITBOARD x = randbuffer[r_p1] = rotl(randbuffer[r_p2],R1) + rotl(randbuffer[r_p1], R2);
/* rotate list pointers */
if( --r_p1 < 0)
r_p1 = KK - 1;
if( --r_p2 < 0 )
r_p2 = KK - 1;
return x;
}
/* this function initializes the random number generator. */
void RanrotAInit(void) {
int i;
/* one can fill the randbuffer here with possible other values here */
randbuffer[0] = 0x92930cb295f24000 | (BITBOARD)ProcessNumber;
randbuffer[1] = 0x0d2f2c860b000215 | ((BITBOARD)ProcessNumber<<12);
/* initialize pointers to circular buffer */
r_p1 = 0;
r_p2 = JJ;
/* randomize */
for( i = 0; i < 300; i++ )
(void)RanrotA();
}
/* Now the RASML code */
char *To64(BITBOARD x) {
static char buf[256];
char *sb;
sb = &buf[0];
#if UNIX
sprintf(buf,"%llu",x);
#else
sprintf(buf,"%I64u",x);
#endif
return sb;
}
int GetClock(void) {
/* The accuracy is measured in millisecondes. The used function is very accurate according
* to the NT team, way more accurate nowadays than mentionned in the MSDN manual. The accuracy
* for linux or unix we can only guess. Too many experts there.
*/
#if UNIX
struct timeval timeval;
struct timezone timezone;
gettimeofday(&timeval, &timezone);
return((int)(timeval.tv_sec*1000+(timeval.tv_usec/1000)));
#else
return((int)GetTickCount());
#endif
}
float ToNano(BITBOARD nps) {
/* convert something from times a second to nanoseconds.
* NOTE THAT THERE IS COMPILER BUGS SOMETIMES AT OLD COMPILERS
* SO THAT'S WHY MY CODE ISN'T A 1 LINE RETURN HERE. PLEASE DO
* NOT MODIFY THIS CODE */
float tn;
tn = 1000000000/(float)nps;
return tn;
}
float TimeRandom(void) {
/* timing the random number generator is very easy of course. Returns
* number of random numbers a second that can get generated
*/
BITBOARD bb=0,i,value,nps;
float ns_rng;
int t1,t2,took;
printf("Benchmarking Pseudo Random Number Generator speed, RanRot type 'B'!\n");
printf("Speed depends upon CPU and compile options from RASML,\n therefore we benchmark the RNG\n");
printf("Please wait a few seconds.. "); fflush(stdout);
value = 100000;
took = 0;
while( took < 3000 ) {
value <<= 2; // x4
t1 = GetClock();
for( i = 0; i < value; i++ ) {
bb ^= RanrotA();
}
t2 = GetClock();
took = t2-t1;
}
nps = (1000*value)/(BITBOARD)took;
#if UNIX
printf("..took %i milliseconds to generate %llu numbers\n",took,value);
printf("Speed of RNG = %llu numbers a second\n",nps);
#else
printf("..took %i milliseconds to generate %I64 numbers\n",took,value);
printf("Speed of RNG = %I64u numbers a second\n",nps);
#endif
ns_rng = ToNano(nps);
printf("So 1 RNG call takes %f nanoseconds\n",ns_rng);
return ns_rng;
}
void ParseBuffer(BITBOARD nbytes) {
tree->nentries = nbytes/sizeof(BITBOARD);
#if UNIX
printf("Trying to allocate %llu entries. ",tree->nentries);
printf("In total %llu bytes\n",tree->nentries*(BITBOARD)sizeof(BITBOARD));
#else
printf("Trying to allocate %s entries. ",To64(tree->nentries));
printf("In total %s bytes\n",To64(tree->nentries*(BITBOARD)sizeof(BITBOARD)));
#endif
}
void ClearHash(void) {
BITBOARD *hi,i,nentries = tree->nentries;
/* clearing hashtable */
printf("Clearing hashtable for processor %i\n",ProcessNumber);
fflush(stdout);
hi = hashtable[ProcessNumber];
for( i = 0 ; i < nentries ; i++ ) /* very unoptimized way of clearing */
hi[i] = i;
}
void DeAllocate(void) {
int i;
#if UNIX
shmctl(shm_tree,IPC_RMID,0);
for( i = 0; i < cpus; i++ ) {
shmctl(shm_hash[i],IPC_RMID,0);
}
#else
UnmapViewOfFile(tree);
for( i = 0; i < cpus; i++ ) {
UnmapViewOfFile(hashtable[i]);
}
#endif
}
int DoNrng(BITBOARD n) {
BITBOARD i=1,dummyres,nents;
int t1,t2,ncpu;
ncpu = cpus;
nents = nentries; /* hopefully this gets into a register */
dummyres = globaldummy;
t1 = GetClock();
do {
BITBOARD rani=RanrotA(),index=rani%nents;
unsigned int i2 = (unsigned int)(rani>>32)%ncpu;
dummyres ^= (index+(BITBOARD)i2);
} while( i++ < n );
t2 = GetClock();
globaldummy = dummyres;
return(t2-t1);
}
int DoNreads(BITBOARD n) {
BITBOARD i=1,dummyres,nents;
int t1,t2,ncpu;
ncpu = cpus;
nents = nentries; /* hopefully this gets into a register */
dummyres = globaldummy;
t1 = GetClock();
do {
BITBOARD rani=RanrotA(),index=rani%nents;
unsigned int i2 = (unsigned int)(rani>>32)%ncpu;
dummyres ^= hashtable[i2][index];
} while( i++ < n );
t2 = GetClock();
globaldummy = dummyres;
return(t2-t1);
}
#if 0
int DoNreadwrites(BITBOARD n) {
BITBOARD i=1,dummyres,nents;
int t1,t2;
nents = nentries; /* hopefully this gets into a register */
dummyres = globaldummy;
t1 = GetClock();
do {
BITBOARD index = RanrotA()%nents;
dummyres ^= hashtable[index];
hashtable[index] = dummyres;
} while( i++ < n );
t2 = GetClock();
globaldummy = dummyres;
return(t2-t1);
}
void TestLatency(float ns_rng) {
BITBOARD n,nps_read,nps_rw,nps_rng;
float ns,fns;
int timetaken;
printf("Doing random RNG test. Please wait..\n");
n = 50000000; // 50 mln
timetaken = DoNrng(n);
nps_rng = (1000*n) / (BITBOARD)timetaken;
fns = ToNano(nps_rng);
printf("Machine needs %f ns for RND loop\n",fns);
/* READING SINGLE CPU RANDOM ENTRIES */
printf("Doing random read tests single cpu. Please wait..\n");
n = 100000000; // 100 mln
timetaken = DoNreads(n);
nps_read = (1000*n) / (BITBOARD)timetaken;
ns = ToNano(nps_read);
printf("Machine needs %f ns for single cpu random reads.\nExtrapolated=%f nanoseconds a read\n",ns,ns-fns);
/* READING AND THEN WRITING SINGLE CPU RANDOM ENTRIES */
printf("Doing random readwrite tests single cpu. Please wait..\n");
n = 100000000; // 100 mln
timetaken = DoNreadwrites(n);
nps_rw = (1000*n) / (BITBOARD)timetaken;
ns = ToNano(nps_rw);
printf("Machine needs %f ns for single cpu random readwrites.\n",ns);
printf("Extrapolated=%f nanoseconds a readwrite (to the same slot)\n\n",ns-fns);
printf("So far the useless tests.\nBut we have vague read/write nodes a second numbers now\n");
}
#endif
int AllocateTree(void) { /* initialize the tree. returns 0 if error */
#if UNIX
shm_tree = shmget(
ftok(".",'t'),
sizeof(GlobalTree),IPC_CREAT|0777);
if( shm_tree == -1 )
return 0;
tree = (GlobalTree *)shmat(shm_tree,0,0);
if( tree == (GlobalTree *)-1 )
return 0;
#else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
if( !ProcessNumber ) {
HANDLE TreeFileMap;
TreeFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
(DWORD)sizeof(GlobalTree),"RASM_Tree");
if( TreeFileMap == NULL )
return 0;
tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( tree == NULL )
return 0;
}
else { /* Slaves attach also try to attach to the tree */
HANDLE TreeFileMap;
TreeFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Tree");
if( TreeFileMap == NULL )
return 0;
tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( tree == NULL )
return 0;
}
#endif
return 1;
}
int AttachAll(void) {
#if UNIX
#else
HANDLE HashFileMap;
#endif
char hashname2[32] = {"RASM_Hash00"},hashname[32];
int i,r;
for( r = 0; r < cpus; r++ ) {
i = ProcessNumber+r;
i %= cpus;
if( i == ProcessNumber )
continue;
#if UNIX
shm_hash[i] = shmget(
#if IRIX
ftok(".",200+i),
#else
ftok(".",(char)i),
#endif
tree->nentries*8,IPC_CREAT|0777);
if( shm_hash[i] == -1 )
return 0;
hashtable[i] = (BITBOARD *)shmat(shm_hash[i],0,0);
if( hashtable[i] == (BITBOARD *)-1 )
return 0;
#else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
strcpy(hashname,hashname2);
hashname[9] += (i/10);
hashname[10] += (i%10);
HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,hashname);
if( HashFileMap == NULL )
return 0;
hashtable[i] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( hashtable[i] == NULL )
return 0;
#endif
}
return 1;
}
int AllocateHash(void) { /* initialize the hashtable (cache). returns 0 if error */
char hashname[32] = {"RASM_Hash00"};
#if UNIX
shm_hash[ProcessNumber] = shmget(
#if IRIX
ftok(".",200+ProcessNumber),
#else
ftok(".",(char)ProcessNumber),
#endif
tree->nentries*8,IPC_CREAT|0777);
if( shm_hash[ProcessNumber] == -1 )
return 0;
hashtable[ProcessNumber] = (BITBOARD *)shmat(shm_hash[ProcessNumber],0,0);
if( hashtable[ProcessNumber] == (BITBOARD *)-1 )
return 0;
#else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
//if( !ProcessNumber ) {
HANDLE HashFileMap;
hashname[9] += (ProcessNumber/10);
hashname[10] += (ProcessNumber%10);
HashFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
(DWORD)tree->nentries*8,hashname);
if( HashFileMap == NULL )
return 0;
hashtable[ProcessNumber] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( hashtable[ProcessNumber] == NULL )
return 0;
//}
//else { /* Slaves attach also try to attach to the tree */
/* HANDLE HashFileMap;
HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Hash");
if( HashFileMap == NULL )
return 0;
hashtable[ProcessNumber] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( hashtable[ProcessNumber] == NULL )
return 0;*/
//}
#endif
return 1;
}
int StartProcesses(int ncpus) {
char buf[256];
int i;
/* returns 1 if ncpus-1 started ok */
if( ncpus == 1 )
return 1;
for( i = 1 ; i < ncpus ; i++ ) {
sprintf(buf,"%i_%i",i+1,ncpus);
#if UNIX
if( !fork() )
execl(rasmexename,rasmexename,buf,NULL);
#else
(void)_spawnl(_P_NOWAIT,rasmexename,rasmexename,buf,NULL);
#endif
}
return 1;
}
void InitTree(int ncpus) {
int i;
for( i = 0 ; i < ncpus ; i++ ) {
tree->ps[i].status = STATUS_NOTSTARTED;
tree->ps[i].readread = 0;
}
}
void WaitForStatus(int ncpus,int waitforstate) {
/* wait for all processors to have the same state */
int i,badluck=1;
while( badluck ) {
badluck = 0;
for( i = 0 ; i < ncpus ; i++ ) {
if( tree->ps[i].status != waitforstate )
badluck = 1;
}
}
}
void PutStatus(int ncpus,int statenew) {
int i;
for( i = 0 ; i < ncpus ; i++ ) {
tree->ps[i].status = statenew;
}
}
int CheckStatus(int ncpus,int statenew) {
/* returns false when not all cpu's are in the new state */
int i;
for( i = 0 ; i < ncpus ; i++ ) {
if( tree->ps[i].status != statenew )
return 0;
}
return 1;
}
int CheckAllStatus(int ncpus,int status) {
/* Tries with a single loop to determine whether the other cpu's also finished
*
* returns:
* true ==> when all the processes have this status
* false ==> when 1 or more are still busy measuring
*/
int i,badluck=1;
for( i = 0 ; i < ncpus ; i++ ) {
if( tree->ps[i].status != status ) {
badluck = 0;
break;
}
}
return badluck;
}
void Slapen(int ms) {
#if UNIX
usleep(ms*1000); /* 0.050 000 secondes, it is in microseconds! */
#else
Sleep(ms); /* 0.050 seconds, it is in milliseconds */
#endif
}
float LoopRandom(void) {
BITBOARD n,nps_rng;
float fns;
int timetaken;
printf("Benchmarking random RNG test. Please wait..\n");
n = 25000000; // 50 mln
timetaken = 0;
while( timetaken < 500 ) {
n += n;
timetaken = DoNrng(n);
}
printf("timetaken=%i\n",timetaken);
nps_rng = (1000*n) / (BITBOARD)timetaken;
fns = ToNano(nps_rng);
printf("Machine needs %f ns for RND loop\n",fns);
return fns;
}
/* Example showing how to use the random number generator: */
int main(int argc,char *argv[]) {
/* allocate a big memory buffer parameter is in bytes.
* don't hesitate to MODIFY this to how many gigabytes
* you want to try.
* The more the better i keep saying to myself.
*
* Note that under linux your maximum shared memory limit can be set with:
*
* echo > /proc/sys/kernel/shmmax
*
* and under IRIX it is usually 80% from the total RAM onboard that can get allocated
*/
BITBOARD nbytes,firstguess;
float ns_rng,f_loop;
int tottimes,t1,t2;
if( argc <= 1 ) {
printf("Latency test usage is: latency \n");
printf("Where 'buffer' is the buffer in number of bytes to allocate PRO PROCESSOR\n");
printf("and where 'cpus' is the number of processes that this test will try to use (1 = default) \n");
return 1;
}
/* parse the input */
nbytes = 0;
cpus = 1; // default
if( strchr(argv[1],'_') == NULL ) { /* main startup process */
int np = 0;
#if UNIX
#if FREEBSD
nbytes = (BITBOARD)atoi(argv[1]); // freebsd doesn't support > 2 GB memory
#else
nbytes = (BITBOARD)atoll(argv[1]);
#endif
#else
nbytes = (BITBOARD)_atoi64(argv[1]);
#endif
printf("Welcome to RASM Latency!\n");
printf("RASML measures the RANDOM AVERAGE SHARED MEMORY LATENCY!\n\n");
if( argc > 2 ) {
cpus = 0;
do {
cpus *= 10;
cpus += (int)(argv[2][np]-'1')+1;
np++;
} while( argv[2][np] >= '0' && argv[2][np] <= '9' );
}
//printf("Master: buffer = %s bytes. #CPUs = %i\n",To64(nbytes),cpus);
ProcessNumber = 0;
/* check whether we are not getting out of bounds */
if( cpus > MAXPROCESSES ) {
printf("Error: Recompile with a bigger stack for MAXPROCESSES. %i processors is too much\n",cpus);
return 1;
}
/* find out the file name */
#if UNIX
strcpy(rasmexename,argv[0]);
#else
GetModuleFileName(NULL,rasmexename,2044);
#endif
printf("Stored in rasmexename = %s\n",rasmexename);
}
else { // latency 2_452 ==> means processor 2 out of 452.
int np = 0;
ProcessNumber = 0;
do {
ProcessNumber *= 10;
ProcessNumber += (argv[1][np]-'1')+1; // n
np++;
} while( argv[1][np] >= '0' && argv[1][np] <= '9' );
ProcessNumber--; // 1 less because of ProcessNumber ==> [0..n-1]
np++; // skip underscore
cpus = 0;
do {
cpus *= 10;
cpus += (argv[1][np]-'1')+1; // n
np++;
} while( argv[1][np] >= '0' && argv[1][np] <= '9' );
//printf("Slave: ProcessNumber=%i cpus=%i\n",ProcessNumber,cpus);
}
/* first we setup the random number generator. */
RanrotAInit();
/* initialize shared memory tree; it gets used for communication between the processes */
if( !AllocateTree() ) {
printf("Error: ProcessNumber %i could not allocate the tree\n",ProcessNumber);
return 1;
}
if( !ProcessNumber )
ParseBuffer(nbytes);
nentries = tree->nentries;
/* Now some stuff only the Master has to do */
if( !ProcessNumber ) {
/* Master: now let's time the pseudo random generators speed in nanoseconds a call */
ns_rng = TimeRandom();
f_loop = LoopRandom();
printf("Trying to Allocate Buffer\n");
t1 = GetClock();
if( !AllocateHash() ) {
printf("Error: Could not allocate buffer!\n");
return 1;
}
t2 = GetClock();
printf("Took %i.%03i seconds to allocate Hash\n",(t2-t1)/1000,(t2-t1)%1000);
ClearHash(); // local hash
t1 = GetClock();
printf("Took %i.%03i seconds to clear Hash\n",(t1-t2)/1000,(t1-t2)%1000);
/* so now hashtable is setup and we know quite some stuff. So it is time to
* start all other processes */
InitTree(cpus);
printf("Starting Other processes\n");
t1 = GetClock();
if( !StartProcesses(cpus) ) {
printf("Error: Could not start processes\n");
DeAllocate();
}
t2 = GetClock();
printf("Took %i milliseconds to start %i additional processes\n",t2-t1,cpus-1);
t1 = GetClock();
}
else { /* all Slaves do this */
if( !AllocateHash() ) {
printf("Error: slave %i Could not allocate buffer!\n",ProcessNumber);
return 1;
}
ClearHash(); // local hash
}
tree->ps[ProcessNumber].status = STATUS_ATTACH;
if( ! ProcessNumber ) {
WaitForStatus(cpus,STATUS_ATTACH);
t2 = GetClock();
printf("Took %i milliseconds to synchronize %i additional processes\n",t2-t1,cpus-1);
t1 = GetClock();
/* now we can continue with the next phase that is attaching all the segments */
PutStatus(cpus,STATUS_GOATTACH);
}
else {
while( tree->ps[ProcessNumber].status == STATUS_ATTACH ) {
Slapen(500);
}
}
if( !AttachAll() ) {
printf("Error: process %i Could not attach correctly!\n",ProcessNumber);
return 1;
}
tree->ps[ProcessNumber].status = STATUS_ATTACHED;
if( ! ProcessNumber ) {
WaitForStatus(cpus,STATUS_ATTACHED);
t2 = GetClock();
printf("Took %i milliseconds to ATTACH. %llu total RAM\n",t2-t1,(BITBOARD)cpus*tree->nentries*8);
PutStatus(cpus,STATUS_STARTREAD);
printf("Read latency measurement STARTS NOW using steps of 2 * %i.%03i seconds :\n",
(SWITCHTIME/1000),(SWITCHTIME%1000));
}
else {
while( tree->ps[ProcessNumber].status == STATUS_ATTACHED ) {
Slapen(500);
}
}
tree->ps[ProcessNumber].status = STATUS_READ;
firstguess = 200000;
tottimes = 0;
for( ;; ) {
int timetaken = 0;
if( tree->ps[ProcessNumber].status == STATUS_MEASUREREAD ) {
/* this really MEASURES the readread */
BITBOARD ntried = 0,avnumber;
int totaltime=0;
while( totaltime < SWITCHTIME ) { /* go measure around switchtime seconds */
totaltime += DoNreads(firstguess);
ntried += firstguess;
}
/* now put the average number of readreads into the shared memory */
avnumber = (ntried*1000) / (BITBOARD)totaltime;
tree->ps[ProcessNumber].readread = avnumber;
/* show that it is finished */
tree->ps[ProcessNumber].status = STATUS_MEASUREDREAD;
/* now keep doing the same thing until status gets modified */
while( tree->ps[ProcessNumber].status == STATUS_MEASUREDREAD ) {
(void)DoNreads(firstguess);
if( !ProcessNumber ) {
if( CheckAllStatus(cpus,STATUS_MEASUREDREAD) ) {
PutStatus(cpus,STATUS_QUIT);
break;
}
}
}
}
else if( tree->ps[ProcessNumber].status == STATUS_READ ) {
BITBOARD nextguess;
/* now software must try to determine how many reads a seconds are possible for that
* process
*/
//printf("proc=%i trying %s reads\n",ProcessNumber,To64(firstguess));
timetaken = DoNreads(firstguess);
/* try to guess such that next test takes 1 second, or if test was too inaccurate
* then double the number simply. also prevents divide by zero error ;)
*/
if( timetaken < 400 )
nextguess = firstguess*2;
else
nextguess = (firstguess*1000)/(BITBOARD)timetaken;
firstguess = nextguess;
if( !ProcessNumber ) {
tottimes += timetaken;
if( tottimes >= SWITCHTIME ) { // 30 seconds to a few minutes
tottimes = 0;
if( CheckStatus(cpus,STATUS_READ) ) {
PutStatus(cpus,STATUS_MEASUREREAD);
} /* waits another SWITCH time before starting to measure */
}
}
}
else if( tree->ps[ProcessNumber].status == STATUS_QUIT )
break;
}
/* now do the latency tests
*/
//TestLatency(ns_rng);
tree->ps[ProcessNumber].status = STATUS_QUIT;
if( !ProcessNumber ) {
BITBOARD averagereadread;
int i;
averagereadread = 0;
WaitForStatus(cpus,STATUS_QUIT);
printf("the raw output\n");
for( i = 0; i < cpus ; i++ ) {
BITBOARD tr=tree->ps[i].readread;
averagereadread += tr;
printf("%llu ",tr);
}
printf("\n");
averagereadread /= (BITBOARD)cpus;
printf("Raw Average measured read read time at %i processes = %f ns\n",cpus,ToNano(averagereadread));
printf("Now for the final calculation it gets compensated:\n");
printf(" Average measured read read time at %i processes = %f ns\n",cpus,ToNano(averagereadread)-f_loop);
}
DeAllocate();
return 0;
}
/* EOF latencyC.c */
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From sdm900 at gmail.com Mon Jul 18 20:17:17 2005
From: sdm900 at gmail.com (Stuart Midgley)
Date: Tue, 19 Jul 2005 08:17:17 +0800
Subject: [Beowulf] Re: dual core (latency)
In-Reply-To: <3.0.32.20050718163844.0127d450@pop3.xs4all.nl>
References: <3.0.32.20050718163844.0127d450@pop3.xs4all.nl>
Message-ID:
The numactl tools won't generally help latency. Latency isn't the
issue with Opteron based systems (or any system with multiply
connected distributed memory controllers).
The real issue is page locality (which is the case with most numa
based systems).
If you run 2 processes on a dual cpu (single core) systems and they
both happen to allocate their pages on the same memory controller,
they will each only see 1/2 the memory bandwidth and 1 controller
sits idle. That's the real issue (and the extreme pathalogical case).
Linux2.6 generally does a good job of putting the pages on the memory
controller attached to cpu that the process is running on. However,
it can't get it perfect. There are always more than 1process/cpu on
a system, so there is always a little noise... so there is always the
chance that some pages can be spread around. Also, the system buffer
cache will get spread around effecting everyone.
Add into the mix the possibility of suspending processes and you can
end up with a processes pages all over the place. Since Linux
doesn't yet have make migration, once a page is allocated it won't be
moved to a different memory controller unless it is swapped out.
With numactl tools you will force the pages to be allocated on the
right memory/cpu. The processes buffer cache will also be locked
down (which is another VERY important issue)...
I have used numa tools to double the performance of some codes (or
perhaps its more correct to say to get back to the correct performance).
Stu.
On 18/07/2005, at 22:38, Vincent Diepeveen wrote:
> I've been toying some with the numactl at dual core and it doesn't
> really seem to help much. It helps 0.00
>
> System: Ubuntu at a quad opteron dual core 1.8Ghz 2.6.10-5 smp
> kernel.
>
> Latencies as measured by my own program (TLB trashing read of 8 bytes,
> each cpu 250MB buffer):
>
> #cpu latency
> 1 144-147 ns
> 2 174 ns
> 4 206 ns
> 8 234 ns
>
> That single cpu figure is pretty ugly bad if i may say so.
>
> All kind of numa calls just didn't help a thing. I've tried for
> example:
>
> if(numa_available() < 0 ) {
> setitnuma = 0;
> }
> else {
> int i,back;
> nodemask_t nt,n2,rnm;
> maxnodes = numa_max_node()+1; // () returns 3 when 4 controllers
> printf("numa=%i maxnodes=%i\n",setitnuma,maxnodes);
>
> nt = numa_get_interleave_mask();
> for( i = 0 ; i < maxnodes ; i++ ) {
> printf("node = %i mask = %i\n",i,nt.n[i]);
> nt.n[i] = 0;
> n2.n[i] = 0;
> }
> numa_set_interleave_mask(&nt);
> nt = numa_get_interleave_mask();
> for( i = 0 ; i < maxnodes ; i++ )
> printf("checking memory interleave node = %i mask = %i
> \n",i,nt.n[i]);
>
> rnm = numa_get_run_node_mask();
> printf("numa get run node mask = %i\n",rnm);
> back = numa_run_on_node(0);
> if( !back )
> printf("set to run on node 0\n");
> else
> printf("failed to set run on node 0\n");
>
> }
>
> Whatever i try, single cpu latency keeps 144-147 ns.
>
> A dual opteron dual core with 2.2Ghz dual core controllers shows
> similar
> latencies. 200 ns for example when running 4 processes with the same
> testprogram.
>
> This single cpu latency behaviour of dual core opteron is ugly bad
> compared to other dual opterons which are not dual core.
>
> Nearly identical Tyan mainboard with dual opteron 2.2Ghz gives
> single cpu
> with SAME kernel, with SAME program 115 ns latency. When turning
> off ECC at
> that dual opteron it gets down to 113 ns even.
>
> The frustrating thing is, the dual opteron 2.2Ghz has pc2700,
> whereas the quad opteorn dual core has all banks filled
> with pc3200 registered ram, a-brand.
>
> Vincent
--
Dr Stuart Midgley
sdm900 at gmail.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From sdm900 at gmail.com Mon Jul 18 23:05:44 2005
From: sdm900 at gmail.com (Stuart Midgley)
Date: Tue, 19 Jul 2005 11:05:44 +0800
Subject: [Beowulf] Re: dual core (latency)
In-Reply-To: <3.0.32.20050719045009.012813f0@pop3.xs4all.nl>
References: <3.0.32.20050719045009.012813f0@pop3.xs4all.nl>
Message-ID: <234C21E7-B790-4EC4-BAE7-4E26955685FB@gmail.com>
The first thing to note is that as you add cpu's the cost of the
cache snooping goes up dramatically. The latency of a 4 cpu (single
core) opteron system is (if my memory serves me correctly) around
120ns. Which is significantly higher than the latency of a dual
processor system (I think it scales roughly as O(n^2) where n is the
number of cpu's).
Now, with a dual core system, you are effectively halving the
bandwidth/cpu over the hyper transport AND increasing the cpu count,
thus increasing the amount of cache snooping required. The end
result is drastically blown-out latencies.
Stu.
On 19/07/2005, at 10:50, Vincent Diepeveen wrote:
> Hello Stuart,
>
> Thanks for your answer regarding numactl tools.
>
> Your answer doesn't necessarily explain why the dual core latency
> (with or
> without numactl) is far worse, yes 30%+ worse, than that of single cpu
> opterons of the same speed, when benchmarking just 1 core (so the
> others
> sitting idle).
>
> Any thoughts on that?
>
> Thanks,
> Vincent
>
--
Dr Stuart Midgley
Industry Uptake Program Leader
iVEC, 'The hub of advanced computing in Western Australia'
26 Dick Perry Avenue, Technology Park
Kensington WA 6151
Australia
Phone: +61 8 6436 8545
Fax: +61 8 6436 8555
Email: industry at ivec.org
WWW: http://www.ivec.org
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Mon Jul 18 22:50:09 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Tue, 19 Jul 2005 04:50:09 +0200
Subject: [Beowulf] Re: dual core (latency)
Message-ID: <3.0.32.20050719045009.012813f0@pop3.xs4all.nl>
Hello Stuart,
Thanks for your answer regarding numactl tools.
Your answer doesn't necessarily explain why the dual core latency (with or
without numactl) is far worse, yes 30%+ worse, than that of single cpu
opterons of the same speed, when benchmarking just 1 core (so the others
sitting idle).
Any thoughts on that?
Thanks,
Vincent
At 08:17 AM 7/19/2005 +0800, Stuart Midgley wrote:
>The numactl tools won't generally help latency. Latency isn't the
>issue with Opteron based systems (or any system with multiply
>connected distributed memory controllers).
>
>The real issue is page locality (which is the case with most numa
>based systems).
>
>If you run 2 processes on a dual cpu (single core) systems and they
>both happen to allocate their pages on the same memory controller,
>they will each only see 1/2 the memory bandwidth and 1 controller
>sits idle. That's the real issue (and the extreme pathalogical case).
>
>Linux2.6 generally does a good job of putting the pages on the memory
>controller attached to cpu that the process is running on. However,
>it can't get it perfect. There are always more than 1process/cpu on
>a system, so there is always a little noise... so there is always the
>chance that some pages can be spread around. Also, the system buffer
>cache will get spread around effecting everyone.
>
>Add into the mix the possibility of suspending processes and you can
>end up with a processes pages all over the place. Since Linux
>doesn't yet have make migration, once a page is allocated it won't be
>moved to a different memory controller unless it is swapped out.
>
>With numactl tools you will force the pages to be allocated on the
>right memory/cpu. The processes buffer cache will also be locked
>down (which is another VERY important issue)...
>
>I have used numa tools to double the performance of some codes (or
>perhaps its more correct to say to get back to the correct performance).
>
>Stu.
>
>
>On 18/07/2005, at 22:38, Vincent Diepeveen wrote:
>
>> I've been toying some with the numactl at dual core and it doesn't
>> really seem to help much. It helps 0.00
>>
>> System: Ubuntu at a quad opteron dual core 1.8Ghz 2.6.10-5 smp
>> kernel.
>>
>> Latencies as measured by my own program (TLB trashing read of 8 bytes,
>> each cpu 250MB buffer):
>>
>> #cpu latency
>> 1 144-147 ns
>> 2 174 ns
>> 4 206 ns
>> 8 234 ns
>>
>> That single cpu figure is pretty ugly bad if i may say so.
>>
>> All kind of numa calls just didn't help a thing. I've tried for
>> example:
>>
>> if(numa_available() < 0 ) {
>> setitnuma = 0;
>> }
>> else {
>> int i,back;
>> nodemask_t nt,n2,rnm;
>> maxnodes = numa_max_node()+1; // () returns 3 when 4 controllers
>> printf("numa=%i maxnodes=%i\n",setitnuma,maxnodes);
>>
>> nt = numa_get_interleave_mask();
>> for( i = 0 ; i < maxnodes ; i++ ) {
>> printf("node = %i mask = %i\n",i,nt.n[i]);
>> nt.n[i] = 0;
>> n2.n[i] = 0;
>> }
>> numa_set_interleave_mask(&nt);
>> nt = numa_get_interleave_mask();
>> for( i = 0 ; i < maxnodes ; i++ )
>> printf("checking memory interleave node = %i mask = %i
>> \n",i,nt.n[i]);
>>
>> rnm = numa_get_run_node_mask();
>> printf("numa get run node mask = %i\n",rnm);
>> back = numa_run_on_node(0);
>> if( !back )
>> printf("set to run on node 0\n");
>> else
>> printf("failed to set run on node 0\n");
>>
>> }
>>
>> Whatever i try, single cpu latency keeps 144-147 ns.
>>
>> A dual opteron dual core with 2.2Ghz dual core controllers shows
>> similar
>> latencies. 200 ns for example when running 4 processes with the same
>> testprogram.
>>
>> This single cpu latency behaviour of dual core opteron is ugly bad
>> compared to other dual opterons which are not dual core.
>>
>> Nearly identical Tyan mainboard with dual opteron 2.2Ghz gives
>> single cpu
>> with SAME kernel, with SAME program 115 ns latency. When turning
>> off ECC at
>> that dual opteron it gets down to 113 ns even.
>>
>> The frustrating thing is, the dual opteron 2.2Ghz has pc2700,
>> whereas the quad opteorn dual core has all banks filled
>> with pc3200 registered ram, a-brand.
>>
>> Vincent
>
>
>--
>Dr Stuart Midgley
>sdm900 at gmail.com
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Tue Jul 19 00:42:02 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Tue, 19 Jul 2005 06:42:02 +0200
Subject: [Beowulf] Re: dual core (latency)
Message-ID: <3.0.32.20050719064158.012813f0@pop3.xs4all.nl>
At 11:05 AM 7/19/2005 +0800, Stuart Midgley wrote:
>The first thing to note is that as you add cpu's the cost of the
>cache snooping goes up dramatically. The latency of a 4 cpu (single
>core) opteron system is (if my memory serves me correctly) around
>120ns. Which is significantly higher than the latency of a dual
>processor system (I think it scales roughly as O(n^2) where n is the
>number of cpu's).
>
>Now, with a dual core system, you are effectively halving the
>bandwidth/cpu over the hyper transport AND increasing the cpu count,
>thus increasing the amount of cache snooping required. The end
>result is drastically blown-out latencies.
>
>Stu.
This doesn't answer even remotely accurate things.
A) my test is doing no WRITES, just READS.
B) snooping might be for free.
C) all other cores are just idle when such a latency test for just 1 core
happens and the rest of the system is idle.
D) in all cases a dual core processor has a SLOWER latency and it doesn't
make sense.
E) you don't seem to grasp the difference between LATENCY and BANDWIDTH;
For example your BANDWIDTH to Mars might be GREAT, but your LATENCY to Mars
is real ugly, as it takes 200 years for them to return.
You keep mixing latency and bandwidth. That's ugly, to say polite.
I'm speaking of LATENCY here, not bandwidth.
The total BANDWIDTH that my program takes at a dual core is to be correct:
8 bytes * 1 billion (1/ns) / 147 (ns) = 54MB/s
In fact with some luck your gigabit ethernet card might be able to handle
54MB/s.
Vincent
>
>On 19/07/2005, at 10:50, Vincent Diepeveen wrote:
>
>> Hello Stuart,
>>
>> Thanks for your answer regarding numactl tools.
>>
>> Your answer doesn't necessarily explain why the dual core latency
>> (with or
>> without numactl) is far worse, yes 30%+ worse, than that of single cpu
>> opterons of the same speed, when benchmarking just 1 core (so the
>> others
>> sitting idle).
>>
>> Any thoughts on that?
>>
>> Thanks,
>> Vincent
>>
>
>
>--
>Dr Stuart Midgley
>Industry Uptake Program Leader
>iVEC, 'The hub of advanced computing in Western Australia'
>26 Dick Perry Avenue, Technology Park
>Kensington WA 6151
>Australia
>
>Phone: +61 8 6436 8545
>Fax: +61 8 6436 8555
>Email: industry at ivec.org
>WWW: http://www.ivec.org
>
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Tue Jul 19 01:31:34 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Mon, 18 Jul 2005 22:31:34 -0700
Subject: [Beowulf] Re: dual core (latency)
In-Reply-To: <3.0.32.20050719064158.012813f0@pop3.xs4all.nl>
References: <3.0.32.20050719064158.012813f0@pop3.xs4all.nl>
Message-ID: <20050719053134.GA1649@greglaptop.hsd1.ca.comcast.net>
On Tue, Jul 19, 2005 at 06:42:02AM +0200, Vincent Diepeveen wrote:
> You keep mixing latency and bandwidth. That's ugly, to say polite.
Vincent,
If you're through berating Stuart, can you be bothered to explain why
you think TLB misses should be so cheap? Many architectures have much
worse TLB miss performance than the Opteron. Go bother the Power5
newsgroup.
-- g
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From sdm900 at gmail.com Tue Jul 19 01:22:09 2005
From: sdm900 at gmail.com (Stuart Midgley)
Date: Tue, 19 Jul 2005 13:22:09 +0800
Subject: [Beowulf] Re: dual core (latency)
In-Reply-To: <3.0.32.20050719064158.012813f0@pop3.xs4all.nl>
References: <3.0.32.20050719064158.012813f0@pop3.xs4all.nl>
Message-ID:
I like your email style :)
a) reading doesn't prevent snooping, it causes it. You need to snoop
all the caches to make sure the cache line isn't on some other cpu
before you go to main memory
b) nothing is free - cache snooping costs a lot (even more advanced
methods like page caches - see SGI Altix systems - cost a lot)
c) cores being idle has absolutely nothing to do with cache snooping
(unless you have to flush from a higher level cache or register). A
cpu doesn't know priori that a cpu doesn't have process on it or that
it isn't holding an old cache line.
d) I would expect dual cores to have a larger latency... as per my
previous argument
e) I guess this is an interesting point.
Actually, you would be surprised how MUCH bandwidth and latency have
to do with each other in computers. They are VERY tightly coupled.
For example... you have a cpu with dual channel DDR3200 memory
attached. So you think your bandwidth is 6.4GB/s... then why does
streams show a maximum of around 3-4GB/s? Where did the other ~2.5GB/
s go?
Now, if you look at the actual bandwidth of loading a single cache line:
a cache line is 128bytes which can be access at 6.4GB/s so it takes
128/6.4/1024/1024/1024 s to get = 18.6ns
take into account the ~125ns latency and you can get the 128byte
cache line in about 143ns which gives a bandwidth of 0.93GB/s.
Now, given that the pentium can have 4 outstanding cache loads misses
you can in effect over lay 4 operations and 1/4 the latency to around
45ns to give around 2.4GB/s to get the same 128 byte cache line.
Now, take into account all the other outstanding factors: some memory
is already in fast caches; that you can't quite 1/4 the latency; 4
operations don't quite happen simultaneously due to the 18ns it takes
to get the data etc.
The end result is that latency has a MASSIVE impact on real bandwidth.
Stu.
>
> This doesn't answer even remotely accurate things.
>
> A) my test is doing no WRITES, just READS.
> B) snooping might be for free.
> C) all other cores are just idle when such a latency test for just
> 1 core
> happens and the rest of the system is idle.
> D) in all cases a dual core processor has a SLOWER latency and it
> doesn't
> make sense.
> E) you don't seem to grasp the difference between LATENCY and
> BANDWIDTH;
>
> For example your BANDWIDTH to Mars might be GREAT, but your LATENCY
> to Mars
> is real ugly, as it takes 200 years for them to return.
>
> You keep mixing latency and bandwidth. That's ugly, to say polite.
>
> I'm speaking of LATENCY here, not bandwidth.
>
> The total BANDWIDTH that my program takes at a dual core is to be
> correct:
>
> 8 bytes * 1 billion (1/ns) / 147 (ns) = 54MB/s
>
> In fact with some luck your gigabit ethernet card might be able to
> handle
> 54MB/s.
>
> Vincent
>
>
--
Dr Stuart Midgley
sdm900 at gmail.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kus at free.net Tue Jul 19 08:17:34 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Tue, 19 Jul 2005 16:17:34 +0400
Subject: [Beowulf] Re: dual core (latency)
In-Reply-To: <234C21E7-B790-4EC4-BAE7-4E26955685FB@gmail.com>
Message-ID:
In message from Stuart Midgley (Tue, 19 Jul 2005
11:05:44 +0800):
>The first thing to note is that as you add cpu's the cost of the cache
>snooping goes up dramatically. The latency of a 4 cpu (single core)
>opteron system is (if my memory serves me correctly) around
> 120ns.
AFAIK, cache coherence for dual core Athlon64 is resolved w/using
of SRQ/SRI, i.e. doesn't involve switch.
Opteron, I beleive, has the same possibility (if I remember correctly,
I asked this question on comp.arch, but the answer was "probably yes",
not 100% yes :-)). Then, theoretically, it may be realized "2-level
cache snooping" : the answer for broadcast request from 2nd core of
the same chip may be returned using SRI, but other cores send answers
through switch.
If like scheme is realized, cache snoop traffic
through switch is decreasing for 4cores/2CPUs in comparison w/4 single
core CPUs (about 2/3 of usual - 1/3 of coherence traffic on switch is
absent) but available throughput to the switch is only half of usual
for 4 single core CPUs.
This gives increase of cache snoop traffic to 30+% which looks
Vincent. Interesting, is this my estimation really right or I'm wrong
somewhere?
Yours
Mikhail
> Which is significantly higher than the latency of a dual
> processor system (I think it scales roughly as O(n^2) where n is the
> number of cpu's).
>
>Now, with a dual core system, you are effectively halving the
>bandwidth/cpu over the hyper transport AND increasing the cpu count,
>thus increasing the amount of cache snooping required. The end
> result is drastically blown-out latencies.
>
>Stu.
>
>
>On 19/07/2005, at 10:50, Vincent Diepeveen wrote:
>
>> Hello Stuart,
>>
>> Thanks for your answer regarding numactl tools.
>>
>> Your answer doesn't necessarily explain why the dual core latency
>> (with or
>> without numactl) is far worse, yes 30%+ worse, than that of single
>>cpu
>> opterons of the same speed, when benchmarking just 1 core (so the
>> others
>> sitting idle).
>>
>> Any thoughts on that?
>>
>> Thanks,
>> Vincent
>>
>
>
>--
>Dr Stuart Midgley
>Industry Uptake Program Leader
>iVEC, 'The hub of advanced computing in Western Australia'
>26 Dick Perry Avenue, Technology Park
>Kensington WA 6151
>Australia
>
>Phone: +61 8 6436 8545
>Fax: +61 8 6436 8555
>Email: industry at ivec.org
>WWW: http://www.ivec.org
>
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Tue Jul 19 08:27:48 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Tue, 19 Jul 2005 14:27:48 +0200
Subject: [Beowulf] Re: dual core (latency)
Message-ID: <3.0.32.20050719142745.012813f0@pop3.xs4all.nl>
At 10:31 PM 7/18/2005 -0700, Greg Lindahl wrote:
>On Tue, Jul 19, 2005 at 06:42:02AM +0200, Vincent Diepeveen wrote:
>
>> You keep mixing latency and bandwidth. That's ugly, to say polite.
>
>Vincent,
>
>If you're through berating Stuart, can you be bothered to explain why
>you think TLB misses should be so cheap? Many architectures have much
>worse TLB miss performance than the Opteron. Go bother the Power5
>newsgroup.
Point is that latency to a single core in a dual core dual system is far
slower than a single core in a dual system or a single core in a quad system.
Idemdito a single core in a quad dual core system is also dead slow to access.
In short a TLB miss at a dual core has a larger latency than a TLB miss at
a single core, also when the other cores aren't busy.
This for a read, which seemingly cannot be done in parallel by dual cores.
>-- g
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Tue Jul 19 20:55:55 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Tue, 19 Jul 2005 17:55:55 -0700
Subject: [Beowulf] New HPCC results, and an MX question
Message-ID: <20050720005555.GA4234@greglaptop.internal.keyresearch.com>
First off, I'd like to announce that we've started publishing public
benchmark data for InfiniPath; for example, we've now got a data point
listed at the HPC Challenge website:
http://icl.cs.utk.edu/hpcc/hpcc_results.cgi
In particular I'd like to point out our "Random Ring Latency" number
of 1.31 usec. This benchmark is a lot more realistic than the usual
ping-pong latency, because it uses all the cpus on all the nodes,
instead of just 1 cpu on each of 2 nodes. If you examine other
interconnects, you'll note that many of them get a much worse random
ring latency than ordinary ping-pong.
Second, I have a question about Myrinet MX performance. Myricom has
better things to do than answer my performance queries (no surprise,
every company prefers to answer customer queries first). With GM,
Myricom published the raw output from the Pallas benchmark, and that
was very useful for doing comparisons. With MX, Myricom hasn't
published the raw data, but they did publish graphs. The claimed
0-byte latency is 2.6 usec, with no explanation of what benchmark was
used. The graph at:
http://www.myri.com/myrinet/performance/MPICH-MX/
for Pallas pingpong latency is a log/log scale, so it's hard to see
what latency it got without having the detailed results, which are not
provided. But if you look at the bandwidth chart, it's semi-log. So at
32 byte payloads, the bandwidth looks to me like it's 9 or 10
MB/s. That corresponds to a 3.1 to 3.4 usec 0-byte bandwidth. The
bandwidth for 64 bytes and 128 bytes seem to support this number, too.
So, the question is, am I full of it? Wait, don't answer that! The
question is, can someone using MX please run Pallas pingpong and
publish the raw chart?
To be fair, we don't have these details for InfiniPath up on our
website yet, so here's what we get on our 2.6 Ghz dual-cpu systems.
We're about 30 nanoseconds slower on this pingpong than the number
we get from the osu_latency pingpong.
-- greg
#---------------------------------------------------
# Benchmarking PingPong
# ( #processes = 2 )
# ( 30 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.35 0.00
1 1000 1.36 0.70
2 1000 1.36 1.41
4 1000 1.34 2.85
8 1000 1.35 5.66
16 1000 1.59 9.58
32 1000 1.63 18.75
64 1000 1.68 36.38
128 1000 1.79 68.20
256 1000 2.04 119.47
512 1000 2.53 192.73
1024 1000 3.51 277.86
2048 1000 5.57 350.71
4096 1000 7.46 523.45
8192 1000 11.70 668.02
16384 1000 21.49 727.14
32768 1000 42.89 728.55
65536 640 88.76 704.17
131072 320 161.42 774.36
262144 160 308.38 810.68
524288 80 582.13 858.92
1048576 40 1146.71 872.06
2097152 20 2253.23 887.62
4194304 10 4452.19 898.43
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Tue Jul 19 22:27:01 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Tue, 19 Jul 2005 19:27:01 -0700
Subject: [Beowulf] New HPCC results, and an MX question
In-Reply-To: <42DDB306.7010700@myri.com>
References: <20050720005555.GA4234@greglaptop.internal.keyresearch.com>
<42DDB306.7010700@myri.com>
Message-ID: <20050720022701.GA5030@greglaptop.internal.keyresearch.com>
On Tue, Jul 19, 2005 at 10:12:22PM -0400, Patrick Geoffray wrote:
> > interconnects, you'll note that many of them get a much worse random
> > ring latency than ordinary ping-pong.
>
> Nope. It's worse because:
> * they use much larger clusters: when the size of the cluster increases,
> the number of hops increases, thus the worst case latency increases. 16
> nodes is a tiny cluster with just one hop worst case.
> * they use older hardware: 2.6 GHz Opterons are not very old.
> * they use older drivers: because customers have other things to do that
> running benchmark on carefuly crafted environment with carefuly
> optimized driver/lib.
Patrick,
I am referring to a comparison of the HPCC "random ring latency" to
the HPCC "average ping-pong" on the same hardware, with the same
driver, at the same cluster size. I was not referring to the absolute
numbers, which of course are dependent on cluster size, host cpu
clock, and driver version.
> By the way, could you point me to the raw performance data on the
> pathscale web pages ?
As I said, it is in the process of being published, and I attached
the relevant info to my posting.
> >published the raw data, but they did publish graphs. The claimed
> >0-byte latency is 2.6 usec, with no explanation of what benchmark was
> >used. The graph at:
>
> From the page: "Performance data is presented for the Pallas MPI
> Benchmark Suite, Version 2.2". It's in bold, but maybe we should write
> in red, blinking...
I was referring to the 2.6 usec claim at:
http://www.myri.com/myrinet/performance/
That page makes no reference to Pallas. The page you're referring to is
http://www.myri.com/myrinet/performance/MPICH-MX/
which doesn't include a 2.6 usec claim, but does say that it's Pallas
results.
> Anyway, the cluster I ran Pallas on had a 0-byte MPI latency of 2.9 us.
> Why ? Because it's a production cluster, deployed over a year ago, with
> 1.4 GHz Opteron CPUs (compare that with your 2.6 GHz).
Thank you for the number. Does your latency change significantly with
faster cpus? Ours does (from 1.50 usec at 2.0 Ghz to 1.32 usec at 2.6
Ghz), but my impression was that your number ought to be relatively
insensitive to the host cpu speed.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From patrick at myri.com Tue Jul 19 23:11:33 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Tue, 19 Jul 2005 23:11:33 -0400
Subject: [Beowulf] New HPCC results, and an MX question
In-Reply-To: <20050720022701.GA5030@greglaptop.internal.keyresearch.com>
References: <20050720005555.GA4234@greglaptop.internal.keyresearch.com> <42DDB306.7010700@myri.com>
<20050720022701.GA5030@greglaptop.internal.keyresearch.com>
Message-ID: <42DDC0E5.3050906@myri.com>
Greg Lindahl wrote:
> I am referring to a comparison of the HPCC "random ring latency" to
> the HPCC "average ping-pong" on the same hardware, with the same
The random ring latency will increase with the size of the cluster,
whereas the average pingpong will not as the pair of nodes are ordered
and ordered nodes are likely to be in the same crossbar. If you
randomize the machine list, then there is no difference between the
random ring latency and the average pingpong.
On a tiny cluster, all nodes are on the same crossbar, so it does not
matter if the pair are ordered or not.
>>By the way, could you point me to the raw performance data on the
>>pathscale web pages ?
>
>
> As I said, it is in the process of being published, and I attached
> the relevant info to my posting.
I know, tongue-in-cheek. Will you publish the raw numbers on the web
site eventually ?
> I was referring to the 2.6 usec claim at:
>
> http://www.myri.com/myrinet/performance/
This is Pallas too. I will ask to add a reference to it: Pallas between
2 nodes Opteron 2GHz, on the same crossbar with E cards.
>>Anyway, the cluster I ran Pallas on had a 0-byte MPI latency of 2.9 us.
>>Why ? Because it's a production cluster, deployed over a year ago, with
>>1.4 GHz Opteron CPUs (compare that with your 2.6 GHz).
> Thank you for the number. Does your latency change significantly with
> faster cpus? Ours does (from 1.50 usec at 2.0 Ghz to 1.32 usec at 2.6
> Ghz), but my impression was that your number ought to be relatively
> insensitive to the host cpu speed.
No, we do PIO for small messages too, but not for medium/large messages
(CPU cycles start to get expensive when you push data through a slow bus
like PCI-X. On PCI-Express or HT, the picture is different). So the CPU
clock will affect the latency up to 127 Bytes (the threshold may
change). Write combining affect the latency too.
It depends also on architecture (Opteron are better than EM64T for
example, but I suspect the PCI-E/PCI-X bridge to be the culprit) and the
cost of pthread mutex (MX is threadsafe).
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From patrick at myri.com Tue Jul 19 22:12:22 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Tue, 19 Jul 2005 22:12:22 -0400
Subject: [Beowulf] New HPCC results, and an MX question
In-Reply-To: <20050720005555.GA4234@greglaptop.internal.keyresearch.com>
References: <20050720005555.GA4234@greglaptop.internal.keyresearch.com>
Message-ID: <42DDB306.7010700@myri.com>
Greg Lindahl wrote:
> interconnects, you'll note that many of them get a much worse random
> ring latency than ordinary ping-pong.
Nope. It's worse because:
* they use much larger clusters: when the size of the cluster increases,
the number of hops increases, thus the worst case latency increases. 16
nodes is a tiny cluster with just one hop worst case.
* they use older hardware: 2.6 GHz Opterons are not very old.
* they use older drivers: because customers have other things to do that
running benchmark on carefuly crafted environment with carefuly
optimized driver/lib.
> Second, I have a question about Myrinet MX performance. Myricom has
> better things to do than answer my performance queries (no surprise,
No. Myricom answered your query, not the way you wanted but we replied
to you (same day).
> every company prefers to answer customer queries first). With GM,
> Myricom published the raw output from the Pallas benchmark, and that
> was very useful for doing comparisons. With MX, Myricom hasn't
That was very useful for competitors to publish bogus data comparaison
on differents configuration (different hardware, different software).
That's why we stopped to publish the raw data.
By the way, could you point me to the raw performance data on the
pathscale web pages ?
> published the raw data, but they did publish graphs. The claimed
> 0-byte latency is 2.6 usec, with no explanation of what benchmark was
> used. The graph at:
From the page: "Performance data is presented for the Pallas MPI
Benchmark Suite, Version 2.2". It's in bold, but maybe we should write
in red, blinking...
> MB/s. That corresponds to a 3.1 to 3.4 usec 0-byte bandwidth. The
> bandwidth for 64 bytes and 128 bytes seem to support this number, too.
>
> So, the question is, am I full of it? Wait, don't answer that! The
Full of what ? I can think about a few things...
Anyway, the cluster I ran Pallas on had a 0-byte MPI latency of 2.9 us.
Why ? Because it's a production cluster, deployed over a year ago, with
1.4 GHz Opteron CPUs (compare that with your 2.6 GHz).
> question is, can someone using MX please run Pallas pingpong and
> publish the raw chart?
And please don't forget to turn write combining (WC) on.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Tue Jul 19 23:53:41 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Tue, 19 Jul 2005 20:53:41 -0700
Subject: [Beowulf] New HPCC results, and an MX question
In-Reply-To: <42DDC0E5.3050906@myri.com>
References: <20050720005555.GA4234@greglaptop.internal.keyresearch.com>
<42DDB306.7010700@myri.com>
<20050720022701.GA5030@greglaptop.internal.keyresearch.com>
<42DDC0E5.3050906@myri.com>
Message-ID: <20050720035341.GA1159@greglaptop.hsd1.ca.comcast.net>
On Tue, Jul 19, 2005 at 11:11:33PM -0400, Patrick Geoffray wrote:
> If you randomize the machine list, then there is no difference
> between the random ring latency and the average pingpong.
Patrick,
There likely will be a difference, because average pingpong doesn't
run on all the cpus. On a 4-cpu node, that can make a big difference.
To give you an example, look at the Quadrics reported numbers for
random ring latency of 11.4568 usec and average ping-pong of 1.552
usec. This is on a 2-cpu node (I think). I'd bet that most of this
difference has nothing to do with machine size. But I'd be happy to be
proven wrong.
Hopefully someone will publish a Myrinet MX-based set of HPCC results
soon. (hint, hint!)
> >As I said, it is in the process of being published, and I attached
> >the relevant info to my posting.
>
> I know, tongue-in-cheek. Will you publish the raw numbers on the web
> site eventually ?
Yes. That's what I meant in the first place.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From patrick at myri.com Wed Jul 20 00:38:35 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Wed, 20 Jul 2005 00:38:35 -0400
Subject: [Beowulf] New HPCC results, and an MX question
In-Reply-To: <20050720035341.GA1159@greglaptop.hsd1.ca.comcast.net>
References: <20050720005555.GA4234@greglaptop.internal.keyresearch.com> <42DDB306.7010700@myri.com> <20050720022701.GA5030@greglaptop.internal.keyresearch.com> <42DDC0E5.3050906@myri.com>
<20050720035341.GA1159@greglaptop.hsd1.ca.comcast.net>
Message-ID: <42DDD54B.2030805@myri.com>
Greg Lindahl wrote:
> On Tue, Jul 19, 2005 at 11:11:33PM -0400, Patrick Geoffray wrote:
>
>
>>If you randomize the machine list, then there is no difference
>>between the random ring latency and the average pingpong.
>
>
> Patrick,
>
> There likely will be a difference, because average pingpong doesn't
> run on all the cpus. On a 4-cpu node, that can make a big difference.
I believe the difference will not be that big. I will get my hands on a
quad in the next couple of weeks, I will look into int.
> To give you an example, look at the Quadrics reported numbers for
> random ring latency of 11.4568 usec and average ping-pong of 1.552
> usec. This is on a 2-cpu node (I think). I'd bet that most of this
> difference has nothing to do with machine size. But I'd be happy to be
> proven wrong.
I would think 1.5 is shared memory in this case (all pairs are ordered
and they end up being on the same nodes). This is one of the thing I
don't like with HPCC, so much variation in results depending on size of
clusters, process mapping, order/topology.
> Hopefully someone will publish a Myrinet MX-based set of HPCC results
> soon. (hint, hint!)
I don't have time to do that. At least, as long as HPCC, like HPL, take
a gazillions parameters. Give me HPCC with no parameters and I will take
5 minutes to start it. I was promised it would be this way eventually.
I don't believe much in any analytic benchmarks. HPL can yield 90% of
peak if rewritten for modern MPI implementations, Pallas is nice to find
out when something is very wrong, but not much more, and the NAS are
marginaly more interesting.
I prefer benchmarking real codes, and we will publish that, but 10G is
taking most of my time these days (got to get something for you to
compare against).
>>I know, tongue-in-cheek. Will you publish the raw numbers on the web
>>site eventually ?
>
>
> Yes. That's what I meant in the first place.
I bet the next time you won't :-\
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Wed Jul 20 09:06:08 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Wed, 20 Jul 2005 15:06:08 +0200
Subject: [Beowulf] New HPCC results, and an MX question
Message-ID: <3.0.32.20050720150607.0110d930@pop3.xs4all.nl>
At 12:38 AM 7/20/2005 -0400, Patrick Geoffray wrote:
>Greg Lindahl wrote:
>> On Tue, Jul 19, 2005 at 11:11:33PM -0400, Patrick Geoffray wrote:
>>
>>
>>>If you randomize the machine list, then there is no difference
>>>between the random ring latency and the average pingpong.
>>
>>
>> Patrick,
>>
>> There likely will be a difference, because average pingpong doesn't
>> run on all the cpus. On a 4-cpu node, that can make a big difference.
>
>I believe the difference will not be that big. I will get my hands on a
>quad in the next couple of weeks, I will look into int.
The difference will be huge of course, network processors have a switch
latency. That's why.
If it must switch at the wrong moment that'll cost 50 us or something at
certain network chips.
Additional there will be software layers that have to lock in some way.
Locking + unlocking is already like half a microsecond extra, just like that.
Tests at all processors at the same time make major sense.
Any denial in advance that it will be the same speed is just ballony.
>> To give you an example, look at the Quadrics reported numbers for
>> random ring latency of 11.4568 usec and average ping-pong of 1.552
>> usec. This is on a 2-cpu node (I think). I'd bet that most of this
>> difference has nothing to do with machine size. But I'd be happy to be
>> proven wrong.
>
>I would think 1.5 is shared memory in this case (all pairs are ordered
>and they end up being on the same nodes). This is one of the thing I
>don't like with HPCC, so much variation in results depending on size of
>clusters, process mapping, order/topology.
>
>> Hopefully someone will publish a Myrinet MX-based set of HPCC results
>> soon. (hint, hint!)
>
>I don't have time to do that. At least, as long as HPCC, like HPL, take
>a gazillions parameters. Give me HPCC with no parameters and I will take
>5 minutes to start it. I was promised it would be this way eventually.
>
>I don't believe much in any analytic benchmarks. HPL can yield 90% of
>peak if rewritten for modern MPI implementations, Pallas is nice to find
>out when something is very wrong, but not much more, and the NAS are
>marginaly more interesting.
>
>I prefer benchmarking real codes, and we will publish that, but 10G is
>taking most of my time these days (got to get something for you to
>compare against).
>
>>>I know, tongue-in-cheek. Will you publish the raw numbers on the web
>>>site eventually ?
>>
>>
>> Yes. That's what I meant in the first place.
>
>I bet the next time you won't :-\
>
>Patrick
>--
>
>Patrick Geoffray
>Myricom, Inc.
>http://www.myri.com
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From bill at princeton.edu Wed Jul 20 09:48:08 2005
From: bill at princeton.edu (Bill Wichser)
Date: Wed, 20 Jul 2005 09:48:08 -0400
Subject: [Beowulf] Performance issue - CPU Intel 00/02
Message-ID: <42DE5618.9000109@princeton.edu>
System: 128 node Intel 2.4GHz P4
MBO: Tyan S2099, i845E
OS: RedHat 8.0, kernel 2.4.18-18.8.0 (but 2.4.20-28.8 changes nothing)
Problem: Performance one third after 60 minutes from reload/reboot on a
number of nodes, as determined by an xhpl run
---
On SuperBowl Sunday, the lights went out on this cluster. Ever since
that time performance has suffered. Initially, when running xhpl, there
was a 3x performance difference between a "good" node and a "bad" node.
A reboot solved the problem, or so I thought.
This summer, having more time to investigate the problem, I found that
some nodes exhibit this degradation after a power cycle while others
didn't. I've used strace, ptrace, watched memory usage statistics, etc
but the only thing which ever changed was that all of these calls
suffered a 3x performance hit on a bad node.
At first I thought it might be cooling, knowing that these Intel
processors throttle down when reaching a set value. But watching the
temperatures revealed that all nodes were effectively running the same
way. And once performance dropped, they never returned to normal.
By accident I discovered that of these 128 nodes, 50 of them show some
strange value in /proc/cpuinfo for model name. On a good node these
reveal themselves as "Intel(R) Pentium(R) 4 CPU 2.40GHz" while on a
"bad" node they call themselves "00/02" yet when checking the BIOS, and
all the nodes have the same configuration I believe although I neglected
to gather the level this last go round, they reveal themselves correctly
as Intel(R) Pentium(R) 4 CPU 2.40GHz.
Now I'm stuck. I don't know how to proceed. I see the symptom but
somehow find it hard to believe that 40% of the CPUs have become somehow
defective. Yet the software is all the same and reloads on a good node
or a bad node produce no changes whatsoever. Only a reboot on a bad
node seems to cure the performance problem albeit for some short duration.
My next step will be to swap two CPUs, one from a known good into a
known bad and see if anything changes. But before I go that route I
just wanted to ask the advice of this group, hoping that someone might
have seen this before and offer a solution.
Thanks,
Bill
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From john.hearns at streamline-computing.com Wed Jul 20 10:51:11 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Wed, 20 Jul 2005 15:51:11 +0100
Subject: [Beowulf] Performance issue - CPU Intel 00/02
In-Reply-To: <42DE5618.9000109@princeton.edu>
References: <42DE5618.9000109@princeton.edu>
Message-ID: <1121871072.8056.6.camel@vigor13>
On Wed, 2005-07-20 at 09:48 -0400, Bill Wichser wrote:
>
> By accident I discovered that of these 128 nodes, 50 of them show some
> strange value in /proc/cpuinfo for model name. On a good node these
> reveal themselves as "Intel(R) Pentium(R) 4 CPU 2.40GHz" while on a
> "bad" node they call themselves "00/02" yet when checking the BIOS, and
> all the nodes have the same configuration I believe although I neglected
> to gather the level this last go round, they reveal themselves correctly
> as Intel(R) Pentium(R) 4 CPU 2.40GHz.
>
Jumping in with both feet,
could you try running Dave Jones' x86info on good/bad nodes?
http://www.codemonkey.org.uk/projects/x86info/
Also does the bogomips rating differ (that's a real stupid question, I
know)
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From bill at princeton.edu Wed Jul 20 11:11:05 2005
From: bill at princeton.edu (Bill Wichser)
Date: Wed, 20 Jul 2005 11:11:05 -0400
Subject: [Beowulf] Performance issue - CPU Intel 00/02
In-Reply-To: <1121871072.8056.6.camel@vigor13>
References: <42DE5618.9000109@princeton.edu> <1121871072.8056.6.camel@vigor13>
Message-ID: <42DE6989.3020604@princeton.edu>
bogomips shows the same numbers. Interesting find though on the x86info
testing. Can it be that this has been bad all along? I'll need to pull
some CPUs to see if they are indeed Celeron, which I seriously doubt.
BAD
Found 1 CPU
--------------------------------------------------------------------------
Family: 15 Model: 2 Stepping: 4 Type: 0 Brand: 15
CPU Model: Celeron (P4 core) [B0] Original OEM
Feature flags:
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflsh dtes acpi mmx fxsr sse sse2 selfsnoop ht acc
Extended feature flags:
Instruction trace cache:
Size: 12K uOps 8-way associative.
L1 Data cache:
Size: 8KB Sectored, 4-way associative.
line size=64 bytes.
L2 unified cache:
Size: 512KB Sectored, 8-way associative.
line size=64 bytes.
Instruction TLB: 4K, 2MB or 4MB pages, fully associative, 64 entries.
Data TLB: 4KB or 4MB pages, fully associative, 64 entries.
The physical package supports 1 logical processors
GOOD
Found 1 CPU
--------------------------------------------------------------------------
Family: 15 Model: 2 Stepping: 4 Type: 0 Brand: 9
CPU Model: Pentium 4 (Northwood) [B0] Original OEM
Processor name string: Intel(R) Pentium(R) 4 CPU 2.40GHz
Feature flags:
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflsh dtes acpi mmx fxsr sse sse2 selfsnoop ht acc
Extended feature flags:
Instruction trace cache:
Size: 12K uOps 8-way associative.
L1 Data cache:
Size: 8KB Sectored, 4-way associative.
line size=64 bytes.
L2 unified cache:
Size: 512KB Sectored, 8-way associative.
line size=64 bytes.
Instruction TLB: 4K, 2MB or 4MB pages, fully associative, 64 entries.
Data TLB: 4KB or 4MB pages, fully associative, 64 entries.
The physical package supports 1 logical processors
John Hearns wrote:
> On Wed, 2005-07-20 at 09:48 -0400, Bill Wichser wrote:
>
>
>>By accident I discovered that of these 128 nodes, 50 of them show some
>>strange value in /proc/cpuinfo for model name. On a good node these
>>reveal themselves as "Intel(R) Pentium(R) 4 CPU 2.40GHz" while on a
>>"bad" node they call themselves "00/02" yet when checking the BIOS, and
>>all the nodes have the same configuration I believe although I neglected
>>to gather the level this last go round, they reveal themselves correctly
>>as Intel(R) Pentium(R) 4 CPU 2.40GHz.
>>
>
> Jumping in with both feet,
> could you try running Dave Jones' x86info on good/bad nodes?
> http://www.codemonkey.org.uk/projects/x86info/
> Also does the bogomips rating differ (that's a real stupid question, I
> know)
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From mathog at mendel.bio.caltech.edu Wed Jul 20 11:40:10 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Wed, 20 Jul 2005 08:40:10 -0700
Subject: [Beowulf] Performance issue - CPU Intel 00/02
Message-ID:
Bill Wichser wrote:
> System: 128 node Intel 2.4GHz P4
> MBO: Tyan S2099, i845E
> OS: RedHat 8.0, kernel 2.4.18-18.8.0 (but 2.4.20-28.8 changes nothing)
>
> This summer, having more time to investigate the problem, I found that
> some nodes exhibit this degradation after a power cycle while others
> didn't.
We have some Asus motherboards which drop the CPU speed to a
"safe" value after a power failure or other unplanned shutdown
event. The symptoms are exactly as you describe - perfectly good
machines mysteriously running much slower than other identical
machines.
It seems likely that the Tyan board may be doing something
similar. Try getting into the BIOS on one of these and see what
it's done to the clock speed. Unfortunately for the Asus boards
the only way to fix this is in the BIOS using the keyboard. That
would be rather painful if you had 300 nodes to deal with. You
might want to see if there is a switch in the Tyan BIOS to disable
this feature.
Regards,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From bill at princeton.edu Wed Jul 20 12:03:25 2005
From: bill at princeton.edu (Bill Wichser)
Date: Wed, 20 Jul 2005 12:03:25 -0400
Subject: [Beowulf] Performance issue - CPU Intel 00/02
In-Reply-To: <1121871072.8056.6.camel@vigor13>
References: <42DE5618.9000109@princeton.edu> <1121871072.8056.6.camel@vigor13>
Message-ID: <42DE75CD.3050800@princeton.edu>
The CPU is verified as a P4. The problem does NOT move with the CPU but
stays with the motherboard. I will attempt to upgrade the BIOS to a
more recent version than 1.04 in the next week, as soon as I can locate
an external USB drive, to see if thgis doesn't correct the situation.
Again, according to the initial BIOS screen, it sees the chip as a P4.
But maybe something is corrupt in the firmware telling the Linux kernel
otherwise when queried. Also, I see no indication of any slowness until
sometime after reboot, so I have no way of accessing BIOS to test and
change.
Bill
John Hearns wrote:
> On Wed, 2005-07-20 at 09:48 -0400, Bill Wichser wrote:
>
>
>>By accident I discovered that of these 128 nodes, 50 of them show some
>>strange value in /proc/cpuinfo for model name. On a good node these
>>reveal themselves as "Intel(R) Pentium(R) 4 CPU 2.40GHz" while on a
>>"bad" node they call themselves "00/02" yet when checking the BIOS, and
>>all the nodes have the same configuration I believe although I neglected
>>to gather the level this last go round, they reveal themselves correctly
>>as Intel(R) Pentium(R) 4 CPU 2.40GHz.
>>
>
> Jumping in with both feet,
> could you try running Dave Jones' x86info on good/bad nodes?
> http://www.codemonkey.org.uk/projects/x86info/
> Also does the bogomips rating differ (that's a real stupid question, I
> know)
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From john.hearns at streamline-computing.com Wed Jul 20 13:27:47 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Wed, 20 Jul 2005 18:27:47 +0100
Subject: [Beowulf] Performance issue - CPU Intel 00/02
In-Reply-To:
References:
Message-ID: <1121880467.8056.14.camel@vigor13>
On Wed, 2005-07-20 at 08:40 -0700, David Mathog wrote:
>
> It seems likely that the Tyan board may be doing something
> similar. Try getting into the BIOS on one of these and see what
> it's done to the clock speed. Unfortunately for the Asus boards
> the only way to fix this is in the BIOS using the keyboard. That
> would be rather painful if you had 300 nodes to deal with. You
> might want to see if there is a switch in the Tyan BIOS to disable
> this feature.
I agree with what Dave says.
If you have lots of sick machines, you could try making BIOS
settings using the nvram module
On a good node: modprobe nvram; cat /dev/nvram > biosfile
On a bad one: modprobe nvram; cat biosfile > /dev/nvram
YMMV. The management accept no liability for lost bits.
Seriously, give it a try on one node.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Wed Jul 20 13:18:56 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed, 20 Jul 2005 10:18:56 -0700
Subject: [Beowulf] New HPCC results, and an MX question
In-Reply-To: <3.0.32.20050720150607.0110d930@pop3.xs4all.nl>
References: <3.0.32.20050720150607.0110d930@pop3.xs4all.nl>
Message-ID: <20050720171856.GB1164@greglaptop.internal.keyresearch.com>
On Wed, Jul 20, 2005 at 03:06:08PM +0200, Vincent Diepeveen wrote:
> Additional there will be software layers that have to lock in some way.
Vincent, nobody builds networks this way, at least nobody building a
high performance network. What everyone does is give N processes their
own virtual copy of the chip, generally called a "port". Myrinet
implements ports in software on their Lanai chip, we do them in
hardware. In regular InfiniBand, the separate processes get separate
queues.
You are correct that software locking on the host cpu would be
expensive, and that's why "threaded MPI" is a bad idea.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From david.n.lombard at intel.com Wed Jul 20 13:47:34 2005
From: david.n.lombard at intel.com (Lombard, David N)
Date: Wed, 20 Jul 2005 10:47:34 -0700
Subject: [Beowulf] Performance issue - CPU Intel 00/02
Message-ID: <187D3A7CAB42A54DB61F1D05F01257220664B0B7@orsmsx402.amr.corp.intel.com>
From: John Hearns on Wednesday, July 20, 2005 10:28 AM
> On Wed, 2005-07-20 at 08:40 -0700, David Mathog wrote:
> > It seems likely that the Tyan board may be doing something
> > similar. Try getting into the BIOS on one of these and see what
> > it's done to the clock speed. Unfortunately for the Asus boards
> > the only way to fix this is in the BIOS using the keyboard. That
> > would be rather painful if you had 300 nodes to deal with. You
> > might want to see if there is a switch in the Tyan BIOS to disable
> > this feature.
> I agree with what Dave says.
>
> If you have lots of sick machines, you could try making BIOS
> settings using the nvram module
>
> On a good node: modprobe nvram; cat /dev/nvram > biosfile
>
> On a bad one: modprobe nvram; cat biosfile > /dev/nvram
>
> YMMV. The management accept no liability for lost bits.
> Seriously, give it a try on one node.
Before you do this, you may want to check BIOS levels on a good v. bad
system and try a BIOS reset on the bad machine. Even a batch of systems
delivered at the same time can have different BIOS levels. I've also
seen a BIOS reset cure these odd problems if they were all already at
the same version.
The final bit, which is more in the voodoo camp, is the battery...
If the BIOS versions differ, would copying the nvram from one system to
another be an issue?
--
dnl
My comments represent my opinions, not those of Intel Corporation.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hahn at physics.mcmaster.ca Wed Jul 20 16:08:44 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed, 20 Jul 2005 16:08:44 -0400 (EDT)
Subject: [Beowulf] Performance issue - CPU Intel 00/02
In-Reply-To: <187D3A7CAB42A54DB61F1D05F01257220664B0B7@orsmsx402.amr.corp.intel.com>
Message-ID:
> seen a BIOS reset cure these odd problems if they were all already at
> the same version.
that's what I'd do. remember that you can use pxe to boot a bios-update
image, so you don't necessarily have to touch each machine.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From joachim at ccrl-nece.de Wed Jul 20 16:42:03 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Wed, 20 Jul 2005 22:42:03 +0200
Subject: [Beowulf] New HPCC results, and an MX question
In-Reply-To: <42DDC0E5.3050906@myri.com>
References: <20050720005555.GA4234@greglaptop.internal.keyresearch.com> <42DDB306.7010700@myri.com> <20050720022701.GA5030@greglaptop.internal.keyresearch.com>
<42DDC0E5.3050906@myri.com>
Message-ID: <42DEB71B.3090106@ccrl-nece.de>
Patrick Geoffray wrote:
> Greg Lindahl wrote:
>
>> I am referring to a comparison of the HPCC "random ring latency" to
>> the HPCC "average ping-pong" on the same hardware, with the same
>
>
> The random ring latency will increase with the size of the cluster,
> whereas the average pingpong will not as the pair of nodes are ordered
> and ordered nodes are likely to be in the same crossbar. If you
> randomize the machine list, then there is no difference between the
> random ring latency and the average pingpong.
>
> On a tiny cluster, all nodes are on the same crossbar, so it does not
> matter if the pair are ordered or not.
All this is true, but the MPI library plays an important role, too. On
our 32node Dual-Athlon Cluster, the HPCC random ring latency with the
latest MPICH-GM (using shared memory for intra-node) was 27,9us, while
our MPI/PC-32 has 16,8us. This is on the exact same hardware, using the
same GM driver (measured Sept. 2004), 32 MPI processes on 16 nodes. And
our MPI is (sort of) a swiss-army knife MPI... I assume that the
intra-node communication makes the difference here.
Joachim
--
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From joachim at ccrl-nece.de Wed Jul 20 16:46:46 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Wed, 20 Jul 2005 22:46:46 +0200
Subject: [Beowulf] New HPCC results, and an MX question
In-Reply-To: <3.0.32.20050720150607.0110d930@pop3.xs4all.nl>
References: <3.0.32.20050720150607.0110d930@pop3.xs4all.nl>
Message-ID: <42DEB836.6020104@ccrl-nece.de>
Vincent Diepeveen wrote:
> Additional there will be software layers that have to lock in some way.
I'd say all relevant contemporary high-performance interconnects don't
do any locking for MPI communication. What are you refering to exactly?
Joachim
--
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Wed Jul 20 16:58:57 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed, 20 Jul 2005 13:58:57 -0700
Subject: [Beowulf] New HPCC results, and an MX question
In-Reply-To: <42DDD54B.2030805@myri.com>
References: <20050720005555.GA4234@greglaptop.internal.keyresearch.com>
<42DDB306.7010700@myri.com>
<20050720022701.GA5030@greglaptop.internal.keyresearch.com>
<42DDC0E5.3050906@myri.com>
<20050720035341.GA1159@greglaptop.hsd1.ca.comcast.net>
<42DDD54B.2030805@myri.com>
Message-ID: <20050720205857.GA3015@greglaptop.internal.keyresearch.com>
> >To give you an example, look at the Quadrics reported numbers for
> >random ring latency of 11.4568 usec and average ping-pong of 1.552
> >usec. This is on a 2-cpu node (I think). I'd bet that most of this
> >difference has nothing to do with machine size. But I'd be happy to be
> >proven wrong.
>
> I would think 1.5 is shared memory in this case
Patrick,
That's too high, if you look at the "minimum ping pong" of 0.937 usec,
that is their shared memory number. (The Quadrics guys are a lot
smarter than 1.5!)
> I prefer benchmarking real codes, and we will publish that, but 10G is
> taking most of my time these days (got to get something for you to
> compare against).
I'll look forward to it. We've published several application
benchmarks for you to compare to; a whitepaper is linked at the bottom
of: http://pathscale.com/infinipath-perf.html
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Wed Jul 20 17:14:22 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Wed, 20 Jul 2005 23:14:22 +0200
Subject: [Beowulf] New HPCC results, and an MX question
Message-ID: <3.0.32.20050720231422.0128c498@pop3.xs4all.nl>
Greg,
My chessprogram diep is multiprocessor, but about everything
around me is multithreaded.
Multithreading is supported better by windows for example than
multiprocessing.
If you share in windows XP a big amount of RAM and store pointers there,
meaning you have to allocated it in the same virtual adress space,
then microsoft is up to factor 100 slower there than it should be.
Linux has by default if you boot the kernel allowance for 32MB shared memory.
You have to signal the kernel as 'root' to RUN a program which eats more
than 32MB ram.
Arguably, multiprocessor works better for freak software, but about all
teachers and professors will push students already for 10 years towards
multithreading.
Unless you want a very selected group to use your software, you'll have
to take care that multithreaded software also works fast for clusters.
Now you'll argue that MPI in itself can already start many processes and
that there is no way out.
But let's take for example most chessprograms, they are multithreaded,
so if they go run on a cluster, they want at each node 1 process started
which has a bit more threads than there is cores, and of course for each
node 1 process.
This is a logical way to get more speedup out of a cluster.
Additional they would be near to insane to rewrite their already good working
SMP algorithm from multithreading to multiprocessor first, and *then* start
a 2nd layer MPI parallel search.
If you see that majority of jobs in supercomputers are jobs of 4-8 cores,
you'll also realize how big multithreading is in scientific world.
MPI is the big reason why not everyone who is needing a lot of cpu power
has a cluster at home. If there would
be a layer on top of it providing same functionality but in a kind of SSI
form, there would be more software running on clusters.
At 10:18 AM 7/20/2005 -0700, Greg Lindahl wrote:
>On Wed, Jul 20, 2005 at 03:06:08PM +0200, Vincent Diepeveen wrote:
>
>> Additional there will be software layers that have to lock in some way.
>
>Vincent, nobody builds networks this way, at least nobody building a
>high performance network. What everyone does is give N processes their
>own virtual copy of the chip, generally called a "port". Myrinet
>implements ports in software on their Lanai chip, we do them in
>hardware. In regular InfiniBand, the separate processes get separate
>queues.
>
>You are correct that software locking on the host cpu would be
>expensive, and that's why "threaded MPI" is a bad idea.
>
>-- greg
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Wed Jul 20 17:28:41 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed, 20 Jul 2005 14:28:41 -0700
Subject: [Beowulf] New HPCC results, and an MX question
In-Reply-To: <3.0.32.20050720231422.0128c498@pop3.xs4all.nl>
References: <3.0.32.20050720231422.0128c498@pop3.xs4all.nl>
Message-ID: <20050720212841.GB3015@greglaptop.internal.keyresearch.com>
> Unless you want a very selected group to use your software, you'll have
> to take care that multithreaded software also works fast for clusters.
Vincent,
Funny, in all my years of activity in the high performance computing
niche, it's fairly rare that people run multi-threaded programs
accessing MPI. Pure MPI is the most widely used programming model,
followed as a distant second by OpenMP programs where only the main
thread does MPI (so a thread-safe MPI is not required), followed by a
few, rare very codes which are multi-threaded with either OpenMP or
Posix threads, and all the threads call MPI. Admittedly, this last
choice is can be a good one if you have a multi-threaded program and
want to add MPI. But there is a speed hit with low latency networks
due to the extra locking needed. And that was the small point I was
attemping to make.
Perhaps you can point me at some huge existing market of threaded MPI
programs? I'm sure you can speak with authority about your chess
program, but I didn't realize you were an expert on the general MPI
marketplace. With the advent of dual-core cpus, I figured the most
likely future was that most people were going to continue to run their
pure MPI programs as pure MPI programs.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Wed Jul 20 19:31:34 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Thu, 21 Jul 2005 01:31:34 +0200
Subject: [Beowulf] New HPCC results, and an MX question
Message-ID: <3.0.32.20050721013130.0128c498@pop3.xs4all.nl>
At 02:28 PM 7/20/2005 -0700, Greg Lindahl wrote:
>> Unless you want a very selected group to use your software, you'll have
>> to take care that multithreaded software also works fast for clusters.
>
>Vincent,
>
>Funny, in all my years of activity in the high performance computing
>niche, it's fairly rare that people run multi-threaded programs
It's the other way around.
MPI is so primitive that it doesn't allow multithreaded programs very well
and forces people to use a single process for calculations.
This is a severe handicap, as the average student is guessing that
multithreading is superior to multiprocessing.
You put it the other way around now.
You look in a tiny bottle and say: "it's all multiprocessing water".
Better is to look in the real world and see that there is much more water
in the ocean which could be all in your bottle.
Vincent
>accessing MPI. Pure MPI is the most widely used programming model,
>followed as a distant second by OpenMP programs where only the main
>thread does MPI (so a thread-safe MPI is not required), followed by a
>few, rare very codes which are multi-threaded with either OpenMP or
>Posix threads, and all the threads call MPI. Admittedly, this last
>choice is can be a good one if you have a multi-threaded program and
>want to add MPI. But there is a speed hit with low latency networks
>due to the extra locking needed. And that was the small point I was
>attemping to make.
>
>Perhaps you can point me at some huge existing market of threaded MPI
>programs? I'm sure you can speak with authority about your chess
>program, but I didn't realize you were an expert on the general MPI
>marketplace. With the advent of dual-core cpus, I figured the most
>likely future was that most people were going to continue to run their
>pure MPI programs as pure MPI programs.
>
>-- greg
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Wed Jul 20 21:24:07 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed, 20 Jul 2005 18:24:07 -0700
Subject: [Beowulf] New HPCC results, and an MX question
In-Reply-To: <3.0.32.20050721013130.0128c498@pop3.xs4all.nl>
References: <3.0.32.20050721013130.0128c498@pop3.xs4all.nl>
Message-ID: <20050721012407.GD4727@greglaptop.internal.keyresearch.com>
On Thu, Jul 21, 2005 at 01:31:34AM +0200, Vincent Diepeveen wrote:
> MPI is so primitive that it doesn't allow multithreaded programs very well
> and forces people to use a single process for calculations.
Vincent,
The first half is true, but that has nothing to do with what I said.
I guess you were just looking for an excuse to grind your favorite
axe. I apologize to the entire mailing list for giving him an
opportunity -- I'll try to be more careful in the future.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From patrick at myri.com Wed Jul 20 21:49:09 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Wed, 20 Jul 2005 21:49:09 -0400
Subject: [Beowulf] New HPCC results, and an MX question
In-Reply-To: <3.0.32.20050720150607.0110d930@pop3.xs4all.nl>
References: <3.0.32.20050720150607.0110d930@pop3.xs4all.nl>
Message-ID: <42DEFF15.1010902@myri.com>
Vincent Diepeveen wrote:
>>>There likely will be a difference, because average pingpong doesn't
>>>run on all the cpus. On a 4-cpu node, that can make a big difference.
>>
>>I believe the difference will not be that big. I will get my hands on a
>>quad in the next couple of weeks, I will look into int.
>
>
> The difference will be huge of course, network processors have a switch
> latency. That's why.
>
> If it must switch at the wrong moment that'll cost 50 us or something at
> certain network chips.
Switch latency is negligable in this problem, and in any event 50us is
not a realistic switch latency with modern hardware.
The real question is the following: does 4 processes running on 4
different CPUs affect greatly the latency when sending small messages to
other nodes compared to only one process running on one CPU ?
The answer, I argue, is "not much". Assuming that all processes sends at
the exact same time, access to the PCI bus will be serialized, NIC
processing will be serialized and access to the wire will be serialized.
The most expensive resource in this pipeline for 0-byte messages is
likely to be the NIC. So, it boils down to the NIC overhead per send (or
recv) and that is not big with MX (and will be further reduce in the
future). In any event, not in the order of 10us. With GM, it's a
different story as it does not do PIO for small messages.
> Additional there will be software layers that have to lock in some way.
You don't have to lock when doing os-bypass. At least, you don't have to
lock with other processes (which is kinda expensive). We take a spinlock
because we have at least another thread in the lib. The gain of having
such a thread outweight the cost of the spinlock, no questions about that.
> Locking + unlocking is already like half a microsecond extra, just like that.
Taking a spinlock on Opteron is ~50 us. On Xeon or Nocona, it's a bit
more (~150ns).
> Tests at all processors at the same time make major sense.
Yes and no. Most networking people believe the job of a node is to send
messages. Actually, it's mainly to compute, and sometimes sends
messages. So, would running a pingpong test on multiple processors at
the same time sharing a NIC an interesting benchmark ? Not really, it
won't happen much on real codes that compute most of the time. I prefer
to optimize other things that help the host compute faster.
> Any denial in advance that it will be the same speed is just ballony.
And I thought I was the bulliest on this list...
I just give my opinion and at least my opinion is backed up by
first-hand experience. I don't know how to play chess, but I know my stuff.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Wed Jul 20 23:07:08 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Thu, 21 Jul 2005 05:07:08 +0200
Subject: [Beowulf] New HPCC results and the Myri viewpoint
Message-ID: <3.0.32.20050721050704.0128e7f8@pop3.xs4all.nl>
>> Tests at all processors at the same time make major sense.
>
>Yes and no. Most networking people believe the job of a node is to send
>messages. Actually, it's mainly to compute, and sometimes sends
>messages. So, would running a pingpong test on multiple processors at
>the same time sharing a NIC an interesting benchmark ? Not really, it
>won't happen much on real codes that compute most of the time. I prefer
>to optimize other things that help the host compute faster.
If most of the time they are 'just computing', then it just doesn't make
sense to have a highend network. A $10 gigabit network will do in that case.
Reality is however different. Reality is that you simply stress the network
until it wastes say 10-20% of your system time until a maximum of 50%.
50% scaling of applications is acceptable, if it exponential speeds up
other calculations.
100% embarrassingly parallel calculations you can so to speak make a
distributed project for with 1 internet server and distribute software.
In short, if you deliver highend nic's, ASSUME they get used.
At least *i* understand that principle.
Weirdly enough the manufacturer of a product assumes his stuff isn't going
to get used.
Why make it then for your users?
You try to sell a product without your users using it?
Vincent
>> Any denial in advance that it will be the same speed is just ballony.
>
>And I thought I was the bulliest on this list...
>
>I just give my opinion and at least my opinion is backed up by
>first-hand experience. I don't know how to play chess, but I know my stuff.
>
>Patrick
>--
>
>Patrick Geoffray
>Myricom, Inc.
>http://www.myri.com
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From patrick at myri.com Wed Jul 20 23:48:15 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Wed, 20 Jul 2005 23:48:15 -0400
Subject: [Beowulf] New HPCC results and the Myri viewpoint
In-Reply-To: <3.0.32.20050721050704.0128e7f8@pop3.xs4all.nl>
References: <3.0.32.20050721050704.0128e7f8@pop3.xs4all.nl>
Message-ID: <42DF1AFF.6030104@myri.com>
Vincent Diepeveen wrote:
>
>>>Tests at all processors at the same time make major sense.
>>
>>Yes and no. Most networking people believe the job of a node is to send
>>messages. Actually, it's mainly to compute, and sometimes sends
>>messages. So, would running a pingpong test on multiple processors at
>>the same time sharing a NIC an interesting benchmark ? Not really, it
>>won't happen much on real codes that compute most of the time. I prefer
>>to optimize other things that help the host compute faster.
>
>
> If most of the time they are 'just computing', then it just doesn't make
> sense to have a highend network. A $10 gigabit network will do in that case.
And it does for many people. What is the most used interconnect in the
cluster market ? GigE.
> Reality is however different. Reality is that you simply stress the network
> until it wastes say 10-20% of your system time until a maximum of 50%.
What do you know about my reality ? Your reality is a 8x8 chessboard.
Have you looked at a trace of one of the 10 ISV codes that are the
majority of applications running on real world clusters ? Yes they do
communicate, but they compute most of the time.
Your reality is very unususal: your problem size if tiny, you add nodes
to go faster, not bigger. If you would add nodes to go bigger, then you
will realize that your compute/communicate ratio (usually) increases.
You have rambled on this list about parallel machines not being suited
to your usage. Maybe it's the way around, maybe nobody thinks about
chess when they buy a cluster.
> In short, if you deliver highend nic's, ASSUME they get used.
Of course they will get used, that's not the question ! It's about what
is important. Tuning for a pattern that is not common has little return.
An example for your curious and open mind: many interconnect people
advertize the streamed bandwidth curve, where the sender just keeps
sending messages as fast as possible. How often does this communication
pattern happens in my reality ? Never. I have never seen an application
sending enough messages back to back to fill up the pipeline. So why
optimizing for this case ? because the curve looks good and people likes
to think they have a bigger pipe than their friends.
> At least *i* understand that principle.
Good for you. It must be lonely up there, so many stupid people around.
> Weirdly enough the manufacturer of a product assumes his stuff isn't going
> to get used.
>
> Why make it then for your users?
>
> You try to sell a product without your users using it?
What was that procmail filter again ? I just remember the "idiot" part.
Got to look in the archives...
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From sdm900 at gmail.com Thu Jul 21 00:59:55 2005
From: sdm900 at gmail.com (Stuart Midgley)
Date: Thu, 21 Jul 2005 12:59:55 +0800
Subject: [Beowulf] New HPCC results and the Myri viewpoint
In-Reply-To: <42DF1AFF.6030104@myri.com>
References: <3.0.32.20050721050704.0128e7f8@pop3.xs4all.nl>
<42DF1AFF.6030104@myri.com>
Message-ID: <924E32F7-97E9-45BF-A817-460BF0A2664A@gmail.com>
Actually, I tend to disagree with your comment here. The curve tells
you one of the characteristics of the network, which is VERY useful
in evaluating a network before you expend time/effort testing your
code on it (assuming you know your code well). On its own (without
lots of other micro benchmarks) I agree that it is useless.
In my own experience, I tend to find that most codes are not latency
sensitive (that is, QsNetII, Infinipath, Myricom etc are effectively
the same, on a latency sense, to most codes)... until they try and
scale to the 1000's of cpu's. All of a sudden simple things like
barriers and synchronisation etc can become expensive on networks
with higher latencies. Things that the software writer wasn't
expensive start to dominate their code. Hence, the ping-pong
latencies and ring latencies are useful in giving you an idea of how
well the larger codes will scale.
> An example for your curious and open mind: many interconnect people
> advertize the streamed bandwidth curve, where the sender just keeps
> sending messages as fast as possible. How often does this
> communication pattern happens in my reality ? Never. I have never
> seen an application sending enough messages back to back to fill up
> the pipeline. So why optimizing for this case ? because the curve
> looks good and people likes to think they have a bigger pipe than
> their friends.
--
Dr Stuart Midgley
sdm900 at gmail.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From patrick at myri.com Thu Jul 21 01:37:27 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Thu, 21 Jul 2005 01:37:27 -0400
Subject: [Beowulf] New HPCC results and the Myri viewpoint
In-Reply-To: <924E32F7-97E9-45BF-A817-460BF0A2664A@gmail.com>
References: <3.0.32.20050721050704.0128e7f8@pop3.xs4all.nl> <42DF1AFF.6030104@myri.com>
<924E32F7-97E9-45BF-A817-460BF0A2664A@gmail.com>
Message-ID: <42DF3497.7070701@myri.com>
Hi Stuart,
Stuart Midgley wrote:
> Actually, I tend to disagree with your comment here. The curve tells
> you one of the characteristics of the network, which is VERY useful in
> evaluating a network before you expend time/effort testing your code on
> it (assuming you know your code well). On its own (without lots of
> other micro benchmarks) I agree that it is useless.
Yes, Keith noted it also, it's useful to evaluate the receive rate of a
N-to-1 pattern. I meant that it's useless to optimize the send side in
this case.
> In my own experience, I tend to find that most codes are not latency
> sensitive (that is, QsNetII, Infinipath, Myricom etc are effectively
> the same, on a latency sense, to most codes)... until they try and
> scale to the 1000's of cpu's. All of a sudden simple things like
> barriers and synchronisation etc can become expensive on networks with
> higher latencies. Things that the software writer wasn't expensive
> start to dominate their code. Hence, the ping-pong latencies and ring
> latencies are useful in giving you an idea of how well the larger codes
> will scale.
In my experience, the main source of delay for synchronization points
when the number of nodes increase is jitter between computation phases:
one node will be late to enter the collective and delay the whole
sub-tree. The other source is contention in the fabric, specially at
1000's of nodes, which ring latency tests don't really exercise.
Ring latencies are a step in the good direction though, but it still
quite analytic IMHO.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From gmpc at sanger.ac.uk Thu Jul 21 04:42:27 2005
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Thu, 21 Jul 2005 09:42:27 +0100 (BST)
Subject: [Beowulf] Performance issue - CPU Intel 00/02
In-Reply-To:
References:
Message-ID:
>
> that's what I'd do. remember that you can use pxe to boot a bios-update
> image, so you don't necessarily have to touch each machine.
>
If you are stuck with bios updates that only comes as DOS floppy images,
you can use bpbatch to turn them into PXE-bootable images.
http://cui.unige.ch/info/pc/remote-boot/soft/
Guy
--
Dr. Guy Coates, Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 494919
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From gmpc at sanger.ac.uk Thu Jul 21 09:55:38 2005
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Thu, 21 Jul 2005 14:55:38 +0100 (BST)
Subject: [Beowulf] Performance issue - CPU Intel 00/02
In-Reply-To:
References:
Message-ID:
>
> interesting. this appears to be a project that eventually went commercial.
> however, I had no trouble booting to a floppy image using pxelinux, which is
> quite nice for booting clusters.
>
Do you have a config I can steal? I could never get pxebooting DOS disk
images to work; I ended up with all sorts of interesting hangs and
crashes.
Cheers,
Guy
--
Dr. Guy Coates, Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 494919
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hahn at physics.mcmaster.ca Thu Jul 21 09:54:11 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Thu, 21 Jul 2005 09:54:11 -0400 (EDT)
Subject: [Beowulf] Performance issue - CPU Intel 00/02
In-Reply-To:
Message-ID:
> > that's what I'd do. remember that you can use pxe to boot a bios-update
> > image, so you don't necessarily have to touch each machine.
> >
> If you are stuck with bios updates that only comes as DOS floppy images,
> you can use bpbatch to turn them into PXE-bootable images.
>
> http://cui.unige.ch/info/pc/remote-boot/soft/
interesting. this appears to be a project that eventually went commercial.
however, I had no trouble booting to a floppy image using pxelinux, which is
quite nice for booting clusters.
regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hahn at physics.mcmaster.ca Thu Jul 21 10:27:16 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Thu, 21 Jul 2005 10:27:16 -0400 (EDT)
Subject: [Beowulf] Performance issue - CPU Intel 00/02
In-Reply-To:
Message-ID:
> > interesting. this appears to be a project that eventually went commercial.
> > however, I had no trouble booting to a floppy image using pxelinux, which is
> > quite nice for booting clusters.
> >
> Do you have a config I can steal? I could never get pxebooting DOS disk
> images to work; I ended up with all sorts of interesting hangs and
> crashes.
well, I didn't see any issues, but I used it in a homogeneous 96x cluster:
default nfsroot
#default flash
#default fromdisk
label nfsroot
kernel /vmlinuz-2.6.11.11-x
append root=/dev/ram initrd=image.gz init=/linuxrc vga=ext panic=60
ipappend 1
label flash
kernel /memdisk
append initrd=/dl145-2004.10.08.img
label fromdisk
kernel /test
append root=/dev/hda2 vga=ext
ipappend 1
regards, mark.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From gdjacobs at gmail.com Tue Jul 19 00:56:16 2005
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Mon, 18 Jul 2005 23:56:16 -0500
Subject: [Beowulf] help a newbie
In-Reply-To: <87f1c381050717220375a693a6@mail.gmail.com>
References: <87f1c381050717220375a693a6@mail.gmail.com>
Message-ID: <42DC87F0.6060709@gmail.com>
rupinder bhangu wrote:
> hi
> I am Rupinder.I am a final year student.I have planned to work on the
> topic of Beowulf clusters during my six months training.I have also
> gone through some of the sites & the other stuff on the Internet to
> gather the basic info regarding beowulfs, because I had to convince my
> teachers for allowing me to work on this topic.Having done that job
> successfully, I would now like to have the help from the people who
> are experienced in this field. I am really a newbie in this field, but
> I want to do it. Could you please tell me where to start, how to work
> & the related help that you think would be useful for me?Could you
> also tell that whether a period of 6 months is adequate for a person
> like me to build a cluster with 3-4 nodes successfully?
> Thanks
> Rupinder Kaur
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>
Depending on your skill level, you can possibly do this in a matter of
days. Your task involves three elements.
1) Hardware
Gather together, test, configure any hardware you need (including
networking).
2) System Software
OS installation, network shares configuration, passwordless ssh/rsh
configuration, any libraries you need for intercommunication (MPI, PVM,
OpenMosix patched kernel), routing, Batch system (not likely with a
small setup).
3) User Software
Learning API for intercommunication, building non-trivial program to
prove you are a cool parallel programmer (this is hard). Browny points
if you build a program relevant to the people evaluating you.
You can skip this step if you are working in genomics, or meteorology
(lots of free software in these areas), or can use pre-canned libraries
like ScaLAPACK, or if you have scads of money to buy Gaussian licenses
(and you like chemistry), etc. You get the idea. Lots of ifs.
Using something like ROCKS will simplify 2), and you can often trip
someone into helping you with 1). Unfortunately, 3) is something you
usually have to do yourself.
--
Geoffrey D. Jacobs
MORE CORE AVAILABLE, BUT NONE FOR YOU.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From landman at scalableinformatics.com Tue Jul 19 06:30:08 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 19 Jul 2005 06:30:08 -0400
Subject: [Beowulf] Re: dual core (latency)
In-Reply-To: <3.0.32.20050719064158.012813f0@pop3.xs4all.nl>
References: <3.0.32.20050719064158.012813f0@pop3.xs4all.nl>
Message-ID: <20050719101634.M54357@scalableinformatics.com>
On Tue, 19 Jul 2005 06:42:02 +0200, Vincent Diepeveen wrote
> At 11:05 AM 7/19/2005 +0800, Stuart Midgley wrote:
> >The first thing to note is that as you add cpu's the cost of the
> >cache snooping goes up dramatically. The latency of a 4 cpu (single
> >core) opteron system is (if my memory serves me correctly) around
> >120ns. Which is significantly higher than the latency of a dual
> >processor system (I think it scales roughly as O(n^2) where n is the
> >number of cpu's).
> >
> >Now, with a dual core system, you are effectively halving the
> >bandwidth/cpu over the hyper transport AND increasing the cpu count,
> >thus increasing the amount of cache snooping required. The end
> >result is drastically blown-out latencies.
> >
> >Stu.
>
> This doesn't answer even remotely accurate things.
Actually it was a very well written and quite accurate discussion of what you
were seeing.
> A) my test is doing no WRITES, just READS.
Doesn't matter, unless you turn off all cache effects on the memory you are
dealing with. A memory write is a read-modify-write operation, and memory
read is a read operation. You still require that initial "snoop" to grab the
cache line. You basically ask all the other processors that have the
potential of sharing that cache line to look into which lines they have in
cache, and if they have the line in question, please flush that line if it is
dirty (e.g. a pending but uncommitted write exists). Otherwise, please hand
over the cache line with all due speed.
Its not "complex" with 2 CPUs, just a little costly. It gets complex and time
consuming with 4. At 4 and higher it is one of the issues you take into
consideration when optimizing code. This is also why processor affinity is so
important, as you can (to a degree) pre-bias where the pages (and hence cache
lines) are sitting relative to the CPU, and tie the memory and processor
together. This increases the likelyhood of the line being local, as well as
potentially decreases the likelyhood of the line being needed remotely.
> B) snooping might be for free.
Absolutely not.
> C) all other cores are just idle when such a latency test for just 1
> core happens and the rest of the system is idle.
The only way you can guarantee that the other cores are "idle" is to turn them
off.
> D) in all cases a
> dual core processor has a SLOWER latency and it doesn't make sense.
Makes a great deal of sense as Stuart has pointed out. Your snooping
algorithm is somewhat better than O(N**2) on a system with a directory.
Without a directory it is closer to O(N**2). The more snooping you need to do
before getting a cache line, the more latency you pay to get that initial
cache line. A directory based system is effectively a hash table.
> E) you don't seem to grasp the difference between LATENCY and BANDWIDTH;
Hmmmm. I think Stuart gets it very well. I am not convinced that you get the
issue of how important and expensive cache line processing via snoopy
algorithms is, and what its impact upon overall processing time is.
Joe Landman
--
Scalable Informatics LLC
http://www.scalableinformatics.com
phone: +1 734 786 8423
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From ballen at gravity.phys.uwm.edu Wed Jul 20 23:20:52 2005
From: ballen at gravity.phys.uwm.edu (Bruce Allen)
Date: Wed, 20 Jul 2005 22:20:52 -0500 (CDT)
Subject: [Beowulf] Performance issue - CPU Intel 00/02
In-Reply-To: <187D3A7CAB42A54DB61F1D05F01257220664B0B7@orsmsx402.amr.corp.intel.com>
Message-ID:
> > If you have lots of sick machines, you could try making BIOS
> > settings using the nvram module
> >
> > On a good node: modprobe nvram; cat /dev/nvram > biosfile
> >
> > On a bad one: modprobe nvram; cat biosfile > /dev/nvram
> >
> > YMMV. The management accept no liability for lost bits.
> > Seriously, give it a try on one node.
>
> Before you do this, you may want to check BIOS levels on a good v. bad
> system and try a BIOS reset on the bad machine. Even a batch of systems
> delivered at the same time can have different BIOS levels. I've also
> seen a BIOS reset cure these odd problems if they were all already at
> the same version.
>
> The final bit, which is more in the voodoo camp, is the battery...
>
> If the BIOS versions differ, would copying the nvram from one system to
> another be an issue?
Even if the BIOS versions are the same, this won't copy the setting
reliably. We tried this several years ago using some identical Intel
motherboards and discovered that there are lots of BIOS settings that are
not kept in the handful of nvram bytes that /dev/nvram has access to.
Bruce
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From tbole1 at umbc.edu Thu Jul 21 00:55:05 2005
From: tbole1 at umbc.edu (Timothy Bole)
Date: Thu, 21 Jul 2005 00:55:05 -0400 (EDT)
Subject: [Beowulf] New HPCC results and the Myri viewpoint
In-Reply-To: <42DF1AFF.6030104@myri.com>
References: <3.0.32.20050721050704.0128e7f8@pop3.xs4all.nl>
<42DF1AFF.6030104@myri.com>
Message-ID:
I must side with Patrick on this issue.
A GigE network works just fine for me, running embarassingly parallel
Monte Carlo simulations. I haven't seen RGB weigh in on this, so I'll try
to make the point which I think would be his. A Beowulf is engineered to
solve a problem, not the other way around.
I don't know how much time paralellized chess programs spend passing
messages, but I can tell you that it seems to be a small market in the
grand scheme of Beowulfery. Most clusters that I have seen are devoted to
large scale simulations or numerical analysis. For these types of uses,
it seems a good bet that most time is spent on computation, not message
passing.
just my US$0.02
-tim
On Wed, 20 Jul 2005, Patrick Geoffray wrote:
> Vincent Diepeveen wrote:
> >
> >>>Tests at all processors at the same time make major sense.
> >>
> >>Yes and no. Most networking people believe the job of a node is to send
> >>messages. Actually, it's mainly to compute, and sometimes sends
> >>messages. So, would running a pingpong test on multiple processors at
> >>the same time sharing a NIC an interesting benchmark ? Not really, it
> >>won't happen much on real codes that compute most of the time. I prefer
> >>to optimize other things that help the host compute faster.
> >
> >
> > If most of the time they are 'just computing', then it just doesn't make
> > sense to have a highend network. A $10 gigabit network will do in that case.
>
> And it does for many people. What is the most used interconnect in the
> cluster market ? GigE.
>
> > Reality is however different. Reality is that you simply stress the network
> > until it wastes say 10-20% of your system time until a maximum of 50%.
>
> What do you know about my reality ? Your reality is a 8x8 chessboard.
> Have you looked at a trace of one of the 10 ISV codes that are the
> majority of applications running on real world clusters ? Yes they do
> communicate, but they compute most of the time.
>
> Your reality is very unususal: your problem size if tiny, you add nodes
> to go faster, not bigger. If you would add nodes to go bigger, then you
> will realize that your compute/communicate ratio (usually) increases.
>
> You have rambled on this list about parallel machines not being suited
> to your usage. Maybe it's the way around, maybe nobody thinks about
> chess when they buy a cluster.
>
> > In short, if you deliver highend nic's, ASSUME they get used.
>
> Of course they will get used, that's not the question ! It's about what
> is important. Tuning for a pattern that is not common has little return.
>
> An example for your curious and open mind: many interconnect people
> advertize the streamed bandwidth curve, where the sender just keeps
> sending messages as fast as possible. How often does this communication
> pattern happens in my reality ? Never. I have never seen an application
> sending enough messages back to back to fill up the pipeline. So why
> optimizing for this case ? because the curve looks good and people likes
> to think they have a bigger pipe than their friends.
>
> > At least *i* understand that principle.
>
> Good for you. It must be lonely up there, so many stupid people around.
>
> > Weirdly enough the manufacturer of a product assumes his stuff isn't going
> > to get used.
> >
> > Why make it then for your users?
> >
> > You try to sell a product without your users using it?
>
> What was that procmail filter again ? I just remember the "idiot" part.
> Got to look in the archives...
>
> Patrick
> --
>
> Patrick Geoffray
> Myricom, Inc.
> http://www.myri.com
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
=========================================================================
Timothy W. Bole a.k.a valencequark
Graduate Student
Department of Physics
UMBC
http://www.beowulf.org
=========================================================================
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hendrata at students.ee.itb.ac.id Wed Jul 13 11:29:27 2005
From: hendrata at students.ee.itb.ac.id (13200178 Hendra Tampang Allo)
Date: Wed, 13 Jul 2005 22:29:27 +0700 (WIT)
Subject: [Beowulf] Clustermatic on red-hat 9.0
Message-ID: <20050713222649.N19856@students.ee.itb.ac.id>
Hi everyone, i am sorry, i'm a newbie. Can I install clustermatic on
redhat 9.0? Why
do I always read on google search result that clustermatic 3 is used on
redhat 8?
Thanks lot..
Soli Deo Gloria & Sola Christa Eterna
Hendra/EL-00
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hendrata at students.ee.itb.ac.id Wed Jul 20 04:48:06 2005
From: hendrata at students.ee.itb.ac.id (13200178 Hendra Tampang Allo)
Date: Wed, 20 Jul 2005 15:48:06 +0700 (WIT)
Subject: [Beowulf] help a newbie
In-Reply-To: <87f1c381050717220375a693a6@mail.gmail.com>
References: <87f1c381050717220375a693a6@mail.gmail.com>
Message-ID: <20050720154342.B74634@students.ee.itb.ac.id>
I am also a newbie like you but i think it's not hard to build a cluster.
Just search it on google and you will find many ways of making beowulf
cluster. But what kind of job will you process on your cluster? I am also
building a cluster to run namd (a molecular dynamics) and i was suggested
to use redhat 8 + clustermatic 3.
Soli Deo Gloria & Sola Christa Eterna
Hendra/EL-00
On Sun, 17 Jul 2005, rupinder bhangu wrote:
> hi
> I am Rupinder.I am a final year student.I have planned to work on the topic
> of Beowulf clusters during my six months training.I have also gone through
> some of the sites & the other stuff on the Internet to gather the basic info
> regarding beowulfs, because I had to convince my teachers for allowing me to
> work on this topic.Having done that job successfully, I would now like to
> have the help from the people who are experienced in this field. I am really
> a newbie in this field, but I want to do it. Could you please tell me where
> to start, how to work & the related help that you think would be useful for
> me?Could you also tell that whether a period of 6 months is adequate for a
> person like me to build a cluster with 3-4 nodes successfully?
> Thanks
> Rupinder Kaur
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From cjtan at optimanumerics.com Thu Jul 21 01:49:52 2005
From: cjtan at optimanumerics.com (C J Kenneth Tan -- OptimaNumerics)
Date: Thu, 21 Jul 2005 06:49:52 +0100 (BST)
Subject: [Beowulf] New HPCC results and the Myri viewpoint
In-Reply-To: <3.0.32.20050721050704.0128e7f8@pop3.xs4all.nl>
References: <3.0.32.20050721050704.0128e7f8@pop3.xs4all.nl>
Message-ID:
On 2005-07-21 05:07 +0200 Vincent Diepeveen (diep at xs4all.nl) wrote:
> Date: Thu, 21 Jul 2005 05:07:08 +0200
> From: Vincent Diepeveen
> To: Patrick Geoffray
> Cc: beowulf at beowulf.org
> Subject: Re: [Beowulf] New HPCC results and the Myri viewpoint
>
> 50% scaling of applications is acceptable, if it exponential speeds up
> other calculations.
>
I do find this statement rather interesting and amusing. I certainly
haven't seen such cases in numerical code. Is this supposed to be new
gospel, or is this coffee table talk?
Kenneth Tan
--------------------------------------------------------------------------
C J Kenneth Tan, PhD
OptimaNumerics Ltd Telephone: +44 798 941 7838
E-mail: cjtan at OptimaNumerics.com Telephone: +44 871 504 3328
Web: http://www.OptimaNumerics.com Facsimile: +44 289 066 3015
--------------------------------------------------------------------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From bropers at cct.lsu.edu Thu Jul 21 12:43:57 2005
From: bropers at cct.lsu.edu (Brian D. Ropers-Huilman)
Date: Thu, 21 Jul 2005 11:43:57 -0500
Subject: [Beowulf] New HPCC results and the Myri viewpoint
In-Reply-To:
References: <3.0.32.20050721050704.0128e7f8@pop3.xs4all.nl> <42DF1AFF.6030104@myri.com>
Message-ID: <42DFD0CD.9020208@cct.lsu.edu>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160
Timothy Bole said the following on 2005.07.20 23:55:
> A Beowulf is engineered to
> solve a problem, not the other way around.
I think we'd all agree that this is the way it should be, but I contend it
is rarely the case. I certainly don't have the luxury of designing systems
to match a problem, though I so wish I could. Rather, I have a very generic
resource available to a multitude of researchers with many and varied problems.
In this case, I _encourage_ our users, to the extent that is possible, to
optimize their code for our platform.
Likewise, our applications framework, Cactus, is heavily optimized for
various different systems, so no matter where it runs, its going to run as
optimally as possible.
Picking a nit, of sorts, but I think it's an important point to make.
- --
Brian D. Ropers-Huilman .:. Asst. Director .:. HPC and Computation
Center for Computation & Technology (CCT) bropers at cct.lsu.edu
Johnston Hall, Rm. 350 +1 225.578.3272 (V)
Louisiana State University +1 225.578.5362 (F)
Baton Rouge, LA 70803-1900 USA http://www.cct.lsu.edu/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFC39DNwRr6eFHB5lgRA+ZDAJ9XAlNJxhJSMjfr2f8dKMNFpTPy4ACfZOc2
XE54BDVq300PGM9LSNRp/xI=
=qzee
-----END PGP SIGNATURE-----
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From mathog at mendel.bio.caltech.edu Thu Jul 21 12:56:37 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Thu, 21 Jul 2005 09:56:37 -0700
Subject: PXE bios update, was RE: [Beowulf] Performance issue - CPU Intel 00/02
Message-ID:
Guy Coates wrote:
>
> >
> > that's what I'd do. remember that you can use pxe to boot a bios-update
> > image, so you don't necessarily have to touch each machine.
> >
> If you are stuck with bios updates that only comes as DOS floppy images,
> you can use bpbatch to turn them into PXE-bootable images.
>
> http://cui.unige.ch/info/pc/remote-boot/soft/
That's all very interesting but I've seen all sorts of whacko things
happen with BIOS upgrades. For instance, the loss of all
previous settings (the default in the last Tyan BIOS upgrader
I used, there was an optional command line switch to retain them) or
the redefinition of existing settings to have different
meanings (also Tyan, where Quickboot Enabled on 4.03 had to
be changed to Quickboot Disabled on 4.06). If you're doing this
via PXE how do you review the settings that exist after the
BIOS is upgraded?
Actually this is one of my long standing pet peeves about the BIOS
in general. Even updating straight from a floppy I've always wanted
to be able to do something like:
DOS> BIOSUPGRADE BIOS123.IMG SETTINGS.TXT
where the text file settings.txt contains all of the settings
for the BIOS that's being loaded. Unfortunately I've yet to encounter
a single BIOS utility that provides anything remotely near this level
of functionality.
Regards,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From john.hearns at streamline-computing.com Thu Jul 21 12:23:38 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Thu, 21 Jul 2005 17:23:38 +0100
Subject: [Beowulf] Performance issue - CPU Intel 00/02
In-Reply-To:
References:
Message-ID: <1121963018.8056.138.camel@vigor13>
On Thu, 2005-07-21 at 14:55 +0100, Guy Coates wrote:
> >
> > interesting. this appears to be a project that eventually went commercial.
> > however, I had no trouble booting to a floppy image using pxelinux, which is
> > quite nice for booting clusters.
> >
> Do you have a config I can steal? I could never get pxebooting DOS disk
> images to work; I ended up with all sorts of interesting hangs and
> crashes.
We've done this, and generally use a Freedos image to do this.
Might be your particular motherboard.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Thu Jul 21 13:28:01 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Thu, 21 Jul 2005 10:28:01 -0700
Subject: [Beowulf] New HPCC results and the Myri viewpoint
In-Reply-To: <924E32F7-97E9-45BF-A817-460BF0A2664A@gmail.com>
References: <3.0.32.20050721050704.0128e7f8@pop3.xs4all.nl>
<42DF1AFF.6030104@myri.com>
<924E32F7-97E9-45BF-A817-460BF0A2664A@gmail.com>
Message-ID: <20050721172800.GA1882@greglaptop.internal.keyresearch.com>
On Thu, Jul 21, 2005 at 12:59:55PM +0800, Stuart Midgley wrote:
> Actually, I tend to disagree with your comment here. The curve tells
> you one of the characteristics of the network,
In particular, the ping-pong bandwith curve is the lower bound, and
the streaming curve is the upper bound, of what you'll see in real
life.
> In my own experience, I tend to find that most codes are not latency
> sensitive (that is, QsNetII, Infinipath, Myricom etc
... did I miss you signing up to run on our customer benchmark
cluster? ;-) In any case, I can point you to quite a few sets of
online performance numbers where scalability is being hurt by
short-message performance, which is the thing that everyone means when
they say "latency sensitive". For example, quantum chemistry is very
well known for being hard to scale. Climate involves running small
datasets with large numbers of timesteps, and that inevitably ends up
being latency sensitive. And so on.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From bill at princeton.edu Thu Jul 21 14:27:06 2005
From: bill at princeton.edu (Bill Wichser)
Date: Thu, 21 Jul 2005 14:27:06 -0400
Subject: [Beowulf] Performance issue - CPU Intel 00/02
In-Reply-To:
References:
Message-ID: <42DFE8FA.7010101@princeton.edu>
Here are my findings to date:
Problem does NOT move with CPU but remains on motherboard.
Resetting CMOS via jumper on board changes nothing.
Exchanging batteries, both supplying correct voltage, between bad/good
nodes changes nothing.
Updating the BIOS firmware, making sure to reset CMOS in the process,
changes nothing.
BIOS upon bootup always displays the processor as the Pentium 4. x86info
still displays the bad node info as Celeron (P4 core) [B0] Original OEM.
Again, I'm out of ideas. Wish me luck when trying to deal with Tyan!
Thanks for all your pointers/suggestions thus far.
Bill
> Bill Wichser wrote:
>
>
>>System: 128 node Intel 2.4GHz P4
>>MBO: Tyan S2099, i845E
>>OS: RedHat 8.0, kernel 2.4.18-18.8.0 (but 2.4.20-28.8 changes nothing)
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From gmpc at sanger.ac.uk Thu Jul 21 14:44:30 2005
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Thu, 21 Jul 2005 19:44:30 +0100 (BST)
Subject: PXE bios update, was RE: [Beowulf] Performance issue - CPU Intel
00/02
In-Reply-To:
References:
Message-ID:
> be changed to Quickboot Disabled on 4.06). If you're doing this
> via PXE how do you review the settings that exist after the
> BIOS is upgraded?
You are right; PXE+BIOS does give you the ability to brick many machines
in a highly efficient and spectacular manner.
We actually do main BIOS update from within linux where we can. IBM and
HP kit both support utilities that allow you to flash, query and change
BIOS settings from within linux, so you can do as you describe.
The PXEboot/DOS approach only gets used where we have no other choice. The
last time we did it in anger was to update the firmware on 560 broadcom
NICs, which were locking up under high network load. That flash procedure
is (probably) alot safer than a BIOS upgrade, as there are no
user-settings which need to be kept during the upgrade.
Cheers,
Guy
--
Dr. Guy Coates, Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 494919
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From patrick at myri.com Thu Jul 21 15:11:15 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Thu, 21 Jul 2005 15:11:15 -0400
Subject: [Beowulf] New HPCC results and the Myri viewpoint
In-Reply-To: <1121972605.12616.8.camel@fpga>
References: <3.0.32.20050721050704.0128e7f8@pop3.xs4all.nl>
<42DF1AFF.6030104@myri.com>
<924E32F7-97E9-45BF-A817-460BF0A2664A@gmail.com>
<42DF3497.7070701@myri.com> <1121972605.12616.8.camel@fpga>
Message-ID: <42DFF353.2000709@myri.com>
Keith D. Underwood wrote:
>
>>Yes, Keith noted it also, it's useful to evaluate the receive rate of
>>a
>>N-to-1 pattern. I meant that it's useless to optimize the send side in
>>this case.
>
>
> Except that many nodes to both an N-to-1 and a 1-to-N. i.e. they
> receive from N neighbors and send to those same N neighbors in each time
> step. That means that both the send and receive sides of the streaming
> bandwidth issue matter.
I have never seen a code with 1-to-N where N is big enough to fill up
the pipeline. It's this type of pattern (N-to-1 is much worse than
1-to-N) that kills scalability in large config and developpers know
that. Even in a classic all-to-all, most implementation uses a spanning
log tree, some even with a fan-out limit just to limit the damages.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kdunder at sandia.gov Thu Jul 21 15:03:25 2005
From: kdunder at sandia.gov (Keith D. Underwood)
Date: Thu, 21 Jul 2005 13:03:25 -0600
Subject: [Beowulf] New HPCC results and the Myri viewpoint
In-Reply-To: <42DF3497.7070701@myri.com>
References: <3.0.32.20050721050704.0128e7f8@pop3.xs4all.nl>
<42DF1AFF.6030104@myri.com>
<924E32F7-97E9-45BF-A817-460BF0A2664A@gmail.com>
<42DF3497.7070701@myri.com>
Message-ID: <1121972605.12616.8.camel@fpga>
> Yes, Keith noted it also, it's useful to evaluate the receive rate of
> a
> N-to-1 pattern. I meant that it's useless to optimize the send side in
> this case.
Except that many nodes to both an N-to-1 and a 1-to-N. i.e. they
receive from N neighbors and send to those same N neighbors in each time
step. That means that both the send and receive sides of the streaming
bandwidth issue matter.
Keith
--
Keith D. Underwood
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kdunder at sandia.gov Thu Jul 21 15:07:46 2005
From: kdunder at sandia.gov (Keith D. Underwood)
Date: Thu, 21 Jul 2005 13:07:46 -0600
Subject: [Beowulf] New HPCC results and the Myri viewpoint
In-Reply-To: <20050721172800.GA1882@greglaptop.internal.keyresearch.com>
References: <3.0.32.20050721050704.0128e7f8@pop3.xs4all.nl>
<42DF1AFF.6030104@myri.com>
<924E32F7-97E9-45BF-A817-460BF0A2664A@gmail.com>
<20050721172800.GA1882@greglaptop.internal.keyresearch.com>
Message-ID: <1121972866.12616.12.camel@fpga>
> ... did I miss you signing up to run on our customer benchmark
> cluster? ;-) In any case, I can point you to quite a few sets of
> online performance numbers where scalability is being hurt by
> short-message performance, which is the thing that everyone means when
> they say "latency sensitive". For example, quantum chemistry is very
> well known for being hard to scale. Climate involves running small
> datasets with large numbers of timesteps, and that inevitably ends up
> being latency sensitive. And so on.
In many of those cases, the latency impact on throughput (i.e. the
streaming numbers) are what is important. I don't know of anything that
does a tight loop of "small amount of work, ping-pong with one other
processor, repeat". In most networks, message throughput == 1/latency
(approximately, anyway). It doesn't have to be that way.
Keith
--
Keith D. Underwood
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From john.hearns at streamline-computing.com Thu Jul 21 15:46:14 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Thu, 21 Jul 2005 20:46:14 +0100
Subject: PXE bios update, was RE: [Beowulf] Performance issue - CPU
Intel 00/02
In-Reply-To:
References:
Message-ID: <1121975174.8056.160.camel@vigor13>
On Thu, 2005-07-21 at 09:56 -0700, David Mathog wrote:
> Guy Coates wrote:
>
> >
> > >
> > > that's what I'd do. remember that you can use pxe to boot a bios-update
> > > image, so you don't necessarily have to touch each machine.
> > >
> > If you are stuck with bios updates that only comes as DOS floppy images,
> > you can use bpbatch to turn them into PXE-bootable images.
> >
> > http://cui.unige.ch/info/pc/remote-boot/soft/
>
> That's all very interesting but I've seen all sorts of whacko things
> happen with BIOS upgrades. For instance, the loss of all
> previous settings (the default in the last Tyan BIOS upgrader
I agree. We've had problems with a certain motherboard when the BIOS is
redirected to serial, and yes its common for the BIOS settings to be
reset when you do an upgrade.
Roll on Linuxbios, or any replacement which makes it possible to manage
large farms an update via a network boot.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Thu Jul 21 17:55:14 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Thu, 21 Jul 2005 14:55:14 -0700
Subject: [Beowulf] New HPCC results and the Myri viewpoint
In-Reply-To: <1121972866.12616.12.camel@fpga>
References: <3.0.32.20050721050704.0128e7f8@pop3.xs4all.nl>
<42DF1AFF.6030104@myri.com>
<924E32F7-97E9-45BF-A817-460BF0A2664A@gmail.com>
<20050721172800.GA1882@greglaptop.internal.keyresearch.com>
<1121972866.12616.12.camel@fpga>
Message-ID: <20050721215514.GB2674@greglaptop.internal.keyresearch.com>
On Thu, Jul 21, 2005 at 01:07:46PM -0600, Keith D. Underwood wrote:
> I don't know of anything that
> does a tight loop of "small amount of work, ping-pong with one other
> processor, repeat".
Right, it's typically "small amount of work, ping-pong with N other
processors, repeat", where N=2, 4, 8, 9, 27... Solidly in the middle
between ping-pong with 1 node and streaming. This describes pretty
much all time-explicit finite-difference codes decomposed in 1-3
dimensions.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From erwan at seanodes.com Fri Jul 22 04:08:37 2005
From: erwan at seanodes.com (Velu Erwan)
Date: Fri, 22 Jul 2005 10:08:37 +0200
Subject: PXE bios update, was RE: [Beowulf] Performance issue - CPU
Intel 00/02
In-Reply-To:
References:
Message-ID: <1122019717.10675.26.camel@R1.seanodes.com>
> where the text file settings.txt contains all of the settings
> for the BIOS that's being loaded. Unfortunately I've yet to encounter
> a single BIOS utility that provides anything remotely near this level
> of functionality.
I was discussing about that with a Tyan tech which told me Tyan was
about to release such tool... I was a few months ago.. Still nothing...
Bios and theirs tools quality is not very good for many manufacturer.
How many allow you to flash the bios from Linux ? How many allow to
configure it via Linux ? How many fills the dmi table with some usefull
information ? How many create a bios with a wrong DSDT table ? (not
usefull on clusters but linked to the bios quality problems cf
http://acpi.sf.net for more informations)
In the cluster-world this issues are really important and I'm always
disapointed when I discover a big bios bug. If I'm able to find it in
10mns the bios manufacturer is also able to do it. It sounds like
motherboard manufacturers releases to much mobo to provide a good
quality bios. Many of them releases 5 or 6 versions before havine a
stable bios !
As "John Hearns" said, the linuxbios issue could be very interesting but
to few mobo are supported yet :(
I hope as RMS said in http://www.fsf.org/campaigns/supportlinuxbios.html
that many manufacturer will help the development of the linuxbios
project. It could become a nice alternative to this issues like how to
keep your configuration after a bios update.
I just remember, if your systems are strictly all the same you can
update one by hand, make it configuration and save the nvram using a the
dd command. dd if=/dev/nvram of=nvram. Then you update the bios of the
next machine and the update the nvram by dd the nvram you saved.
dd if=nvram of=/dev/nvram.
Then you reboot, if your systems are the same, it may works : if not the
bios will says that your checksum is invalid (you must reconfigure it).
Regards,
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hendrata at students.ee.itb.ac.id Fri Jul 22 10:16:31 2005
From: hendrata at students.ee.itb.ac.id (13200178 Hendra Tampang Allo)
Date: Fri, 22 Jul 2005 21:16:31 +0700 (WIT)
Subject: [Beowulf] help a newbie
In-Reply-To: <42DC87F0.6060709@gmail.com>
References: <87f1c381050717220375a693a6@mail.gmail.com>
<42DC87F0.6060709@gmail.com>
Message-ID: <20050722211022.B77181@students.ee.itb.ac.id>
I am Hendra and I have a similar final project with Rupindar. I have
sucessfully installed fedora 3 two pcs but I don't know how to setup
passwordless
login between the two machine.Can anyone help me?
Thanks alot....
Soli Deo Gloria & Sola Christa Eterna
Hendra/EL-00
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From jmdavis at mail2.vcu.edu Mon Jul 25 11:32:25 2005
From: jmdavis at mail2.vcu.edu (Mike Davis)
Date: Mon, 25 Jul 2005 11:32:25 -0400
Subject: [Beowulf] VASP
Message-ID: <42E50609.4030403@mail2.vcu.edu>
Hello all,
I have spenty 3 days trying to compile VASP 4.6 for parallel operation
using PGF. I have compiled mpich with pgf,compiled VASP in serial mode,
and flailed at trying to get it to mpi compile.
I moved the mpif.h to the Vasp dir, but it still won'tr work and seems
to be braking on the mpi commands.
Any advice?
Mike Davis
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From eugen at leitl.org Tue Jul 26 06:20:11 2005
From: eugen at leitl.org (Eugen Leitl)
Date: Tue, 26 Jul 2005 12:20:11 +0200
Subject: [Beowulf] [richard@SCL.UTAH.EDU: MacEnterprise Closing & Parallel
Computing Session Online]
Message-ID: <20050726102011.GN2259@leitl.org>
----- Forwarded message from Richard Glaser -----
From: Richard Glaser
Date: Mon, 25 Jul 2005 22:17:14 -0600
To: MACENTERPRISE at LISTSERV.CUNY.EDU
Subject: MacEnterprise Closing & Parallel Computing Session Online
Reply-To: Mac OS X enterprise deployment project
FYI:
The following sessions from the MacEnterprise Day from June 5th 2005
are available for viewing.
MacEnterprise Day - Closing
---------------------------
http://data.scl.utah.edu/fmi/xsl/stream/details.xsl?-recid=225
Priorities to bring to Apple in the Enterprise environment. Bring
your unanswered questions and innovative solutions for an open floor
discussion! And awards to top contributors to project.
Parallel Computing
------------------
http://data.scl.utah.edu/fmi/xsl/stream/details.xsl?-recid=228
by Dean Dauger, Dauger Research
Charles Parnot, Stanford University
James Reynolds, University of Utah
The three participants will present their work in Parallel computing
with the Macintosh:
Combining powerful, numerically-intensive parallel computing clusters
with the famed ease-of-use of the Macintosh, Pooch is the only
solution that merges a modern graphical user interface with
supercomputer compatible parallel computing. This software enables
users, without any expertise with OS X, world-wide to develop and run
parallel code eficiently and productively.
This session will also include discussion of a particular biological
model and the architecture of the Xgrid-aware application built to
run computations used to analyze biophysical studies done in the lab
towards the goal of better
understanding one receptor involved in heart regulation.
Since the oficial release of XGrid, it has been used to render
POV-Ray and Maya animations with more accuracy and ease, using
modifications of Apple's Xgrid sample code to submit jobs.
Experiences and future plans for releasing a submission and job queue
engine for Maya and POV-Ray will be shared.
--
Thanks:
Richard Glaser
University of Utah - Student Computing Labs
richard at scl.utah.edu
801-585-8016
_____________________________________________________
Subscription Options and Archives
http://listserv.cuny.edu/archives/macenterprise.html
----- End forwarded message -----
--
Eugen* Leitl leitl
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Wed Jul 27 10:37:44 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed, 27 Jul 2005 07:37:44 -0700
Subject: [Beowulf] New HPCC results, and an MX question
In-Reply-To: <20050720022701.GA5030@greglaptop.internal.keyresearch.com>
References: <20050720005555.GA4234@greglaptop.internal.keyresearch.com>
<42DDB306.7010700@myri.com>
<20050720022701.GA5030@greglaptop.internal.keyresearch.com>
Message-ID: <20050727143744.GB1416@greglaptop>
Last week, Patrick asked:
> By the way, could you point me to the raw performance data on the
> pathscale web pages ?
This information is now published on our webpages:
http://pathscale.com/infinipath-perf.html
Go to the pallas page, then at the bottom there's a link to the
"Original Data".
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From spacetiller at 163.com Mon Jul 25 22:30:42 2005
From: spacetiller at 163.com (Zhang Hui)
Date: Tue, 26 Jul 2005 10:30:42 +0800
Subject: [Beowulf] Mii interface not supported
Message-ID: <200507260214.j6Q2EDEr026389@bluewest.scyld.com>
Hello,
I have written a nic monitoring function,using mii. But it does not work, even when i use mii-tool cmd in the shell it says "mii interface, operation not supported.".
Later i find that some nics do not support mii.But it's not over, I still need the monitoring function, very urgently.So can anyone tell me whether there are some other better method to monitor the nic.
Thanks in advance.Great appreciation to any reply.
????????Zhang Hui
????????spacetiller at 163.com
??????????2005-07-26
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From josip at lanl.gov Wed Jul 27 12:26:26 2005
From: josip at lanl.gov (Josip Loncaric)
Date: Wed, 27 Jul 2005 10:26:26 -0600
Subject: [Beowulf] Opteron memory rank limits with DDR-400
Message-ID: <42E7B5B2.1010600@lanl.gov>
Hello,
Can anyone confirm that Opteron processors Rev. E and later can operate
four dual-rank 2GB memory modules (8 ranks total) at full DDR-400 speed?
AMD used to recommend no more than 4 ranks of DDR-400 memory. See
http://forums.amd.com/lofiversion/index.php/t39745.html where the
relevant quote from AMD technical service reads:
"AMD does recommend to downclock the memory of the system to 333MHz,
if more than 4 ranks is used in the DIMM slots. What this means is
that only 2 sticks of 2 rank memory is recommended to run at the full
400MHz or 4 sticks of 1 rank memory. There is a memory timing issue
with more than 4 ranks of memory, which is a limitation of the memory
controller on the Opteron chips."
In the past, this downclocking was automatically enforced by some
BIOSes, but supposedly there is no need to do so with currently shipping
Opteron Rev. E and later, provided that the motherboard also allows full
8 ranks at DDR-400.
I'd just like to be sure... Also, has anyone observed increased memory
latency with dual-rank modules?
Sincerely,
Josip
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From cousins at limpet.umeoce.maine.edu Wed Jul 27 12:43:32 2005
From: cousins at limpet.umeoce.maine.edu (Steve Cousins)
Date: Wed, 27 Jul 2005 12:43:32 -0400 (EDT)
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To: <200507141900.j6EJ084D017387@bluewest.scyld.com>
Message-ID:
On Thu, 14 Jul 2005 11:25:12 +0100 Igor Kozin wrote:
> But now for 4cores/2CPUs per Opteron node to force the using of
> > only 2 cores (from 4), by 1 for each chip, we'll need to have
> > cpu affinity support in Linux.
>
> Mikhail,
> you can use "taskset" for that purpose.
> For example, (perhaps not in the most elegant form)
> mpiexec -n 1 taskset -c 0 $code : -n 1 taskset -c 2 $code
> But I doubt you want to let the idle cores to do something else
> in the mean time. However small you will generally see an increase
> in performance if you use all the cores.
We are considering getting a Dual Dual-Core Opteron system vs. two Dual
Opteron systems. We like the ability to use all four cores on one model
but a lot of what we'll do is have two models running at the same time,
each using two cores.
We are worried that running two models on one system with four cores (each
model using two cores) will not work as well as using two systems, each
with two cores/cpu's. Is this what you were refering to (Igor) when you
wrote:
> But I doubt you want to let the idle cores to do something else
> in the mean time.
We have an 8 CPU SGI Origin 3200 that has no problem doing this sort of
thing. I'm just curious what the implications are of doing this with the
Dual Core Opteron cpu's.
Thanks,
Steve
______________________________________________________________________
Steve Cousins, Ocean Modeling Group Email: cousins at umit.maine.edu
Marine Sciences, 208 Libby Hall http://rocky.umeoce.maine.edu
Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From Bogdan.Costescu at iwr.uni-heidelberg.de Wed Jul 27 12:43:45 2005
From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Wed, 27 Jul 2005 18:43:45 +0200 (CEST)
Subject: [Beowulf] Mii interface not supported
In-Reply-To: <200507260214.j6Q2EDEr026389@bluewest.scyld.com>
Message-ID:
On Tue, 26 Jul 2005, Zhang Hui wrote:
> I have written a nic monitoring function,using mii. But it does not
> work, even when i use mii-tool cmd in the shell it says "mii
> interface, operation not supported."
mii-tool or mii-diag require the presence of some MII-specific ioctls
in each network driver. Some drivers have these ioctls, some don't.
More recently (2.6 kernels), the MII-specific ioctls seemed to fall
out of fashion, being replaced by ethtool ones, which are obviously
used by the "ethtool" utility, which you can get at:
http://sourceforge.net/projects/gkernel/
> Later i find that some nics do not support mii.
Most NICs these days do support MII; what might be missing is Linux
driver support for making MII data available to user tools.
> But it's not over, I still need the monitoring function, very
> urgently.
I'm sure that there are companies that can offer such services for a
fee and can deliver whatever you need "very urgently"...
> So can anyone tell me whether there are some other better method to
> monitor the nic.
MII data is THE way of monitoring the link. If you are interested in
other aspects than the link, then this is very much NIC and driver
specific and should probably be asked on the Linux network developers
mailing list (netdev at vger dot kernel dot org).
--
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Wed Jul 27 13:39:06 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed, 27 Jul 2005 10:39:06 -0700
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To:
References: <200507141900.j6EJ084D017387@bluewest.scyld.com>
Message-ID: <20050727173905.GA1966@greglaptop>
On Wed, Jul 27, 2005 at 12:43:32PM -0400, Steve Cousins wrote:
> We are considering getting a Dual Dual-Core Opteron system vs. two Dual
> Opteron systems.
The dual-core system should be cheaper. But the two dual Opterons will
have a faster clock and twice as much memory bandwidth. So depending on
how much your code likes memory bandwidth...
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Wed Jul 27 14:09:32 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Wed, 27 Jul 2005 20:09:32 +0200
Subject: [Beowulf] Re: Opteron 275 performance
Message-ID: <3.0.32.20050727200930.0129cc58@pop3.xs4all.nl>
Steve,
Things depend upon what your software is sensitive to.
If that's cpu speed, it will run at all these systems great,
and you can skip the below lines i wrote.
If the most important issue is memory latency then read below:
Memory latency, when all 4 processors are busy with their own memory TLB
trashing job, is about 200 ns at opteron (1GB ram).
Origin3000 series has when testing just 1 cpu to its own memory, a memory
latency from 280 ns.
So your jobs will just as fine, provided memory is a problem for you.
Please note that at 16 cpu's Altix3000-3800 the latency to RAM when taking
250 MB ram is growing to around 700ns.
This where quad opteron dual core with 1.8Ghz dual core cpu's has 234ns
latency (each cpu 250MB ram).
We are running a simplistic Ubuntu distribution and only upgraded its
kernel to the latest default SMP kernel they compiled for AMD64 in Ubuntu,
which is by the way:
diep at ubuntu:/egtb$ uname -a
Linux ubuntu 2.6.10-5-amd64-k8-smp #1 SMP Fri Jun 24 17:23:48 UTC 2005
x86_64 GNU/Linux
In reality it's a NUMA kernel, so SMP is confusing to put there.
I have to add that single cpu latency from the opterons SINGLE core,
is a LOT better to memory. I measure 111 ns to a single cpu
at a dual opteron 2.2Ghz
single core with the same Ubuntu and kernel installed.
So if your only worry is TLB trashing main memory then 2
machines dual opteron will outperform anything thanks to the
memory latency.
On the other hand if only the speed of the cpu matters, then what you worry
about. A dual core quad opteron 2.2Ghz will just outperform a 8 processor
Itanium2 like silly for the average application.
Scaling of the quad opteron dual core for a 8 cpu job will be a tad less
than at the itanium2. Something like 7.80 versus 8.0 for the altix3200.
Yet the nps at the itanium2 1.5Ghz for Diep is around 800k nps,
versus 1+ million at a quad opteron dual core 1.8Ghz.
At 12:43 PM 7/27/2005 -0400, Steve Cousins wrote:
>
>On Thu, 14 Jul 2005 11:25:12 +0100 Igor Kozin wrote:
>
>> But now for 4cores/2CPUs per Opteron node to force the using of
>> > only 2 cores (from 4), by 1 for each chip, we'll need to have
>> > cpu affinity support in Linux.
>>
>> Mikhail,
>> you can use "taskset" for that purpose.
>> For example, (perhaps not in the most elegant form)
>> mpiexec -n 1 taskset -c 0 $code : -n 1 taskset -c 2 $code
>> But I doubt you want to let the idle cores to do something else
>> in the mean time. However small you will generally see an increase
>> in performance if you use all the cores.
>
>We are considering getting a Dual Dual-Core Opteron system vs. two Dual
>Opteron systems. We like the ability to use all four cores on one model
>but a lot of what we'll do is have two models running at the same time,
>each using two cores.
>
>We are worried that running two models on one system with four cores (each
>model using two cores) will not work as well as using two systems, each
>with two cores/cpu's. Is this what you were refering to (Igor) when you
>wrote:
>
>> But I doubt you want to let the idle cores to do something else
>> in the mean time.
>
>We have an 8 CPU SGI Origin 3200 that has no problem doing this sort of
>thing. I'm just curious what the implications are of doing this with the
>Dual Core Opteron cpu's.
>
>Thanks,
>
>Steve
>______________________________________________________________________
> Steve Cousins, Ocean Modeling Group Email: cousins at umit.maine.edu
> Marine Sciences, 208 Libby Hall http://rocky.umeoce.maine.edu
> Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302
>
>
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Wed Jul 27 15:08:16 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed, 27 Jul 2005 12:08:16 -0700
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To: <3.0.32.20050727200930.0129cc58@pop3.xs4all.nl>
References: <3.0.32.20050727200930.0129cc58@pop3.xs4all.nl>
Message-ID: <20050727190816.GA2400@greglaptop>
On Wed, Jul 27, 2005 at 08:09:32PM +0200, Vincent Diepeveen wrote:
> Steve,
> Things depend upon what your software is sensitive to.
Vincent,
I know you think everthing is a nail because you like hammers, but if
Steve's problem doesn't thrash the TLB, your entire email is not
relevant. Few programs thrash the TLB. There are some programs
sensitive to memory latency, but you only consider memory latency with
TLB thrashing, which is much less common.
In short: this probably isn't a nail.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Wed Jul 27 16:58:53 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Wed, 27 Jul 2005 22:58:53 +0200
Subject: [Beowulf] Re: Opteron 275 performance
Message-ID: <3.0.32.20050727225847.0129cc58@pop3.xs4all.nl>
In short you didn't even read the FIRST TWO lines of the email i shipped.
quote: "Steve,
Things depend upon what your software is sensitive to.
If that's cpu speed, it will run at all these systems great,
and you can skip the below lines i wrote."
This tells a lot about you. Really a lot. It isn't positive.
At 12:08 PM 7/27/2005 -0700, Greg Lindahl wrote:
>On Wed, Jul 27, 2005 at 08:09:32PM +0200, Vincent Diepeveen wrote:
>
>> Steve,
>> Things depend upon what your software is sensitive to.
>
>Vincent,
>
>I know you think everthing is a nail because you like hammers, but if
>Steve's problem doesn't thrash the TLB, your entire email is not
>relevant. Few programs thrash the TLB. There are some programs
>sensitive to memory latency, but you only consider memory latency with
>TLB thrashing, which is much less common.
>
>In short: this probably isn't a nail.
>
>-- greg
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From diep at xs4all.nl Wed Jul 27 17:12:54 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Wed, 27 Jul 2005 23:12:54 +0200
Subject: [Beowulf] Opteron memory rank limits with DDR-400
Message-ID: <3.0.32.20050727231252.0129cc58@pop3.xs4all.nl>
Quad opteron dual core 1.8Ghz
Dmesg gives:
"AMD Opteron(tm) Processor 865 stepping 00"
All 16 banks filled with 256 registered+ecc PC3200 memory.
How do i check what clock it runs the memory?
Latency timings as measured with 250MB ram a cpu (so that's 2 GB with 8
cores):
1 cpu : 144-147 ns
2 cpu's : 174 ns
4 cpu's : 206 ns
8 cpu's : 234 ns
To test it with this program do:
gcc -O2 -o lat latencylinux.c
./lat 250000000 // single cpu eating 250MB
./lat 250000000 2 // dual eating 500MB in total
./lat 250000000 4 // quad
./lat 250000000 8 // 8 cpu's
etc.
confirmed working till 500 cpu's.
At 10:26 AM 7/27/2005 -0600, Josip Loncaric wrote:
>Hello,
>
>Can anyone confirm that Opteron processors Rev. E and later can operate
>four dual-rank 2GB memory modules (8 ranks total) at full DDR-400 speed?
>
>AMD used to recommend no more than 4 ranks of DDR-400 memory. See
>http://forums.amd.com/lofiversion/index.php/t39745.html where the
>relevant quote from AMD technical service reads:
>
>"AMD does recommend to downclock the memory of the system to 333MHz,
>if more than 4 ranks is used in the DIMM slots. What this means is
>that only 2 sticks of 2 rank memory is recommended to run at the full
>400MHz or 4 sticks of 1 rank memory. There is a memory timing issue
>with more than 4 ranks of memory, which is a limitation of the memory
>controller on the Opteron chips."
>
>In the past, this downclocking was automatically enforced by some
>BIOSes, but supposedly there is no need to do so with currently shipping
>Opteron Rev. E and later, provided that the motherboard also allows full
>8 ranks at DDR-400.
>
>I'd just like to be sure... Also, has anyone observed increased memory
>latency with dual-rank modules?
>
>Sincerely,
>Josip
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
/*-----------------10-6-2003 3:48-------------------*
*
* This program rasml.c measures the Random Average Shared Memory Latency (RASML)
* Thanks to Agner Fog for his excellent random number generator.
*
* This testset is using a 64 bits optimized RNG of Agner Fog's ranrot generator.
*
* Created by Vincent Diepeveen who hereby releases this under GPL
* Feel free to look at the FSF (free software foundation) for what
* GPL is and its conditions.
*
* Please don't confuse the times achieved here with two times the one
* way pingpong latency, though at
* ideal scaling supercomputers/clusters they will be close. There is a few
* differences:
* a) this is TLB trashing
* b) this test tests ALL processors at the same time and not
* just 2 cpu's while the rest of the entire cluster is idle.
* c) this test ships 8 bytes whereas one way pingpong typical also
* gets used to test several kilobyte sizes, or just returns a pong.
* d) this doesn't use MPI but shared memory and the way such protocols are
* implemented matters possibly for latency.
*
* Vincent Diepeveen diep at xs4all.nl
* Veenendaal, The Netherlands 10 june 2003
*
* First a few lines about the random number generator. Note that I modified Agner Fog's
* RanRot very slightly. Basically its initialization has been done better and some dead
* slow FPU code rewritten to fast 64 bits integer code.
*/
#define UNIX 1 /* put to 1 when you are under unix or using gcc a look like compilers */
#define IRIX 1 /* this value only matters when UNIX is set to 1. For Linux put to 0
* basically allocating shared memory in linux is pretty buggy done in
* its kernel.
*
* Therefore you might want to do 'cat /proc/sys/kernel/shmmax'
* and look for yourself how much shared memory YOU can allocate in linux.
*
* If that is not enough to benchmark this program then try modifying it with:
* echo > /proc/sys/kernel/shmmmax
* Be sure you are root when doing that each time the system boots.
*/
#define FREEBSD 0 // be sure to not use more than 2 GB memory with freebsd with this test. sorry.
#if UNIX
#include
#include
#include
#include
#include
#include
#else
#include
#include // for GetTickCount()
#include // _spawnl
#endif
#include
#include
#include
#include
#include
#define SWITCHTIME 60000 /* in milliseconds. Modify this to let a test run longer or shorter.
* basically it is a good idea to use about the cpu number times
* thousand for this. 30 seconds is fine for PC's, but a very
* bad idea for supercomputers. I recomment several minutes
* there, and at least a few hours for big supers if the partition isn't started yet
* if the partition is started starting it at 460 processors (SGI) should
* take 10 minutes, otherwise it takes 3 hours to attach all.
* Of course that let's a test take way way longer.
*/
#define MAXPROCESSES 512 /* this test can go up to this amount of processes to be tested */
#define CACHELINELENGTH 128 /* cache line length at the machine. Modify this if you want to */
#if UNIX
#include
// #include
#define FORCEINLINE __inline
/* UNIX and such this is 64 bits unsigned variable: */
#define BITBOARD unsigned long long
#else
#define FORCEINLINE __forceinline
/* in WINDOWS we also want to be 64 bits: */
#define BITBOARD unsigned _int64
#endif
#define STATUS_NOTSTARTED 0
#define STATUS_ATTACH 1
#define STATUS_GOATTACH 2
#define STATUS_ATTACHED 3
#define STATUS_STARTREAD 4
#define STATUS_READ 5
#define STATUS_MEASUREREAD 6
#define STATUS_MEASUREDREAD 7
#define STATUS_QUIT 10
struct ProcessState {
volatile int status; /* 0 = not started yet
* 1 = ready to start reading
*
* 10 = quitted
* */
/* now the numbers each cpu gathers. The name of the first number is what
* cpu0 is doing and the second name what all the other cpu's were doing at that
* time
*/
volatile BITBOARD readread; /* */
char dummycacheline[CACHELINELENGTH];
};
typedef struct {
BITBOARD nentries; // number of entries of 64 bits used for cache.
struct ProcessState ps[MAXPROCESSES];
} GlobalTree;
void RanrotAInit(void);
float ToNano(BITBOARD);
int GetClock(void);
float TimeRandom(void);
void ParseBuffer(BITBOARD);
void ClearHash(void);
void DeAllocate(void);
int DoNrng(BITBOARD);
int DoNreads(BITBOARD);
int DoNreadwrites(BITBOARD);
//void TestLatency(float);
int AllocateTree(void);
void InitTree(int);
void WaitForStatus(int,int);
void PutStatus(int,int);
int CheckStatus(int,int);
int CheckAllStatus(int,int);
void Slapen(int);
float LoopRandom(void);
/* define parameters (R1 and R2 must be smaller than the integer size): */
#define KK 17
#define JJ 10
#define R1 5
#define R2 3
/* global variables Ranrot */
BITBOARD randbuffer[KK+3] = { /* history buffer filled with some random numbers */
0x92930cb295f24dab,0x0d2f2c860b685215,0x4ef7b8f8e76ccae7,0x03519154af3ec239,0x195e36fe715fad23,
0x86f2729c24a590ad,0x9ff2414a69e4b5ef,0x631205a6bf456141,0x6de386f196bc1b7b,0x5db2d651a7bdf825,
0x0d2f2c86c1de75b7,0x5f72ed908858a9c9,0xfb2629812da87693,0xf3088fedb657f9dd,0x00d47d10ffdc8a9f,
0xd9e323088121da71,0x801600328b823ecb,0x93c300e4885d05f5,0x096d1f3b4e20cd47,0x43d64ed75a9ad5d9
/*0xa05a7755512c0c03,0x960880d9ea857ccd,0x7d9c520a4cc1d30f,0x73b1eb7d8891a8a1,0x116e3fc3a6b7aadb*/
};
int r_p1, r_p2; /* indexes into history buffer */
/* global variables RASML */
BITBOARD *hashtable[MAXPROCESSES],nentries,globaldummy=0;
GlobalTree *tree;
int ProcessNumber,
cpus; // number of processes for this test
#if UNIX
int shm_tree,shm_hash[MAXPROCESSES];
#endif
char rasmexename[2048];
/******************************************************** AgF 1999-03-03 *
* Random Number generator 'RANROT' type B *
* by Agner Fog *
* *
* This is a lagged-Fibonacci type of random number generator with *
* rotation of bits. The algorithm is: *
* X[n] = ((X[n-j] rotl r1) + (X[n-k] rotl r2)) modulo 2^b *
* *
* The last k values of X are stored in a circular buffer named *
* randbuffer. *
* *
* This version works with any integer size: 16, 32, 64 bits etc. *
* The integers must be unsigned. The resolution depends on the integer *
* size. *
* *
* Note that the function RanrotAInit must be called before the first *
* call to RanrotA or iRanrotA *
* *
* The theory of the RANROT type of generators is described at *
* www.agner.org/random/ranrot.htm *
* *
*************************************************************************/
FORCEINLINE BITBOARD rotl(BITBOARD x,int r) {return(x<>(64-r));}
/* returns a random number of 64 bits unsigned */
FORCEINLINE BITBOARD RanrotA(void) {
/* generate next random number */
BITBOARD x = randbuffer[r_p1] = rotl(randbuffer[r_p2],R1) + rotl(randbuffer[r_p1], R2);
/* rotate list pointers */
if( --r_p1 < 0)
r_p1 = KK - 1;
if( --r_p2 < 0 )
r_p2 = KK - 1;
return x;
}
/* this function initializes the random number generator. */
void RanrotAInit(void) {
int i;
/* one can fill the randbuffer here with possible other values here */
randbuffer[0] = 0x92930cb295f24000 | (BITBOARD)ProcessNumber;
randbuffer[1] = 0x0d2f2c860b000215 | ((BITBOARD)ProcessNumber<<12);
/* initialize pointers to circular buffer */
r_p1 = 0;
r_p2 = JJ;
/* randomize */
for( i = 0; i < 300; i++ )
(void)RanrotA();
}
/* Now the RASML code */
char *To64(BITBOARD x) {
static char buf[256];
char *sb;
sb = &buf[0];
#if UNIX
sprintf(buf,"%llu",x);
#else
sprintf(buf,"%I64u",x);
#endif
return sb;
}
int GetClock(void) {
/* The accuracy is measured in millisecondes. The used function is very accurate according
* to the NT team, way more accurate nowadays than mentionned in the MSDN manual. The accuracy
* for linux or unix we can only guess. Too many experts there.
*/
#if UNIX
struct timeval timeval;
struct timezone timezone;
gettimeofday(&timeval, &timezone);
return((int)(timeval.tv_sec*1000+(timeval.tv_usec/1000)));
#else
return((int)GetTickCount());
#endif
}
float ToNano(BITBOARD nps) {
/* convert something from times a second to nanoseconds.
* NOTE THAT THERE IS COMPILER BUGS SOMETIMES AT OLD COMPILERS
* SO THAT'S WHY MY CODE ISN'T A 1 LINE RETURN HERE. PLEASE DO
* NOT MODIFY THIS CODE */
float tn;
tn = 1000000000/(float)nps;
return tn;
}
float TimeRandom(void) {
/* timing the random number generator is very easy of course. Returns
* number of random numbers a second that can get generated
*/
BITBOARD bb=0,i,value,nps;
float ns_rng;
int t1,t2,took;
printf("Benchmarking Pseudo Random Number Generator speed, RanRot type 'B'!\n");
printf("Speed depends upon CPU and compile options from RASML,\n therefore we benchmark the RNG\n");
printf("Please wait a few seconds.. "); fflush(stdout);
value = 100000;
took = 0;
while( took < 3000 ) {
value <<= 2; // x4
t1 = GetClock();
for( i = 0; i < value; i++ ) {
bb ^= RanrotA();
}
t2 = GetClock();
took = t2-t1;
}
nps = (1000*value)/(BITBOARD)took;
#if UNIX
printf("..took %i milliseconds to generate %llu numbers\n",took,value);
printf("Speed of RNG = %llu numbers a second\n",nps);
#else
printf("..took %i milliseconds to generate %I64 numbers\n",took,value);
printf("Speed of RNG = %I64u numbers a second\n",nps);
#endif
ns_rng = ToNano(nps);
printf("So 1 RNG call takes %f nanoseconds\n",ns_rng);
return ns_rng;
}
void ParseBuffer(BITBOARD nbytes) {
tree->nentries = nbytes/sizeof(BITBOARD);
#if UNIX
printf("Trying to allocate %llu entries. ",tree->nentries);
printf("In total %llu bytes\n",tree->nentries*(BITBOARD)sizeof(BITBOARD));
#else
printf("Trying to allocate %s entries. ",To64(tree->nentries));
printf("In total %s bytes\n",To64(tree->nentries*(BITBOARD)sizeof(BITBOARD)));
#endif
}
void ClearHash(void) {
BITBOARD *hi,i,nentries = tree->nentries;
/* clearing hashtable */
printf("Clearing hashtable for processor %i\n",ProcessNumber);
fflush(stdout);
hi = hashtable[ProcessNumber];
for( i = 0 ; i < nentries ; i++ ) /* very unoptimized way of clearing */
hi[i] = i;
}
void DeAllocate(void) {
int i;
#if UNIX
shmctl(shm_tree,IPC_RMID,0);
for( i = 0; i < cpus; i++ ) {
shmctl(shm_hash[i],IPC_RMID,0);
}
#else
UnmapViewOfFile(tree);
for( i = 0; i < cpus; i++ ) {
UnmapViewOfFile(hashtable[i]);
}
#endif
}
int DoNrng(BITBOARD n) {
BITBOARD i=1,dummyres,nents;
int t1,t2,ncpu;
ncpu = cpus;
nents = nentries; /* hopefully this gets into a register */
dummyres = globaldummy;
t1 = GetClock();
do {
BITBOARD rani=RanrotA(),index=rani%nents;
unsigned int i2 = (unsigned int)(rani>>32)%ncpu;
dummyres ^= (index+(BITBOARD)i2);
} while( i++ < n );
t2 = GetClock();
globaldummy = dummyres;
return(t2-t1);
}
int DoNreads(BITBOARD n) {
BITBOARD i=1,dummyres,nents;
int t1,t2,ncpu;
ncpu = cpus;
nents = nentries; /* hopefully this gets into a register */
dummyres = globaldummy;
t1 = GetClock();
do {
BITBOARD rani=RanrotA(),index=rani%nents;
unsigned int i2 = (unsigned int)(rani>>32)%ncpu;
dummyres ^= hashtable[i2][index];
} while( i++ < n );
t2 = GetClock();
globaldummy = dummyres;
return(t2-t1);
}
#if 0
int DoNreadwrites(BITBOARD n) {
BITBOARD i=1,dummyres,nents;
int t1,t2;
nents = nentries; /* hopefully this gets into a register */
dummyres = globaldummy;
t1 = GetClock();
do {
BITBOARD index = RanrotA()%nents;
dummyres ^= hashtable[index];
hashtable[index] = dummyres;
} while( i++ < n );
t2 = GetClock();
globaldummy = dummyres;
return(t2-t1);
}
void TestLatency(float ns_rng) {
BITBOARD n,nps_read,nps_rw,nps_rng;
float ns,fns;
int timetaken;
printf("Doing random RNG test. Please wait..\n");
n = 50000000; // 50 mln
timetaken = DoNrng(n);
nps_rng = (1000*n) / (BITBOARD)timetaken;
fns = ToNano(nps_rng);
printf("Machine needs %f ns for RND loop\n",fns);
/* READING SINGLE CPU RANDOM ENTRIES */
printf("Doing random read tests single cpu. Please wait..\n");
n = 100000000; // 100 mln
timetaken = DoNreads(n);
nps_read = (1000*n) / (BITBOARD)timetaken;
ns = ToNano(nps_read);
printf("Machine needs %f ns for single cpu random reads.\nExtrapolated=%f nanoseconds a read\n",ns,ns-fns);
/* READING AND THEN WRITING SINGLE CPU RANDOM ENTRIES */
printf("Doing random readwrite tests single cpu. Please wait..\n");
n = 100000000; // 100 mln
timetaken = DoNreadwrites(n);
nps_rw = (1000*n) / (BITBOARD)timetaken;
ns = ToNano(nps_rw);
printf("Machine needs %f ns for single cpu random readwrites.\n",ns);
printf("Extrapolated=%f nanoseconds a readwrite (to the same slot)\n\n",ns-fns);
printf("So far the useless tests.\nBut we have vague read/write nodes a second numbers now\n");
}
#endif
int AllocateTree(void) { /* initialize the tree. returns 0 if error */
#if UNIX
shm_tree = shmget(
ftok(".",'t'),
sizeof(GlobalTree),IPC_CREAT|0777);
if( shm_tree == -1 )
return 0;
tree = (GlobalTree *)shmat(shm_tree,0,0);
if( tree == (GlobalTree *)-1 )
return 0;
#else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
if( !ProcessNumber ) {
HANDLE TreeFileMap;
TreeFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
(DWORD)sizeof(GlobalTree),"RASM_Tree");
if( TreeFileMap == NULL )
return 0;
tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( tree == NULL )
return 0;
}
else { /* Slaves attach also try to attach to the tree */
HANDLE TreeFileMap;
TreeFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Tree");
if( TreeFileMap == NULL )
return 0;
tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( tree == NULL )
return 0;
}
#endif
return 1;
}
int AttachAll(void) {
#if UNIX
#else
HANDLE HashFileMap;
#endif
char hashname2[32] = {"RASM_Hash00"},hashname[32];
int i,r;
for( r = 0; r < cpus; r++ ) {
i = ProcessNumber+r;
i %= cpus;
if( i == ProcessNumber )
continue;
#if UNIX
shm_hash[i] = shmget(
#if IRIX
ftok(".",200+i),
#else
ftok(".",(char)i),
#endif
tree->nentries*8,IPC_CREAT|0777);
if( shm_hash[i] == -1 )
return 0;
hashtable[i] = (BITBOARD *)shmat(shm_hash[i],0,0);
if( hashtable[i] == (BITBOARD *)-1 )
return 0;
#else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
strcpy(hashname,hashname2);
hashname[9] += (i/10);
hashname[10] += (i%10);
HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,hashname);
if( HashFileMap == NULL )
return 0;
hashtable[i] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( hashtable[i] == NULL )
return 0;
#endif
}
return 1;
}
int AllocateHash(void) { /* initialize the hashtable (cache). returns 0 if error */
char hashname[32] = {"RASM_Hash00"};
#if UNIX
shm_hash[ProcessNumber] = shmget(
#if IRIX
ftok(".",200+ProcessNumber),
#else
ftok(".",(char)ProcessNumber),
#endif
tree->nentries*8,IPC_CREAT|0777);
if( shm_hash[ProcessNumber] == -1 )
return 0;
hashtable[ProcessNumber] = (BITBOARD *)shmat(shm_hash[ProcessNumber],0,0);
if( hashtable[ProcessNumber] == (BITBOARD *)-1 )
return 0;
#else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
//if( !ProcessNumber ) {
HANDLE HashFileMap;
hashname[9] += (ProcessNumber/10);
hashname[10] += (ProcessNumber%10);
HashFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
(DWORD)tree->nentries*8,hashname);
if( HashFileMap == NULL )
return 0;
hashtable[ProcessNumber] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( hashtable[ProcessNumber] == NULL )
return 0;
//}
//else { /* Slaves attach also try to attach to the tree */
/* HANDLE HashFileMap;
HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Hash");
if( HashFileMap == NULL )
return 0;
hashtable[ProcessNumber] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( hashtable[ProcessNumber] == NULL )
return 0;*/
//}
#endif
return 1;
}
int StartProcesses(int ncpus) {
char buf[256];
int i;
/* returns 1 if ncpus-1 started ok */
if( ncpus == 1 )
return 1;
for( i = 1 ; i < ncpus ; i++ ) {
sprintf(buf,"%i_%i",i+1,ncpus);
#if UNIX
if( !fork() )
execl(rasmexename,rasmexename,buf,NULL);
#else
(void)_spawnl(_P_NOWAIT,rasmexename,rasmexename,buf,NULL);
#endif
}
return 1;
}
void InitTree(int ncpus) {
int i;
for( i = 0 ; i < ncpus ; i++ ) {
tree->ps[i].status = STATUS_NOTSTARTED;
tree->ps[i].readread = 0;
}
}
void WaitForStatus(int ncpus,int waitforstate) {
/* wait for all processors to have the same state */
int i,badluck=1;
while( badluck ) {
badluck = 0;
for( i = 0 ; i < ncpus ; i++ ) {
if( tree->ps[i].status != waitforstate )
badluck = 1;
}
}
}
void PutStatus(int ncpus,int statenew) {
int i;
for( i = 0 ; i < ncpus ; i++ ) {
tree->ps[i].status = statenew;
}
}
int CheckStatus(int ncpus,int statenew) {
/* returns false when not all cpu's are in the new state */
int i;
for( i = 0 ; i < ncpus ; i++ ) {
if( tree->ps[i].status != statenew )
return 0;
}
return 1;
}
int CheckAllStatus(int ncpus,int status) {
/* Tries with a single loop to determine whether the other cpu's also finished
*
* returns:
* true ==> when all the processes have this status
* false ==> when 1 or more are still busy measuring
*/
int i,badluck=1;
for( i = 0 ; i < ncpus ; i++ ) {
if( tree->ps[i].status != status ) {
badluck = 0;
break;
}
}
return badluck;
}
void Slapen(int ms) {
#if UNIX
usleep(ms*1000); /* 0.050 000 secondes, it is in microseconds! */
#else
Sleep(ms); /* 0.050 seconds, it is in milliseconds */
#endif
}
float LoopRandom(void) {
BITBOARD n,nps_rng;
float fns;
int timetaken;
printf("Benchmarking random RNG test. Please wait..\n");
n = 25000000; // 50 mln
timetaken = 0;
while( timetaken < 500 ) {
n += n;
timetaken = DoNrng(n);
}
printf("timetaken=%i\n",timetaken);
nps_rng = (1000*n) / (BITBOARD)timetaken;
fns = ToNano(nps_rng);
printf("Machine needs %f ns for RND loop\n",fns);
return fns;
}
/* Example showing how to use the random number generator: */
int main(int argc,char *argv[]) {
/* allocate a big memory buffer parameter is in bytes.
* don't hesitate to MODIFY this to how many gigabytes
* you want to try.
* The more the better i keep saying to myself.
*
* Note that under linux your maximum shared memory limit can be set with:
*
* echo > /proc/sys/kernel/shmmax
*
* and under IRIX it is usually 80% from the total RAM onboard that can get allocated
*/
BITBOARD nbytes,firstguess;
float ns_rng,f_loop;
int tottimes,t1,t2;
if( argc <= 1 ) {
printf("Latency test usage is: latency \n");
printf("Where 'buffer' is the buffer in number of bytes to allocate PRO PROCESSOR\n");
printf("and where 'cpus' is the number of processes that this test will try to use (1 = default) \n");
return 1;
}
/* parse the input */
nbytes = 0;
cpus = 1; // default
if( strchr(argv[1],'_') == NULL ) { /* main startup process */
int np = 0;
#if UNIX
#if FREEBSD
nbytes = (BITBOARD)atoi(argv[1]); // freebsd doesn't support > 2 GB memory
#else
nbytes = (BITBOARD)atoll(argv[1]);
#endif
#else
nbytes = (BITBOARD)_atoi64(argv[1]);
#endif
printf("Welcome to RASM Latency!\n");
printf("RASML measures the RANDOM AVERAGE SHARED MEMORY LATENCY!\n\n");
if( argc > 2 ) {
cpus = 0;
do {
cpus *= 10;
cpus += (int)(argv[2][np]-'1')+1;
np++;
} while( argv[2][np] >= '0' && argv[2][np] <= '9' );
}
//printf("Master: buffer = %s bytes. #CPUs = %i\n",To64(nbytes),cpus);
ProcessNumber = 0;
/* check whether we are not getting out of bounds */
if( cpus > MAXPROCESSES ) {
printf("Error: Recompile with a bigger stack for MAXPROCESSES. %i processors is too much\n",cpus);
return 1;
}
/* find out the file name */
#if UNIX
strcpy(rasmexename,argv[0]);
#else
GetModuleFileName(NULL,rasmexename,2044);
#endif
printf("Stored in rasmexename = %s\n",rasmexename);
}
else { // latency 2_452 ==> means processor 2 out of 452.
int np = 0;
ProcessNumber = 0;
do {
ProcessNumber *= 10;
ProcessNumber += (argv[1][np]-'1')+1; // n
np++;
} while( argv[1][np] >= '0' && argv[1][np] <= '9' );
ProcessNumber--; // 1 less because of ProcessNumber ==> [0..n-1]
np++; // skip underscore
cpus = 0;
do {
cpus *= 10;
cpus += (argv[1][np]-'1')+1; // n
np++;
} while( argv[1][np] >= '0' && argv[1][np] <= '9' );
//printf("Slave: ProcessNumber=%i cpus=%i\n",ProcessNumber,cpus);
}
/* first we setup the random number generator. */
RanrotAInit();
/* initialize shared memory tree; it gets used for communication between the processes */
if( !AllocateTree() ) {
printf("Error: ProcessNumber %i could not allocate the tree\n",ProcessNumber);
return 1;
}
if( !ProcessNumber )
ParseBuffer(nbytes);
nentries = tree->nentries;
/* Now some stuff only the Master has to do */
if( !ProcessNumber ) {
/* Master: now let's time the pseudo random generators speed in nanoseconds a call */
ns_rng = TimeRandom();
f_loop = LoopRandom();
printf("Trying to Allocate Buffer\n");
t1 = GetClock();
if( !AllocateHash() ) {
printf("Error: Could not allocate buffer!\n");
return 1;
}
t2 = GetClock();
printf("Took %i.%03i seconds to allocate Hash\n",(t2-t1)/1000,(t2-t1)%1000);
ClearHash(); // local hash
t1 = GetClock();
printf("Took %i.%03i seconds to clear Hash\n",(t1-t2)/1000,(t1-t2)%1000);
/* so now hashtable is setup and we know quite some stuff. So it is time to
* start all other processes */
InitTree(cpus);
printf("Starting Other processes\n");
t1 = GetClock();
if( !StartProcesses(cpus) ) {
printf("Error: Could not start processes\n");
DeAllocate();
}
t2 = GetClock();
printf("Took %i milliseconds to start %i additional processes\n",t2-t1,cpus-1);
t1 = GetClock();
}
else { /* all Slaves do this */
if( !AllocateHash() ) {
printf("Error: slave %i Could not allocate buffer!\n",ProcessNumber);
return 1;
}
ClearHash(); // local hash
}
tree->ps[ProcessNumber].status = STATUS_ATTACH;
if( ! ProcessNumber ) {
WaitForStatus(cpus,STATUS_ATTACH);
t2 = GetClock();
printf("Took %i milliseconds to synchronize %i additional processes\n",t2-t1,cpus-1);
t1 = GetClock();
/* now we can continue with the next phase that is attaching all the segments */
PutStatus(cpus,STATUS_GOATTACH);
}
else {
while( tree->ps[ProcessNumber].status == STATUS_ATTACH ) {
Slapen(500);
}
}
if( !AttachAll() ) {
printf("Error: process %i Could not attach correctly!\n",ProcessNumber);
return 1;
}
tree->ps[ProcessNumber].status = STATUS_ATTACHED;
if( ! ProcessNumber ) {
WaitForStatus(cpus,STATUS_ATTACHED);
t2 = GetClock();
printf("Took %i milliseconds to ATTACH. %llu total RAM\n",t2-t1,(BITBOARD)cpus*tree->nentries*8);
PutStatus(cpus,STATUS_STARTREAD);
printf("Read latency measurement STARTS NOW using steps of 2 * %i.%03i seconds :\n",
(SWITCHTIME/1000),(SWITCHTIME%1000));
}
else {
while( tree->ps[ProcessNumber].status == STATUS_ATTACHED ) {
Slapen(500);
}
}
tree->ps[ProcessNumber].status = STATUS_READ;
firstguess = 200000;
tottimes = 0;
for( ;; ) {
int timetaken = 0;
if( tree->ps[ProcessNumber].status == STATUS_MEASUREREAD ) {
/* this really MEASURES the readread */
BITBOARD ntried = 0,avnumber;
int totaltime=0;
while( totaltime < SWITCHTIME ) { /* go measure around switchtime seconds */
totaltime += DoNreads(firstguess);
ntried += firstguess;
}
/* now put the average number of readreads into the shared memory */
avnumber = (ntried*1000) / (BITBOARD)totaltime;
tree->ps[ProcessNumber].readread = avnumber;
/* show that it is finished */
tree->ps[ProcessNumber].status = STATUS_MEASUREDREAD;
/* now keep doing the same thing until status gets modified */
while( tree->ps[ProcessNumber].status == STATUS_MEASUREDREAD ) {
(void)DoNreads(firstguess);
if( !ProcessNumber ) {
if( CheckAllStatus(cpus,STATUS_MEASUREDREAD) ) {
PutStatus(cpus,STATUS_QUIT);
break;
}
}
}
}
else if( tree->ps[ProcessNumber].status == STATUS_READ ) {
BITBOARD nextguess;
/* now software must try to determine how many reads a seconds are possible for that
* process
*/
//printf("proc=%i trying %s reads\n",ProcessNumber,To64(firstguess));
timetaken = DoNreads(firstguess);
/* try to guess such that next test takes 1 second, or if test was too inaccurate
* then double the number simply. also prevents divide by zero error ;)
*/
if( timetaken < 400 )
nextguess = firstguess*2;
else
nextguess = (firstguess*1000)/(BITBOARD)timetaken;
firstguess = nextguess;
if( !ProcessNumber ) {
tottimes += timetaken;
if( tottimes >= SWITCHTIME ) { // 30 seconds to a few minutes
tottimes = 0;
if( CheckStatus(cpus,STATUS_READ) ) {
PutStatus(cpus,STATUS_MEASUREREAD);
} /* waits another SWITCH time before starting to measure */
}
}
}
else if( tree->ps[ProcessNumber].status == STATUS_QUIT )
break;
}
/* now do the latency tests
*/
//TestLatency(ns_rng);
tree->ps[ProcessNumber].status = STATUS_QUIT;
if( !ProcessNumber ) {
BITBOARD averagereadread;
int i;
averagereadread = 0;
WaitForStatus(cpus,STATUS_QUIT);
printf("the raw output\n");
for( i = 0; i < cpus ; i++ ) {
BITBOARD tr=tree->ps[i].readread;
averagereadread += tr;
printf("%llu ",tr);
}
printf("\n");
averagereadread /= (BITBOARD)cpus;
printf("Raw Average measured read read time at %i processes = %f ns\n",cpus,ToNano(averagereadread));
printf("Now for the final calculation it gets compensated:\n");
printf(" Average measured read read time at %i processes = %f ns\n",cpus,ToNano(averagereadread)-f_loop);
}
DeAllocate();
return 0;
}
/* EOF latencyC.c */
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From cousins at limpet.umeoce.maine.edu Wed Jul 27 19:34:08 2005
From: cousins at limpet.umeoce.maine.edu (Steve Cousins)
Date: Wed, 27 Jul 2005 19:34:08 -0400 (EDT)
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To:
Message-ID:
On Wed, 27 Jul 2005, Ben Mayer wrote:
> Correct me if I am wrong but I am under the impression that each CPU
> has one memory controller, independent of the number of cores. So if
> you go dual core those cores are sharing a memory controller. What you
> have in the end if you go dual core with 2 CPUs is 4 cores with 2
> memory controllers.
Yes. This is the sort of thing I am worried about. I remember running
into problems with our Dual PIII cluster where there was a significant
difference in performance when ran we our models using 8 nodes using both
CPU's per node vs. 16 nodes using one CPU per node. Something like 40%
faster on the single CPU per node runs. I guess the only sure-fire way to
know is to get our hands on a Dual Core system and see how our model
performs. Anyone got one available for testing... ;^)
Steve
______________________________________________________________________
Steve Cousins, Ocean Modeling Group Email: cousins at umit.maine.edu
Marine Sciences, 208 Libby Hall http://rocky.umeoce.maine.edu
Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From cousins at limpet.umeoce.maine.edu Wed Jul 27 19:59:00 2005
From: cousins at limpet.umeoce.maine.edu (Steve Cousins)
Date: Wed, 27 Jul 2005 19:59:00 -0400 (EDT)
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To:
Message-ID:
On Wed, 27 Jul 2005, Joe Landman wrote:
> Hi Steve:
>
> Not knowing the details of your calculations might be an issue, but
> you can read about our experiences with a number of chemistry and
> informatics codes on dual core Opteron systems. See
> http://enterprise2.amd.com/downloadables/Dual_Core_Performance.pdf for
> more details.
>
> Joe
Hi Joe,
Thanks a lot. I just took a look and it seems to make a good case for
getting the Dual Dual Core machine.
I'm fairly certain that the memory latency issue that Vincent was warning
about won't be an issue, although I'm a bit clueless about how to know for
sure. How would I go about finding out if our model is TLB trashing main
memory? I feel like I just bit the hook... I don't want to start a huge
discussion on this but if there are some quick tell-tale signs of it I'd
be interested to find out.
Thanks,
Steve
> Steve Cousins wrote:
> >> On Thu, 14 Jul 2005 11:25:12 +0100 Igor Kozin wrote:
> >>
> >>
> >>> But now for 4cores/2CPUs per Opteron node to force the using of
> >>>
> >>>>only 2 cores (from 4), by 1 for each chip, we'll need to have
> >>>>cpu affinity support in Linux.
> >>>
> >>>Mikhail,
> >>>you can use "taskset" for that purpose.
> >>>For example, (perhaps not in the most elegant form)
> >>> mpiexec -n 1 taskset -c 0 $code : -n 1 taskset -c 2 $code
> >>>But I doubt you want to let the idle cores to do something else
> >>>in the mean time. However small you will generally see an increase
> >>>in performance if you use all the cores.
> >>
> >>
> >> We are considering getting a Dual Dual-Core Opteron system vs. two Dual
> >> Opteron systems. We like the ability to use all four cores on one model
> >> but a lot of what we'll do is have two models running at the same time,
> >> each using two cores.
> >>
> >> We are worried that running two models on one system with four cores (each
> >> model using two cores) will not work as well as using two systems, each
> >> with two cores/cpu's. Is this what you were refering to (Igor) when you
> >> wrote:
> >>
> >>
> >>>But I doubt you want to let the idle cores to do something else
> >>>in the mean time.
> >>
> >>
> >> We have an 8 CPU SGI Origin 3200 that has no problem doing this sort of
> >> thing. I'm just curious what the implications are of doing this with the
> >> Dual Core Opteron cpu's.
> >>
> >> Thanks,
> >>
> >> Steve
> >> ______________________________________________________________________
> >> Steve Cousins, Ocean Modeling Group Email: cousins at umit.maine.edu
> >> Marine Sciences, 208 Libby Hall http://rocky.umeoce.maine.edu
> >> Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org
> >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> >
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web : http://www.scalableinformatics.com
> phone: +1 734 786 8423
> fax : +1 734 786 8452
> cell : +1 734 612 4615
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Wed Jul 27 21:36:14 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed, 27 Jul 2005 18:36:14 -0700
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To: <3.0.32.20050727225847.0129cc58@pop3.xs4all.nl>
References: <3.0.32.20050727225847.0129cc58@pop3.xs4all.nl>
Message-ID: <20050728013614.GA1202@greglaptop>
> In short you didn't even read the FIRST TWO lines of the email i shipped.
Vincent,
I did.
It's probably best to assume that people disagreeing with you are
reading your emails; it saves a lot of pointless emails like this one.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From lindahl at pathscale.com Wed Jul 27 21:32:01 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed, 27 Jul 2005 18:32:01 -0700
Subject: [Beowulf] Opteron memory rank limits with DDR-400
In-Reply-To: <3.0.32.20050727231252.0129cc58@pop3.xs4all.nl>
References: <3.0.32.20050727231252.0129cc58@pop3.xs4all.nl>
Message-ID: <20050728013201.GA2447@greglaptop>
On Wed, Jul 27, 2005 at 11:12:54PM +0200, Vincent Diepeveen wrote:
> How do i check what clock it runs the memory?
Use the lmbench lat_mem_rd program instead of your program, and
compare to the numbers reported by other people. You're just asking
for trouble when you measure several things at once, when all you want
to know is the clock that the RAM is running at, not anything about
TLB fills.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From rgb at phy.duke.edu Wed Jul 27 22:54:35 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 27 Jul 2005 22:54:35 -0400
Subject: [Beowulf] Re: Opteron 275 performance
References:
Message-ID:
Steve Cousins writes:
>
> On Wed, 27 Jul 2005, Ben Mayer wrote:
>
>> Correct me if I am wrong but I am under the impression that each CPU
>> has one memory controller, independent of the number of cores. So if
>> you go dual core those cores are sharing a memory controller. What you
>> have in the end if you go dual core with 2 CPUs is 4 cores with 2
>> memory controllers.
>
> Yes. This is the sort of thing I am worried about. I remember running
> into problems with our Dual PIII cluster where there was a significant
> difference in performance when ran we our models using 8 nodes using both
> CPU's per node vs. 16 nodes using one CPU per node. Something like 40%
> faster on the single CPU per node runs. I guess the only sure-fire way to
> know is to get our hands on a Dual Core system and see how our model
> performs. Anyone got one available for testing... ;^)
Dead on the money on all accounts.
I'd be surprised if you don't get offers from several vendors. Michael
Will of Penguin is on list, for example. Most of the serious
linux/cluster vendors are happy enough to provide access to a system or
cluster for testing purposes, especially if they sense a sale at the end
of it. Or even a possible sale;-).
rgb
>
> Steve
> ______________________________________________________________________
> Steve Cousins, Ocean Modeling Group Email: cousins at umit.maine.edu
> Marine Sciences, 208 Libby Hall http://rocky.umeoce.maine.edu
> Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From rgb at phy.duke.edu Wed Jul 27 23:25:03 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 27 Jul 2005 23:25:03 -0400
Subject: [Beowulf] Re: Opteron 275 performance
References:
Message-ID:
Steve Cousins writes:
> On Wed, 27 Jul 2005, Joe Landman wrote:
>
>> Hi Steve:
>>
>> Not knowing the details of your calculations might be an issue, but
>> you can read about our experiences with a number of chemistry and
>> informatics codes on dual core Opteron systems. See
>> http://enterprise2.amd.com/downloadables/Dual_Core_Performance.pdf for
>> more details.
>>
>> Joe
>
> Hi Joe,
>
> Thanks a lot. I just took a look and it seems to make a good case for
> getting the Dual Dual Core machine.
>
> I'm fairly certain that the memory latency issue that Vincent was warning
> about won't be an issue, although I'm a bit clueless about how to know for
> sure. How would I go about finding out if our model is TLB trashing main
> memory? I feel like I just bit the hook... I don't want to start a huge
> discussion on this but if there are some quick tell-tale signs of it I'd
> be interested to find out.
Why bother with tell-tale signs? Like I said, your previous post was
dead on the money. Get a loaner (which can physically be far far away
and should be "free"), install YOUR application and run the only
benchmark or test that matters.
On paper, the memory access schemes used by the Opterons should largely
ameliorate the kind of difficulty encountered with the dual PIII's --
they ought to do better than just divide single processor bandwidth
between two processors at any rate. You can visit the hypertransport
site and look at white papers, e.g. --
http://www.hypertransport.org/tech/tech_whitepapers.cfm
or look at multicore hype (with some useful info mixed in) here:
http://multicore.amd.com/WhatIsMC/
especially its generic description of DCA (Direct Connect Architecture).
The design was driven by the desire to reduce latency in shared access
situations; HT does this by a fairly complicated interleaving on a
request queue (as best I can tell). With a dual core dual CPU design in
particular you have to worry about connecting four distinct cores to
each other and to memory and to peripherals. This ultimately makes it
very difficult to predict whether any given application will scale the
way it "should" in an ideal universe.
I honestly think that the only way to be SURE your particular
application fits in the probably very broad category of applications
that can scale from one to four cores in nearly constant time is to try
it. Preferrably at several mixes of program scales to "force" the
processors to interleave memory in all the ways it might ever need to in
application. Eventually enough may be learned about the architecture so
that somebody can say "yeah, run the X benchmark, and if it does well so
will your application" but I think we aren't quite there yet.
rgb
>
> Thanks,
>
> Steve
>
>
>
>> Steve Cousins wrote:
>> >> On Thu, 14 Jul 2005 11:25:12 +0100 Igor Kozin wrote:
>> >>
>> >>
>> >>> But now for 4cores/2CPUs per Opteron node to force the using of
>> >>>
>> >>>>only 2 cores (from 4), by 1 for each chip, we'll need to have
>> >>>>cpu affinity support in Linux.
>> >>>
>> >>>Mikhail,
>> >>>you can use "taskset" for that purpose.
>> >>>For example, (perhaps not in the most elegant form)
>> >>> mpiexec -n 1 taskset -c 0 $code : -n 1 taskset -c 2 $code
>> >>>But I doubt you want to let the idle cores to do something else
>> >>>in the mean time. However small you will generally see an increase
>> >>>in performance if you use all the cores.
>> >>
>> >>
>> >> We are considering getting a Dual Dual-Core Opteron system vs. two Dual
>> >> Opteron systems. We like the ability to use all four cores on one model
>> >> but a lot of what we'll do is have two models running at the same time,
>> >> each using two cores.
>> >>
>> >> We are worried that running two models on one system with four cores (each
>> >> model using two cores) will not work as well as using two systems, each
>> >> with two cores/cpu's. Is this what you were refering to (Igor) when you
>> >> wrote:
>> >>
>> >>
>> >>>But I doubt you want to let the idle cores to do something else
>> >>>in the mean time.
>> >>
>> >>
>> >> We have an 8 CPU SGI Origin 3200 that has no problem doing this sort of
>> >> thing. I'm just curious what the implications are of doing this with the
>> >> Dual Core Opteron cpu's.
>> >>
>> >> Thanks,
>> >>
>> >> Steve
>> >> ______________________________________________________________________
>> >> Steve Cousins, Ocean Modeling Group Email: cousins at umit.maine.edu
>> >> Marine Sciences, 208 Libby Hall http://rocky.umeoce.maine.edu
>> >> Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Beowulf mailing list, Beowulf at beowulf.org
>> >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>> >
>> --
>> Joseph Landman, Ph.D
>> Founder and CEO
>> Scalable Informatics LLC,
>> email: landman at scalableinformatics.com
>> web : http://www.scalableinformatics.com
>> phone: +1 734 786 8423
>> fax : +1 734 786 8452
>> cell : +1 734 612 4615
>>
>>
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From cousins at limpet.umeoce.maine.edu Thu Jul 28 09:24:45 2005
From: cousins at limpet.umeoce.maine.edu (Steve Cousins)
Date: Thu, 28 Jul 2005 09:24:45 -0400 (EDT)
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To:
Message-ID:
On Wed, 27 Jul 2005, Robert G. Brown wrote:
> Steve Cousins writes:
>
> >> I'm fairly certain that the memory latency issue that Vincent was warning
> >> about won't be an issue, although I'm a bit clueless about how to know for
> >> sure. How would I go about finding out if our model is TLB trashing main
> >> memory? I feel like I just bit the hook... I don't want to start a huge
> >> discussion on this but if there are some quick tell-tale signs of it I'd
> >> be interested to find out.
> >
> Why bother with tell-tale signs? Like I said, your previous post was
> dead on the money. Get a loaner (which can physically be far far away
> and should be "free"), install YOUR application and run the only
> benchmark or test that matters.
Hi Robert,
I mostly agree. I was curious about how one goes about doing this, and if
I find out that our program doesn't scale well to four cores then I'd like
to know what the problem is. The first step is to just see though.
Hopefully it will scale well and it will be a moot point.
As you predicted, I have heard from a couple of vendors. We'll soon see
how it goes, and I'll send an update to the list when I find out.
Thanks everyone for your help.
Steve
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From landman at scalableinformatics.com Wed Jul 27 17:31:57 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 27 Jul 2005 17:31:57 -0400
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To:
References:
Message-ID: <42E7FD4D.1000507@scalableinformatics.com>
Hi Steve:
Not knowing the details of your calculations might be an issue, but
you can read about our experiences with a number of chemistry and
informatics codes on dual core Opteron systems. See
http://enterprise2.amd.com/downloadables/Dual_Core_Performance.pdf for
more details.
Joe
Steve Cousins wrote:
> On Thu, 14 Jul 2005 11:25:12 +0100 Igor Kozin wrote:
>
>
>> But now for 4cores/2CPUs per Opteron node to force the using of
>>
>>>only 2 cores (from 4), by 1 for each chip, we'll need to have
>>>cpu affinity support in Linux.
>>
>>Mikhail,
>>you can use "taskset" for that purpose.
>>For example, (perhaps not in the most elegant form)
>> mpiexec -n 1 taskset -c 0 $code : -n 1 taskset -c 2 $code
>>But I doubt you want to let the idle cores to do something else
>>in the mean time. However small you will generally see an increase
>>in performance if you use all the cores.
>
>
> We are considering getting a Dual Dual-Core Opteron system vs. two Dual
> Opteron systems. We like the ability to use all four cores on one model
> but a lot of what we'll do is have two models running at the same time,
> each using two cores.
>
> We are worried that running two models on one system with four cores (each
> model using two cores) will not work as well as using two systems, each
> with two cores/cpu's. Is this what you were refering to (Igor) when you
> wrote:
>
>
>>But I doubt you want to let the idle cores to do something else
>>in the mean time.
>
>
> We have an 8 CPU SGI Origin 3200 that has no problem doing this sort of
> thing. I'm just curious what the implications are of doing this with the
> Dual Core Opteron cpu's.
>
> Thanks,
>
> Steve
> ______________________________________________________________________
> Steve Cousins, Ocean Modeling Group Email: cousins at umit.maine.edu
> Marine Sciences, 208 Libby Hall http://rocky.umeoce.maine.edu
> Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From landman at scalableinformatics.com Wed Jul 27 23:40:41 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 27 Jul 2005 23:40:41 -0400
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To:
References:
Message-ID: <42E853B9.3050305@scalableinformatics.com>
Robert G. Brown wrote:
> Steve Cousins writes:
>
[...]
>> Hi Joe,
>>
>> Thanks a lot. I just took a look and it seems to make a good case for
>> getting the Dual Dual Core machine.
>> I'm fairly certain that the memory latency issue that Vincent was warning
>> about won't be an issue, although I'm a bit clueless about how to know
>> for
>> sure. How would I go about finding out if our model is TLB trashing main
>> memory? I feel like I just bit the hook... I don't want to start a huge
>> discussion on this but if there are some quick tell-tale signs of it I'd
>> be interested to find out.
There are some good analysis tools out there you can use for setting up
and watching various processor counters. Have a look at
http://user.it.uu.se/~mikpe/linux/perfctr/ and the announcements
http://user.it.uu.se/~mikpe/linux/perfctr/current/ANNOUNCE-2.6.15 .
There are numerous others such as http://icl.cs.utk.edu/papi/ .
Oprofile is focused upon a slightly different performance measurement.
That said, the only thing that really will give you convincing data will
be running your code, and seeing if it has performance anomolies. Its
when you have those anomolies that you need to start looking at why you
are having the anomolies. Thats when it makes sense to start looking at
Perfctr etc.
> Why bother with tell-tale signs? Like I said, your previous post was
> dead on the money. Get a loaner (which can physically be far far away
> and should be "free"), install YOUR application and run the only
> benchmark or test that matters.
Agreed.
> On paper, the memory access schemes used by the Opterons should largely
> ameliorate the kind of difficulty encountered with the dual PIII's --
> they ought to do better than just divide single processor bandwidth
> between two processors at any rate. You can visit the hypertransport
> site and look at white papers, e.g. --
I am of the opinion that the proof is always in the pudding as it were.
That is, regardless of the white papers say (or benchmarks say), your
code is going to beat on the processor in its preferred manner, which
may or may not mesh with what is in the benchmark, whitepaper, etc.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From i.kozin at dl.ac.uk Thu Jul 28 06:49:53 2005
From: i.kozin at dl.ac.uk (Kozin, I (Igor))
Date: Thu, 28 Jul 2005 11:49:53 +0100
Subject: [Beowulf] Re: Opteron 275 performance
Message-ID: <77673C9ECE12AB4791B5AC0A7BF40C8F0147D654@exchange02.fed.cclrc.ac.uk>
> http://multicore.amd.com/WhatIsMC/
That was fun. Thanks Robert! No music though :(
It seems there is an increasingly dominating opinion that the second
core is meant to run anti-virus protection software 24 hours a day.
> I honestly think that the only way to be SURE your particular
> application fits in the probably very broad category of applications
> that can scale from one to four cores in nearly constant time
> is to try it.
I absolutely and unequivocally agree with this.
> >> >> We are considering getting a Dual Dual-Core Opteron system vs. two Dual
> >> >> Opteron systems. We like the ability to use all four cores on one model
> >> >> but a lot of what we'll do is have two models running at the same time,
> >> >> each using two cores.
> >> >>
> >> >> We are worried that running two models on one system with four cores (each
> >> >> model using two cores) will not work as well as using two systems, each
> >> >> with two cores/cpu's. Is this what you were refering to (Igor) when you
> >> >> wrote:
> >> >>
> >> >>>But I doubt you want to let the idle cores to do something else
> >> >>>in the mean time.
> >> >>
> >> >> We have an 8 CPU SGI Origin 3200 that has no problem doing this sort of
> >> >> thing. I'm just curious what the implications are of doing this with the
> >> >> Dual Core Opteron cpu's.
In your Origin system a pair of cpus is connected to memory via a router chip.
If by "no problem" you mean that your applications run on 0,2,4,6 processors
as good as on 0,1,2,3 then the chances are good that a dual core Opteron
will be fine too.
Dual dual-core will also give you a single system image which you are probably
used to on the Origin. E.g. Sun offers Opteron based workstations.
On the other hand two single-core duals will most likely have higher raw
performance for the same money (albeit higher electricity bill).
You can connect them using a cross-wire and enter the world of clustering!
> >> >> Thanks,
> >> >>
> >> >> Steve
> >> >>
Good luck,
Igor
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From bmayer at gmail.com Wed Jul 27 13:07:24 2005
From: bmayer at gmail.com (Ben Mayer)
Date: Wed, 27 Jul 2005 17:07:24 +0000
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To:
References: <200507141900.j6EJ084D017387@bluewest.scyld.com>
Message-ID:
> We are considering getting a Dual Dual-Core Opteron system vs. two Dual
> Opteron systems. We like the ability to use all four cores on one model
> but a lot of what we'll do is have two models running at the same time,
> each using two cores.
>
> We are worried that running two models on one system with four cores (each
> model using two cores) will not work as well as using two systems, each
> with two cores/cpu's. Is this what you were refering to (Igor) when you
> wrote:
>
Correct me if I am wrong but I am under the impression that each CPU
has one memory controller, independent of the number of cores. So if
you go dual core those cores are sharing a memory controller. What you
have in the end if you go dual core with 2 CPUs is 4 cores with 2
memory controllers.
If on the other hand you get two systems which two single core CPUs,
then you will have a total of 4 cores and 4 memory controllers.
So it would really depend on price of the systems and if your core
needs the full memory bandwidth of a controller.
--
Benjamin Mayer
University of Minnesota
Ph.D. Student, Computer Science
HPC, Data Mining for Medical Informatics and Network Intrusion Detection
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kc40 at hw.ac.uk Wed Jul 27 15:33:04 2005
From: kc40 at hw.ac.uk (Cheng, Kevin )
Date: Wed, 27 Jul 2005 20:33:04 +0100
Subject: [Beowulf] Help with MPICH programming
Message-ID: <2C104E0B5F7AFE4CA6D266BE280EF72B23BBB8@ex1.mail.win.hw.ac.uk>
I've been looking and looking on the internet for documentation on programming in MPI but couldn't find any.
What am needing to know is how to change the communication group for MPICH-1.2.1. Basically so that when the master process 0 broadcasts a message it will only be heard by the processes in a specific communication group.
Many thanks
Kev
PS: Did anyone find out how to dynamically create/destroy MPI processes on-the-fly / in-real-time for MPICH-2?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From turuncu at be.itu.edu.tr Thu Jul 28 04:42:02 2005
From: turuncu at be.itu.edu.tr (Ufuk Utku Turuncoglu)
Date: Thu, 28 Jul 2005 11:42:02 +0300
Subject: [Beowulf] top command problem in 2 and 4 way nodes !!!
Message-ID: <42E89A5A.706@be.itu.edu.tr>
hi,
i try to test the performance of the cluster. i am using simple "top"
command to getting load information of the cpu. My problem is as fallowing,
when i run the openmp job in 2-way nodes (OMP_NUM_THREADS=2) the output
of the "top" commands show the ~%90 load in each cpu (cpu 0 and 1) and
overall cpu usage ~%180-190 for a job.
but in 4 way nodes (OMP_NUM_THREADS=4) the result of the "top" command
is different. The load of the each cpu is %25 percent (total %100 for 4
cpu). But i expect that total load of the node %400 or close to this
value, when i get the similar behaviour of the "top" command as 2 way nodes.
The versions of the top command is same (top -v, procps version 2.0.17).
I think this is realted to karnel versions of the operating systems. Any
suggestions and information will be helpful for me.
Architecture, HP cluster 2 way and 4 way nodes
Operating System, Redhat Linux AS v3.0
Detail,
2 way nodes, uname -a command output,
Linux cn07 2.4.21-20.0.1.ELSFS #1 SMP Wed Mar 30 09:12:30 GMT+2 2005
x86_64 x86_64 x86_64 GNU/Linux
4way nodes,
Linux cn41 2.4.21-27.ELSFS #1 SMP Mon Jun 13 13:57:23 EEST 2005 x86_64
x86_64 x86_64 GNU/Linux
Thanks,
Ufuk Utku Turuncoglu
Istanbul Technical University
Informatics Institute, HPCC Group
Turkey
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From cap at nsc.liu.se Fri Jul 22 07:38:32 2005
From: cap at nsc.liu.se (Peter =?iso-8859-1?q?Kjellstr=F6m?=)
Date: Fri, 22 Jul 2005 13:38:32 +0200
Subject: [Beowulf] help a newbie
In-Reply-To: <20050720154342.B74634@students.ee.itb.ac.id>
References: <87f1c381050717220375a693a6@mail.gmail.com>
<20050720154342.B74634@students.ee.itb.ac.id>
Message-ID: <200507221338.40517.cap@nsc.liu.se>
Hello,
Might I suggest that if you're building something from scratch with no legacy
stuff to port, that you please use something less old/dead/antique/EOL'ed/...
Clustermatic5 was released in November 2004 if I remember correctly and as for
rh8... It's so old that not even Fedora legacy supports it anymore.
Maybe you should try Clustermatic5 on Centos-4.1 (or Centos-3.5). That will
give you something with working update streams and clustermatic on a
redhat-ish dist.
regards,
Peter
On Wednesday 20 July 2005 10.48, 13200178 Hendra Tampang Allo wrote:
> I am also a newbie like you but i think it's not hard to build a cluster.
> Just search it on google and you will find many ways of making beowulf
> cluster. But what kind of job will you process on your cluster? I am also
> building a cluster to run namd (a molecular dynamics) and i was suggested
> to use redhat 8 + clustermatic 3.
>
> Soli Deo Gloria & Sola Christa Eterna
> Hendra/EL-00
>
> On Sun, 17 Jul 2005, rupinder bhangu wrote:
> > hi
> > I am Rupinder.I am a final year student.I have planned to work on the
> > topic of Beowulf clusters during my six months training.I have also gone
> > through some of the sites & the other stuff on the Internet to gather the
> > basic info regarding beowulfs, because I had to convince my teachers for
> > allowing me to work on this topic.Having done that job successfully, I
> > would now like to have the help from the people who are experienced in
> > this field. I am really a newbie in this field, but I want to do it.
> > Could you please tell me where to start, how to work & the related help
> > that you think would be useful for me?Could you also tell that whether a
> > period of 6 months is adequate for a person like me to build a cluster
> > with 3-4 nodes successfully?
> > Thanks
> > Rupinder Kaur
--
------------------------------------------------------------
Peter Kjellstr?m |
National Supercomputer Centre |
Sweden | http://www.nsc.liu.se
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From mark.westwood at ohmsurveys.com Thu Jul 28 10:58:02 2005
From: mark.westwood at ohmsurveys.com (Mark Westwood)
Date: Thu, 28 Jul 2005 15:58:02 +0100
Subject: [Beowulf] Help with MPICH programming
In-Reply-To: <2C104E0B5F7AFE4CA6D266BE280EF72B23BBB8@ex1.mail.win.hw.ac.uk>
References: <2C104E0B5F7AFE4CA6D266BE280EF72B23BBB8@ex1.mail.win.hw.ac.uk>
Message-ID: <42E8F27A.3090401@ohmsurveys.com>
Kevin
the MPICH home page is at: http://www-unix.mcs.anl.gov/mpi/mpich/ and
will lead you to most of what you need to know about MPI and MPICH. You
should browse the documentation looking for MPI_COMM_CREATE or
MPI_COMM_SPLIT. These probably give you the facility you want. You
could also investigate MPI GROUPs.
Regards
Mark
PS not me
Cheng, Kevin wrote:
> I've been looking and looking on the internet for documentation on
> programming in MPI but couldn't find any.
>
> What am needing to know is how to change the communication group for
> MPICH-1.2.1. Basically so that when the master process 0 broadcasts a
> message it will only be heard by the processes
> in a specific communication group.
>
> Many thanks
> Kev
>
> PS: Did anyone find out how to dynamically create/destroy MPI processes
> on-the-fly / in-real-time for MPICH-2?
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Mark Westwood
Parallel Programmer
OHM Ltd
The Technology Centre
Offshore Technology Park
Claymore Drive
Aberdeen
AB23 8GD
United Kingdom
+44 (0)870 429 6586
www.ohmsurveys.com
-------------------------------------------------------------------------------
This message is confidential and intended solely for the use of the
person to whom it is addressed. The message, any attachments and
response string may contain privileged and confidential information. If
you have received this message in error, please notify the sender
immediately and remove it from your system. Thank you.
-------------------------------------------------------------------------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From josip at lanl.gov Thu Jul 28 18:39:10 2005
From: josip at lanl.gov (Josip Loncaric)
Date: Thu, 28 Jul 2005 16:39:10 -0600
Subject: [Beowulf] Opteron memory rank limits with DDR-400
In-Reply-To: <3.0.32.20050727231252.0129cc58@pop3.xs4all.nl>
References: <3.0.32.20050727231252.0129cc58@pop3.xs4all.nl>
Message-ID: <42E95E8E.8050609@lanl.gov>
Vincent Diepeveen wrote:
> Quad opteron dual core 1.8Ghz
It sounds like you've got 16 single-rank memory modules, 4/Opteron, so
you would not see the issue I'm concerned about.
Opteron memory controller apparently needs either downclocking to
DDR333, or a recent Opteron revision plus an extra cycle of memory
access latency, when driving 8 ranks of DDR400 memory. Downclocking is
automatically enforced by some BIOSes.
See the discussion at http://forums.amd.com/index.php?showtopic=49274
for more detail... There is much confusion on this topic. The most
interesting comment therein is:
> Of course Opterons can handle more than 4 banks, but then only at DDR333.
>
> Exceptional to this are the new E Revision CPUs, which are able to
> handle more than 4 banks with DDR400. That is you can run DDR400 timings
> with 4 dual rank modules, but it will add another cycle to the memory
> access. This is referred as "Command Rate" in most bioses and could be
> 1T or 2T. In the case of 4 dual rank modules, DDR400 (and an E-Rev. CPU)
> it must be set at 2 T. So you get more bandwidth, but worse access
> timings compared to DDR333 and 1T.
Sincerely,
Josip
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From landman at scalableinformatics.com Thu Jul 28 22:25:35 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Thu, 28 Jul 2005 22:25:35 -0400
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To: <42E8FCF0.5010105@cea.fr>
References:
<42E7FD4D.1000507@scalableinformatics.com>
<42E8FCF0.5010105@cea.fr>
Message-ID: <42E9939F.9080703@scalableinformatics.com>
Hi Philippe:
I don't have an accurate measurement at this moment. I have been
working on measurement using the performance counters.
What I did not run, but would like to see is a set of streams numbers
per core, for 1,2,3,4 on the system.
Joe
Philippe Blaise wrote:
> Hopefully, you show that dual core is quite superior to
> "hyper"-threading for some scientific programs,
> and is cost-friendly, very nice ! but page 14 you write
>
> "It is worth noting that since AMBER8 does consume significant memory
> bandwidth, the
> memory contention issue that could reduce overall performance on a AMD
> Opteron 275
> processor based 2p/4c system under heavy memory usage shows a small
> effect, at most".
>
> Please, what do you mean by "significant memory bandwith" ? may be no
> more than 50 % ?
> have you got an estimation value ?
>
> Philippe.
>
> Joe Landman wrote:
>
>> Hi Steve:
>>
>> Not knowing the details of your calculations might be an issue, but
>> you can read about our experiences with a number of chemistry and
>> informatics codes on dual core Opteron systems. See
>> http://enterprise2.amd.com/downloadables/Dual_Core_Performance.pdf for
>> more details.
>>
>> Joe
>>
>> Steve Cousins wrote:
>>
>>> On Thu, 14 Jul 2005 11:25:12 +0100 Igor Kozin wrote:
>>>
>>>
>>>> But now for 4cores/2CPUs per Opteron node to force the using of
>>>>
>>>>> only 2 cores (from 4), by 1 for each chip, we'll need to have
>>>>> cpu affinity support in Linux.
>>>>
>>>>
>>>>
>>>> Mikhail,
>>>> you can use "taskset" for that purpose. For example, (perhaps not in
>>>> the most elegant form)
>>>> mpiexec -n 1 taskset -c 0 $code : -n 1 taskset -c 2 $code
>>>> But I doubt you want to let the idle cores to do something else in
>>>> the mean time. However small you will generally see an increase in
>>>> performance if you use all the cores.
>>>
>>>
>>>
>>>
>>> We are considering getting a Dual Dual-Core Opteron system vs. two Dual
>>> Opteron systems. We like the ability to use all four cores on one model
>>> but a lot of what we'll do is have two models running at the same time,
>>> each using two cores. We are worried that running two models on one
>>> system with four cores (each
>>> model using two cores) will not work as well as using two systems, each
>>> with two cores/cpu's. Is this what you were refering to (Igor) when you
>>> wrote:
>>>
>>>
>>>> But I doubt you want to let the idle cores to do something else
>>>> in the mean time.
>>>
>>>
>>>
>>>
>>> We have an 8 CPU SGI Origin 3200 that has no problem doing this sort of
>>> thing. I'm just curious what the implications are of doing this with
>>> the
>>> Dual Core Opteron cpu's. Thanks,
>>>
>>> Steve
>>> ______________________________________________________________________
>>> Steve Cousins, Ocean Modeling Group Email: cousins at umit.maine.edu
>>> Marine Sciences, 208 Libby Hall http://rocky.umeoce.maine.edu
>>> Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>>
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hahn at physics.mcmaster.ca Fri Jul 29 01:35:30 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Fri, 29 Jul 2005 01:35:30 -0400 (EDT)
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To: <77673C9ECE12AB4791B5AC0A7BF40C8F0147D654@exchange02.fed.cclrc.ac.uk>
Message-ID:
> It seems there is an increasingly dominating opinion that the second
> core is meant to run anti-virus protection software 24 hours a day.
only in the sense that all CPUs are intended to run desktop-windows :|
seriously, if that were the case, then there would be little point to
providing both CPUs with giant 1MB caches. ("giant" here is relative to
the amount of die area devoted to computation, of course!)
I find that most software has a pretty high flops-per-byte ratio, at least
as compared to Stream/daxpy. dual-core K8's seem like a pretty clear win,
though memory contention and higher single-core clocks can argue against.
(I'm about to receive 1536 single-core AMD's...)
regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hunting at ix.netcom.com Fri Jul 29 02:21:02 2005
From: hunting at ix.netcom.com (Michael Huntingdon)
Date: Thu, 28 Jul 2005 23:21:02 -0700
Subject: [Beowulf] IB Cluster Configuration Options
Message-ID: <6.2.1.2.2.20050728231344.01d8b650@pop.sbcglobal.yahoo.com>
We're working on a 64 node dual core Opteron cluster with a wide variety of
mpi heavy applications and have run into the typical funding issues that
are forcing some configuration restrictions (down sizing the wish list).
The original concept included a 64 node, dual processor, dual core cluster
with an Infiniband 9288 or 9096.
As a means to reduce our cost the suggestion was raised to cascade 4 9024
switches, or reducing the node count, Opteron clock rate, and memory. Just
wondering how severe the hop penalty might be in the case of cascading
switches?
ciao~
michael
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hahn at physics.mcmaster.ca Fri Jul 29 10:29:27 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Fri, 29 Jul 2005 10:29:27 -0400 (EDT)
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To: <77673C9ECE12AB4791B5AC0A7BF40C8F0147D65A@exchange02.fed.cclrc.ac.uk>
Message-ID:
> > I find that most software has a pretty high flops-per-byte
> > ratio, at least
> > as compared to Stream/daxpy. dual-core K8's seem like a
> > pretty clear win,
> > though memory contention and higher single-core clocks can
> > argue against.
> > (I'm about to receive 1536 single-core AMD's...)
>
> But I also do understand why you did prefer the single cores :)
because I have apps which saturate a SC memory bus, of course!
recall that Amdahl's law still applies, so there are cases where
1x2.6 GHz does better than 2x2.2 GHz, even without the memory issue.
and actually we're getting about 2K DC chips too,
though not at the same site, thank goodness!
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From hahn at physics.mcmaster.ca Fri Jul 29 11:08:43 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Fri, 29 Jul 2005 11:08:43 -0400 (EDT)
Subject: [Beowulf] IB Cluster Configuration Options
In-Reply-To: <6.2.1.2.2.20050728231344.01d8b650@pop.sbcglobal.yahoo.com>
Message-ID:
> As a means to reduce our cost the suggestion was raised to cascade 4 9024
> switches, or reducing the node count, Opteron clock rate, and memory. Just
> wondering how severe the hop penalty might be in the case of cascading
> switches?
IB switch vendors claim quite low per-hop latencies. of course, the real
problem is that the base of a non-fat switching tree is highly contended.
(the voltaire 9024 spec says 140 ns, for instance).
the question is: how much are you willing to pay for good
bisection-bandwidth? personally, I cannot imagine buying IB except for
extremely bandwidth-intensive apps, in which case, you probably want
a non-blocking (full bisection) switching network. the 9024 doesn't
seem to be offered with all 12x ports, which is kind of odd, since
a classic fat tree would be most natural with 8 12x ports at the top.
but a set of 5 9024-12 switches would get you a pretty well-connected
set of 60 nodes. a hypercube with express links would use 8 switches
and have great bandwidth and at most 3 switches on any path (I think!).
maybe you should consider alternatives (Myri 10G seems extremely attractive,
for instance).
regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From fant at pobox.com Fri Jul 29 11:57:22 2005
From: fant at pobox.com (Andrew Fant)
Date: Fri, 29 Jul 2005 11:57:22 -0400
Subject: [Beowulf] cluster toolkit comparison questions
Message-ID: <42EA51E2.9090605@pobox.com>
Afternoon all,
I am in the process of making some assessments for architecture on a
new cluster we are building and I have a couple of questions comparing
OSCAR vs. Rocks. I certainly don't mind having other options presented,
but diskless options like warewulf are already precluded because of
administrative fiat.
1) How well can Oscar or Rocks be integrated with an LDAP directory?
Do either of them have some mechanism already available for proxying
LDAP queries? On our current cluster, we have had to maintain a local
password/group repository, which is something that management has been
less than happy about, given the investment they have made in enterprise
directory services.
2) Does Oscar and/or Rocks have support for multiple head nodes? I've
been in the habit of using 2 head nodes, one for administrative
functions, and one for user access, and would really like to maintain
this practice. It keeps the administrative tools out of the sight of
users, it gives me a second gateway into the cluster for redundancy of
access, and it makes it harder for users to accidentally interfere with
administrative functions by starting processes on the head node when
they "forget" that they aren't supposed to.
3) Does either toolkit have problems with home directories coming off
from a separate NFS appliance instead of living on a filesystem on the
head node that gets exported to the compute nodes directly from there?
Thanks for you help,
Andy
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From harrysnotter at gmail.com Thu Jul 28 17:31:39 2005
From: harrysnotter at gmail.com (Ik Ben)
Date: Thu, 28 Jul 2005 23:31:39 +0200
Subject: [Beowulf] mpich2 mutex
Message-ID:
Hi,
I'm reading the book Using MPI-2 of William Gropp .
In the book they explains how to create a mutex for RMA using the
lock/unlock method.
They speek about the functions MPE_Counter_nxtval etc... to create the mutex.
I'm french and have it very difficult to merge the necessary pieces
together correctly.
Can someone please show an example of the minimum code for creating a
mutex using mpi-2 (planning to use Suse, but I don't think that this
matters...).
I know that this seems like I'm to lazy to do my own homework, but it's not.....
Thanks in advance;
Gilles Roman.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From i.kozin at dl.ac.uk Fri Jul 29 05:42:13 2005
From: i.kozin at dl.ac.uk (Kozin, I (Igor))
Date: Fri, 29 Jul 2005 10:42:13 +0100
Subject: [Beowulf] Re: Opteron 275 performance
Message-ID: <77673C9ECE12AB4791B5AC0A7BF40C8F0147D65A@exchange02.fed.cclrc.ac.uk>
> > It seems there is an increasingly dominating opinion that the second
> > core is meant to run anti-virus protection software 24 hours a day.
>
> only in the sense that all CPUs are intended to run desktop-windows :|
>
> seriously, if that were the case, then there would be little point to
> providing both CPUs with giant 1MB caches. ("giant" here is
> relative to
> the amount of die area devoted to computation, of course!)
I was not saying that was my opinion. I was referring to all those
popular adverts ("IMAGINE A WORLD" type of thing) where they can't
think of anything else for you to do with the second core.
Oh, the other things is downloading. Most likely the patches for
the anti-virus software.
On the contrary I am seeing quite a good benefit for certain type
of applications.
> I find that most software has a pretty high flops-per-byte
> ratio, at least
> as compared to Stream/daxpy. dual-core K8's seem like a
> pretty clear win,
> though memory contention and higher single-core clocks can
> argue against.
> (I'm about to receive 1536 single-core AMD's...)
But I also do understand why you did prefer the single cores :)
> regards, mark hahn.
Best,
Igor
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From philippe.blaise at cea.fr Thu Jul 28 11:42:40 2005
From: philippe.blaise at cea.fr (Philippe Blaise)
Date: Thu, 28 Jul 2005 17:42:40 +0200
Subject: [Beowulf] Re: Opteron 275 performance
In-Reply-To: <42E7FD4D.1000507@scalableinformatics.com>
References:
<42E7FD4D.1000507@scalableinformatics.com>
Message-ID: <42E8FCF0.5010105@cea.fr>
Hopefully, you show that dual core is quite superior to
"hyper"-threading for some scientific programs,
and is cost-friendly, very nice ! but page 14 you write
"It is worth noting that since AMBER8 does consume significant memory
bandwidth, the
memory contention issue that could reduce overall performance on a AMD
Opteron 275
processor based 2p/4c system under heavy memory usage shows a small
effect, at most".
Please, what do you mean by "significant memory bandwith" ? may be no
more than 50 % ?
have you got an estimation value ?
Philippe.
Joe Landman wrote:
> Hi Steve:
>
> Not knowing the details of your calculations might be an issue, but
> you can read about our experiences with a number of chemistry and
> informatics codes on dual core Opteron systems. See
> http://enterprise2.amd.com/downloadables/Dual_Core_Performance.pdf for
> more details.
>
> Joe
>
> Steve Cousins wrote:
>
>> On Thu, 14 Jul 2005 11:25:12 +0100 Igor Kozin wrote:
>>
>>
>>> But now for 4cores/2CPUs per Opteron node to force the using of
>>>
>>>> only 2 cores (from 4), by 1 for each chip, we'll need to have
>>>> cpu affinity support in Linux.
>>>
>>>
>>> Mikhail,
>>> you can use "taskset" for that purpose. For example, (perhaps not in
>>> the most elegant form)
>>> mpiexec -n 1 taskset -c 0 $code : -n 1 taskset -c 2 $code
>>> But I doubt you want to let the idle cores to do something else in
>>> the mean time. However small you will generally see an increase in
>>> performance if you use all the cores.
>>
>>
>>
>> We are considering getting a Dual Dual-Core Opteron system vs. two Dual
>> Opteron systems. We like the ability to use all four cores on one model
>> but a lot of what we'll do is have two models running at the same time,
>> each using two cores.
>> We are worried that running two models on one system with four cores
>> (each
>> model using two cores) will not work as well as using two systems, each
>> with two cores/cpu's. Is this what you were refering to (Igor) when you
>> wrote:
>>
>>
>>> But I doubt you want to let the idle cores to do something else
>>> in the mean time.
>>
>>
>>
>> We have an 8 CPU SGI Origin 3200 that has no problem doing this sort of
>> thing. I'm just curious what the implications are of doing this with
>> the
>> Dual Core Opteron cpu's.
>> Thanks,
>>
>> Steve
>> ______________________________________________________________________
>> Steve Cousins, Ocean Modeling Group Email: cousins at umit.maine.edu
>> Marine Sciences, 208 Libby Hall http://rocky.umeoce.maine.edu
>> Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302
>>
>>
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From kc40 at hw.ac.uk Thu Jul 28 13:57:09 2005
From: kc40 at hw.ac.uk (Cheng, Kevin )
Date: Thu, 28 Jul 2005 18:57:09 +0100
Subject: [Beowulf] uninstalling MPICH2 woes :S
Message-ID: <2C104E0B5F7AFE4CA6D266BE280EF72B23BBBC@ex1.mail.win.hw.ac.uk>
Does anyone know how to fully/safely uninstall MPICH2? I think it's installed into /usr/local/ - however I think they is other applications stored in /usr/local/. MPICH2 is installed to /usr/local/bin, /usr/local/include, etc I think.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From asada at dsee.fee.unicamp.br Thu Jul 28 14:11:47 2005
From: asada at dsee.fee.unicamp.br (Eduardo N. Asada)
Date: Thu, 28 Jul 2005 15:11:47 -0300
Subject: [Beowulf] pvm freezes
Message-ID: <200507281511.47731.asada@dsee.fee.unicamp.br>
Hi ,
I have noticed in my pvm programs and also with the examples provided with the
package that when I run it several times, generally the program freezes in
the third time (or in the second time). Then I have to interrupt it and run
it again and this behavior repeats.
What is the reason for this ? Is it related to the network card ? Does it
happen with mpi also ?
Regards.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
From llw7 at ix.netcom.com Thu Jul 28 17:31:54 2005
From: llw7 at ix.netcom.com (Laura Winkelbauer)
Date: Thu, 28 Jul 2005 14:31:54 -0700
Subject: [Beowulf] Infiniband Configuration Options
Message-ID: