I would add to this:
"how sure are we that a process (or thread) that allocated and
initialized and writes to memory at a single specific memory node,
also keeps getting scheduled at a core on that memory node?"
It seems to me that sometimes (like every second or so) threads jump
from 1 memory node to another. I could be wrong,
but i certainly have that impression with the linux kernels.
That said, it has improved a lot, now all we need is a better
compiler for linux. GCC is for my chessprogram generating an
executable that gets 22% slower positions per second than visual c++
2005 is.
Thanks,
Vincent
On Jun 23, 2008, at 4:01 PM, Mikhail Kuzminsky wrote:
> I'm testing my 1st dual-socket quad-core Opteron 2350-based server.
> Let me assume that the RAM used by kernel and system processes is
> zero, there is no physical RAM fragmentation, and the affinity of
> processes to CPU cores is maintained. I assume also that both the
> nodes are populated w/equal number of the same DIMMs.
>> If I run thread- parallelized (for example, w/OpenMP) application w/
> 8 threads (8 = number of server CPU cores), the ideal case for all
> the ("equal") threads is: the shared memory used by each of 2 CPUs
> (by each of 2 processes "quads") should be divided equally between
> 2 nodes, and the local memory used by each process should be mapped
> analogically.
> Theoretically like ideal case may be realized if my application (8
> threads) uses practically all the RAM and uses only shared memory
> (I assume here also that all the RAM addresses have the same load,
> and the size of program codes is zero :-) ).
>> The questions are
> 1) Is there some way to distribute analogously the local memory of
> threads (I assume that it have the same size for each thread) using
> "reasonable" NUMA allocation ?
>> 2) Is it right that using of numactl for applications may gives
> improvements of performance for the following case:
> the number of application processes is equal to the number of cores
> of one CPU *AND* the necessary (for application) RAM amount may be
> placed on one node DIMMs (I assume that RAM is allocated
> "continously").
>> What will be w/performance (at numactl using) for the case if RAM
> size required is higher than RAM available per one node, and
> therefore the program will not use the possibility of (load
> balanced) simultaneous using of memory controllers on both CPUs ?
> (I also assume also that RAM is allocated continously).
>> 3) Is there some reason to use things like
> mpirun -np N /usr/bin/numactl <numactl_parameters> my_application ?
>> 4) If I use malloc() and don't use numactl, how to understand -
> from which node Linux will begin the real memory allocation ? (I
> remember that I assume that all the RAM is free) And how to
> understand where are placed the DIMMs which will corresponds to
> higher RAM addresses or lower RAM addresses ?
>> 5) In which cases is it reasonable to switch on "Node memory
> interleaving" (in BIOS) for the application which uses more memory
> than is presented on the node ?
> And BTW: if I use taskset -c CPU1,CPU2, ... <program_file>
> and the program_file creates some new processes, will all this
> processes run only on the same CPUs defined in taskset command ?
>> Mikhail Kuzminsky
> Computer Assistance to Chemical Research Center,
> Zelinsky Institute of Organic Chemistry
> Moscow
>> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org> To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.