Am Mon, 15 Nov 2010 22:05:52 +0100
schrieb Philippe Wang <mail@philippewang.info>:
> Take the current Apple Mac Pro for instance (I take this reference
> because it's easy to find and it doesn't evolve very often), with
> 12-core configuration.
> - Two 2.93GHz 6-Core Intel Xeon Ã¢ÂÂWestmereÃ¢ÂÂ (12 cores)
> - 1333MHz DDR3 ECC SDRAM (whatever the capacity)
> => with HT, there are 24 logical units, which all share a tiny
> bandwidth for CPU<->RAM communications.
> Let's say bandwidth is about 2400MHz : 2400MHz/24Thread =
> 100MHz/Thread. It's kind of "ridiculous..."
You're assuming that there'd be a lot of communication between cores
and RAM. Which is not (or should not) be the case in well written
multithreaded programs.
> OCaml is not (at least not yet) a language for HPC (high performance
> computing), it is very efficient (compared to so many other languages)
> and yet doesn't not take advantage of SMP. Well, sooner or later it
> will actually probably need to support SMP. (But somehow it already
> does, via C code boxed in "blocking sections").
Which is a pitty, since especially functional languages could much
better parallelize tasks implicitly.
> Well, if you take casual OCaml programs, and put them on SMP
> architectures (on which indeed they often already are) while giving
> them capacity to take advantage of SMP (via POSIX-C threads in
> blocking sections, message-passing style, or OCaml-for-multicore, or
> whatever else), they quickly become less efficient because there is a
> bottleneck on the CPU<->RAM bus.
Suppose you were to implement a convolution in n dimensions on a large
data set. This is a prime example of where multithreading can help and
where main-memory bandwidth is not the limiting factor. One can split
up the whole task in small tasklets dispatching them tho individual
cores. As long as the dataset, which are the payload data i.e.
input, convolution kernel (and output buffer if not in-place) plus code,
fit into the L1 cache everything will be executed on-cache. On current
Intel CPUs this are 32kB, AMD it's even 64kB -- per core!
And all the cores on the same die share L2 cache, which has far more
bandwidth, about an order of magnitude, than to system memory. Modern
OS schedulers thus try to keep together threads of the same process
on CPU dice in the system. And further group it by NUMA.
> I want to believe you're right to ask for SMP support, even if now I'm
> pretty convinced that current state of OCaml is not compatible with "I
> want to write HPC programs in pure OCaml". (One should implement a
> brand new compiler maybe??)
This is not just about HPC but about resource utilization. A single
core running at full speed consumes far more power, than 4 cores,
clocked down to minimal frequency. Even worse only the most recent CPU
generations can clock cores individually. So a single core running at
full speed will significantly increase power consumption (and thermal
output).
> There are people studying how to have HPC with OCaml, but it has quite
> a little to do with SMP matters. Instead, it's more about (static or
> dynamic) specialized-code generation for GPUs etc. We'll see in some
> time what it produces...
For the time being I'm more interested in what's actually preventing
proper SMP in OCaml right now. I've read something about issues with
the garbage collector, which surprises be, as I switched over to use
Boehm-GC in my C programs to resolve problems in memory deallocation in
multithreaded programs -- this of course was possible only after
Boehm-GC became thread safe.
Wolfgang