Vincent Diepeveen wrote:
>>>There likely will be a difference, because average pingpong doesn't
>>>run on all the cpus. On a 4-cpu node, that can make a big difference.
>>>>I believe the difference will not be that big. I will get my hands on a
>>quad in the next couple of weeks, I will look into int.
>>> The difference will be huge of course, network processors have a switch
> latency. That's why.
>> If it must switch at the wrong moment that'll cost 50 us or something at
> certain network chips.
Switch latency is negligable in this problem, and in any event 50us is
not a realistic switch latency with modern hardware.
The real question is the following: does 4 processes running on 4
different CPUs affect greatly the latency when sending small messages to
other nodes compared to only one process running on one CPU ?
The answer, I argue, is "not much". Assuming that all processes sends at
the exact same time, access to the PCI bus will be serialized, NIC
processing will be serialized and access to the wire will be serialized.
The most expensive resource in this pipeline for 0-byte messages is
likely to be the NIC. So, it boils down to the NIC overhead per send (or
recv) and that is not big with MX (and will be further reduce in the
future). In any event, not in the order of 10us. With GM, it's a
different story as it does not do PIO for small messages.
> Additional there will be software layers that have to lock in some way.
You don't have to lock when doing os-bypass. At least, you don't have to
lock with other processes (which is kinda expensive). We take a spinlock
because we have at least another thread in the lib. The gain of having
such a thread outweight the cost of the spinlock, no questions about that.
> Locking + unlocking is already like half a microsecond extra, just like that.
Taking a spinlock on Opteron is ~50 us. On Xeon or Nocona, it's a bit
more (~150ns).
> Tests at all processors at the same time make major sense.
Yes and no. Most networking people believe the job of a node is to send
messages. Actually, it's mainly to compute, and sometimes sends
messages. So, would running a pingpong test on multiple processors at
the same time sharing a NIC an interesting benchmark ? Not really, it
won't happen much on real codes that compute most of the time. I prefer
to optimize other things that help the host compute faster.
> Any denial in advance that it will be the same speed is just ballony.
And I thought I was the bulliest on this list...
I just give my opinion and at least my opinion is backed up by
first-hand experience. I don't know how to play chess, but I know my stuff.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com