On Fri, 10 Jan 2003, Mike Eggleston wrote:
> That does help, thanks. Currently I am seeing 10000 units of work
> completeing in ~300 seconds on a single cpu box with the multi-threaded
> app. That works to ~33 work units per second. For my app I would like to
Multi-threaded app? Task swapping between lots of threads adds
significant kernel overhead. You'd much rather work in completely
serial fashion through the work units and avoid the overhead. Forking
is expensive, context switches and so forth are expensive.
> see 10000 units complete in ~20 seconds or 500 units/sec. If the amount
Well, in principle you can achieve this on 10 Base -- that is about 500
KB/sec of bandwidth required by the master to deliver the data and
retrieve results, out of about a MB/sec of theoretical bandwidth. I'm
pretty sure you're safely within the practical latency limits as well --
the 1000 Byte packet should be no problem as it is close to the MTU and
hence bandwidth dominated; the small data packet might well be latency
dominated but you should be OK.
> of work is consistent on each node that would mean ~15 nodes? So if on
> 15 nodes each node processed 667 units, would that amount of network
> traffic saturate the cluster's lan such that in the end it would have
> been better to stay on a single (or smp) box?
I "think" you'll be able to scale to fifteen or sixteen nodes. Right
now you're working about 30,000 microseconds per task. Your packet
exchange "should" take on the order of 1000 microseconds on 10BT (feel
free to correct this estimate, anyway) -- roughly one microsecond per
byte, although of course you do NOT send it a byte at a time to achieve
this but all at once in a single packet. 10 base is slow enought that
you can probably neglect TCP and PVM overhead relative to the physical
network time. The additional overhead required by the communications
will slow down your task completion rate so it is less than 33/second
(maybe you'll get 28-30) but that should still provide you with a
generous amount of parallel speedup.
>> I have seen the time required to process 10000 units to reach 2200-2400
> seconds, or 4.5 units per second.
>> There is no barrier within the 10000 units, but all 10000 must complete
> before the next round of 10000 are worked. On a previous attempt using
> pvm I implemented a round-robin deal where the head sends a unit to
> A, then B, then C, and so on. The first node that returned its results
> received the next unit. The head continued down the list of units until
> reaching the bottom. At the bottom loop to the top and look for any
> lingering units that have not received results, sending those units to
> waiting nodes. In this way I could deal with nodes of different speeds
> and even nodes that crashed as not every node would crash at the same
> time.
>> Oh, and I do mean 10baseT as a 10Mb/s network. I can go 100Mb/s if I
> needed.
Things that can mess up the "simple" estimate of (say) 30 task
units/second sustained include overhead and networking inefficiencies.
A lot of them might end up connected to your cycle -- you really want
the master node to be idle when each node tries to send its last result
back and and collects its next unit. You also definitely want an
ethernet switch, not a hub, as collisions in 10BT can significantly
attenuate real world bandwidth.
At that point, you have to face the economics of 10 vs 100. 100base
cards cost $10-50 each, and even a rotten card like a cheapo RTL8139
(what you're likely to get for $10) will in principle outperform any
10base card. A sixteen port fast ethernet switch costs what, $80?
$100? Again, less than $10/port -- little enough that I have one in my
house for my personal LAN.
So for an investment of $320 to maybe $700, your nodes could all be on
switched 100BT. Suddenly, your bandwidth would be about 10 MB/sec (you
only need 0.5 MB/sec) and the time to send your packets would be down in
the 100-200 microsecond range. The probability of collision would be
greatly reduced, and the time that such a collision would affect traffic
ditto. You could also at least consider scaling up BEYOND just 16
nodes.
Either way you're going to want to try to keep task distribution and
execution as smooth and synchronous as possible. But I >>think<< you
could scale out as far as 15-16 nodes on 10BT with at least positive
gain (maybe or maybe not reaching your design goal), and am pretty sure
you could reach your design goal with 100BT. You're coming fairly close
to saturating 10Base with your data stream, and 10Base cards (especially
ISA cards, which I HOPE are not involved!) aren't really stellar
performers and tend to require a lot of systems resources as they
function (i.e. the may not support any sort of DMA and may require the
full attention of the CPU/kernel while sending or receiving a stream).
rgb
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu