3 The Dragonfly Network TopologyA two level directly connected topologyUses high-radix routersLarge number of ports per routerEach port has moderate bandwidth“p”: Number of compute nodes connected to a router“a”: Number of routers in a group“h”: Number of global channels per routerk=a + p + h – 1a=2p=2h(Recommended configuration)By using high-radix router, one can reduce the diameter of the network and limit the number of global channels traversed by the packetIncreasing the degree of the router reduces the hop count, leads to low latency and low network costAs global channels can be expensive, high-radix routers also help to reduce the number of global channels traversed by a packet- Comments by Chris: The network size grows by the p to the 4 power where p is the compute nodes. The precise question is 4*p^4 + 2*p^2 -- this is sub in for the N=p*a*g eq. So, to get a billion node DF network, p only needs to be ~128. This is interesting because a torus can grow by the dimension power D of the K-aryity -- K^D -- e.g the blue gene/L was a 32^3 system by design thus providing Up to 64K cores, but then later IBM push the design out so you could have a much longer Z dimension -- so K was not the same in each dimension, etc. The billion node torus networks we ran before where 32^6 which is very close to p=128 DF network

5 Dragonfly Model ConfigurationTraffic PatternsUniform Random Traffic (UR)Nearest Neighbor Traffic (or Worst Case traffic WC)Virtual channelsTo avoid deadlocksCredit based flow controlUpstream nodes/routers keep track of buffer slotsAn input-queued virtual channel routerEach router port supports up to ‘v’ virtual channelsUniform Random Traffic: Each packet generated by the model randomly chooses a destination node through a uniform distribution.Nearest-Neighbor Traffic: Each node in a group sends a message to a random node in the neighboring group.Credit based flow control in which the upstream node keeps a count of free buffer slots in the downstream VCsInput queued virtual channel router with a specified number of input and output ports with each port supporting up to ‘v’ virtual channels

7 Dragonfly Model Routing AlgorithmsMinimal Routing (MIN)Uniform random traffic: High throughput, low latencyNearest neighbor traffic: causes congestion, high latency, low throughputNon-minimal routing (VAL)Half the throughput as MIN under UR trafficNearest neighbor traffic: optimal performance (about 50% throughput)Global Adaptive routingChooses between MIN and VAL by sensing the traffic conditions on the global channelsWith uniform random traffic, MIN gives the optimal throughput as the traffic is scattered over the entire networkWith nearest neighbor traffic, MIN congests the single global channel going between two groups as it always prefers the shortest pathNon-minimal routing works by deviating the traffic to a randomly selected intermediate group first and then to the destination group.Under UR traffic, non-minimal routing gives half the throughput as MINWith nearest neighbor traffic, non-minimal routing gives the best possible throughput as it deviates the traffic to an intermediate group.

10 Dragonfly Model ValidationDragonfly network topologies in designPERCS network topologyMachines from Echelon projectBooksim:A cycle accurate simulator with dragonfly modelUsed by Dally et. al to validate the dragonfly topology proposalRuns in serial mode onlySupports minimal and global adaptive routingPerformance results shown on 1,024 nodes and 264 routersWe validated our ROSS dragonfly model against booksimThe IBM PERCS has a similar topology to the dragonfly that was intended to be a part of the Blue Waters system.We validated the correctness of our dragonfly model against booksim. Booksim is an open source cycle accurate simulator proposed by Kim, Dally et. al to validate the dragonfly topology proposal.Similarities between ROSS and booksimBoth support virtual channels, credit based flow control, finite buffers.Both simulators support uniform random and nearest neighbor traffic patterns.Both use single flit packets.The router arbitration policy is FCFS for ROSS as it was simple to implement for discrete event simulations and as the results show, changing the router arbitration policy does not significantly affect the results.

11 Global Adaptive Routing---Threshold selection (ROSS vs. Booksim)Booksim uses an adaptive threshold to bias the UGAL algorithm towards minimal or non-minimal routingWe incorporated a similar threshold in ROSSTo find the threshold value to bias traffic towards non-minimal, we did experiments to find the optimal threshold value.The value that yields maximum non-minimal packets is -180To load balance nearest-neighbor traffic, we use UGAL routing algorithm which is based on the above algorithm.Booksim uses an adaptive threshold which is set to positive value (currently set to 30) to bias the algorithm to use minimal routing under uniform random trafficROSS also uses a similar adaptive threshold to bias the routing decision.As booksim doesn’t specify the optimal threshold value for global adaptive routing, we sought the best value of the adaptive threshold for ROSS and booksim that can bias the traffic towards non-minimal routing.We did experiments with different negative threshold values and found the optimal values by recording the number of minimal and non-minimal packets. We decided to select a threshold value of -180 for both ROSS and booksim as it biases maximum number of packets towards nonminimal path and gives minimum latencyGlobal Adaptive RoutingIf min_queue_size < (2 * nonmin_queue_size) + adaptive_threshold then route minimally Else route non-minimally End if

12 ROSS vs. booksim– Uniform Random trafficFor MIN, ROSS has an average of 4.2% and a maximum of 7% difference from booksim results.For UGAL, ROSS has an average of 3% and a maximum of 7.8% difference from booksim results.With minimal routing, ROSS has an average of 4.2% and a maximum of 7% difference from booksim resultsWith global adaptive routing, ROSS has an average of 3% and a maximum of 7.8% difference from booksim results

13 ROSS vs. booksim– Nearest neighbor trafficThe nearest neighbor traffic yields a very high latency and low throughput with minimal routing.This traffic pattern can be load balanced by either non-minimal or adaptive routingNon-minimal routing gives slightly under 50% throughput with nearest neighbor trafficNon-minimal routing gives slightly under 50% throughput with nearest neighbor traffic which is the maximum throughput that can be achieved under this kind of traffic.Minimal routing gives high latency and low throughputBoth ROSS & booksim give low latency high throughput for MIN routing under UR trafficBoth simulators give high latency and low throughput for MIN routing under WC trafficBoth simulator’s UGAL algorithm resembles MIN routing under UR trafficBoth simulators yield slightly under 50% latency for UGAL under WC routing.The small differences can be due toWe approximated the internal speedup for booksimBoth simulators use different random number generators

15 Dragonfly performance: ROSS vs. booksimWe compared the performance of ROSS and booksim by measuring the simulation execution timeAs booksim runs serially, we configured ROSS in its serial modeBoth simulators ran for a warm-up phase of 30,000 cycles and a measurement phase of 30,000 cyclesTests were carried out on dual core Intel X5650s running at 2.67GHzROSS attains the following performance speedupMinimum of 5x up to a maximum of 11x speedup over booksim with MIN routingMinimum of 5.3x speedup and a maximum of 12.38x speedup with global adaptive routing

17 ROSS Dragonfly model on BG/P and BG/QWe evaluated the strong scaling characteristics of the dragonfly model onArgonne Leadership Computing Facility (ALCF) IBM Blue Gene/P system (Intrepid)Computational Center for Nanotechnology Innovations (CCNI) IBM Blue Gene/QWe scheduled 64 MPI tasks per node on BG/Q and 4 MPI tasks per node on BG/PPerformance was evaluated through the following metricsCommitted event ratePercentage of remote eventsROSS event efficiencySimulation run timeIntrepid’s BG/P has 40 racks, each of which has 1,024 nodes. Intrepid is equipped with a 3D torus network used for point-to-point communication and collective operations.CCNI BG/Q has 1 rack, with 1,024 nodes. It is equipped with a 5D torus network. Each node has 16 cores with each core supporting 4 threads.

18 ROSS ParametersROSS employs Time Warp Optimistic synchronization protocolTo reduce state saving overheads, ROSS employs an event roll back mechanismROSS event efficiency determines the amount of useful work performed by the simulationGlobal Virtual Time (GVT) imposes a lower bound on the simulation timeGVT is controlled by batch and gvt-interval parametersOn average, batch * gvt-interval events are processed between each GVT epochBatch is the number of events that ROSS processes before returning to the top scheduling loop and checking for arrival of remote events and messagesThe gvt-interval specifies the number of iterations that ROSS goes through the main event scheduling loop before initiating a GVT computation

19 ROSS Dragonfly Performance Results on BG/P vs. BG/QEvent efficiency drops and total rollbacks increase on BG/P after 16K MPI tasksLess off-node communication on BG/Q vs. BG/PEach MPI task has more processing power on BG/P and simulation advances quicklyAs BG/Q has 64 MPI tasks per node vs. 4 MPI tasks on BG/P, there is more off-node communication in BG/P as compared to BG/Q. We conjecture that sending a message through memory on BG/Q takes less time as sending off-node messages on BG/P. Therefore, the probability of a message arriving late is less on BG/Q as compared to BG/P.Each BG/Q core has 1.6 GHz processing power divided among 4 threads. Each thread gets 400 MHz of processing power vs. 850 MHz of processing power on the BG/P. This causes MPI ranks on BG/Q to advance more slowly through the simulated time and also lowers the potential for rollbacks relative to BG/P.

20 ROSS Dragonfly Performance Results on BG/P vs. BG/QThis increases computation and dominates the number of events being rolled back.The event efficiency stays high on both BG/P and BG/Q as each MPI task has substantial work loadThe computation performed at each MPI task dominates the number of rolled back events