Parallel Algorithms.

Similar presentations

2 Computation ModelsGoal of computation model is to provide a realistic representation of the costs of programming.Model provides algorithm designers and programmers a measure of algorithm complexity which helps them decide what is “good” (i.e. performance-efficient)

3 Goal for ModelingWe want to develop computational models which accurately represent the cost and performance of programsIf model is poor, optimum in model may not coincide with optimum observed in practiceModelReal WorldxoptimumAoptimumBY

4 Models of Computation What’s a model good for??Provides a way to think about computers. Influences design of:ArchitecturesLanguagesAlgorithmsProvides a way of estimating how well a program will perform.Cost in model should be roughly same as cost of executing program

5 The Random Access Machine ModelRAM model of serial computers:Memory is a sequence of words, each capable of containing an integer.Each memory access takes one unit of timeBasic operations (add, multiply, compare) take one unit time.Instructions are not modifiableRead-only input tape, write-only output tape

6 Has RAM influenced our thinking?Language design:No way to designate registers, cache, DRAM.Most convenient disk access is as streams.How do you express atomic read/modify/write?Machine & system design:It’s not very easy to modify code.Systems pretend instructions are executed in-order.Performance Analysis:Primary measures are operations/sec (MFlop/sec, MHz, ...)What’s the difference between Quicksort and Heapsort??

7 What about parallel computersRAM model is generally considered a very successful “bridging model” between programmer and hardware.“Since RAM is so successful, let’s generalize it for parallel computers ...”

9 PRAM model of computationShared memoryp processors, each with local memorySynchronous operationShared memory reads and writesEach processor has unique id in range 1-p

10 CharacteristicsAt each unit of time, a processor is either active or idle (depending on id)All processors execute same programAt each time step, all processors execute same instruction on different data (“data- parallel”)Focuses on concurrency only

14 Why study PRAM algorithms?Well-developed body of literature on design and analysis of such algorithmsBaseline model of concurrencyExplicit modelSpecify operations at each stepScheduling of operations on processorsRobust design paradigm

20 Points to note about WT pgmGlobal program: no references to processor idContains both serial and concurrent operationsSemantics of forallOrder of additions different from sequential order: associativity critical

24 { List Ranking List ranking problem If d denotes the distanceGiven a singly linked list L with n objects, for each node, compute the distance to the end of the listIf d denotes the distancenode.d = if node.next = nilnode.next.d otherwiseSerial algorithm: O(n)Parallel algorithmAssign one processor for each nodeAssume there are as many processors as list objectsFor each node i, performi.d = i.d + i.next.di.next = i.next.next // pointer jumping{

32 Pointer jumpingFast parallel processing of linked data structures (lists, trees)Convention: Draw trees with edges directed from children to parentsExample: Finding the roots of forest represented as parent array PP[i] = j if and only if (i, j) is a forest edgeP[i] = i if and only if i is a root

39 Concurrent Read – Finding RootsThis is a CREW algorithmSuppose Exclusive-Read is used, what will be the running time?Initially only one node i has root informationFirst iteration: Another node reads from the node iTotally two nodes are filled upSecond iteration: Another two nodes can reads from the two nodesTotally four nodes are filled upk-th iteration: 2k-1 nodes are filled upIf there are n nodes, k=log nSo Find_root with Exclusive-Read takes O(log n).O(log log n) vs. O(log n)

42 Computing the Depth Problem definitionGiven a binary tree with n nodes, compute the depth of each nodeSerial algorithm takes O(n) timeA simple parallel algorithmStarting from root, compute the depths level by levelStill O(n) because the height of the tree could be as high as nEuler tour algorithmUses parallel prefix computation

43 { Computing the Depth (2) { {Euler tour: A cycle that traverses each edge exactly once in a graphIt is a directed version of a treeRegard an undirected edge into two directed edgesAny directed version of a tree has an Euler tour by traversing the treein a DFS way forming a linked list.Employ 3*n processorsEach node i has fields i.parent, i.left, i.rightEach node i has three processors, i.A, i.B, and i.C.Three processors in each node of the tree are linked as followsi.A = i.left.A if i.left != nili.B if i.left = nili.B = i.right.A if i.right != nili.C if i.right = nili.C = i.parent.B if i is the left childi.parent.C if i is the right childnil if i.parent = nil{{{

44 Computing the Depth (3) Algorithm O(log n)Construct the Euler tour for the tree – O(1) timeAssign 1 to all A processors, 0 to B processors, -1 to C processorsPerform a parallel prefix computationThe depth of each node resides in its C processorO(log n)Actually log 3nEREW because no concurrent read or writeSpeedupS = n/log n

49 Simulating CRCW with EREWCRCW algorithms are faster than EREW algorithmsHow much fast?TheoremA p-processor CRCW algorithm can be no more than O(log p) times faster than the best p-processor EREW algorithmProof by simulating CRCW steps with EREW stepsAssumption: A parallel sorting takes O(log n) time with n processorsWhen CRCW processor pi write a datum xi into a location li, EREW pi writes the pair (li, xi) into a separate location A[i]Note EREW write is exclusive, while CRCW may be concurrentSort A by liO(log p) time by assumptionCompare adjacent elements in AFor each group of the same elements, only one processor, say first, write xi into the global memory li.Note this is also exclusive.Total time complexity: O(log p)