4
4 Fundamental Issue #1: Naming Naming: what data is shared how it is addressed what operations can access data how processes refer to each other Choice of naming affects code produced by a compiler: via load where just remember address, or keep track of processor number and local virtual address for msg. passing Choice of naming affects replication of data: via load in cache memory hierarchy, or via SW replication and consistency Naming: what data is shared how it is addressed what operations can access data how processes refer to each other Choice of naming affects code produced by a compiler: via load where just remember address, or keep track of processor number and local virtual address for msg. passing Choice of naming affects replication of data: via load in cache memory hierarchy, or via SW replication and consistency

5
5 Fundamental Issue #1: Naming Global physical address space: any processor can generate, address and access it in a single operation memory can be anywhere: virtual addr. translation handles it Global virtual address space: if the address space of each process can be configured to contain all shared data of the parallel program Segmented shared address space: locations are named uniformly for all processes of the parallel program Global physical address space: any processor can generate, address and access it in a single operation memory can be anywhere: virtual addr. translation handles it Global virtual address space: if the address space of each process can be configured to contain all shared data of the parallel program Segmented shared address space: locations are named uniformly for all processes of the parallel program

8
8 Scalable Machines What are the design trade-offs for the spectrum of machines between? specialize or commodity nodes? capability of node-to-network interface supporting programming models? What does scalability mean? avoids inherent design limits on resources bandwidth increases with P latency does not cost increases slowly with P What are the design trade-offs for the spectrum of machines between? specialize or commodity nodes? capability of node-to-network interface supporting programming models? What does scalability mean? avoids inherent design limits on resources bandwidth increases with P latency does not cost increases slowly with P

9
9 Bandwidth Scalability What fundamentally limits bandwidth? single set of wires Must have many independent wires Connect modules through switches Bus vs Network Switch? What fundamentally limits bandwidth? single set of wires Must have many independent wires Connect modules through switches Bus vs Network Switch?

12
12 Key Property Large number of independent communication paths between nodes => allow a large number of concurrent transactions using different wires initiated independently no global arbitration effect of a transaction only visible to the nodes involved effects propagated through additional transactions Large number of independent communication paths between nodes => allow a large number of concurrent transactions using different wires initiated independently no global arbitration effect of a transaction only visible to the nodes involved effects propagated through additional transactions

16
16 Key Properties of Shared Address Abstraction Source and destination data addresses are specified by the source of the request a degree of logical coupling and trust no storage logically “ outside the address space ” may employ temporary buffers for transport Operations are fundamentally request response Remote operation can be performed on remote memory logically does not require intervention of the remote processor Source and destination data addresses are specified by the source of the request a degree of logical coupling and trust no storage logically “ outside the address space ” may employ temporary buffers for transport Operations are fundamentally request response Remote operation can be performed on remote memory logically does not require intervention of the remote processor

26
26 Challenges in Realizing Prog. Models in the Large One-way transfer of information No global knowledge, nor global control barriers, scans, reduce, global-OR give fuzzy global state Very large number of concurrent transactions Management of input buffer resources many sources can issue a request and over-commit destination before any see the effect Latency is large enough that you are tempted to “ take risks ” optimistic protocols large transfers dynamic allocation Many many more degrees of freedom in design and engineering of these system One-way transfer of information No global knowledge, nor global control barriers, scans, reduce, global-OR give fuzzy global state Very large number of concurrent transactions Management of input buffer resources many sources can issue a request and over-commit destination before any see the effect Latency is large enough that you are tempted to “ take risks ” optimistic protocols large transfers dynamic allocation Many many more degrees of freedom in design and engineering of these system

34
34 Generic Solution: Directories Maintain state vector explicitly associate with memory block records state of block in each cache On miss, communicate with directory determine location of cached copies determine action to take conduct protocol to maintain coherence Maintain state vector explicitly associate with memory block records state of block in each cache On miss, communicate with directory determine location of cached copies determine action to take conduct protocol to maintain coherence

35
35 Adminstrative Break Project Descriptions due today Properties of a good project There is an idea There is a body of background work There is something that differentiates the idea There is a reasonable way to evaluate the idea Project Descriptions due today Properties of a good project There is an idea There is a body of background work There is something that differentiates the idea There is a reasonable way to evaluate the idea

36
36 A Cache Coherent System Must: Provide set of states, state transition diagram, and actions Manage coherence protocol (0) Determine when to invoke coherence protocol (a) Find info about state of block in other caches to determine action whether need to communicate with other cached copies (b) Locate the other copies (c) Communicate with those copies (inval/update) (0) is done the same way on all systems state of the line is maintained in the cache protocol is invoked if an “ access fault ” occurs on the line Different approaches distinguished by (a) to (c) Provide set of states, state transition diagram, and actions Manage coherence protocol (0) Determine when to invoke coherence protocol (a) Find info about state of block in other caches to determine action whether need to communicate with other cached copies (b) Locate the other copies (c) Communicate with those copies (inval/update) (0) is done the same way on all systems state of the line is maintained in the cache protocol is invoked if an “ access fault ” occurs on the line Different approaches distinguished by (a) to (c)

37
37 Bus-based Coherence All of (a), (b), (c) done through broadcast on bus faulting processor sends out a “ search ” others respond to the search probe and take necessary action Could do it in scalable network too broadcast to all processors, and let them respond Conceptually simple, but broadcast doesn ’ t scale with p on bus, bus bandwidth doesn ’ t scale on scalable network, every fault leads to at least p network transactions Scalable coherence: can have same cache states and state transition diagram different mechanisms to manage protocol All of (a), (b), (c) done through broadcast on bus faulting processor sends out a “ search ” others respond to the search probe and take necessary action Could do it in scalable network too broadcast to all processors, and let them respond Conceptually simple, but broadcast doesn ’ t scale with p on bus, bus bandwidth doesn ’ t scale on scalable network, every fault leads to at least p network transactions Scalable coherence: can have same cache states and state transition diagram different mechanisms to manage protocol

38
38 One Approach: Hierarchical Snooping Extend snooping approach: hierarchy of broadcast media tree of buses or rings (KSR-1) processors are in the bus- or ring-based multiprocessors at the leaves parents and children connected by two-way snoopy interfaces snoop both buses and propagate relevant transactions main memory may be centralized at root or distributed among leaves Issues (a) - (c) handled similarly to bus, but not full broadcast faulting processor sends out “ search ” bus transaction on its bus propagates up and down hiearchy based on snoop results Problems: high latency: multiple levels, and snoop/lookup at every level bandwidth bottleneck at root Not popular today Extend snooping approach: hierarchy of broadcast media tree of buses or rings (KSR-1) processors are in the bus- or ring-based multiprocessors at the leaves parents and children connected by two-way snoopy interfaces snoop both buses and propagate relevant transactions main memory may be centralized at root or distributed among leaves Issues (a) - (c) handled similarly to bus, but not full broadcast faulting processor sends out “ search ” bus transaction on its bus propagates up and down hiearchy based on snoop results Problems: high latency: multiple levels, and snoop/lookup at every level bandwidth bottleneck at root Not popular today

39
39 Scalable Approach: Directories Every memory block has associated directory information keeps track of copies of cached blocks and their states on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary in scalable networks, communication with directory and copies is through network transactions Many alternatives for organizing directory information Every memory block has associated directory information keeps track of copies of cached blocks and their states on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary in scalable networks, communication with directory and copies is through network transactions Many alternatives for organizing directory information

40
40 Basic Operation of Directory k processors. With each cache-block in memory: k presence-bits, 1 dirty-bit With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit Read from main memory by processor i: If dirty-bit OFF then { read from main memory; turn p[i] ON; } if dirty-bit ON then { recall line from dirty proc (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;} Write to main memory by processor i: If dirty-bit OFF then { supply data to i; send invalidations to all caches that have the block; turn dirty-bit ON; turn p[i] ON;... }... Read from main memory by processor i: If dirty-bit OFF then { read from main memory; turn p[i] ON; } if dirty-bit ON then { recall line from dirty proc (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;} Write to main memory by processor i: If dirty-bit OFF then { supply data to i; send invalidations to all caches that have the block; turn dirty-bit ON; turn p[i] ON;... }...