2
CSE431 Chapter 7A.2Irwin, PSU, 2008 The Big Picture: Where are We Now?  Multiprocessor – a computer system with at least two processors l Can deliver high throughput for independent jobs via job-level parallelism or process-level parallelism l And improve the run time of a single program that has been specially crafted to run on a multiprocessor - a parallel processing program Processor Cache Interconnection Network Memory I/O

3
CSE431 Chapter 7A.3Irwin, PSU, 2008 Multicores Now Common  The power challenge has forced a change in the design of microprocessors l Since 2002 the rate of improvement in the response time of programs has slowed from a factor of 1.5 per year to less than a factor of 1.2 per year  Today’s microprocessors typically contain more than one core – Chip Multicore microProcessors (CMPs) – in a single IC l The number of cores is expected to double every two years ProductAMD Barcelona Intel Nehalem IBM Power 6 Sun Niagara 2 Cores per chip4428 Clock rate2.5 GHz~2.5 GHz?4.7 GHz1.4 GHz Power120 W~100 W? 94 W

4
CSE431 Chapter 7A.4Irwin, PSU, 2008 Other Multiprocessor Basics  Some of the problems that need higher performance can be handled simply by using a cluster – a set of independent servers (or PCs) connected over a local area network (LAN) functioning as a single large multiprocessor l Search engines, Web servers, email servers, databases, …  A key challenge is to craft parallel (concurrent) programs that have high performance on multiprocessors as the number of processors increase – i.e., that scale l Scheduling, load balancing, time for synchronization, overhead for communication

7
CSE431 Chapter 7A.7Irwin, PSU, 2008 Example 1: Amdahl’s Law  Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E =  What if its usable only 15% of the time? Speedup w/ E =  Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!  To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E =

8
CSE431 Chapter 7A.8Irwin, PSU, 2008 Example 1: Amdahl’s Law  Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E = 1/(.75 +.25/20) = 1.31  What if its usable only 15% of the time? Speedup w/ E = 1/(.85 +.15/20) = 1.17  Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!  To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 +.999/100) = 90.99 Speedup w/ E = 1 / ((1-F) + F/S)

11
CSE431 Chapter 7A.11Irwin, PSU, 2008 Scaling  To get good speedup on a multiprocessor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem. l Strong scaling – when speedup can be achieved on a multiprocessor without increasing the size of the problem l Weak scaling – when speedup is achieved on a multiprocessor by increasing the size of the problem proportionally to the increase in the number of processors  Load balancing is another important factor. Just a single processor with twice the load of the others cuts the speedup almost in half

12
CSE431 Chapter 7A.12Irwin, PSU, 2008 Multiprocessor/Clusters Key Questions  Q1 – How do they share data?  Q2 – How do they coordinate?  Q3 – How scalable is the architecture? How many processors can be supported?

13
CSE431 Chapter 7A.13Irwin, PSU, 2008 Shared Memory Multiprocessor (SMP)  Q1 – Single address space shared by all processors  Q2 – Processors coordinate/communicate through shared variables in memory (via loads and stores) l Use of shared data must be coordinated via synchronization primitives (locks) that allow access to data to only one processor at a time  They come in two styles l Uniform memory access (UMA) multiprocessors l Nonuniform memory access (NUMA) multiprocessors  Programming NUMAs are harder  But NUMAs can scale to larger sizes and have lower latency to local memory

17
CSE431 Chapter 7A.17Irwin, PSU, 2008 Process Synchronization  Need to be able to coordinate processes working on a common task  Lock variables (semaphores) are used to coordinate or synchronize processes  Need an architecture-supported arbitration mechanism to decide which processor gets access to the lock variable l Single bus provides arbitration mechanism, since the bus is the only path to memory – the processor that gets the bus wins  Need an architecture-supported operation that locks the variable Locking can be done via an atomic swap operation (on the MIPS we have ll and sc one example of where a processor can both read a location and set it to the locked state – test-and-set – in the same bus operation)

20
CSE431 Chapter 7A.20Irwin, PSU, 2008 An Example with 10 Processors sum[P0]sum[P1]sum[P2]sum[P3]sum[P4]sum[P5]sum[P6]sum[P7]sum[P8]sum[P9] P0P1P2P3P4P5P6P7P8P9 P0P1P2P3P4  synch() : Processors must synchronize before the “consumer” processor tries to read the results from the memory location written by the “producer” processor l Barrier synchronization – a synchronization scheme where processors wait at the barrier, not proceeding until every processor has reached it

22
CSE431 Chapter 7A.22Irwin, PSU, 2008 Spin-Locks on Bus Connected ccUMAs  With a bus based cache coherency protocol (write invalidate), spin-locks allow processors to wait on a local copy of the lock in their caches l Reduces bus traffic – once the processor with the lock releases the lock (writes a 0) all other caches see that write and invalidate their old copy of the lock variable. Unlocking restarts the race to get the lock. The winner gets the bus and writes the lock back to 1. The other caches then invalidate their copy of the lock and on the next lock read fetch the new lock value (1) from memory.  This scheme has problems scaling up to many processors because of the communication traffic when the lock is released and contested

28
CSE431 Chapter 7A.28Irwin, PSU, 2008 Pros and Cons of Message Passing  Message sending and receiving is much slower than addition, for example  But message passing multiprocessors and much easier for hardware designers to design l Don’t have to worry about cache coherency for example  The advantage for programmers is that communication is explicit, so there are fewer “performance surprises” than with the implicit communication in cache-coherent SMPs. l Message passing standard MPI-2 (www.mpi-forum.org )www.mpi-forum.org  However, its harder to port a sequential program to a message passing multiprocessor since every communication must be identified in advance. l With cache-coherent shared memory the hardware figures out what data needs to be communicated

31
CSE431 Chapter 7A.31Irwin, PSU, 2008 Multithreading on A Chip  Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions  Hardware multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor l Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread l The caches, TLBs, BHT, BTB, RUU can be shared (although the miss rates may increase if they are not sized accordingly) l The memory can be shared through virtual memory mechanisms l Hardware must support efficient thread context switching

32
CSE431 Chapter 7A.32Irwin, PSU, 2008 Types of Multithreading  Fine-grain – switch threads on every instruction issue l Round-robin thread interleaving (skipping stalled threads) l Processor must be able to switch threads on every clock cycle l Advantage – can hide throughput losses that come from both short and long stalls l Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads  Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses) l Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread l Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss -Pipeline must be flushed and refilled on thread switches

35
CSE431 Chapter 7A.35Irwin, PSU, 2008 Simultaneous Multithreading (SMT)  A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread- level parallelism (TLP) l Most SS processors have more machine level parallelism than most programs can effectively use (i.e., than have ILP) l With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them -Need separate rename tables (RUUs) for each thread or need to be able to indicate which thread the entry belongs to -Need the capability to commit from multiple threads in one cycle  Intel’s Pentium 4 SMT is called hyperthreading l Supports just two threads (doubles the architecture state)