3
6.004 –Fall /03/0L09 -Pipelining 3 One load at a time Everyone knows that the real reason that MIT students put off doing laundry so long is not because they procrastinate, are lazy, or even have better things to do. The fact is, doing one load at a time is not smart. Step 1: Step 2: Total = Washer PD + Dryer PD = _________ mins 90

4
6.004 –Fall /03/0L09 -Pipelining 4 Doing N loads of laundry Here’s how they do laundry at Harvard, the “combinational” way. (Of course, this is just an urban legend. No one at Harvard actually does laundry. The butlers all arrive on Wednesday morning, pick up the dirty laundry and return it all pressed and starched in time for afternoon tea) Step 1: Step 2: Step 3: Step 4: … Total = N*(Washer PD + Dryer PD ) = ____________ mins N*90

6
6.004 –Fall /03/0L09 -Pipelining 6 Some definitions Latency: The delay from when an input is established until the output associated with that input becomes valid. (Harvard Laundry = _________ mins) (MIT Laundry = _________ mins) Throughput: The rate of which inputs or outputs are processed. (Harvard Laundry = _________ outputs/min) ( MIT Laundry = _________ outputs/min) Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available /90 1/60

7
6.004 –Fall /03/0L09 -Pipelining 7 Okay, back to circuits… For combinational logic: latency = t PD, throughput = 1/t PD. We can’t get the answer faster, but are we making effective use of our hardware at all times? F & G are “idle”, just holding their outputs stable while H performs its computation

8
6.004 –Fall /03/0L09 -Pipelining 8 Pipelined Circuits use registers to hold H’s input stable! Suppose F, G, H have propagation delays of 15, 20, 25 ns and we are using ideal zero-delay registers: Now F & G can be working on input X i+1 while H is performing its computation on X i. We’ve created a 2-stage pipeline: if we have a valid input X during clock cycle j, P(X) is valid during clock j+2. latencythroughput unpipelined451/45 2-stage pipeline 501/25 worsebetter

10
6.004 –Fall /03/0L09 -Pipelining 10 Pipeline Conventions DEFINITION: a K-Stage Pipeline (“K-pipeline”) is an acyclic circuit having exactly K registers on every path from an input to an output. a COMBINATIONAL CIRCUIT is thus an 0-stage pipeline. CONVENTION: Every pipeline stage, hence every K-Stage pipeline, has a register on its OUTPUT (not on its input). ALWAYS: The CLOCK common to all registers must have a period sufficient to cover propagation over combinational paths PLUS (input) register t PD PLUS (output) register t SETUP. The LATENCY of a K-pipeline is K times the period of the clock common to all registers. The THROUGHPUT of a K-pipeline is the frequency of the clock.

11
6.004 –Fall /03/0L09 -Pipelining 11 Ill-formed pipelines For what value of K is the following circuit a K-Pipeline? ANS: ______ Problem: Successive inputs get mixed: e.g., B(A(X i+1 ), Y i ). This happened because some paths from inputs to outputs had 2 registers, and some had only 1! Can this happen on a well-formed K pipeline? none Consider a BAD job of pipelining:

12
6.004 –Fall /03/0L09 -Pipelining 12 A pipelining methodology Step 1: Draw a line that crosses every output in the circuit, and mark the endpoints as terminal points. Step 2: Continue to draw new lines between the terminal points across various circuit connections, ensuring that every connection crosses each line in the same direction. These lines demarcate pipeline stages. Adding a pipeline register at every point where a separating line crosses a connection will always generate a valid pipeline. STRATEGY: Focus your attention on placing pipelining registers around the slowest circuit elements (BOTTLENECKS).

14
6.004 –Fall /03/0L09 -Pipelining 14 Pipelining Summary Advantages: –Allows us to increase throughput, by breaking up long combinational paths and (hence) increasing clock frequency Disadvantages: –May increase latency... –Only as good as the weakest link: slowest step constrains system throughput. Isn’t there a way around this “weak link” problem?

15
6.004 –Fall /03/0L09 -Pipelining 15 Pipelined Components Pipelined systems can be hierarchical: Replacing a slow combinational component with a k-pipe version may increase clock frequency Must account for new pipeline stages in our plan but... How can one pipeline a clothes dryer??? 4-stage pipeline, throughput=1

16
6.004 –Fall /03/0L09 -Pipelining 16 How do Aces do Laundry? They work around the bottleneck. First, they find a place with twice as many dryers as washers. Step 1: Step 2: Step 3: Step 4: Throughput = ______ loads/min Latency = ______ mins/load 1/30 90

17
6.004 –Fall /03/0L09 -Pipelining 17 Circuit Interleaving We can simulate a pipelined version of a slow component by replicating the critical element and alternate inputs between the various copies. This is a simple 2-state FSM that alternates between 0 and 1 on each clock

18
6.004 –Fall /03/0L09 -Pipelining 18 Circuit Interleaving When Q is 1 the lower path is combinational (the latch is open), yet the output of the upper path will be enabled onto the input of the output register ready for the NEXT clock edge. Meanwhile, the other latch maintains the input from the last clock. We can simulate a pipelined version of a slow component by replicating the critical element and alternate inputs between the various copies. “It acts like a 2-stage pipeline” C 1 output Mux output

20
6.004 –Fall /03/0L09 -Pipelining 20 Combining techniques We can combine interleaving and pipelining. Here, C’ interleaves two C elements with a propagation delay of 8 nS. The resulting C’ circuit has a throughput of 1/4 nS, and latency of 8 nS. This can be considered as an extra pipelining stage that passes through the middle of the C’ module. One of our separation lines must pass through this pipeline stage. By combining interleaving with pipelining we move the bottleneck from the C element to the F element.

26
6.004 –Fall /03/0L09 -Pipelining 26 Control Structure Taxonomy Centralized clocked FSM generates all control signals. Central control unit tailors current time slice to current tasks. Start and Finish signals generated by each major subsystem, synchronously with global clock. Each subsystem takes asynchronous Start, generates asynchronous Finish (perhaps using local clock). SynchronousAsynchronous Globally Timed Locally Timed Easy to design but fixed-sized interval can be wasteful (no data- dependencies in timing) Large systems lead to very complicated timing generators… just say no! The best way to build large systems that have independently-timed components. The “next big idea” for the last several decades: a lot of design work to do in general, but extra work is worth it in special cases