Pipelines --------- typical RISC 5-stage pipeline 1. IF - fetch instrtuction 2. ID - decode and read from register file 3. EX - execute ALU operation, which may be calculation of address for ld/st 4. MEM - read/write data for ld/st, or bypass result from ALU for other insts. 5. WB - write back result of ld or ALU op stages: IF ID EX MEM WB latches: IF/ID ID/EX EX/MEM MEM/WB ++ ++ ++ ++ ||rs1+----+ || +- || || +--+ +------+ ||-->|reg.|-->||----->| \ || +------+ || mux |PC|-*->|icache|->||rs2|file| || ` |->||-*->|dcache|->||->++ +--+ | +------+ ||-->| |-->||->++ ' | || | +------+ || ||-. ^ v || +----+ || ||->| / || | || || | | +4 || immediate || || +- || `----------->||->++ | | | ||-->extend-->||->++ ALU || bypass || | `--' ||rd || mux || || | ||----------->||----------->||------------->||--. | ++ ^ ^ ++ ++ ++ | | | | | | | `--------------------------------------' | `--------------------------------------------' result and rd sent back to register file IF - fetch instruction from icache ID - phi-1: (if WB active) write result value into register rd phi-2: read values from registers rs1 and rs2, and sign extend immediate value field from instruction EX - ALU operation, which may have two register values as inputs or one register value and a sign-extended immediate value as inputs MEM - read cache on load or bypass WB - phi-1: (if not a store or nop) write either the cache result or the bypassed result back into the register file example instructions lw r2,0(r1) // r2 <- memory[r1+0] IF - fetch inst ID - read value from r1 base register and sign-extend 0 displacement EX - form effective address as base+displacement using ALU MEM - read data from data cache WB - write loaded value into r2 add r5,r3,r4 // r5 <- r3 + r4 IF - fetch inst ID - read values from r3 and r4 EX - obtain sum of values using ALU MEM - bypass sum around data cache WB - write sum into r5 left as exercise - where do you get the value to write for a store? what if it is a SPARC-like store with three registers - st %o0,[%o1+%o2] left as exercise - how do you accomplish branching? 3-ported register file (2 read, 1 write), with writes done before reads |<-- one cycle -->| multiported register file (Figs. B.8.8 and B.8.9) phi_1 phi_2 read rs1----------------. .________. . v clock| | | +----+ .-. | |________| | r0 |---*->| | . . . write rd-->| r1 |--*+->|m|--> rs1 value IF . . . rd value-->| r2 |-*++->|u| . . . | ...| |||..|x| ID . .read rs1. +----+ ||| `-' . .read rs2. ||| .-. EX . . . ||`->| | . . . |`-->|m|--> rs2 value MEM . . . `--->|u| . . . ...|x| WB .write rd. . `-' ^ read rs2----------------' data dependency producer-consumer relationship true data dependency or RAW (read-after-write) dependency alu-op dependency and r2,r0,r1 // r2 <- r0 and r1 or r4,r2,r3 // r4 <- r2 or r3 - and is the "producer" and defines r2 - or is the dependent instruction, the "consumer", and uses r2 load dependency lw r2,0(r1) // r2 <- memory[r1+0] add r4,r2,r3 // r4 <- r2 + r3 - lw is the "producer" and defines r2 - add is the dependent instruction, the "consumer", and uses r2 simple register scoreboard to handle true data dependencies - vector of "busy" bits, one per register - stall instruction in ID until all source registers are free - set busy bit of destination register for instruction exiting ID - reset busy bit of destination register after writing in WB lw r2,0(r1) IF ID. EX MEM WB. \ . . add r4,r2,r3 IF. --- --- --. ID EX MEM WB . . v v set.......reset r2 busy saves one stall cycle by writing rd in first half-cycle and then reading rs and rt in second half-cycle lw r2,0(r1) IF ID. EX MEM WB \ . . add r4,r2,r3 IF. --- --- ID EX MEM WB . . v v set.....reset r2 busy without forwarding: two-cycle penalty for dependent instruction pairs lw r2,0(r1) IF ID EX MEM WB add r4,r2,r3 IF --- --- ID EX MEM WB (wait on r2) sub r6,r4,r5 IF --- --- --- --- ID EX MEM WB (wait on r4) add forwarding paths to bring a value from EX/MEM latch or MEM/WB latch to muxes in front of A and B legs of ALU lw r2,0(r1) IF ID EX MEM WB \ \ <-- forward from MEM stage add r4,r2,r3 IF ID --- EX MEM WB \ \ <-- forward from EX stage sub r6,r4,r5 IF --- ID EX MEM WB but note that loads still have an inherent load-use penalty forwarding adds extra paths in datapath .------------------------------------------------------------. _ |.-------------------------------. | | | || __ _ | _ | | | |`1->| \ .---. | | | | | | | | '-2->|mux-------->|A \ | | | | | | | |----3->|__/ \ \ | | | .-----. | | | | | __ |alu|->| |---*-8->|data | | | __ | | |----4-------->| \ / / | | .-|-9->|cache|->| |-11->| \ | | | __ |mux->|B / | | | | `-----' | | |mux--* | |----5->| \ .>|__/ `---' | | | *-10--------->| |-10->|__/ | | | .-6->|mux' | | | | | | | |_| |.7->|__/`-------9-------->|_|-' | bypass |_| | ID/EX || EX/MM | MM/WB | |`-------------------------------' | ... <-*------------------------------------------------------------' 1 - ALU result to ALU A-leg forwarding 2 - MEM result to ALU A-leg forwarding 3 - register value to ALU A-leg 4 - sign-extended immediate value to ALU B-leg 5 - register value to ALU B-leg 6 - MEM result to ALU B-leg forwarding 7 - ALU result to ALU B-leg forwarding 8 - effective address (from ALU to dcache) 9 - store value (from lower mux in EX stage to dcache) 10 - ALU result bypass around dcache 11 - loaded value from dcache control dependency some RISC designs used delayed branches X: IF ID* EX MEM WB <-- branch resolves in ID stage ('*') . X+4: IF. ID EX MEM WB <-- next inst. (in "delay slot") executed . Y: `>IF ID EX MEM WB <-- taken branch causes fetch at target to be delayed by one cycle two parts to branch resolution - taken/untaken decision - branch target address calculation (e.g., PC+offset) 12-stage Amdahl 470 (1975) two stages per clock cycle (phi-1 and phi-2) 1. IA - instruction address generation 2+3. IF - instruction fetch from cache 4. D - instruction decode 5. R - register fetch 6. AG - effective address generation 7+8. DF - data fetch from cache 9. EX1 - first stage of execute 10. EX2 - second stage of execute; result write to cache 11. C - error check 12. W - result write to registers instructions can start processing on each clock cycle 5-stage Intel 486 (1989) 1. IF - when needed, fetch 16 bytes of instructions into one of two prefetch buffers (reduces conflicts with data accesses to the single cache and obtains approx. five insts.) 2. D1 - main decoding stage - processes up to three instruction bytes at a time; determines the length of the instruction 3. D2 - secondary decoding stage - effective address calculation; extra cycles when inst. has both an immediate constant and a memory displacement or when an index register must be added to a base register and a displacement 4. EX - execution - includes register operand fetch and data cache access a. cache hit takes one cycle b. alu operations with all operands/results in registers can be performed in one cycle c. extra cycles required for complex instructions e.g., reg-to-memory (M[x]<-reg+M[x]) requires three EX cycles .- fetch operand from cache (r) | .- ALU operation | | .- store result to cache (w) v v v IF D1 D2 EXr EX EXw WB D1 D2 --- --- EX WB <-- subsequent inst. stalled 5. WB - write back 4-ported register file (3 read, 1 write) no load-use penalty except for register values needed in address calculation IF D1 D2 EXr WB <-- load -- move memory to register in x86 \ IF D1 D2 EX WB <-- use -- e.g., add registers IF D1 D2 EXr WB <-- load -- move memory to register in x86 \ IF D1 --- D2 EX WB <-- use -- e.g., address = register + displ. this kind of design is sometimes called an AGI pipeline (AGI = address generation interlock) branches are predict-untaken (even for unconditional jumps) with 2-cycle mispredict penalty since change in the PC (caused by a taken conditional jump or an unconditional jump) is determined during the EX stage; contents of D1 and D2 must be flushed; one cycle of penalty is saved by prefetching down the taken path once the branch is decoded IF D1 D2 EX* WB <-- branch resolves in EX stage ('*') D1 D2. x until then follow untaken path D1. x (which is flushed on mispredict) IF. D1 D2 EX WB <-- start taken path ^ ^ ^ | | `- switch to alternate buffer | | | `- fetch down taken path into alternate prefetch buffer | `- fetch (if becessary) down untaken path into primary prefetch buffer 8-stage FP pipeline with integer stages fetch/D1/D2/EX followed by FP stages X1 (execute-1), X2 (execute-2), WF (FP write-back), and ER (error reporting) classic Pentium added: - separate icache and dcache - two integer pipelines (U and V) and floating-point pipeline - 256-entry branch target buffer - each entry contains inst. address, target address, and two bits of history Pentium w/ MMX used a two-level branch prediction scheme similar to the Pentium Pro and added an extra stage to the pipeline to do this