Homework 3 examples Purposes: (1) work with control signals and datapath design (2) work with pipelining concepts (2) work with branch prediction 1. For the MIPS instruction sw Rt, Offset(Rs) // Mem[ Reg[Rs] + sign_ext(Offset) ] <- Reg[Rt] Give values for all control signals in Fig. 4.2 as well as whether each of the three muxes selects its upper or lower input to pass through. For ALU operation, use one of the function names in the table on p. 316. For other control signals, use either 0 or 1, and assume Zero = 0. ALU operation = Add Branch = 0 MemRead = 0 MemWrite = 1 RegWrite = 0 Zero = 0 Mux1 (upper left; output to PC) = lower Mux2 (upper middle; output to Register File) = don't care Mux3 (lower middle; output to B-leg of ALU) = lower 2. For the data path depicted below, assume the following latencies (delays): 1 ns temporary registers W,Y,Z (setup and hold) 2 ns register file R0-R3 (setup and hold) 3 ns bus 4 ns incrementer 5 ns adder What is the critical path? +------+ .-. +-------------+ | R0 |<-->| |-->| incrementer |--. +------+ | | +-------------+ | | R1 |<-->| | v +------+ | | +-------+ | R2 |<-->| |<---------------| W | +------+ | | +-------+ | R3 |<-->| | +------+ | |--------------------. | | | | | +-------+ | | |-->| Y | | | | +-------+ | | | v v | | ----- ----- | | \ \______/ / | | \ / | | \ adder / | | \__________/ bus | | v | | +-------+ | |<---------| Z | `-' +-------+ Consider the possible paths and determine the path delay for each. Typically multiple individual paths will be topologically identical and can be combined into a single generic path. (Ri or W or Z) to bus to Rj = 3 + 2 = 5 ns (Ri or W or Z) to bus to incrementer to W = 3 + 4 + 1 = 8 ns (Ri or W or Z) to bus to adder to Z = 3 + 5 + 1 = 9 ns (Ri or W or Z) to bus to Y = 3 + 1 = 4 ns Y to adder to Z = 5 + 1 = 6 ns The critical path is: (Ri or W or Z) to bus to adder to Z. 3. For the MIPS instruction sequence below, identify the dependencies. Additionally, show the pipeline cycle diagrams for the 5-stage pipeline without and with forwarding. Assume register file writes occur in the first half cycle and reads in the second half cycle. i1: add $1, $2, $3 // Reg[1] <- Reg[2] + Reg[3] i2: sub $4, $5, $6 // Reg[4] <- Reg[5] + Reg[6] i3: add $7, $1, $4 // Reg[7] <- Reg[1] + Reg[4] dependencies: i3 RAW-dependent on i1 through register $1 i3 RAW-dependent on i2 through register $4 if shown in the form of a dependence graph, it becomes clear that i1 and i2 are independent $2 $3 $5 $6 | | | | v v v v .--------. .--------. | i1:add | | i2:sub | `--------' `--------' \ / \ / $1/RAW \ /$4/RAW \ / v v .--------. | i3:add | `--------' | v $7 cycle diagram without forwarding - two stall cycles while i3 waits for values in $1 (available in cycle 5) and in $4 (available in cycle 6) 1 2 3 4 5 6 7 8 9 i1: IF ID EX MM WB i2: IF ID EX MM WB i3: IF - - ID EX MM WB equivalent code with nops i1: add $1, $2, $3 // Reg[1] <- Reg[2] + Reg[3] i2: sub $4, $5, $6 // Reg[4] <- Reg[5] + Reg[6] nop nop i3: add $7, $1, $4 // Reg[7] <- Reg[1] + Reg[4] 1 2 3 4 5 6 7 8 9 i1: IF ID EX MM WB i2: IF ID EX MM WB nop IF ID EX MM WB nop IF ID EX MM WB i3: IF ID EX MM WB cycle diagram with forwarding - no stalls since the value from i1 is forwarded to the EX stage of i3 from the MM/WB pipepine latch and the value from i2 is forwarded to the EX stage of i3 from the EX/MM pipeline latch 1 2 3 4 5 6 7 i1: IF ID EX MM WB i2: IF ID EX MM WB i3: IF ID EX MM WB if given in a more detailed cycle diagram 1 2 3 4 5 6 7 i1: IF ID EX MM... WB | i2: IF ID EX. | MM WB | | (value from i2) \ \ (value from i1) \ \ i3: IF ID EX MM WB or in a subset of a pipeline datapath diagram i3 in EX i2 in MM i1 in WB add $7,$1,$4 sub $4,$5,$6 add $1,$2,$3 .------------------------------------------------. | ...<------. | | _ | _ | | .---. | | | | | | forwarded '->[mux]->| \ | | | | | | value from i1 \ \ | | | .------. | | | |alu |->| |--*->|dcache|->| |-. | / / | | | `------' | | [mux]--* .->[mux]->| / | | *----------->| |-' | | `---' |_| | bypass |_| | | EX/MM | MM/WB | `------------------------' | forwarded value from i2 ... <--' 4. Download EduMIPS64. Create a file, e.g., test.s, and enter the following program. .data value: .word 1 .text main: ld r1,value(r0) daddi r1,r1,1 daddi r1,r1,2 finish: halt This program will execute as follows: ld r1,value(r0) ; r1 <- memory[value] daddi r1,r1,1 ; r1 <- r1 + 1 daddi r1,r1,2 ; r1 <- r1 + 2 halt ; end execution Run EduMIPS64, e.g., java -jar edumips64-0.5.2.jar. Click "file" on the upper menu bar; click "open" and load test.s. Click "configure" on the upper menu bar and disable forwarding. Step through the program using F7 until the program will no longer advance. Note the register stalls between the ld instruction and the first daddi instruction, and between the first and second daddi instructions. You should see that these four instructions take 13 cycles to execute without forwarding. (a) How can the ld instruction's WB stage and the first daddi instruction's ID stage overlap? The ld instruction reaches the WB stage in cycle 5 and thus writes the results into R1 in the first phase of cycle 5. The daddi instruction stalls in the ID stage until it can read the updated R1 in the second phase of cycle 5. Click file on the upper menu bar; click reset. Then click configure on the upper menu bar and enable forwarding. Step through the program using F7 until the program will no longer advance. (b) Explain why there is a register stall between ld instruction and the first daddi instruction but not between the first and second daddi instructions. There is a load-use penalty of one for the ld instructions and the first daddi instruction, since the loaded value is not available for forwarding until the end of the MEM stage for the ld instruction. There is no penalty for data forwarding between the two daddi instructions since the value produced by the first daddi instruction is available at the end of the EX stage. (c) How many cycles are saved by the use of forwarding for this program? Three. 5. For the following instruction set workload, find the extra CPI for a branch prediction scheme with 50% accuracy where the branch outcome is available in the EX stage of the standard 5-stage pipeline. type | freq -------+------ alu | 0.5 branch | 0.2 ld/st | 0.3 for mispredicted branches, there is a two-cycle penalty 1 2 3 4 5 6 7 8 bc target: IF ID EX MM WB i2: IF ID * i3: IF * target: IF ID EX MM WB note that for correctly predicted branches, there is no penalty 1 2 3 4 5 6 bc target: IF ID EX MM WB target: IF ID EX MM WB extra CPI = (branch freq.)*(misprediction freq.)*(mispredict penalty) = 0.2 * 0.5 * 2 = 0.2 cycle increase in CPI for this scheme 6. Consider a one-bit history for branch prediction. It records the state of the last branch as taken (T) or untaken (U) and predicts the next branch will be the same. Assume the bit is initialized to U. Determine the prediction accuracy on the following branch trace; include all trace entries in your calculation. T T T T U T T T T U T T T T U T T T T U T T T T U ^ ^ ^ ^ ^ (carets mark entry into loop) trace: T T T T U T T T T U T T T T U T T T T U T T T T U history: (U)T T T T U T T T T U T T T T U T T T T U T T T T prediction? M M M M M M M M M M There are two mispredicts per loop execution.