Homework 3 examples Purposes: (1) work with control signals and datapath design (2) work with pipelining concepts (2) work with branch prediction 1. For the MIPS instruction sw Rt, Offset(Rs) // Mem[ Reg[Rs] + sign_ext(Offset) ] <- Reg[Rt] Give values for all control signals in Fig. 4.2 as well as whether each of the three muxes selects its upper or lower input to pass through. For ALU operation, use one of the function names in the table on p. 316. For other control signals, use either 0 or 1, and assume Zero = 0. ALU operation = Add Branch = 0 MemRead = 0 MemWrite = 1 RegWrite = 0 Zero = 0 Mux1 (upper left; output to PC) = lower Mux2 (upper middle; output to Register File) = don't care Mux3 (lower middle; output to B-leg of ALU) = lower 2. For the data path depicted below, assume the following latencies (delays): 1 ns temporary registers W,Y,Z (setup and hold) 2 ns register file R0-R3 (setup and hold) 3 ns bus 4 ns incrementer 5 ns adder What is the critical path? +------+ .-. +-------------+ | R0 |<-->| |-->| incrementer |--. +------+ | | +-------------+ | | R1 |<-->| | v +------+ | | +-------+ | R2 |<-->| |<---------------| W | +------+ | | +-------+ | R3 |<-->| | +------+ | |--------------------. | | | | | +-------+ | | |-->| Y | | | | +-------+ | | | v v | | ----- ----- | | \ \______/ / | | \ / | | \ adder / | | \__________/ bus | | v | | +-------+ | |<---------| Z | `-' +-------+ Consider the possible paths and determine the path delay for each. Typically multiple individual paths will be topologically identical and can be combined into a single generic path. (Ri or W or Z) to bus to Rj = 3 + 2 = 5 ns (Ri or W or Z) to bus to incrementer to W = 3 + 4 + 1 = 8 ns (Ri or W or Z) to bus to adder to Z = 3 + 5 + 1 = 9 ns (Ri or W or Z) to bus to Y = 3 + 1 = 4 ns Y to adder to Z = 5 + 1 = 6 ns The critical path is: (Ri or W or Z) to bus to adder to Z. 3. For the MIPS instruction sequence below, identify the dependencies. Additionally, show the pipeline cycle diagrams for the 5-stage pipeline without and with forwarding. Assume register file writes occur in the first half cycle and reads in the second half cycle. i1: add $1, $2, $3 // Reg[1] <- Reg[2] + Reg[3] i2: sub $4, $5, $6 // Reg[4] <- Reg[5] + Reg[6] i3: add $7, $1, $4 // Reg[7] <- Reg[1] + Reg[4] dependencies: i3 RAW-dependent on i1 through register $1 i3 RAW-dependent on i2 through register $4 if shown in the form of a dependence graph, it becomes clear that i1 and i2 are independent $2 $3 $5 $6 | | | | v v v v \ / $1\ /$4 v v | v $7 cycle diagram without forwarding - two stall cycles while i3 waits for values in $1 (available in cycle 5) and in $4 (available in cycle 6) 1 2 3 4 5 6 7 8 9 i1: IF ID EX MM WB i2: IF ID EX MM WB i3: IF - - ID EX MM WB equivalent code with nops i1: add $1, $2, $3 // Reg[1] <- Reg[2] + Reg[3] i2: sub $4, $5, $6 // Reg[4] <- Reg[5] + Reg[6] nop nop i3: add $7, $1, $4 // Reg[7] <- Reg[1] + Reg[4] 1 2 3 4 5 6 7 8 9 i1: IF ID EX MM WB i2: IF ID EX MM WB nop IF ID EX MM WB nop IF ID EX MM WB i3: IF ID EX MM WB cycle diagram with forwarding - no stalls since the value from i1 is forwarded to the EX stage of i3 from the MM/WB pipepine latch and the value from i2 is forwarded to the EX stage of i3 from the EX/MM pipeline latch 1 2 3 4 5 6 7 i1: IF ID EX MM WB i2: IF ID EX MM WB i3: IF ID EX MM WB if given in a more detailed cycle diagram 1 2 3 4 5 6 7 i1: IF ID EX MM... WB | i2: IF ID EX. | MM WB | | (value from i2) \ \ (value from i1) \ \ i3: IF ID EX MM WB or in a subset of a pipeline datapath diagram i3 in EX i2 in MM i1 in WB add $7,$1,$4 sub $4,$5,$6 add $1,$2,$3 .------------------------------------------------. | ...<------. | | _ | _ | | .---. | | | | | | forwarded '->[mux]->| \ | | | | | | value from i1 \ \ | | | .------. | | | |alu |->| |--*->|dcache|->| |-. | / / | | | `------' | | [mux]--* .->[mux]->| / | | *----------->| |-' | | `---' |_| | bypass |_| | | EX/MM | MM/WB | `------------------------' | forwarded value from i2 ... <--' 4. Download WinDLX on a Windows computer. Copy test.s to another file, e.g., test2.s, and change the contents to the following. (The DLX instructions are similar to MIPS instructions.) .data value: .word 1 .text .global main main: lw r1,value addi r1,r1,1 addi r1,r1,2 finish: trap 0 This program will execute as follows: lw r1,value ; r1 <- memory[value] addi r1,r1,1 ; r1 <- r1 + 1 addi r1,r1,2 ; r1 <- r1 + 2 trap 0 ; end execution Run WinDLX. I suggest tiling the code box in the upper left, the pipeline box in the upper middle, the statistics box in the upper right, and the clock cycle diagram across the middle and bottom. (See screen capture http://www.cs.clemson.edu/~mark/330/windlx.gif) Click configuration on the upper menu bar; set absolute cycle count and disable forwarding. Click file on the upper menu bar; click load code or data; you should see test2.s in a menu; click test2.s; click select; click load; info box should appear asking to reset DLX; click yes. Step through the program using F7 until info box appear stating trap #0 occured. Note register stalls between lw and first addi, and between first and second addi's. You should see that these four instructions take 12 cycles. (The trap invokes the OS starting on cycle 12.) (a) How can the lw's WB and the first addi's ID overlap? The lw writes the results into R1 in the first phase of cycle 5, and the addi stalls until it can read R1 in the second phase of cycle 5. (b) Click configuration on the upper menu bar; enable forwarding; an info box should ask if you want to reset DLX; click yes. Step through the program using F7 until info box appear stating trap #0 occured. Note a register stall between lw and first addi but not between the first and second addi's. Explain. There is a load-use penalty of one for the lw and first addi pair, since the loaded value is not available for forwarding until the end of the MEM stage. (c) How many cycles are saved by the use of forwarding for the program? Three. 5. For the following instruction set workload, find the extra CPI for a branch prediction scheme with 50% accuracy where the branch outcome is available in the EX stage of the standard 5-stage pipeline. type | freq -------+------ alu | 0.5 branch | 0.2 ld/st | 0.3 for mispredicted branches, there is a two-cycle penalty 1 2 3 4 5 6 7 8 bc target: IF ID EX MM WB i2: IF ID * i3: IF * target: IF ID EX MM WB note that for correctly predicted branches, there is no penalty 1 2 3 4 5 6 bc target: IF ID EX MM WB target: IF ID EX MM WB extra CPI = (branch freq.)*(misprediction freq.)*(mispredict penalty) = 0.2 * 0.5 * 2 = 0.2 cycle increase in CPI for this scheme 6. Consider a one-bit history for branch prediction. It records the state of the last branch as taken (T) or untaken (U) and predicts the next branch will be the same. Assume the bit is initialized to U. Determine the prediction accuracy on the following branch trace; include all trace entries in your calculation. T T T T U T T T T U T T T T U T T T T U T T T T U ^ ^ ^ ^ ^ (carets mark entry into loop) trace: T T T T U T T T T U T T T T U T T T T U T T T T U history: U T T T T U T T T T U T T T T U T T T T U T T T T prediction? M M M M M M M M M M There are two mispredicts per loop execution.