CPSC 330 - Spring 2006 Study Guide for Exam 3 (revised with added pipeline material - see second half) This exam will cover the RAM part of Chapter 5, Chapter 7, and the scalar pipeline part of Chapter 8. 1. Be able to define or match these terms random access memory (RAM) read-only memory (ROM) static RAM (SRAM) dynamic RAM (DRAM) refresh memory array row decoder row address strobe row buffer column decoder column address strobe page-mode DRAM programmable ROM (PROM) erasable PROM (EPROM) memory bus split transaction bus centralized arbitration daisy chain arbitration decentralized arbitration ring (token) arbitration register internal CPU bus tri-state buffer CPU data path control control signal hardwired control microprogrammed control control store 2. Be able to: A. Draw a block diagram (high-level circuit) showing the two-dimensional organization of a RAM and/or identify the components and signals required to access RAM. B. Draw a timing diagram of bus and RAM chip activity for a memory read. Be able to identify differences for burst transfer mode. C. Given a high-level RTL statement, derive the necessary step-by-step RTL and/or control signals to implement that statement on a given datapath. D. Given a user-level instruction, derive the necessary step-by-step RTL and/or control signals to implement the fetch/decode/execute of that instruction on a given datapath. E. Explain the purpose/use of the fields in a microinstruction in the microprogramming example we studied in class. Be prepared to work problems as given in homework 3. Example questions 1. Consider the microprogrammed implementation of the simple CPU. cycle PC IR MAR MDR ACC TMP CSAR CSIR ----------------------------------------------------------------- 1: 0 0 0 0 0 0 0 00000100010000000100 MAR_in PC_out 2: 0 0 0 0 0 0 2 00000000001100000110 pc_incr read 3: 1 0 0 8 0 0 3 00010001000000001000 IR_in MDR_out 4: 1 8 0 8 0 0 4 00000000000000100000 br_table 5: 1 8 0 8 0 0 5 00001100000000001100 IR_out MAR_in 6: 1 8 8 8 0 0 6 00000000000100001110 read 7: 1 8 8 2 0 0 7 10000001000000000000 ACC_in MDR_out ----------------------------------------------------------------- (a) The control store held 16 words of 20 bits each, yet only the first 14 bits were used as control signals sent to the datapath. What is the general purpose for including an extra 6 bits in the fields br_table, next_address, and or_address? [Hint: a one word answer suffices.] (b) What effect did the br_table have, and why was it used? 2. Write a C function that could be used in a memory simulator that prints the word addresses of memory words transferred on a burst transfer of 16 bytes (four words). The addresses printed must start with a word aligned on a 16-byte boundary and must include the address passed to the function. (Addresses may be up to 32 bits.) The prototype of the function is void burst_addrs( unsigned int word_addr ); burst_addrs( 0x100 ) prints: 0x100 0x104 0x108 0x10c burst_addrs( 0x104 ) prints: 0x100 0x104 0x108 0x10c burst_addrs( 0xff5 ) prints: 0xff0 0xff4 0xff8 0xffc burst_addrs( 0xff8 ) prints: 0xff0 0xff4 0xff8 0xffc 3. Explain why a burst transfer is more efficient than transferring the four words separately. Draw bus timing diagrams of the two different approaches. ----- 1. Be able to define or match these terms. instruction cache multi-ported register file sign-extension ALU data cache pipelining pipeline stages pipeline latches structural hazard dependence (data dependency) data hazard forwarding load-use data hazard pipeline stall register scoreboard control hazard (branch hazard) branch target address (BTA) branch taken branch not taken (untaken) delayed branch branch delay slot branch prediction misprediction misprediction recovery flushing pipeline stages dynamic branch prediction branch target address cache (BTAC) branch history table (BHT) branch target buffer (BTB) branch history shift register (BHSR) gshare branch prediction algorithm 2. Be able to: A. Describe in general what each stage in the 5-stage pipeline does. B. Identify and discuss hardware and software solutions to hazards. C. Given a code sequence, draw a data dependency graph. D. Given a code sequence, identify where any stalls occur and determine their duration on the 5-stage pipeline. E. Given a code sequence, identify where any forwarding actions occur on the 5-stage pipeline. F. Given a code sequence, draw the pipeline cycle diagram (stairstep diagram) for that code executing on the 5-stage pipeline. G. Explain how data hazards are detected and how forwarding paths work. H. Draw forwarding paths on a pipeline diagram. I. Describe how delayed branches work in the 5-stage pipeline. J. Describe how a BTB (or BTAC and BHT) provides a predicted next instruction address for branch instructions. K. Calculate the specific or average CPI for a given branching scheme. Alternatively, calculate misprediction penalties. L. Calculate the prediction accuracy for a given branch prediction scheme. Example questions Associate each term or statement below with one or more types of dependence or hazard. Circle C, D, or S for Control, Data, or Structural, respectively. Note some questions may require two or all three to be circled. 1. C D S providing an L1 instruction cache and a separate L1 data cache avoids an instance of this type of hazard in a scalar pipeline 2. C D S a load-use penalty may occur when you have this type of hazard 3. C D S delayed branches deal with this type of dependence 4. C D S forwarding paths within a pipeline are used to avoid this type of hazard Associate each term or statement below with aspects of branching. Circle A or P, for Address or Prediction, respectively. Note some questions may require both to be circled. 5. A P BTAC 6. A P BHT 7. A P BTB 8. A P gshare 9. Identify in order and briefly explain the five stages in the pipeline we studied. 10. Draw a diagram of the five stages you listed in question 9 and draw the forwarding path that is used when the following pair of instructions is executed (the destination register is listed first). I1: add R3, R1, R2 I2: sub R5, R3, R4 11. Why are current machines using longer pipelines than five stage? 12. Identify two major drawbacks to using longer pipelines. 13. Consider two types of load instructions: a regular load (ld) and an indirect load (ldi), which perform the following actions: ld rd, offset(rs) => reg[rd] <- memory[ reg[rs] + offset ]; ldi rd, offset(rs) => reg[rd] <- memory[ memory[ reg[rs] + offset ] ]; Can the load indirect instruction be easily implemented on the five-stage pipeline? If so, identify what is done in each pipeline stage. If not, explain why not and what would have to be changed to implement this instruction. For questions 14-17, consider the following code. Destination registers are listed first. I1: mov R1, 0x1000 I2: mov R2, 0x2000 loop: I3: ld R3, 0(R1) I4: add R1, R1, 4 I5: add R4, R3, 1 I6: st 0(R2), R4 I7: add R2, R2, 4 I8: cmp R1, 0x1100 I9: blt loop // delayed branch on less than I10: nop // delay slot 14. Draw the dependency diagram for instructions I3-I9. 15. Consider the five-stage pipeline. Registers are written back in the the first half of a clock cycle, and registers are read in the second half of a clock cycle. Forwarding paths are also available. Draw a pipeline cycle (stairstep) diagram that shows the execution of the ten instructions on the five-stage pipeline. cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 I1: mov R1,0x1000 ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I2: mov R2,0x2000 ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I3: ld R3,0(R1) ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I4: add R1,R1,4 ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I5: add R4,R3,1 ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I6: st 0(R2),R4 ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I7: add R2,R2,4 ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I8: cmp R1,0x1100 ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I9: blt loop ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I10: nop ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ 16. List any instruction pairs (by numbers) that require a stall cycle. (Consider all instructions, I1-I10). 17. List any instruction pairs (by numbers) that require a forwarding action. (Consider all instructions, I1-I10). (8 pts.) 18. Explain why branches tend to make make a computer run slower. 19. Identify and briefly explain a software technique that reduces the impact of branches on system performance. 20. Identify and briefly explain a hardware technique that reduces the impact of branches on system performance. 21. Consider a one-bit history for branch prediction. It records the state of the last branch as taken (T) or untaken (U) and predicts the next branch will be the same. Assume the bit is initialized to U. Determine the prediction accuracy on the following branch trace; include all trace entries in your calculation. (a) T T T T U T T T T U T T T T U T T T T U (b) T U T U T U T U T U T U T U T U T U T U 22. An HP processor uses a three-bit branch history shift register in each branch history table entry and uses a "majority vote" algorithm to predict whether the next branch is taken (T) or untaken (U). Assume an entry is initialized to UUU. Determine the prediction accuracy on the following branch trace; include all trace entries in your calculation. (a) T T T T U T T T T U T T T T U T T T T U (b) T U T U T U T U T U T U T U T U T U T U 23. Consider gshare with 3 bits from 16-bit PC (bits 4-6 when numbered in little endian), a 3-bit branch history shift register containing 0x3, and a pattern history table (PHT) of saturating two-bit counters (containing values shown below). Give the PHT entry index, value, and taken/not-taken prediction for the following branch addresses (for the prediction, start each part with the counter values as given, and use the most significant bit of the 2bc with 1=taken). 15 654 0 a) PC = 0x0100 => PHT entry index is __________ +----------------+ PHT PC|---------...----| +--+ PHT entry value is __________ +----------------+ 0|00| `|' 1|10| branch prediction is __________ v 2|11| (x)------> 3|01| ^ 4|00| | 5|01| b) PC = 0x2454 => PHT entry index is __________ +---+ 6|11| BHSR|011| 7|10| PHT entry value is __________ +---+ +--+ branch prediction is __________ 24. Consider branch prediction for the five-stage pipeline we studied. Let the base CPI be one. If branches are 20% of the workload, and the average branch prediction accuracy is 90% and the misprediction penalty is 2 cycles, show the effect of imperfect branch prediction on overall performance.