CPSC 330 - Spring 2008 Study Guide for Final Exam The final will be approximately 66% cumulative through Exam 2, and approximately 33% on pipelining and Amdahl's Law and MIPS. Pipelining 1. Be able to define or match these terms. instruction cache multi-ported register file sign-extension ALU data cache pipelining pipeline stages pipeline latches structural hazard data dependencies RAW (read after write ordering must be preserved) - true dependency WAR (write after read ordering must be preserved) - anti-dependency WAW (write after write ordering must be preserved) - output dependency load-use data hazard pipeline stall register scoreboard forwarding control hazard (branch hazard) branch target address (BTA) branch taken branch not taken (untaken) delayed branch branch delay slot branch prediction misprediction misprediction recovery flushing pipeline stages dynamic branch prediction branch target address cache (BTAC) branch history table (BHT) branch target buffer (BTB) branch history shift register (BHSR) gshare branch prediction algorithm 2. Be able to: A. Describe in general what each stage in the 5-stage pipeline does. B. Identify and discuss hardware and software solutions to hazards. C. Given a code sequence, draw a data dependency graph. D. Given a code sequence, identify where any stalls occur and determine their duration on the 5-stage pipeline. E. Given a code sequence, identify where any forwarding actions occur on the 5-stage pipeline. F. Given a code sequence, draw the pipeline cycle diagram (stairstep diagram) for that code executing on the 5-stage pipeline. G. Explain how data hazards are detected and how forwarding paths work. H. Draw forwarding paths on a pipeline diagram. I. Describe how delayed branches work in the 5-stage pipeline. J. Describe how a BTB (or BTAC and BHT) provides a predicted next instruction address for branch instructions. K. Calculate the specific or average CPI for a given branching scheme. Alternatively, calculate misprediction penalties. L. Calculate the prediction accuracy for a given branch prediction scheme. Example questions Associate each term or statement below with one or more types of dependence or hazard. Circle C, D, or S for Control, Data, or Structural, respectively. Note some questions may require two or all three to be circled. 1. C D S providing an L1 instruction cache and a separate L1 data cache avoids an instance of this type of hazard in a scalar pipeline 2. C D S a load-use penalty may occur when you have this type of hazard 3. C D S delayed branches deal with this type of dependence Associate each term or statement below with aspects of branching. Circle A or P, for Address or Prediction, respectively. Note some questions may require both to be circled. 4. A P BTAC 5. A P BHT 6. A P BTB 7. A P gshare 8. Identify in order and briefly explain the five stages in the pipeline we studied. 9. Draw a diagram of the five stages you listed in question 9 and draw the forwarding path that is used when the following pair of instructions is executed (the destination register is listed first). I1: add R3, R1, R2 I2: sub R5, R3, R4 10. Why are current machines using longer pipelines than five stage? 11. Identify two major drawbacks to using longer pipelines. 12. Consider two types of load instructions: a regular load (ld) and an indirect load (ldi), which perform the following actions: ld rd, offset(rs) => reg[rd] <- memory[ reg[rs] + offset ]; ldi rd, offset(rs) => reg[rd] <- memory[ memory[ reg[rs] + offset ] ]; Can the load indirect instruction be easily implemented on the five-stage pipeline? If so, identify what is done in each pipeline stage. If not, explain why not and what would have to be changed to implement this instruction. For questions 13-16, consider the following code. Destination registers are listed first. I1: mov R1, 0x1000 I2: mov R2, 0x2000 loop: I3: ld R3, 0(R1) I4: add R1, R1, 4 I5: add R4, R3, 1 I6: st 0(R2), R4 I7: add R2, R2, 4 I8: cmp R1, 0x1100 I9: blt loop // delayed branch on less than I10: nop // delay slot 13. Draw the dependency diagram for instructions I3-I9. 14. Consider the five-stage pipeline. Registers are written back in the the first half of a clock cycle, and registers are read in the second half of a clock cycle. Forwarding paths are also available. Draw a pipeline cycle (stairstep) diagram that shows the execution of the ten instructions on the five-stage pipeline. cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 I1: mov R1,0x1000 ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I2: mov R2,0x2000 ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I3: ld R3,0(R1) ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I4: add R1,R1,4 ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I5: add R4,R3,1 ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I6: st 0(R2),R4 ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I7: add R2,R2,4 ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ I8: cmp R1,0x1100 ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ 15. List any instruction pairs (by numbers) that require a stall cycle. (Consider instructions I1-I8). 16. List any instruction pairs (by numbers) that require a forwarding action. (Consider instructions I1-I8). (8 pts.) 17. Explain why branches tend to make make a computer run slower. 18. Identify and briefly explain a software technique that reduces the impact of branches on system performance. 19. Identify and briefly explain a hardware technique that reduces the impact of branches on system performance. 20. Consider a one-bit history for branch prediction. It records the state of the last branch as taken (T) or untaken (U) and predicts the next branch will be the same. Assume the bit is initialized to U. Determine the prediction accuracy on the following branch trace; include all trace entries in your calculation. (a) T T T T U T T T T U T T T T U T T T T U (b) T U T U T U T U T U T U T U T U T U T U 21. An HP processor uses a three-bit branch history shift register in each branch history table entry and uses a "majority vote" algorithm to predict whether the next branch is taken (T) or untaken (U). Assume an entry is initialized to UUU. Determine the prediction accuracy on the following branch trace; include all trace entries in your calculation. (a) T T T T U T T T T U T T T T U T T T T U (b) T U T U T U T U T U T U T U T U T U T U 22. Consider gshare with 3 bits from 16-bit PC (bits 4-6 when numbered in little endian), a 3-bit branch history shift register containing 0x3, and a pattern history table (PHT) of saturating two-bit counters (containing values shown below). Give the PHT entry index, value, and taken/not-taken prediction for the following branch addresses (for the prediction, start each part with the counter values as given, and use the most significant bit of the 2bc with 1=taken). 15 654 0 a) PC = 0x0100 => PHT entry index is __________ +----------------+ PHT PC|---------...----| +--+ PHT entry value is __________ +----------------+ 0|00| `|' 1|10| branch prediction is __________ v 2|11| (x)------> 3|01| ^ 4|00| | 5|01| b) PC = 0x2454 => PHT entry index is __________ +---+ 6|11| BHSR|011| 7|10| PHT entry value is __________ +---+ +--+ branch prediction is __________ 23. Consider branch prediction for the five-stage pipeline we studied. Let the base CPI be one. If branches are 20% of the workload, and the average branch prediction accuracy is 90% and the misprediction penalty is 2 cycles, show the effect of imperfect branch prediction on overall performance. Amdahl's Law and MIPS 1. Be able to define or match these terms Amdahl's Law MIPS MFLOPS 2. Be able to: A. Show how Amdahl's Law is derived. B. Calculate overall speedup using Amdahl's Law, or the fraction of the original program execution time that must be enhanced in order to meet an overall speedup goal given a particular enhancement. C. Explain how Amdahl's Law applies to parallel systems. D. Give the two formulae for MIPS. E. Calculate MIPS or MFLOPS. Example questions 1. Explain the difference between the measures of MIPS and MFLOPS. 2. If you knew a computer system was rated as 1000 MIPS, what would you estimate to be the MFLOPS and why? 3. Give Amdahl's Law (that is, the overall speedup in terms of f and s). 4. Consider an enhancement to a computer system that provides a three times speedup when in use. What is the overall speedup if the enhancement can only be used for one half of a program's normal execution time? 5. a) Find the execution time for a program that executes 1 billion instructions on a processor with an avg. CPI of 3.0 and a clock cycle time of 33.3 nsec. b) What is the MIPS value for the processor and program in part (a)? 6. For the following instruction set workload and cycle values, find the average CPI. If the clock rate is 800 MHz, what is the MIPS value? type | freq cycles -------+-------------- alu | 0.5 2 branch | 0.2 6 ld/st | 0.3 6 ----- below is not covered on Spring 2008 final exam ----- Parallelism 1. Be able to define or match these terms scalar pipeline multiple pipelines multiple intruction issue superscalar very long instruction word (VLIW) explicitly parallel instruction computing (EPIC) data dependencies RAW (read after write ordering must be preserved) - true dependency WAR (write after read ordering must be preserved) - anti-dependency WAW (write after write ordering must be preserved) - output dependency false dependencies register renaming architectural registers reservation station reorder buffer (completion buffer) out-of-order execution in-order commit (retire) speculative execution load speculation (across stores and across branches) loop unrolling predication multithreading multicore vector processor Flynn notation (SISD, SIMD, MIMD) single-instruction-multiple-data (SIMD) - muliple processor units operating in lockstep - subword parallelism (partitioned ALU) shared memory symmetric multiprocessor (SMP) non-uniform memory access (NUMA) massively parallel processing (MPP) cache coherency bus snooping memory consistency message passing cluster 2. Be able to: A. Distinguish among superscalar, EPIC, and VLIW in terms of dependency checking, function unit assignment, and execution scheduling. B. Explain the hardware and software tradeoffs between a superscalar and VLIW. (E.g., control logic complexity, compiler complexity, coded compatibility.) C. Given a code sequence and functional unit latencies, schedule the operations in a set of very long instruction words. D. Identify and explain at least three advantages of cluster systems. Example questions Associate each term or statement below with aspects of ILP. Circle S or V, for Superscalar or VLIW, respectively. Note some questions may require both to be circled. 1. S V often uses hardware register renaming 2. S V the compiler makes decisions on what is in an issue packet 3. S V can benefit from loop unrolling optimizations performed by a compiler 4. Identify superscalar, EPIC, and VLIW (leaving one blank empty). dependency fn. unit time to start checking assignment execution ---------- ---------- ------------- _____________ hardware hardware hardware _____________ software hardware hardware _____________ software software hardware _____________ software software software 5. Identify the type of organization that is used by each of the following processors: scalar pipeline, superscalar, VLIW, EPIC. 486 _________________ Itanium _________________ Pentium III _________________ Pentium 4 _________________ 6. Predication changes a ________ into a ________ dependency. 7. What are the two major characteristics of an "embarrassingly parallel" workload? 8. Explain how a cluster differs from an SMP in terms of ld/st memory accesses and processor interconnections. Fill in the blanks for the processing order and names of pipeline stages in a dynamically-scheduled processor such as the Intel P6. steps in the front end of the pipeline are done in program order 9. _instruction_fetch__ obtain instructions from the icache 10. _decode_____________ determine the operation and operands 11. _register_renaming__ map register operand names to tags or phy. registers 12. _dispatch___________ allocate ROB entry and send decoded and renamed instruction to the instruction window steps in the middle of the pipeline are done out of program order 13. _issue______________ search instruction window and send ready instructions to execution units 14. _execute____________ execution units performs the required operation 15. _write_(or_complete)_ results from execution units are written to the ROB steps at the last of the pipeline are done in program order 16. _retire_(or_commit)_ results from the oldest completed ROB entries are examined for exceptions and branch mispredictions, and if none are found, the results are written to the register file