CPSC 330 Spring 2009 -- Final Exam Name: ____________________ No calculators. 1. Fill in with the power of ten. (1 pt. each) exa _______ milli _______ peta _______ pico _______ tera _______ 2. Give the CPU time equation in terms of clock frequency and define the terms you use. (4 pts.) 3. Consider the following instruction frequencies, CPI values, and number of floating-point operations per instruction. inst. type freq. cpi flt. pt. ops / inst. ------------------------------------------------------------ load/store .30 2 - integer ALU op .20 1 - FP add .05 3 1 FP multiply .05 3 1 FP fused multiply-add .10 3 2 branch .30 2 - a. What is the average CPI? (2 pts.) Consider further that the program with the above workload has a total of 100 million instructions and that the processor on which it executes has a clock rate of 2 GHz. b. What is the CPU time required for the program execution? (3 pts.) c. What is the MIPS for this program running on this processor? (3 pts.) d. What is the MFLOPS for this program running on this processor? (3 pts.) 4. Give the equation for Amdahl's law in terms of "f" and "s" and define the terms you use. (4 pts.) 5. What is the overall speedup if an enhancement with speedup of 16 can be used 80% of the time? (3 pts.) 6. What fraction of normal execution time must an 16x enhancement be used to achieve an overall speedup of 8? Show the result as a fraction. (3 pts.) 7. Give the full truth table for a multiplexer and the circuit implementation. "s" is the selection control signal that chooses between inputs i0 and i1. (4 pts.) s i0 i1 | out ------------+------ | | | 8. Give the truth table for a majority-vote circuit (i.e., output is one if two or more of the inputs are one). Also give the circuit implementation using a PLA. (4 pts.) i0 i1 i2 | out -------------+------ | | | 9. Consider a two-bit saturating up-down counter. (10 pts.) (a) Draw the state transition diagram. Label the up transitions with '1' and the down transitions with '0'. (b) Give the state transition table. (c) Give a circuit implementation for this counter. 10. Contrast hardwiring and microprogramming control in terms of (a) speed and (b) flexibility. (6 pts.) Extra Credit: Was the Intel P6 hardwired, microprogrammed, or a combination of both? Explain your answer and cite any details you remember from reading about the P6. (up to 3 pts.) 11. Draw a data dependency diagram for the following code segment and label data dependencies as RAW, WAR, or WAW. (Destination registers are listed first, except for the store.) (10 pts.) lw r8,0(r1) // r8 <- mem[r1+0] (load) lw r9,0(r2) // r9 <- mem[r2+0] (load) mult r10,r8,r9 // r10 <- r8*r9 sw r10,0(r2) // mem[r2+0] <- r10 (store) addi r1,r1,4 // r1 <- r1+4 addi r2,r2,4 // r2 <- r2+4 Extra Credit: Consider a wide-issue superscalar with one-cycle execution for any operation and register renaming (like the DFA in the P6 project). What is the execution trace and the max and average IPC values for the above code segment? (up to 2 pts.) 12. Draw a pipeline cycle diagram (stairstep diagram) for the following code segment as it would be executed on the standard five-stage pipeline (which is single-issue). There is a load-use penalty of one cycle; mult requires three cycles in the EX stage; and, addi requires one cycle. Assume standard data forwarding is available in the five-stage pipeline. (The code is the same as in question 11.) (6 pts.) lw r8,0(r1) lw r9,0(r2) mult r10,r8,r9 sw r10,0(r2) addi r1,r1,4 addi r2,r2,4 13. Explain the difference between the three types of cache misses: compulsory, conflict, and capacity. (6 pts.) 14. Consider a byte-addressed computer system with four-byte instructions. If a program has an initial sequential execution section and then a loop with instruction addresses as follows .-----. | 0-8 | `-----' .---->| | .-----. | |12-40| three iterations | `-----' `-----' then the instruction fetch address stream is 0,4,8,12,16,20,24,28,32,36,40,12,16,20,24,28,32,36,40,12,16,20,24,28,32,36,40 \_____iteration 1_____/ \_____iteration 2_____/ \_____iteration 3_____/ Consider a 32-byte direct-mapped instruction cache with 8 bytes per line. Determine the cache hits and misses to execute the program above. (10 pts.) 15. Give an example of an "embarrassingly parallel" workload and explain what makes it so parallel. (4 pts.) 16. Contrast a NUMA system and a cluster in terms of (a) load/store memory accesses and (b) processor interconnections. (3 pts.) 17. Explain why write-invalidate is typically chosen over write-update in a cache coherency protocol. (3 pts.) 18. Compare and contrast (a) historical SIMD and (b) what Intel describes as SIMD in its Pentium III/4 and Core processors. (4 pts.) Extra Credit: Identify and explain the meaning of the states of the four- state cache coherency protocol we studied. (up to 5 pts.) Extra Credit: Explain the purpose and major features of the roofline model. (up to 5 pts.)