330 Fall 2010 -- Final Exam Name: ____________________ No calculators. Most problems can be worked with simple fractions. 1. a. Find the execution time of a program that executes 4 billion instructions on a processor with an average CPI of 0.5 and a clock frequency of 2 GHz. (2 pts.) b. What is the MIPS value for the machine described in part (a)? (2 pts.) 2. Use Amdahl's Law to determine what fraction of time a ten times speedup must be used to obtain an overall speedup of five? Give your answer as a fraction. (3 pts.) 3. Give a circuit implementation for an equality circuit. (2 pts.) a b | equality -------+------------- 0 0 | 1 0 1 | 0 1 0 | 0 1 1 | 1 4. Consider a pipelined computer system with this workload and CPIs. inst. type freq. cpi ------------------------- load/store .3 2 integer ALU op .4 1 taken branch .2 4 untaken branch .1 2 You are given three suggestions on how to improve the performance. Evaluate each suggestion quantitatively in terms of speedup with respect to the original system, and rank them in recommended order. (12 pts.) (a) Increase the clock frequency by a factor of 1.2. (b) Implement an optimizing compiler that changes (i.e., reduces) the instruction count by a factor of 8/9 and also alters the instruction frequencies as follows: inst. type freq. cpi ------------------------- load/store .3 2 integer ALU op .5 1 taken branch .15 4 untaken branch .05 2 (c) Implement a dynamic branch prediction scheme that alters the frequencies and CPI values as follows: inst. type freq. cpi ------------------------------------- load/store .3 2 integer ALU op .4 1 correctly predicted branch .29 1 mispredicted branch .01 21 5. Consider the design of a serial comparator takes that two inputs A and B. The comparator has three valid states: state meaning 00 A==B 01 AB 11 never reached from stating state of 00 Assume the comparator starts in state 00 and changes state as each pair of bits, A and B, are examined. (Two unsigned words, a and b, are processed from the least significant bit to most significant bit. Thus, in the first step, a0 and b0 are supplied to the A and B inputs. In the second step, a1 and b1 are supplied. Etc.) Thus, comparing a=1001 with b=0011 means you will run through the following steps: initially: state starts at 00 (a==b) 0: compare A=a0=1 and B=b0=1: state stays at 00 (a==b) 1: compare A=a1=0 and B=b1=1: state changes to 01 (ab) thus, for these example four-bit unsigned values a=1001 and b=0011, a>b Note: transitions from state 11 are don't cares (d's). Also, there is no need to design a state-reset capability for this problem. a) give the state diagram (4 pts.) b) give the truth table (there is no need for output columns) (4 pts.) c) design a circuit (output is merely the state values) (4 pts.) 6. Identify the five stages of the simple scalar pipeline we studied and explain what each stage does when processing the following instruction. (5 pts.) load R2,0(R1) // R2 <- memory[R1+0] 7. Draw the IF stage of a pipeline with branch prediction and indicate how the PC, icache, BHT, and BTAC are connected. Explain how branches are predicted. (4 pts.) 8. Associate each term or statement below with a type of dependency. Circle one or more of RAW, WAR, or WAW. (Destination registers are listed first.) (4 pts.) a. RAW WAR WAW true data dependency b. RAW WAR WAW false data dependency c. RAW WAR WAW load R2,0(R1) followed by add R1,R1,4 d. RAW WAR WAW load R2,0(R1) followed by add R2,R2,4 9. Consider the statement "The load-use penalty in the simple five-stage pipeline is one cycle." (a) Does this apply to the pipeline when using forwarding or when not using forwarding? Explain. (2 pts.) (b) What is a software technique to avoid a load-use penalty? Explain. (2 pts.) 10. Explain how the P6 or other modern microprocessor can support out-of-order execution yet provide for precise exceptions. (5 pts.) 11. Associate each term or statement below with aspects of ILP. Circle S or V, for Superscalar or VLIW, respectively. (4 pts.) a. S V typically uses hardware register renaming b. S V places complexity in the compiler rather than the hardware c. S V provides code compatibility across different implementations that have different number of pipelines and/or latencies d. S V design approach of current Intel x86 processors such as the i7 12. Identify at least three difference between DRAM and SRAM. (3 pts.) 13. Consider a 4 GB byte-addressable main memory with a four-way set- associative cache of 1 MB and 16 bytes per line. a) How many total lines are there in cache? (not just per bank) (1 pt.) b) Show how the main memory address is partitioned into fields for the cache access and give the bit lengths of those fields. (4 pts.) c) How many total tag comparators are there in the cache? (1 pt.) 14. Assume a 256-byte main memory and a four-line cache with two bytes per line. The cache is initially empty. For the byte address reference stream (reads) given below circle which of the references are hits for the different cache placement schemes. Also, show the final contents of the cache. (The byte addresses are in decimal.) a) direct-mapped (6 pts.) 2, 10, 3, 26, 12, 4, 12, 10, 5, 12 b) two-way set associative with LRU replacement (6 pts.) 2, 10, 3, 26, 12, 4, 12, 10, 5, 12 c) fully-associative with FIFO replacement (6 pts.) 2, 10, 3, 26, 12, 4, 12, 10, 5, 12 15. Assume you are running the following program segment on a system with a 32 KB two-way set associative data cache with 64-byte line size. double a[2048],b[2048],c[2048],d[2048]; /* each double is 8 bytes */ ... for(i=0;i<2048;i++){ a[i] = 0.0; b[i] = 0.0; c[i] = 0.0; d[i] = 0.0; } A friend suggests that it would be faster to write the program segment in this way: double a[2048],b[2048],c[2048],d[2048]; /* each double is 8 bytes */ ... for(i=0;i<2048;i++){ a[i] = 0.0; } for(i=0;i<2048;i++){ b[i] = 0.0; } for(i=0;i<2048;i++){ c[i] = 0.0; } for(i=0;i<2048;i++){ d[i] = 0.0; } Is your friend correct? Why or why not? (3 pts.) 16. Identify two reasons why a cache line is usually larger than one word in size. (2 pts.) 17. Explain the following three terms for parallel processor systems. (3 pts.) (a) UMA (b) NUMA (c) NORMA 18. Explain how a cluster differs from an SMP in terms of ld/st memory accesses and processor interconnections. (3 pts.) 19. Classify a GPU in Flynn notation (SISD, SIMD, MIMD) and explain your choice. (3 pts.)