CPSC 330 Fall 2011 -- Final Exam Name: ____ No calculators. 1. Give the CPU time equation in terms of clock frequency and define the terms you use. (3 pts.) CPU time = IC * CPI / CR IC = instruction count CPI = cycles per instruction CR = clock rate (a.k.a. clock frequency) or CPU time = IC * CPI * CCT CCT = clock cycle time = 1/CR 2. Consider the following instruction frequencies and CPI values. inst. type freq. cpi ------------------------------ load/store .30 2 integer ALU op .40 1 FP op .10 6 branch .20 2 a. What is the average CPI? (2 pts.) avg CPI = 0.30*2 + 0.40*1 + 0.10*6 + 0.20*2 = 0.6 + 0.4 + 0.6 + 0.4 = 2.0 Consider further that the program with the above workload has a total of 100 million instructions and that the processor on which it executes has a clock rate of 2 GHz. b. What is the CPU time required for the program execution? (2 pts.) CPU time = ( 100x10^6 insts * 2 cycle/inst )/( 2x10^9 cycles/sec ) = 100 x 10^(-3) sec = 100 msec c. What is the MIPS for this program running on this processor? (2 pts.) MIPS = 100x10^6 insts / ( 100x10^(-3) sec * 10^6 ) = 1000 MIPS or MIPS = 2 GHz / (2 CPI * 10^6) = 1000 MIPS d. What is the MFLOPS for this program running on this processor? (2 pts.) floating-point operations are 10% of instructions executed MFLOPS = 0.1 flop/inst * 1000 MIPS = 100 MFLOPS 3. Give the equation for Amdahl's law in terms of "f" and "s" and define the terms you use. (3 pts.) overall speedup = 1 / ( ( 1 - f ) + ( f / s ) ) f = fraction of program execution time that will be affected by speedup s = speedup 4. Consider the computer in question 2 above being redesigned in such a way that FP ops will have a cpi of 3 rather than 6. If we were to use Amdahl's law to determine the overall speedup from this change, what are the appropriate values of f and s? (3 pts.) f = FP fraction of execution time = FP CPI / avg CPI = 0.60 / 2.0 = 0.3 s = speedup for FP = original FP CPI / new FP CPI = 6 / 3 = 2 5. What is the overall speedup if an enhancement with speedup of 10 can be used 4/5ths of the time? (3 pts.) overall speedup = 1 / ( ( 1 - 4/5 ) + ( (4/5) / 10 ) ) = 1 / ( 1/5 + 4/50 ) = 1 / ( 10/50 + 4/50 ) = 1 / ( 14/50 ) = 50/14 = 3.57 6. Give the truth table and a sum-of-products logic expression for a two-input multiplexer, which is a function of three single-bit input values, A, B, and S, and the output is set to the value of A when the value of S equals zero and to the value of B when the value of S equals one. (4 pts.) S A B | out \ AB ------+---- A \ 00 01 11 10 S 0 0 0 | 0 +----+----+====+====+ | 0 0 1 | 0 0 | 0 | 0 || 1 | 1 || +---+ 0 1 0 | 1 +----+====+====+====+ A--|0 | 0 1 1 | 1 1 | 0 || 1 | 1 || 0 | | |-- out 1 0 0 | 0 +----+====+====+----+ B--|1 | 1 0 1 | 1 +---+ 1 1 0 | 0 out = (~S)*A + S*B 1 1 1 | 1 7. Design a circuit with inputs A, B, and S, and outputs, C and D, such that C=A and D=B when S=0, and C=B and D=A when S=1. (4 pts.) S | C D C = (~S)*A + S*B => can use mux --+---- S 0 | A B D = (~S)*B + S*A => can use mux | 1 | B A with reversed +---+ inputs A--| |-- C | | S B--| |-- D | +---+ *----------. | | +--------+ | A ------*--|0 2-to-1|------|------- C .--|--|1 mux | | | | +--------+ | | | | | | +--------+ B ---*--|-------------|0 2-to-1|--- D `-------------|1 mux | +--------+ 8. Consider a registered output, in which a signal X is routed to both a two-input multiplexer and to a D flip-flop, and the output of the D flip-flop is routed to the second input of the multiplexer. A select signal, S, is used to choose between the signal X for S=0 and the output of the D flip-flop for S=1. (10 pts.) S | .-------------. X --*---------------|0 | D(t+1) = X | .------. | 2-to-1 mux |---- Z '--| D-FF |-----|1 | Z = ~S*X + S*D(t) `------' `-------------' (a) Give the truth table for this circuit. (b) Give the state diagram for this circuit. (a) S X D(t) | D(t+1) Z ---------+--------- 0 0 0 | 0 0 0 0 1 | 0 0 0 1 0 | 1 1 0 1 1 | 1 1 ---------+--------- 1 0 0 | 0 0 1 0 1 | 0 1 1 1 0 | 1 0 1 1 1 | 1 1 (b) 01/1, 11/0 .---. .--------. .---. / v.===./ v.===./ \ 00/0, | | 0 | | 1 | | 01/1, 10/0 \ /`==='^ /`==='^ / 11/1 `---' `--------' `---' 00/0, 10/1 9. Consider the control sequences for the simple CPU we studied. load instruction: T0: PC_out, MAR_in T1: read, pcincr T2: MDR_out, IR_in T3: time step for decoding opcode in IR T4: IR_out(addr part), MAR_in T5: read T6: MDR_out, ACC_in, reset to T0 add instruction: T0: PC_out, MAR_in T1: read, pcincr T2: MDR_out, IR_in T3: time step for decoding opcode in IR T4: IR_out(addr part), MAR_in T5: read T6: ACC_out, aluadd T7: TEMP_out, ACC_in, reset to T0 store instruction: T0: PC_out, MAR_in T1: read, pcincr T2: MDR_out, IR_in T3: time step for decoding opcode in IR T4: IR_out(addr part), MAR_in T5: ACC_out, MDR_in T6: write, reset to T0 brz instruction: T0: PC_out, MAR_in (branch on zero) T1: read, pcincr T2: MDR_out, IR_in T3: time step for decoding opcode in IR T4: if (acceq0) then { IR_out(addr part), PC_in } T5: reset to T0 Give the logic expression in sum of products form for each of the following control signals as would be implemented in hardwired control circuitry for the simple CPU. (2 pts. each) (a) ACC_in (b) IR_out(addr part) (a) ACC_in = load * T6 + add * T7 (b) IR_out(addr_part) = load * T4 + add * T4 + store * T4 + brz * acceq0 * T4 10. Why do most pipelines rely on having both an instruction cache and a data cache? (3 pts.) allows instruction fetch in IF pipe stage to overlap with memory read/write in MEM stage 11. Explain why the standard five-stage pipeline has a load-use data hazard even with data forwarding. (3 pts.) the value obtained from the cache is not available until the end of the MEM stage load instruction IF ID EX MEM WB \ dependent instruction IF ID -- EX MEM WB 12. For the MIPS instruction sequence below, identify the dependencies in a data dependency diagram. (8 pts.) i1: add $1, $2, $3 // Reg[1] <- Reg[2] + Reg[3] i2: ld $4, 0($1) // Reg[4] <- Memory[ 0 + Reg[1] ] i3: sub $5, $1, $4 // Reg[5] <- Reg[1] - Reg[4] $2 $3 v v .--------. | i1:add | `--------' *------. | | | v $1/RAW | .--------. | | i2:ld | | `--------' | | $1/RAW v v $4/RAW .--------. | i3:sub | `--------' v $5 13. Explain what the term "false dependency" means and how it can impact performance in an out-of-order superscalar processor. What can be done in hardware to eliminate false dependencies? (4 pts.) A false dependency is a WAR or WAW dependency that results from register reuse rather than passing a value between two instructions. False dependencies inhibit out-of-order execution but can be removed by register renaming. 14. Explain how the gshare branch prediction scheme works. Please draw a diagram that illustrates the components required. (4 pts.) +---+ | | +------------+ +---+ PC | | | | +------------+ +---+ v index | | xor (+)-------------->+---+ PHT (table of predictors) ^ into | | +------+ +---+ BHSR | | | | +------+ +---+ | | +---+ | | +---+ | | +---+ Bits from a subfield of the PC (program counter) are xor'ed with bits from the BHSR (branch history shift register), and the result is used as an index into the PHT (pattern history table). Each entry of the PHT is an individual branch predictor (e.g., a two-bit counter). 15. Explain why a set-associative cache is typically preferred to a direct- mapped cache, even though a direct-mapped cache has a slightly faster hit time. (4 pts.) set-associative caches have fewer conflict misses than direct-mapped 16. Explain where flash memory fits in a computer system memory hierarchy. (4 pts.) between main memory and disk 17. Give the name for each of three types of policies for the cache level of the memory hierarchy and give two example policy choices each. (2 pts. pe policy type blank and 1 pt. per choice blank) policy type choice a choice b 1) ____________________ - a) ___________________, b) ___________________ 2) ____________________ - a) ___________________, b) ___________________ 3) ____________________ - a) ___________________, b) ___________________ fetch - demand, prefetch placement - direct-mapped, set-associative, fully-associative replacement - LRU, FIFO, random, ... write - write-through, write-back write miss - write-allocate, no-write-allocate write coherency - invalidate, update 18. Assume you are running the following program segment on a system with a 32 KB two-way set associative data cache with 64-byte line size. double a[2048],b[2048],c[2048],d[2048]; /* each double is 8 bytes */ ... for(i=0;i<2048;i++){ a[i] = 0.0; b[i] = 0.0; } for(i=0;i<2048;i++){ c[i] = 0.0; d[i] = 0.0; } A friend suggests that the program would run faster if you wrote the program segment in this way: double a[2048],b[2048],c[2048],d[2048]; /* each double is 8 bytes */ ... for(i=0;i<2048;i++){ a[i] = 0.0; } for(i=0;i<2048;i++){ b[i] = 0.0; } for(i=0;i<2048;i++){ c[i] = 0.0; } for(i=0;i<2048;i++){ d[i] = 0.0; } Is your friend correct? Why or why not? (6 pts.) No. There are no cache conflicts in your code, so there is no memory locality improvement in splitting the loops any further. In fact, splitting leads to higher loop overhead. 19. Explain how a cluster differs from SMP and NUMA systems in terms of load/store memory accesses. (4 pts.) A cluster implements a separate memory for each node rather than providing shared memory across the nodes. Thus, in a cluster, one node cannot access another node's memory using load and store instructions. 20. Associate each application below with aspects of parallelism. Circle C or F for coarse-grained and fine-grained parallelism, respectively. (6 pts.) a. C F quicksort b. C F matrix multiply of dense (i.e., non-sparse) matrice c. C F ray tracer a. C b. F c. F XC. Explain what each of the four states represent in the MESI cache coherency protocol. (up to 12 pts.) M - modified - cache line is dirty and is in this cache only E - exclusive - cache line is clean and is in this cache only S - shared - cache line is clean and is (or was) in another cache as well as this one I - invalid - this line is empty