CPSC 330 - Spring 2007 Study Guide for Exam 2 Coverage: memory and CPU implementation 1. Be able to define or match these terms random access memory (RAM) static RAM (SRAM) dynamic RAM (DRAM) refresh memory cell array row decoder row address strobe (RAS) row buffer column decoder column address strobe (CAS) page-mode DRAM read-only memory (ROM) programmable ROM (PROM) erasable PROM (EPROM) memory bus centralized arbitration daisy chain arbitration decentralized arbitration ring (token) arbitration memory hierarchy locality spatial locality temporal locality working set hit rate miss rate hit time miss penalty multilevel cache primary cache (level one, L1) secondary cache (level two, L2) fetch policy (demand fetch or prefetch) placement policy (fully associative, set-associative, direct-mapped) replacement policy (LRU, random, etc.) cache miss refill burst transfer over memory bus for refill write-hit policy (write-through or write-back) write-miss policy (write-allocate or write-no-allocate) cache lines (often called cache blocks) tag valid bit direct-mapped cache set-associative cache fully associative cache compulsory miss capacity miss conflict miss thrashing write-through write-back dirty bit (also known as modified bit and changed bit) least recently used (LRU) register internal CPU bus tri-state buffer CPU data path control control signal hardwired control microprogrammed control control store 2. Be able to: A. Draw a block diagram (high-level circuit) showing the two-dimensional organization of a RAM and/or identify the components and signals required to access RAM. B. Draw a timing diagram of bus and RAM chip activity for a memory read. Be able to identify differences for burst transfer mode. C. Give typical capacity, latency, and block size values for the memory hierarchy components. D. Identify at least one example each of temporal and spatial locality of memory references. E. Given memory and cache parameters, give the tag, index, and offset field sizes within the main memory address. F. Given an address stream and cache parameters, determine the number of misses. G. Explain how row versus column access to a matrix affects cache misses. (E.g., discuss how misses were reduced in the matrix multiply example.) H. Given a high-level RTL statement, derive the necessary step-by-step RTL and/or control signals to implement that statement on a given datapath. I. Given a user-level instruction, derive the necessary step-by-step RTL and/or control signals to implement the fetch/decode/execute of that instruction on a given datapath. J. Explain the purpose/use of the fields in a microinstruction from the microprogramming example we studied in class. Be prepared to work problems as given in homework 2. Example questions 1. Matching. Write the correct term from the list into each blank. static RAM (SRAM) centralized arbitration memory bus dynamic RAM (DRAM) decentralized arbitration internal CPU bus erasable PROM (EPROM) hardwired control data path control store microprogrammed control control (a) _____________________ a data transfer path within the CPU involving registers such as PC, IR, etc. (b) _____________________ a type of semiconductor memory that needs to be periodically refreshed (c) _____________________ generation of control signals from random logic or a PLA (d) _____________________ a token ring arrangement among memory bus masters is an example of this (e) _____________________ a data transfer path involving only the MAR and MDR registers within the CPU 2. Write a C function that could be used in a memory simulator that prints the word addresses of memory words transferred on a burst transfer of 16 bytes (four words). The addresses printed must start with a word aligned on a 16-byte boundary and must include the address passed to the function. (Addresses may be up to 32 bits.) The prototype of the function is void burst_addrs( unsigned int word_addr ); burst_addrs( 0x100 ) prints: 0x100 0x104 0x108 0x10c burst_addrs( 0x104 ) prints: 0x100 0x104 0x108 0x10c burst_addrs( 0xff5 ) prints: 0xff0 0xff4 0xff8 0xffc burst_addrs( 0xff8 ) prints: 0xff0 0xff4 0xff8 0xffc 3. Explain why a burst transfer is more efficient than transferring the four words separately. Draw bus timing diagrams of the two different approaches. 4. T / F Locality of reference is a property of the CPU on which a program is running (e.g., size of the cache, degree of set associativity). 5. T / F A direct-mapped cache needs an index field to find a cache line, whereas a set-associative doesn't. 6. T / F An n-way set associative cache requires n to be a power of two (i.e., 2-way, 4-way, 8-way, ...) 7. T / F A valid bit is also sometimes called a dirty (or modified) bit. 8. T / F A write-back cache needs a cache line replacement (or eviction) policy, but a write-through cache doesn't. 9. T / F Steps for accessing a fully-associative cache are 1) index lookup, 2) tag match, 3) routing, and 4) offset selection. 10. T / F Fully associative caches can have conflict and capacity misses but not compulsory misses. 11. If the main memory access time is 200 ns, and the cache measurements are: hit time of 20 ns, miss time of 220 ns, and hit rate of 80% - what is the memory access speedup of using a cache versus not using a cache? (Assume that the miss time includes all cache actions necessary to determine the miss, refill the line, and forward the data to the CPU.) 12. Explain why a cache line would ever be larger than a single word. 13. Consider a 4 GB byte-addressable main memory with a three-way set-associative cache of 3 MB and 64 bytes per line. a) How many total lines are there in cache? (not just per bank) b) Show how the main memory address is partitioned. c) How many total tag comparators are there in the cache? 14. Assume a 256-byte main memory and a four-line cache with two bytes per line. The cache is initially empty. For the byte address reference stream (reads) given below circle which of the references are hits for the different cache placement schemes. a) direct-mapped 0, 8, 1, 25, 9, 16, 8, 17, 24, 9 b) two-way set associative with LRU replacement 0, 8, 1, 25, 9, 16, 8, 17, 24, 9 c) fully-associative with FIFO replacement 0, 8, 1, 25, 9, 16, 8, 17, 24, 9 15. Consider the following datapath. (Assume all registers are edge-triggered and thus immune from races.) Control signal names are given near the in and out control points of the registers. Additional control signals include memory signals R and W and ALU function selector F. A.bus B.bus _ _ PC.A.in | |--o->+-----+<-o--| | PC.B.in | | | PC | | | PC.A.out | |<-o--+-----+--o->| | PC.B.out | | | | IR.A.in | |--o->+-----+<-o--| | IR.B.in | | | IR | | | IR_addr.A.out | |<-o--+-----+--o->| | IR_addr.B.out | | | | MDR.A.in | |--o->+-----+<-o--| | MDR.B.in | | | MDR | | | MDR.A.out | |<-o--+-----+--o->| | MDR.B.out | | | ^ | | | | v | | | memory functions | | R->(memory)<-W | | ------------------- | | ^ | | R = read, W = write | | | | | MAR.A.in | |--o->+-----+<-o--| | MAR.B.in | | | MAR | | | MAR.A.out | |<-o--+-----+--o->| | MAR.B.out | | | | ACC.A.in | |--o->+-----+<-o--| | ACC.B.in | | | ACC | | | ACC.A.out | |<-o--+-----+--o->| | ACC.B.out | | | | Y.A.in | |--o->+-----+<-o--| | Y.B.in | | | Y | | | Y.A.out | |<-o--+-----+--o->| | Y.B.out | | | | | | | | .---|_| | | | | | | Y v B v ALU functions (three-bit F field) | | +---+ +---+ --------------------------------- | | \ V / 000 = B+Y 100 = B-Y | | F->\ / 001 = B 101 = not B | | +---+ 010 = B+1 110 = B-1 | | | 011 = B<<1 111 = B>>1 ALU.A.out |_|<-o-------+ Give the step-by-step RTL and the control signal sequence for an accumulator machine subtract to fetch and execute the instruction SUB X. Assume the instruction has an opcode and address field placed together in a single word and that the memory is word addressable. 16. Consider the microprogrammed implementation of the simple CPU. cycle PC IR MAR MDR ACC TMP CSAR CSIR ----------------------------------------------------------------- 1: 0 0 0 0 0 0 0 00000100010000000100 MAR_in PC_out 2: 0 0 0 0 0 0 2 00000000001100000110 pc_incr read 3: 1 0 0 8 0 0 3 00010001000000001000 IR_in MDR_out 4: 1 8 0 8 0 0 4 00000000000000100000 br_table 5: 1 8 0 8 0 0 5 00001100000000001100 IR_out MAR_in 6: 1 8 8 8 0 0 6 00000000000100001110 read 7: 1 8 8 2 0 0 7 10000001000000000000 ACC_in MDR_out ----------------------------------------------------------------- (a) The control store held 16 words of 20 bits each, yet only the first 14 bits were used as control signals sent to the datapath. What is the general purpose for including an extra 6 bits in the fields br_table, next_address, and or_address? [Hint: a one word answer suffices.] (b) What effect did the br_table (branch via table) have, and why was it used?