Pipelining Review -- Mark Smotherman

Clemson University
CPSC 464/664 Lecture Notes
Fall 2003
Mark Smotherman

Pipelining Review (background material from Appendix A)

Hardware review
1. datapath - registers, ALU, shifters, internal busses, and control points
2. control signals activate control points to allow register contents to be copied onto busses, to select ALU/shifter operations, and to allow values on busses to replace register contents (called register transfers or micro-operations)
3. instruction fetch/execute cycle governed by activation of control signals in appropriate sequence
  1. IF - memory access for instruction fetch (can update PC in parallel)
  2. ID - instruction decode (register fetch in parallel if fixed-field format)
  3. EA - effective address calculation (possible memory access or register fetch)
  4. OF - operand fetch (memory access or register fetch)
  5. EX - execute
  6. OS - operand store (memory access or register write back)
  7. IC - if enabled, check for interrupts
4. load/store architectures combine address calculation with execution and allow only one data memory access per inst. (IF, ID/RF, EX/EA, OF/OS, WB) -- thus no structural hazards for competing data accesses by EA (when indirect addressing is being done), OF, and OS
5. hardwired control unit
  1. input = clock phases or state, IR, condition codes; output = control signals
  2. control signals are sums of products (e.g. Rin = ADD*T3 + SUB*T3 + ...)
  3. can use gates to generate control signals (random logic) or PLA
6. microprogrammed control (see http://www.cs.clemson.edu/~mark/uprog.html)
  1. package all control signals for a given clock cycle into a microinstruction (relatively unencoded format called a "horizontal" uinst.; highly encoded format called a "vertical" uinst.)
  2. microinstruction sequencing can be similar to machine language sequencing
  3. instruction decode by using the opcode as index into a jump table
  4. parallelism by packing several datapath operations in the same microinstruction
  5. microprogramming provides flexibility - updates to change or correct instruction set
  6. in 1970 control stores were 10 times faster than main memory - this made microprogramming attractive; now, icaches make it less attractive (RISC can be seen as "compiling to microcode")
7. now most designs are cache-based and have gone back to hardwired control
8. minimize CPI by minimizing the number of clocks/states/microinstructions in execution path
9. minimize clock cycle time by redesigning critical paths in execution path

Technology changes

1995 2005

memory latency low (50 cycles) high (100-500 cycles)

L1 cache latency low (1 cycle) multi-cycle (2-3 cycles)

pipelines short (5 stages) long (20-50 stages)

operand forwarding paths full network clustered subnets, multi-cycle

Pipelining
1. exploits parallelism of instruction execution stages to improve throughput
  1. divide instruction into pipe stages, speedup in best case approx. equal to the # of pipe stages
  2. goals
    1. balanced stages (each takes approx. same time since slowest stage limits clock rate)
    2. fixed ordering of stages (each instruction does approx. the same work and follows the same path through the stages, no empty stages if possible)
    3. independent stages (so stages don't need to stall waiting for values)
2. RISC caters to pipelining
  1. uniform instruction length (for ease of determining and fetching next inst.)
  2. simple, fixed-field instruction formats (for ease of decoding and overlapping reg. fetch with decode)
  3. load/store architecture (so at most one memory data access per inst.)
  4. simple, single-cycle operations (for single execute stage)
  5. simple addressing modes (esp. no indirect addressing)
  6. explicit specification of resources (for ease of dependency checking)
  7. one result per instruction (avoid condition code as side-effect)
  8. CISC note: pipeline frequent-case, RISC-like inst. subset; trap and emulate on other insts. (MicroVAX-32, Motorola ColdFire)
3. example pipeline for RISC architecture
```
  IF | ID | EX | DC | WB
```
  1. IF - instruction fetch and update PC
  2. ID - instruction decode and read operands from register file (fixed-field decoding)
  3. EX - execute (alu inst.) or calculate effective address for memory access (ld/st)
  4. DC - data cache access (ld/st inst., will be empty for alu inst.)
  5. WB - write back results to register file (for ld/alu inst., empty for st inst.)
4. resource requirements
```
                 .--.
  IF  ID  EX  DC |WB|
      IF  ID  EX |DC| WB
          IF  ID |EX| DC  WB
              IF |ID| EX  DC  WB
                 |IF| ID  EX  DC  WB
                 `--'
```
  1. IF+DC => 2 memory access ports (Harvard architecture - icache and dcache)
  2. IF+EX => PC needs own incrementer since EX must use ALU
  3. ID+WB => register file needs two read ports and one write port
5. hazards - conditions that prevent the next instruction from executing
  1. hazards reduce ability to complete one instruction each cycle, and thus speedup scales as: pipeline depth / (1 + avg. pipeline stall cycles)
  2. structural hazards - due to resource conflicts (two insts. need to use same unit)
    1. soln. 1: stall second instruction - increases CPI so do this only for rare events
    2. soln. 2: duplicate the resource (e.g., icache + dcache) - can be costly
    3. avoid structural hazard if each instruction uses a resource at most once, always in the same pipeline stage, and for only one cycle => can do this for RISCs but not all CISCs
  3. data hazards - due to operand conflicts (two insts. will access same operand location; must appear that insts. are executed sequentially)
    1. true data dependency - RAW - read must happen after write (or read gets the wrong old value)
      1. soln. 1: detect and stall, e.g., register scoreboard
        
        vector of "busy" bits, one per register
        stall instruction in ID until all source registers are free
        set busy bit of destination register for instruction exiting ID
        reset busy bit of destination register after writing in WB
        
        ld r2,0(r1) IF ID.EX DC WB. \ . . dadd r4,r2,r3 IF.-- -- --.ID EX DC WB . . set-----reset r2 busy
      2. soln. 2: detect and stall, but forward through the register file
        
        clock phase 1: WB writes into register file
        clock phase 2: ID reads from register file
        
        ld r2,0(r1) IF ID.EX DC WB \ . \ dadd r4,r2,r3 IF.-- -- ID EX DC WB . . set---reset r2 busy
      3. soln. 3: detect and forward the data from an earlier instruction
        
        add forwarding paths to bring values from EX/MEM and MEM/WB latches to muxes in front of the A and B legs of the ALU
        note that loads still have an inherent load-use penalty
        
        ld r2,0(r1) IF ID EX DC WB \ \ dadd r4,r2,r3 IF -- ID EX DC WB \ \ dsub r6,r4,r5 -- IF ID EX DC WB
    2. anti-dependency - WAR - write after read (or read gets wrong new value); won't occur in scalar pipeline if all reads in ID
    3. output dependency - WAW - write after write (or correct value overwritten); won't occur in scalar pipeline if all writes in WB
  4. control hazards - due to timing conflicts in changing instruction sequence
    1. branches appear about every five insts., and a control flow change occurs about every ten
      1. update these to 3rd. ed.
      2. 10% of insts. are untaken conditional branches (SPECint92 frequencies for DLX)
      3. 6% of insts. are taken conditional branches
      4. 4% of insts. are (taken) unconditional branches, calls, returns
    2. waiting for conditional branches to execute typically not done, too slow
```
                                          branch CPI = 5 (taken or untaken)
  a:   br t   IF  ID  EX  DC  WB
  t:              --  --  --  --  IF  ID  EX  DC  WB
```
    3. predict untaken, advantage is that it uses the normally-updated PC
```
  a:   br t   IF  ID  EX  DC  WB.         untaken branch CPI = 1
  a+4:            IF  ID  EX  DC|           taken branch CPI = 5 (or less if
  a+8:                IF  ID  EX|                                you accelerate
  a+12:                   IF  ID|                                BTA calc. and
  a+16:                       IF|                                decision)
                                v
  t:                              IF  ID  EX  DC  WB
```
    4. delayed branch
      1. accelerate decision (taken/untaken) - perform simple comparison (such as compare to 0) in earlier pipe stage (e.g., ID)
      2. accelerate BTA calculation - use extra address adder to determine BTA in earlier pipe stage (e.g., ID)
```
  a:   br t   IF  ID. EX  DC  WB          branch CPI = 1 (taken or untaken)
  a+4:            IF| ID  EX  DC  WB      delay slot inst. always executed
                    v
  t:                  IF  ID  EX  DC  WB
```
    5. changes to ISA to reduce effect of branches
      1. delayed branches (possibly with annulling)
      2. conditional moves (sometimes called partial predication)
      3. full instruction set predication (e.g., IA-64)
      4. prepare to branch / multiway branch with just one change in PC
    6. branch prediction
6. instruction scheduling by compiler
  1. scheduling for multi-cycle operations (e.g., load-use), interacts with register allocation (increases live ranges and thus increases register pressure)
  2. scheduling of branch delay slots - independent inst. from before branch, benign inst. from taken or untaken path, or nop
  3. compiler branch elimination
    1. if-conversion - change code to use predication or conditional moves
    2. unrolling loops
    3. hand-crafted code (e.g., anding results with all-0 or all-1 masks)
    4. superoptimizer - near-exhaustive searches for equivalent but branchless code segments (e.g., Denali, also see Massalin's original paper and Granlund and Kenner's paper)
  4. different optimizations required for different family members, e.g., FP code on the FP-stack-based x86:
    1. 486 - avoid an excessive number of FXCH instructions since they require execution by the FPU and thus reduce the FP performance
    2. classic Pentium - FXCH instructions can be strategically paired with other instructions to improve performance
    3. Pentium II/III (P6 core) - FXCHs are essentially free (they are handled by the register renaming hardware)
    4. Pentium 4 - avoid FXCHs since they consume slots in the trace cache and there are also issue slot restrictions (use SSE2 instructions instead where possible)
  5. basic block scheduling (one entry, one exit) (see example)
  6. global scheduling - across branches
  7. superblock / hyperblock scheduling (one entry, multiple exits)
7. interrupt/faults/exceptions complicate pipeline design
  1. an external interrupt can occur between instructions
    1. immediately flush instructions in pipe (e.g., disable WB) or let pipe drain
```
                    !
  IF  ID  EX  DC  WB|
      IF  ID  EX  DC|flush   <-- save this address to resume
          IF  ID  EX|flush
              IF  ID|flush
                  IF|flush
                    v
                     TRAP ID  EX  DC  WB
                          IF  ID  EX  DC  WB        instructions
                              IF  ID  EX  DC  WB    from handler
or
                    !
  IF  ID  EX  DC  WB|
      IF  ID  EX  DC| WB              \
          IF  ID  EX| DC  WB           | but what happens to exceptions
              IF  ID| EX  DC  WB       | that occur in these instructions?
                  IF| ID  EX  DC  WB  /
                    v                 <-- save normal next IF address to resume
                     TRAP ID  EX  DC  WB
                          IF  ID  EX  DC  WB          instructions
                              IF  ID  EX  DC  WB    from handler
```
    2. run handler (perhaps by inserting trap instruction into the pipe at IF)
    3. restart the interrupted instruction sequence
  2. to support an interrupt between a delayed branch and the delay slot instruction, you must keep multiple PCs
```
  x:   delayed branch to y  <-- interrupt after this
  x+4: delay slot
  y:   target instruction

  sequence w/o interrupt:   x, x+4, y, y+4, ...

  sequence after interrupt:    x+4, y, y+4, ...
    (but x+4 is not the branch to y!)

  so must save PC = x+4 and NPC = y
```
  3. a fault/exception can occur within an instruction
    1. let prior instructions finish
    2. flush faulting instruction and subsequent ones from pipe
    3. restore any changed registers (e.g., page fault after autoincrement)
    4. run handler (perhaps by inserting trap instruction into the pipe)
    5. restart the faulting instruction and subsequent ones
  4. for longer-running instructions such as string move
    1. ISA specifies use of registers for count, source address, and destination address
    2. leave registers as-is upon occurrence of interrupt/fault/exception
    3. after handler runs, use "continue" rather than "restart" actions
  5. out-of-order interrupts (e.g., inst. page fault in inst. i+1 seen before data page fault in inst. i)
    1. soln. 1: handle upon occurrence, restart all insts. that were in in the pipe at that time - disadvantage is that interrupts are not handled as in sequential machine
```
  IF  ID  EX [DC] ..             <-- data page fault
     [IF] ..  ..| ..  ..         <-- inst page fault
        v       |
         TRAP ID| EX  DC  WB     <-- start inst page fault handler
              IF| ID  EX  DC  WB
                v
                ?TRAP?           <-- start data page fault handler?
                                     (or wait until after inst page
                                      fault handler completes?)
```
    2. soln. 2: status vector per instruction, set by faulting stage, inspected in WB - disadvantage is extra latency for inst. to get to WB stage, advantage is in-order handling
```
  IF  ID  EX [DC] ..                 <-- data page fault, seen in WB
     [IF] ..  ..| ..  ..             <-- inst page fault, ignored b/c WB disabled
         [IF] ..| ..  ..  ..         <-- ignored because WB disabled
             [IF| ..  ..  ..  ..     <-- ignored because WB disabled
                v
                 TRAP ID  EX  DC  WB <-- start data page fault handler

  after handler runs, resume with inst. that caused data page fault
```
  6. the frequent case should be fast, but the rare case MUST BE CORRECT
Multiple-cycle instructions (e.g., integer multiply, floating point) complicate pipeline design
1. soln. 1: eliminate them (e.g., SPARC version 7 IU had no integer multiply or divide)
2. soln. 2: stall pipeline while instruction spends multiple cycles in EX
```
    IF|ID|EX|DC|WB

  IF  ID  E1  E2  E3  DC  WB                // fp
      IF  ID  --  --  EX  DC  WB            // int
          IF  --  --  ID  EX  DC  WB        // int
                      IF  ID  EX  DC  WB    // ld/st
```
3. soln. 3: split pipeline into multiple fixed-length segments (guarantee in-order WB)
  1. segments must be fully pipelined to avoid structural hazards
  2. implies empty stages, but advantage is in-order completion (UltraSPARC)
  3. complicates forwarding (many more paths)
```
         |EX|~~|~~|WB  (single-cycle integer)
    IF|ID|AC|DC|~~|WB  (load/stores)
         |E1|E2|E3|WB  (fp and multi-cycle integer)


  IF  ID  E1  E2  E3  WB                // fp
      IF  ID  EX  ~~  ~~  WB            // int
          IF  ID  EX  ~~  ~~  WB        // int
              IF  ID  AC  DC  ~~  WB    // ld/st
```
4. soln. 4: split pipeline into multiple variable-length segments
  1. must deal with possible simultaneous WB from multiple segments
    1. soln. a: reservation scheduling in ID - reserve a WB slot in a reservation vector or table or stall until an available slot can be obtained
    2. soln. b: arbitrate among the segments for use of the WB bus(es), perhaps stalling some segments
  2. out-of-order completion => can lead to imprecise exceptions (esp. FP)
  3. must deal with out-of-order writes to same register (e.g., you can stall or squash a tardy write)
```
         |EX|WB        (single-cycle integer)
    IF|ID|AC|DC|WB     (load/stores)
         |E1|E2|E3|WB  (fp and multi-cycle integer)

  IF  ID  E1  E2  E3  WB
      IF  ID  EX  WB          <-- out-of-order completion
          IF  ID  EX  WB      <-- competition for WB port / resource conflict?
              IF  ID  AC  DC  WB
```

	1995	2005
memory latency	low (50 cycles)	high (100-500 cycles)
L1 cache latency	low (1 cycle)	multi-cycle (2-3 cycles)
pipelines	short (5 stages)	long (20-50 stages)
operand forwarding paths	full network	clustered subnets, multi-cycle

Precise exceptions for out-of-order completion (esp. important for FP)

soln. 0: allow imprecise FP exceptions (IBM S/360 M91)

         |EX|WB
    IF|ID|E1|E2|E3|WB
         |X1|X2|X3|X4|X5|WB

              exception in b
                           V
  a: IF  ID  X1  X2  X3  X4|     <-- not completed!
  b:     IF  ID  E1  E2  E3|     <-- resume here or where after handler?
  c:         IF  ID  EX  WB|     <-- completed!
  d:             IF  ID  EX| ..  ..
  e:                 IF  ID| ..  ..  ..
  f:                     IF| ..  ..  ..  ..
                           v
                            TRAP

  consider if inst c overwrites a source, e.g., ADDI R1,R1,#1

soln. 1: provide run-slow mode that will force in-order FP completion (IBM RS/6000)

  a: IF  ID  X1  X2  X3  X4  X5  WB
  b:     IF  ID  --  --  --  --  --  E1  E2  E3.  <-- resume here
  c:         IF  --  --  --  --  --  ID  --  --|
  d:                                 IF  --  --|
                                               v
                                                TRAP

soln. 2: provide exception barrier insts. to force machine to wait until all possible points of exceptions have been passed (DEC Alpha; compiler opt. can reduce barriers to one per basic block)

  b:         IF  ID  E1  E2  E3.   <-- resume here
  barrier        IF  ID  --  --|
  c:                 IF  --  --|   <-- never executed
                               v
                                TRAP

soln. 3: provide precise FP exceptions using combined hardware/software buffer scheme; FP exception occurs and is queued; recognized later upon inst. issue to FPU; exception handler must fix up and/or simulate faulting inst. (Intel 486 delayed exception, SPARC FPQ)

  b:     IF  ID  E1  E2  E3* ..    <-- queued exception
  c:         IF  ID  EX  WB
  d:             IF  ID  EX  WB
  e:                 IF  ID  E1.   <-- earlier exception recognized =>
  f:                     IF  ID|        handle and restart here at e
  g:                         IF|
                               v
                                TRAP

soln. 4: provide precise FP exceptions using overlapped execution only when FP insts. are guaranteed to be without exceptions (MIPS R2000, Intel Pentium SIR/safe instruction recognition)
```
  b:     IF  ID  E1  E2  E3.   <-- resume here
  c:             IF  --  --|   <-- never executed
                           v
                            TRAP
```

soln. 5: provide precise exceptions by saving previous contents of destination registers into a history buffer; roll back upon exception (MC88110)

  a: IF  ID  X1  X2  X3  X4  X5  WB      <-- allow to drain
  b:     IF  ID  E1  E2  E3.             <-- resume here
  c:         IF  ID  EX  WB| RB          <-- roll back dest reg to previous value
  d:             IF  ID  EX| ..  ..      <-- flushed
  e:                 IF  ID| ..  ..  ..
  f:                     IF| ..  ..  ..  ..
                           v
                            TRAP

soln. 6: provide precise exceptions by writing results from out-of-order completion into a reorder buffer and then examining exception status on sequential retirement of results from buffer into registers (PowerPC 603/604/750, Intel P6 core, MIPS R10000, etc.)

  a: IF  ID  X1  X2  X3  X4  X5  WB  RT  <-- retires from buffer to reg file
  b:     IF  ID  E1  E2  E3.             <-- resume here
  c:         IF  ID  EX  WB| ..          <-- completes but doesn't retire
  d:             IF  ID  EX| ..  ..      <-- flushed
  e:                 IF  ID| ..  ..  ..
  f:                     IF| ..  ..  ..  ..
                           v
                            TRAP

soln. 7: provide precise exceptions by writing results from out-of-order completion into a future file using age bits to eliminate tardy writes (UltraSPARC III, US Patent 5,964,862)

Case study: MIPS R2000 (1989)
1. five-stage integer pipeline
  1. IF - translate PC into physical address in first clock phase, and start fetch from icache in second phase
  2. RD - finish fetch from icache in first clock phase, and decode and read register operands (fixed-field decoding) in second phase; for branch determine next PC in second phase
  3. ALU - execute ALU instruction; for ld/st determine effective address in first phase, and translate effective address into physical address in second phase
  4. DC - access dcache
  5. WB - in first phase (note how write-back occurs in first phase of WB and register read occurs in second phase of RD - this obviates a forwarding path from WB to RD)
2. delayed branches (note how branch target address is available at end of RD), this is a case where the pipelined implementation "shows through" the architecture
3. 3-ported register file (2 read, 1 write)
4. chip architects were Craig Hansen, John Moussouris, Tom Riordan, and Chris Rowen (MIPS-I architect was Craig Hansen)
5. separate FPU chip (R2010) - Craig Hansen, Ed Hudson, Mark Johnson, and others
6. J. Moussouris, et al., "A CMOS RISC processor with integrated system function," IEEE COMPCON Spring '86, San Francisco, March 1986, pp. 126-131.
7. (see MIPS R4300i datasheet (pdf) for an even cleaner presentation of the five-stage pipeline)
Case study: MIPS R4000/R4400 (1991)
1. "superpipelined" eight-stage pipeline - see section A.6 in text
2. branch delay is three cycles, so taken branch has delay slot cycle plus two stall cycles (example of the architecture lasting longer than an implementation)
3. note speculative use of data prior to cache tag check (i.e., direct-mapped dcache)
4. chip architects were Peter Davies, Earl Killian, and Tom Riordan
5. A. Bashteen, I. Lui, and J. Mullan, "A superpipeline approach to the MIPS architecture," IEEE COMPCON Spring '91, San Francisco, 1991, pp. 8-12.
6. S. Mirapuri, M. Woodacre, and N. Vasseghi, "The MIPS R4000 processor," IEEE Micro, 12, 2, April 1992, pp. 10-22.
7. other R4x00 family members (R4200/R4600) go back to simpler 5-stage pipeline
Case study: Fujitsu TurboSPARC (1996)
1. early SPARCs had four-stage pipelines
2. TurboSPARC has nine-stage combined integer/FP pipeline
  1. I - inst. address given to virtually-indexed/virtually-tagged direct-mapped 16 KB icache
  2. F - fetch instruction pair, also buffers an instruction pair at branch target address
  3. D - decode single instruction and read two registers; compute branch target address; perform FP structural and data hazard detection for the three nonpipelined FP units (FALU, FMUL, FDIV), including avoidance of simultaneous FP writebacks
  4. E - integer execute, most are single-cycle but integer multiply/divide stalls the pipeline
  5. M - memory access to virtually-indexed/physically-tagged direct-mapped 16 KB dcache
  6. R - register-defer stage - to cancel writes to integer register file
  7. W - write-back to integer register file and issue any pending traps, long-running (25-cycle) FP div or sqrt instructions are placed in a special holding buffer
  8. FR - FP-defer stage - normal 3-cycle FP insts. complete, if no exceptions in this or prior FP instructions route result to FP register file, else route info to FPQ
  9. FW - FP-write-back - write FP register or FPQ with posted results and set FP condition code
3. delayed branches
4. 4-ported register file (3 read, 1 write)
Case study: Intel i486 (1989)
1. five-stage integer pipeline (approach is called an AGI pipeline)
  1. fetch - fetch 16 bytes of instructions from the single physically-addressed 4-way set associative 8KB cache into a prefetch buffer (providing about five instructions per fetch); use the two 16-byte buffers in a double buffered manner or use one for prefetching down a branch target path
  2. D1 (main decoding stage) - processes up to three instruction bytes at a time; determines the length of the instruction and causes the prefetch buffer to step to the next instruction; extra cycles for prefix bytes or two-byte opcodes
  3. D2 (secondary decoding stage) - includes effective address calculation; extra cycles when inst. has both an immediate constant and a memory displacement or when an index register must be added to a base register and a displacement
  4. EX (execution) - includes register operand fetch and data cache access
    1. data cache hit for either a load or store operation (i.e., MOV mem to reg, MOV reg to mem) can be accomplished by the EX stage in one cycle
    2. alu operations with all operands/results in registers can be performed in one cycle
    3. extra cycles required for complex instructions, e.g. reg-to-memory add requires three EX cycles: one for data fetch from cache, one for the add itself, and one for result store to cache -- this type of instruction is common in x86-style code
    4. using forwarding, a loaded result is available for use in the very next cycle; however, because address calculation occurs in a previous stage (D2), there can be pointer load delays (that is, a sequence of a load and then an instruction that uses the loaded register as a base or index register will encounter a one-cycle stall)
  5. WB - write back to registers
2. branches
  1. predict-untaken (even for unconditional jumps)
  2. two-cycle mispredict penalty since change in the PC (caused by a taken conditional jump or an unconditional jump) is determined during the EX stage; contents of D1 and D2 must be flushed
3. 4-ported register file (3 read, 1 write)
4. eight-stage FP pipeline with integer stages fetch/D1/D2/EX followed by FP stages X1 (execute-1), X2 (exexute-2), WF (FP write-back), and ER (error reporting)
5. chief designer was John Crawford (IA-32 architects were John Crawford and Patrick Gelsinger)
6. J. Crawford, "The i486 CPU: Executing Instructions in One Clock Cycle," IEEE Micro, February 1990, pp. 27-36.
7. B. Fu, A. Saini, and P. Gelsinger, "Performance and Microarchitecture of the i486 Microprocessor," Intl. Conf. Computer Design, 1989, pp. 182-187.
8. E. Grochowski and K. Shoemaker, "Issues in the Implementation of the i486 Cache and Bus," Intl. Conf. Computer Design, 1989, pp. 193-198.
Case study: IBM z900 (2002)
1. seven-stage pipeline (optimized for RX format instructions Reg <- Reg op M[B+X+D])
  1. fetch (may take several cycles)
  2. decode
  3. read base and index registers, generate address
  4. cache stage 1
  5. cache stage 2
  6. execute
  7. write back
2. dual 256 KB L1 caches
3. 8K-entry BTB, keeps only taken branches
4. fetch buffers to predict through three branches
5. I-unit resolves load, load address, and branch on index instructions early
6. E-unit has integer ("fixed-point") and floating-point units
7. E. Schwarz, et al., "The microarchitecture of the IBM eServer z900 processor," IBM Jrnl. Res. Dev., 46, 4/5, 2002, pp. 381-396.

[Course home page] [Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]

mark@cs.clemson.edu