Clemson University
CPSC 464/664 Lecture Notes
Spring 2003
Mark Smotherman


Chapter 4 Lecture Outline

  1. Amount of ILP available in programs
    1. early studies
      1. measured amount of parallelism in basic blocks (using code compiled by compilers of that era)
        1. 1.86 IPC - Tjaden and Flynn (1970, IBM 7094 programs)
        2. 1.72 IPC - Foster and Riseman (1972, CDC 3600 programs)
      2. Foster and Riseman also studied the effect of perfect branch prediction (by executing all paths) and found IPC of 51
    2. later studies include
      1. 90 IPC - Nicolau and Fisher (1984, idealized assumptions)

  2. Software scheduling
    1. scheduling finds independent instructions to execute during latency of operations (e.g., branch delays, load delays)
    2. basic block scheduling limited to 5-10 instructions, which often have multiple dependencies
    3. scheduling needs larger basic blocks or to go beyond basic blocks to exploit ILP
      1. if-conversion removes branches (by using predication)
      2. hyperblock scheduling (one entry, multiple exits)
      3. percolation scheduling uses transformation rules for global code motion
      4. trace scheduling
        1. schedule code as if frequent path ("trace") is one big basic block
        2. add compensation code to off-trace paths based on code motions in current trace
        3. trace scheduling example (Univ. of Minn.)
    4. loops
      1. loop unrolling
        1. reduces number of branches
        2. increasing the number of independent instructions that can cover long latency operations can be more of a performance boost than decreasing the number of branches
        3. unrolling is limited by number of architected registers (different subset needed for each unrolled iteration to provide independent instructions)
        4. unrolling is affected by loop-carried dependencies
      2. loop optimizations
      3. software pipelining
        1. essentially convert long RAW latencies into WAR dependencies within the loop body (RAW dependencies now become loop-carried)
        2. new loop structure
          1. prolog
          2. kernel
          3. epilog
        3. assisted by predication and rotating registers
        4. software pipelining for IA-64 -->
        5. software pipelining for IA-64 (Univ. of Alberta)
    5. interprocedural optimizations
      1. inlining
      2. cloning, e.g., hot/cold versions
    6. enabling techniques
      1. large set of registers
      2. register renaming (by compiler)
      3. predication
        1. partial (conditional move instruction)
        2. full
      4. alias analysis
      5. speculative loads (e.g., Itanium)
        1. data speculation - across stores
          1. advanced load instruction - ld.a
          2. advance load address table (ALAT) watches for stores to recorded addresses
          3. load check instruction - ld.c.clr (reloads value if ALAT entry invalid)
          4. check instruction - chk.a.clr (branches to fixup routine)
        2. control speculation - across branches - need to preserve exception behavior
          1. speculative load instruction - ld.s
          2. poison bits for general registers - NaT
          3. invalid exponent value for fp registers - NaTVal
          4. special context switching needed to preserve NaT and NaTVal
          5. exception upon use
          6. check instruction - chk.s (branches to fixup routine)
      6. rotating registers

  3. VLIW
    1. history
      1. ties to horizontal microcode (Wilkes suggested parallel microoperations in 1953)
      2. Alan Charlesworth - FPS attached array processors
      3. Josh Fisher - ELI-512 at Yale (then Multiflow)
      4. Bob Rau - Polycyclic Processor at TRW (then Cydra 5 at Cydrome)
      5. other VLIW efforts (iWARP, CHoPP, etc.)
      6. LIW - Apollo DN10000 and Intel i860 (dual instruction mode)
    2. compiler prepares fixed packets of multiple operations that give the full "plan of execution"
      1. dependencies are determined by compiler and used to schedule according to function unit latencies
      2. function units are assigned by compiler and correspond to the position within the instruction packet ("slotting")
      3. compiler produces fully-scheduled, hazard-free code => hardware doesn't have to "rediscover" dependencies or schedule
      4. unscheduled events (e.g., cache miss) stall entire processor
    3. compatibility across implementations is a major problem
      1. VLIW code won't run properly with different number of function units or different latencies
      2. proposals to reschedule at page-fault time, etc. (e.g., Conte)
    4. code density is a major problem
      1. low slot utilization (mostly nops)
      2. reduce nops by compression ("flexible VLIW", "variable-length VLIW")
        1. multiple predefined instruction formats (e.g., Cydrome multi-op and uni-op formats)
        2. format selection index into special programmble format memory used during decoding (e.g., De Gloria)
        3. decompress to full-length format during icache refill (e.g., Multiflow)
        4. decompress to full-length format in special stage after icache fetch (e.g., Trimedia)
        5. route individual ops out of icache through a crossbar using packing/stop bits and pipeline ids (e.g., IBM SCISM, Intergraph Clipper 5 patents)
        6. "indirect VLIW" (iVLIW) - preload separate VLIW instruction matrix entry-by-entry and later issue one row at a time (e.g., Iizuka and Pechanek patents, BOPS ManArray)
      3. eliminate instructions with nops in all fields
        1. multi-cycle nop
        2. register scoreboard (but introduces dynamic scheduling)
          1. SUN MAJC - load-use and long-latency-operation register scoreboard
          2. TigerSHARC

  4. EPIC - explicitly parallel instruction computing
    1. history
      1. Lee Higbie - concurrency control bits added to instruction format, 1978
      2. Burton Smith - lookahead count in Horizon processor, 1988
      3. see also EPIC historical precedents
    2. compiler provides information on dependencies
      1. extra bits or counts to indicate independence among instructions
      2. hardware doesn't have to "rediscover" dependencies, but function unit assignment and scheduling done by hardware
      3. avoids the scaling problem of superscalar dependency checking but retains code compatibility across implementations

                       dependency    fn. unit   time to start
                        checking    assignment    execution
                       ----------   ----------  -------------
      
        superscalar     hardware     hardware     hardware
                       ...........
        EPIC            software  .  hardware     hardware
                                   ............
        dynamic VLIW    software     software  .  hardware
                                                ............
        VLIW            software     software     software
      
      
      
          code generation
                |                 superscalar
                |`----------------------------------.
                |                                   |
        .-------|-----------------.         .-------|-----------------.
        |       v                 |         |       v                 |
        | dependency checking     |         | dependency checking     |
        |     O(n^2)              |         |       |                 |
        |       |                 |  EPIC   |       |                 |
        |       |`----------------------------------.                 |
        |       v                 |         |       v                 |
        | fn. unit assignment     |         | fn. unit assignment     |
        |       |                 | dynamic |       |                 |
        |       |                 |  VLIW   |       |                 |
        |       |`----------------------------------.                 |
        |       v                 |         |       v                 |
        | time to start execution |         | time to start execution |
        |       |                 |         |       |                 |
        |       `                 |  VLIW   |       |                 |
        |        `----------------------------------.                 |
        |                         |         |       |                 |
        `-------------------------'         `-------|-----------------'
            compiler scheduling                     |  hardware control
                                                    |
                                                    v
                                               hardware fn. units
      

    3. B. Rau and J. Fisher, "Instruction-Level Parallel Processing: History, Overview and Perspective," HP Tech Rept, 1992 (pdf).
    4. M. Schlansker and B. Rau, "EPIC: An Architecture for Instruction-Level Parallel Processors," HP Tech Rept, 1999 (pdf)
    5. M. Smotherman, "Understanding EPIC Architectures and Implementations" (pdf)

  5. Case study: Multiflow
    1. 1983 - Josh Fisher describes ELI-512 VLIW design and trace scheduling
    2. 1984-1990 - Multiflow works on VLIW design called the Trace, but the company folds in 1990
    3. in-memory compression scheme - VLIW instructions are expanded during i-cache miss and stored in VLIW format in the i-cache
    4. slots in the instruction word have a fixed assignment to first cycle of execution or second cycle of execution
    5. mnop - multicycle nop to halt instruction fetch for specified number of cycles to save space in the i-cache
    6. Multiflow /500 design recognized a need to start loads as early as possible but also to discard ("crush") remaining, unnecessary loads when "long loops are winding down"
    7. R. Colwell, et al., "A VLIW architecture for a trace scheduling compiler," IEEE Trans. on Computers, August 1988, pp. 967-979.
    8. R. Colwell, et al., "Architecture and Implementation of a VLIW Supercomputer," Proc. Supercomputing, 1990, pp. 910-919.

  6. Case study: Cydrome
    1. VLIW
    2. 1981 - Bob Rau leads Polycyclic Architecture project at TRW/ESL
    3. 1983-1988 - Cydrome works on VLIW design called the Cydra-5, but the company folds in 1988
    4. normal VLIW multi-op format (256-bit instruction word, seven fields)
    5. for compression, added a uni-op format which contained routing fields to specify which function units were used (six 40-bit uni-op instructions are held in a 256-bit instruction word)
    6. mnoop was a special uni-op instruction that halted instruction execution for a specified number of cycles to allow for the cases where no instructions were ready to execute (and thus avoid a series of empty multi-ops or uni-ops)
    7. memory latency register specifies latency used by compiler when scheduling; hardware buffers values from any loads that complete earlier or stalls the processor if any loads complete later than the specified number of cycles
    8. rotating registers are used for software pipelining - the idea is later adopted by IA-64
    9. B. Rau, D. Yen, W. Yen, and R. Towle, "The Cydra 5 Departmental Supercomputer: Design Philosophies, Decisions, and Trade-offs," IEEE Computer, January 1989, pp. 12-35.
    10. G. Beck, D. Yen, and T. Anderson, "The Cydra 5 minisupercomputer: Architecture and implementation," Journal of Supercomputing, May 1993, pp. 143-180.

  7. Case study: TI C6x
    1. design started in 1992, chief architect is Ray Simar
    2. compressed VLIW
      1. 8-instruction-wide fetch packet
      2. execute packet defined by link bits in instructions, can have 1-8 execute packets per fetch packet
    3. 11-stage pipeline: 4-stage fetch, 2-stage dispatch and decode, 5 stages for load execution
    4. multicycle nop
    5. Tutorial from Micro 31

  8. Case study: Transmeta
    1. Crusoe
      1. 4-way VLIW native architecture
        1. load/store
        2. alu0
        3. alu1/shifts/fp/mmx
        4. branch/immediate
      2. six instruction bundle formats: AA, AB, AI, LA, LAAB, LAAI
      3. code morphing from x86 insts. to native bundles
      4. working and shadow ("known good") registers
        1. 64 integer regs., first 48 are shadowed (r0 = eax, r1 = ecx, etc.)
        2. 32 fp regs., first 16 are shadowed
      5. alias hardware for data-speculative loads
        1. load-and-protect records address and data size
        2. store-under-alias-mask checks for match
    2. seven-stage pipeline, register scoreboard
    3. commit bit in each bundle to define commit point, which maps n x86 insts. to m VLIW insts. and copies working registers to shadow registers
    4. exception causes rollback to last commit point and then x86 inst.-by-inst. emulation
    5. Efficeon
      1. 8-way VLIW native architecture
        1. two load/store/add
        2. two alu
        3. fp/mmx/sse/sse2
        4. mmx/sse/sse2
        5. address alias checking
        6. control

  9. Case study: Itanium
    1. history
      1. Bill Worley, HP PA-Wide Word, 1990
      2. HP Play-Doh, 1993
      3. HP approached Intel in 1994
      4. MPF presentation in 1997
    2. instruction format
      1. three 41-bit instruction syllables per 128-bit "bundle"
      2. each bundle contains 5 "template bits" which specify syllable types and independence of syllables (within bundle and between bundles)
      3. each syllable is predicated
    3. register rich
      1. 128 integer registers (64 bits each)
      2. 128 floating-point registers
      3. 64 predicate registers (1 bit each)
    4. multiway branch architecture
    5. rotating register file
    6. variable-sized register windows with RSE (register stack engine) protocol that will allow implementations to perform memory traffic in background
    7. W.-M. Hwu at MPF 2001 (pdf) reported Itanium IPC ranged from low of 0.1 to high of 1.9 on SPEC benchmarks
    8. HP collection of IPF papers and presentations


Key Points


[Course home page] [Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]

mark@cs.clemson.edu