Prediction and Profile Buffers

Mark Smotherman
Last updated: May 2004

UNDER CONSTRUCTION

Summary: ... tbd ...

... intro to be written ...

... explain purpose ... importance of dynamic branch prediction ... but beyond that high perf. impls. will record and exploit other dynamic behavior of programs ... look at tables and caches that assist inst. fetch and decoding ...

... this survey not oriented to covering all the specific BP algorithms ... (so find cite to paper or web page that is)

... note difference between table structures (mere lookup) versus caches (hit/miss via tagging) ... offer rationale ...

Design Tree

An extensive design tree was posted by Andy Glew in comp.arch in July 1998 and reprinted in CAN Internet Nuggets, Vol. 26, No. 5, December 1998, p. 34.

From: "Andy Glew" 
Newsgroups: comp.arch
Subject: Re: Branch Target Cache (&& AMD 29K)
Date: 15 Jul 1998 23:02:42 GMT
Organization: Intel & University of Wisconsin


At the risk of reducing the number of future patent plaques on my wall,
let me enumerate some possible combinations of the above that I
consider "obvious to someone skilled in the art".


Index structure by:
---------------------------

instruction address
branch target address
branch from address
path (sequence of branch from/to addresses)
branch history (taken/not-taken)
procedure call address
procedure call point
branch address (or procedure address) in combination with certain parameter values
basic block number (i.e. transformed branch target address)
data address
address of data referencing instruction (e.g. load at address A accesses B)
memory addressing mode 
memory addressing mode and parameters
register version numbers or timestamps
any hash or checksum of the above
        e.g bitfield extracts, shifts, XORs, etc.
linear offsets of any of the above
        denominated in a variety of units such as time, etc.
        e.g. branch from address - 2 instructions
transitive closures of the above
        denominated in a variety of units such as time, same unit, etc.
        e.g. the branch from address 2 branches before the instruction


What is cached
-----------------------

raw instructions
data
branch target address
instructions at branch target
decoded instructions
sequence of instructions at branch target
        possibly beyond multiple branches, e.g. a trace
optimized sequence of instructions at branch target
        where optimizations may include
        a) pre-renaming (relative to trace)
        b) rearranging instructions for schedule
        c) eliminating dead code
        d) common sub expressions detected by hardware
        e) predication to eliminate some branches
type information (e.g. this value is an integer; this value is a pointer)
skip list information
future branch history
future branch path
past branch history
past branch path
schedules indicating expected order of fetch or execution of instructions
cache miss addresses
        absolute, or relative to given data values
compressed versions of the above
combinations of the above - e.g. return both a trace of instructions,
        and a expected cache miss addresses

Branch Prediction Terminology

IBM mainframe systems and patents use the terms

"BTB" for what I would call an instruction prefetch buffer
"BHT" for what I would call a BTAC
"DHT" for what I would call a BHT

So here are my terms.

     branch   branch   target   target
     address  history  address  insts.
     -------  -------  -------  -------
BHT             yes                      branch history table
BPC   tags      yes      yes      yes    branch prediction cache
BTAC  tags               yes             branch target address cache
BTB   tags      yes      yes             branch target buffer
BTIC  tags                        yes    branch target instruction cache

(My use of BTB and BPC is more a concession to historical usage rather than seeing them as completely descriptive terms.)

For another discussion of terminology, see sections 8.4.4 and 8.4.5 of D. Sima, T. Fountain, and P. Kacuk, Advanced Computer Architecture: A Design Space Approach, Addison-Wesley, 1997 and chapter 4 of A. Omondi, The Microarchitecture of Pipelined and Superscalar Computers, Kluwer, 1999. Note also that branch instruction identification can be accelerated without using a prediction buffer/cache by partial decoding of instruction fetch buffers ("look-ahead branch detection" section 8.4.2 of Sima, et al.).

Actual Designs and Implementations

... to be done / very incomplete ...
... many descriptions of current processors ...
... aim this list at identifying earliest examples in each category ...

branch direction
- history held in icache
  - LLNL S-1 (1978) - two-bit predictor held in icache for each instruction
    - prediction bit ("jump bit") and dynamic reverse bit ("wrong bit"), which was set whenever the branch was mispredicted; two mispredicts in a row caused the jump bit to toggle
    - A.J. Smith cites an unpublished memo by L.C. Widdoes, Jr., at Stanford in February 1977 that outlines the scheme
    - this scheme has a state diagram just like the two-bit predictor given in Hennessy and Patterson but with the strongly-taken state at the top left having the bit pattern of 10 (rather than 11) and weakly-taken having 11 (rather than 10)
  - Jim Smith (1981) - mentions this in "A study of branch prediction strategies," Proc. ISCA, 1981, pp. 135-148.
  - Mike Johnson's scheme and AMD K5 (see US Patent 5,136,697)
  - Alpha 21064
- to do: classify patents
  - US 3,551,895 G.C. Driscoll, Jr., "Look-Ahead Branch Detection System," granted to IBM in 1970
- BHT
  - Jim Smith, US Patent 4,370,711 (filed 1980, granted to CDC 1983)
    - Smith recounts his work in 1979-1980 on a 64-entry, 2-bit predictor BHT for the CDC Cyber 180/900 in his retrospective on "A study of branch prediction strategies" in 25 Years of ISCA (Sohi, ed.); the 180/900 performed speculative execution by having the predicted branch reserve all the result bus slots until it was resolved
  - Losq, et al., US Patent 4,477,872 - called "DHT" (filed 1982, granted to IBM 1984)
  - ... and others ...
branch target address capture
- BTAC
  - IBM ACS (late 1960s) - prefetch sequence control registers
    - Ed Sussenguth, US Patent 3,559,183 (filed 1968, granted to IBM 1971)
  - MU5 (late 1960s / early 1970s) - jump trace buffer
    - eight entries (jump-from,jump-to)
    - branch is predicted taken if it is stored in the buffer, no correction on mispredict, FIFO replacement, virtually addressed so flush on process switch
    - simulated a BTIC in L.A. Taylor, "Instruction Accessing in High Speed Computers," M.Sc. Thesis, Univ. of Manchester, 1969, but ruled it out for the MU5 due to hardware cost.
    - R.N. Ibbett, "The MU5 instruction pipeline," The Computer Journal, vol. 15, February 1972, pp. 42-50.
    - R.W. Holgate and Roland N. Ibbett, "An Analysis of Instruction-Fetching Strategies in Pipelined Computers," IEEE Transactions on Computers, 29(4), 1980, pp. 325-329.
    - see also Morris and Ibbett, pp. 59-60.
- decoded instruction cache
  - M68060 (1995) - zero-cycle taken branch
    - target info stored using address of instruction prior to the predict-taken branch
    - entry includes the condition code setting needed for execution unit to check the accuracy of the taken prediction and to start misprediction recovery if necessary
  - ... and others ... (see decoded instruction caches)
branch target fetch
- BTIC
  - loop-catching instruction buffers
    - CDC 6600 (1964)
      - 8-word instruction stack
      - D (depth) and L
      - catches jump within stack
    - CDC 7600 (1968)
      - 12-word instruction stack
      - keep address with each word in stack (thus loop can have non-contiguous instructions)
    - S/360 Model 91 (1967)
      - 8-doubleword instruction stack
      - backward branch to a target address within 8 doublewords of the branch (called a "short loop") causes entry into loop mode; decoder tags taken-path instructions as conditional and sends them to wait at the execution units for branch resolution
      - otherwise decoder predicts untaken and tags the fall-through instructions as conditional and sends them to wait at the execution units for branch resolution
      - "BTB" - instruction buffer for use in prefetching at target address in case a predict-untaken branch is taken
      - method was described in 1964 by T.C. Chen at AFIPS
  - AMD 29K (1987) - "BTC"
    - 32 entries of four instructions each, two-way set associative
    - Phil Frieden, US 4,933,837 (filed 1986, granted to AMD 1990)
    - Baror, et al., US 4,926,323 (filed 1988, granted to AMD 1990)
  - Motorola 88110 (1991) - "TIC"
    - 32 entries of two instructions each, fully associative
- combinations
  - US 3,940,741 Horikoshi, et al., granted to Hitachi in February 1976; describes "route index memory" (BTAC) and "route buffer memory" (BTIC)
instruction prefetch
- tbd
dynamic code optimization
- dynamic inlining, re-layout of hot spots
- buffers
  - profile buffer, Conte et al., Micro-29
  - branch behavior buffer, Merten et al., ISCA '99 / ISCA ' 00 / IEEE TC June 2001
- related: Dynamo, rePLay, DELI, DISE, etc.
data prefetch
- ... tbd ...
- J. Pomerene, T. Puzak, R. Rechtschaffen, and F. Sparacio. Prefetching System for a Cache Having a Second Directory for Sequentially Accessed Blocks. U.S. Patent 4,807,110, February 1989.
- RPT - Tien-Fu Chen and Jean-Loup Baer, "Effective Hardware-Based Data Prefetching for High-Performance Processors," IEEE Transactions on Computers, Vol. 44, No. 5, pp. 609-623, May 1995.
- IRB - Sharad Mehrotra, "Data Prefetch Mechanisms For Accelerating Symbolic And Numeric Computation," Illinois, 1996.
- Amir Roth, Andreas Moshovos, and Gurindar S. Sohi, "Dependence Based Prefetching for Linked Data Structures," ASPLOS, October 1998.
- Surendra Byna, Yong Chen, and Xian-He Sun, "A Taxonomy of Data Prefetching Mechanisms," Proc. Intl. Symp. on Parallel Architectures, Algorithms, and Networks (ISPAN), 2008, pp. 19-24.
value prediction
- ... tbd ...
- Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen, "Value Locality and Load Value Prediction", ASPLOS, 1996.
- Mikko H. Lipasti and John Paul Shen, "Exceeding the Dataflow Limit via Value Prediction", Micro, 1996.
instruction reuse
- ... tbd ...
- value cache - S.P. Harbison, "An Architectural Alternative to Optimizing Compilers," ASPLOS, 1982.
- result cache - S.E. Richardson, "Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation," Tech. Rept., Sun Microsystems Labs, 1992.
- Avinash Sodani and Gurindar S. Sohi, "Dynamic Instruction Reuse", ISCA, 1997.
- value reuse table - J. Yi, R. Sendag, and D. Lilja, "Increasing Instruction-Level Parallelism with Instruction Precomputation", Euro-Par, 2002.
- memo tables - Daniel Citron
- data flow manipulation - Stephan Jourdan

(US Patents - for branching see subclasses 238 and 240 under class 712)

[History page] [Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]

mark@cs.clemson.edu