Historical background for EPIC instruction set architectures
Mark Smotherman
Last updated: November 2023
Summary: The design style of EPIC (explicitly parallel instruction computing)
did not appear instantaneously, like Athena springing from Zeus' head. Instead,
EPIC is a compendium of ideas that have been percolating in computer
architecture for years.
See a partial writeup of this material in
M. Smotherman, "Understanding EPIC Architectures and Implementations" (pdf)
from ACM Southeast Conference, 2002.
Intel/HP EPIC - Explicitly Parallel Instruction Computing
There are several principles behind EPIC:
- start loads early
- predication to eliminate many conditional branches
- register rich
- independence architecture
- uncoupled branch architecture
- rotating register file
In the HP/Intel Itanium (IA-64), these influences are seen in the following
ways.
- start loads early
- advance loads - move above stores when alias analyis is incomplete
- speculative loads - move above branches
- predication to eliminate many conditional branches
- 64 predicate registers
- almost every instruction is predicated
- register rich
- 128 integer registers (64 bits each)
- 128 floating-point registers
- independence architecture
- VLIW flavor, but fully interlocked (i.e., no delay slots)
- three 41-bit instruction syllables per 128-bit "bundle"
- each bundle contains 5 "template bits" which specify independence
of following syllables (within bundle and between bundles)
- uncoupled branch architecture
- eight branch registers
- multiway branches
- rotating register files
- lower 48 of the predicate registers rotate
- lower 96 of the integer registers rotate
Sidebar: IA-64 History
- IA-64 joint ACM committee (architecture, compilers, microarchitecture)
- five Intel members:
- John Crawford (chief architect for overall effort)
- Hans Mulder (architecture)
- Harsh Sharangpani (microarchitecture and x86 floating point
compatibility)
- Kent Fielden (compilers)
- Jack Mills (architecture and performance evaluation)
- five HP members:
- Jerry Huck (lead architect for HP)
- Rajiv Gupta (architecture, Wide-Word background)
- David Fotland (microarchitecture, PA-RISC background)
- Dale Morris (architecture, PA-RISC background)
- Carol Thompson (compilers)
- Timeline
- 1981 -
Bob Rau leads Polycyclic Architecture project at TRW/ESL
- 1983 -
Josh Fisher describes ELI-512 VLIW design and trace scheduling
- 1983-1988 - Rau at Cydrome works on VLIW design called the Cydra-5,
but the company folds in 1988
- 1984-1990 -
Fisher at Multiflow works on VLIW design called the Trace,
but the company folds in 1990
- 1988 - Dick Lampman at HP hires Bob Rau and
Mike Schlansker from Cydrome and also gets IP rights from Cydrome
- 1989 - Rau and Schlansker begin the FAST (Fine-grained Architecture
and Software Technologies) research project at HP; they later develop
the HP PlayDoh architecture
- 1990-1993 - Bill Worley leads PA-WW (Precision Architecture Wide-Word)
effort at HP Labs to be the successor to the PA-RISC architecture;
it was also called SP-PA (Super-Parallel Processor Architecture)
and SWS (Super WorkStation)
- HP hires Josh Fisher, input to PA-WW
- input to PA-WW from Hitachi team, led by Yasuyuki Okada
- November 1991 - Hans Mulder joins Intel to start work on a 64-bit
architecture
- July 1992 - Worley recommends HP seek a semiconductor manufacturing
partner
- 1993 - HP starts effort to develop PA-WW as a product
- December 1993 - HP investigates partnership with Intel
- June 1994 - announcement of cooperation between HP and Intel;
PA-WW used as starting point for joint design; John Crawford of Intel
leads the joint team
- 1997 - the term EPIC is coined
- October 1997 - Microprocessor Forum presentations by Intel and HP
- July 1998 - Carole Dulong of Intel publishes
"The IA-64 Architecture at Work," IEEE Computer, pp. 24-32.
- February 1999 - release of ISA details of IA-64
- 2001 - Intel marketing prefers IPF (Itanium Processor Family) to IA-64
- May 2001 - Itanium (Merced)
- July 2002 - Itanium 2 (McKinley)
- References
-
Itanium history page at HPL
- Russ Britt,
"The Birth of a New Processor,"
Electronic Business, January 2000.
- Mike Schlansker and Bob Rau,
"EPIC: Explicitly Parallel Instruction Computing" (pdf),
IEEE Computer, February 2000, pp. 37-45.
- Mike Schlansker and Bob Rau,
"EPIC: An Architecture for Instruction-Level Parallel Processors"
(pdf),
HP Labs Technical Report HPL-1999-111, February 2000.
- John Crawford,
"Introducing the Itanium Processors,"
IEEE Micro, September-October 2000, pp. 9-11.
- Harsh Sharangpani,
"Intel Itanium Processor Core" (pdf), Hot Chips 12 slides,
August 2000.
-
Itanium home page at Intel
- Wen-mei Hwu, et al., "Itanium Performance Insights"
(Univ. of Illinois, IMPACT compiler project),
Microprocessor Forum, 2001.
- Jay Bharadwaj, et al.,
"The Intel IA-64 Compiler Code Generator",
IEEE Micro, September-October 2000, pp. 44-53.
- Rumi Zahirm, Dale Morris, Jonathan Ross, and Drew Hess,
"OS and Compiler Considerations in the Design of the
IA-64 Architecture,"
ASPLOS-IX, November 2000, 212-221.
- Charles Gray, Matthew Chapman, Peter Chubb, David Mosberger-Tang,
and Gernot Heiser,
"Itanium - A System Implementor's Tale" (pdf),
USENIX Annual Technical Conference, April 2005, pp. 264-278.
- John Sias,
"A Systematic Approach to Delivering Instruction-Level
Parallelism in EPIC Systems" (pdf), Ph.D. Dissertation,
Univ. Illinois at Urbana-Champaign, 2005.
- see also the set of links (some now dead) collected for
"Itanium: An EPIC Architecture,"
CS 854 Advanced Computer Architecture class project,
Univ. of Virginia, 2001.
Historical precedents for load speculation
- several designs recognized the need for early initiation of loads
-
Zuse Z4, 1940s - the instruction stream was read two instructions
in advance, and if a load was detected it was started early to reduce
the impact of the slow cycle time of the memory.
-
IBM Stretch, 1961 - a separate indexing unit pre-processes the
instruction stream to decode arithmetic instructions and start memory
loads early (it also executes index-register-related operations and
branches); decoded instructions and data values loaded from memory
are placed in a four-element lookahead buffer between the indexing
unit and the arithmetic unit
- IBM S/360 Model 91, 1967; IBM S/370 Model 165, 1970; IBM 3033, 1978;
IBM 3090, 1986 - these IBM mainframes use overlapped I and E units in a
manner similar to Stretch: the I unit fetches and decodes instructions,
performs address calculations, and starts memory loads
- decoupled access/execute (DAE) architectures split the instruction
stream and allow the memory-access machine to run ahead of the execute
machine and to pre-load memory data into FIFO buffers
- three approaches to load speculation
- hardware speculative loads
- hardware speculative loads are normal loads executed on a hardware
platform that provides branch prediction and speculative execution;
any addressing exception will only be recognized if the predicted
path is confirmed as taken
- speculative execution dates back to the IBM Stretch of 1961, which
also provided address monitoring using two boundary registers
- if address monitoring is enabled and a hardware speculative load
failed a boundary check, it appears that the instruction would be
changed into a no-op by the lookahead logic ("instruction rejection"
or "cancel") and the matching lookahead level indicator bit would
be set (i.e., the "data fetch" indicator)
- it appears that once this no-op became the oldest instruction
in the lookahead without being deleted during branch misprediction
recovery ("housekeeping"), then the CPU indicator bit would be set
according to the lookahead level indicator bit and an interrupt
would occur
- the most detailed source for speculative execution in Stretch
appears to be W.C. Stetler, "Lookahead Section of the Sigma
Computer," Jan. 19, 1960 (Sigma was the internal project name for
the high-performance scientific processor that became Stretch);
the papers by R.T. Blosk on the Stretch Instruction Unit mention
address boundary checking and the setting of indicator bits but
do not go into the same level of detail; it is possible that the
actual Stretch implementation was somewhat simpler than that
described by Stetler
- software speculative loads
- software speculative loads are normal loads executed on a hardware
platform that does not provide speculative execution but that are
scheduled by a compiler prior to the basic block in which the loaded
value will be used, e.g., from a computed address like an array
reference or a pointer chain you are chasing; in most (but not all)
environments, these are impractical because they will constantly
cause addressing exceptions and the resulting unacceptable
performance degradation
- Josh Fisher included loads as being among the instructions to be
moved entirely ahead of branches in his description of Trace
Scheduling in 1979, and this was implemented in 1982 in his group's
Bulldog Compiler and described in John Ellis's dissertation in 1985
- dismissible loads (also called silent or non-faulting loads)
- "dismissible loads" are software speculative loads in which the
exception problem is handled by some hardware implementation
technique, allowing control flow to continue until the hardware
figures out whether the exception should be taken
- starting with the design of the Multiflow Trace 7/200 and its
compiler in 1984, mechanisms were developed to avoid the penalty
inherent in software speculative loads causing frequent unwanted
exceptions; DLD, for Dismissible Load, was the opcode used
in the Multiflow Trace's instruction set
- from P.G. Lowney, S.M. Freudenberger, T.J. Karzes, W.D. Lichtenstein,
R.P. Nix, J.S. O'Donnell, and J.C. Ruttenberg, "The Multiflow Trace
Scheduling Compiler," The Journal of Supercomputing, May 1993:
To prevent unwarranted memory faults on speculative loads, the
compiler uses the dismissable load operation. If a dismissable load
traps, the trap code does not signal an exception, but returns a NAN
or integer zero, and computation continues; if necessary, a translation
buffer miss or a page fault is serviced. NANs are propagated by the
floating units, and checked only when they are written to memory or
converted to integers or booleans. Correct programs exhibit correct
behavior with speculative execution on the Trace, but an incorrect
program may not signal an exception that it would have signaled if
compiled without speculative execution.
- the design of the Key Computer Laboratories K-1 (ca. 1988)
included an early load (eload) instruction and an
optional early load check (echk) instruction; each register
had a early load flag bit, which was set on an illegal
memory access by an early load
- other work
- IBM VLIW efforts
- 33rd bit (to indicate the "bottom value") in registers
- extra bit in the instruction opcode to indicate a dismissible
or non-dismissible version; exceptions occurred when a
non-dismissible instruction computed a bottom value
- K. Ebcioglu, "Some Design Ideas for a VLIW Architecture for
Sequential Natured Software," in Parallel Processing
(Proceedings of IFIP WG 10.3 Working Conference on Parallel
Processing), 1988
- similar design proposed an extra field in a register to hold the
address of the excepting instruction
- K. Ebcioglu and R. Groves, "Some Global Compiler Optimizations
and Architectural Features for Improving Performance of
Superscalars," Research Report no. RC16145, IBM T.J. Watson
Research Center, Yorktown Heights, NY, 1990
- precise exceptions from compiler techniques
- G.M. Silberman and K. Ebcioglu, "An Architectural Framework for
Supporting Heterogeneous Instruction-Set Architectures," IEEE
Computer, June 1993 (first version published in G.M. Silberman
and K. Ebcioglu, "An Architectural Framework for Migration from
CISC to Higher Performance Platforms," Intl. Conf.
on Supercomputing, 1992)
- IBM patents
- 5,542,075 - Method and apparatus for improving performance of
out of sequence load operations in a computer system
- 5,625,835 - Method and apparatus for reordering memory operations
in a superscalar or very long instruction word processor
- 5,799,179 - Handling of exceptions in speculative instructions
- Smith, Lam, and Horowitz -- "Boosting Beyond Static Scheduling in a
Superscalar Processor," ISCA 1990
- "best aspects of static and dynamic scheduling"
- static branch prediction, encoded in branch op codes
- shadow register file and shadow store buffer
- move instructions up before one branch, mark as boosted, and
access shadow registers; any exceptions are deferred until
boosted instruction commits
- if branch prediction is correct:
- move results from shadow registers into registers
- any boosted instructions still in pipe are unmarked and accesses
to shadow registers are then changed to corresponding registers
- if branch prediction is incorrect:
- flush shadow structures
- squash any boosted instructions still in pipe
- speedups
|
(1 ld/st |
basic block |
---- fetch 2 ---- |
---- fetch 4 ---- |
|
per cycle) |
scheduling |
(no load/store reorg.) |
(load/store reorg.) |
|
"max speedup" |
only |
dynamic sched. |
boosting |
dynamic sched. |
boosting |
awk |
3.91 | 1.17 | 1.41 | 1.49 |
1.86 | 1.52 |
ccom |
3.03 | 1.11 | 1.41 | 1.52 |
1.97 | 1.57 |
espresso |
4.19 | 1.22 | 1.51 | 1.70 |
1.97 | 1.79 |
irsim |
2.84 | 1.11 | 1.42 | 1.55 |
1.94 | 1.64 |
latex |
2.88 | 1.16 | 1.43 | 1.56 |
1.95 | 1.63 |
- A. Rogers and K. Li "Software support for speculative loads,"
ASPLOS, 1992
- added a presence and poison bit to each register
- S. Mahlke, W. Chen, W.-m. Hwu, B.R. Rau and M. Schlansker,
"Sentinel scheduling for VLIW and superscalar processors," ASPLOS,
1992
- M. Franklin and G. Sohi, "A new paradigm for exploiting fine-grain
parallelism," Hawaii Intl. Conf. on System Sciences, 1992
- HP patents
- 5,278,985 - Software method for implementing dismissible instructions
on a computer, 1994
- 5,692,169 - Method and system for deferring exceptions
generated during speculative execution
- 5,596,733 - System for exception recovery using a conditional
substitution instruction which inserts a replacement result in
the destination of the excepting instruction
Historical precedents for predication (conditional execution)
- predication dates back
- Wilkes lecture on control unit design, 1951 - "some of the
micro-orders can be made conditional in their action as well
as (or instead of) conditional as regards the switching of
micro-control"
- IBM 604, 1952 - each instruction had a suppression bit, which
controls whether it is executed or not
- Zemanek's MAILÜFTERL, 1954 - each instruction could be
made dependent on one of 15 conditions (e.g., if the value in
the ACC is negative) specified by a four-bit field in the
instruction format
- Zuse Z22, 1955 - each instruction could be made dependent on
a condition specified in a five-bit field in the instruction
- van der Poel's ZEBRA, 1958 - each instruction could be made
dependent on a condition specified by a three-bit field in
the instruction (this is a refinement of his 1952 ZERO
instruction set in which non-branch instructions could be
made conditional as a side effect of an unusual branching
scheme)
- Electrologica X-1, 1959 - the basic instruction format had two
"precondition bits" that specify whether the instruction should
be executed or not, and two "post condition" bits that specify
how the condition codes should be set after execution
-
IBM ACS, 1967 - a set of 24 condition code registers allowed
precalculation of branch conditions and also supported logical
operations between condition codes; this similar to the eight
independent condition codes in the IBM RS/6000 and PowerPC;
a 'skip flag' bit in each instruction was used along with a
conditional 'skip' instruction to replace regular conditional
branches
- CDC Flexible Processor, 1976 - each microinstruction is
conditionally executed based on three bits in the microinstrucion
format (e.g., selecting among dozens of conditions including
sign of a result, arithmetic overflow, I/O conditions,
and loop control)
- Key Computer Laboratories K-1, 1988 - "The select instruction
allows for the complete elimination of branches which are used
to select between two results."
- ARM, ca. 1986 - each instruction is predicated
- Cydra 5, 1988 - each instruction is predicated
-
HARP VLIW design, 1988 - each instruction is predicated
- Key Computer K-1, 1988 - most instructions are predicated
- Multiflow /500, 1990 - each floating-point operation or store
could be made conditional (Colwell, et al., Supercomputing '90)
-
HP PlayDoh, 1993 - experimental predicated instruction set
architecture
- Mahlke, et al., "Comparison ..." paper (ISCA, 1995)
- TI VelociTI VLIW architecture, 1997 - each instruction is predicated
- (and lots of architectures have added a conditional move instruction)
- Mahlke, et al., "Comparison of Full and Partial Predicated Execution
Support," ISCA 1995
- limited branch resources restricts # branches handled per cycle
- imperfect branch prediction reduces performance by factor of 2 to 10
- eliminate branches by predication
- partial predication - conditional moves
- full predication - every instruction, but adds another source operand
- compiler performs if-conversion
- processor will fetch instructions from both paths but only allow
instructions with true predicates to issue/complete
- partial predication changes dynamic instruction count by .93 to 2.1
- full predication changes dynamic instruction count by .83 to 1.29
- advantages
- decreases # of branches so limited branch resources are not a problem
- decreases # of mispredicted branches so performance impact is lessened
- exposes multiple execution paths to hardware
- table of Million branches (Million mispredicts)
|
superblock |
conditional |
full |
|
only |
move |
predication |
grep |
.66 (.01) | .17 (.02) | .17 (.02) |
yacc |
12 (.52) | 5.9 (.45) | 5.9 (.43) |
espresso |
75 (3.4) | 38 (2.1) | 33 (1.0) |
eqntott |
315 (42) | 53 (6.7) | 51 (6.9) |
ear |
1539 (66) | 443 (16) | 442 (15) |
Possible insight into register size choice
- Mahlke, Chen, Gyllenhaal, and Hwu, "Compiler code transformations for
superscalar-based high-performance systems," Supercomputing '92,
Minneapolis, Nov. 1992, pp. 808-817
- discusses 2-way issue, 4-way-issue, and 8-way issue
"superscalar/VLIW" processors running 40 loop nests from Perfect Club
benchmarks, SPEC-FP, and vector library functions
- to get maximum effectiveness of the ILP, several compiler optimizations
need to be performed (e.g., loop unrolling, variable renaming, variable
expansion, tree-height reduction)
- each optimization has the effect of increasing the number of registers
needed
- concluding sentence: "37 of the 40 loops require fewer than 128 total
registers after all transformations"
Historical precedents for independence architectures
- the name of this architectural category is due to Josh Fisher and Bob Rau
- explicitly encoded information on instruction independence is
placed in the instruction format by the compiler; difference between
independence architecture and VLIW (and esp. compressed VLIW) is that
in the former the hardware does the scheduling of which instructions
will execute together
- early examples
- NBS PILOT, ready signal, 1958 - bit 65 in the 68-bit instruction
format of the primary computer can be set to indicate that the
program in the primary computer should stop and wait until a
secondary computer has produced previously requested data,
A.L. Leiner, et al., "Concurrently operating computer systems,"
Proc. UNESCO Conference on Information Processing, Paris, June 1959,
pp. 353-361.
- Lee Higbie, concurrency control bits, 1978 -
bits added to instruction format and set by programmer or compiler to
indicate that the execution of an instruction should be delayed until
a specified function unit has produced an operand,
"Overlapped operation with microprogramming," IEEE Trans. on Computers,
March 1978, pp. 270-275. [written while he was at U. Mass. Amherst
about work on a signal processing computer at Sanders]
- Burton Smith's Horizon, lookahead, 1988 - field in instruction format
is set to minimum distance to next dependent instruction (over all
branch paths)
- LIW (long instruction word)
- original Stanford MIPS, 1984 - underpipelined and could pack an ALU op
and a load/store op together into a single machine instruction,
e.g., see Steven Przybylski, et al., "Organization and VLSI
implementation of MIPS," Advances in VLSI and Computer Systems, 1984
- Apollo DN10000, 1988 - "FP companion" bit is leftmost bit of integer
instruction format and is used to indicate if a paired floating-point
instruction follows and is to be issued in parallel; the integer/FP
pair must start on an 8-byte boundary, and an FP instruction cannot
appear without the paired integer instruction); the five-operand
version of the FP instruction format can specify both a multiply and
an independent add/sub/truncate (thus, with the integer operation,
the Apollo can execute a peak of three operations/cycle)
- Intel i860, 1988 - "dual instruction mode (DIM)" bit ("D-bit") in
floating-point instruction format to indicate if aligned pairs of
independent floating-point and integer instructions are to be issued
in parallel (see Kohn, US 5,241,636); because of pipelining the bit
has a two-cycle delayed effect and governs the dual issue of the
instruction pair two cycles later; also, the i860 allowed multiple
ways of specifying the execution of a floating-point addition and
multiply at the same time, thus up to three operations could be
performed per cycle ("dual operation", see Kohn, US 5,204,828)
- CMU iWarp, 1988 - two instruction formats: short (32 bits, loop-back
bit and one operation) and long (96 bits, loop-back bit and either
three floating-point operations or two floating-point operations and
two integer/memory-access operations); references to queue-pointer
registers implicitly resulted in memory loads and stores
- Stanford TORCH, 1990 - two instructions issued together (to "A side"
and "B side" with some slotting restrictions) unless a dynamic nop bit
is set in either instruction's extension byte (see
TORCH architectural specifications)
- Fujitsu VPP500 scalar processor, 1994 - up to three operations per
instruction word; the first four bits of the 64-bit instruction word
serves as the format selector. (see Y. Nakashima, et al.,
"Scalar processor of the VPP500 parallel supercomputer,"
Proc. ICS, 1995)
- traditional VLIW
- roots of VLIW lie in horizontal microprogramming (e.g., Josh Fisher's
work in trace scheduling was done for horizontal microcode)
- see, e.g., van der Poel's "Microprogramming and Trickology", 1962
- other horizontal microcode history and VLIW "pre-history"
- Turing's ACE (1946)
- IBM SSEC (1948) - two instructions in a "line of sequence",
which could could be used to specify two separate operations
within the same program or duplicate operations using separate
resources to provide checking [see US 2,636,672]
- Elliott 152 (1950) and Elliott 153 (1954) - the 153 had a
64-bit instruction specifying multiple register transfers
(ALU, multiplier, I/O, branching, and control of two
scratchpad memories)
- Wilkes and Stringer paper (1953) - suggesting horizontal
microcode
- array processors, including
IBM 2938 Array Processor (1969),
IBM 3838 Array Processor (1974),
and FPS AP-120B (1975)
- P.M. Melliar-Smith, "A design for a fast computer
for scientific calculations," in 1969 AFIPS FJCC,
pp. 201-208. He proposes "direct functional control"
for inner loops in array processing applications, by
which he means a noninterlocked VLIW design with
exposed pipelining. (He's writing in reaction to
execution resources "squandered" and "wasted" by a
Tomasulo-like E-box coupled with a
one-instruction-decode-per-cycle I-box.)
- Culler patent (1973) - "Data processor with parallel
operations per instruction" [US 3,771,141]
- Pomerene patent (1981) - "Machine for multiple instruction
execution" [US 4,295,193]
- Rau's Polycyclic Architecture project at TRW/ESL (1981)
- Fisher's ELI-512 design (1983)
- see J. Fisher, P. Faraboschi, and C. Young,
"VLIW processors: Once blue sky, now commonplace,"
IEEE Solid-State Circuits Magazine, vol. 1, no. 2, Spring 2009,
pp. 10-17.
- compressed VLIW / flexible VLIW
- variable-length encoding of VLIW programs
- Multiflow, 1988
- in-memory compression scheme - VLIW instructions are expanded
during i-cache miss and stored in VLIW format in the i-cache
- a single instruction encodes first-beat and second-beat operations
(slots in the instruction word have a fixed assignment to first
cycle of execution or second cycle of execution)
- mnop - multicycle nop to halt instruction fetch for specified
number of cycles to save space in the i-cache
- Colwell/et al., "A VLIW architecture for a trace scheduling
compiler," IEEE Trans. on Computers, August 1988, pp. 967-979.
- Colwell/et al., "Architecture and Implementation of a VLIW
Supercomputer," Proc. Supercomputing, 1990, pp. 910-919.
- Cydrome, 1988
- normal VLIW multi-op format (256-bit instruction word, seven fields)
- for compression, added a uni-op format which contained routing fields
to specify which function units were used (six 40-bit uni-op
instructions are held in a 256-bit instruction word)
- mnoop was a special uni-op instruction that halted instruction
execution for a specified number of cycles to allow for the cases
where no instructions were ready to execute (and thus avoid a
series of empty multi-ops or uni-ops)
- memory latency register specifies latency used by compiler when
scheduling; hardware buffers values from any loads that complete
earlier or stalls the processor if any loads complete later than
the specified number of cycles
- had plans for an in-memory compression scheme (vari-ops) for
second generation design (Cydra-10); similar to Multifow since
vari-ops would be expanded into one or more multi-op instructions
during i-cache miss processing
- Beck/Yen/Anderson, "The Cydra 5 minisupercomputer: Architecture and
implementation," Journal of Supercomputing, May 1993, pp. 143-180.
- Intergraph Clipper 5 (U.S. patent 5,560,028, 1996)
- called a "software scheduled superscalar" architecture but more
accurately classified as a compressed VLIW scheme
- tags are added for multiple-issue group identification
along with routing tags for function unit assignment; the tags
control a crossbar switch
- later paper mentions use of a register scoreboard to determine
when to issue the next group
- Arya/Sachs/Duvvuru, "An architecture for high instruction level
parallelism," 28th Hawaii Intl. Conf. Syst. Sci., 1995, pp. 153-162.
- [Arya worked for Higbie while they were at Gould in 1980s]
- Philips Trimedia, 1996
- a compressed instruction format is stored in the cache as well
as memory and is expanded by a decompressor unit (decompression
takes place during one pipeline stage of instruction fetch)
- the encoding eliminates nops by using a header that includes a
count of operations in that instruction
- an uncompressed instruction has five operation slots, each of
which contains an execution unit identifier that is used to
route that operation to the appropriate execution unit
- three generations: TM-1, TM-1000, and CPU64
- TI VelociTI VLIW architecture, 1997
- design started in 1992, chief architect is Ray Simar
- fetch packet of 8 instructions
- one to eight variable-length, multiple-issue execute packets can
be contained within each fetch packet; they are delimited by
"parallel instruction" link bits in the instruction format
- 5 delay slots per branch and 4 per load
- multicycle nop
- TI C6x family: C62x, C64x, and C67x
- Starcore, 1998
- 16-bit instruction formats
- VLES (variable length execution set) - two options:
- serial - a two-bit field is allocated in the instruction format of
a subset of the instructions; "00" indicates that the current
instruction is included with the next, other values indicate a stop
- prefix - instructions can also be grouped using one or two prefix
words; the prefix contains a set count and also provides for
conditional execution, access to more registers, and looping
- SC140, 1998 - up to six instructions in an execution set
- SC110, 200x - up to three instructions in an execution set
- execution set determined from encoding during dispatch stage
in a 5-stage pipeline ( prefetch / fetch / dispatch / address
generation / execute )
- each execution set advances as a unit; thus, the longest running
instruction determines the number of cycles its execution set
occupies the execution pipeline stage
- TigerSHARC, 1998
- "static superscalar" - one to four 32-bit instructions can be executed
each cycle from a 128-bit instruction line, most significant bit of
each instruction acts as a stop bit
- minor slotting restrictions (e.g., a conditional or program sequencer
instruction must be placed in the first slot of a line)
- no memory alignment restrictions for instruction lines
- register scoreboard, stalls complete line
- Sun MAJC, 1999
- one to four instructions, count field in first instruction
- retains slotted assignment to function units, each of which is
general purpose
- function units have separate set of local registers and share
a common set of global registers
- load-use and long-latency-operation register scoreboard
- after fetch, align stage prepares for 1-to-4-way issue based
on count field in first unissued instruction
- Tremblay/Chan/Chaudhry/Conigliaro/Tse, "The MAJC architecture:
A synthesis of parallelism and scalability," IEEE Micro,
November-December 2000, pp. 12-25.
- Fujitsu FR-V family, 1999
- each 32-bit instruction has a 1-bit packing flag, acts as a stop bit
- up to four instructions in parallel, "nop insertion and slot
distribution" occur after fetching from the i-cache
- fairly general functional units so slot assignment is not a big issue
- Sukemura, "FR500 VLIW-architecture high-performance embedded
microprocessor," Fujitsu Sci. Tech. Jrnl., June 2000, pp. 31-38.
- Suga/Matsunami, "Introducing the FR500 embedded microprocessor,"
IEEE Micro, July-August 2000, pp. 21-27.
- Aditya/Mahlke/Rau, "Code size minimization and retargetable assembly
for custom EPIC and VLIW instruction formats," HP technical report
HPL-2000-141, Oct. 2000.
- IBM SCISM, early 1990s
- compound units "reflect the parallel issue of instructions"
- 3 instructions per compounded unit (provision made to jump into
the middle of a compound unit)
- compounding can be done by the compiler, at the time of a
page fault, or at the time of i-cache refill
- compound units include tag bits that can indicate dependency
info, e.g., for interlock-collapsing function units
- Vassiliadis/Blaner/Eickemeyer, "SCISM: A scalable compound
instruction set machine," IBM JRD, 38/1, Jan. 1994, pp. 59-78
- apparently never built, but lots of patents
- Transmeta Crusoe, 2000 - six instruction formats (2-4 instructions)
- AA - two ALU instructions
- AB - ALU instruction and branch
- AI - ALU instruction with 32-bit immediate value
- LA - load/store and ALU instruction
- LAAB - load/store, two ALU instructions, branch
- LAAI - load/store, two ALU instructions (one w/ 32-bit immediate)
(Several patents, including U.S. 6,031,992, 2000)
- retrofitting: examples of hardware marking of independence
internally via predecoding and retaining the marking within a decoded
i-cache (i.e., when you move the dependency detection out of the fetch
and decode pipeline stages but not all the way back to compile time due
to instruction set compatibility)
-
NS Swordfish, 1991 - instruction pair dependency bit is contained
in each decoded i-cache entry; it is set on i-cache refill by predecode
hardware and yields LIW issue of independent instruction pairs; no bits
are used in the normal instruction format.
- Minigawa/Saito/Aikawa, 1991 - "Pre-decoding mechanism for superscalar
architecture," IEEE Pacific Rim Conf. on Comm., Comp., and Sig. Proc.,
pp. 22-24; on i-cache miss, a predecoder adds instruction grouping
("priority") and function unit assignment fields.
(see also US Patent 5,163,139, "Instruction preprocessor for
conditionally combining short memory instructions into virtual
long instructions")
- HP 7200, 1995 - six predecode bits are added for each double word in
the i-cache; they encode resource conflicts and data dependencies and
are set by a predecoder on i-cache refill.
Historical precedents for prepare-to-branch
- Four aspects of conditional branching
- Condition setting
- condition storage
- single set of bits in PSR or flags register for integer conditions
- second set of bits in FP status register for flt. pt. conditions
- high-perf. implementation problem for inst. sets that use single,
serialized resources [cf. Sites on design of Alpha]
- multiple sets of bits (e.g., IBM RS/6000, Key K-1)
- use of general registers (e.g., MC88110)
- specification of comparands
- explicit compare instruction (basically a subtraction)
- side effect of ALU operation (setting by ALU op is optional in SPARC)
- Decision - logical relation between comparands (eq, ne, lt, le, gt, ge,
flt.pt unordered)
- Branch target address
- Change PC
- immediate effect
- delayed effect - next one or so sequential instructions have already
been fetched and will be executed regardless of branch decision
- delayed effect with anulling/squashing - sequential instructions already
fetched and may be executed or optionally purged on untaken (e.g., SPARC)
- Packaging these aspects
- compare (1), then conditional branch (2+3+4)
- ALUop side effect (1), then conditional branch (2+3+4)
- compare and branch (1+2+3+4) - may need multiple comparand specifiers
plus the branch address field, although often use reg. vs. 0
- [
IBM ACS, 1967] prepare to branch (1+2+3), then exit (4)
- special branch registers to hold BTAs (accomplishing 3 and allowing
for instruction prefetch when loaded), with remaining steps packaged
as (1) then (2+4), or together as (1+2+4)
- ISA add-ons
- [TI ASC, 1972] prepare to branch --
redundant specification of 3, for prefetch
- [PIPE, 1985 (Pleszkun and Farrens)] prepare to branch --
intended as generalized delayed branch technique where
the PTB instruction would specify the number of delay slots after a
branch instruction (0-7)
- hardware retrofit
- US 3,553,655, "Short forward conditional skip hardware,"
inventors were IBM S/360 Model 91 team members Anderson and Sparacio
- see also (among others to be listed)
- US 6,662,294, Kahle and Moore,
"Converting short branches to predicated instructions"
- US 7,409,534, Uht, Morano, and Kaeli,
"Automatic and transparent hardware conversion of
traditional control flow to predicates"
- US patent application 20100262813, Brown, et al.,
"Detecting and handling short forward branch conversion
candidates"
Historical precedents for rotating register files
- "A different, programmatically controlled register renaming scheme is
obtained by providing rotating register files, that is, base-displacement
indexing into the register file using an instruction-provided displacement
off a dedicated base register.
Although applicable only for renaming registers across multiple
iterations of a loop, rotating registers have the advantage of being
considerably less expensive in their implementation than are other
renaming schemes." - Rau and Fisher, Jrnl. Supercomputing, 1993, p 22.
- scratch-pad in AP-120B/FPS-164, 1976 (Charlesworth)
- compacting FIFO structure in Polycyclic Architecture at TRW/ESL, 1981 (Rau)
- rotating registers in Cydrome Cydra-5, 1988 (Rau)
Historical precedents for register stack engine
- Dick Site's dribble-back registers (1979)
- Hitachi SR2201 preload and poststore ("slide-windowed registers")
My thanks to Harsh Sharangpani for his help; Jason Eckhardt for help with
the i860 and AP120-B descriptions; Norm Hardy for pointing me to the Gray,
et al., paper.; and, Josh Fisher for help with the history of load speculation.
[Computer Architecture History page]
[Mark's homepage]
mark@cs.clemson.edu