Historical background for EPIC instruction set architectures

Mark Smotherman
Last updated: November 2023

Summary: The design style of EPIC (explicitly parallel instruction computing) did not appear instantaneously, like Athena springing from Zeus' head. Instead, EPIC is a compendium of ideas that have been percolating in computer architecture for years.

See a partial writeup of this material in M. Smotherman, "Understanding EPIC Architectures and Implementations" (pdf) from ACM Southeast Conference, 2002.

Intel/HP EPIC - Explicitly Parallel Instruction Computing

There are several principles behind EPIC:

start loads early
predication to eliminate many conditional branches
register rich
independence architecture
uncoupled branch architecture
rotating register file

In the HP/Intel Itanium (IA-64), these influences are seen in the following ways.

start loads early
1. advance loads - move above stores when alias analyis is incomplete
2. speculative loads - move above branches
predication to eliminate many conditional branches
1. 64 predicate registers
2. almost every instruction is predicated
register rich
1. 128 integer registers (64 bits each)
2. 128 floating-point registers
independence architecture
1. VLIW flavor, but fully interlocked (i.e., no delay slots)
2. three 41-bit instruction syllables per 128-bit "bundle"
3. each bundle contains 5 "template bits" which specify independence of following syllables (within bundle and between bundles)
uncoupled branch architecture
1. eight branch registers
2. multiway branches
rotating register files
1. lower 48 of the predicate registers rotate
2. lower 96 of the integer registers rotate

Sidebar: IA-64 History

IA-64 joint ACM committee (architecture, compilers, microarchitecture)
- five Intel members:
  - John Crawford (chief architect for overall effort)
  - Hans Mulder (architecture)
  - Harsh Sharangpani (microarchitecture and x86 floating point compatibility)
  - Kent Fielden (compilers)
  - Jack Mills (architecture and performance evaluation)
- five HP members:
  - Jerry Huck (lead architect for HP)
  - Rajiv Gupta (architecture, Wide-Word background)
  - David Fotland (microarchitecture, PA-RISC background)
  - Dale Morris (architecture, PA-RISC background)
  - Carol Thompson (compilers)
Timeline
- 1981 - Bob Rau leads Polycyclic Architecture project at TRW/ESL
- 1983 - Josh Fisher describes ELI-512 VLIW design and trace scheduling
- 1983-1988 - Rau at Cydrome works on VLIW design called the Cydra-5, but the company folds in 1988
- 1984-1990 - Fisher at Multiflow works on VLIW design called the Trace, but the company folds in 1990
- 1988 - Dick Lampman at HP hires Bob Rau and Mike Schlansker from Cydrome and also gets IP rights from Cydrome
- 1989 - Rau and Schlansker begin the FAST (Fine-grained Architecture and Software Technologies) research project at HP; they later develop the HP PlayDoh architecture
- 1990-1993 - Bill Worley leads PA-WW (Precision Architecture Wide-Word) effort at HP Labs to be the successor to the PA-RISC architecture; it was also called SP-PA (Super-Parallel Processor Architecture) and SWS (Super WorkStation)
- HP hires Josh Fisher, input to PA-WW
- input to PA-WW from Hitachi team, led by Yasuyuki Okada
- November 1991 - Hans Mulder joins Intel to start work on a 64-bit architecture
- July 1992 - Worley recommends HP seek a semiconductor manufacturing partner
- 1993 - HP starts effort to develop PA-WW as a product
- December 1993 - HP investigates partnership with Intel
- June 1994 - announcement of cooperation between HP and Intel; PA-WW used as starting point for joint design; John Crawford of Intel leads the joint team
- 1997 - the term EPIC is coined
- October 1997 - Microprocessor Forum presentations by Intel and HP
- July 1998 - Carole Dulong of Intel publishes "The IA-64 Architecture at Work," IEEE Computer, pp. 24-32.
- February 1999 - release of ISA details of IA-64
- 2001 - Intel marketing prefers IPF (Itanium Processor Family) to IA-64
- May 2001 - Itanium (Merced)
- July 2002 - Itanium 2 (McKinley)
References
- Itanium history page at HPL
- Russ Britt, "The Birth of a New Processor," Electronic Business, January 2000.
- Mike Schlansker and Bob Rau, "EPIC: Explicitly Parallel Instruction Computing" (pdf), IEEE Computer, February 2000, pp. 37-45.
- Mike Schlansker and Bob Rau, "EPIC: An Architecture for Instruction-Level Parallel Processors" (pdf), HP Labs Technical Report HPL-1999-111, February 2000.
- John Crawford, "Introducing the Itanium Processors," IEEE Micro, September-October 2000, pp. 9-11.
- Harsh Sharangpani, "Intel Itanium Processor Core" (pdf), Hot Chips 12 slides, August 2000.
- Itanium home page at Intel
- Wen-mei Hwu, et al., "Itanium Performance Insights" (Univ. of Illinois, IMPACT compiler project), Microprocessor Forum, 2001.
  - MPF 2001 slides (pdf)
  - IMPACT group publications page
- Jay Bharadwaj, et al., "The Intel IA-64 Compiler Code Generator", IEEE Micro, September-October 2000, pp. 44-53.
- Rumi Zahirm, Dale Morris, Jonathan Ross, and Drew Hess, "OS and Compiler Considerations in the Design of the IA-64 Architecture," ASPLOS-IX, November 2000, 212-221.
- Charles Gray, Matthew Chapman, Peter Chubb, David Mosberger-Tang, and Gernot Heiser, "Itanium - A System Implementor's Tale" (pdf), USENIX Annual Technical Conference, April 2005, pp. 264-278.
- John Sias, "A Systematic Approach to Delivering Instruction-Level Parallelism in EPIC Systems" (pdf), Ph.D. Dissertation, Univ. Illinois at Urbana-Champaign, 2005.
- see also the set of links (some now dead) collected for "Itanium: An EPIC Architecture," CS 854 Advanced Computer Architecture class project, Univ. of Virginia, 2001.

Historical precedents for load speculation

several designs recognized the need for early initiation of loads
- Zuse Z4, 1940s - the instruction stream was read two instructions in advance, and if a load was detected it was started early to reduce the impact of the slow cycle time of the memory.
- IBM Stretch, 1961 - a separate indexing unit pre-processes the instruction stream to decode arithmetic instructions and start memory loads early (it also executes index-register-related operations and branches); decoded instructions and data values loaded from memory are placed in a four-element lookahead buffer between the indexing unit and the arithmetic unit
- IBM S/360 Model 91, 1967; IBM S/370 Model 165, 1970; IBM 3033, 1978; IBM 3090, 1986 - these IBM mainframes use overlapped I and E units in a manner similar to Stretch: the I unit fetches and decodes instructions, performs address calculations, and starts memory loads
- decoupled access/execute (DAE) architectures split the instruction stream and allow the memory-access machine to run ahead of the execute machine and to pre-load memory data into FIFO buffers
  - Culler-7, 1986
  - Astronautics ZS-1, 1987
three approaches to load speculation
- hardware speculative loads
  - hardware speculative loads are normal loads executed on a hardware platform that provides branch prediction and speculative execution; any addressing exception will only be recognized if the predicted path is confirmed as taken
  - speculative execution dates back to the IBM Stretch of 1961, which also provided address monitoring using two boundary registers
    - if address monitoring is enabled and a hardware speculative load failed a boundary check, it appears that the instruction would be changed into a no-op by the lookahead logic ("instruction rejection" or "cancel") and the matching lookahead level indicator bit would be set (i.e., the "data fetch" indicator)
    - it appears that once this no-op became the oldest instruction in the lookahead without being deleted during branch misprediction recovery ("housekeeping"), then the CPU indicator bit would be set according to the lookahead level indicator bit and an interrupt would occur
    - the most detailed source for speculative execution in Stretch appears to be W.C. Stetler, "Lookahead Section of the Sigma Computer," Jan. 19, 1960 (Sigma was the internal project name for the high-performance scientific processor that became Stretch); the papers by R.T. Blosk on the Stretch Instruction Unit mention address boundary checking and the setting of indicator bits but do not go into the same level of detail; it is possible that the actual Stretch implementation was somewhat simpler than that described by Stetler
- software speculative loads
  - software speculative loads are normal loads executed on a hardware platform that does not provide speculative execution but that are scheduled by a compiler prior to the basic block in which the loaded value will be used, e.g., from a computed address like an array reference or a pointer chain you are chasing; in most (but not all) environments, these are impractical because they will constantly cause addressing exceptions and the resulting unacceptable performance degradation
  - Josh Fisher included loads as being among the instructions to be moved entirely ahead of branches in his description of Trace Scheduling in 1979, and this was implemented in 1982 in his group's Bulldog Compiler and described in John Ellis's dissertation in 1985
- dismissible loads (also called silent or non-faulting loads)
  - "dismissible loads" are software speculative loads in which the exception problem is handled by some hardware implementation technique, allowing control flow to continue until the hardware figures out whether the exception should be taken
  - starting with the design of the Multiflow Trace 7/200 and its compiler in 1984, mechanisms were developed to avoid the penalty inherent in software speculative loads causing frequent unwanted exceptions; DLD, for Dismissible Load, was the opcode used in the Multiflow Trace's instruction set
  - from P.G. Lowney, S.M. Freudenberger, T.J. Karzes, W.D. Lichtenstein, R.P. Nix, J.S. O'Donnell, and J.C. Ruttenberg, "The Multiflow Trace Scheduling Compiler," The Journal of Supercomputing, May 1993:
    To prevent unwarranted memory faults on speculative loads, the compiler uses the dismissable load operation. If a dismissable load traps, the trap code does not signal an exception, but returns a NAN or integer zero, and computation continues; if necessary, a translation buffer miss or a page fault is serviced. NANs are propagated by the floating units, and checked only when they are written to memory or converted to integers or booleans. Correct programs exhibit correct behavior with speculative execution on the Trace, but an incorrect program may not signal an exception that it would have signaled if compiled without speculative execution.
  - the design of the Key Computer Laboratories K-1 (ca. 1988) included an early load (eload) instruction and an optional early load check (echk) instruction; each register had a early load flag bit, which was set on an illegal memory access by an early load

other work

IBM VLIW efforts
- 33rd bit (to indicate the "bottom value") in registers
  - extra bit in the instruction opcode to indicate a dismissible or non-dismissible version; exceptions occurred when a non-dismissible instruction computed a bottom value
  - K. Ebcioglu, "Some Design Ideas for a VLIW Architecture for Sequential Natured Software," in Parallel Processing (Proceedings of IFIP WG 10.3 Working Conference on Parallel Processing), 1988
- similar design proposed an extra field in a register to hold the address of the excepting instruction
  - K. Ebcioglu and R. Groves, "Some Global Compiler Optimizations and Architectural Features for Improving Performance of Superscalars," Research Report no. RC16145, IBM T.J. Watson Research Center, Yorktown Heights, NY, 1990
- precise exceptions from compiler techniques
  - G.M. Silberman and K. Ebcioglu, "An Architectural Framework for Supporting Heterogeneous Instruction-Set Architectures," IEEE Computer, June 1993 (first version published in G.M. Silberman and K. Ebcioglu, "An Architectural Framework for Migration from CISC to Higher Performance Platforms," Intl. Conf. on Supercomputing, 1992)
- IBM patents
  - 5,542,075 - Method and apparatus for improving performance of out of sequence load operations in a computer system
  - 5,625,835 - Method and apparatus for reordering memory operations in a superscalar or very long instruction word processor
  - 5,799,179 - Handling of exceptions in speculative instructions

Smith, Lam, and Horowitz -- "Boosting Beyond Static Scheduling in a Superscalar Processor," ISCA 1990

"best aspects of static and dynamic scheduling"
static branch prediction, encoded in branch op codes
shadow register file and shadow store buffer
move instructions up before one branch, mark as boosted, and access shadow registers; any exceptions are deferred until boosted instruction commits
if branch prediction is correct:
- move results from shadow registers into registers
- any boosted instructions still in pipe are unmarked and accesses to shadow registers are then changed to corresponding registers
if branch prediction is incorrect:
- flush shadow structures
- squash any boosted instructions still in pipe

speedups

(1 ld/st basic block ---- fetch 2 ---- ---- fetch 4 ----

per cycle) scheduling (no load/store reorg.) (load/store reorg.)

"max speedup" only dynamic sched. boosting dynamic sched. boosting

awk 3.91 1.17 1.41 1.49 1.86 1.52

ccom 3.03 1.11 1.41 1.52 1.97 1.57

espresso 4.19 1.22 1.51 1.70 1.97 1.79

irsim 2.84 1.11 1.42 1.55 1.94 1.64

latex 2.88 1.16 1.43 1.56 1.95 1.63

A. Rogers and K. Li "Software support for speculative loads," ASPLOS, 1992
- added a presence and poison bit to each register
S. Mahlke, W. Chen, W.-m. Hwu, B.R. Rau and M. Schlansker, "Sentinel scheduling for VLIW and superscalar processors," ASPLOS, 1992
M. Franklin and G. Sohi, "A new paradigm for exploiting fine-grain parallelism," Hawaii Intl. Conf. on System Sciences, 1992
- load/store reordering
HP patents
- 5,278,985 - Software method for implementing dismissible instructions on a computer, 1994
- 5,692,169 - Method and system for deferring exceptions generated during speculative execution
- 5,596,733 - System for exception recovery using a conditional substitution instruction which inserts a replacement result in the destination of the excepting instruction

	(1 ld/st	basic block	---- fetch 2 ----	---- fetch 4 ----
	per cycle)	scheduling	(no load/store reorg.)	(load/store reorg.)
	"max speedup"	only	dynamic sched.	boosting	dynamic sched.	boosting
awk	3.91	1.17	1.41	1.49	1.86	1.52
ccom	3.03	1.11	1.41	1.52	1.97	1.57
espresso	4.19	1.22	1.51	1.70	1.97	1.79
irsim	2.84	1.11	1.42	1.55	1.94	1.64
latex	2.88	1.16	1.43	1.56	1.95	1.63

Historical precedents for predication (conditional execution)

predication dates back
- Wilkes lecture on control unit design, 1951 - "some of the micro-orders can be made conditional in their action as well as (or instead of) conditional as regards the switching of micro-control"
- IBM 604, 1952 - each instruction had a suppression bit, which controls whether it is executed or not
- Zemanek's MAILÜFTERL, 1954 - each instruction could be made dependent on one of 15 conditions (e.g., if the value in the ACC is negative) specified by a four-bit field in the instruction format
- Zuse Z22, 1955 - each instruction could be made dependent on a condition specified in a five-bit field in the instruction
- van der Poel's ZEBRA, 1958 - each instruction could be made dependent on a condition specified by a three-bit field in the instruction (this is a refinement of his 1952 ZERO instruction set in which non-branch instructions could be made conditional as a side effect of an unusual branching scheme)
- Electrologica X-1, 1959 - the basic instruction format had two "precondition bits" that specify whether the instruction should be executed or not, and two "post condition" bits that specify how the condition codes should be set after execution
- IBM ACS, 1967 - a set of 24 condition code registers allowed precalculation of branch conditions and also supported logical operations between condition codes; this similar to the eight independent condition codes in the IBM RS/6000 and PowerPC; a 'skip flag' bit in each instruction was used along with a conditional 'skip' instruction to replace regular conditional branches
- CDC Flexible Processor, 1976 - each microinstruction is conditionally executed based on three bits in the microinstrucion format (e.g., selecting among dozens of conditions including sign of a result, arithmetic overflow, I/O conditions, and loop control)
- Key Computer Laboratories K-1, 1988 - "The select instruction allows for the complete elimination of branches which are used to select between two results."
- ARM, ca. 1986 - each instruction is predicated
- Cydra 5, 1988 - each instruction is predicated
- HARP VLIW design, 1988 - each instruction is predicated
- Key Computer K-1, 1988 - most instructions are predicated
- Multiflow /500, 1990 - each floating-point operation or store could be made conditional (Colwell, et al., Supercomputing '90)
- HP PlayDoh, 1993 - experimental predicated instruction set architecture
- Mahlke, et al., "Comparison ..." paper (ISCA, 1995)
- TI VelociTI VLIW architecture, 1997 - each instruction is predicated
- (and lots of architectures have added a conditional move instruction)

Mahlke, et al., "Comparison of Full and Partial Predicated Execution Support," ISCA 1995

limited branch resources restricts # branches handled per cycle
imperfect branch prediction reduces performance by factor of 2 to 10
eliminate branches by predication
- partial predication - conditional moves
- full predication - every instruction, but adds another source operand
compiler performs if-conversion
- processor will fetch instructions from both paths but only allow instructions with true predicates to issue/complete
- partial predication changes dynamic instruction count by .93 to 2.1
- full predication changes dynamic instruction count by .83 to 1.29
advantages
- decreases # of branches so limited branch resources are not a problem
- decreases # of mispredicted branches so performance impact is lessened
- exposes multiple execution paths to hardware

table of Million branches (Million mispredicts)

	superblock	conditional	full
	only	move	predication
grep	.66 (.01)	.17 (.02)	.17 (.02)
yacc	12 (.52)	5.9 (.45)	5.9 (.43)
espresso	75 (3.4)	38 (2.1)	33 (1.0)
eqntott	315 (42)	53 (6.7)	51 (6.9)
ear	1539 (66)	443 (16)	442 (15)

Possible insight into register size choice

Mahlke, Chen, Gyllenhaal, and Hwu, "Compiler code transformations for superscalar-based high-performance systems," Supercomputing '92, Minneapolis, Nov. 1992, pp. 808-817
- discusses 2-way issue, 4-way-issue, and 8-way issue "superscalar/VLIW" processors running 40 loop nests from Perfect Club benchmarks, SPEC-FP, and vector library functions
- to get maximum effectiveness of the ILP, several compiler optimizations need to be performed (e.g., loop unrolling, variable renaming, variable expansion, tree-height reduction)
- each optimization has the effect of increasing the number of registers needed
- concluding sentence: "37 of the 40 loops require fewer than 128 total registers after all transformations"

Historical precedents for independence architectures

the name of this architectural category is due to Josh Fisher and Bob Rau
explicitly encoded information on instruction independence is placed in the instruction format by the compiler; difference between independence architecture and VLIW (and esp. compressed VLIW) is that in the former the hardware does the scheduling of which instructions will execute together
early examples
- NBS PILOT, ready signal, 1958 - bit 65 in the 68-bit instruction format of the primary computer can be set to indicate that the program in the primary computer should stop and wait until a secondary computer has produced previously requested data, A.L. Leiner, et al., "Concurrently operating computer systems," Proc. UNESCO Conference on Information Processing, Paris, June 1959, pp. 353-361.
- Lee Higbie, concurrency control bits, 1978 - bits added to instruction format and set by programmer or compiler to indicate that the execution of an instruction should be delayed until a specified function unit has produced an operand, "Overlapped operation with microprogramming," IEEE Trans. on Computers, March 1978, pp. 270-275. [written while he was at U. Mass. Amherst about work on a signal processing computer at Sanders]
- Burton Smith's Horizon, lookahead, 1988 - field in instruction format is set to minimum distance to next dependent instruction (over all branch paths)
LIW (long instruction word)
- original Stanford MIPS, 1984 - underpipelined and could pack an ALU op and a load/store op together into a single machine instruction, e.g., see Steven Przybylski, et al., "Organization and VLSI implementation of MIPS," Advances in VLSI and Computer Systems, 1984
- Apollo DN10000, 1988 - "FP companion" bit is leftmost bit of integer instruction format and is used to indicate if a paired floating-point instruction follows and is to be issued in parallel; the integer/FP pair must start on an 8-byte boundary, and an FP instruction cannot appear without the paired integer instruction); the five-operand version of the FP instruction format can specify both a multiply and an independent add/sub/truncate (thus, with the integer operation, the Apollo can execute a peak of three operations/cycle)
- Intel i860, 1988 - "dual instruction mode (DIM)" bit ("D-bit") in floating-point instruction format to indicate if aligned pairs of independent floating-point and integer instructions are to be issued in parallel (see Kohn, US 5,241,636); because of pipelining the bit has a two-cycle delayed effect and governs the dual issue of the instruction pair two cycles later; also, the i860 allowed multiple ways of specifying the execution of a floating-point addition and multiply at the same time, thus up to three operations could be performed per cycle ("dual operation", see Kohn, US 5,204,828)
- CMU iWarp, 1988 - two instruction formats: short (32 bits, loop-back bit and one operation) and long (96 bits, loop-back bit and either three floating-point operations or two floating-point operations and two integer/memory-access operations); references to queue-pointer registers implicitly resulted in memory loads and stores
- Stanford TORCH, 1990 - two instructions issued together (to "A side" and "B side" with some slotting restrictions) unless a dynamic nop bit is set in either instruction's extension byte (see TORCH architectural specifications)
- Fujitsu VPP500 scalar processor, 1994 - up to three operations per instruction word; the first four bits of the 64-bit instruction word serves as the format selector. (see Y. Nakashima, et al., "Scalar processor of the VPP500 parallel supercomputer," Proc. ICS, 1995)
traditional VLIW
- roots of VLIW lie in horizontal microprogramming (e.g., Josh Fisher's work in trace scheduling was done for horizontal microcode)
- see, e.g., van der Poel's "Microprogramming and Trickology", 1962
- other horizontal microcode history and VLIW "pre-history"
  - Turing's ACE (1946)
  - IBM SSEC (1948) - two instructions in a "line of sequence", which could could be used to specify two separate operations within the same program or duplicate operations using separate resources to provide checking [see US 2,636,672]
  - Elliott 152 (1950) and Elliott 153 (1954) - the 153 had a 64-bit instruction specifying multiple register transfers (ALU, multiplier, I/O, branching, and control of two scratchpad memories)
  - Wilkes and Stringer paper (1953) - suggesting horizontal microcode
  - array processors, including IBM 2938 Array Processor (1969), IBM 3838 Array Processor (1974), and FPS AP-120B (1975)
  - P.M. Melliar-Smith, "A design for a fast computer for scientific calculations," in 1969 AFIPS FJCC, pp. 201-208. He proposes "direct functional control" for inner loops in array processing applications, by which he means a noninterlocked VLIW design with exposed pipelining. (He's writing in reaction to execution resources "squandered" and "wasted" by a Tomasulo-like E-box coupled with a one-instruction-decode-per-cycle I-box.)
  - Culler patent (1973) - "Data processor with parallel operations per instruction" [US 3,771,141]
  - Pomerene patent (1981) - "Machine for multiple instruction execution" [US 4,295,193]
  - Rau's Polycyclic Architecture project at TRW/ESL (1981)
  - Fisher's ELI-512 design (1983)
- see J. Fisher, P. Faraboschi, and C. Young, "VLIW processors: Once blue sky, now commonplace," IEEE Solid-State Circuits Magazine, vol. 1, no. 2, Spring 2009, pp. 10-17.
compressed VLIW / flexible VLIW - variable-length encoding of VLIW programs
- Multiflow, 1988
  - in-memory compression scheme - VLIW instructions are expanded during i-cache miss and stored in VLIW format in the i-cache
  - a single instruction encodes first-beat and second-beat operations (slots in the instruction word have a fixed assignment to first cycle of execution or second cycle of execution)
  - mnop - multicycle nop to halt instruction fetch for specified number of cycles to save space in the i-cache
  - Colwell/et al., "A VLIW architecture for a trace scheduling compiler," IEEE Trans. on Computers, August 1988, pp. 967-979.
  - Colwell/et al., "Architecture and Implementation of a VLIW Supercomputer," Proc. Supercomputing, 1990, pp. 910-919.
- Cydrome, 1988
  - normal VLIW multi-op format (256-bit instruction word, seven fields)
  - for compression, added a uni-op format which contained routing fields to specify which function units were used (six 40-bit uni-op instructions are held in a 256-bit instruction word)
  - mnoop was a special uni-op instruction that halted instruction execution for a specified number of cycles to allow for the cases where no instructions were ready to execute (and thus avoid a series of empty multi-ops or uni-ops)
  - memory latency register specifies latency used by compiler when scheduling; hardware buffers values from any loads that complete earlier or stalls the processor if any loads complete later than the specified number of cycles
  - had plans for an in-memory compression scheme (vari-ops) for second generation design (Cydra-10); similar to Multifow since vari-ops would be expanded into one or more multi-op instructions during i-cache miss processing
  - Beck/Yen/Anderson, "The Cydra 5 minisupercomputer: Architecture and implementation," Journal of Supercomputing, May 1993, pp. 143-180.
- Intergraph Clipper 5 (U.S. patent 5,560,028, 1996)
  - called a "software scheduled superscalar" architecture but more accurately classified as a compressed VLIW scheme
  - tags are added for multiple-issue group identification along with routing tags for function unit assignment; the tags control a crossbar switch
  - later paper mentions use of a register scoreboard to determine when to issue the next group
  - Arya/Sachs/Duvvuru, "An architecture for high instruction level parallelism," 28th Hawaii Intl. Conf. Syst. Sci., 1995, pp. 153-162.
  - [Arya worked for Higbie while they were at Gould in 1980s]
- Philips Trimedia, 1996
  - a compressed instruction format is stored in the cache as well as memory and is expanded by a decompressor unit (decompression takes place during one pipeline stage of instruction fetch)
  - the encoding eliminates nops by using a header that includes a count of operations in that instruction
  - an uncompressed instruction has five operation slots, each of which contains an execution unit identifier that is used to route that operation to the appropriate execution unit
  - three generations: TM-1, TM-1000, and CPU64
- TI VelociTI VLIW architecture, 1997
  - design started in 1992, chief architect is Ray Simar
  - fetch packet of 8 instructions
  - one to eight variable-length, multiple-issue execute packets can be contained within each fetch packet; they are delimited by "parallel instruction" link bits in the instruction format
  - 5 delay slots per branch and 4 per load
  - multicycle nop
  - TI C6x family: C62x, C64x, and C67x
- Starcore, 1998
  - 16-bit instruction formats
  - VLES (variable length execution set) - two options:
    - serial - a two-bit field is allocated in the instruction format of a subset of the instructions; "00" indicates that the current instruction is included with the next, other values indicate a stop
    - prefix - instructions can also be grouped using one or two prefix words; the prefix contains a set count and also provides for conditional execution, access to more registers, and looping
  - SC140, 1998 - up to six instructions in an execution set
  - SC110, 200x - up to three instructions in an execution set
  - execution set determined from encoding during dispatch stage in a 5-stage pipeline ( prefetch / fetch / dispatch / address generation / execute )
  - each execution set advances as a unit; thus, the longest running instruction determines the number of cycles its execution set occupies the execution pipeline stage
  - see also US Patent 6,418,527 B1, "Data processor instruction system for grouping instructions with or without a common prefix and data processing system that uses two or more instruction grouping methods"
- TigerSHARC, 1998
  - "static superscalar" - one to four 32-bit instructions can be executed each cycle from a 128-bit instruction line, most significant bit of each instruction acts as a stop bit
  - minor slotting restrictions (e.g., a conditional or program sequencer instruction must be placed in the first slot of a line)
  - no memory alignment restrictions for instruction lines
  - register scoreboard, stalls complete line
- Sun MAJC, 1999
  - one to four instructions, count field in first instruction
  - retains slotted assignment to function units, each of which is general purpose
  - function units have separate set of local registers and share a common set of global registers
  - load-use and long-latency-operation register scoreboard
  - after fetch, align stage prepares for 1-to-4-way issue based on count field in first unissued instruction
  - Tremblay/Chan/Chaudhry/Conigliaro/Tse, "The MAJC architecture: A synthesis of parallelism and scalability," IEEE Micro, November-December 2000, pp. 12-25.
- Fujitsu FR-V family, 1999
  - each 32-bit instruction has a 1-bit packing flag, acts as a stop bit
  - up to four instructions in parallel, "nop insertion and slot distribution" occur after fetching from the i-cache
  - fairly general functional units so slot assignment is not a big issue
  - Sukemura, "FR500 VLIW-architecture high-performance embedded microprocessor," Fujitsu Sci. Tech. Jrnl., June 2000, pp. 31-38.
  - Suga/Matsunami, "Introducing the FR500 embedded microprocessor," IEEE Micro, July-August 2000, pp. 21-27.
- Aditya/Mahlke/Rau, "Code size minimization and retargetable assembly for custom EPIC and VLIW instruction formats," HP technical report HPL-2000-141, Oct. 2000.
IBM SCISM, early 1990s
- compound units "reflect the parallel issue of instructions"
- 3 instructions per compounded unit (provision made to jump into the middle of a compound unit)
- compounding can be done by the compiler, at the time of a page fault, or at the time of i-cache refill
- compound units include tag bits that can indicate dependency info, e.g., for interlock-collapsing function units
- Vassiliadis/Blaner/Eickemeyer, "SCISM: A scalable compound instruction set machine," IBM JRD, 38/1, Jan. 1994, pp. 59-78
- apparently never built, but lots of patents
Transmeta Crusoe, 2000 - six instruction formats (2-4 instructions)
- AA - two ALU instructions
- AB - ALU instruction and branch
- AI - ALU instruction with 32-bit immediate value
- LA - load/store and ALU instruction
- LAAB - load/store, two ALU instructions, branch
- LAAI - load/store, two ALU instructions (one w/ 32-bit immediate)
(Several patents, including U.S. 6,031,992, 2000)
retrofitting: examples of hardware marking of independence internally via predecoding and retaining the marking within a decoded i-cache (i.e., when you move the dependency detection out of the fetch and decode pipeline stages but not all the way back to compile time due to instruction set compatibility)
- NS Swordfish, 1991 - instruction pair dependency bit is contained in each decoded i-cache entry; it is set on i-cache refill by predecode hardware and yields LIW issue of independent instruction pairs; no bits are used in the normal instruction format.
- Minigawa/Saito/Aikawa, 1991 - "Pre-decoding mechanism for superscalar architecture," IEEE Pacific Rim Conf. on Comm., Comp., and Sig. Proc., pp. 22-24; on i-cache miss, a predecoder adds instruction grouping ("priority") and function unit assignment fields.
  (see also US Patent 5,163,139, "Instruction preprocessor for conditionally combining short memory instructions into virtual long instructions")
- HP 7200, 1995 - six predecode bits are added for each double word in the i-cache; they encode resource conflicts and data dependencies and are set by a predecoder on i-cache refill.

Historical precedents for prepare-to-branch

Four aspects of conditional branching
1. Condition setting
  - condition storage
    - single set of bits in PSR or flags register for integer conditions
    - second set of bits in FP status register for flt. pt. conditions
    - high-perf. implementation problem for inst. sets that use single, serialized resources [cf. Sites on design of Alpha]
    - multiple sets of bits (e.g., IBM RS/6000, Key K-1)
    - use of general registers (e.g., MC88110)
  - specification of comparands
    - explicit compare instruction (basically a subtraction)
    - side effect of ALU operation (setting by ALU op is optional in SPARC)
2. Decision - logical relation between comparands (eq, ne, lt, le, gt, ge, flt.pt unordered)
3. Branch target address
4. Change PC
  - immediate effect
  - delayed effect - next one or so sequential instructions have already been fetched and will be executed regardless of branch decision
  - delayed effect with anulling/squashing - sequential instructions already fetched and may be executed or optionally purged on untaken (e.g., SPARC)
Packaging these aspects
- compare (1), then conditional branch (2+3+4)
- ALUop side effect (1), then conditional branch (2+3+4)
- compare and branch (1+2+3+4) - may need multiple comparand specifiers plus the branch address field, although often use reg. vs. 0
- [ IBM ACS, 1967] prepare to branch (1+2+3), then exit (4)
- special branch registers to hold BTAs (accomplishing 3 and allowing for instruction prefetch when loaded), with remaining steps packaged as (1) then (2+4), or together as (1+2+4)
ISA add-ons
- [TI ASC, 1972] prepare to branch -- redundant specification of 3, for prefetch
- [PIPE, 1985 (Pleszkun and Farrens)] prepare to branch -- intended as generalized delayed branch technique where the PTB instruction would specify the number of delay slots after a branch instruction (0-7)
hardware retrofit
- US 3,553,655, "Short forward conditional skip hardware," inventors were IBM S/360 Model 91 team members Anderson and Sparacio
- see also (among others to be listed)
  - US 6,662,294, Kahle and Moore, "Converting short branches to predicated instructions"
  - US 7,409,534, Uht, Morano, and Kaeli, "Automatic and transparent hardware conversion of traditional control flow to predicates"
  - US patent application 20100262813, Brown, et al., "Detecting and handling short forward branch conversion candidates"

Historical precedents for rotating register files

"A different, programmatically controlled register renaming scheme is obtained by providing rotating register files, that is, base-displacement indexing into the register file using an instruction-provided displacement off a dedicated base register. Although applicable only for renaming registers across multiple iterations of a loop, rotating registers have the advantage of being considerably less expensive in their implementation than are other renaming schemes." - Rau and Fisher, Jrnl. Supercomputing, 1993, p 22.
scratch-pad in AP-120B/FPS-164, 1976 (Charlesworth)
compacting FIFO structure in Polycyclic Architecture at TRW/ESL, 1981 (Rau)
rotating registers in Cydrome Cydra-5, 1988 (Rau)

Historical precedents for register stack engine

Dick Site's dribble-back registers (1979)
Hitachi SR2201 preload and poststore ("slide-windowed registers")

My thanks to Harsh Sharangpani for his help; Jason Eckhardt for help with the i860 and AP120-B descriptions; Norm Hardy for pointing me to the Gray, et al., paper.; and, Josh Fisher for help with the history of load speculation.

[Computer Architecture History page] [Mark's homepage]

mark@cs.clemson.edu