Was Stretch Superscalar?

Mark Smotherman
March 15, 2010
(updated December 30, 2016)

Summary: The IBM Stretch used several aggressive parallel processing techniques to enhance its performance; but, I do not believe that it could achieve an instruction throughput rate greater than one instruction per cycle. While my definition of a superscalar processor includes the goal of achieving a throughput rate greater than one, others do not require this when applying the label "superscalar". Thus, depending upon the choice of definition, some might label Stretch as a superscalar processor. However, I believe the term superscalar is better reserved for multiple decoding designs like the IBM Project Y Floating Point Processor and the IBM ACS.

What is a "superscalar" processor?

Different authors have used the term "superscalar" to mean different things.

The definition of a superscalar processor that I and others use is a computer designed to accept a single, sequential instruction stream and fetch, decode, execute, and write back multiple instructions from that stream each clock cycle. That is, a superscalar processor will be able to concurrently fetch multiple instructions per clock cycle, decode multiple instructions per clock cycle, issue or dispatch (i.e., start the execution of) multiple instructions per cycle, and write back the results of multiple instructions per cycle. This leads to an instruction throughput rate (instructions per cycle, or IPC) that is greater than one.

There have been other definitions of "superscalar" over the past twenty or so years. I list some of the more salient definitions in Appendix A, especially those that mention Stretch. The table below summarizes these definitions; my inferences regarding some of the included processor types are listed in italics. I also list the definitions of other related terms in Appendix B.

Author(s) Summary of definition Includes Stretch? Includes VLIW? Includes in-order?

Agerwala and Cocke (1987) Dispatch multiple instructions every cycle No No No

Lam (1990) Execute multiple operations in parallel Yes Yes Yes

Rau and Fisher (1992) Issue an instruction every cycle, and further goal is to issue multiple instructions every cycle Unclear^* No Yes

Smith and Sohi (1995) Initiate multiple instructions in same cycle, breaking the single-instruction-per-cycle bottleneck No No No

Hennessy and Patterson (2007 and 2009) Issue multiple instructions in a cycle, allowing the instruction execution rate to exceed the clock rate No No Yes

Author(s)	Summary of definition	Includes Stretch?	Includes VLIW?	Includes in-order?
Agerwala and Cocke (1987)	Dispatch multiple instructions every cycle	No	No	No
Lam (1990)	Execute multiple operations in parallel	Yes	Yes	Yes
Rau and Fisher (1992)	Issue an instruction every cycle, and further goal is to issue multiple instructions every cycle	Unclear^*	No	Yes
Smith and Sohi (1995)	Initiate multiple instructions in same cycle, breaking the single-instruction-per-cycle bottleneck	No	No	No
Hennessy and Patterson (2007 and 2009)	Issue multiple instructions in a cycle, allowing the instruction execution rate to exceed the clock rate	No	No	Yes

^* In a footnote Rau and Fisher say they include "look-ahead processors" from 1960s as superscalars. If they are using Keller's definition of look-ahead (see Appendix B), then this might exclude Stretch. Rau and Fisher do mention Stretch explicitly but in a separate section on overlapping and pipelining (see Appendix A).

It should be evident from the table above that various authors differ on what they mean by the term "superscalar". Some define it to mean a broad collection of different processor types, with Monica Lam explicitly identifying Stretch as a superscalar.

According to my definition (which matches that of Hennessy and Patterson), Stretch would not be classified as a superscalar. Instead I would describe Stretch as a decoupled access/execute design with look ahead and a decoupled microarchitecture (see Appendix B).

Instead, I would classify the IBM Project Y FPP and the derivative ACS design as the first superscalar designs.

If pressed to use Monica Lam's definition, then I would qualify my description of FPP/ACS as a multiple decoding or multiple-issue superscalar design ("seven-issue") or as a superscalar that could sustain an instruction throughput rate of great than one per cycle. (FPP/ACS also share the decoupled access/execute approach of Stretch.)

I note in passing that I attributed the idea of superscalar processors to John Cocke and ACS in 1999 when a Russian journalist claimed that Russian computer designers had invented superscalar first. See Mike Magee, "Battle royal breaks out over Russian chip claim," The Register, June 8, 1999, available on-line as "http://www.theregister.co.uk/1999/06/08/battle_royal_breaks_out_over/".

Mark Smotherman, writing from a US university, said: "I would like to correct, for the record, the statement in ["]Intel uses Russia military technologies" by Andrei Fatkullin in which he says: "Superscalar architecture was invented in Russia." "The first superscalar design was the IBM ACS-1 supercomputer, designed in Menlo Park, California, in the mid-1960's at IBM's Advanced Computing Systems by a team that included John Cocke. In fact, the vision for a computer that decoded and issued multiple instructions per cycle was due to Cocke.

(The Elbrus-1 processor for which the Russians claimed credit had a decoupled microarchitecture in which up to one architected instruction from a Burroughs-like stack-machine instruction set could be issued per cycle but with multiple derived microoperations executing in parallel. The NexGen F86, ca. 1989, was a similar externally-scalar/internally-superscalar design intended for executing programs compiled to the Intel x86 instruction set.)

I also attributed the idea of superscalar to John Cocke in my 2007 article in Computer Engineering Handbook and in my chapter on "Survey of Superscalar Processors" in the textbook by John Shen and Miko Lipasti (Modern Processor Design: Fundamentals of Superscalar Processors, 2005, chapter 8, pp. 369-451). The following excerpt is from the 2007 article:

The idea of a superscalar computer originated with John Cocke at IBM in the 1960s. Cocke has said that Gene Amdahl, architect of the IBM 704 and one of the architects of the IBM S/360, postulated a bound on computer performance that included an assumption of a maximum decoding rate of one instruction per cycle on a single processor. Cocke felt that this was not an inherent limit for a single processor. His ideas about multiple decoding became an important part of the IBM ACS-1 supercomputer design, which was started in 1965 but ultimately cancelled in 1969. In this design, up to 16 instructions would be decoded and checked for dependencies each cycle and up to seven instructions would be issued to function units [6].
[p. 2-8]

Instruction Throughput Rate for the IBM Stretch

I have described the parallelism of Stretch in a web page entitled "IBM Stretch (7030) -- Aggressive Uniprocessor Parallelism," available on-line as http://www.cs.clemson.edu/~mark/stretch.html.

CPI Analysis from MIPS Estimates

I have searched for published MIPS (millions of instructions per second) rates for Stretch and various other processors. For Stretch, I have found a range from approximately 500 KIPS (see, e.g., Dag Spicer, "It's Not Easy Being Green (or 'Red'): The IBM Stretch Project," Dr. Dobbs Journal, April 2000, available on-line as http://www.drdobbs.com/184404433) to 1.2 MIPS in the Stretch Wikipedia article (available on-line as http://en.wikipedia.org/wiki/IBM_7030_Stretch). For comparison to other delivered machines I used John McCallum's collection of MIPS estimates (see http://www.jcmit.com/cpu-performance.htm), and I used estimates for ACS. (McCallum has 484 KIPS for Stretch.)

CCT - clock cycle time (* control cycle time for Stretch)
MCT - memory cycle time
MIPS - millions of instructions per second
CPI - cycles per instruction (* control cycles per instruction for Stretch)

date machine CCT^* MCT MIPS CPI^* max decode rate out-of-order?

1961 Stretch 600 ns* 2180 ns 500 KIPS to 1.2 MIPS 3.3 to 1.4 1 preexecution
by I-unit

1964 CDC 6600 100 ns 1000 ns 5.36 MIPS 1.9 1 yes

1967 IBM S/360 M91 60 ns 780 ns 5 MIPS 3.3 1 FP unit

1969 CDC 7600 27.5 ns 275 ns 14.3 MIPS 2.5 1 yes

197x IBM ACS 10 ns (goal) (cache) >100 MIPS <1 7 (3+3+branch) A-unit

date	machine	CCT^*	MCT	MIPS	CPI^*	max decode rate	out-of-order?
1961	Stretch	600 ns*	2180 ns	500 KIPS to 1.2 MIPS	3.3 to 1.4	1	preexecution by I-unit
1964	CDC 6600	100 ns	1000 ns	5.36 MIPS	1.9	1	yes
1967	IBM S/360 M91	60 ns	780 ns	5 MIPS	3.3	1	FP unit
1969	CDC 7600	27.5 ns	275 ns	14.3 MIPS	2.5	1	yes
197x	IBM ACS	10 ns (goal)	(cache)	>100 MIPS	<1	7 (3+3+branch)	A-unit

There may be some confusion in defining the Stretch "machine clock cycle". The Stretch master clock operated at 3.3 MHz = 300 nsec cycle time. However, Bob Blosk's technical paper on the I-unit goes into detail about the clocking. In particular, the Stretch I-unit control cycle uses timing signals derived from two master clock cycles to avoid race conditions. So, for the Stretch I-unit, a control cycle, or equivalently a "machine cycle", equals 600 nsec. See pages 21-24 of "Design and Performance Goals of the STRETCH Computer Instruction Unit," TR00.722, May 1960, available on-line as a pdf file http://www.textfiles.com/bitsavers/pdf/ibm/7030/TR00.722_Stretch_I-Unit_Design_Mar60.pdf. (Note at the time of writing of the report, the Stretch master clock was assumed to be operating at 4 MHz = 250 nsec cycle time.)

If a clock cycle time of 300 nsec is used for Stretch, then the CPI values double to a range from 6.6 to 2.8. Even these are impressive for a 1950's design, but none of the possible Stretch CPI values are fractional, as I would want to see in order to classify it as a superscalar. Only ACS from the table above, had it been built and reached its goals, would have had a fractional CPI (or IPC greater than one).

Timing Diagrams and Performance Characterizations

From the Blosk technical report (op. cit.), I can find no indication of the ability to perform multiple-instruction-decoding or multiple-instruction- issue (apart from the I-unit being able to predecode up to two half-word instructions at a time).

On pages 28-30, Blosk shows the possibility of a sustained rate of one half-word FP instruction every two control cycles and mentions the possibility of a short peak rate of one per cycle when indexing is not required.

A timing diagram showing the preparation of continuous FP instructions is shown in Figure 6. ... It can be seen how successive instruction fetches, preparations, and lookahead loads are overlapped to achieve a performance goal of one FP instruction every two cycles (one microsecond). Each instruction is assumed to require indexing. Otherwise instantaneous rates would achieve one FP instruction every half microsecond; the average rate, however, would still remain at one per microsecond. This is because the maximum rate of instruction fetches is balanced with the maximum indexing rate and the lookahead rate.
[p. 28]

With the subsequent change from 4 MHz clocking to 3.3 MHz, this rate matches the description in the March 1961 "7030 Performance Characteristics" manual, vol. 1 (available on-line as http://www.textfiles.com/bitsavers/pdf/ibm/7030/7030_Performance_Characteristics_Vol1_Mar61.pdf):

For a sequence of floating-point instructions the decoding rate is one instruction per 1.2 μs. With few exceptions, this is faster than the execution rate. Thus the decoding time for average floating point sequences is completely overlapped by concurrent E-box execution time. In other words, floating point instructions are E-box limited.
[p. 3-17]

I also corresponded with Bob Blosk, Fred Brooks, and Dick Holleran. None could remember details, but Fred Brooks doubted that Stretch could execute more than one instruction per cycle.

Serialization Points in Stretch

To further explain my understanding that Stretch cannot reach an instruction throughput rate greater than one, I believe that there are at least two points of instruction serialization in Stretch:

All instructions are routed one-at-a-time through the I-unit execution register (ZR) and then into one or more lookahead levels.
Every instruction is placed in this register from the Y registers for indexing, decoding, execution, and lookahead loading.
[Blosk, tech. rept., op. cit., p. 13]

I also note that the 1960 EJCC conference version of the paper has this sentence:
All instructions eventually are loaded into lookahead.
[Blosk, 1960 EJCC, top of col. 2, page 300]

The Performance Characteristics manual explicitly mentions indexing instructions:
Each I-box instruction results in one (sometimes more) level of LA being loaded.
[7030 Perf. Char., op. cit., bottom of page 3-22]
The E-unit processes the oldest lookahead level one-at-a-time, including indexing instructions.
In order to know the memory address of every instruction as it is interrupt tested, the I unit loads the instruction counter value into lookahead with each instruction.
[Blosk, tech. rept., op. cit., p. 5]

The LA-Unit 18 supplies OP codes and operands or data one at a time to the E-Unit 22. Each time E-Unit 22 finishes an operation from one level of LA-Unit, LA-Unit supplies the E-Unit with the OP code and operand from its next level.
...
Before the E-Unit is allowed to complete its operation (modify addressable registers) it must wait for a signal "OK to modify addressable registers" (OK MAR) from LA. This signal is a result of LA's testing the status of the interrupt mechanism.
[Bahnsen and Dirac, US 3,156,897, 6:65-69; 10:38-42]

The decoding time of I-box instructions is 0.6 μs. Since these instructions are executed in the I-box, the decoding time has been included in the execution times. The processing time of I-box instructions can be largely overlapped by concurrent E-box action, since their E-box time is only 1.2 μs.
[7030 Perf. Char., op. cit., p. 3-17]

Conclusion

I think the Stretch was a marvelous machine with design aspects such as speculative execution that could readily be said to be ten to thirty years ahead of their time. However, I think it would be a mistake to use a broad-brush definition and label Stretch as a superscalar. I think instead that the Project Y FPP and the ACS deserve the label of superscalar, and that the lack of including Stretch as a superscalar does not diminish its stature.

Appendix A: A collection of definitions of the term superscalar

Agerwala and Cocke (1987)

I believe the first use of the term "superscalar" occurred in the 1980s during talks by Tilak Agerwala and John Cocke, both of IBM, as they drew a distinction between multiple-issue RISC processor design and vector processors. The following quotes come from their joint technical report, "High Performance Reduced Instruction Set Processors," IBM Watson Research Center, RC 12434, 1987.

One approach to obtaining high performance without vector instructions is to dispatch multiple instructions to the execution units every cycle. This corresponds to a more general though also more limited form of concurrency than vector instructions. By allowing out-of-sequence execution, special handling of branch instructions, and providing pipelined execution hardware, very high performance can be obtained on unstructured code. We call such machines superscalar processors.
[p. 48]
To explain the overall superscalar approach, we will start with the machine organization in Figure 8. On the surface, this is a fairly common block diagram. There are several units: branch, fixed point, floating point, etc. Each unit has a queue to hold incoming instructions, a decoder, registers, and pipelined execution hardware. The goal is to fetch and dispatch N instructions every cycle, one to each unit.
[p. 49]
The superscalar approach can be summarized as follows. The architecture largely preserves the basic Single Instruction Stream Single Data Stream (SISD) model. As a result, standard optimizations and locality (instruction buffers, data caching, register usage) can be exploited. Fundamental bottlenecks to instruction dispatching are removed. As a result, the machine is flooded with instructions (the active number is a function of the amount of buffering provided). The instructions then execute based on the availability of operands and execution resources.
[p. 56]

They are describing what I would call an out-of-order superscalar processor.

Lam (1990)

The following excerpt is from Monica Lam, then at Stanford university, in her paper "Compiler Optimizations for Superscalar Computers," in Proceedings of the Ninth International Conference on Computing Methods in Applied Sciences and Engineering, R. Glowinski and A. Lichnewsky (eds.), SIAM, 1990, pp. 360-377.

Superscalar machines existed long before the term was coined. The IBM Stretch [1], the CDC 6600 [2] and the IBM 360/91 [3] are all superscalar architectures that can execute multiple operations in parallel. These machines all implement a sequential instruction set with hardware that schedules the instructions dynamically. Instructions can also be scheduled by software. Epitomizing the class of superscalar machines that rely on software for scheduling instructions is the VLIW (Very Long Word Instruction) architecture [4]. Each wide word specifies the operations to be executed in parallel.
[p. 361]

I would characterize Dr. Lam's use of "superscalar" in her paper as a broad umbrella term that covers all non-vector processors with multiple functional units. She explicitly labels the IBM Stretch as a superscalar.

Rau and Fisher (1992)

Bob Rau and Josh Fisher, who were part of VLIW startup companies Cydrome and Multiflow, respectively, published a thorough history of instruction-level parallelism (ILP) techniques in 1992. (Their bibliography contains 225 citations.)

The following excerpts come from their "Instruction-Level Parallel Processing: History, Overview and Perspective," Hewlett-Packard, Computer Systems Laboratory, HPL-92-132, October, 1992 (available on-line as http://www.hpl.hp.com/techreports/92/HPL-92-132.pdf; later published in The Journal of Supercomputing, vol. 7, no. 1, January 1993).

In 1963, Control Data Corporation started delivering its CDC 6600 [4, 5], which had 10 functional units -- integer add, shift, increment (2), multiply (2), logical, branch, floating-point add and divide. Any one of these could start executing in a given cycle whether or not others were still processing data-independent earlier operations. In this machine the hardware decided, as the program executed, which operation to issue in a given cycle; its model of execution was well along the way toward what we would today call superscalar.
[p. 4]
Also during the 1960s, IBM introduced, and in 1967-8 delivered, the 360/91 [6]. This machine, based partly on IBM's instruction-level parallel experimental Stretch processor, offered less instruction-level parallelism than the CDC-6600, having only a single integer adder, a floating-point adder, and a floating-point multiply/divide. But it was far more ambitious than the CDC 6600 in its attempt to rearrange the instruction stream to keep these functional units busy--a key technology in today's superscalar designs. ... As with the CDC 6600, this ILP pioneer started a chain of superscalar architectures that has lasted into the 1990s.
[p. 5]
A superscalar processor¹ strives to issue an instruction every cycle, so as to execute many instructions in parallel, even though the hardware is handed a sequential program.
¹ The first machines of this type that were built in the 1960s were referred to as look-ahead processors. Subsequently, machines that performed out-of-order execution, while issuing multiple operations per cycle, came to be termed superscalar processors. Since look-ahead processors are only quantitatively different from superscalar processors, we shall drop the distinction and refer to them, too, as superscalar processors
[p. 10 with footnote 1]
The further goal of a superscalar processor is to issue multiple instructions every cycle.
...
Note that an ILP processor need not issue multiple operations per cycle in order to achieve a certain level of performance. For instance, instead of a processor capable of issuing five instructions per cycle, the same performance could be achieved by pipelining the functional units and instruction issue hardware five times as deeply, speeding up the clock rate by a factor of five but issuing only one instruction per cycle. This strategy, which has been termed superpipelining ([43]), goes full circle back to the single-issue, superscalar processing of the 1960s. Superpipelining may result in some parts of the processor (such as the instruction unit and communications busses) being less expensive and better utilized and other parts (such as the execution hardware) being more costly and less well used.
[p. 11, emphasis in original]
The related techniques of pipelining and overlapped execution were employed as early as in the late 1950s in computers such as IBM's STRETCH computer [65, 66] and UNIVAC's LARC [67]. Traditionally, overlapped execution refers to the parallelism that results from multiple active instructions, each in a different one of the phases of instruction fetch, decode, operand fetch, and execute whereas pipelining is used in the context of functional units such as multipliers and floating-point adders [68, 69]. (A potential source of confusion is that, in the context of RISC processors, overlapped execution and pipelining, especially when the integer ALU is pipelined, have been referred to as pipelining and superpipelining, respectively [43].)
[p. 17]

I will note in passing that Rau and Fisher were either unaware of the multiple decoding and instruction issue of the IBM ACS or perhaps gave publication date priority to Tjaden and Flynn's 1970 paper:

The first consideration given to the possibility of issuing multiple instructions per cycle from a sequential program was by Tjaden and Flynn [76]. ... This idea, of multiple instruction issue of sequential programs, was probably first referred to as superscalar execution by Agerwala and Cocke [82].
[p. 18]

Smith and Sohi (1995)

The following excerpts are from Jim Smith and Guri Sohi (University of Wisconsin), "The Microarchitecture of Superscalar Processors," Proceedings Of The IEEE, vol. 83, no. 12, December 1995, pp. 1609-1624.

1. Introduction
Superscalar processing, the ability to initiate multiple instructions during the same clock cycle, is the latest in a long series of architectural innovations aimed at producing ever faster microprocessors. ...
A typical superscalar processor fetches and decodes the incoming instruction stream several instructions at a time. As part of the instruction fetching process, the outcomes of conditional branch instructions are usually predicted in advance to ensure an uninterrupted stream of instructions. The incoming instruction stream is then analyzed for data dependences, and instructions are distributed to functional units, often according to instruction type. Next, instructions are initiated for execution in parallel, based primarily on the availability of operand data, rather than their original program sequence. This important feature, present in many superscalar implementations, is referred to as dynamic instruction scheduling. Upon completion, instruction results are re-sequenced so that they can be used to update the process state in the correct (original) program order in the event that an interrupt condition occurs. Because individual instructions are the entities being executed in parallel, superscalar processors exploit what is referred to as instruction level parallelism (ILP).
1.1. Historical Perspective
Instruction level parallelism in the form of pipelining has been around for decades. A pipeline acts like an assembly line with instructions being processed in phases as they pass down the pipeline. With simple pipelining, only one instruction at a time is initiated into the pipeline, but multiple instructions may be in some phase of execution concurrently.
Pipelining was initially developed in the late 1950s [8] and became a mainstay of large scale computers during the 1960s. The CDC 6600 [61] used a degree of pipelining, but achieved most of its ILP through parallel functional units. Although it was capable of sustained execution of only a single instruction per cycle, the 6600's instruction set, parallel processing units, and dynamic instruction scheduling are similar to the superscalar microprocessors of today. Another remarkable processor of the 1960s was the IBM 360/91 [3]. The 360/91 was heavily pipelined, and provided a dynamic instruction issuing mechanism, known as Tomasulo's algorithm [63] after its inventor. As with the CDC 6600, the IBM 360/91 could sustain only one instruction per cycle and was not superscalar, but the strong influence of Tomasulo's algorithm is evident in many of today's superscalar processors.
The pipeline initiation rate remained at one instruction per cycle for many years and was often perceived to be a serious practical bottleneck. Meanwhile other avenues for improving performance via parallelism were developed, such as vector processing [28, 49] and multiprocessing [5, 6]. Although some processors capable of multiple instruction initiation were considered during the '60s and '70s [50,62], none were delivered to the market. Then, in the mid-to-late 1980s, superscalar processors began to appear [21,43,54]. By initiating more than one instruction at a time into multiple pipelines, superscalar processors break the single-instruction-per-cycle bottleneck.

Reference [50] is a citation of Herb Schorr's 1971 paper on the IBM ACS, and reference [62] is a citation of Tjaden and Flynn's 1970 study of a multi-issue 7094 design.

In a more recent paper ("Characterizing the branch misprediction penalty," ISPASS 2006), Smith and colleagues write:

The basis for the model is that a superscalar processor is designed to stream instructions through its various pipelines and functional units, and under optimal conditions, it sustains a level of performance equal to its issue (or commit) width.

Hennessy and Patterson (2007 and 2009)

One of the most widely used and respected textbooks in computer architecture these days is Computer Architecture: A Quantitative Approach, 4th ed., 2007, by John Hennessy and Dave Patterson.

The goal of the multiple-issue processors, discussed in the next few sections, is to allow multiple instructions to issue in a clock cycle. Multiple-issue processors come in three major flavors:

statically scheduled superscalar processors,
VLIW (very long instruction word) processors, and
dynamically scheduled superscalar processors.
The two types of superscalar processors issue varying numbers of instructions per clock and use in-order execution if they are statically scheduled or out-of-order execution if they are dynamically scheduled.
[p. 114, emphasis in original]

In their undergraduate textbook, Computer Organization and Design: The Hardware/Software Interface, 4th ed., 2009, they write about superscalars in a similar manner:

Pipelining exploits the parallelism among instructions. This parallelism is called instruction-level parallelism (ILP). There are two primary methods for increasing the potential amount of instruction-level parallelism. The first is increasing the depth of the pipeline to overlap more instructions. ... Another approach is to replicate the internals components of the computer so that it can launch multiple instructions in every stage. The general name for this technique is multiple-issue. ... Launching multiple instructions per stage allows the instruction execution rate to exceed the clock rate or, stated alternatively, the CPI to be less than 1.
[p. 391, emphasis in original]

Dynamic multiple-issue processors are also known as superscalar processors, or simply superscalars.
[p. 397, emphasis in original]

I believe that these excerpts agree with my definition, but, of course, various alternate definitions can be found in a search of other textbooks. (Disclosure: I was a technical reviewer for these books.)

Appendix B: Definitions of related terms

Confluent SISD

Mike Flynn used the term "confluent SISD" (single instruction stream, single data stream) in his 1966 paper on "Very High-Speed Computing Systems," Proceedings of the IEEE, December 1966, pp. 1901-1910.

The confluent SISD processor (IBM STRETCH [7], CDC 6600 series[8], IBM 360/90 series [2]-[5]) achieves its power by overlapping the various sequential decision processes which make up the execution of the instruction (Figs. 1 and 2). In spite of the various schemes for achieving arbitrarily high memory bandwidth and execution bandwidth, there remains an essential constraint in this type of organization. As we implied before, this bottleneck is the decoding of one instruction in a unit time, thus no more than one instruction can be retired in the same time quantum, on the average. If one were to try to extend this organization by taking two, three, or n different instructions in the same decode cycle, and no limitations were placed on instruction interdependence, the number of instruction types to be classified would be increased by the combinatorial amount (M different instructions taken n at a time represents Mⁿ different outcomes) and the decoding mechanism would be correspondingly increased in complexity.
[p. 1907]

Decoupled Access/Execute (DAE)

Jim Smith defined the phrase "Decoupled Access/Execute Computer Architectures" in a 1982 ISCA paper of that same title. The phrase is used to describe the decoupling of operand access and instruction execution. He credited Stretch and other IBM processors as early examples.

This paper discussed a new type of processor architecture which separates its processing into two parts: access to memory to fetch to fetch operands and store results, and operand execution to produce the results. By architecturally decoupling data access from execution, it is possible to construct implementations that provide much of the performance improvement offered by complex issuing schemes, but without significant design complexity.
...
The architecture proposed here represents an evolutionary step, since a similar, but more restricted, separation of tasks appeared as early as STRETCH [6], and has been employed to some degree in several high performance processors, including those from IBM, Amdahl, CDC and Cray.
[p. 112]

Decoupled Microarchitecture

Processors for complex instruction sets are sometimes designed to break down the more complex instructions into simpler units of execution for the underlying microarchitecture, such as micro-operations. Thus, the microarchitecture is said to be "decoupled" from the instruction set architecture. Various terms have been to describe the breakdown process, including instruction "cracking", "decoding", and "fission". Stretch had a similar process of breaking down a complex instruction and loading multiple levels of its Look Ahead for a single instruction; this was described as "elementalizing" the instruction by Ralph Bahnsen and Jules Dirac (US Patent 3,156,897, 10:3-4; 12:73).

Dynamic Instruction Scheduling

In the 1969 IBM Watson Research Center Technical Report RJ565, entitled "Dynamic Instruction Scheduling", Lynn Conway, Brian Randell, Don Rozenberg, and Don Senzig describe the ability of a processor to execute instructions in an out-of-order manner and cite the CDC 6600 and the S/360 Model 91 as other examples. (The report was originally drafted in 1966 as part of the ACS project.) From their conclusion:

In this paper we have described a dynamic scheduling mechanism for providing a capability which enables the execution of instructions to be initiated out-of-sequence. In addition, the mechanism is capable of controlling the simultaneous initiation of two or more instructions.
[p. 14]

Jim Smith published "Dynamic Instruction Scheduling and the Astronautics ZS-1," in IEEE Computer, July 1989, pp. 21-35. He also describes dynamic instruction scheduling as out-of-order processing.

Many features of the pioneering CDC 6600 have found their way into modern pipelined processors. One noteworthy exception is the reordering of instructions at runtime, or dynamic instruction scheduling. The CDC 6600 scoreboard allowed hardware to reorder instruction execution, and the memory system stunt box allowed reordering of some memory references aas well. Another innovative computer of considerable historical interest, the IBM 360/91, used dynamic scheduling methods even more extensive than the CDC 6600.
[p. 21, emphasis in original]

Look-Ahead

Like the term superscalar, the term look-ahead has different definitions depending on the author. Look-ahead in Stretch was a well-known and crucial part of its organization. However, when Robert Keller wrote the article entitled "Look-Ahead Processors" for ACM Computing Surveys in December 1975, he describes what we would call today "out-of-order" processors and highlights the CDC 6600 and IBM S/360 Model 91. Keller makes no mention of Stretch except by listing the Buchholz book in the supplementary references section.

The term look-ahead derives from a class of schemes in which program for the processor are specified in a conventional, serial manner; however, the processor can look ahead during execution and execute instructions out of sequence, provided no logical inconsistencies arise as a result of doing so. The advantage of look-ahead is that several instructions can be executed concurrently, assuming the processor has sufficient capabilities. Designs of specific look-ahead processors have been presented in [AST, Th, To].
...
It is the task of the look-ahead processor to determine which instructions can be executed concurrently without changing the semantics.
[pp. 177-178; AST and To are the 1967 M91 papers, Th is the Thornton book on the 6600]

Multiple Decoding

Multiple instruction decoding occurs when a processor examines multiple instructions at one time and checks for resource and data dependencies that would prevent multiple instruction issue. Multiple decoding was part of the Floating Point Processor design for Project Y; the August 1965 Arden House presentation on Project Y included this description:

FPP Features
...
Multiple decoding
...
More than one instruction can be executed per machine cycle. - Average ~ 1.5
[pp. 5-6]

As the continuation of Project Y, Herb Schorr described the ACS in this manner in his 1968 IBM programming Symposium paper:

Multiple decode is a function new to the ACS-1 CPU. By multiple decoding it is possible in every cycle to initiate the execution of three index type operations and three of the next eight arithmetic or logical type instructions.
[p. 429]

Peter Capek and Bruce Shriver interviewed John Cocke in 1999 regarding his career and asked about multiple decode ("Just Curious: An Interview with John Cocke," IEEE Computer, November 1999):

Computer: One of ACS's most important concepts is the decoding and issuing of multiple instructions per cycle. How did you arrive at this idea?
Cocke: I credit Gene Amdahl with that idea. He wrote a paper that said the fastest single-instruction-counter machine has an upper bound on its performance. I wanted to make a faster machine. So we looked at his paper, which said you can only decode and issue one instruction per cycle, and we decided to get around that limitation.
[p. 37]

Scalar pipelined processor

A scalar pipelined processor differs from a superscalar processor in that while it is also designed to accept a single, sequential instruction stream, it will overlap the fetch, decoding, execution, and write back of separate instructions from that stream, with at most one instruction occupying each stage. Thus a scalar pipeline can approach but not exceed an instruction throughput rate of one instruction per cycle.

VLIW processor

A VLIW (very long instruction word) processor differs from a superscalar processor in that the VLIW processor is designed to execute one pre-packaged, multi-operation instruction per clock cycle. While each operation can appear to be a traditional, sequential instruction, a compiler or other equivalent VLIW-instruction-preparation mechanism has pre-packaged the operations together into the single VLIW instruction. Thus, like a superscalar processor, a VLIW processor can potentially achieve a throughput of multiple operations executed per cycle. However, unlike a superscalar processor, a VLIW processor does not have to discover the groups of operations that can be processed in parallel. Because of the prepackaging of these operations, the "instruction" throughput rate is not more than one VLIW instruction per cycle.