IBM Advanced Computing Systems -- Organization sections

Mark Smotherman
last updated January 2, 2017

Logical Organization of the MPM

The design included a powerful highly-concurrent CPU, new channels, a storage hierarchy with relocation and overlay hardware, a new maintenance architecture and new software.
For the mid 70s, then, it appeared appropriate to seek design principles that could achieve a CPU performance of 1,000 x IBM 7090 performance. CPU performance can be characterized as a function of circuit speed x number of circuits x architecture factor.... It was assumed that circuit speed would improve by an order of magnitude over the 7090 so that the problem faced was to increase the number of circuits x architecture factor by two orders of magnitude over the 7090. This can only be obtained by employing some form of parallel execution -- either: (1) extensions of the look-ahead principles of the IBM 7030 (Stretch), 360/91, 360/195; (2) vector or array processors; or (3) multiple instruction counters of some type. Use of the last type of parallelism on a single problem involved the solution of very difficult software problems. Vector and array processors were thought to be too specialized for effective use in general purpose areas. Therefore, attention quickly focused upon a single instruction counter machine with hardware controlled parallelism and an unconventional memory organization.
-- Herb Schorr, "Design principles for a high-performance system" (1971) (emphasis in original)

ACS built upon the foundations of Project Y work at Watson Lab. The design for Project Y presented at the Arden House in late summer 1965 included:

FPP (floating-point processor)
- 10 nsec cycle time
- 64,000 word instruction memory, 4-way interleave
- 250,000 word data memory, 16-way interleave, 6-8 cycle access time
- single instruction counter
- 32 arithmetic registers and 31 index registers
- pipelined arithmetic and logic units
- multiple decoding and transfer anticipation
- concurrent execution among three unit types: (1) arithmetic and logical, (2) indexing and load/store, and (3) transfers
- instruction scan mechanism to separate the three instruction type streams and pull out transfers in advance
- average of 1.5 instructions executed per cycle within the FPP
- queued storage requests
UP (universal processor) for system control, compilation, file maintenance, etc.
2,000,000 word bulk store, 64-way interleave, 25 cycle access time
I/O processor for channel control and interface to low-speed devices
head-per-track disks

The idea of a universal processor and a separate floating-point processor had its origins in some of the design proposals for Stretch, but was discarded in favor of a single processor. The central processor in ACS was called the Main Processing Module (MPM). As in the Stretch, it was expected that the index part of the processor would run ahead of the arithmetic part, and therefore loads would be started early enough to hide the memory latency. (This general technique would acquire the name "decoupled access-execute architecture" from Jim Smith in 1982; in his research and subsequent work on the Astronautics ZS-1, FIFO buffers in between the two units would take the place of the backup register scheme.)

A block diagram of the ACS-1 MPM:

As discussed below, the instruction buffer and dispatch registers are contained in the Instruction Sequence & Fetch block, the contender stacks used for dynamic instruction issue are contained in the Decode blocks, and the functional units are contained in the Index Unit (X unit) and the Arithmetic Unit (A unit) blocks.

The ACS-1 MPM organization went through several revisions, including changes in the instruction issue rates. The information below is derived from the Arden House presentations of Project Y, a January 1967 presentation of ACS-1 to the IBM Science Advisory Board, an April 1967 presentation of ACS-1 to Dave Woods and Jack Worlton of Los Alamos, a November 1967 presentation of ACS-1, Herb Schorr's 1971 paper of the ACS-1 design circa spring of 1968, Ed Sussenguth's September 1968 description of the ACS/360 design, and a 1972 IBM TDB document on the ACS-1 instruction sequencing control.

Instruction Sequence and Fetch

As part of the ACS effort, Ed Sussenguth invented a branch history table for prefetching instructions; see US Patent 3,559,183, "Instruction sequence control," filed February 1968, and issued January 1971. A detailed description of the instruction prefetching and dispatching design for the circa 1968 ACS-1 design can be found in Bruce Beebe, John Cocke, John Earle, Chuck Holleran, Merle Homan, Russ Robelen, and Ed Sussenguth, "Instruction Sequencing Control", IBM Technical Disclosure Bulletin, vol. 14, no. 12, May 1972, pp. 3599-3611.

presentation to the IBM Scientific Advisory Board in January 1967:
ACS-1 presentation in November 1967:
1972 TDB document:

The instruction buffer has 12 quadword entries (i.e., 12 x 192 bits). Since ACS-1 instructions were either a halfword (24 bits) or word (48 bits) in length, up to 96 halfword instructions could be held in the instruction buffer. (Word-length instructions are allowed to span across two instruction buffer entries.) A major goal of the instruction buffer was to capture and hold small loop bodies, thus allowing it to act as an instruction cache. This is similar to the the function of the eight-word (8 x 60-bit) instruction stack in the CDC 6600 and the eight-doubleword (8 x 64-bit) instruction buffer in the S/360 Model 91.

The prefetch sequence control registers hold the quadword address results of the eight most recent transfers of control from an exit instruction within one quadword to a branch target address at the start of or within another quadword. (The ACS-1 memory was halfword addressable, and the three least-significant bits of the exit instruction address and the target address are dropped in order to obtain the addresses of the respective instruction quadwords.) After each fetch of an instruction buffer entry, the next fetch is made according to the transfer of control mapping held in the prefetch sequence control registers, if one exists, or sequentially.

Instruction buffer entries can be sent independently to the A-unit predecode and the X-unit predecode. The predecoders mark the instruction boundaries of up to eight instructions per cycle and mark whether the instructions are for the A unit, X unit, or both. A decoding order table keeps track of the instruction buffer entries sent to the two predecoders, and an instruction buffer entry is overwritten only when it has been sent to both predecoders (or found to be on a mispredicted branch path).

After predecoding, instruction blocks are placed into dispatch registers. There are two quadword registers for each unit. Instructions are then dispatched from the dispatch registers into the contender stacks for decode and issue. Although not described in the 1972 TDB document, April 1967 notes made by Harwood Kolsky and the November 1967 diagram shown above indicate that three X instructions and four A instructions can be dispatched each cycle (up from two and two in the January 1967 diagram).

Branch instructions are dispatched to the X unit only. Exit instructions are not dispatched to either unit but are instead held in the dispatch registers of each unit to prevent further dispatch until the branch instructions are resolved. When the branch sequencing is resolved¹, the exit instruction is removed from the X-unit dispatch register and dispatch can resume either sequentially after the untaken exit or at the branch target address for a taken exit. The decoding order table is arranged to indicate predicted transfers of control so that the second dispatch register is loaded with the either the next sequential quadword or the quadword containing the branch target address. If the prediction is not correct, recovery actions update the prefetch control registers, the decoding order table, and the dispatch registers.

The TDB document describes a six-entry exit history table that is used to coordinate A-unit handling of exit instructions with the resolution of exit instructions by the X unit. When encountering an exit instruction, the A unit checks the top entry in the exit history table. When that entry is marked as resolved, the A unit can remove that entry from the exit history table, remove the exit insruction from the A unit dispatch registers, and proceed according to the resolved path. The overall design appears to allow the X unit to get ahead of the A unit by up to 80 instructions (i.e., ten instruction buffer entries) or up to six exit instructions. The A unit can get ahead of the X unit only when no exit instructions have been encountered.

¹ The TDB document discusses the use of a single EBA (effective branch address) register and a branch state of unresolved, successful, or unsuccessful. The EBA is loaded and marked valid by the first successful branch instruction. Subsequent branch instructions are then ignored until an exit instruction is reached, and the exit instruction then uses the EBA to transfer control if the branch state is successful and the EBA is valid. The successful exit also resets the branch status to unsuccessful. If the branch state is unsuccessful when an exit instruction is reached, the exit instruction is ignored. It is unclear how the unresolved branch state is maintained in the case of two or more unresolved branch instructions when the first of which resolves as unsuccessful. It is also unclear how the approach prevents or handles the case of out-of-order completion of two or more successful branch instructions. (The earlier patent uses an in-order single-issue embodiment and thus does not contemplate out-of-order multiple-issue cases.) For simplicity, the branch handling approach may have serialized the dispatch and resolution of branch instructions. However, this simple approach would appear to limit performance for code sequences with multiple branch instructions occuring between exit instructions without a sufficient number of independent instructions to hide the latency of the branch handling. This supports Gene Amdahl's critique of ACS-1 to Harwood Kolsky in December 1967 in which Amdahl claimed that the minimum time for ACS-1 to prepare to branch was seven cycles and thus needed 14-21 instructions between transfers of control to maintain performance.

Instruction Decode and Issue

John Cocke had urged Project Y team members to investigate the "open question" of implementing instruction lookahead and out-of-order issue of multiple instructions each cycle. Cocke arrived at the idea of multiple decoding for Project Y in response to an early 1960s IBM internal report written by Gene Amdahl on how fast a single-instruction-counter machine could execute (see Sidebar). In this report, Amdahl postulated one instruction decode per cycle as one of the fundamental limits on obtainable performance. Cocke wanted to test each supposed fundamental limitation and decided that multiple decoding was possible.

Previous to Project Y, there had been a proposal within the Stretch project in 1956 by Ernie Stephens, Bill Stringfellow, and Vaughn Winkler to consider decoding up to six instructions per cycle as an advanced version of the "look ahead" facility for Stretch. Part of their motivation was to provide a control unit that would be able to string together several of the Stretch one-address instructions into the equivalent of a more complex multiple-address instruction. They use terms such as a "family" or group of interdependent instructions assembled into a single "thought".

The Project Y plans differed in that multiple independent instructions would be issued to the functional units each cycle. This required hardware to determine dependencies and interlocks to enforce sequential execution of dependent instructions.

The plans for instruction issuing went through several revisions during the project, as sketched below.

Arden House / Project Y presentation in August 1965:
- five-way superscalar - 2X, 2A, and one branch
- execute-ahead registers can be preloaded with instructions to execute during the time of a branch (today we might view these delay slot instructions, however, they had to be placed in special registers rather than appearing in the instruction stream after a delayed branch)
presentation to the IBM Scientific Advisory Board in January 1967:
- five-way superscalar - 2X, 2A, and one exit instruction
- centralized set of 24-instruction dispatch registers
- separate X and A contender stacks, six entries in each with issue from the top two entries in each buffer, possibly out-of-order
- branch/exit/skip scheme
  - exit instructions are held in the dispatch registers until branch resolution and never dispatched to the contender stacks
  - skipped instructions are dispatched but are deleted from the contender stacks without issue
ACS-1 presentation in November 1967:
- seven-way superscalar - 3X, 3A, and one exit instruction
- separate sets of dispatch registers for X and A units
- 3-entry X contender stack (see X-unit diagram below), issue up to three instructions per cycle, possibly out-of-order
- 8-entry A contender stack (see A-unit diagram below), issue up to three instructions per cycle, possibly out-of-order
- branch/exit/skip scheme (as in January 1967 with provision of exit history between the sets of dispatch registers so that resolution of exit instructions by the X unit are visible to the A unit)
ACS/360 in September 1968:
- three-way superscalar - 1F, 1G, and one branch in I-unit
- 4-entry G contender stack (G = general, takes role of X unit)
- 6-entry F contender stack (F = float, takes role of A unit)
- speculative instruction prefetch past up to two conditional branches (two conditional bits added to each instruction)

Sidebar: I haven't been able to track down the report mentioned by John Cocke, and Gene Amdahl told me that he didn't remember it. However, it may have been contained in an analysis for the Project X design (a.k.a. NPL 604 scientific computer and 604X). For example, see the September 16, 1963, memo from T.C. Chen to J.C. Logue on "604X Design": "Major credit of the design should go to Dr. G. Amdahl. ... The arithmetic unit is designed to allow several operations (about 4 maximum) to occur at once. ... With more extra hardware it is possible to perform operations out of sequence. The limit is clearly 1 instruction per machine cycle."

Note also the discussion of the practical limit of one instruction per cycle in the 1967 IBM JRD articles on the Model 91. These articles were published late in the ACS project lifetime, but I believe they indicate the type of thinking and goal setting in the other parts of the company circa 1963-1965.

"It is easily shown that issuing at a rate in excess of one instruction per cycle leads to a rapid expansion of hardware and complexity. (Variable-length instructions, adjacent instruction interdependencies, and storage requirements are prime factors involved.) A one-cycle maximum rate is thereby established, but it too presents difficulties." p. 15, D.W. Anderson, F.J. Sparacio, and R.M. Tomasulo, "The IBM System/360 Model 91: Machine Philosophy and Instruction Handling," IBM JRD, vol. 11, no. 1, 1967 (submitted for publication in September 1965).
"The concept of processing one instruction per cycle has now become the basis for the entire system..." p. 3, M.J. Flynn and P.R. Low, "The IBM System/360 Model 91: Some Remarks on System Development." IBM JRD, vol. 11, no. 1, 1967 (submitted for publication in January 1966.)
"It has become evident that the target design goal of the execution of one instruction per cycle was a prudent one: targets much in excess of this figure would require exponentially more hardware. We infer from this that, in the future, order-of-magnitude improvements in computer performance due to systems organization alone will be unlikely, at least where emphasis on simplex instruction processing (single-line instruction processing against multi-processing)." p. 5, ibid.

In his 1966 paper, Mike Flynn states that the "bottleneck is the decoding of one instruction in a unit time,... If one were to try to extend this organization by taking two, three, or n different instructions in the same decode cycle, and no limitations were placed on instruction interdependence, the number of instruction types to be classified would be increased by the combinatorial amount ... and the decoding mechanism would be correspondingly increased in complexity." p. 1907, M.J. Flynn, "Very high-speed computing systems," Proceedings of IEEE, vol. 54, no. 12, Dec. 1966. Flynn and his student Tjaden also published a famous ILP limit study in 1970 that reported an average of 1.86 instructions per cycle for a model of a multiple-issue 7094.

Dynamic Instruction Scheduling

A generalized scheme for parallel decoding of multiple instructions and out-of-order issue from the contender stacks was proposed by Lynn Conway in late 1965. Conway, who joined the project in 1964 after finishing graduate work at Columbia, was assigned with Don Rozenberg to develop the machine simulator. Approaching the problem of instruction decoding and issuing from a software and systems point of view, she hit upon a general scheme for multiple issue that, due to its regularized structure, could be implemented within the maximum number of gate levels of the ACS machine cycle. Conway speculates that preoccupation with assumed gate-level limitations may have constrained design options considered by earlier designers to less ambitious out-of-order schemes. Her multiple-decode and multiple-issue scheme was developed in late 1965 and was accepted by the ACS architecture group in 1966 It appears to be the first implementation of the generalized dynamic instruction scheduling (i.e., multiple-decode, multiple-issue) that is now common in modern processors. See L. Conway, B. Randell, D. Rozenberg, and D. Senzig, "Dynamic Instruction Scheduling," February 23, 1966. (pdf)

The ACS/360 approach to instruction issue for the floating-point unit appears to be described in US Patent 3,718,912, entitled "Instruction execution unit," (Leo Hasbrouck, Bill Madden, Robert Rew, Ed Sussenguth, and John Wierzbicki; filed December 1970 and issued February 1973.) This patent describes the ability to issue a floating-point instruction out-of-order from a buffer to one or more functional units. A major point of the design is to rename a floating-point register whenever a load occurs. (This renaming scheme replaces the ACS-1 backup register design and was also used twenty years later in the IBM RS/6000.)

Functional Units

In January 1967, there were to be five functional units for index operations: two adders, integer multiply and divide, shift, and compare (test), as shown below on the left. (The "Instr Buffer" shown in this diagram is the X-unit contender stack.) In November 1967, there were to be six functional units for index operations: three adders (including support for branch address add and logic and effective address addition), integer multiply and divide, shift, and compare, as shown below in the middle. By 1968, this would be adjusted to: two effective address adders, branch address add and logic, multiply and add, shift, and compare, as shown on the right.

X Unit in January 1967 X Unit in November 1967 X Unit in 1968
In January 1967, there were to be six functional units for arithmetic operations: two adders, multiply, divide, logic and shift, as shown below on the left. (The "Instr Buffer" shown in this diagram is the A-unit contender stack.) In November 1967, there were to be seven functional units for arithmetic operations: two adders, multiply, divide, logic, shift, and compare, as shown below in the middle. By 1968, this would be adjusted to compare, add (DP/SP/integer), DP add/integer multiply/divide(DP/SP/integer), logic, SP add, shift, and SP multiply, as shown on the right.

A Unit in January 1967 A Unit in November 1967 A Unit in 1968

Cycle time diagram of instruction issue from April 1967 notes of Harwood Kolsky:

cycle time diagram

Kolsky's notes indicate that both single-precision floating-point addition and single-precision multiplication were fully-pipelined with three-cycle execution. Double-precision multiplication required six cycles of execution. For double precision, the multiplier was partially pipelined and could start a new double-precision multiply operation every three cycles. Single-precision division required 10 cycles, and double-precision division required 19 cycles. Logic and shift operations had single-cycle execution. (The November 1967 presentation indicates that double-precision addition was fully-pipelined with four-cycle execution.)

The A-unit design provided 31 "backup registers", each one being paired with a corresponding arithmetic register. This provided a form of register renaming for overlapped execution of different loop iterations, so that a load or writeback could occur to the backup register whenever a dependency on the previous register value was still outstanding. For more information on the backup register scheme, see the two 1971 IBM TDB entries, both entitled "System for interlocking between asynchronously operating indexing and arithmetic units":

Merle Homan and Ed Sussenguth, IBM Technical Disclosure Bulletin, vol. 14, no. 2, July 1971, pp. 656-659.
Charlie Freiman, Merle Homan, Gerry Paul, and C.C. Wang, IBM Technical Disclosure Bulletin, vol. 14, no. 2, July 1971, pp. 660-662.

A similar backup-register scheme for the X unit is noted in the January 1967 presentation, but only A-unit backup registers are listed in Schorr's 1971 paper.

Memory Hierarchy

Cache memory (called "high speed storage" (HSS)) was introduced within IBM in 1965, leading to the announcement of the S/360 Model 85 in 1968.

Initial plans for ACS called for the compiler to generate code to manage the transfer of data and programs from a large main memory to a small amount of high-speed storage. Herb Schorr and Dick Arnold reviewed the progress of this approach, and they concluded that the problem was intractable for the compiler and system software. They then designed a hardware-managed two-way set associative cache. The cache was 32K to 64K-word in total size with 32 words per line. The cache contained both instructions and data and used LRU replacement. A cache hit required 5 cycles, and a cache miss required 30-40 cycles. (Note: sources vary about the cache hit timing; some say 2 cycles, others say 4 cycles.) I/O was performed to and from the cache.

Main memory was pipelined and 16-way interleaved. A store buffer was proposed and provided address-collision interlock for load bypass and load forwarding. The storage control unit was called the bus and lining module (BLM), and it provided for virtual address translation and memory protection. Virtual memory pages were 64 words in size, and the BLM implemented a hashed page table that could support one page per entry or blocks of 4, 16, or 64 contiguous pages per entry. See US Patent 3,675,215, Dick Arnold, Phil Dauber, and Ed Sussenguth, "Pseudo-random code implemented variable block size storage mapping device and method," 1972.

Interrupts

In his 1971 paper, Schorr states that up to 50 instructions could be in some stage of execution at any given time. Thus, interrupts and exceptions would be costly. Most external interrupts were converted by the hardware into specially-marked branches to the appropriate interrupt handler routines and then inserted into the instruction stream to allow previously-issued instructions to complete. (This was called "soft-interrupts".)

To handle hard interrupts, program execution was immediately stopped and all important internal processor state (including register contents, contender stacks, and scheduling matrices) was saved to memory. A special privileged instruction, SCAN, is provided for the operating system to reload the internal data and restart the processor. (On the later CDC STAR-100 and Cyber 200, this approach is called an "invisible exchange package".)

Estimates were that soft interrupts would have a variable response latency, from 0.1 up to 15 microseconds depending on the instruction mix in flight at the time of the special branch insertion, while hard interrupts would have a fixed response latency of 2 microseconds.

The instruction prefetch controls were designed to defer instruction page faults until the sequencing was resolved.

Maintenance

Simulators were to be part of the system maintenance. The plan was to build an unchecked processor with the operating system running periodic diagnostics. If one of the diagnostics failed, the operating system would step through the failing diagnostic program and compare the latch scan-out results with those from a processor simulator running on another machine. Once a difference was found, the maintenance system would search for the logic trees associated with the incorrect latches. If there were no board crossings, the errant board could be replaced. Otherwise, a customer engineer would scope the board crossings and determine whether the sending board or the receiving board was faulty.

This diagram shows an ACS-1 system with an external maintenance computer:

The Instruction Sequence and Fetch unit is referred to as Y-unit in the diagram above.

Navigation within IBM ACS pages:

Back to first ACS page

Next sections: Circuits / Chips and Packaging / Cooling