IBM Advanced Computing Systems -- Compiler / Performance sections

Mark Smotherman
last updated December 30, 2016

Optimizing Compiler

From the beginning of the project it was recognized that the only way to realistically realize the performance goals and make them accessible to the user was to design the compiler and the computer at the same time. In this way features would not be put in the hardware that the software could not use or which the software could support more effectively.
In order to isolate the effects of changes in the CPU design and to modularize the experiment, the ACS compiler classified the optimizations as machine independent or machine dependent. ... With the exception of [instruction] scheduling, most of the other analyses and transformations had appeared in previous IBM compilers ... [but in ACS] the techniques were generalized and isolated as much as possible from source and machine constructs.
Out of this project, therefore, came numerous advances in compiler technology, and, more importantly, the foundations of the theory of program analysis and optimization.
-- Fran Allen, "The history of language processor technology in IBM" (1981)

An aggressive optimizing compiler was an integral part of the Project Y and ACS plans, with machine-independent and machine-dependent optimizations as well as profile-guided optimization (PGO). The compiler was planned to include front-ends for PL/I and an extended FORTRAN.

The FORTRAN extensions included many of the OS/360 FORTRAN IV extensions, as well as:

non-privileged machine operations, e.g., X.AU.Y would provide the FORTAN programmer with access to the unnormalized add operation
bit strings and packed words
pointers
recursion
macros

The compiler included the following techniques.

during parsing normal operator precedence was modified to produce more parallel code, e.g., an unparenthesized expression A+B+C+D was parsed as (A+B)+(C+D) rather than a left-to-right scan ((A+B)+C)+D that yields a sequential set of additions
subroutine inlining and partial inlining
machine-independent optimizations
- basic block analysis and optimization
  - common subexpression elimination (CSE) within basic blocks
  - constant propagation for singly-defined variables
- control flow analysis using intervals (see the paper by Fran Allen in the 1970 SIGPLAN compiler conference)
- global analysis and optimization
  - global code motion; hoisting
  - common subexpression elimination (CSE) across basic blocks (see the paper by John Cocke in the 1970 SIGPLAN compiler conference)
  - chain uses and definitions across basic blocks
  - constant propagation across basic blocks
- strength reduction
- dead variable and expression elimination
machine-dependent optimizations
- skip and branch analysis, which merges basic blocks to improve the instruction scheduling and to eliminate branches
- blocks are reordered to reduce the number of taken branches
- instruction scheduling
- load and store motion
- register allocation
code generation

Fran Allen published some of her work on the compiler in 1966 in an article in the Annual Review of Automatic Programming. John Cocke later worked with Jacob Schwartz, and they wrote a famous textbook draft entitled "Programming languages and their compilers". Jim Beatty worked on optimization techniques and register allocation and published some of his work in the Journal of the ACM and the book Design and Optimization of Compilers in 1972 and in the IBM Journal of Research and Development in 1974.

ACS-1 Operating System

The ACS-1 was designed in a period of increasing software expectations on the part of the high-performance user community. This was unlike earlier in the 1960s when initial users of the CDC 6600 eagerly accepted a role in developing the software infrastructure for the system.¹

A new kernel-based operating system was planned for ACS-1 with much of the work contracted to the Computer Sciences Corporation. The OS design envisioned a protected-mode kernel, called the "nucleus", and a user-mode set of additional services, called the "extended nucleus". It was anticipated that many of the customers might want to change parts of the extended nucleus, so the nucleus was defined to be the essential OS requirements across all customers.

The nucleus was to include task management, interrupt handling, explicit memory hierarchy management and protection, I/O processor interface, and file handling and protection. The extended nucleus was to include job scheduling, an I/O control system, and data management (e.g., access methods, blocking, and buffering). In some instances, documents classified the language processors (e.g., compilers, loader, link editor, library routines, command language interpreter) as part of the extended nucleus services.

An interesting design decision was that demand paging was to be the responsibility of the extended nucleus rather than the nucleus. Extended nucleus services like demand paging would have to invoke the protected-mode nucleus services in order to make changes in the memory hierarchy placement of objects. (A similar design approach was taken some twenty years later with the Mach kernel at CMU; however, the IPC calls required to implement external pagers in Mach typically resulted in poor performance.)

ACS-1 Software Design Schedule

ACS-1 hardware was not anticipated to be available to coders until 1970, so most of the coding was to be done using the ACS simulator running on the project's S/360 Model 75.

The ACS-1 architecture team was planning to be extensively involved with specifying and coding the nucleus, and then specifying and monitoring the progress and quality of the extended nucleus and language processors. The plans called for most of the OS and language processing modules to be written in the ACS extended FORTRAN.

Estimates in 1967 called for 21K lines of code for the nucleus, over 70K lines for the extended nucleus, and 60K lines for the language processors. Man-years for the software were estimated at 45 for the ACS architecture team members and 200 for CSC contract employees, with a labor cost estimate of $11 million.

As of November 1967, the software schedule was:

1Q68 - assembler and debugging aids running on S/360
1Q69 - checkout compiler on S/360
3Q69 - nucleus
1Q71 - extended nucleus
2H72 - full set of language processors

Performance Analysis

-- Performance goal from Schorr's 1968 paper

It would have accomplished four to five instructions per cycle on linear algebra-type problems. But, because of cache latency and the inability to almost always anticipate the correct branch flow, we would not have achieved near that rate on more general problems.
-- John Cocke, Turing Award Lecture (CACM, March 1988)

Special emphasis was given to several Livermore and Los Alamos benchmark kernels by both the design and compiler groups. Performance comparisons were made with the IBM 7090, CDC 6600, and IBM System/360 Model 91 on Lagrangian Hydrodynamics Calculation (LHC) and Neutron Diffusion (ND), and these were presented in January 1967 to the IBM Science Advisory Board. At this point in time, the issue rates were projected to be 2-out-of-6 for both the index and arithmetic units.

1967 Performance Comparisons IBM 7090 CDC 6600 S/360 M91 ACS-1

Relative performance on LHC 1 50 110 2500

Relative performance on ND 1 21 72 1286

1967 Performance Comparisons	IBM 7090	CDC 6600	S/360 M91	ACS-1
Relative performance on LHC	1	50	110	2500
Relative performance on ND	1	21	72	1286

The following table gives statistics projected in 1967 for the ND kernel.

1967 Detailed Performance
Comparison using ND Kernel IBM 7090 CDC 6600 S/360 M91 ACS-1

Arithmetic instructions 26 38 24 32

Index instructions 2 5 3 6

Load/store instructions 61 29 31 14

Branch instructions 11 9 11 7

Total instructions 78 81 63 59

Storage references 139 59 57 14

Total cycles 295 296 148 45

Cycles in critical path 110 76 30 37

Percentage in critical path 37% 26% 20% 82%

Sustained insts. per cycle (IPC) 0.26 0.27 0.4 1.3

Relative performance 1 21 72 1286

Performance limiter sequential machine inst. fetch branches arithmetic

1967 Detailed Performance Comparison using ND Kernel	IBM 7090	CDC 6600	S/360 M91	ACS-1
Arithmetic instructions	26	38	24	32
Index instructions	2	5	3	6
Load/store instructions	61	29	31	14
Branch instructions	11	9	11	7
Total instructions	78	81	63	59
Storage references	139	59	57	14
Total cycles	295	296	148	45
Cycles in critical path	110	76	30	37
Percentage in critical path	37%	26%	20%	82%
Sustained insts. per cycle (IPC)	0.26	0.27	0.4	1.3
Relative performance	1	21	72	1286
Performance limiter	sequential machine	inst. fetch	branches	arithmetic

A 1967 summary of MFLOPS projections was split into two cases, independent operations and sequentially-dependent operations, and reveals the major impact of overlap in the high-end machines.

1967 MFLOPS Comparison IBM 7090 CDC 6600 S/360 M91 ACS-1

MFLOPS (independent) .046 4 11 160

MFLOPS (seq. dependent) .046 1.3 6.7 27

Overlap factor 1 3 1.6 5.9

1967 MFLOPS Comparison	IBM 7090	CDC 6600	S/360 M91	ACS-1
MFLOPS (independent)	.046	4	11	160
MFLOPS (seq. dependent)	.046	1.3	6.7	27
Overlap factor	1	3	1.6	5.9

By 1968, performance projections were significantly higher with the 3-of-8 Index and 3-of-8 Arithmetic issue rates. IPC on the ND kernel had increased to 1.8 and relative performance to 1608.

1968 Performance Comparison
using ND Kernel IBM 7090 CDC 6600 S/360 M91 ACS-1

Sustained insts. per cycle (IPC) 0.26 0.27 0.4 1.8

Relative performance 1 21 72 1608

1968 Performance Comparison using ND Kernel	IBM 7090	CDC 6600	S/360 M91	ACS-1
Sustained insts. per cycle (IPC)	0.26	0.27	0.4	1.8
Relative performance	1	21	72	1608

An analysis of several machines was performed that normalized relative performance with respect to the circuit speed of each machine and the number of circuits in each machine. The result was called a relative "architectural performance" factor, although average memory access times (and thus the presence or absence of cache memory) affected the results. For example, note the effect of the M195 having cache memory whereas the M91 does not. (M195 also has a 10% faster clock than the M91: 54 nsec cycle time versus 60 nsec.) Based on this analysis, the superscalar issue, decoupled access and execute, branch avoidance techniques, and cache memory of the ACS-1 design gave a it a factor of 5 "architectural" improvement.

Design Comparison 7090 Stretch CDC 6600 S/360 M91 S/360 M195 ACS-1

Relative performance 1 3.5 23 61 96 1230

Performance normalized to circuit speed 1 3.5 11.5 14.6 23 98

Performance normalized to circuit speed and count 1 1.2 1.1 1.1 1.7 5.2

Design Comparison	7090	Stretch	CDC 6600	S/360 M91	S/360 M195	ACS-1
Relative performance	1	3.5	23	61	96	1230
Performance normalized to circuit speed	1	3.5	11.5	14.6	23	98
Performance normalized to circuit speed and count	1	1.2	1.1	1.1	1.7	5.2

Sidebar - Benchmark programs

LHC was part of the Livermore CORONET program. The kernel of CORONET became loop 18 in the Livermore Loops. The 1981 Fortran version is:

c
c*******************************************************************************
c***  KERNEL 18     2-D EXPLICIT HYDRODYNAMICS FRAGMENT
c*******************************************************************************
c
c
 1018           T= 0.003700d0
                S= 0.004100d0
               KN= 6
               JN= n
         DO 70  k= 2,KN
         DO 70  j= 2,JN
          ZA(j,k)= (ZP(j-1,k+1)+ZQ(j-1,k+1)-ZP(j-1,k)-ZQ(j-1,k))
     1            *(ZR(j,k)+ZR(j-1,k))/(ZM(j-1,k)+ZM(j-1,k+1))
          ZB(j,k)= (ZP(j-1,k)+ZQ(j-1,k)-ZP(j,k)-ZQ(j,k))
     1            *(ZR(j,k)+ZR(j,k-1))/(ZM(j,k)+ZM(j-1,k))
   70    CONTINUE
c
         DO 72  k= 2,KN
         DO 72  j= 2,JN
          ZU(j,k)= ZU(j,k)+S*(ZA(j,k)*(ZZ(j,k)-ZZ(j+1,k))
     1                    -ZA(j-1,k) *(ZZ(j,k)-ZZ(j-1,k))
     2                    -ZB(j,k)   *(ZZ(j,k)-ZZ(j,k-1))
     3                    +ZB(j,k+1) *(ZZ(j,k)-ZZ(j,k+1)))
          ZV(j,k)= ZV(j,k)+S*(ZA(j,k)*(ZR(j,k)-ZR(j+1,k))
     1                    -ZA(j-1,k) *(ZR(j,k)-ZR(j-1,k))
     2                    -ZB(j,k)   *(ZR(j,k)-ZR(j,k-1))
     3                    +ZB(j,k+1) *(ZR(j,k)-ZR(j,k+1)))
   72    CONTINUE
c
         DO 75  k= 2,KN
         DO 75  j= 2,JN
          ZR(j,k)= ZR(j,k)+T*ZU(j,k)
          ZZ(j,k)= ZZ(j,k)+T*ZV(j,k)
   75    CONTINUE

Donald MacKenzie briefly discusses CORONET in "The Influence of the Los Alamos and Livermore National Laboratories on the Development of Supercomputing," IEEE Annals of the History of Computing, Vol. 13, No. 2, April 1991, as well as the use of the kernel in specifying the Livermore workload as part of the design of the CDC Star-100 in the late 1960s. MacKenzie states that the CORONET kernel was "judged untypical of Livermore weapons code because it contained no data-dependent branches."

ND was part of the Los Alamos DTF code, which is a Fortran version of Bengt Carlson's difference-equation method to solve a linear, time-independent Boltzmann equation for particle transport.

¹ Boelie Elzen and Donald MacKenzie, in "The social limits of speed: The development and use of supercomputers," IEEE Annals of the History of Computing, vol. 16, no. 1, 1994, discuss how this change in expectations of supercomputer manufacturers additionally providing system software affected Cray Research, Inc., in the late 1970s and early 1980s.

Navigation within IBM ACS pages:

Back to first ACS page

Next section: end of ACS project