Culler-7

Mark Smotherman. Last updated February 1998

The Culler-7 was a decoupled access-execute (DAE) system from Culler Scientific Systems. Glen Culler was the chief architect and designer (and founder of the company). He was assisted in the design by Bob Pearson, John Richardson, Mike McCammon, and Dave Probert. (Glen Culler was also the original designer of the integer array processor, the AP120. This machine was purchased by Floating-Point Systems, an Oregon company selling floating-point accelerators to the oil industry in Texas. FPS re-engineered the the AP120 into a floating-point version, the AP120B.)

Design work on the system started in January of 1983. Working versions of the Culler-7 were installed at a couple of beta sites, and a half dozen machines were sold to distributors in England and Japan.

The design is quite complicated, using a Harvard architecture (i.e., a separate program memory), an "A" machine to control program sequencing and data memory addressing and access, and a microcoded "X" machine for floating-point computations that could run in parallel with the A machine. The A machine had 33 registers, while the X machine had eight individual registers, augmented by two 4K-entry scratchpad memories (called "XYMEM"). There were in fact two copies of each scratchpad memory, thus allowing one to be in use by a current process while the contents of the other were destaged into memory for the previous process and then reloaded for the next process (cf. design of the old MIT CTSS time-sharing system that switched in a similar manner between two banks of user memory).

Woody Lichtenstein, in describing the design, writes,

Multiple memories are important, because they help prevent computations from becoming memory bound. Multiple flexible address controllers are needed to help prevent computations from becoming address compute bound.

The design was also a multiprocessor, with one Kernel Processor (based on the Sun 2 using a MC68010 processor), and up to four User Processors. The main memory (built from SRAM) could be partitioned by page tables into a globally-accessible region and multiple UP-local regions. [and/or physically partitioned by different interconnect speeds?]

The program memory contained a sequence of X-machine instructions, sometimes paired and trailed with A-machine instructions. X-machine instructions were lookups into a control store of microcode routines; these routines were sequences of horizontal microcode words, which specified operations for the floating-point multiplier, floating-point adder, XYMEM, X-machine registers, and busses. Single-precision floating-point operations were single-cycle, and double-precision floating-point operations were two cycles. Many of the microcode routines were only one word long (e.g., start a floating multiply), but others were many words long (e.g., x_sin). X instructions could also start user-microcoded subroutines. Thus, the X instructions could be single- or multiple-cycle. The A instructions were all single-cycle.

X/A pairs were fetched/executed together when available, but a series of XXXX or AAAA instructions would result in single instruction fetches. A common sequence was XAA... in which the first X/A pair was fetched/executed simultaneously, and then the subsequent A instructions would be fetched/executed in an overlapping manner with the multiple-cycle X instruction.

The X and A machines transmitted data through buffers. The A-machine instructions placed the data needed by the X machine into a 3-word input FIFO and took X-machine results from a 1-word output buffer. The placement of the A instructions in the instruction stream was therefore not arbitrary; however, hardware interlock was provided. Thus, when the X machine needed data but found an empty FIFO, it stalled waiting for the appropriate A instruction to execute (e.g., the A instruction might have been delayed by a page fault). When the A machine encountered a full FIFO, it stalled until data was removed by microoperations controlled by the current X instruction. A similar interlock was provided for the output buffer.

The programmer/compiler was responsible for deadlock avoidance (e.g., omission of an A instruction before the next X instruction); this typically involved making sure loops count(s) inside the X-instruction-initiated microprogram matched loop count(s) in the trailing A instructions. The micro-compiler that Culler shipped generated software-pipelined loops of the form XAA... automatically.

Both the A and X machines used three-stage pipelines: fetch/decode/execute. The A machine implemented delayed branches with two delay slots. The delay slots could cross page boundaries.

Dave Probert designed the Kernel Processor. The port of SunOS was done by Dave Probert, Mark Lucovsky, Jeff Berkowitz, and Dave McMillen. Compiler developers were Steve Byrne (C), Mike McCammon (f77), Joe Bonasera (code generator), Woody Lichtenstein and Steve Pearson (microcode generation), and Cleo O'Brien (libraries).

Although developed about the same time, the Culler-7 designers apparently had no previous knowledge of Jim Smith's DAE papers or the Astronautics ZS-1. However, the need for starting and controlling memory accesses as early as possible can be seen in the late 1950s IBM Stretch computer, which in a way can be regarded as the "first" DAE design.

Patents on Culler-7 techniques

Patents on earlier FPS techniques

References


Acknowledgements

My thanks to Woody Lichtenstein and Dave Probert for their help in understanding the Culler-7.


[History page] [Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]

mark@cs.clemson.edu