HARP - Hatfield Advanced RISC Processor

Mark Smotherman. Last updated July 2011


HARP is a VLIW architecture dating from the late 1980s that has been cited in many papers and patents, and it may have influenced the Itanium design. The HARP execution model characteristics include:

A brief aside on names:

The people involved in the HARP design include: [incomplete]

The HARP design was started in the late 1980s by Gordon Steven at Hatfield Polytechnic with the goals of "execut[ing] non-scientific programs at a sustained instruction execution rate in excess of one instruction per cycle" and "exploit[ing] the low-level parallelism available in systems programs and general purpose computations". [Microproc. and Microprog. paper, 1990] Professor Steven and his students followed the approach of a VLIW-like machine model (described above) and an optimizing compiler.

The HRC (HARP Research Compiler) compiled a subset of Modula-2 and consisted of three major phases: sequential code generation, local compaction (basic block scheduling), and conditional compaction (global scheduling). A gcc port was later developed to generate sequential HARP code that could then be run through the two compaction phases.

David Whale wrote a C-based generic instruction set simulator as part of his BSc project that was then used to implement the complete HARP model to run the SPEC benchmarks for a number of the later papers. The simulation model allowed the user to vary features such as the number of pipelines and register bank sizes, and allowed the team to explore the design space.

Students from the EE department at Hatfield Polytechnic, including Simon Trainis, designed a VLSI implementation of the a four-pipeline instance of HARP. This implementation was called the iHARP and featured reduced register counts (32 general registers and 8 Boolean registers) and slotting of some of the pipeline functions.

pipeline 0 pipeline 1 pipeline 2 pipeline 3
computational computational computational computational
relational relational relational relational
memory reference memory reference
Boolean
branch (1st priority) branch (2nd priority)
special purpose
traps
32-bit literal
for pipeline 0
32-bit literal
for pipeline 2

There could be at most two branches per long instruction word, with pipeline 1 checked first and thus given priority in the case that both branches evaluated to be taken. Also, the compiler was expected to generate predicated code in such a manner that even though two load/stores are allowed in a given long instruction word, there could be only one data cache access at run-time. Likewise, using predication and write-back-permission there could be only two register write-backs allowed per long instruction word at run-time. (The register file had ten read ports and two write ports.) The four ALUs have a complete set of forwarding paths to each other. [Note that the pipeline functions and branch priority assignment differ among the various papers; the above description comes from the 1995 IEE Proceedings paper.]

A Resource Limited Scheduler (RLS) was subsequently developed specifically for the iHARP, and it incorporated loop unrolling and interprocedural scheduling as well as local and global compaction. An evaluation in 1994 of a simulated iHARP configuration reported a 1.76 speedup over a simulated single-pipeline version. [EuroMicro94] A slightly later study reported a 1.8 speedup. [IEE Proceedings 1995]

The 1994 evaluation found that the scheduled iHARP code was 134% larger than code for a single-pipeline version of HARP, mainly because of the nop-padding required for the long instruction words. Because of that increase, an in-order superscalar version was compared; it resulted in only an 18% increase in code size over the single-pipeline version.

The HARP team also investigated speculative loads and ALU operations using "pollution" bits added to the general registers (and also to the Boolean registers) to indicate delayed exceptions.

In 1992, the research team changed its name from HARP to HSP (Hatfield Superscalar Processor) and then to HSA (Hatfield Superscalar Architecture). As part of this effort, a variable-length branch delay slot technique was proposed that uses a count field within each branch operation. Floating-point operations, as well as integer multiply and divide, were included. The HSA execution model was also extended from strictly in-order to also encompass out-of-order techniques. The Hatfield Superscalar Scheduler (HSS) was developed for the new execution model.

An asynchronous processor with a five-stage pipeline was also investigated under the project name "Hades".

Some of the more recent papers from the HSA team are listed on the web page for the Compiler Technology and Computer Architecture Research Group (CTCA) at Hertfordshire.


References

[Note: there is a fair amount of repeated material across many of the papers, so the best papers to read first are marked with **.]

Journal, conference, and periodical papers (not exhaustive)

Additional technical reports (not exhaustive and does not include the technical report versions of the published papers above)


Acknowledgements

My thanks to Colin Egan for his help in collecting this information and to David Whale for information about the simulator.


[History page] [Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]

mark@cs.clemson.edu