Multiple Instruction Issue goal is IPC > 1 (IPC = instructions per cycle = reciprocal of CPI) Superscalar - multiple instructions issued to multiple pipelines varying number of instructions are issued each clock cycle based on hardware-checked dependencies and, in the more sophisticated processors, dynamic instruction scheduling key advantage is code compatibility - can run the same machine code as a regular scalar-pipeline processor integer pipeline stages for classic Pentium (dual pipelines, U and V) IF inst. fetch from icache (can get two insts. in one fetch) D1 first decode - determine inst. lengths and possibility of dual issue / \ D2 D2 second decode - read registers and calculate effective address EX EX execute - both ALU and dcache access (may require multiple cycles) WB WB writeback UltraSPARC-I and II nine-stage, in-order pipeline, up to four instructions per cycle F 1 - fetch D 2 - decode (instruction buffer) _____G_____ 3 - grouping E E R R 4 - int: execution / fp: register fetch C C X1 X1 5 - int: cache access / fp: execution first stage N1 N1 X2 X2 6 - int: pass through / fp: execution second stage N2 N2 X3 X3 7 - int: pass through / fp: execution third stage N3 N3 N3 N3 8 - exception checking W W W W 9 - writeback Data dependencies 1) RAW (read after write ordering must be preserved) - true dependency 2) WAR (write after read ordering must be preserved) - anti-dependency 3) WAW (write after write ordering must be preserved) - output dependency dependency-checking comparisons +----+-------+-------+-------+ The number of dependencies varies as | op | Rdest | Rsrc1 | Rsrc2 | the square of number of instructions +----+-------+-------+-------+ that must be compared in single cycle | WAR=? WAR=? of decoding: | | | | .-----*-------' 2 insts. => 5 comparators | | *- | ----*-------. 3 insts. => 15 comparators | | | | WAW=? | RAW=? RAW=? 4 insts. => 30 comparators +----+-------+-------+-------+ | op | Rdest | Rsrc1 | Rsrc2 | ... etc.. => 5*n*(n-1)/2 comparators +----+-------+-------+-------+ (A research project called Ultrascalar at Yale provided better scaling) WAR and WAW are called false dependencies since they arise from reuse of a register and thus artificially limit parallelism; consider a code sequence: // multiple R1 dependencies (dest. reg. listed first) mul R1,_,_ e.g., mul R1,R2,R3 // first update to R1 add _,R1,_ add R4,R1,R5 // (use of R1) xor R1,_,_ xor R1,R6,R7 // second update to R1 and _,R1,_ and R8,R1,R9 // (use of R1) the xor cannot write to R1 until the add has read the value - a WAR dep. Register renaming - removing false dependencies either the software or the hardware can rename the registers; in the latter case, the hardware can have more physical registers than appear in the ISA mul PR10,_,_ decode, assign phy. reg. PR10 to architected R1 | . (mapping table now has R1 -> PR10) \ add _,PR10,_ decode and rename source reg. R1 to PR10 xor PR12,_,_ decode, assign phy. reg. PR12 to architected R1 | . (mapping table now has R1 -> PR12) \ and _,PR12,_ decode and rename source reg. R1 to PR12 the xor is no longer dependent on either the mul or the add and can execute and even complete prior to the either the mul or the add Static instruction scheduling example of DEC Alpha instruction scheduler for two-way in-order superscalar basic block before code scheduling - 24 cycles, one double issue |01234567890123456789| Cycle Offset Number Instruction |I | 0 00000000 0 LDD F0,0(R16) |.I | 1 00000004 1 LDD F8,0(R0) | I | 2 00000008 2 LDD F1,8(R16) | .I | 3 0000000C 3 LDD F9,8(R0) | I | 4 00000010 4 MULG F0,F5,F0 | ......I | 10 00000014 5 ADDG F8,F0,F8 | I | 11 00000018 6 MULG F1,F5,F1 | ......I | 17 0000001C 7 ADDG F9,F1,F9 | I | 18 00000020 8 LDA R16,10(R16) | .I| 19 00000024 9 CMPLT R16,R10,R11 |I | 20 00000028 10 LDA R0,10(R0) |..I | 22 0000002C 11 STD F8,0(R0) | I | 23 00000030 12 STD F9,8(R0) | I | 23 00000034 13 BNE R11,0 |01234567890123456789| Cycle Offset Number Instruction basic block after code scheduling - 15 cycles, four double issues |01234567890123456789| Cycle Offset Number Instruction |I | 0 00000000 0 LDD F0,0(R16) |.I | 1 00000004 2 LDD F1,8(R16) | I | 2 00000008 1 LDD F8,0(R0) | I | 2 0000000C 8 LDA R16,10(R16) | I | 3 00000010 4 MULG F0,F5,F0 | I | 3 00000014 3 LDD F9,8(R0) | I | 4 00000018 6 MULG F1,F5,F1 | I | 4 0000001C 10 LDA R0,10(R0) | I | 5 00000020 9 CMPLT R16,R10,R11 | ....I | 9 00000024 5 ADDG F8,F0,F8 | I | 10 00000028 7 ADDG F9,F1,F9 | ...I | 13 0000002C 11 STD F8,0(R0) | I | 14 00000030 12 STD F9,8(R0) | I | 14 00000034 13 BNE R11,0 |01234567890123456789| Cycle Offset Number Instruction Dynamic instruction scheduling <-------- "front end" --------> buffer <--------- "back end" ---------> "restricted data flow" --- in program order --- ------ out of order ------ in order / \ / \ / \ .------. .------. .----. .------. .------. .------. |inst. | .---------. |result| |reg.| |memory|-->|icache|-->|decode|-->|window|-->|execution|-->|window|-->|file| `------' \ `------' \ `------' \ `------' \ `---------' \ `------' \ `----' \ \ \ \ \ \ refill fetch issue dispatch complete retire or depending on manufacturer ("dispatch") ("issue"/ ("finish") ("commit"/ "schedule") "graduate") instruction window often called reservation station(s) result window often called reorder buffer or completion buffer as the decoder places the next instruction into the instruction window, it also reserves a slot for the result of the instruction in the result window; these slots will be kept in program order within the result window, but the instructions can be scheduled from the instruction window out-of-order and thus execute and complete out of program order for example, consider the instruction sequence A,B,C,D,E,F a possible internal state within the processor: instruction window/ result window/ reservation stations execution units reorder buffer -------------------- --------------- -------------- A (done) B (executing) C (wait on B result) D (done) E (executing) F (executing) D will retire (i.e., make its results visible in the architectural state) only after A, B, and C have retired; however, any instructions dependent on D's result can access that result earlier than D's retirement via operand forwarding branch mispredictions can be handled in a simple fashion by flushing the rest of the result buffer and restarting down the correct path once a mispredicted branch attempts to retire (can also add checks to catch a mispredicted branch earlier and reduce the misprediction penalty) for example, if B in the example above was a mispredicted branch and if C,D,E,F were on the mispredicted path, then the processor can flush C,D,E,F and restart on the correct path precise exception model can be supported in similar manner for example, if E in the example above causes an exception, then the processor can flush F, save the address of E, and start fetching instructions from the exception handler; B and C will be allowed to finish before exception is caught at time E tries to retire (as with mispredictions, some processors may check for exceptions during execution rather than at retirement) Example of dynamic instruction scheduling out-of-order execution - restricted data flow engine (out-of-order often abbreviated as OoO or o-o-o) advantages * tolerates cache misses * provides hardware loop unrolling example loop ld F0,0(R1) // 2-cycle latency, F0 <- memory[ R1 + 0 ] addf F4,F2,F0 // 4-cycle execution, F4 <- F2 + F0 st F4,0(R1) // 2-cycle latency, memory[ R1 + 0 ] <- F4 sub R1,R1,8 // 1-cycle execution, R1 <- R1 - 8 bne R1,R2,loop // 1-cycle execution, branch to loop if R1 != R2 cycle diagram for simple 5-stage pipeline with forwarding, with an integer ALU and floating-point ALU (in same stage but designated by integer instruction executing as 'E' and FP instruction executing as 'X') E F | D | / | M | W X 0/LD: FDEMW // first iteration 1/AD: FD-XXXXMW // addf dependent on ld (executes on FP pipe) 2/ST: F-D---EMW // st dependent on addf 3/SU: F---DEMW 4/BN: FDEMW // bne dependent on sub F // delay slot (could move store here) 5/LD: FDEMW // second iteration 6/AD: FD-XXXXMW 7/ST: F-D---EMW 8/SU: F---DEMW 9/BN: FDEMW F 10/LD: FDEMW 11/AD: FD-XXXXMW 12/ST: F-D---EMW 13/SU: F---DEMW 14/BN: FDEMW ... ... | | 1234567890 = 1 store each 10 cycles F-Fetch; D-decode; E/X-execute; M-memory; W-writeback cycle diagram for 2-wide in-order superscalar (omit M stage from FP and branch pipes) 0/LD: FDEMW // first iteration 1/AD: FD--XXXXW // addf dependent on ld 2/ST: F--DE--MW // st dependent on addf 3/SU: F--DE---W 4/BN: F-DE--W // bne dependent on sub // empty slot 5/LD: FDEMW // second iteration 6/AD: FD--XXXXW 7/ST: F--DE--MW 8/SU: F--DE---W 9/BN: F-DE--W 10/LD: FDEMW 11/AD: FD--XXXXW 12/ST: F--DE--MW 13/SU: F--DE---W 14/BN: F-DE--W ... ... | | 1234567 = 1 store each 7 cycles (could be reduced to 6 cycles with branch prediction) F-Fetch; D-decode; E/X-execute; M-memory; W-writeback cycle diagram for 2-wide o-o-o superscalar - assume max issue of 2 and perfect branch prediction, note o-o-o means more pipe stages 0/LD: FIDEMCR // first iteration 1/AD: FI DXXXXCR // addf dependent on ld 2/ST: FI DEMCR // st dependent on addf 3/SU: FIDEC R // sub can start early 4/BN: FI DEC R // bne dependent on sub // empty slot 5/LD: FI DEMC R // second iteration -- note that 6/AD: FI DXXXXC R // 2nd ld starts before 1st st 7/ST: FI DEMCR 8/SU: FIDEC R 9/BN: FI DEC R 10/LD: FI DEMC R 11/AD: FI DXXXXC R 12/ST: FI DEMCR 13/SU: FIDEC R 14/BN: FI DEC R ... ... | | 123 = 1 store each 3 cycles F-Fetch; I-decode/issue; D-dispatch; E/X-execute; M-memory; C-complete; R-retire (note that two integer ALUs and two FP ALUs are needed) Pipeline stages of Intel P6 core (Pentium Pro / II / III) front-end execution core back-end --------- -------------- -------- I1 - next IP (inst. ptr.) 1 .--> I2 - Icache 1 2 | I3 - Icache 2 / inst. length 3 | I4 - Decode 1 / rotate 4 | I5 - Decode 2 5 | I6 - Decode 3 6 | I7 - Register rename 7 | I8 - RS write 8 | O1 - RS schedule 9 | O2 - RS dispatch 10 `---- branch mispredict ----- O3 - execute R1 - Retire 1 R2 - Retire 2 R3 - RRF write 8-stage in-order pipeline - decodes up to three x86 insts. per cycle, but only one can be complex (e.g., memory-to-reg); each x86 inst. is translated into one or more fixed-length uops; uops are placed into 20-entry reservation station (RS) 3-stage out-of-order core - issue up to four uops per cycle, results from completed uops written to 40-entry reorder buffer 3-stage retirement pipeline - retire up to three uops per cycle into retirement register file (RRF) decoder "shortstop" predicts branch based on its displacement when it was not found in BTB; untaken forward branches yield no penalty, taken backward branches yield 5-6 cycle penalty, misprediction yields 10-15 cycle penalty 20 (vs. 10) branch-mispredict stages for initial versions of Pentium 4 30+ branch-mispredict stages for Prescott version of Pentium 4 Pentium 4 added trace cache to collect groups of translated uops (avoids the cost of retranslation) 4-stage pipeline for PowerPC 750 (see manual) 16-stage pipeline for PowerPC 970 VLIW - eliminate dependency-checking and scheduling hardware; instead, let the compiler prepares fixed packages of multiple operations per cycle dependencies are determined by compiler and used to schedule according to function unit latencies function units are assigned by compiler and correspond to the position within the instruction packet ("slotting") ld/st 0 ld/st 1 alu br .------------.------------.--------------.--------------. | load r1,A | load r2,B | nop | nop | .------------.------------.--------------.--------------. | nop | nop | nop | nop | .------------.------------.--------------.--------------. | nop | nop | add r3,r1,r2 | nop | .------------.------------.--------------.--------------. | store r3,C | nop | nop | nop | `------------.------------.--------------.--------------' problems 1) compatibility across different implementations 2) code density (high percentage of nops) Transmeta Crusoe compatibility - code morphing from x86 insts. to native 4-way VLIW Load/store Alu0 Alu1/sh/fp/mmx Br/Imm .------------.------------.--------------.--------------. | | | | | `------------.------------.--------------.--------------' improve code density - register scoreboard (a form of "dynamic VLIW" since extra hardware) - six instruction bundle formats: AA, AB, AI, LA, LAAB, LAAI commit bit in each bundle to define commit point, which maps n x86 insts. to m VLIW insts. by copying working registers to shadow registers EPIC - Explicitly Parallel Instruction Computing - compiler provides information on dependencies extra bits or counts to indicate independence among instructions hardware doesn't have to "rediscover" dependencies, but function unit assignment and scheduling done by hardware avoids the scaling problem of superscalar dependency checking but retains code compatibility across implementations Categorizing the approaches dependency fn. unit execution checking assignment scheduling ---------- ---------- ------------ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ superscalar ~ hardware ~ ~ hardware ~ ~ hardware ~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ EPIC software ~ hardware ~ ~ hardware ~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ VLIW software software software code generation | superscalar |`----------------------------------. | | .-------|-----------------. .-------|-----------------. | v | | v | | dependency checking | | dependency checking | | O(n^2) | | O(n^2) | | | | EPIC | | | | |`----------------------------------. | | v | | v | | fn. unit assignment | | fn. unit assignment | | | | dynamic | | | | | | VLIW | | | | |`----------------------------------. | | v | | v | | time to start execution | | time to start execution | | | | | (e.g., reg. scoreboard) | | ` | VLIW | | | | `----------------------------------. | | | | | | `-------------------------' `-------|-----------------' compiler scheduling | hardware control | v hardware fn. units Intel Itanium Processor Family (IPF) also known as Intel Architecture, 64 bits (IA-64) instruction format - three 41-bit instruction syllables per 128-bit "bundle" - each bundle contains 5 template bits which specify syllable types and independence of syllables (idea of "stop bits" within bundle and between bundles, seen in Itanium assembly language as ";;") - each instruction syllable is predicated register rich - 128 integer registers (64 bits each) (also 65th bit called NaT on each) - 128 floating-point registers (can have special NaTVal encoding) - 64 predicate registers (1 bit each) - 8 branch target registers predicates provide conditional execution and also are used to turn unconditional branches into conditional branches overlapping register windows for procedure calls rotating register file for software pipelining memory addressing uses contents of single register, with optional update by another register or immediate value EPIC compared to dynamically-scheduled RISC - Harsh Sharangpani HC12 slide Bottleneck Dynamic RISC Approach Itanium EPIC Approach ---------- --------------------- --------------------- Scheduling Traditional compiler Entire compilation scope Scope + limited hardware window Memory latency Hardware scheduling Control speculation across & control across dynamic compiler scope; flow barriers window assisted by data speculation for memory order buffer undisambiguated memory; extensive memory hints Control flow Large dynamic branch Predication for flaky disruptions predictors; branches; 1 branch/clock extensive branch/prefetch hints; superscalar branching; Operand Small architectural Large register file, with delivery file with register stacking & rotation rename tables Interprocedural Require spill/fill to Stacking for parameter overhead memory or registers passing predication if( a == b ) c++; else d++; on SPARC on Itanium cmp %a_r, %b_r cmp.eq p1,p2 = a_r, b_r bne else ;; nop (p1) add c_r = 1, c_r then: (p2) add d_r = 1, d_r inc %c_r ba done nop else: inc %d_r done: (note: SPARC v9 has a conditional move which could be used to help reduce number of branches) example Itanium code http://software.intel.com/en-us/articles/recognizing-efficient-use-of-caches-in-code-for-the-itaniumr-processor-family/ example shows explicit grouping into three-instruction bundles (assembler has option to do this for you) { .mfb // bundle type is mem/fp/branch add r3=-1,r33 // r3 <- r33 - 1 nop.f 0 nop.b 0 ;; // ;; indicates end of group }{ .mii // bundle type is mem/int/int alloc r11=ar.pfs,2,6,0,8 // alloc for register window cmp4.eq.unc p7,p6=r33,r0 // set predicates p6 and p7 add r34=0,r0 }{ .mfb nop.m 0 nop.f 0 (p7) br.cond.dpnt .b1_1 ;; // - predicate makes branch conditional; // - special case for cmp and branch } // in same instruction group; // - "dpnt" stands for dynamic prediction // initialized to predict not taken load/store hints (none) - temporal locality in all levels is assumed as default .nt1 - non-temporal level 1 .nt2 - non-temporal level 2 .nta - non-temporal all levels speculative load - move a load above a branch (control speculation) want to perform the load earlier - want to ignore any exceptions (TLB miss, page fault, access violation) caused by an early load (since it was not on the taken path) - but must preserve any exceptions on the original path (this approaches substitutes for cache miss tolerance of OOO superscalar) for ld.s, the NaT bit on the register is set if an exception occurs; chk.s checks the NaT bit and branches to a recovery block if set load control dependent speculative load can be used prior to branch on branch (recovery block used if needed) ld4.s r1 = [r2] | recovery: (p1) br.cond label (p1) br.cond label | ld4 r1=[r2] ... ... | br use_r1 ld4 r1 = [r2] chk.s r1,recovery | use_r1: ... NaT bits propagate during ALU operations, so you can arrange to combine multiple loads into one check and one recovery block loads control dependent speculative loads can be used prior to branch on branch (recovery block used if needed) ld4.s r1 = [r2] | recovery: (p1) br.cond label ld4.s r3 = [r4] ;;| ld4 r1=[r2] ... add r5 = r1,r3 | ld4 r3=[r4];; ld4 r1 = [r2] (p1) br.cond label | add r5=r1,r3 ld4 r3 = [r4] ;; ... | br use_r5 add r5 = r1,r3 chk.s r5,recovery | use_r5: ... advance load - move a load above a store (data speculation) ld.a places address in ALAT (advance load address table); can use ld.c (reload if needed) or chk.a (branch to recovery block) load may be dependent advance load can be used prior on store (that is, to the store and the ld.a's ALAT the compiler may not entry is invalidated if a store be able to prove occurs to that address that r2 != r3) ld4.a r1 = [r2] // early load ... st4 [r3] = r4 st4 [r3] = r4 ... ... ld4 r1 = [r2] ld4.c r1 = [r2] // nop if okay ... // can use r1 here either way