Overview of exploiting parallelism in processors Goal: CPI = 1 Scalar pipeline (e.g., 5 stages) Problems: 1) resource conflicts 2) data dependencies 3) branches Software solutions - all revolve around scheduling instructions Hardware solutions 1) resource conflicts - replicate resources 2) data dependencies - forwarding 3) branches - branch prediction Instruction set solution to branches - delayed branches (popular in 1980's era RISC designs but actually a design mistake to tie an ISA so closely with one implementation) Goal: CPI < 1 (or more commonly phrased as IPC > 1) Multiple pipelines Same problems as above Hardware approach to controlling the pipelines => superscalar Software approach to controlling the pipelines => VLIW Later hybrid approach => EPIC (e.g., Intel Itanium) scalar RISC (no stalls) IF ID EX MM WB IF ID EX MM WB IF ID EX MM WB 3-way superscalar RISC (diagram shows sustained execution, no stalls) IF ID EX MM WB IF ID EX MM WB IF ID EX MM WB IF ID EX MM WB IF ID EX MM WB IF ID EX MM WB IF ID EX MM WB IF ID EX MM WB IF ID EX MM WB 3-way VLIW IF ID EX __ WB alu slot " EX MM WB ld/st slot " EX __ __ branch slot IF ID EX __ WB " EX MM WB " EX __ __ IF ID EX __ WB " EX MM WB " EX __ __ vector processor IF ID EX WB element 1 (multiple execution cycles for RF EX WB element 2 each vector instruction fetched) RF EX WB element 3 ... IF ID EX WB next vector instruction RF EX WB RF EX WB ... Types of Parallelism ILP (instruction-level parallelism) => superscalar EPIC VLIW TLP (thread-level parallelism) => multithreading multicore DLP (data-level parallelism) => vector processors (e.g., Cray-1) historical SIMD multimedia ISA extensions (MMX, SSE) stream processors (graphics processors) combine ILP and TLP => SMT (simultaneous multithreading) Intel calls it "hyperthreading" PLP (packet-level parallelism) => special form of DLP for network processors multiprocessors (MIMD) => shared or distributed memories interconnection network clusters are multiprocessors with distributed memory and slower LAN technology used for interconnection comparison of approaches (from Williams and Patterson) expressed at discovered compile time at run time .................................. . . . ILP . VLIW . superscalar . . . . .................................. . . . DLP . SIMD, . GPUs, . . vectors . streams . . . . .................................. modern processors typically combine different approaches Pentium 4 HT Transmeta Crusoe ILP => superscalar . . . . . . . yes . . . EPIC . . . . . . . . . . . . . VLIW . . . . . . . . . . . . . yes TLP => multithreading . . . . . yes . . . multicore . . . . . . . . . . . DLP => multimedia ISA ext. . . . yes . . . yes GPU/stream processors . . . . . execution models (from Sarkar, 2009) model device trend architecture trend software trend ----- ------------ ------------------ -------------- von Neumann vacuum tubes scalar insts. scalar to SSI compilers vector MSI vector insts. vectorizing parallelism compilers shared memory VLSI micro- cache coherence multithreaded parallelism processors OS and runtime bulk-synchronous clusters interconnects message-passing (distributed libraries (MPI) memory) NEW MODEL multicore power constraints lightweight asynchronous tasks