Multiprocessors Types of workloads regular (e.g., dense linear algebra) vs. irregular (e.g., trees, hash tables, etc.) synchronous (lockstep) vs. asynchronous coarse-grained (thread-level) vs. fine-grained (instruction-level) competitive (e.g., reader/writer) vs. cooperative (e.g., barrier synch.) communication-dominant vs. compute-dominant latency-tolerant vs. latency-intolerant "embarrassingly parallel" - easily partitionable and requires very little communication and synchronization to execute Communication models shared memory - process can communicate via load and store (e.g., OpenMP) message passing - process can communicate via send and receive (e.g., MPI) Interconnections types shared bus hierarchical buses crossbar switch ring(s) multistage networks n-cube/mesh/torus concerns scaling (costs, bandwidth) communications latency Memory access UMA - uniform memory access standard bus-based multiprocessor SMP = symmetric multiprocessor synchronization is typically through shared variables in memory +-----+ +-----+ | CPU | | CPU | +-----+ ... +-----+ |cache| |cache| +-----+ +-----+ | | +--------------------+ | interconnect | +--------------------+ | +--------+ (memory is interleaved for multiple accesses per | memory | cycle; apart from bank conflicts, load/store +--------+ accesses from any CPU will execute at same speed) NUMA - nonuniform memory access single address space but distributed memory synchronization is typically through shared variables in memory +--------------+ +--------------+ | CPU (or SMP) | | CPU (or SMP) | +--------------+ +--------------+ | | +-----+ +------+ ... +-----+ +------+ all memory can be | hub |-|memory| | hub |-|memory| accessed by load/store +-----+ +------+ +-----+ +------+ instructions on any CPU | | (non-local accesses will +----------------------------+ of course be slower) | interconnect | +----------------------------+ choices: no caches noncoherent caches CC-NUMA - coherent caches - add cache directory to each hub COMA - cache-only memory access some research efforts and a few commercial machines built NORMA - no remote memory access MPP conventional wisdom: the shared memory programming model doesn't scale to lagre numbers of processors message passing - communication occurs through run-time library routine and OS calls (MPI msg. latency may reach 3000 cycles) programmer must explicitly distribute data to different memories synchronization occurs on message sends and receives +-----+ +-----+ | CPU | | CPU | +-----+ +-----+ | | +-----+ +------+ ... +-----+ +------+ memories are local and | hub |-|memory| | hub |-|memory| cannot be accessed by +-----+ +------+ +-----+ +------+ load/store instructions | | from other CPUs +----------------------------+ | interconnect | +----------------------------+ variants DSM (distributed shared memory) - software simulates NUMA-like platform clusters - commodity nodes and LAN-technology-based interconnect (also, see below) Flynn notation SISD - single instruction stream, single data stream SIMD - single instruction stream, multiple data streams single control unit, many custom processing units all processors do same work on separate data, vary only based on local condition codes synchronization easy since all units operate in lockstep MIMD - multiple instruction streams, multiple data streams all processors execute their own instruction stream processors can be commodity parts (i.e., inexpensive) synchronization between cooperating programs can be a problem MISD - multiple instruction streams, single data stream Cache coherency shared data - multiple processors use the same word(s) in a cache line; caching eliminates read-shared hot spots from memory multiple readers, exclusive writer - eliminate problem of stale data by write-invalidate or write-update cache protocol write-invalidate typically preferred - write-update requires more bandwidth, e.g., each update (write hit) requires a bus (or other interconnect) message even when reads of shared data by other processors are not frequent - write-update doesn't adapt to workload changes, e.g., when a thread is rescheduled (migrates) onto another CPU, the updates continue to be performed on the "apparently shared" data; in contrast, write-invalidate cleans out leftover data in previous cache that is not being used - write-update causes problems for memory consistency (see below) in scalable NUMA systems false sharing - multiple processors use different words in a cache line (but the hardware only works on the granularity of a whole line so ping-pong effect occurs with write-invalidate) Snoopy protocols each cache snoops (listens to) the bus traffic - will often have second set of cache tags dedicated for snooping MESI protocol states are represented by bits associated with cache line M = modified data in this cache line. The line is incoherent with memory, so the cache is said to own the line. No other caches in the system may have a copy of this line. E = exclusive cache line. The line is coherent with memory and is held unmodified only in one cache. The cache owns the line and can modify it without having to notify the rest of the system. No other caches in the system may have a copy of this line. S = shared cache line. The line is coherent with memory and may be present in several caches. Caches must notify the rest of the system about any changes to this line. The main memory owns this cache line. I = invalid line. The line is not cached. snoopy cache actions Read With Intent to Modify - If the address on the bus matches a Shared or Exclusive line, the line is invalidated. If a line is Modified, the cache must cause the bus action to abort, write the modified line back to memory, invalidate the line, and then allow the bus read to retry. Alternatively, the owning cache can supply the line directly to the requestor across the bus. Read - If the address on the bus matches a Shared line there is no change. If the line is Exclusive, the state changes to Shared. If a line is Modified, the cache must cause the bus action to abort, write the modified line back to memory, change the line to Shared, and then allow the bus read to retry. Alternatively, the owning cache can supply the line to the requestor directly and change its state to Shared. snoopy scheme limited to small number of processors, e.g., range of 32 multi-level inclusion - only on a snoop hit in L2 do you query L1 MIPS R4000 provides per-page selection among five protocols to reduce bus traffic noncacheable - I/O device registers noncoherent (no invalidations) - CPU interrupt stack, local run queue, kernel text, read-only user text exclusive (MEI) - user data, process info that migrates (e.g., u-area, kernel stack) shared (MESI) - kernel data structures, user shared memory write-update - data constantly read by multiple processes (e.g., global run queue, some high-contention locks) Directory-based protocols state bits kept at memory controllers for each memory block eliminates broadcast messages across all of interconnection and instead uses point-to-point messages based on list of which processors contain the cache line Memory consistency rules on reordering load/stores as seen by other processors can hide memory write latency by allowing write buffering but need explicit drain operations (fence, memory barrier) to force ordering example -- with read-write reordering, this code can print "0 0" thread 1 thread 2 a = 0 b = 0 store a=1 store b=1 print b print a if you place memory barriers between the store and print instructions, the code will only be able to produce, "0 1", "1 0", and "1 1" note that Dekker's algorithm will fail under reordered references consistency models | allowed reordering | read early | special |w-to-r w-to-w r-to-rw|other w own w| ops ----------------------------+------.------.-------+-------.-----+-------- SC - strong consistency | - | - | - | - | yes | - TSO - total store order | yes | - | - | - | yes | MB,rmw PC - processor consistency | yes | - | - | yes | yes | MB,rmw PSO - partial store order | yes | yes | - | - | yes | SB,rmw Alpha | yes | yes | yes | - | yes | MB,WMB PowerPC | yes | yes | yes | yes | yes | SYNC ----------------------------+------.------.-------+-------.-----+-------- Chip multiprocessing (CMP) better performance when lower level(s) of on-chip caches are shared growing interest in muticore chips with asymmetric cores (but each core having the same instruction set) Sony/Toshiba/IBM Cell processor had a PowerPC processor along with a number of synergistic processing elements (SPEs) with a different instruction set; programming the Cell for best performance was difficult Multithreading (MT) Clusters collection of interconnected whole computers (workstations or blades) used as a single resource disadvantages of other systems shared-memory SMP on single memory bus doesn't scale well shared-memory NUMA is expensive, esp. with the faster interconnects distributed-memory MPP is expensive because of the fast custom interconnection network attached to the nodes' memory buses advantages of clusters less expensive than traditional vector supercomputer or MPP system since they use commercial off-the-shelf processors and are interconnected by commercial LAN technology high aggregate performance easily scalable good availability (fault tolerance) ease of use - system code and special applications must be written in a distributed manner, but traditional user code will also run Beowulf clusters from 1994 - original system had 16-nodes, each having a 100-MHz 486 running Linux and interconnected by Ethernet - obtained 74 MFlops at under $50K Grid computing - cooperating computers spread across multiple LANs