Sun T1 (Niagara) optimized for - throughput in a commercial server environment (e.g., DB, web services) - low power (esp. for data centers) workload - large working sets, poor locality => lots of cache misses - data-dependent branches => lots of branch mispredicts - resulting ILP will be low - exploit TLP instead - there will be data sharing (and communication) among the threads, so use a shared L2 cache SMP server on chip 8 cores, 4 threads per core = 32 threads each core has L1 icache (16 KB) and L1 write-through dcache (8 KB) all cores share a banked 3 MB L2 cache using queued crossbar interconnect - directory-based cache coherence - data sharing accomplished through L2 to eliminate invalidate traffic over a memory bus one encryption engine only one FPU (with 40 cycle latency) each core has a single-issue, six-stage pipeline +---------+ | IF TS | ID EX MEM WB +---------+ TS = thread select separate set of IF/TS stages per thread separate set of store buffers per thread instructions are predecoded as they are placed into the icache, an extra bit is set for long latency instructions (load, multiply, divide, and branch) to cause a thread switch when issued TLB and cache details iTLB - 64 entries, fully associative, pseudo-random replacement dTLB - 64 entries, fully associative, pseudo-random replacement L1 icache - 16 KB, 4-way set associative, 32 byte lines, pseudo-LRU L1 dcache - 8 KB, 4-way set associative, 16 byte lines, pseudo-LRU, write-through, non-allocate L2 cache - 3 MB, 12-way set associative, 64 byte lines, pseudo-LRU, write-back, allocate, inclusive, 64 miss buffers, four independent banks interleaved on line boundaries (effectively a four-ported cache) Sun T2 8 cores, 8 threads each = 64 threads eight FPUs, one per core eight encryption engines, one per core 4 MB L2 each core has a single-issue, eight-stage pipeline +--------------+ | IF IC TS | ID EX MEM BP WB +--------------+ BP = bypass