Multithreading and Multi-Core Processors COMP381 Tutorial 8
Instruction-Level Parallelism (ILP) • Extract parallelism in a single program • Superscalar processors have multiple execution units working in parallel • Challenge to find enough instructions that can be executed concurrently • Out-of-order execution => instructions are sent to execution units based on instruction dependencies rather than program order
Performance Beyond ILP • Much higher natural parallelism in some applications – Database, web servers, or scientific codes
• Explicit Thread-Level Parallelism • Thread: has own instructions and data – May be part of a parallel program or independent programs – Each thread has all state (instructions, data, PC, register state, and so on) needed to execute
• Multithreading: Thread-Level Parallelism within a processor
Thread-Level Parallelism (TLP) • ILP exploits implicit parallel operations within loop or straight-line code segment • TLP is explicitly represented by multiple threads of execution that are inherently parallel • Goal: Use multiple instruction streams to improve – Throughput of computers that run many programs – Execution time of multi-threaded programs
• TLP could be more cost-effective to exploit than ILP
Fine-Grained Multithreading • Switches between threads on each instruction, interleaving execution of multiple threads – Usually done round-robin, skipping stalled threads
• CPU must be able to switch threads on every clock cycle • Pro: Hide latency of both short and long stalls – Instructions from other threads always available to execute – Easy to insert on short stalls
• Con: Slow down execution of individual threads – Thread ready to execute without stalls will be delayed by instructions from other threads
• Used on Sun’s Niagara
Course-Grained Multithreading • Switches threads only on costly stalls – e.g., L2 cache misses
• Pro: No switching each clock cycle
– Relieves need to have very fast thread switching – No slow down for ready-to-go threads
• Other threads only issue instructions when the main one would stall (for long time) anyway
• Con: Limitation in hiding shorter stalls
– Pipeline must be emptied or frozen on stall, since CPU issues instructions from only one thread – New thread must fill pipe before instructions can complete – Thus, better for reducing penalty of high-cost stalls where pipeline refill << stall time
• Used in IBM AS/400
Multithreading Paradigms FU1FU2FU3FU4
Execution Time
Unused Thread 1 Thread 2 Thread 3 Thread 4 Thread 5
ConventionalFine-grained Coarse-grained Chip Simultaneous Multiprocessor SuperscalarMultithreading Multithreading Multithreading Single (cycle-by-cycle (Block Interleaving)(CMP) (SMT) or called Threaded Interleaving) or Intel’s HT Multi-Core Processors today
Simultaneous Multithreading (SMT) Exploits TLP at the same time it exploits ILP Intel’s HyperThreading (2-way SMT) Others: IBM Power5 and Intel future multicore (8-core, 2-thread 45nm Nehalem) Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources Fdiv, unpipe (16 cycles)
PC PC PC PC PC PC PC PC
RS & ROB plus Physical Register Register Renamer File
Decode
FMult (4 cycles) FAdd (2 cyc) ALU1
Fetch Unit
I-Cache
Reg Reg Reg Reg File Reg File Reg File Reg File Reg File File File File
ALU2
• • • •
Load/Store (variable)
DCache
Simultaneous Multithreading (SMT) • Insight that dynamically scheduled processor already has many HW mechanisms to support multithreading – Large set of virtual registers that can be used to hold register sets for independent threads – Register renaming provides unique register identifiers • Instructions from multiple threads can be mixed in data path • Without confusing sources and destinations across threads!
– Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW
• Just add per-thread renaming table and keep separate PCs – Independent commitment can be supported via separate reorder buffer for each thread
SMT Pipeline Fetch
Decode/ Map
Queue
Reg Read
Execute
Dcache/ Store Buffer
Reg Write
PC
Register Map Regs Icache
Data from Compaq
Dcach e
Regs
Retire
Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine; each can issue instruction each cycle.
Power 4
Power 5
2 fetch (PC), 2 initial decodes
2 commits (architected register sets)
Pentium 4 Hyperthreading: Performance Improvements
Multi-Core Processor
Intel Core 2 Duo • Homogeneous cores • Bus based on chip interconnect • Shared on-die Cache Memory • Traditional I/O Classic OOO: Reservation Stations, Issue ports, Schedulers…etc
Large, shared set associative, prefetch, etc.
Source: Intel Corp.
Advanced Smart Cache
Wide Dynamic Execution
Macro Fusion
Smart Memory Access