Multithreading And Multi-core Processors

  • Uploaded by: om18sahu
  • 0
  • 0
  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Multithreading And Multi-core Processors as PDF for free.

More details

  • Words: 724
  • Pages: 19
Multithreading and Multi-Core Processors COMP381 Tutorial 8

Instruction-Level Parallelism (ILP) • Extract parallelism in a single program • Superscalar processors have multiple execution units working in parallel • Challenge to find enough instructions that can be executed concurrently • Out-of-order execution => instructions are sent to execution units based on instruction dependencies rather than program order

Performance Beyond ILP • Much higher natural parallelism in some applications – Database, web servers, or scientific codes

• Explicit Thread-Level Parallelism • Thread: has own instructions and data – May be part of a parallel program or independent programs – Each thread has all state (instructions, data, PC, register state, and so on) needed to execute

• Multithreading: Thread-Level Parallelism within a processor

Thread-Level Parallelism (TLP) • ILP exploits implicit parallel operations within loop or straight-line code segment • TLP is explicitly represented by multiple threads of execution that are inherently parallel • Goal: Use multiple instruction streams to improve – Throughput of computers that run many programs – Execution time of multi-threaded programs

• TLP could be more cost-effective to exploit than ILP

Fine-Grained Multithreading • Switches between threads on each instruction, interleaving execution of multiple threads – Usually done round-robin, skipping stalled threads

• CPU must be able to switch threads on every clock cycle • Pro: Hide latency of both short and long stalls – Instructions from other threads always available to execute – Easy to insert on short stalls

• Con: Slow down execution of individual threads – Thread ready to execute without stalls will be delayed by instructions from other threads

• Used on Sun’s Niagara

Course-Grained Multithreading • Switches threads only on costly stalls – e.g., L2 cache misses

• Pro: No switching each clock cycle

– Relieves need to have very fast thread switching – No slow down for ready-to-go threads

• Other threads only issue instructions when the main one would stall (for long time) anyway

• Con: Limitation in hiding shorter stalls

– Pipeline must be emptied or frozen on stall, since CPU issues instructions from only one thread – New thread must fill pipe before instructions can complete – Thus, better for reducing penalty of high-cost stalls where pipeline refill << stall time

• Used in IBM AS/400

Multithreading Paradigms FU1FU2FU3FU4

Execution Time

Unused Thread 1 Thread 2 Thread 3 Thread 4 Thread 5

ConventionalFine-grained Coarse-grained Chip Simultaneous Multiprocessor SuperscalarMultithreading Multithreading Multithreading Single (cycle-by-cycle (Block Interleaving)(CMP) (SMT) or called Threaded Interleaving) or Intel’s HT Multi-Core Processors today

Simultaneous Multithreading (SMT) Exploits TLP at the same time it exploits ILP Intel’s HyperThreading (2-way SMT) Others: IBM Power5 and Intel future multicore (8-core, 2-thread 45nm Nehalem) Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources Fdiv, unpipe (16 cycles)

PC PC PC PC PC PC PC PC

RS & ROB plus Physical Register Register Renamer File

Decode

FMult (4 cycles) FAdd (2 cyc) ALU1

Fetch Unit

I-Cache

Reg Reg Reg Reg File Reg File Reg File Reg File Reg File File File File

ALU2

• • • •

Load/Store (variable)

DCache

Simultaneous Multithreading (SMT) • Insight that dynamically scheduled processor already has many HW mechanisms to support multithreading – Large set of virtual registers that can be used to hold register sets for independent threads – Register renaming provides unique register identifiers • Instructions from multiple threads can be mixed in data path • Without confusing sources and destinations across threads!

– Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW

• Just add per-thread renaming table and keep separate PCs – Independent commitment can be supported via separate reorder buffer for each thread

SMT Pipeline Fetch

Decode/ Map

Queue

Reg Read

Execute

Dcache/ Store Buffer

Reg Write

PC

Register Map Regs Icache

Data from Compaq

Dcach e

Regs

Retire

Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine; each can issue instruction each cycle.

Power 4

Power 5

2 fetch (PC), 2 initial decodes

2 commits (architected register sets)

Pentium 4 Hyperthreading: Performance Improvements

Multi-Core Processor

Intel Core 2 Duo • Homogeneous cores • Bus based on chip interconnect • Shared on-die Cache Memory • Traditional I/O Classic OOO: Reservation Stations, Issue ports, Schedulers…etc

Large, shared set associative, prefetch, etc.

Source: Intel Corp.

Advanced Smart Cache

Wide Dynamic Execution

Macro Fusion

Smart Memory Access

Related Documents


More Documents from "api-3760105"

Xns Protocol
May 2020 11
Chapter 3: Sql
June 2020 7
Java Be An
May 2020 8
Dynamic Link Library
May 2020 10
Lecture 1
June 2020 8
June 2020 2