Ch6 Pipeline

  • Uploaded by: Swarup Epari
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Ch6 Pipeline as PDF for free.

More details

  • Words: 4,113
  • Pages: 20
Computer Architecture

Pipelining and Instruction Level Parallelism–An Introduction

Ad ap S l ide 1

ted

f

ro

m

CO

D

2e

by

H

enn

es

s

y

Outline of This Lecture Introduction to the Concept of Pipelined Processor – Pipelined Datapath and Pipelined Control – Pipeline Example: Instructions Interaction

Pipeline Hazards – Forwarding – Stalls

Introduction to Instruction Level Parallelism – Superscalar, VLIW – Out-of-order execution – Branch Prediction – Future

Ch ap ter 6 - P ip eli ni ng Ba s ics S l ide 2

&

P

at

ter

s

on

The Five Stages of Load IF: Instruction Fetch – Fetch the instruction from the Instruction Memory

RF/ID: Registers Fetch and Instruction Decode EX: Calculate the memory address MEM: Read the data from the Data Memory WB: Write the data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Load IF RF/ID EX MEM WB

Ch ap ter 6 - P ip eli ni ng Ba s ics S l ide 3

Key Ideas Behind Pipelining Analogy–Grading the mid term exams: – 6 problems, six people grading the exam – Each person grades ONE problem – Pass exam to next person as soon as one finishes her part – Assume each problem takes 0.15 hour to grade • Each individual exam still takes 0.9 hours to grade • But with 6 people, all exams can be graded much quicker: – 100 exams: 90 hours, vs. 90 hrs x 6 = 540 hours

The load instruction has 5 stages:

– Five independent functional units to work on each stage • Each functional unit is used only once

– Another load can start as soon as 1st finishes its IF stage – Each load still takes five cycles to complete – The throughput, however, is much higher

A d apt ed fr o m CO D2 e by He nn es s y & P att er s on

Ch apt er 6 - Pi pe lin in g Bas i cs S li de 4

Pipelining the Load Instruction Five independent functional units in pipeline are: – Instruction Memory for the IF stage – Register file’s read ports for the RF/ID stage – ALU for the EX stage – Data Memory for the MEM stage – Register File’s Write port (bus W) for the WB stage

1 instruction enters the pipeline every cycle – 1 instruction comes out of pipeline (completes) every cycle – “Effective” Cycles per Instruction (CPI) is 1 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock 1st lw IF RF/ID EX MEM WB 2nd lw IF RF/ID EX MEM WB 3rd lw

IF RF/ID EX MEM WB

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

C ha pte r 6 - P i pel in in g B as ic s S lid e 5

Four Stages of R-type IF: Instruction Fetch – Fetch the instruction from the Instruction Memory

RF/ID: Registers Fetch and Instruction Decode EX: ALU operates on the two register operands WB: Write the ALU output back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4

R-type IF RF/ID EX WB

A da pte d f r om CO D2 e b y Hen ne s s y & Pa tte rs o n

Ch ap ter 6 - P ip eli ni ng Ba si cs S l ide 6

Pipelining R-type + Load We have a problem: – Two instructions try to write to register file at same time! Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ops! We have a problem!

R-type IF RF/ID EX WB R-type

IF RF/ID EX WB Load

IF RF/ID EX MEM WB R-type IF RF/ID EX WB R-type IF RF/ID EX WB

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 7

Important Observation A functional unit can be used once per instruction Each functional unit must be used at same stage for all instructions: – Load uses Register File’s Write Port during its 5th stage – • – –

12345 Load IF RF/ID EX MEM WB



– R-type uses Register File’s Write Port during its 4th stage 1234 R-type IF RF/ID EX WB

A dap te d f r om C OD 2 e b y H en ne s s y & Pa tte rs o n

C ha pte r 6 - P i pel in in g B as ic s S lid e 8

Solution: Delay R-type WB a Cycle Delay R-type’s register write by one cycle: – R-type instructions also use Reg File’s write port at Stage 5 – MEM stage is a NOOP stage: nothing is being done 12345 R-type IF RF/ID EX WB

MEM

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type IF RF/ID EX WB MEM R-type

IF RF/ID EX WB MEM Load

IF RF/ID EX MEM WB R-type

IF RF/ID EX WBMEM R-type

IF RF/ID EX WB MEM

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 9

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

A Pipelined Datapath Clk IF RF/ID EX MEM WB ALUOp

RegWr ExtOp

Branch

1 0 Imm16 PC+4 PC+4 Rs

A

busA busB EX Imm16

Ra Rb

Rt I

Unit

RFile

Rt Rw Di Rd

1 0 RegDst

A dap te d f r om C OD 2 e b y H en ne s s y & Pa tte rs o n

ALUSrc

Zero Data ME M WA Di RA Do

MemWr

1 0

MemtoReg C ha pte r 6 - P i pel in in g B as ic s S lid e 1 0

How About Control Signals? Control Signals at Stage N = Func (Instr. at Stage N) – N = EX, MEM, or WB

Example: Controls Signals at EX Stage – Func(Load’s EX) IF RF/ID EX MEM WB RegWr ExtOp=1

ALUOp=Add

Branch

1 0 Imm16 PC+4 Rs PC+4

A

Rt I

Zero Data ME M

busA busB Imm16 EX

Ra Rb

Unit

RFile

Rt Rw Di Rd

1 0 RegDst=0

1

WA Di RA Do

ALUSrc=1 MemWr

0

MemtoReg C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 1 1

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

Pipeline Control The Main Control generates the control signals during RF/ID – Control signals for EX (ExtOp, ALUSrc, ...) used 1 cycle later – Control signals for MEM (MemWr, Branch) used 2 cycles later – Control signals for WB (MemtoReg MemWr) used 3 cycles later RF/ID EX MEM ExtOp ALUSrc ALUOp

Main RegDst Control

Branch MemWr MemtoReg RegWr

A da pte d f r om CO D2 e b y Hen ne s s y & Pa tte rs o n

WB ExtOp ALUSrc ALUOp RegDst Branch MemWr MemtoReg RegWr

MemWr Branch MemtoReg RegWr

MemtoReg RegWr

Ch apt er 6 - Pi pe lin in g Bas i cs S li de 12

Single Cycle, Multi-Cycle, Pipelined Cycle 1 Clk Single Cycle Implementation:

Cycle 2

Load

Store Waste

Cycle 1Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load IF Reg EX MEM WB

Store IF Reg EX MEM

R-type IF

Pipeline Implementation: Load IF Reg EX MEM WB Store IF Reg EX MEM WB R-type IF Reg EX MEM WB A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 1 3

Hazards–Challenge to Pipelining Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – structural hazards: HW cannot support this combination of

instructions

• earlier case of load and R-typ like a structural hazard, but

normally cannot fix by retiming instruction. – data hazards: instruction depends on result of prior

instruction still in the pipeline – control hazards: pipelining of branches & other

instructionsCommon solution is to stall the later part of the pipeline until the hazard pipeline

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 1 4

Data Hazard on r1 Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB

I

add r1,r2,r3

Im Reg Dm Reg

n s sub r4,r1,r3

t r.

and r6,r1,r7

Im Reg Dm Reg Im Reg Dm Reg

O r

d or r8,r1,r9 e r

Im

xor r10,r1,r11

Reg Dm Reg Im Reg Dm Reg

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 1 5

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

HW Stalls to Resolve Hazard Dependencies backwards in time are hazards – eliminate “reverse time” by a stall Time (clock cycles) I

Reg Dm Reg add r1,r2,r3 IF ID/RFIm EX MEM WB

n s sub r4, r1,r3

t r.

and r6,r1,r7

O r

d or r8,r1,r9 e r

xor r10,r1,r11

A dap te d f r om C OD 2 e b y H en ne s s y & Pa tte rs o n

Im bubble bubble bubble

Reg Dm Reg

Im Reg Dm Im Reg Im Reg

C ha pte r 6 - P i pel in in g B as ic s S lid e 1 6

Insight: Data is available! Pipeline registers already contain needed data – “Forward” the data to the appropriate unit Time (clock cycles) IF ID/RF EX MEM WB

I

add r1,r2,r3

Im Reg Dm Reg

n s sub r4,r1,r3

t r.

and r6,r1,r7

O r

d or r8,r1,r9 e r

xor r10,r1,r11

Im Reg Dm Reg Im Reg Dm Reg Im

Reg Dm Reg Im Reg Dm Reg

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 1 7

HW for “Forwarding” (Bypassing) Increase multiplexors to add paths from registers – Assumes register read during write gets new value

(otherwise more results to be forwarded)

A dap te d f r om C OD 2 e b y H en ne s s y & Pa tte rs o n

C ha pte r 6 - P i pel in in g B as ic s S lid e 1 8

Forwarding Cannot Hide All Hazards Time (clock cycles) IF ID/RF EX MEM WB

I

lw r1, 0(r2)

n s sub r4,r1,r6

t r.

O r

Im Reg Dm Reg Im Reg Dm Reg

and r6,r1,r7

Im Reg Dm Reg

d or r8,r1,r9 e

Im

Reg Dm Reg

r

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 1 9

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

Option: HW Stalls to Resolve Hazard “Interlock”: checks for hazard & stalls Time (clock cycles) IF ID/RF EX MEM WB

I

n s

t r.

O r

lw r1, 0(r2)

Im Reg Dm Reg

stall sub r4,r1,r3

d and r6,r1,r7 e r

or r8,r1,r9

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

Im

bubble bubble bubble bubble Im Reg Dm Reg Im Reg Dm Reg Im

Reg Dm Reg

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 2 0

Option: SW resolves hazard SW inserts independent instuctions – Worst case: performance no better/worse Time (clock cycles) IF ID/RF EX MEM WB

I

n s

t r.

O r

lw r1, 0(r2)

Im Reg Dm Reg

unrelated instruction

sub r4,r1,r3

d and r6,r1,r7 e r

or r8,r1,r9

Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im

Reg Dm Reg

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 2 1

Control Hazard on Branches

A dap te d f r om C OD 2 e b y H en ne s s y & Pa tte rs o n

C ha pte r 6 - P i pel in in g B as ic s S lid e 2 2

Hazards on Branches Time (clock cycles) IF ID/RF EX MEM WB

beq r1,r2,L

Im Reg Dm Reg

sub r4,r1,r3

Im Reg Dm Reg

and r6,r2,r7

Im Reg Dm Reg

or r8,r7,r9

Im

Reg Dm Reg

L: add r1,r2,r1 Stall for two cycles on every branch! A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 2 3

Branch Stall Impact CPI Impact: – If CPI = 1, 30% branch, Stall 2 cycles => new CPI = 1.6!

Reducing the branch penalty – MIPS branch already more aggressive than most – limited eq/neq allows us to determine branch condition early

(after EX), rather than later (e.g., after MEM) – doing better • use separate comparator rather than ALU and move branch

decision to RF (hard!!!)

• reduces penalty to 1 cycle

Going further

– Variety of techniques: • separating branch and destination • separating branch condition and branch decision • hardware prediction of branche

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 2 4

When is pipelining hard? Interrupts: 5 instructions executing in 5 stage pipeline – How to stop the pipeline? – Restrart? – Who caused the interrupt?

Stage Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic interrupt MEM Page fault on data fetch; misaligned memory access; memory-protection violation Load with data page fault, Add with instruction page fault? Solution 1: interrupt vector/instruction, restart everything incomplete

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 2 5

First Generation RISC Pipelines All instructions: 1 pipeline order (“static schedule”). Register write in last stage + reads performed in first stage after issue. – Simpliy/eliminate hazards

Memory access in stage 4 – Avoid all memory hazards

Control hazards use delayed branch (with fast path) RAW hazards use bypass, except on load results – Load resolved by delayed load or stall

Good pipeline performance at little cost/complexity.

Ad apt ed f ro m CO D 2e by H en nes s y & P at ter s o n

C hap ter 6 - P ip el in ing B as ic s S lid e 2 6

Summary of Pipelining Basics Speed Up = Pipeline Depth Hazards limit performance on computers: – structural: need more HW resources – data: need forwarding, compiler scheduling – control: early evaluation & PC, delayed branch, prediction

Increasing length of pipe increases hazards – since pipelining helps instruction bandwidth, not latency

Compilers can reduce cost of data & control hazards – load delay slots – branch delay slots

Exceptions (also FP, ISA) make pipelining harder

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 2 7

Advanced Pipelining Pipelining exploits parallelism among instructions by overlapping them – Called Instruction Level Parallelism (ILP) – Limited by a variety of things: • parallelism in the program • compiler technology in exposing parallelism • functional unit capability: how many ovrlapping instructions • ability of hardware to find instructions to run in parallel

Exploiting ILP is “hot topic” in processor design: – Lots of different approaches • Multiple instuctions/cycle – compiler vs. HW for scheduling instructions

• Both architecture approaches and compiler approaches

A d apt ed fr o m CO D2 e by He nn es s y & P att er s on

C ha pte r 6 - P i pel in in g B as ic s S li de 28

Exploiting Available ILP Technique Pipelining

IF D Ex M W IF D Ex M W IF D Ex M W IF D Ex M W

Super-scalar Issue multiple scalar instructions per cycle

VLIW Each instruction specifies multiple scalar operations

IF D Ex M W IF D Ex M W IF D Ex M W IF D Ex M W IF D Ex M W Ex M W Ex M W Ex M W

HW Limitation Issue rate, FU stalls, FU depth Hazard resolution

Packing

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 2 9

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

Easy Superscalar I-Cache Int Reg Inst Issue and Bypass FP Reg

Int Unit

Load / Store Unit

FP Add FP Mul

D-Cache

Issue integer and FP operations in parallel! – potential hazards? – expected speedup? – what combinations of instructions make sense? A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 3 0

Issuing Multiple Instruction/ Cycle Superscalar: 2 instructions, 1 FP & 1 anything else – Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues – More ports for FP registers to do FP load & FP op in a pair

Type Pipe Stages Int. instruction IF ID EX MEM W FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB

1 cycle load delay expands to 3 instruction in SS – instruction in right half can’t use result, nor can either

instruction in next slot

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 3 1

Dynamic Branch Prediction Predict direction of branches on past behavior – keep a cache of branch behavior, look up prediction

Performance = f(accuracy, cost of misprediction) Branch prediction buffer: – lower bits of PC address index table of 1-bit values – says whether or not branch taken last time – evaluate actual banch condition, if prediction incorrect: • recover by flushing pipeline, restarting fetch • reset prediction

A da pte d f r om CO D2 e b y Hen ne s s y & Pa tte rs o n

Ch apt er 6 - Pi pe lin in g Bas i cs S li de 32

Speculative Superscalar Execution look ahead and prefetch instructions

Get all available parallelism – across branches

Instruction Fetch Decode

– in face cache misses – limited only by data dependences

Goal: resources and available bandwidth are only HW limit Branch prediction

Instruction Window

– execute instructions speculatively

Hazard detection and aggressive resolution

Execution Units

– out-of-order execution (dynamic

scheduling)

– in-order completion

Issue multiple instructions to Execution Units when inputs are available

• Exception handling easier • handles incorrect speculation C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 3 3

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

Variety of Modern Microprocessor Processor Instruction Completion Rate PowerPC 604 4 Dynamic,

Scheduling of pipeline

nonspeculative

MIPS R10000 4 Dynamic,

Branch prediction HW

HW speculative

Pentium II 4 Dynamic,

HW nonspeculative

UltraSPARC 4 Static HW

Merced ? Static? Static? A da pt ed fr o m CO D2 e b y He nn es s y & P att er s on

C hap ter 6 - P ip el ini ng B as ics Sl id e 3 4

Limits to Multi-Issue Machines Inherent limitations of ILP – 1 branch in 5 => 5-way VLIW busy? – Latencies of units => many operations must be scheduled – Need about Pipeline Depth x No. Functional Units of

independentDifficulties in building HW – Duplicate FUs to get parallel execution – Increase ports to Register File (3 x integer/FP rate) – Increase ports to memory – Decoding challenge and impact on clock rate, pipeline depth

Limitations specific to either SS or VLIW implementation – Decode issue in SS – VLIW code size: unroll loops + wasted fields in VLIW – VLIW lock step => 1 hazard & all instructions stall – VLIW & binary compatibility

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 3 5

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

Summary Instruction Level Parallelism in SW or HW Loop level parallelism is easiest to see SW dependencies/Compiler sophistication determine if compiler can unroll loops SW Scheduling HW scheduling Branch Prediction SuperScalar and VLIW – CPI < 1 – Dynamic issue vs. Static issue – More instructions issue/clock, larger penalty of hazards

Future? Stay tuned…

A da pte d f r om CO D2 e b y Hen ne s s y & Pa tte rs o n

Ch apt er 6 - Pi pe lin in g Bas i cs S li de 36

Duplicate to Resolve Hazard Single Memory=>Structural Hazard Time (clock Separate Instruction Cache (Im) & cycles) Data Cache (Dm) Time (clock cycles) I I

n Load MEM Reg MEM Reg s Im Reg Dm RegReg MEM Reg MEM

n t Load Instr 1 s r.

t Instr 1 r. O Instr 2 r Instr 2 3 O d Instr r e d r

Im Reg DmMEM Reg Reg MEM Reg Im Reg DmMEM Reg

Instr 3 4 Instr e r Instr 4

Im

Reg MEM Reg

RegMEM Dm Reg Reg MEM Reg Im Reg Dm Reg

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 3 97

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

Stall to resolve Structural Hazard Time (clock cycles)

I

n Load s

t r.

O r

Instr 1

MEM Reg MEM Reg MEM Reg MEM Reg

Instr 2

MEM Reg MEM Reg

d Instr 3(stall) e r

Instr 4

A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n

bubble MEM Reg MEM Reg MEM Reg MEM Reg

C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 3 8

Related Documents

Ch6 Pipeline
June 2020 47
Ch6
November 2019 42
Ch6
November 2019 42
Ch6
October 2019 33
Ch6
April 2020 31
Ch6
October 2019 29

More Documents from ""