Computer Architecture
Pipelining and Instruction Level Parallelism–An Introduction
Ad ap S l ide 1
ted
f
ro
m
CO
D
2e
by
H
enn
es
s
y
Outline of This Lecture Introduction to the Concept of Pipelined Processor – Pipelined Datapath and Pipelined Control – Pipeline Example: Instructions Interaction
Pipeline Hazards – Forwarding – Stalls
Introduction to Instruction Level Parallelism – Superscalar, VLIW – Out-of-order execution – Branch Prediction – Future
Ch ap ter 6 - P ip eli ni ng Ba s ics S l ide 2
&
P
at
ter
s
on
The Five Stages of Load IF: Instruction Fetch – Fetch the instruction from the Instruction Memory
RF/ID: Registers Fetch and Instruction Decode EX: Calculate the memory address MEM: Read the data from the Data Memory WB: Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Load IF RF/ID EX MEM WB
Ch ap ter 6 - P ip eli ni ng Ba s ics S l ide 3
Key Ideas Behind Pipelining Analogy–Grading the mid term exams: – 6 problems, six people grading the exam – Each person grades ONE problem – Pass exam to next person as soon as one finishes her part – Assume each problem takes 0.15 hour to grade • Each individual exam still takes 0.9 hours to grade • But with 6 people, all exams can be graded much quicker: – 100 exams: 90 hours, vs. 90 hrs x 6 = 540 hours
The load instruction has 5 stages:
– Five independent functional units to work on each stage • Each functional unit is used only once
– Another load can start as soon as 1st finishes its IF stage – Each load still takes five cycles to complete – The throughput, however, is much higher
A d apt ed fr o m CO D2 e by He nn es s y & P att er s on
Ch apt er 6 - Pi pe lin in g Bas i cs S li de 4
Pipelining the Load Instruction Five independent functional units in pipeline are: – Instruction Memory for the IF stage – Register file’s read ports for the RF/ID stage – ALU for the EX stage – Data Memory for the MEM stage – Register File’s Write port (bus W) for the WB stage
1 instruction enters the pipeline every cycle – 1 instruction comes out of pipeline (completes) every cycle – “Effective” Cycles per Instruction (CPI) is 1 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock 1st lw IF RF/ID EX MEM WB 2nd lw IF RF/ID EX MEM WB 3rd lw
IF RF/ID EX MEM WB
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
C ha pte r 6 - P i pel in in g B as ic s S lid e 5
Four Stages of R-type IF: Instruction Fetch – Fetch the instruction from the Instruction Memory
RF/ID: Registers Fetch and Instruction Decode EX: ALU operates on the two register operands WB: Write the ALU output back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4
R-type IF RF/ID EX WB
A da pte d f r om CO D2 e b y Hen ne s s y & Pa tte rs o n
Ch ap ter 6 - P ip eli ni ng Ba si cs S l ide 6
Pipelining R-type + Load We have a problem: – Two instructions try to write to register file at same time! Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ops! We have a problem!
R-type IF RF/ID EX WB R-type
IF RF/ID EX WB Load
IF RF/ID EX MEM WB R-type IF RF/ID EX WB R-type IF RF/ID EX WB
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 7
Important Observation A functional unit can be used once per instruction Each functional unit must be used at same stage for all instructions: – Load uses Register File’s Write Port during its 5th stage – • – –
12345 Load IF RF/ID EX MEM WB
–
– R-type uses Register File’s Write Port during its 4th stage 1234 R-type IF RF/ID EX WB
A dap te d f r om C OD 2 e b y H en ne s s y & Pa tte rs o n
C ha pte r 6 - P i pel in in g B as ic s S lid e 8
Solution: Delay R-type WB a Cycle Delay R-type’s register write by one cycle: – R-type instructions also use Reg File’s write port at Stage 5 – MEM stage is a NOOP stage: nothing is being done 12345 R-type IF RF/ID EX WB
MEM
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type IF RF/ID EX WB MEM R-type
IF RF/ID EX WB MEM Load
IF RF/ID EX MEM WB R-type
IF RF/ID EX WBMEM R-type
IF RF/ID EX WB MEM
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 9
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
A Pipelined Datapath Clk IF RF/ID EX MEM WB ALUOp
RegWr ExtOp
Branch
1 0 Imm16 PC+4 PC+4 Rs
A
busA busB EX Imm16
Ra Rb
Rt I
Unit
RFile
Rt Rw Di Rd
1 0 RegDst
A dap te d f r om C OD 2 e b y H en ne s s y & Pa tte rs o n
ALUSrc
Zero Data ME M WA Di RA Do
MemWr
1 0
MemtoReg C ha pte r 6 - P i pel in in g B as ic s S lid e 1 0
How About Control Signals? Control Signals at Stage N = Func (Instr. at Stage N) – N = EX, MEM, or WB
Example: Controls Signals at EX Stage – Func(Load’s EX) IF RF/ID EX MEM WB RegWr ExtOp=1
ALUOp=Add
Branch
1 0 Imm16 PC+4 Rs PC+4
A
Rt I
Zero Data ME M
busA busB Imm16 EX
Ra Rb
Unit
RFile
Rt Rw Di Rd
1 0 RegDst=0
1
WA Di RA Do
ALUSrc=1 MemWr
0
MemtoReg C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 1 1
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
Pipeline Control The Main Control generates the control signals during RF/ID – Control signals for EX (ExtOp, ALUSrc, ...) used 1 cycle later – Control signals for MEM (MemWr, Branch) used 2 cycles later – Control signals for WB (MemtoReg MemWr) used 3 cycles later RF/ID EX MEM ExtOp ALUSrc ALUOp
Main RegDst Control
Branch MemWr MemtoReg RegWr
A da pte d f r om CO D2 e b y Hen ne s s y & Pa tte rs o n
WB ExtOp ALUSrc ALUOp RegDst Branch MemWr MemtoReg RegWr
MemWr Branch MemtoReg RegWr
MemtoReg RegWr
Ch apt er 6 - Pi pe lin in g Bas i cs S li de 12
Single Cycle, Multi-Cycle, Pipelined Cycle 1 Clk Single Cycle Implementation:
Cycle 2
Load
Store Waste
Cycle 1Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load IF Reg EX MEM WB
Store IF Reg EX MEM
R-type IF
Pipeline Implementation: Load IF Reg EX MEM WB Store IF Reg EX MEM WB R-type IF Reg EX MEM WB A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 1 3
Hazards–Challenge to Pipelining Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – structural hazards: HW cannot support this combination of
instructions
• earlier case of load and R-typ like a structural hazard, but
normally cannot fix by retiming instruction. – data hazards: instruction depends on result of prior
instruction still in the pipeline – control hazards: pipelining of branches & other
instructionsCommon solution is to stall the later part of the pipeline until the hazard pipeline
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 1 4
Data Hazard on r1 Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB
I
add r1,r2,r3
Im Reg Dm Reg
n s sub r4,r1,r3
t r.
and r6,r1,r7
Im Reg Dm Reg Im Reg Dm Reg
O r
d or r8,r1,r9 e r
Im
xor r10,r1,r11
Reg Dm Reg Im Reg Dm Reg
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 1 5
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
HW Stalls to Resolve Hazard Dependencies backwards in time are hazards – eliminate “reverse time” by a stall Time (clock cycles) I
Reg Dm Reg add r1,r2,r3 IF ID/RFIm EX MEM WB
n s sub r4, r1,r3
t r.
and r6,r1,r7
O r
d or r8,r1,r9 e r
xor r10,r1,r11
A dap te d f r om C OD 2 e b y H en ne s s y & Pa tte rs o n
Im bubble bubble bubble
Reg Dm Reg
Im Reg Dm Im Reg Im Reg
C ha pte r 6 - P i pel in in g B as ic s S lid e 1 6
Insight: Data is available! Pipeline registers already contain needed data – “Forward” the data to the appropriate unit Time (clock cycles) IF ID/RF EX MEM WB
I
add r1,r2,r3
Im Reg Dm Reg
n s sub r4,r1,r3
t r.
and r6,r1,r7
O r
d or r8,r1,r9 e r
xor r10,r1,r11
Im Reg Dm Reg Im Reg Dm Reg Im
Reg Dm Reg Im Reg Dm Reg
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 1 7
HW for “Forwarding” (Bypassing) Increase multiplexors to add paths from registers – Assumes register read during write gets new value
(otherwise more results to be forwarded)
A dap te d f r om C OD 2 e b y H en ne s s y & Pa tte rs o n
C ha pte r 6 - P i pel in in g B as ic s S lid e 1 8
Forwarding Cannot Hide All Hazards Time (clock cycles) IF ID/RF EX MEM WB
I
lw r1, 0(r2)
n s sub r4,r1,r6
t r.
O r
Im Reg Dm Reg Im Reg Dm Reg
and r6,r1,r7
Im Reg Dm Reg
d or r8,r1,r9 e
Im
Reg Dm Reg
r
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 1 9
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
Option: HW Stalls to Resolve Hazard “Interlock”: checks for hazard & stalls Time (clock cycles) IF ID/RF EX MEM WB
I
n s
t r.
O r
lw r1, 0(r2)
Im Reg Dm Reg
stall sub r4,r1,r3
d and r6,r1,r7 e r
or r8,r1,r9
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
Im
bubble bubble bubble bubble Im Reg Dm Reg Im Reg Dm Reg Im
Reg Dm Reg
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 2 0
Option: SW resolves hazard SW inserts independent instuctions – Worst case: performance no better/worse Time (clock cycles) IF ID/RF EX MEM WB
I
n s
t r.
O r
lw r1, 0(r2)
Im Reg Dm Reg
unrelated instruction
sub r4,r1,r3
d and r6,r1,r7 e r
or r8,r1,r9
Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im
Reg Dm Reg
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 2 1
Control Hazard on Branches
A dap te d f r om C OD 2 e b y H en ne s s y & Pa tte rs o n
C ha pte r 6 - P i pel in in g B as ic s S lid e 2 2
Hazards on Branches Time (clock cycles) IF ID/RF EX MEM WB
beq r1,r2,L
Im Reg Dm Reg
sub r4,r1,r3
Im Reg Dm Reg
and r6,r2,r7
Im Reg Dm Reg
or r8,r7,r9
Im
Reg Dm Reg
L: add r1,r2,r1 Stall for two cycles on every branch! A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 2 3
Branch Stall Impact CPI Impact: – If CPI = 1, 30% branch, Stall 2 cycles => new CPI = 1.6!
Reducing the branch penalty – MIPS branch already more aggressive than most – limited eq/neq allows us to determine branch condition early
(after EX), rather than later (e.g., after MEM) – doing better • use separate comparator rather than ALU and move branch
decision to RF (hard!!!)
• reduces penalty to 1 cycle
Going further
– Variety of techniques: • separating branch and destination • separating branch condition and branch decision • hardware prediction of branche
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 2 4
When is pipelining hard? Interrupts: 5 instructions executing in 5 stage pipeline – How to stop the pipeline? – Restrart? – Who caused the interrupt?
Stage Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic interrupt MEM Page fault on data fetch; misaligned memory access; memory-protection violation Load with data page fault, Add with instruction page fault? Solution 1: interrupt vector/instruction, restart everything incomplete
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 2 5
First Generation RISC Pipelines All instructions: 1 pipeline order (“static schedule”). Register write in last stage + reads performed in first stage after issue. – Simpliy/eliminate hazards
Memory access in stage 4 – Avoid all memory hazards
Control hazards use delayed branch (with fast path) RAW hazards use bypass, except on load results – Load resolved by delayed load or stall
Good pipeline performance at little cost/complexity.
Ad apt ed f ro m CO D 2e by H en nes s y & P at ter s o n
C hap ter 6 - P ip el in ing B as ic s S lid e 2 6
Summary of Pipelining Basics Speed Up = Pipeline Depth Hazards limit performance on computers: – structural: need more HW resources – data: need forwarding, compiler scheduling – control: early evaluation & PC, delayed branch, prediction
Increasing length of pipe increases hazards – since pipelining helps instruction bandwidth, not latency
Compilers can reduce cost of data & control hazards – load delay slots – branch delay slots
Exceptions (also FP, ISA) make pipelining harder
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 2 7
Advanced Pipelining Pipelining exploits parallelism among instructions by overlapping them – Called Instruction Level Parallelism (ILP) – Limited by a variety of things: • parallelism in the program • compiler technology in exposing parallelism • functional unit capability: how many ovrlapping instructions • ability of hardware to find instructions to run in parallel
Exploiting ILP is “hot topic” in processor design: – Lots of different approaches • Multiple instuctions/cycle – compiler vs. HW for scheduling instructions
• Both architecture approaches and compiler approaches
A d apt ed fr o m CO D2 e by He nn es s y & P att er s on
C ha pte r 6 - P i pel in in g B as ic s S li de 28
Exploiting Available ILP Technique Pipelining
IF D Ex M W IF D Ex M W IF D Ex M W IF D Ex M W
Super-scalar Issue multiple scalar instructions per cycle
VLIW Each instruction specifies multiple scalar operations
IF D Ex M W IF D Ex M W IF D Ex M W IF D Ex M W IF D Ex M W Ex M W Ex M W Ex M W
HW Limitation Issue rate, FU stalls, FU depth Hazard resolution
Packing
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 2 9
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
Easy Superscalar I-Cache Int Reg Inst Issue and Bypass FP Reg
Int Unit
Load / Store Unit
FP Add FP Mul
D-Cache
Issue integer and FP operations in parallel! – potential hazards? – expected speedup? – what combinations of instructions make sense? A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 3 0
Issuing Multiple Instruction/ Cycle Superscalar: 2 instructions, 1 FP & 1 anything else – Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues – More ports for FP registers to do FP load & FP op in a pair
Type Pipe Stages Int. instruction IF ID EX MEM W FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB
1 cycle load delay expands to 3 instruction in SS – instruction in right half can’t use result, nor can either
instruction in next slot
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 3 1
Dynamic Branch Prediction Predict direction of branches on past behavior – keep a cache of branch behavior, look up prediction
Performance = f(accuracy, cost of misprediction) Branch prediction buffer: – lower bits of PC address index table of 1-bit values – says whether or not branch taken last time – evaluate actual banch condition, if prediction incorrect: • recover by flushing pipeline, restarting fetch • reset prediction
A da pte d f r om CO D2 e b y Hen ne s s y & Pa tte rs o n
Ch apt er 6 - Pi pe lin in g Bas i cs S li de 32
Speculative Superscalar Execution look ahead and prefetch instructions
Get all available parallelism – across branches
Instruction Fetch Decode
– in face cache misses – limited only by data dependences
Goal: resources and available bandwidth are only HW limit Branch prediction
Instruction Window
– execute instructions speculatively
Hazard detection and aggressive resolution
Execution Units
– out-of-order execution (dynamic
scheduling)
– in-order completion
Issue multiple instructions to Execution Units when inputs are available
• Exception handling easier • handles incorrect speculation C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 3 3
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
Variety of Modern Microprocessor Processor Instruction Completion Rate PowerPC 604 4 Dynamic,
Scheduling of pipeline
nonspeculative
MIPS R10000 4 Dynamic,
Branch prediction HW
HW speculative
Pentium II 4 Dynamic,
HW nonspeculative
UltraSPARC 4 Static HW
Merced ? Static? Static? A da pt ed fr o m CO D2 e b y He nn es s y & P att er s on
C hap ter 6 - P ip el ini ng B as ics Sl id e 3 4
Limits to Multi-Issue Machines Inherent limitations of ILP – 1 branch in 5 => 5-way VLIW busy? – Latencies of units => many operations must be scheduled – Need about Pipeline Depth x No. Functional Units of
independentDifficulties in building HW – Duplicate FUs to get parallel execution – Increase ports to Register File (3 x integer/FP rate) – Increase ports to memory – Decoding challenge and impact on clock rate, pipeline depth
Limitations specific to either SS or VLIW implementation – Decode issue in SS – VLIW code size: unroll loops + wasted fields in VLIW – VLIW lock step => 1 hazard & all instructions stall – VLIW & binary compatibility
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 3 5
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
Summary Instruction Level Parallelism in SW or HW Loop level parallelism is easiest to see SW dependencies/Compiler sophistication determine if compiler can unroll loops SW Scheduling HW scheduling Branch Prediction SuperScalar and VLIW – CPI < 1 – Dynamic issue vs. Static issue – More instructions issue/clock, larger penalty of hazards
Future? Stay tuned…
A da pte d f r om CO D2 e b y Hen ne s s y & Pa tte rs o n
Ch apt er 6 - Pi pe lin in g Bas i cs S li de 36
Duplicate to Resolve Hazard Single Memory=>Structural Hazard Time (clock Separate Instruction Cache (Im) & cycles) Data Cache (Dm) Time (clock cycles) I I
n Load MEM Reg MEM Reg s Im Reg Dm RegReg MEM Reg MEM
n t Load Instr 1 s r.
t Instr 1 r. O Instr 2 r Instr 2 3 O d Instr r e d r
Im Reg DmMEM Reg Reg MEM Reg Im Reg DmMEM Reg
Instr 3 4 Instr e r Instr 4
Im
Reg MEM Reg
RegMEM Dm Reg Reg MEM Reg Im Reg Dm Reg
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 3 97
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
Stall to resolve Structural Hazard Time (clock cycles)
I
n Load s
t r.
O r
Instr 1
MEM Reg MEM Reg MEM Reg MEM Reg
Instr 2
MEM Reg MEM Reg
d Instr 3(stall) e r
Instr 4
A dap te d f r om C OD 2 e b y H en ne ss y & P at ter s o n
bubble MEM Reg MEM Reg MEM Reg MEM Reg
C hap ter 6 - P ip eli ni ng Ba s ics Sl id e 3 8