Computer Architecture CS1601
(Dr. Hari T.S. Narayanan)
Fundamentals of Computer Design
Unit 1
Contents
Measuring and Reporting performance Quantitative principles of computer design Classifying instruction set architecture Memory addressing Addressing Modes Type and size of operands Operations in the instruction set Operands and operations for signal processing Instruction for control flow Encoding an instruction set Example architecture – MIPS and TM32.
Measuring and Reporting Performance
Response Time or Execution Time Throughput Benchmarks are used to measure and compare performance of different systems
Response Time Break up
Response time Same as Elapsed time or Wall Clock time Includes the following components
User Time System Time or Privileged Time Waiting time and I/O time
Finding User time in Windows
Choosing Programs to Evaluate Performance 1.
Real Applications (C compiler, Text Processing)
2.
Modified (or Scripted) Applications
3.
Kernels (Linpack, Livermore Loops)
4.
5.
Toy benchmarks (Quicksort, Sieve of Eratosthenes) Synthetic benchmarks (Whetstone & Dhrystone)
Comparing & Summarizing Response Time Computer Program 1 Program 2 Total
A 1 1000 1001
B 10 100 110
Total Execution Time Weighted Execution Time Normalized Execution Time with benchmark
C 20 20 40
Problem Computer P1 1 P2 1000 Total 1001
A
B 10 100 110
Compare the performance for the following scenario:
For every P2 there are 10 P1s
C 20 20 40
Some Related Terminologies
Million Instructions Per Second (MIPS) FLOPS (Floating Point Operations) TeraFlops
Quantitative Principles
Principle of Locality Make the common case fast Amdahl’s Law CPU Performance Equation Take advantage of parallelism
Principle of Locality
Spatial Locality Temporal Locality 80-20 Rule Applies to Instruction and Data More applicable to Instruction than data!
Memory
Instruction add sto
Data 2 8
add sto
… time
add sto
Space
Amdahl’s Law
Performance Gain or Speed up
=
Performance for entire task with enhancement Performance for entire task without enhancement
OR
=
Execution time of the entire task without enhancement Execution time of the entire with enhancement Note: Performance
∝
1/execution time
Enhancing the Execution time of a task
Each task is divided into two parts
Part A – whose performance cannot be improved/enhanced Part B – whose performance can be improved/enhanced
Amdahl’s law relies on this classification
Amdahl’s Law - Example Problem from Page 41 of your text book: 1. Total time for Normal Execution = 0.6+0.4 = 1.0 s Fraction–enhanced = 0.4 s Speedup-enhanced = 10 times Total time for Enhanced execution = 0.6 + 0.4/10 s = 0.64s Speedup-overall = 1.0/0.64 ~ 1.56
Amdahl’s Law - Review
Let Speedup-enhanced be n As we increase n, 0.4/n approaches 0 Thus, limit of overall speedup = 1/(0.6 + 0.4/n) = 1/0.6 = 1/(10.4)
Short Problems
A task with
Part A – 0.9 Part B – 0.1
Another task with
Part A – 0.1 Part B – 0.9
The CPU Performance Equations 1. CPU Time for a program = CPU Clock Cycles x Clock Cycle Time sto add 1 2 n ..
1 ns
2. CPU time = CPU clock cycles for a program/Clock rate 3. CPIavg
= CPU clock cycles for a program/Instruction Count CPI – Clock Cycles Per Instruction
The CPU Performance Equations… 4. CPU Time = Instruction x Clock Cycles x Seconds = Seconds Program Instruction Clock Cycle Program
Take Advantage of Parallelism
Instruction level parallelism using
Pipeline
Memory Access using multiple banks Parallel computation of two or more possible outcomes
Assignment 1.
2. 3. 4.
Discuss why MIPS is not an accurate measure for comparing performance among computers. Software Design Patterns Problem 1.16 from your text book Problem on p.44-45 – Do this problem in the class without referring to your book
Instruction Set Architecture
Unit I
Content
Internal Storage Reading Memory Addressing Assignment Addressing Mode for Signal Processing Types and Size of Operands Operands for Media and Signal Processing Operations in the Instruction Set Operations for Media and Signal Processing Instruction for Control Flow Encoding and Instruction Set Role of Compilers
Internal Storage stack
Accumulator
Register-mem
Reg-Reg/Loadstore
processor TOS ACC
ALU
memory
ALU
ALU
ALU
Code Sequence with different Internal Storage Stack
Accumulator
Register (registermemory)
Register (load-store)
Push A
Load A
Load R1,A
Load R1,A
Push B
Add B
Add R3,R1,B
Load R2,B
Add
Store C
Store R3,C
Add R3,R1,R2
Pop C
Store R3,C
Memory Operands and Register Operands Number of memory address
Maximum number of operands allowed
Type of architecture
Example
0
3
Register-register
Alpha, ARM, MIPS, PowerPC, SPARC, SuperH, Trimedia TM5200
1
2
Register-Memory
IBM 360, 370,Intel 80x86, Motorola 68000,TI TMS320C54x
2
2
Memory-memory
VAX (also has threeoperand formats)
3
3
Memory-memory
VAX (also has twooperand formats)
Comparison of Reg-Reg with Reg-Memory Type
Advantage
Register1. register(0,3) 2. 3. 4.
Register1. memory(1,2) 2.
Disadvantage
simple, 1. fixed-length instruction encoding. Simple code generation 2. model. Instruction take similar numbers of clocks to execute
Higher instruction count than architectures with memory references in instructions. More instructions , lower instruction density leads to larger programs.
Data can be accessed 1. without a separate load instruction format tends to be easy 2. to encode and yields good density
Operands are not equivalent since a source operand in a binary is destroyed. Encoding a register number and a memory address in each instruction may restrict the number of registers. Clocks per instruction vary by operand location.
3.
Memory Addressing
Byte Order
Big Endian Small Endian Network Byte Order (Big Endian)
Addressing Modes
Small-endian and Big-endian Value is 0X0201
Small Endian
Address Value
2
0
Address Value
Byte 0
Byte 1
Byte 1
0
0
Big Endian
1
0
1
Byte 0
2
Alignment
Alignment normally at
2 Byte boundary 4 Byte boundary
Alignment
Optimizes the storage Makes the CPU logic more complex & slow 4
3
2
4 byte -normal integer
1
0
char/int
Addressing Modes Addressing Mode
Instruction
Meaning
Register
Add R4,R3
Regs[R4]←Regs[R4]+Re gs[R3]
Immediate
Add R4,#3
Regs[R4]←Regs[R4]+3
Displacement
Add R4,100(R1)
Regs[R4]←Regs[R4]+M em[100+Regs[R1]]
Register indirect
Add R4,(R1)
Regs[R4]←Regs[R4]+M em[Regs[R1]]
Indexed
Add R3,(R1+R2)
Regs[R3]←Regs[R3]+M em[Regs[R1]+Regs[R2]]
Addressing Modes (Continued) Addressing Mode
Instruction
Meaning
Direct or absolute
Add R4,(1001)
Regs[R4]←Regs[R41]+Mem[10 01]
Memory indirect
Add R4,@(R3)
Regs[R4]←Regs[R4]+Mem[Me m[Regs[R3]]]
Auto-increment
Add R4,(R2)+
Regs[R4]←Regs[R4]+Mem[Reg s[R2]] Regs[R2]←Regs[R2]+d
Auto-decrement
Add R1,-(R2)
Regs[R2]←Regs[R2]-d Regs[R1]←Regs[R1]+Mem[ Regs[R2]]
Scaled
Add R1,100(R2)[R3] Regs[R1]←Regs[R1]+Mem[100 +Regs[R2]+Regs[R3]*d]
Type and Size of Operands
Fixed Size – rarely used Encoded in the operation code Tag – rarely used Data widths - 16, 24, 32 Accumulator width - 32, 64, 56, 40 Integers – 8 bits, 16 bits, 32 bits, FP
Floating Point Number Representation
23
8 Type
Exponent
Mantissa
Zeros
0
0
Denormalized Numbers
0
non-zero
1 to 28 - 2
any
Infinities
28 - 1
0
NaNs
28 - 1
non-zero
Normalized Numbers
Encoding −118.625 using the IEEE 754 system. 1. Negative number, set the sign to "1". 2. Write the number using binary notation. The result is 1110110.101. 3. Move the radix point left, leaving only a 1 at its left: 1110110.101 = 1.110110101 × 26. This is a normalized floating point number. 1. Compute the exponent - bias is 127 and so 6 + 127 = 133.
Terminologies
Precision – Resolution Rounding of Error Radix – Exponent Mantissa – Significand Normalization Denormals Banker’s rounding
64-bit floating point representation
Double Precision Floating Point Representation
Operations - grouping O p e ra to rT y p e A rith im e tica n d lo g ic a l D a taT ra n s fe r C o n tro l S y s te m
F lo a tin gP o in t
D e c im a l S trin g
G ra p h ic s
E x a m p le In te g e ra rth im e tica n dlo g ic a l o p e ra tio n s ; a d d ,s u b tra c t,a n d ,o r,m u tip le ,d iv id e L o a d -s to re s (m o v ein s tru c tio n so n c o m p u te rsw ithm e m o ry a d d re s s in g ) B ra n c h ,ju m p ,p ro c e d u rec a ll a n d re tu rn ,tra p s O p e ra tin gs y s te m c a ll,V irtu a l m e m o rym a n a g e m e n tin s tru c tio n s F lo a tin g -p o in t o p e ra tin s 'a d d ,m u tip ly ,d iv id e ,c o m p a re D e c im a la d d ,d e c im a l m u tip ly ,d e c im a l-to -c h a ra c te r c o n v e rs io n s S trin gm o v e ,s trin gc o m p a re ,s trin g s e a rc h P ix e la n dv e rte x o p e ra tio n s ,c o m p re s s io n s /d e c o m p r e s s io no p e ra tio n s
Top 10 Frequently used instructions Rank 1 2 3 4 5 6 7 8 9 10
80x86instruction load conditional branch compare store add and sub moveregisterregister call return Total
Integer average (%total executed) 22% 20% 16% 12% 8% 6% 5% 4% 1% 1% 96%
Instructions for Control Flow
Four different types of control flow changes:
Conditional Branches Jumps Procedure calls Procedure returns
Addressing modes for these changes
Three types of Instruction Encoding (a)Variable(e.g.VAX,Intel 80X86) operationand N oof operands
Address specifier 1
Address field1
Address specifier n
(b)Fixed(e.g.Alpha,ARM,MIPS,PowerPC,SPARC,SuperH) o p e r a t io n
A d d r e s s f ie ld 1
A d d r e s s f ie ld 2
A d d r e s s f ie ld 3
(c)Hybrid(e.g.IBM 360/70,MIPS 16,Thumb) A d d r e s s s p e c i f i e r
o p e r a t i o n
A d d r e s s f i e l d
o p e ra tio n
A d d re s s s p e c ifie r1
A d d re s s s p e c ifie r2
o p e r a t io n
A d d r e s s s p e c ifie r
A d d r e s s fie ld1
A d d re s s fie ld
A d d r e s s fie ld2
Address fieldn
Topics to prepare and present 1. 2. 3. 4. 5.
6.
7.
RISC versus CISC IEEE 754 Floating point standard Floating point Arithmetic DEC VAX Assembler Instruction Set Benchmarking - Whetstone, Dhrystone, Linpack, SPEC, & TPC Stack based versus Register architecture Problem – Example on page 106
Instruction Level Parallelism Unit 2
What is Instruction-level parallelism (ILP)?
Overlapping the execution of instructions Assembled code --------------------------------------------
Pipeline - Overlap
Compare & contrast 1. Serving an individual to completion 2. Serving in a pipeline as shown above
Second case: Time to serve an individual Time to server 100 people
What is pipelining?
Pipelining – Advantage, Speedup, Effect, & Balance
Instruction Fetch (IF)
Instruction Decode (ID)
Execution (EX)
Memory Access/Data Access (MEM/DA)
Write-back Cycle (WB)
Classic Five-stage Pipeline for a RISC m/c
In s t r u c t io n n u m 1b e r In s t r u c t io n I IF In s t ru c t io n i+ 1 In s t ru c t io n i+ 2 In s t ru c t io n i+ 3 In s t ru c t io n i+ 4
2 ID IF
3 EX ID IF
C lo c k N u m b e r 4 5 6 MEM W B EX MEM W B ID EX MEM IF ID EX IF ID
7
8
9
W B MEM EX
W B MEM
W B
Pipelining Data Paths
DM
CC 5
CC 6
CC 7
CC 8
CC 9
Reg
Reg
DM
IM
Reg
DM
IM
Reg
DM
IM
Reg
ALU
IM
CC 4
ALU
Reg
CC 3
ALU
IM
CC 2
ALU
CC 1
ALU
Program execution order(in instructions)
Time( in clock cycles)
Reg
Reg
Reg
DM
Reg
Basic Performance Issues in Pipelining
Pipelining slows down individual instructions. But increases the speed of execution of your program!
Problem 1
Consider the un-pipelined processor section. Assume that it has a 1 ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%,20% and 40% respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2 ns of overhead to the clock. Ignoring any latency impact, how many speedup in the instruction execution rate will we gain from a pipeline?
Assume 5-stage pipeline
Problem 2
Consider the un-pipelined processor section. Assume that it has 2 GHz clock and that it uses 4 cycles for ALU operations, 3 cycles for branches, and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 50%,25% and 25% respectively. Suppose that due to clock skew and setup, pipelining the processor is reduced to work at half the clock rate. Ignoring any latency impact, how many speedup in the instruction execution rate will we gain from a pipeline?
Assume 5-stage pipeline
The Hazards of Pipelining
What is a pipeline Hazard? What is a pipeline Stall Types of Hazards
Structural Data Control
Effect of Stalls on Pipeline Performance Speedup from pipelining Avg instruction time unP = Avg instruction time P
CPI unP * Clock period unP
=
CPI P * Clock period P
CPI unP
unP =
CPI P
Clock period *
Clock period P
Speed due to Improved CPI Speed up CPI unP
unP =
CPI P
Clock period *
Clock period P
CPI P = Ideal CPI + Pipeline stall cycles per instruction CPI P = 1 + Pipeline stall cycles per instruction
=
CPI unP or Pipeline depth
1+ Pipeline stall cycles per instructions
Pipeline Implementation with MIPS – IF Cycle
Instruction Fetch Cycle IR NPC
Mem[PC]; PC + 4;
Pipeline Implementation with MIPS – ID Cycle
Instruction decoder/register fetch cycle(ID) A B IMM
Regs[rs]; Regs[rt]; sign-extended immediate field of IR;
Execution/effective address cycle(EX)
The ALU operates on the operands prepared in the prior cycle,performing one of the MIPS instruction type.
Memory Reference: ALUOutput A + IMM; Register-Register ALU instruction: ALUOutput A func B; Register-Immediate ALU instruction: ALUOutput A op IMM; Branch ALUOutput NPC + (IMM << 2); Cond (A==0)
Pipeline Implementation with MIPS – Memory Access Cycle
Memory access/branch completion cycle(MEM) The PC is updated for all instructions:
Memory reference:
LMD Mem[ALUOutput] or Mem[ALUOutput] B;
Branch:
if(cond) PC content
ALUOutput; #NPC
Pipeline Implementation with MIPS – Write Back
Register-Register ALU instruction: Regs[rt] ALUOutput; Register-Immediate ALU instruction: Regs[rt] ALUOutput; Load instruction: Regs[rt] LMD;
What is Instruction-level parallelism (ILP)?
Overlapping the execution of instructions Assembled code --------------------------------------------
Instruction Level Parallelism
Basic block Parallelism within basic block Limitation of parallelism within … Parallelism across basic block Limitation of parallelism across … entry
Inst1 Inst 2 Inst 3
exit
Inst 4
Basic block
MIPS Program Characteristics 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
------------BE ---------BNEZ
92. 93. 94. 95. 96. 97. 98. 99. 100. 101.
JUMP ------------BNE ----------
Vector Processors – Parallelism across … for (i= 1; i<= 1000; i=i+1) x[i] = x[i] + y[i]; Vector processors don’t scale well for different problem size! ILPs are different from Vector processors
Data Dependences and ILP
Two independent instructions can be completely overlapped Two dependent instructions cannot be completely overlapped
Relationship between …
Data Dependence Hazard Stall
Inst1 Inst 2 Inst 3 Inst 4 Inst 5 Inst 6
Data Dependences
Three types of data dependences True dependence Name dependence Control dependence
True Dependence
Instruction i produces a result that may be used by instruction j or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i
Output from instruction i -> input to instruction j
Code Sequence with Data Dependence Loop:
L.D
F0, 0(R1) ;
ADD.D
F4, F0,F2;
S.D
F4, 0(R1);
DADDUI R1, R1,#-8; BNE
R1,R2, Loop;
Code Sequence with Data Dependence Loop:
L.D
F0, 0(R1) ;
ADD.D
F4, F0,F2;
S.D
F4, 0(R1);
DADDUI R1, R1,#-8; BNE
R1,R2, Loop;
Data Dependence can be addressed 1.
2.
Maintaining the dependence but avoiding a hazard Eliminating the dependence by transforming the code
Name Dependence
Anti-dependence: When instruction j writes to a register or memory location that instruction i reads. Output dependence: Instruction i and instruction j write to the same register or memory location.
Control Dependence if p1 { S1; } if p2 { S2; }
Data Hazards
RAW (Read After Write) WAW (Write After Write) WAR (Write After Read) RAR (Read After Read)
Read After Write - RAW
Wrong execution: j tries to read a source before i write to it. Arises from true dependence Example:
Load R1, A Add R1, R2, R1
Write After Write
Arises from Output Dependence Correct order between writes to be maintained Appears in only those pipelines where there are multiple write stages Example:
DMUL F4, F3, F4 DLOAD F4, A
Write After Read
Arises from Anti-dependence An instruction writes to a destination before it is read by another instruction, thus making the latter read the new (incorrect value) Does not occur in static pipelines
Read After Read
Not a hazard!
Dynamic Scheduling
Summarize: Data Dependence, & Structural Dependence Executing instructions in non-sequential form Example: L R1, 0(R2) SUB R1, R3 BEQZ Same S R1, 0(R2) JUMP End Same: S R1, 0(R4)
Control Dependence if p1 { S1; } if p2 { S2; }
ADD R2, R3, R4 BEQZ R2, L1 LW
R1, 0(R2)
RET L1: ADD R1, R2, R3
Two Constraints
Control dependent instruction cannot be moved before the branch
An instruction that is not control dependent cannot be converted to control dependent. ADD R2, R3, R4 BEQZ R2, L1
X
LW
R1, 0(R2)
RET L1: ADD R1, R2, R3
X
Preserving Control & Data Dependence ADD BEQZ LW L1:
R2, R3, R4 R2, L1 R1, 0(R2)
Alternative to Control & Data Dependence
Sequential execution of a program guarantees its intended correctness Preserving control and data dependence guarantees correctness in dynamic scheduling Preserving two dependences is an easier way to guarantee correctness But not the best way in terms of better throughput! Preserving exception behavior and data flow guarantees correctness and offers better throughput!
What is Exception Behavior?
ADD R2, R3, R4 BEQZ LW
R2, L1
R1, 0(R2)
RET L1: ADD R1, R2, R3
What is Data Flow? ADD BEQZ SUB
R1, R2, R3 R4, L R1, R5, R6
OR
R7, R1, R8
L:
To prove control dependence is not critical ADD
R1, R2, R3
BEQZ
R12, Skip
DSUB
R4, R5, R6
ADD
R5, R4, R9
Skip: OR
R7, R8, R9
Scheduling & Speculation
Executing DSUB
R4, R5, R6 ahead of
BEQZ.
Scheduling: reordering the instructions to improve throughput.
Speculation: Indicates scheduling instructions ahead of their order based on prediction
Dynamic Scheduling vs Static Scheduling
Static Scheduling
Decision at Compile time No change to hardware Limited to compile time data
Dynamic scheduling
Decision at Run time Complex hardware Better throughput
Limitations of Static Scheduling
Unnecessary Stalling of nondependent instructions DIV.D
F0, F2, F4
ADD.D
F10, F0, F8
SUB.D
F12, F8, F14
Issues with Dynamic Scheduling
Hazards - WAR and WAW Exception handling DIV.D
F0, F2, F4
ADD.D
F6, F0, F8
SUB.D
F8, F10, F14
MUL
F6, F10, F8
WAR
WAW
Simple 5-stage Pipeline with Integer ALU
IF Instructions Enter
ID
EX ALU
MEM
WB Instructions Leave
5-Stage Pipeline with looped FP FUs EX Integer Unit
EX FP MUL IF Instructions Enter
MEM
ID EX FP Adder
EX FP DIV
WB Instructions Leave
5-Stage Pipeline with pipelined FP units EX Integer Unit
EX FP MUL IF Instructions Enter
MEM
ID EX FP Adder
EX FP DIV
WB Instructions Leave
5-Stage Pipeline with pipelined multiple FP units EX EX Integer IntegerUnit Unit
IF Instructions Enter
ID
EX EX FP MUL Integer Unit
EX EX FP Adder Integer Unit
EX EX FP DIVUnit Integer
MEM
WB Instructions Leave
Hazards in Longer Latency Pipelines
Divide is not fully pipelined, this can lead to structural hazard.
Instructions have varying running times, the number of writes required in a cycle can be more than 1
Instructions can complete in a different order than they were issued
WAW Hazard possible
WAR Hazard less likely
Stalls due to RAW hazards will be more due to longer latency
Dynamic Scheduling Stages
Dynamic scheduling implies out-of-order execution
To allow this we split ID pipe stage of 5-stage pipe line into two stages
Issue: Decode instructions, check for structural hazards.
Read operands: Wait until no data hazards, then read operands
IF Instructions Enter
Issue
RO
EX
MEM
WB
ID Instructions Leave
Variations and other characteristics
Instruction is fetched into a register
Or
Instruction is fetched into a queue
Execution may take different number of cycles depending upon the instruction
EX stage follows read operand stage
Multiple instructions in EX stage at the same time using multiple FUs
Stall and Bypass
Dynamic Scheduling using scoreboard
All instructions come to the issue stage in order (set by the compiler)
They can stall or bypass each other in the read operand stage and enter execution out-of-order.
Scoreboard is a technique to schedule instructions out-of-order, Tomasulo’s algorithm is the other one.
Scoreboard …
Objective:
To execute one instruction per clock cycle
From CDC 6600 Recall WAR hazard IF
Instructions Enter
Issue
RO
DIV.D ADD.D F0, F8 SUB.D F8, F14
EX
MEM
F0, F2, F4 F10, F8,
WB
ID Instructions Leave
Scoreboard content
Instruction Status
Indicates the current stage for an instruction
Functional Unit Status
Busy, Op, Fi, Fj, Fk, Qj, Qk, Rj, Rk
Register Result Status
Indicates which functional unit will write each register
Scoreboard Building blocks Registers FU
FU
FU
Scoreboard Control/status
Control/status
Tomasulo’s Approach to Dynamic Scheduling
Addresses WAR and WAW hazards Based on renaming of registers
DIV.D ADD.D S.D SUB.D F14 MUL.D
F0, F6, F6, F8,
F2, F4 F0, F8 0(R1) F10,
F6, F10, F8
DIV.D ADD.D S.D SUB.D MUL.D
F0, F2, F4 S, F0, F8 S, 0(R1) T, F10, F14 F6, F10, T
3-Stages of Tomasulo’s Pipeline
Issue Execute Write Result
Basic Structure of Tomasulo’s Algorithm From Instruction Unit Load-store operations
FP registers
Instruction Queue
Address Unit Store buffers
Load buffers
Floating-point operations
Operand buses
Operation bus
Data
Address
Memory Unit
3 2 1
Reservation Stations
FP adders
FU
FP multipliers
Common Data Bus
2 1
Advantages of Tomasulo’s algorithm
Distribution of Hazard detection logic No head of the queue blocking – eliminates stalls due to WAW and WAR hazards It can also eliminate name dependences that cannot be eliminated by a compiler Common Data Bus (CDB) makes it easier to copy one destination result to multiple source registers
Reservation Station – Data Fields
Instruction Status
Indicates the current stage for an instruction
Reservation Station Status
Busy, Op, Vj, Vk, Qj, Qk, A
Register Status
Indicates which functional unit will write to each register
Show what information tables look like for the following code sequence when only the first load has completed and written its result: 1.
L.D
F6,34(R2)
2.
L.D
3.
MUL.D
F0,F2,F4
4.
SUB.D
F8,F2,F6
5.
DIV.D
F10,F0,F6
6.
ADD.D
F2,45(R3)
F6,F8,F2
Example
In s t r u c t io n L .D L .D M U L .D S U B .D D IV . D A D D .D
In s t r u c t io n S E x e c u te
Is s u e F 6 ,3 4 (R 2 ) √ F 2 ,4 5 (R 3 ) √ F 0 ,F 2 ,F 4 √ F 8 ,F 2 ,F 6 √ F 1 0 ,F 0 ,F 6√ F 6 ,F 8 ,F 2 √
√ √
√
Example R e s e r va t io n S t a t io n Nam e Load1 Load2 A dd1 A dd2 A dd3 M u lt 1 M u lt 2
B usy
j
kO
Vp
Qj V
k
A
Q
no yes Load yes S UB yes A DD
4 5 + R e g s [R 3 ]] M e m [ 3 4 + R e g s [ RL 2o ] a] d 2 A dd1
no yes M UL y e s D IV
R e g s [F 4 ] Load2 M e m [ 3 4 + R e g s [ RM2 u] ] l t 1
Load2
Example
F ie ld Q i
F0 M u lt 1
F2 Load2
F4
R e g is t e r S t a t u s F6 F8 A dd2 A dd1
Steps in Algorithm
To understand the full power of eliminating WAW and WAR hazards through dynamic renaming of registers, We must look at a loop. Consider the following simple sequence for multiplying the elements of an array by a scalar in F2: LOOP: L . D MUL . D S.D DADDUI BNE
F0 , 0 ( R1 ) F4 , F0 , F2 F4 , 0 ( R1 ) R1 , R1 , #- 8 R1 , R2 , LOOP: branches if R1 # R2
Steps in Algorithm Instruction State
Issue
Load or Store
Load only
Store Only
Wait until
Sation r empty FP Operation
Buffer r empty
Action or bookkeeping if(RegisterStat[rs].Qi ≠0) {RS[r].Qj<-RegisterStat[rs].Qi} else{RS[r].Vj<-Regs[rs];RS[r].Qj<-0}; if(Register[rt].Qi ≠0) {RS[r].Qk<-Register[rt].Qi else{RS[r].Vk<-Regs[rt];RS[r].Qk<-0}; RS[r].Busy<-yes;RegisterStat[rd],Qi=r; if(RegisterStat[rs].Qi ≠0) {RS[r].Qj<-RegisterStat[rs].Qi} else{RS[r].Vj<-Regs[rs]; RS[r].Qj<-0}; RS[r].A<-imm;RS[r].Busy<-yes;
RegisterStat[rt].Qi=r; if(RegisterStat[rt].Qi ≠0) {RS[r].Qk<-RegisterStat[rs].Qi} else{RS[r].Vk<-Regs[rt];RS[r].Qk<-0};
Steps in Algorithm
In s t r u c t i o n S t a t e W a it u n t il A c t io n o r b o o k E x e c u te (R S [ r] . Q j= 0 )a n d C o m p u t e re s u lt : o p e r F P O p e ra t io n (R S [ r] . Q k = 0 ) L o a d S t o r e s t e pR 1S [ r ] . Q j = 0 & r i s h e aRd So[ fr ]l .oAa
Reducing Branch Cost
Control hazard can cause a greater loss for MIPS pipeline than data hazards.
Branch Prediction is the mechanism to resolve the outcome of a branch early, thus preventing control dependences from causing stalls.
Based on the fact that more often branches are associated with loops
Two types of branch prediction
Static prediction
Dynamic prediction
Terminologies related to Branch Prediction
Control or branch Hazard Taken (if a branch changes the PC to its target address) Not taken or untaken (otherwise) Freeze or flush Predicted-not-taken
ADD BEQZ SUB ST
R1, R4, R1, R1,
R2, R3 L R5, R6 A
OR
R7, R1, R8
L:
Branch Taken or Untaken High-level program segment
For (j = 0; j < 10; j++)… for-body End-for nextinst
Assembly program segment
RET:
check & branch to for-loop nextinst … …
for-loop: … … JUMP RET
Basic Branch Prediction – Prediction buffer
Using Branch History Table or Prediction-buffer
Branch Address
Taken or Not
baddr1
1
baddr2
0
baddr3
1 1 bit
Take n Predict taken 11
Not taken Taken
Not taken
Taken Predict untaken 01
Predict taken 10
Not taken Taken
Predict untaken 00 Not taken
80% 90%
Static Branch Prediction
Useful when branch behavior is highly predictable at compile time It can also assist dynamic predictors Loop unrolling is an example of this
Branch Causing 3-Cycle Stall Time
B B B B
ra n c h ra n c h ra n c h ra n c h
i n s t r u c t i o n IF successor successor + 1 successor + 2
ID IF
EX IF
MEM W B IF IF
ADD BEQZ SUB ST …
R1, R2, R3 R4, L R1, R5, R6 R1, A
OR
R7, R1, R8
L:
ID IF
EX ID IF
Reducing pipeline branch penalties – Compile time schemes
Freeze or Flush Assume that every branch has not taken Assume that every branch is taken Delayed branch
Delayed Branch
Code segment branch instruction sequential successor … Loop: branch target if taken
Branch delay slot Instruction in branch delay slot is always executed Compiler can schedule the instruction(s) for branch delay slot so that its execution is useful even when the prediction is wrong
Branch Delay of 1 Not Taken
U n t a k e n B r a n c h I Fi n s t r uIDc t i o nE X M E M B r a n c h d e l a y i n s t r u c t IF i o n j +I D 1 E X In s t r u c t i o n j + 2 IF ID In s t r u c t i o n j + 3 IF In s t r u c t i o n j + 4
W B M EM EX ID IF
W B M EM W B EX M EM W B ID EX M EM W B
Taken U n ta k e n B r a n c h i n s tr u cItiFo n I D
EX MEM W B
B r a n c h d e l a y in s tr u c tio n j + I1 F
ID
EX M EM W B
B r a n c h ta r g e t
IF
ID
EX MEM W B
IF
ID
EX MEM W B
IF
ID
B r a n c h ta r g e t + 1 B r a n c h ta r g e t + 2
EX MEM W B
Scheduling the branch delay slot From before DADD R1, R2, R3
From target DSUB R4, R5, R6
If R2 = 0 then Delay slot
From fall-through DADD R1, R2, R3 If R1 = 0 then
DADD R1, R2, R3
Delay slot
If R1 = 0 then
OR R7, R8, R9
Delay slot DSUB R4, R5, R6 becomes
becomes
becomes
DSUB R4, R5, R6 If R2 = 0 then DADD R1, R2, R3
DADD R1, R2, R3 DADD R1, R2, R3 If R1 = 0 then
If R1 = 0 then OR R7, R8, R9
DSUB R4, R5, R6
Best choice
DSUB R4 , R5, R6
Branch Prediction – Correlating Prediction
Prediction based on correlation if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa != bb) {
SUB R3, R1, #2 BNEZ R3, (aa != 2) ADD R1, R0, 0 L1: SUB R3, R2, #2 BNEZ R3, (bb != 2) ADD R2, R0, R0 L2: SUB R3, BEQZ R3, L3 bb)
L1
; branch b1
;aa=0 L2
;branch b2
:bb=0 R1, R2 ; R3=aa-bb ; branch b3 (aa ==
If both first and second branches are not taken then the last branch will be taken
ILP Software based approaches
Basic Compiler Technique
Basic Pipeline Scheduling and Loop Unrolling L: L.D ADD for (j = 1000; j > 0; j--) S.D x[j] = x[j] + s; ADDI BNE
F0, 0(R1) ; F0=array element .D F4, F0, F2 ; add scalar in F F4, 0(R1) ; store result R1, R1, #-8 ; decrement pointer R1, R2, L ; branch R1 != R2
Latencies of FP operations used In stru ctio n P ro d u cin g Re su lt In stru ctio n Usin g Re suLlta te n cy F P A LU op A nother F P A LU op F P A LU op S tore double Load Double F P A LU op Load Double S tore double
in Clo ck Cycle s 3 2 1 0
Assume an integer load latency of 1 and an integer ALU operation latency of 0 L: L.D ADD S.D ADDI BNE
F0, 0(R1) ; F0=array element .D F4, F0, F2 ; add scalar in F2 F4, 0(R1) ; store result R1, R1, #-8 ; decrement pointer R1, R2, L ; branch R1 != R2
? ? ?
Without Any Scheduling Clock cycles issued L: L.D F0, 0(R1) 1 stall 2 ; NOP ADD.D F4, F0, F2 3 stall 4 ; NOP stall 5 ; NOP S.D F4, 0(R1) 6 ADDI R1, R1, #-8 7 stall 8 ; note latency of 1 for ALU, BNE BNE R1, R2, L 9 stall 10 ; NOP 10 clock cycles/element, 7 (ADDI, BNE, stall) are loop overhead
Assume an integer load latency of 1 and an integer ALU operation latency of
With Scheduling In structio n P ro du cin g Re su lt In struction Usin g Re suLlta te n cy FP A LU op A nother FP A LU op FP A LU op S tore double Load Double F P A LU op Load Double S tore double
Clock cycles issued L: L.D ADDI ADD.D stall BNE S.D
F0, 0(R1) R1, R1, #-8 F4, F0, F2 R1, R2, L F4, 8(R1)
1 2 3 4 5 6
?
?
in Clock Cycle s 3 2 1 0
Clock cycles issued L: L.D F0, 0(R1) 1 stall 2 ; NOP ADD.D F4, F0, F2 3 stall 4 ; NOP stall 5 ; NOP S.D F4, 0(R1) 6 ADDI R1, R1, #-8 7 stall 8 ; note latency of 1 for ALU, BNE BNE R1, R2, L 9 stall 10 ; NOP
6 clock cycles/element, 3 (ADDI, BNE, stall) are loop overhead
With Unrolling L:
L.D L.D L.D L.D ADD.D ADD.D ADD.D ADD.D S.D S.D ADDI S.D BNE S.D
F0, 0(R1) F6, -8(R1) F10, -16(R1) i F14, -24(R1) F4, F0, F2 F8, F6, F2 F12, F10, F2 i F16, F14, F2 F4, 0(R1) F8, -8(R1) R1, R1, #-32 i F16, 16(R1) R1, R2, L F20, 8(R1)
14 clock cycles/4 elements, 2 (ADDI, BNE) are loop overhead
I n str u c ti o n P r o d u c i n g R e su l t I n str u c ti o n U si n g RL ea su te nl tc y i n C l o c k C y c l e s FP A LU op A no ther F P A L U op 3 FP A LU op S t o re d o u b le 2 L o a d D o u b le FP A LU op 1 L o a d D o u b le S t o re d o u b le 0
With Unrolling & Scheduling L: L.D L.D L.D L.D L.D S.D S.D S.D ADDI S.D BNE
F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F18, -32(R1) F4, 0(R1) F8, -8(R1) F12, -16(R1) R1, R1, #-40 F16, 16(R1) R1, R2, L S.D F20,
i ADD.D ADD.D ADD.D ADD.D ADD.D
F4, F0, F2 F8, F6, F2 F12, F10, F2 F16, F14, F2 F20, F18, F2
8(R1) 12 clock cycles/5 elements, 2 (ADDI, BNE) are loop overhead
The VLIW Approach
The first multiple-issue processors required the instruction steam to be explicitly organized.
Used wide instruction with multiple operations per instruction.
This architectural approach was VLIW.(Very long instruction word)
Advantage of VLIW Approach
VLIW increases as the maximum issue rate grows.
VLIW approaches make sense for wider processors.
For example VLIW processors might have five instructions that contain five operations, including one integer,two floating,two memory.
The Basic VLIW Approach
There is no fundamental difference in two approaches.
The instructions to be issued simultaneously falls on the complier, the hardware in a superscalar to make these issue decisions is unneeded.
The Basic VLIW Approach
Scheduling Techniques
Local Scheduling
Global Scheduling
Example
Suppose we have a VLIW that could issue two memory references, two FP operations, and one integer operation or branch in every clock cycle. Show an unrolled version of the loop x[I]=x[I]+s for such a processor. Unroll as many times as necessary to eliminate any stalls. Ignore the branch delay slot.
VLIW instructions occupy inner loop and replace the unrolled sequence I n te g e r M e m o r y r e f e r e n c e M1 e m o r y r e f e r e n c eF P2 o p e r a ti o n F1 P o p e r a ti o n 2 o p e r a ti o n / b r a n c h L .D F 0 , 0 ( R 1 )
L .D F 6 , -8 ( R 1 )
L .D F 1 0 , - 1 6 ( R 1 )
L .D F 1 4 , 2 4 ( R 1 )
L .D F 1 8 , - 3 2 ( R 1 )
L .D F 2 2 , - 4 0 ( R 1 ) A D D .D F 4 , F 0 A, FD2D .D F 8 , F 6 , F 2
L .D F 2 6 , -4 8 R 1 )
A D D .D F 1 2 , F A1 0D, DF 2.D F 1 6 , F 1 4 , F 2 A D D .D F 2 0 , F A1 8D ,DF .D 2 F 2 4 ,F 2 2 ,F 2
S .D F 4 , 0 ( R 1 )
S .D F 8 , -8 ( R 1 )
S .D F 1 2 , -1 6 ( R 1 )
S .D F 1 6 , - 2 4 ( R 1 )
S .D F 2 0 2 4 ( R 1 )
S .D F 2 4 , 1 6 ( R 1 )
S .D F 2 8 , 8 ( R 1 )
A D D .D F 2 8 , F 2 6 , F 2 D A D D U I R 1 ,R 1 ,# -5 6 B N E R 1 ,R 2 ,L o o p
Problems of VLIW Model
They are:
Technical problems
Logistical Problems
Technical Problems
Increase in code size and the limitations of lockstep.
Two different elements combine to increase code size substantially for a VLIW.
Generating enough operations in a straight-line code fragment requires ambitiously unrolling loops, thereby increasing code size.
Whenever instructions are not full, the unused functional units translate to wasted bits in the instruction encoding.
Logistical Problems And Solution
Logistical Problems
Binary code compatibility has been a major problem.
The different numbers of functional units and unit latencies require different versions of the code.
Migration problems
Solution
Approach:
Use object-code translation or emulation.This technology is developing quickly and could play a significant role in future migration schemes.
To tamper the strictness of the approach so that binary compatibility is still feasible.
Advantages of multiple-issue versus vector processor
Twofold.
A multiple-issue processor has the potential to extract some amount of parallelism from less regularly structured code.
It has the ability to use a more conventional, and typically less expensive, cache-based memory system.
Advanced complier support for exposing and exploiting ILP
Complier technology for increasing the amount of parallelism .
Defining when a loop is parallel and how dependence can prevent a loop from being parallel.
Eliminating some types of dependences
Detecting and Enhancing Loop-level Parallelism
Loop-level parallelism is normally analyzed at the source level or close to it.
while most analysis of ILP is done once instructions have been generated by the complier.
It involves determining what dependences exist among the operand in a loop across the iterations of that loop.
Detecting and Enhancing Loop-level Parallelism
The analysis of loop-level parallelism focuses on determining whether data accesses in later iterations are dependent on the data values produced in earlier iterations is called a loop-dependence.
Loop-level parallel at the source representation:
for (I = 1000; I > 0; I = I - 1) x[i] = x[i] + s;
Example 1
Consider a loop: for (i= 1; i <= 100; i = i + 1) { A [ i + 1] = A [ i ] + C [ i ] ; B [ i + 1] = B [ i ] + A [ i + 1 ] ; }
Loop-carried dependence : execution of an instance of a loop requires the execution of a previous instance.
Example 2
Consider a loop: for (i= 1; i <= 100; i = i + 1) { A [ i ] = A [ i ] + B [ i ] ; /* S1 */ B [ i + 1] = C [ i ] + D [ i ] ;/* S2 */ } What are the dependences between S1 and S2? Is this loop parallel? If not, show how to make it parallel.
Scheduling Iteration 1 A[1] = 1 B[2] =
A[1] + B[1] C[1] + D[1]
Iteration 1 A[1] = 1 B[2] =
A[1] + B[1] C[1] + D[1]
2 2
A[2] = B[3] =
A[2] + B[2] C[2] + D[2]
2 2
A[2] = B[3] =
A[2] + B[2] C[2] + D[2]
3 3
A[3] = B[4] =
A[3] + B[3] C[3] + D[3]
3 3
A[3] = B[4] =
A[3] + B[3] C[3] + D[3]
4 4
A[4] = B[5] =
A[4] + B[4] C[4] + D[4]
4 4
A[4] = B[5] =
A[4] + B[4] C[4] + D[4]
99 99
ooo A[99] = B[100] =
A[99] + B[99] C[99] + D[99]
99 99
ooo A[99] = B[100] =
A[99] + B[99] C[99] + D[99]
100 100
A[100] = B[101] =
A[100] + B[100] C[100] + D[100]
100 100
A[100] = B[101] =
A[100] + B[100] C[100] + D[100]
Removing Loop-carried Dependency
Consider the same example: A[1] = B[1] + C[1] for (i= 1; i <= 99; i = i + 1) { B [ i + 1] = C [ i] + D [ i] ; A [ i +1] = A [ i +1] + B [ i +1] ; } B[101] = C[100] + D[100]
Example 3
Loop–carried dependences. This dependence information is inexact, may exist.
Consider a example: for (i= 1; i <= 100; i = i + 1) { A[i]=B[i] +C[i]; D[i]=A[i] *E[i ]; }
Recurrence
Loop-carried dependences for (i= 2; i <= 100; i = i + 1) { Y [ i ] = Y [ i –1 ] + Y [ i ] ; } for (i= 6; i <= 100; i = i + 1) { Y [ i ] = Y [ i –5] + Y [ i ] ; }
Dependence-distance: An iteration i of the loop dependence on the completion of an earlier iteration i-n. Dependence distance is n.
Finding Dependences
Three tasks
Good scheduling of code
Determining which loops might contain parallelism
Eliminating name dependences
Finding Dependence
A dependence exists if two conditions hold:
There are two iteration indices, j and k , both within the limits of the for loop. That is, m<=j <= n, m<=k <= n. The loop stores into an array element indexed by a*j+b and later fetches from the same array element when it is indexed by c*k+d. That is, a*j+b=c*k+d.
Example 1
Use the GCD test to determine whether dependences exist in the loop: for ( i=1 ; i<= 100 ; i = i + 1) { X[2*i+3]=X[2*i ]*5.0; } A = 2, B = 3, C= 2, D = 0 Since 2 (GCD(A,C)) does not divide –3 (D-B), no dependence is possible GCD – Greatest Common Divisor
Example 2
The loop has multiple types of dependences. Find all the true dependences , and antidependences , and eliminate the output dependences and antidependences by renaming. for ( i = 1 , i <= 100 ; i = i + 1) { y [ i ] = x [ i ] / c ; /* S1 */ x [ i ] = x [ i ] + c ; /* S2 */ z [ i ] = y [ i ] + c ; /* S3 */ y [ i ] = c - y [ i ] ; /* S4 */ }
Dependence Analysis
There are a wide variety of situations in which array-oriented dependence analysis cannot tell us we might want to know , including
When objects are referenced via pointers rather than array indices . When array indexing is indirect through another array, which happens with many representations of sparse arrays. When a dependence may exist for some value of the inputs, but does not exist in actuality when the code is run since the inputs never take on those values. When an optimization depends on knowing more than just the possibility of a dependence, but needs to know on which write of a variable does a read of than variable depend.
Basic Approach used in points-to analysis
The basic approach used in points-to analysis relies on information from three major sources:
Type information, which restricts what a pointer can point to. Information derived when an object is allocated or when the address of an object is taken, which can be used to restrict what a pointer can point to. For example if p always points to an object allocated in a given source line and q never points to that object, then p and q can never point to the same object. Information derived from pointer assignments. For example , if p may be assigned the value of q, then p may point to anything q points to.
Analyzing pointers
There are several cases where analyzing pointers has been successfully applied and is extremely useful:
When pointers are used to pass the address of an object as a parameter, it is possible to use points-to analysis to determine the possible set of objects referenced by a pointer. One important use is to determine if two pointer parameters may designate the same object. When a pointer can point to one of several types , it is sometimes possible to determine the type of the data object that a pointer designates at different parts of of the program. It is often to separate out pointers that may only point to local object versus a global one.
Different types of limitations
There are two different types of limitations that affect our ability to do accurate dependence analysis for large programs. Limitations arises from restrictions in the analysis algorithms . Often , we are limited by the lack of applicability of the analysis rather tan a shortcoming in dependence analysis per se. Limitation is the need to analyze behavior across procedure boundaries to get accurate information.
Eliminating dependent computations
Compilers can reduce the impact of dependent computations so as to achieve more ILP. The key technique is to eliminate or reduce a dependent computation by back substitution, which increase the amount of parallelism and sometimes increases the amount of computation required. These techniques can be applied both within a basic block and within loops.
Copy propagation
Within a basic block, algebraic simplifications of expressions and an optimization called copy propagation which eliminates operations that copy values can be used to simplify sequences. DADDUI R1,R2, #4 DADDUI R1,R1 #4 to DADDUI
R1,R2, #8
Example
Consider the code sequences: ADD R1,R2,R3 ADD R4,R1,R6 ADD R8,R4,R7 Notice that this sequence requires at least three execution cycles, since all the instructions depend on the immediate predecessor. By taking advantage of associativity, transform the code and rewrite: ADD R1,R2,R3 ADD R4,R6,R7 ADD R8,R1,R4 This sequence can be computed in two execution cycles. When loop unrolling is used , opportunities for these types of optimizations occur frequently.
Recurrences
Recurrences are expressions whose value on one iteration is given by a function that depends on the previous iterations.
When a loop with a recurrence is unrolled, we may be able to algebraically optimize the unrolled loop, so that the recurrence need only be evaluated once per unrolled iterations.
Type of recurrences
One common type of recurrence arises from an explicit program statement:
Sum = Sum + X ;
Unroll a recurrence – Unoptimized (5 dependent instructions) Sum = Sum + X1 + X2 + X3 + X4 + X5 ;
Optimized: with 3 dependent operations: Sum = ((Sum + X1) + (X2 + X3) ) + (X4 + X5);
Software Pipelining : Symbolic Loop Unrolling
There are two important techniques:
Software pipelining
Trace scheduling
Reading Assignment
Trace Scheduling
Software Pipelining versus Loop Unrolling
Software Pipelining
This is a technique for reorganizing loops such that each iteration in the software-pipelined code is made from instruction chosen from different iterations of the original loop.
This loop interleaves instructions from different iterations without unrolling the loop.
This technique is the software counterpart to what Tomasulo’s algorithm does in hardware.
Software Pipelined Iteration Iteration 0 Iteration 1 Iteration 2 Iteration 3
Software- Pipelined Iteration
Iteration 4
Example Show a software-pipelined version of this loop, which increments all the elements of an array whose starting address is in R1 by the contents of F2: Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop You can omit the start-up and clean-up code.
Software Pipeline Ite ra tio n I: L .D F 0 , 0 (R 1 ) A D D .D F 4 ,F 0 , F 2 S .D F 4 , 0 (R 1 ) Ite ra tio n I+ 1 L: .D F 0 , 0 (R 1 ) A D D .D F 4 ,F 0 , F 2 S .D F 4 , 0 (R 1 ) Ite ra tio n I+ 2 L: .D F 0 , 0 (R 1 ) A D D .D F 4 ,F 0 , F 2 S .D F 4 , 0 (R 1 )
Load M[I] in F0, Add to F0, hold the result in F4 Load M[I-1] in F0 Loop: S.D [i]
F4,16(R1) ; stores into M
ADD.D
F4,F0,F2
; add to M [ i –
L.D
F0,0(R1)
; load M [ i – 2 ]
DADDUI
R1,R1,#-8
BNE
R1,R2,Loop
1]
Store F4 in M[1] Add F4, F0, F2 Store it in M[0]
Major Advantage of Software Pipelining
Software pipelining over straight loop unrolling is that software pipelining consumes less code space.
In addition to yielding a better scheduling inner loop, each reduce a different type of overhead.
Loop unrolling reduces the overhead of the loop the branch and counter update code.
Software pipelining reduces the time when the loop is not running at peak speed to once per loop at the beginning and end.
Software pipelining and Unrolled Loop Start-up code
Wind-down code
Number of Overlapped operations a) Software pipelining
Proportional to number of unrolls Number of Overlapped operations
Time
Overlap between Unrolled iterations
Time b) Loop unrolling
Software Pipelining Difficulty
In practice, complication using software pipelining is quite difficult for several reasons:
Many loops require significant transformation before they can be software pipelined, the trade-offs in terms of overhead versus efficiency of the software-pipelined loop are complex, and the issue of register management creates additional complexities. To help deal with the two of these issues, the IA-64 added extensive hardware support for software pipelining Although this hardware can make it more efficient to apply software pipelining , it does not eliminate the need for complex complier support, or the need to make difficult decisions about the best way to compile a loop.
Global Code Scheduling
Global code scheduling aims to compact a code fragment with internal control structure into the shortest possible sequence that preserves the data and control dependence . Data Dependence: This force a partial order on operations and overcome by unrolling and, in case of memory operations , using dependence analysis to determine if two references refer to the same address. Finding the shortest possible sequence for a piece of code means finding the shortest sequence for the critical path, which is the longest sequence of dependent instructions.
Global Code Scheduling ….
Control dependence It dictate instructions across which code cannot be easily moved. It arising from loop branches are reduce by unrolling. Global code scheduling can reduce the effect of control dependences arising from conditional nonloop branches by moving code. Since moving code across branches will often affect the frequency of execution of such code, effectively using global code motion requires estimates of the relatively frequency of different paths.
Global Code Scheduling ….
Although global code motion cannot guarantee faster code, if the frequency information is accurate, the compiler can determine whether such code movement is likely to lead to faster code.
Global code motion is important since many inner loops contain conditional statements.
Global code scheduling also requires complex trade-offs to make code motion decisions.
A Typical Code Fragment AA[[i i]]= =AA[i] [i]+ +B[i] B[i]
F
T AA[i] [i]= =0? 0?
B [i] =
X
C [i] =
elsepart:
join:
LD LD DADDU SD …. BNEZ …. SD …. J …. X …. …. SD
R4,0(R1) R5,0(R2) R4,R4,R5 R4,0(R1)
; ; ; ;
load A load B ADD to A Store A
R4,elsepart ; Test A ; then part ….,0(R2) ;Stores to B join
….,0(R3)
; jump over else ; else part ; Code for X ; after if ; Store C[i]
Factors of the complier
Consider the factors that the complier would have to consider in moving the computation and assignment of B: What are the relative execution frequencies of the then case and the else case in the branch? If the then case is much more frequent, the code motion may be beneficial. If not, it is likely, although not impossible, to consider moving the code. What is the cost of executing the computation and assignment to B on the branch? It may be that there are a number of empty instruction issue slots in the code on the branch and that the instructions for B can be placed into slots that would otherwise go empty. This opportunity makes the computation of B “free” at least to the first order.
Factors of the complier …
How will the movement of B change the execution time for the then case? If B is at the start of the critical path for the then case, moving it may be highly beneficial. Is B the best code fragment that can be moved above the branch? How does it compare with moving C or other statements within the then case? What is the cost of the compensation code that may be necessary for the else case? How effectively can this code be scheduled, and what is its impact on execution time?
Global code principle
The two methods on a two principle:
Focus on the attention of the compiler on a straight-line code segment representing what is estimated to be most frequently executed code path.
Unrolling is used to generate the straight-line code but,of course, the complexity arises in how conditional branches are handled.
In both cases, they are effectively straightened by choosing and scheduling the most frequent path.
Trace Scheduling : focusing on the critical path
It useful for processors with a large number of issues per clock, where conditional or predicated with a large number of issues per clock, where conditional or predicated execution is inappropriate or unsupported, and where simple loop unrolling may not be sufficient by itself to uncover enough ILP to keep the processor busy. It is away to organize the global code motion process, so as to simplify the code scheduling by incurring the costs of possible code motion on the less frequent paths.
Steps to Trace Scheduling
There are two steps to trace scheduling: Trace selection: It tries to find likely sequence of basic blocks whose operations will be put together into a smaller number of instructions: this sequence is called a trace. Trace compaction: It tries squeeze the trace into a small number of wide instructions. Trace compaction is code scheduling; hence , it attempts to move operations as early as it can in a sequence (trace), packing the operations into a few wide instructions (or issue packets) as possible
Advantage of Trace Scheduling
It approach is that it simplifies the decisions concerning global code motion.
In particular , branches are viewed as jumps into or out of the selected trace, which is assumed to be the most probable path.
When the code is moved across such trace entry and exit points, additional book-keeping code will often be needed on entry or exit point.
Key Assumption:
The key assumption is that the trace is so much more probable than the alternatives that the cost of the book-keeping code need not be a deciding factor:
If an instruction can be moved and thereby make the main trace execute faster; it is moved.
Although trace scheduling has been successfully applied to scientific code with its intensive loops and accurate profile data, it remains unclear whether this approach is suitable for programs that are less simply characterized and less loop-intensive.
In such programs,the significant overheads of compensation code may make trace scheduling an unattractive approach, or, best , its effective use will be extremely complex for the complier.
Trace in the program fragment AA[[i i]]= =AA[i] [i]+ +B[i] B[i] F
T AA[i] [i]= =0? 0?
Trace exit
B [i] = C [i] =
Trace entrance AA[[i i]]= =AA[i] [i]+ +B[i] B[i] F
T AA[i] [i]= =0? 0? B [i] = C [i] =
Trace exit
Drawback of trace scheduling
One of the major drawbacks of trace scheduling is that the entries and exits into the middle of the trace cause significant complications, requiring the complier to generate and track the compensation code and often making it difficult to asses the cost of such code.
Superblocks
Superblocks are formed by a process similar to that used for traces, but are a form of extended basic blocks, which are restricted to a single entry point but allow multiple exits. How can a superblock with only one entrance be constructed? The answer is to use tail duplication to create a separate block that corresponds to the portion of the trace after the entry. The superblock approach reduces the complexity of book-keeping and scheduling versus the more general trace generation approach, but may enlarge code size more than a trace- based approach.
Superblock results from unrolling the code AA[[i i]]= =AA[i] [i]+ +B[i] B[i] F
T
AA[i] [i]= =0? 0? Superblock exit Wit n=2
B [i] =
AA[[i i]]= =AA[i] [i]++B[i] B[i] T
C [i] =
AA[i] [i]= =0? 0?
AA[[i i]]= =AA[i] [i]+ +B[i] B[i] F
T
F
X
B [i] =
AA[i] [i]= =0? 0? B [i] = C [i] =
Superblock exit Wit n=1
C [i] =
Exploited of ILP
Loop unrolling, software pipelining , trace scheduling , and superblock scheduling all aim at trying to increases the amount of ILP that can be exploited by a processor issuing more than one instruction on every clock cycle.
The effectiveness of each of these techniques and their suitability for various architectural approaches are among the hottest topics being pursed by researches and designers of high-speed processors.
Hardware Support for Exposing More Parallelism at Compile Time
Techniques such as loop unrolling, software pipelining , and trace scheduling can be used to increase the amount of parallelism available when the behavior of branches is fairly predictable at compile time. The first is an extension of the instruction set to include conditional or predicated instructions. Such instructions can be used to eliminate branches, converting a control dependence into a data dependence and potentially improving performances. Such approaches are useful with either the hardwareintensive schemes , predication can be used to eliminate branches.
Conditional or Predicated Instructions
The concept behind instructions is quite simple: An instruction refers to a condition, which is evaluated as part of the instruction execution. If the condition is true, the instruction is executed normally; If the condition is false, the execution continues as if the instruction were a no-op. Many newer architectures include some form of conditional instructions.
Example: Consider the following code: if (A==0) {S=T;} Assuming that registers R1,R2 and R3 hold the values of A,S and T respectively, show the code for this statement with the branch and with the conditional move.
Conditional and Predictions
Conditional moves are the simplest form of conditional or predicted instructions, and although useful for short sequences, have limitations. In particular, using conditional move to eliminate branches that guard the execution of large blocks of code can be efficient, since many conditional moves may need to be introduced. To remedy the inefficiency of using conditional moves, some architectures support full predication, whereby the execution of all instructions is controlled by a predicate. When the predicate is false, the instruction becomes a noop. Full predication allows us to simply convert large blocks of code that are branch dependent. Predicated instructions can also be used to speculatively move an instruction that is time critical, but may cause an exception if moved before a guarding branch.
Example:
Here is a code sequence for a two-issue superscalar that can issue a combination of one memory reference and one ALU operation, or a branch by itself, every cycle: First InstructionSet LW R 1,40(R 2)
SecondInstructionSet AD DR 3,R 4,R 5 AD DR 6,R 3,R 7
BEQ Z R 10,L LW R 8,0(R 10) LW R 9,0(R 8)
This sequence wastes a memory operation slot in the second cycle and will incur a data dependence stall if the branch is not taken, since the second LW after the branch depends on the prior load. Show how the code can be improved using a predicated form of LW.
Predicate instructions
When we convert an entire code segment to predicated execution or speculatively move an instruction and make it predicated, we remove a control dependence. Correct code generation and the conditional execution of predicated instructions ensure that we maintain the data flow enforced by the branch. To ensure that the exception behavior is also maintained, a predicated instruction must not generate an exception if the predicate is false.
Complication and Disadvantage
The major complication in implementing predicated instructions is deciding when to annul an instruction. Predicated instructions may either be annulled during instruction issue or later in the pipeline before they commit any results or raise an exception. Each choice has a disadvantage. If predicated instructions are annulled in pipeline, the value of the controlling condition must be known early to prevent a stall for a data hazard. Since data-dependent branch conditions, which tends to be less predictable, are candidates for conversion to predicated execution, this choice can lead to more pipeline stalls.
Complication of Predication
Because of this potential for data hazard stalls, no design with predicated execution (or conditional move) annuls instructions early. Instead , all existing processors annul instructions later in the pipeline , which means that annulled instructions will consume functional unit resources and potentially have negative impact on performance . A variety of other pipeline implementation techniques, such as forwarding, interact with predicated instructions, further complicating the implementation.
Factors Predicated or Conditional Instruction
Predicated or conditional instructions are extremely useful for implementing short alternative control flows, for eliminating some unpredictable branches, and for reducing the overhead of global code scheduling. Nonetheless, the usefulness of conditional instructions is limited by several factors: Predicated instructions that are annulled (I.e., whose conditions are false) still take some processor resources. An annulled predicated instruction requires fetch resources at a minimum, and in most processors functional unit execution time. Therefore, moving an instruction across a branch and making it conditional will slow the program down whenever the moved instruction would not have been normally executed.
Factors Predicated or Conditional Instruction
Likewise, predicating a control-dependent portion of code and eliminating a branch may slow down the processor if that code would not have been executed. An important execution to these situations occurs when the cycles used by the moved instruction when it is not performed would have been idle anyway. Moving an instruction across a branch or converting a code segment to predicated execution is essentially speculating on the outcome of the branch. Conditional instructions make this easier but do not eliminate the execution time taken by an incorrect guess. In simple cases,where we trade a conditional move for a branch and a move, using conditional moves or predication is almost always better.
Factors Predicated or Conditional Instruction…
When longer code sequences are made conditional, the benefits are more limited. Predicated instructions are most useful when the predicate can be evaluated early. If the condition evaluation and predicated instructions cannot be separated (because of data dependence in determining the condition) , then a conditional instruction may result in a stall for a data hazard. With branch prediction and speculation, such stalls can be avoided, at least when the branches are predicted accurately.
Factors Predicated or Conditional Instruction…
The use of conditional instructions can be limited when the control flow involves more than a simple alternative sequence. For example, moving an instruction across multiple branches requires making it conditional on both branches, which requires two conditions to be specified or requires additional instructions to compute the controlling predicate. If such capabilities are not present, the overhead of if conversion will be larger, reduce its advantage. Conditional instructions may have some speed penalty compared with unconditional instructions. This may show up as a higher cycle count for such instructions or a slower clock rate overall. If conditional instructions are more expensive, they will need to be used judiciously.
Compiler Speculation With Hardware Support
Many programs have branches that can be accurately predicted at compile time either from the program structure or by using a profile. In such cases, the compiler may want to speculate either to improve the scheduling or to increase the issue rate. Predicated instructions provide one method to speculate, but they are really more useful when control dependences can be completely eliminated by if conversion. In many cases,we would like to move speculated instructions not only before the branch, but before the condition evaluation cannot achieve this.
Speculate Ambitiously To speculate ambitiously requires three capabilities: The ability of the complier to find instructions that, with the possible use of register renaming, can be speculatively moved and not affect the program data flow. The ability to ignore exceptions in speculated instructions, until we know that such exceptions should really occur. The ability to speculatively interchange loads and stores, or stores and stores, which may have address conflicts. The first of these is a complier capability, while the last two require hardware support .
Hardware Support for Preserving Exception Behavior
To speculate ambitiously, we must be able to move any type of instruction and still preserve its exception behavior.
The key to being able to do this is to observe that the results of a speculated sequence that is mispredicted will not be used in the final computation , and such a speculated instruction should not cause an exception.
Investigated for supporting more ambitious speculation
There are four methods that have been investigated for supporting more ambitious speculation without introducing erroneous behavior: The hardware and operating system cooperatively ignore exceptions for speculative instructions. This approach preserves exception behavior for correct programs, but not for incorrect ones. This approach may viewed as unacceptable for some programs, but it has been used, under program control , as a “fast mode” in several processors. Speculative instructions that never raise exceptions are used, and checks are introduced to determine when an exception should occur.
Investigated for supporting more ambitious speculation
A set of status bits, called poison bits, are attached to the result registers written by speculated instructions when the instructions cause exceptions. The poison bits cause a fault when a normal instructions attempts to use the register.
A mechanism is provided to indicate that an instruction is speculative , and the hardware buffers the instruction result until it is certain that the instruction is no longer speculative.
Schemes
To know the schemes, we need to distinguish between exceptions that indicate a program error and would normally cause termination, such as a memory protection violation, and those that are handled and normally resumed, such as a page default. Exception that can be resumed can be accepted and processed for speculative instructions just as if they were normal instructions. If the speculative instruction should not have been executed, handling the unneeded exception may have some negative performance effects, but it cannot cause incorrect execution.
Schemes..
The cost of these exceptions may be high, however, and some processors use hardware support to avoid taking such exceptions, just as processors with hardware speculation may take some exceptions in speculative mode, while avoiding others until an instruction is known not to be speculative.
Exceptions that indicate a program error should not occur in correct programs, and the result of a program that gets such a exception is not well defined, except perhaps when the program is running in a debugging mode.
If such exceptions arise in speculated instructions, we cannot take the exception until we know that the instruction is no longer speculative.
Schemes …
In the simplest method for preserving exceptions, the hardware and the operating system handle all resumable exceptions when the exception occurs and simply return an undefined value for any exception that would cause termination. If the instruction generating the terminating exception was not speculative , then the program is in error. Note the instead of terminating the program, the program is allowed to continue, although it will almost certainly generate incorrect results. If the instruction generating the terminating exception is speculative, then the program may be correct and the speculative result will simply be unused.
Schemes …
Thus, returning an undefined value for the instruction cannot be harmful. This scheme can never cause a correct program to fail, no matter how much speculation is done. An incorrect program, which formerly might have received a terminating exception, will get an incorrect result. This a acceptable for some programs, assuming the complier can also generate a normal version of the program, which does not speculate and can receive a terminating exception.
Schemes …
In such a scheme, it is not necessary to know that an instruction is speculative. Indeed, it is helpful only when a program is in error and receives a terminating exception on a normal instruction; in such cases, if the instruction were not marked as speculative, the program could be terminated. A second approach to preserving exception behavior when speculating introduces speculative versions of instructions that do not generate terminating exceptions and instructions to check for such exceptions. This combines preserves the exception behavior exactly.
Schemes …
A third approach for preserving exception tracks exceptions as they occur but postpones of the exception, although not in a completely precise fashion. The scheme is simple: a poison bit is added to every register, and another bit is added to every instruction to indicate whether a speculative instructions results in a terminating exception; all other exceptions are handled immediately. If a speculative instruction uses a register with a poison bit turned on, the destination register of the instruction simply has its poison bit turned on, If a normal instruction attempts to use a register source with its poison bit turned on, the instruction causes a fault . In this way, any program that would have generated an exception still generates one, albeit at the first instance where a result is used by an instruction that is not speculative.
Schemes …
Since poison bits exist only on register values and not memory values, stores are never speculative and thus trap if either operand is “poison”. One complication thus must be overcome is how the OS saves the user registers on a context switch if the poison bit is set. A special instruction is needed to save and reset the state of the poison bits to the avoid this problem. The fourth and final approach listed in earlier relies on a hardware mechanism that operates like a reorder buffer. In such an approach, instructions are marked by the complier as speculative and include an indicator of how many branches the instruction was speculatively moved across and what branch action (taken/not taken) the complier assumed. The last piece of information basically tells the hardware the location of the code block where the speculated instruction originally was .
Schemes …
In practice, most of the benefit of speculative instruction is marked by a sentinel, which tells the hardware that the earlier speculative instruction is no longer speculative and values may be committed. All instructions are placed in a reorder buffer when issued and are forced to commit in order, as in hardware speculation approach.(Notice , though, that no actual speculative branch prediction or dynamic scheduling occurs). The reorder buffer tracks when instructions are ready to commit and delays the “write-back” portion of any speculative instruction.
Schemes …
Speculative instructions are not allowed to commit until the branches that have been speculatively moved over are also ready to commit, or ,alternatively until the corresponding sentinel is reached. At that point, we know whether the speculated instructions should have been executed or not. If it should have been executed and it generated a terminating exception, then we know that the program should be terminated. If the instruction should not have been executed then the can be ignored.
Example 1
Consider the code fragment from if-the-else statement of the form: If (A==0) A=B; else A=a+4;Where A is at 0(R3) and B is at 0(R2). Assume the then clause is almost always executed. Compile the code using complier-based speculation. Assume R14 is unused and available.
L1: L2:
LD B NE Z LD J DA DDI SD
R1,0(R3) ; load A R1,L1 ;tes t A R1,0(R2) ; then c laus e L2 ; S k ip els e R1,R1, #4; els e c laus e R0,0(R3) ; s tore A
Example 2
Show how the code using a speculative load (sLD) and a speculation check instruction (SPECCK) to completely preserve exception behavior. Assume R14 is unused and available. LD R1,0(R3) ; load A B NE Z R1,L1 ;tes t A LD R1,0(R2) ; then c laus e J L2 ; S k ip els e L1: DA DDI R1,R1, #4; els e c laus e L2: SD R0,0(R3) ; s tore A
Example 3
Consider the code fragment and show how it would be complied with speculative instructions and poison bits. Show where an exception for the speculative memory reference would be recognized. Assume R14 is unused and available. LD R1,0(R3) ; load A
L1: L2:
B NE Z LD J DA DDI SD
R1,L1 ;tes t A R1,0(R2) ; then c laus e L2 ; S k ip els e R1,R1, #4; els e c laus e R0,0(R3) ; s tore A
Hardware support for memory Reference Speculation
Moving loads across stores is usually done when the complier is certain the address do not conflict. A special instruction to check for address conflicts can be included in the architecture, The special instruction is left at the original location of the load instruction (and acts like a guardian), and the load is moved up across or more stores. When a speculated load is executed, the hardware saves the address of the accessed memory location. If a subsequent store the address of the location before the check instruction, then the speculation has failed.
Hardware support for memory Reference Speculation …
If the location has not been touched, then the speculation is successful. Speculation failure can be handled in two ways. If only the load instruction was speculated, then it suffices to redo the load at the point of the check instruction (which could supply the target register in addition to the memory address). If additional instructions that depend on the load were also speculated , then a fix-up sequence that reexecutes all the speculation instructions starting with the load is needed. In this case, the check instruction specifies the address where the fix-up code is loaded.
Sections To be covered from Chapter 4
Section 4.5 Software versus Hardware based scheduling
Memory Hierarchy
Memory Hierarchy & Cache Memory
Entry Quiz 1.
Primary cache or level 1 cache is implemented in a separate chip a. b.
True False
Entry Quiz 2.
SRAM is implemented using a. b. c. d.
Flip-Flop Magnetic core Capacitor Non-volatile Technology
Entry Quiz 3.
Main memory (200 ns) is slower compared register (0.2 ns) by an order of a. b. c. d.
3 4 1000 10000
Entry Quiz 4.
Virtual Memory is a. b. c. d.
Same as caching Same as associative memory Different from caching Same as disk memory
Entry Quiz 5.
Cache Miss occurs when the a.
b. c.
d.
e.
Required instruction is not found in the cache Required data is not found in the cache Required instruction or data is not found in the cache Required instruction or data is not found in the main memory For all of the above conditions
Module Objective To understand
1.
Memory requirements of different computers
2.
Memory hierarchy and the motivation behind it
3.
Moore’s Law
4. 5. 6. 7.
Principles of Locality Cache Memory and its implementation Cache Performance Terms: Cache, Cache Miss, Cache Hit, Latency, Bandwidth, SRAM, DRAM, by an order of, Direct Mapping, Associative Mapping, Set Mapping, Write Through, Write Allocated, Write Back, Dirty Bit, and Valid Bit
Memory Requirements
In general we would like to have Faster Memory (lower access time or latency) Larger (capacity and bandwidth) Memory Simpler Memory
Memory Requirements - Server, Desktop, and Embedded devices Desktop Lower Access time Larger Memory
Server
Lower Access time Higher Bandwidth* Better Protection* Larger Memory
Embedded Lower Access time Simpler Memory*
Processor-Memory Gap
Moore’s Law
Transistor density on a chip dye doubles every couple (1.5) of years.
Short reference: http://en.wikipedia.org/wiki/Moore's_law
What is Memory Hierarchy and Why? 250 ns Main Memory CPU 0.25 ns (Registers)
Bus Adapter
Storage & I/O devices 2,500,000 ns!
Memory Hierarchy & Cache
Cache
Cache is a smaller, faster, and expensive memory. Improves the througput/latency of slower memory next to it in the memory hierarchy. Blocking reads and delaying the writes to slower memory offers better performance. There are two cache memories L1 and L2 between CPU and main memory. L1 is built into CPU. L2 is an SRAM. Data, Instructions, and Addresses are cached.
Cache Operation
Cache Hit: CPU finds the required data item (or instruction) in the cache. Cache Miss: CPU does not find the required item in the cache.
CPU Stalled Hardware loads the entire block that contains the required data item from the memory into cache. CPU continues to execute once the cache loaded.
Spatial Locality Principle of Locality
Hit and Miss
Hit
Miss
Hit Rate Hit Time Miss Rate Miss Penalty
Hit Time << Miss Penalty
Higher Level Lower Level
Cache Performance Program Execution Time
CPU Clock Cycle Cycle Time IC – Instruction Count Program Execution Time (Simple model) = (Useful CPU Cycles + Stalled CPU Cycles) x Cycle Time
Stalled CPU Cycles Stalled CPU Cycles = Number of Cache Misses X Miss Penalty = IC X Memory Access X Miss Rate X Miss Penalty Instruction
Note: Unit of Miss Penalty is in CPU cycles
Number of Cache reference
Separating Read Miss from Write Miss Stalled CPU Cycles = IC X Memory Access X Miss Rate X Miss Penalty Instruction = IC X Read Access X Read Miss Rate X Read Penalty + Instruction IC X Write Access X Write Miss Rate X Write Penalty Instruction
Example 1
Assume we have a computer where Clock cycles Per Instruction (CPI) is 1.0 when all memory accesses are cache hits. The only data accesses are loads and stores and these total 50% of the instructions. If the miss penalty is 25 clock cycles and miss rate is 2%, how much faster would the computer be if all instructions were cache hits?
Example 1 … 1.
Execution time for the program in a computer with 100 % hits = (Useful CPU Cycles + Stalled CPU Cycles) x Cycle Time = (IC X CPI + 0) X Cycle Time = IC X 1.0 X Cycle Time
Example 1 … 2.
Stall Cycle when there is one or more Cache Misses: = IC X Memory Access X Miss Rate X Miss Penalty Instruction
= IC X (1 + 0.5) X 0.02 X 25 Cycle Time = IC X 0.75 Cycle Time 3.
Total Execution Time: = (IC X 1 + IC X 0.75) Clock Cycles = 1.75 IC Clock Cycles
Example 1 …
4.
Ratios = 1.75 IC Clock Cycles / IC X 1.0 X Cycle Time = 1.75
Execution time is 1.75 faster in a computer with no Misses.
Example 2
Assume we have a computer where Clock cycles Per Instruction (CPI) is 1.0 when all memory accesses are cache hits. The only data accesses are loads and stores and these total 50% of the instructions. Miss penalty for read miss is 25 clock cycles and miss penalty for write miss is 50 clock cycles. Miss rate is 2% and out of this 80% is Read miss. How much faster would the computer be if all instructions were cache hits?
Cache Operation Line Number
CPU
Tag
Block
0
1
2
C= 16
0 1 2 3
Block (K Words)
.. .
.. . Block Length (K Words) Cache
Block 2N - 1 Word Length (K Words)
Main Memory
Elements of Cache Design
Cache Size Block Size Mapping Function Replacement Algorithm Write Policy Write Miss Number of caches Split versus Unified/Mixed Cache
Mapping Functions Address Tag Line Word
From CPU
Tag
Block 0
Line 0
Block 1
1. Select Line 1 2. Copy 3.Compare
+
5. Load
Line 2
4. Hit
Block n-1
Miss
Line 3 (m)
To main memory
Block n
Cache Main Memory
Mapping Function
Direct
Set Associative
Line value in address uniquely points to a line in cache. 1 tag Comparison Line value in address points to a set of lines in cache (typically 2/4/8, so 2/4/8 tag comparisons). This is known as 2/4/8 way Set Associative.
Associative
Line value is always 0. This means Line points to all the lines in cache (4 (m) tag Comparisons) Uses Content Addressable Memory (CAM) for comparison. Needs non-trivial replacement algorithm
0
Address Tag Line Word
From CPU 5
2
1
Tag
2
3
Line 0
Block 0
2 3 4 5
Line 1
Block 1
6 7
1. Select
SET 0
2. Copy
Line 2
3.Compare
+
5. Load 504
4. Hit
Line 3
Miss
SET 1
To main memory Cache
505
Block 126
506 507 508 509
Block 127
510 511
Memory
Mapping Function Comparison
Cache Type Direct Mapping Fully Associative
Hit Ratio
Search Speed
Good
Bes t
Bes t
M ode rate
Ve ry Good, Bette r N-Way Set Associative as N Incre ase s
Good, Worse as N Incre as e s
Replacement Algorithm
Least Recently Used First In First Out Least Frequently Used Random
Write Policy
Write-through
Write back
Information written to cache and the memory
Information written only to cache. Content of the cache is written to the main memory only when this cache block is replaced or the program terminates.
Dirty bit is used to indicate that a cache block needs “Write back”
Write Miss
Write-Allocate No Write-Allocate
Number of Caches
Primary Cache (CPU) Secondary Cache (SRAM) L3 Cache (Cheaper SRAM)
Split versus Unified/Mixed Cache
Single or two Level Unified or Split Misses per 1000 instructions with various Cache size
Size (KB) 8
Instruct Cache 8.1
Improving Cache Performance
Reducing Penalty Reducing Misses
Compiler optimization attempts to reduce the Cache Misses falls under this category
Stalled Cycles
= IC X Memory Access X Miss Rate X Miss Penalty Instruction
Improving Cache Performance Using Compiler
Compilers are built with the following optimization: Instructions: Reordering instructions to avoid conflict misses Data : Merging Arrays Loop Interchange Loop Fusion Blocking
Merging Arrays /* Conflict */ int key[size]; int value [size];
Key value
/* Instead no conflict*/ Key, value pairs struct merge { int key; int value; };
Loop Interchange for (k = 0; k < 100; k= k+1) for (j = 0; j < 100; j= j+1) for (i = 0; i < 5000; i= i+1) x[i][j] = 2*x[i][j] /*instead */
for (k = 0; k < 100; k= k+1) for (i = 0; i < 5000; i= i+1) for (j = 0; j < 100; j= j+1) x[i][j] = 2*x[i][j] Memory
X(0,0) X(0,1) X(0,2) X(0,3) X(0,4) X(0,5) … X(0,4999) X(1,0) X(1,1) …
Address
Blocking
Loop Fusion for (i = 0; i < 5000; i= i+1) a[i] = i; … for (i = 0; i < 5000; i= i+1) b[i] = b[i]+a[i]; /*instead */ for (i = 0; i < 5000; i= i+1) a[i]= i; b[i] = b[i]+a[i];
Example 2
Assume a fully associative write-back cache with many cache entries that starts empty. Below is a sequence of five memory operations (the address is in square brackets) What are the number of hits and misses using no-write allocate versus write allocate? WriteMem[100]; WriteMem[100]; ReadMem[200]; WriteMem[200]; WriteMem[100];
Exit Quiz 1.
In memory hierarchy top layer is occupied by the a. b. c. d.
Fastest and the most expensive memory Slowest and the most expensive memory High band width and the fastest memory Slowest and the least expensive memory
Exit Quiz 2.
The memory technology suitable for L2 cache is a. b. c. d.
EPROM DRAM SRAM Flash
Exit Quiz 3.
Server’s memory requirements are different from desktop’s memory requirements a. b.
True False
Exit Quiz 4.
Hit Rate is (1 – Miss Rate) a. b.
5.
True False
Gordon Moore’s law states that the transistor density a. b. c. d.
Triples every year Doubles every two years Doubles every 1.5 years Doubles every year
Improving Cache Performance
Two options
By Reducing Miss Penalty By Reducing Miss Rate
Reducing Cache Miss Penalty 1. 2. 3. 4. 5.
Multilevel caches Critical word first & early restart Priority for Read misses over Write Merging Write Buffers Victim Cache
Reducing Miss Penalty - Multilevel caches (1) 100 ns, 128M
100 ns, 128M
10 ns, 512K
2 ns, 16K CPU
Cache1 0.0 4 Faster but smaller
Main Memory
CPU
0.0 2 larger Slower but
? ns, 528 K CPU
Cache 1
Cache 1 Cache 2 0.0 0.02 4 Faster and Larger
100 ns, 128M
Main Memory
Main Memory
Reducing Miss Rate & Global Miss Rate (1) 100 ns, 128M
2.6 ns, 528K
2 ns, 16K CPU
Cache 1
10 ns, 512K
Cache 2
Main Memory
0.05 0.02 Faster and Larger
Average memory access timeeff = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 Global Miss Rate = Miss RateL1 x Miss RateL2
Reducing Miss Penalty – Critical Word First and Early Restart (2)
Critical word first Early restart Lets discuss the following:
How is Miss Penalty improved? Under what conditions this is useful? Is it worth going through all the trouble?
Reducing Miss Penalty – Priority To Read Miss (3)
Read Miss takes priority Write to Memory is put on hold Lets discuss the following:
Is there a problem? How is Miss Penalty improved? How this is done in Write-through? How this is done in Write-back? Under what conditions this is useful? Is it worth going through all the trouble?
Reducing Miss Penalty – Merging Write Buffer (4) Write-through is sent to a write buffer Buffer & Write possible options:
1. 2. 3.
Buffer is empty – write address and data to buffer Buffer is not full – write or merge data Buffer is full – Stall until buffer is written to memory. Note: Block write to memory is more efficient then multiple writes.
Lets Discuss the following:
How many different ways do we optimize Miss Penalty in this scheme? Explain the Fig 5.12 in the book When there is no block write (like I/O device), no merging
Reducing Miss Penalty – Fully Associative Victim Cache (5)
Key word is Recycle Another implicit key word is complex (hardware) Aimed at handling conflict misses 1 to 6 victim caches are ideal
Reducing Misses
Misses due to 3Cs: Compulsory, Capacity, and Conflict Reducing Misses Techniques 1. 2. 3. 4. 5.
Larger Lines/Blocks Larger Caches Higher Associativity Way Prediction Pseudoassociative caches Compiler Techniques
Notes: copy tables 5.14 and 5.17 to two slides
Reducing Cache Misses – Larger Block Size (1)
Larger block improves number of misses up to a point! Points to discuss
Why number of misses starts increasing for larger blocks? Low latency encourages smaller blocks and higher latency encourages larger blocks. Low bandwidth encourages smaller blocks and higher bandwidth encourages larger blocks. Effect on Miss Penalty with larger blocks
What is the other name for 1-way set associative mapping?
Reducing Cache Misses – Larger Caches (2)
Obviously!
Reducing Cache Misses – Higher Associativity (3)
Miss rate improves with associativity Points to discuss
Complexity of set associative mapping versus the improvement. Rule of thumb 1: 8-way associative is as good as fully associative. Rule of thumb 2: 2-way associative is as good as direct mapping. Greater associativity increases Hit time
Reducing Cache Misses – PseudoAssociative Caches (4)
Lets understand this using an 8-way set associative Instruction Cache. Instruction exhibits better locality Each access to instruction cache normally needs 8 comparisons. Using locality predict the next block to access in the set to reduce the number of comparisons. Effect on Hit-time and Cache Miss
Cache Memory – ith set
tags Block 0
Block 1
Block 6
Block 8
Reducing Hit Time – Small and Simple Cache (1)
Tag comparison is complex, specifically in associative mapping Tag checking can be overlapped with data transfer to reduce the hit time. Tag checking in CPU chip, cache in a chip by itself. Provides better hit time and larger cache capacity
Reducing Hit Time – Virtual Cache (2)
Addressing
VA -> Physical Address -> cache
Skip two levels, VA maps to cache Problems:
No page boundary checks Building direct mapping between VA and Cache for every process is not easy.
Reducing Hit Time – Pipelined Cache (3)
Increases the cache throughput not access time
Reducing Hit Time – Trace Cache
Increasing instruction level parallelism Instead of 4 consecutive locations of cache, load the next 4 instruction required by the CPU using trace. Folding branch prediction into cache
Improving Parallelism with Cache
Cache Hit under Cache Miss Cache Miss under Cache Miss Both require non-blocking cache, out of order execution CPU Other methods to improve performance using parallelism:
Hardware pre-fetching of instruction Software (compiler) pre-fetching of data
Preparation Write-back
Complicates cache coherency problem Low overhead memory access overhead Better cache access time than write -through Requires higher memory bandwidth if blocking
Write Through
Simplifies cache coherency problem High overhead If cache is blocking then higher access time Requires Lower memory bandwidth
Note: No need to describe these two policies Write-through does not buy anything extra for a single processor system due to absence of cache coherency
CPU Execution Time CPU Execution Time = (CPU Cycle time + Stalled Cycle) X Cycle Time
Stalled Cycle = misses x penalty Misses given either as misses/1000 instruction or misses/memory-access AKA miss rate. Instruction Count , Cycles per Instruction, Miss are also required to compute CPU execution time.
Average Access Time with Cache Average Access Time = Hit Time + Miss Rate X Penalty Multi-level Cache Avg Access Time = Hit TimeL1+ Miss RateL1X PenaltyL1 PenaltyL1 = Hit TimeL2+ Miss RateL2X PenaltyL2
Addressing Address Tag Set Word
From CPU
Tag Set 0
Line 0
0 1 2 3 4 5 6 7
Block 0
Block 1
1. Select Line 1 2. Copy
Set 1
3.Compare
+
5. Load
Line 2
4. Hit
Block 126
4. Miss
To main memory
Cache
Line 3 (m) 508 509 510 511
Block 127
Main Memory
Assignment I – Due same day next week
Mapping functions Replacement algorithms Write policies Write Miss policies Split Cache versus Unified Cache Primary Cache versus Secondary Cache Compiler cache optimization techniques with examples
Assignment II - Due same day next week
Multilevel Cache Cache Inclusion/Exclusion Property Thumb rules of cache Compiler pre-fetch Multi-level Caching + one another Miss Penalty Optimization technique Two miss Rate Optimization Techniques
Assignment III - Due 2nd class of next week
All odd numbered problems from cache module of your text book.
Assignment IV - Due 2nd class of next week
All even numbered problems from cache module of your text book.
CPU Execution Time & Average Access Time Memory Cache CPU 1 CC
100 CC
With Multi-level Cache Memory Cache CPU 1000
1 CC
10 CC
100 CC
Memory Hierarchy
Main Memory
Main Memory
Module Objective
To understand Main memory latency and bandwidth Techniques to improve latency and bandwidth
Memory Hierarchy & Cache
Main Memory – Cache – I/O 250 ns Main Memory 0.25 ns
CPU cache
Bus Adapter
Storage & I/O devices 2,500,000 ns!
Cache prefers low latency main memory I/O & Multiprocessors prefer high bandwidth/throughput main memory
Main Memory Access Time 4 cc
Main Memory 56 cc
CPU cache 4 cc
Data Bus
Address Bus
Access Time per word = 4+56+4 CC One word is 8 Bytes Latency is 1 bit/CC
CC – Clock Cycle
Improving Memory Performance
Improving Latency ( time to access 1 memory unit - word) Improving bandwidth (bytes accessed in unit time)
Improving Memory Bandwidth Simple Design
Wider Bus
CPU
CPU
64 bits Cache 64 bits Memory Cache Block 4 words One word 8 bytes
Interleaved CPU 64 bits
64 bits
Cache
Cache 4x64 bits Memory
64 bits 0 4 8 12
Bank 0
1 5 9 13
Bank 1
Bandwidth, Latency, Penalty Interleaving Factor Address allocation with Interleaving
2 6 10 14
Bank 2
3 7 11 15
Bank 3
Problem 1 Interleaving factor
What is the optimal interleaving factor if the memory cycle is 8 CC (1+6+1)? Assume 4 banks
b1
1
2
3
1st Read Request
4
5
6
7
8
9
b2
10
b3
11
2nd Read Request
b4
12
13
14
15
X
3rd Read Request
16
√
17
3rd Read Request
18
19
20
Problem 2 (page 452 in your text book)
Block size = 1 word Memory bus size = 1 word Miss rate = 3% CPU Memory access per instruction = 1.2 Cache miss Penalty = 64 CC Avg Cycles per instruction = 2
Simple Memory Cache 64 CC
Assume 1000 instructions in your Program If no miss then execution time is 2000 CC One instruction needs 1.2 memory accesses, 1000 instruction -> 1200 accesses. If miss rate is 3% then number of misses for 1200 accesses is 36. the execution time is = 2000+36x64 = 4304 CC Average cycles per instruction = 4304/1000 = 4.3
Problem 2 (wider bus – 2 words)
Block size = 4 word Memory bus size = 2 word Miss rate = 2.5% CPU Memory access per instruction = 1.2 Cache miss Penalty = 128 CC Avg Cycles per instruction = 2
Interleaved Memory Cache 64 CC
Assume 1000 instructions in your Program If no miss then execution time is 2000 CC One instruction needs 1.2 memory accesses, 1000 instruction -> 1200 accesses. If miss rate is 2.5% then number of misses for 1200 accesses is 30. the execution time is = 2000+30x128 = 5840 CC Average cycles per instruction = 5840/1000 = 5.84
Problem 2 (2-way interleaving)
Block size = 4 word Memory bus size = 1 word Miss rate = 2.5% CPU Memory access per instruction = 1.2 Cache miss Penalty = 68x2 CC Avg Cycles per instruction = 2
Interleaved Cache 64 CCMemory
Assume 1000 instructions in your Program If no miss then execution time is 2000 CC One instruction needs 1.2 memory accesses, 1000 instruction -> 1200 accesses. If miss rate is 2.5% then number of misses for 1200 accesses is 30. the execution time is = 2000+30x68x2 = 6000 CC Average cycles per instruction = 6000/1000 = 6.0
DRAM (Dynamic Random Access Memory)
Address Multiplexing RAS and CAS Dynamic refreshing and implications
Amdahl’s rule of thumb
Data to be written after a READ! 1000 MIPS should have 1000 M memory
Read your text book for DRAM, SRAM, RAMBUS, and Flash.
SRAM (Static Random Access Memory)
Uses six transistor per bit. Static
NO NEED to write data after a READ
Comparison versus DRAM
More transistors
No Refresh
Less memory – ¼ to 1/8 Expensive 8 - 10 times Faster 8-16 times
Used in Cache instead of main memory
Flash
Similar to EEPROM Ideal choice for embedded systems
Comparison versus EEPROM
Finer and simpler control Better capacity
WRITE operation – entire block to be erased and replaced.
Low power requirement, no moving parts Faster READ time, slower WRITE time
Writes are 50 to 100 times slower than READ
NOR type flash and NAND type flash
Virtual Memory Physical Address
Virtual Address 0 4K 8K 12K
0 4K 8K 12K 16K 20K 24K 28K
A B C D
Virtual Memory
Disk
D
C
A B
Physical Main Memory
Virtual Versus First Level Cache
B lock /P age S iz e Hit Tim e M is s P enalty A cc es s Tim e Trans ferTim e M iss Rate
A ddres s M apping
16-128 B 4K -64K B 1-3 CC 50-150 CC 8-150 CC 1-10 M CC 6-130 CC 0.8-8 M CC 2-20 CC 0.2-2 M CC 0.1-10% .00001 to 0.001% 25-45 P hy sical 32-64 V irtual A ddres s A ddres s to 14-20 to 25-45 P hy sical Cac he A ddress A ddres s
Memory Hierarchy Questions
Where can a block be placed? How is a block found if it is in? Which block should be replaced? What happens on a write?
Segmentation versus Paging Code
Data
Paging Segmentation
Remark Words per address Programmer visible Replacing a block Memory use inefficiency
Page
Segment Two (Segment and offset) Visible to programmer Hard External fragmentation
One Invisible Trivial Internal fragmentation
Efficient disk traffic
Yes (easy tune the page size) Not always (small segments)
CPU Execution Time CPU Execution Time = (CPU Cycle time + Stalled Cycle) X Cycle Time
Stalled Cycle = misses x penalty Misses given either as misses/1000 instruction or misses/memory-access AKA miss rate. Instruction Count , Cycles per Instruction, Miss are also required to compute CPU execution time.
Average Access Time with Cache Average Access Time = Hit Time + Miss Rate X Penalty Multi-level Cache Avg Access Time = Hit TimeL1 + Miss RateL1 X PenaltyL1 PenaltyL1 = Hit TimeL2 + Miss RateL2 X PenaltyL2
I/O Systems
Objective
To understand mainly disk storage technologies.
Storage Systems
Individual Disks SCSI Subsystem RAID Storage Area Networks (NAS & SAN)
Disk Storage
Disk storage slower than Memory. Disk offers better volume at a lower unit price. Nicely fit into the Virtual Memory concept. Critical piece of the performance puzzle.
Courtesy: www.shortcourses.com/ choosing/storage/06.htm
Disk System
Disk pack & Cache Disk Embedded Controller SCSI Host Adapter OS (kernel/driver) User Program disk
Host
SCSI Adapter
disk
Controller
…
Controller
SCSI Bus Host I/O Bus
T
Individual Disks
Reference: http://www.stjulians.com/cs/diskstoragenotes.html
Components of Disk Access Time
Seek Time Rotational Latency Internal Transfer Time Other Delays That is,
Avg Access Time = Avg Seek Time + Avg Rotational Delay + Transfer Time + Other overhead
Problem (page 684)
Seek Time = 5 ms/100 tracks RPM = 10000 RPM Transfer rate 40 MB/sec Other Delays = 0.1 ms Sector Size = 2 KB
Average Access Time =
Average Seek Time (5ms) + Average Rotational Delay (time for 1/2 revolution) + Transfer Time (512/(40x106)) + Other overhead (0.1 ms) = 5 + 103x60/(2x10000) + 512x103/(40x106) + 0.1 ms = 5 + 3 + 128/104 + 0.1 ms = 5 + 3 + 0.0128 + 0.1 = 8.1128 ms
RAID
RAID stands for Redundant Array of Inexpensive/Individual Disks Uses a number of little disks instead of one large disk. Six+ Types of RAID (0-5).
Terminologies
Mean Time Between Failure (MTBF) Mean Time To Data Loss (MTTDL) Mean Time To Data Inaccessability (MTTDI)
RAID-0 Striping Stripe width
chunk
Logical Storage
Physical Storage
01 02
03
04
05
06
07
08
09
10
11 12
01 05
02 06
03 07
04 08
09
10
11
12
n number of independent disks
RAID-0 Performance
Throughput: best case - nearly n x single disk value Utilization: worst case – nearly (1/ n) x single disk value Data Reliability: (r)n , where r is the reliability of a disk, (r ≤ 1). Sequential Access: Fast Random Access: Multithreaded Random Access offers better performance. When r = 0.8 and n = 2, reliability is 0.64
RAID-1 Mirroring
Issue Addressed: RAID-0 Reliability Problem. Shadowing or mirroring is used. Data is not lost when a disk goes down. One or more disks can be used to mirror primary disk. Writes are posted to primary and shadowing disks. Read from any of them.
01 02 03
01 02 03
RAID-1 Performance
Reliability is improved with mirroring:
(1- (1-r)(1-r))
Example: when r is 0.8, the reliability of RAID-1 is .96. Writes are more complex – must be committed to primary and all shadowing disks. Writes much slower due to atomicity requirement. Expensive – due to 1-to-1 redundancy. Throughput: Same as single disk value Utilization: ½ of single disk value
RAID 0+1 -Stripping & Mirroring RAID 0 + 1 RAID 1 RAID 0
RAID 0
01
02
01
02
03
04
03
04
05
06
05
06
n disks
n disks
Performance RAID-0+1
Let the Reliability of a RAID 0 sub-tree be R’: Then the reliability of RAID 1 tree = 1 – (1-R’)(1R’)
Reliability R’ is: R’ = r2 (reliability of a single disk is r): Throughput is same as RAID-0, however with 2 x n disks Utilization is lower than RAID-0 due to mirroring “Write” is marginally slower due to atomicity When r = 0.9, R’ = 0.81, and R = 1 – (0.19)2 = .96
RAID 1+0 - Mirroring & Stripping RAID 1+0 RAID 0 RAID 1
RAID 1
01 03
01 03
02 04
02 04
05
05
06
06
Performance RAID-1+0
Let the Reliability of a RAID 1 sub-tree be R’: Then the reliability of RAID 0 tree = (R’)2
Reliability R’ is:
R’ = 1-(1-r)2 (reliability of a single disk is r): Throughput is same as RAID-0, however with 2 x n disks Utilization is lower than RAID-0 due to mirroring “Write” is marginally slower due to its atomicity When r = 0.9, R’ = 0.99, and R = (0.99)2 = .98
RAID-2 Hamming Code Arrays
Low commercial interest due to complex nature of Hamming code computation.
RAID-3 Stripping with Parity
Single Bits Or Words Logical Storage
Physical Storage
Stripe width
01 02 03 04 05 06 07 08 09 10
01 05
02 06
09
10
03 07
04 08
11
12
11
12
P0 P1 P2 Parity Disk
RAID-3 Operation
Based on the principle of reversible form of parity computation. Where Parity P = C0⊕ C1⊕ … Cn-1⊕ Cn⊕ Missing Stripe
Cm = P ⊕ C0⊕ C1⊕ … Cm-1⊕ Cm+1⊕… Cn-1⊕ Cn
RAID-3 Performance
RAID-1’s 1-to-1 redundancy issue is addressed by 1-for-n Parity disk. Less expensive than RAID-1 Rest of the performance is similar to RAID-0. This can withstand the failure of one of its disks. Reliability = all the disks are working + exactly one failed
= rn + nc1rn-1.(1-r)
When r = 0.9 and n = 5 = 0.95 + 5 x 0.94 x (1- 0.9) = 0.6 + 5x0.66 x 0.1 = 0.6 + 0.33 = .93
RAID-4 Performance
Similar to RAID-3, but supports larger chunks. Performance measures are similar to RAID-3.
RAID-5 (Distributed Parity) Chunk Logical Storage
Physical Storage
Stripe width
01 02 03 04 05 06 07 08 09 10
01 P1
02 05
09
P2
11
12
03 06
04 07
P0 08
10
11
12 Parity Disk
RAID-5 Performance
In RAID-3 and RAID-4, Parity Disk is a bottleneck for Write operations. This issue is addressed in RAID-5.
Reading Assignment
Optical Disk RAID study material Worked out problems from the text book: p 452, p 537, p 539, p561, p 684, p 691, p 704
Amdahl’s Speedup Multiprocessors Speedup = 1/(Fractione/Speedupe+(1–Fractione )) Speedupe - number of processors Fractione
- Fraction of the program that runs parallely on Speedupe
Assumption: Either the program runs in fully parallel (enhanced) mode making use of all the processors or nonenhanced mode.
Multiprocessor Architectures
Single Instruction Stream, Single Data Stream (SISD):
Single Instruction Stream, Multiple Data Stream (SIMD)
Multiple Instruction Streams, Single Data Stream (MISD)
Multiple Instruction Streams, Multiple Data Streams (MIMD)
Classification of MIMD Architectures
Shared Memory:
Centralized shared memory architecture OR Symmetric (shared-memory) Multiprocessors (SMP) OR Uniform Memory Access (UMA). Distributed shared memory architecture OR NonUniform Memory Access (NUMA) architecture.
Message Passing:
Multiprocessor Systems based on messaging
Shared Memory versus Message Passing No 1
2
Shared Memory Compatibility with well-understood shared memory mechanism used in centralized multi-processor systems. Simplify compiler design
3 4
No need to learn a messaging protocol Low communication overhead
5
Caching to improve latency
Message Passing Simpler hardware compared to scalable shared memory Communication is explicit. This means it is easier to understand. Improved modularity No need for expensive and complex synchronization mechanisms similar to the ones used in shared memory.
Two types of Shared Memory architectures Centralized
Shared -Memory Architecture
Processor
Processor
Processor
Processor
Cache Cache
Cache Cache
Cache Cache
Cache Cache
Main Memory
I/O Systems
Distributed -Shared -Memory Architecture Processor + Cache
Processor + Cache
Processor + Cache
… Memory
I/O
Memory
I/O
Interconnection Network
Memory
I/O
Symmetric versus Distributed Memory MP
SMP: Uses shared memory for Inter-process Communication Advantage:
Disadvantage
Close coupling due to shared memory Sharing of data is faster between processors Scaling: Memory is a bottleneck High and unpredictable Memory Latency
Distributed : Uses message passing for Inter-process Communication Advantage:
Low Memory Latency Scaling: Scales better than SMP
Disadvantage
Control and management are more complex due to distributed memory
Performance Metrics
Communication bandwidth Communication Latency: Ideally latency is as low as possible. Communication latency is = Sender Overhead + Time of flight + Transmission Time + Receiver Overhead Communication Latency Hiding
Cache Coherence Problem
Time 0 1 2 3
Event
Cache Contents for CPU A
Cache Contents for CPU B
CPU A reads X CPU B reads X CPU A stores 0 in X
1 1 0
1 1
Memory contents for Location X 1 1 1 0
Cache Coherence A Memory System is coherent if 1.
A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses. Defines coherent view of memory.
2.
Write to the same location are serialized; that is two writes to the same location by any two processors are seen in the same order by all processors. For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1.
Coherency and Consistency
Coherency and consistency are complementary.
Coherency defines the behavior of reads and writes to the same memory location, while consistency defines the behavior of reads and writes with respect to other memory locations.
Features supported by Coherent Multiprocessors
Migration: Shared data in a cache can be moved to another cache directly (without going through the main memory). This is referred to as migration. It is used in a transparent fashion. It reduces both latency (going to another cache every time the data item is accessed) and precious memory bandwidth.
Replication: Cache also provides replication for shared data items that are being simultaneously read, since the caches make a copy of the data item in the local cache. Replication reduces both latency of access and contention for a read shared data item.
Migration and Replication Centralized Shared -Memory Architecture (SMP) Processor
Processor
Processor
Processor
Cache Cache
Cache Cache
Cache Cache
Cache Cache
cc Main Memory
cc
cc Bus
cc Cache Controller
cc I/O Systems
Cache Coherent (CC) Protocols
CC protocol implement cache coherency Two types:
Snooping (Replicated): There are multiple copies of the sharing status. Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block, and no centralized state is kept. Directory based (logically centralized): There is only one copy of the sharing status of a block of physical memory. All the processors use this one copy. This copy could be in any of the participating processors.
Snooping Protocol Centralized Shared -Memory Architecture (SMP) Processor
Processor
Processor
Processor
Cache Cache
Cache Cache
Cache Cache
Cache Cache
cc
cc
cc
cc
Bus I/O Systems
Main Memory
cc Cache Controller
Two ways: 1. Write Invalidation 2. Write Broadcast
Invalidation of Snooping Protocol
Processor Activity CPU A reads X (block) CPU B reads X (block) A Writes 1 to X B reads X
Bus Activity
Cache Contents for CPU A
Cache Miss for X
0
Cache Miss for X Invalidation for X Cache Miss for X
0 1 1
Cache Contents for CPU B
Memory contents for Location X 0 0
0 1
0 0 1
Write Broadcast of Snooping Protocol
Processor Activity
Bus Activity
CPU A reads X (block) Cache Miss for X CPU B reads X (block) Cache Miss for X A Writes 1 to X Write Broadcast for X B reads X
Cache Contents for CPU A
Cache Contents for CPU B
0 0 1 1
Memory contents for Location X 0 0
0 1 1
0 1 1
Reading Assignment
Section 7.4 (Reliability, Availability, & Dependability) in your text book Page 554 and 555 Section 7.11 I/O Design – attempt all the 5 problems in this section.
Invalidation versus Write Distribute
Multiple writes to the same data item with no intervening reads require multiple write broadcast in an update protocol, but only one initial invalidation in a write invalidate protocol.
With multiword cache blocks, each word written in a cache block requires a write broadcast in an update protocol, although only the first write to any word in the block needs to generate an invalidate in an invalidation protocol.
An invalidation protocol works on cache blocks, while an update protocol must work on individual words.
The delay between writing a word in one processor and reading the written value in another processor is usually less in a write update scheme, since written data are immediately updated in the reader’s cache. By comparison, in an invalidation protocol, the reader is invalidated first, then later reads the data and is stalled until a copy can be read and returned to the processor.
Coherent Misses
Write Invalidate causes Read cache misses
True Miss: If an item X in a block is made exclusive by write invalidate and the read of the same item in another processor causes read miss then it is a true miss. False Miss: If non-exclusive items in the same block cause the miss then it is a false miss.
Use of Valid, Shared, and Dirty bits
Valid bit: Every time a block is loaded into a cache from memory, the tag for the block is saved in cache and the valid bit is set to TRUE. A write update to the same block in a different processor may reset this valid bit due to write invalidate. Thus, when a cache block is accessed for READ or WRITE, tag should match AND the value of valid bit should be TRUE. If the tag matches but the valid bit is reset, then its a cache miss.
Shared bit: When a memory block is loaded into a cache block for the first time the shared bit is set to FALSE. When some other cache loads the same block, it is turned into TRUE. When this block is updated, write invalidate uses the value of shared bit to decide whether to send write-invalidate message or not. If the shared bit is set then an invalidate message is sent, otherwise not.
Dirty bit: Dirty bit is set to FALSE when a block is loaded into a cache memory. It is set to TRUE when the block is updated the first time. When another processor wants to load this block, this block is migrated to the processor instead of loading from the memory.
Summary of Snooping Mechanism Re q u e st Read hit Read m is s Read m is s Read m is s W rite hit W rite hit W rite m is s W rite m is s W rite m is s Read m is s
S o u rce proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or bus
Read m is s bus W rite m is s bus W rite m is s bus
S ta te o f a d d re sse d ca ch e b lock F u n ctio n a n d Ex p la n a tio n s hared or ex c lus iveRead data in c ac he invalid P lac e read m is s on bus s hared A ddres s c onflic t m is s : plac e read m is s on bus ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus ex c lus ive W rite data in c ac he s hared P lac e write m is s on bus invalid P lac e write m is s on bus s hared A ddres s c onflic t m is s : plac e read m is s on bus ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus s hared No ac tion; allow m em ory to s ervic e read m is s A ttem pt to s hare data: plac e c ac he bloc k on bus and c hange s tate to ex c lus ive s hared s hared A ttem pt to write s hared bloc k ; invalidate the bloc k A ttem pt to write bloc k that is ex c lus ive els ewhere: write bac k the c ac he ex c lus ive bloc k and m ak e its s tate invalid
State Transition CPU read hit CPU Read
Invalid
Place read miss on bus k oc l b
CPU read k miss e ac b rit w te bus Place read miss on bus i U r P C s s is W on bu s m s i on ad m e s r is ad U m P e r C e e it c r la w Cache State Transitions Exclusive P e c a Based on requests from CPU (read/write) Pl Place write miss on bus
CPU Write
CPU Write hit CPU Read hit
Shared (read only)
CPU Write miss Write-back cache block Place write miss on bus
State Transition Write miss for this block
Write-back block; Abort memory access
Invalid
CPU Write
Shared (read only)
rt o ab ; k oc l b s k es c c ba a c te y ri or W em M
Write miss for Exclusive this block
CPU read miss
Cache State Transitions Based on requests from the bus
(read/write) Read miss for this block
Some Terminologies
Polling: A process periodically checks if there is a message that it needs to handle. This method of awaiting a message is called polling. Polling reduces processor utilization. Interrupt: A process is notified when a message arrives using built-in interrupt register. Interrupt increases processor utilization in comparison to polling. Synchronous: A process sends a message and waits for the response to come before sending another message or carrying out other tasks. This is way of waiting is referred to as synchronous communication. Asynchronous: A process sends a message and continues to carry out other tasks while the requested message is processed. This is referred as asynchronous communication.
Communication Infrastructure
Multiprocessor Systems with shared memory can have two types of communication infrastructure:
Shared Bus Interconnect Interconnect Shared Bus
Directory Based Protocol
e nNod
e1 Nod
…
Directory
Directory
Interconnect
Directory e2 Nod
Directory
…
en Nod
1
State of the Block
Shared: One or more processors have the block cached, and the value in memory is up to date. Uncached: No processor has a copy of the block. Exclusive: Exactly one processor has a copy of the cache block, and it has written the block, so the memory copy is out of date. The processor is called the owner of the block.
Local, Remote, Home Node
Local Node: This is the node where request originates Home Node: This is the node where the memory location and the directory entry of an an address reside. Remote Node: A remote node is the node that has copy of a cache block
Shared State Operation
Read-miss: The requesting processor is sent the requested data from memory and the requestor is added to the sharing set. Write-miss: The requesting processor is sent the value. All the processors in the set Sharer are sent invalidate messages, and the Sharer set is to contain the identity of the requesting processor. The state of the block is made exclusive.
Uncached State Operation
Read-miss: The requesting processor is sent the requested data from memory and the requestor is made the only sharing node. The state of the block is made shared. Write-miss: The requesting processor is sent the requested data and becomes the sharing node. The block is made exclusive to indicate that the only valid copy is cached. Sharer indicates the identity of the owner.
Exclusive State Operation
Read-miss: The owner processor is sent a data fetch message, which causes the state of the block in the owner’s cache to transition to shared and causes the owner to send the data to the directory, which it is written to memory and sent back to the requesting processor. The identity of the requesting processor is added to the set Sharer, which still contains the identity of the processor that was the owner (since it still has a readable copy).
Data write back: The owner processor is replacing the block and therefore must write it back. This write back makes the memory copy up to date (the home directory essentially becomes the owner), the block is now uncached and the Sharer is empty.
Write-miss: The block has a new owner. A message is sent to the old owner causing the cache to invalidate the block and send the value to the directory, from which it is sent to the requesting processor, which becomes the new owner. Sharer is set to the identity of the new owner, and the state of the block remains exclusive.
Cache State Transition
Fetch, Shares = {}
Uncached
Read Miss Data Value Reply Shares = {P}
pl e r
e lu a v a P} t da = { ; ch es + us t Fe ar n b }; P h { S so = s s i is m es m r Data Writead ha ly ad e S e r p r Back e e; re t c a e a id alu Pl l va v Exclusive In t a da (read/write)
Data Value Reply Shares = {P}
Shared (read only) y;
Write Miss
W
Write miss Fetch/invalidate Data value Reply Shares = {P}
e rit
M
is
s
Read miss Data value Reply Shares += {P}
True Sharing Miss & False Sharing Miss
Time 1 2 3 4 5
Processor P1 Write X1
Processor P2 Read X2
Write X1 Write X2 Read X2
Synchronization
We need a set of hardware primitives with the ability to atomically read and modify a memory location.
Example 1: Atomic Exchange
Interchanges a value in a register for a value in memory You could build a lock with this. If the memory location contains 1 then the lock is on. Register
A
0
Physical Main Memory
1
Other Operations
Test-and-Set Fetch-and-Increment
Load Linked & Set Conditional
Load Linked: Load linked is a primitive operation that loads the content of a specified location into cache. Set Conditional: Set conditional operation is related to Load Linked operation by operating on the same memory location. This operation sets the location to a new value only if the location contains the same value read by Load Linked, otherwise it will not.
Operation of LL & SC try:
LL DADDUI SC BEQZ
R2, R3, R3, R3,
0(R1) R2, #1 0(R1) try
; ; ; ;
load linked increment Store conditional branch store fails
1. Address of the memory location loaded by LL is kept in a register 2. If there is an interrupt or invalidate then this register is cleared 3. SC checks if this register is zero. a. If it is then it fails b. Otherwise it simply Stores the content of R3 in that memory address
Single Processor System
Concurrent Execution is the source of problem X=0
Process 1
Process 2
Memory
TOP
time
P1 READ BNEZ
P2 X TOP P2 is scheduled TOP READ BNEZ WRITE
P1 is scheduled WRITE 1
X TOP 1
With Interrupt Mask X=0 Memory
Process 1
Process 2
P1 TOP
time
P2
READ X BNEZ TOP MASK WRITE 1 UNMASK TOP
MASK READ X BNEZ TOP WRITE UNMASK
1
Multiprocessor System
Masking the interrupt will not work X=0 Memory
Processor1
P1 TOP
READ X BNEZ TOP MASK WRITE 1 UNMASK
Processor2
P2 TOP
READ BNEZ MASK WRITE UNMASK
X TOP
1
Multiprocessor System
Uses Cache Coherence
Bus Based Systems or Interconnect Based System
Coherence system arbitrates writes (Invalidation/write distribute) Thus, serializes writes!
Spin Locks
Locks that a processor continuously tries to acquire, spinning around a loop until it succeeds. It is used when a lock is to be for a short amount of time.
Spin Lock implementation
lockit:
LD BNEZ DADDUI EXCH BNEZ
R2, R2, R2, R2, R2,
0(R1) lockit R0,#1 0(R1) lockit
;load of lock ;not available-spin ;load locked value ;swap ;branch if lock wasn't 0
Cache Coherence steps step 1
P0
2 Set lock to 0
P1 Spins, testing if lock=0 Invalidate received
P2 Spins, testing if lock=0 Invalidate received
3 4
Cache Miss Waits fo bus
5
Lock = 0
Cache Miss Lock = 0 Executes swap, gets cache miss Completes swap: returns 0 and sets Lock = 1
1 Has lock
6
7 8
Executes swap, gets cache miss Swap completes and returns 1, and sets Lock = Enter critical 1 section Spins, testing if lock=0
Coherence state of lock
Bus/Directory activity
Shared
None Write invalidate of lock Exclusive (P0) variable from P0 Bus/Directory services P2 Cache miss; write back from Shared P0 Shared Cache miss for P2 satisfied Shared
Cache miss for P1 satisfied
Bus/Directory services P2 Cache miss; generates Exclusive (P2) invalidate Bus/Directory services P1 Cache miss; generates write Exclusive (P1) back None
Multi-threading
Multi-threading – why and how? Fine-grain
Almost round robin High overhead (compared to coarse grain) Better throughput then coarse grain
Coarse-grain
Threads switched only on long stalls (L2 cache miss) Low overhead Individual threads get better performance
Reading Assignment
Memory Consistency
Notes (Answers)
Problem on page 596 An example that uses LL (Load Linked) and Set Conditional (SC).
End!
IETF – Internet Engineering Task Force
Loosely self-organized groups (working groups) of … Two types of documents: I-D and RFC I-D’s life is shorter compared to RFCs life RFCs include proposed standards and standards RFC Example: RFC 1213, RFC 2543 I-D -> Proposed Standard -> Standard
SIP Evolution
SIP I-D v1 ’97, v2 ’98 SIP RFC 2543 ’99 SIP RFC 3261 ’00, obsoletes RFC 2543 SIP working groups – SIPPING, SIMPLE, PINT, SPIRITS, … SIPit – SIP interoperability test
Predictor for SPEC92 Benchmarks 300 250
Instructions between 2 0 0 mispredictio 150 ns
P re d ic t e d T a k e n P r o fil e b a s e d
100 50
ear
dod uc
gcc
hyd ro2 d md ijdp su2 cor
com
pre ss eqn tot esp res so
li
0
Bench mark