Cs1601 Computer Architecture

  • Uploaded by: ainugiri
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Cs1601 Computer Architecture as PDF for free.

More details

  • Words: 22,104
  • Pages: 389
Computer Architecture CS1601

(Dr. Hari T.S. Narayanan)

Fundamentals of Computer Design

Unit 1

Contents           

Measuring and Reporting performance Quantitative principles of computer design Classifying instruction set architecture Memory addressing Addressing Modes Type and size of operands Operations in the instruction set Operands and operations for signal processing Instruction for control flow Encoding an instruction set Example architecture – MIPS and TM32.

Measuring and Reporting Performance   

Response Time or Execution Time Throughput Benchmarks are used to measure and compare performance of different systems

Response Time Break up  



Response time Same as Elapsed time or Wall Clock time Includes the following components   



User Time System Time or Privileged Time Waiting time and I/O time

Finding User time in Windows

Choosing Programs to Evaluate Performance 1.

Real Applications (C compiler, Text Processing)

2.

Modified (or Scripted) Applications

3.

Kernels (Linpack, Livermore Loops)

4.

5.

Toy benchmarks (Quicksort, Sieve of Eratosthenes) Synthetic benchmarks (Whetstone & Dhrystone)

Comparing & Summarizing Response Time Computer Program 1 Program 2 Total   

A 1 1000 1001

B 10 100 110

Total Execution Time Weighted Execution Time Normalized Execution Time with benchmark

C 20 20 40

Problem Computer P1 1 P2 1000 Total 1001 

A

B 10 100 110

Compare the performance for the following scenario: 

For every P2 there are 10 P1s

C 20 20 40

Some Related Terminologies 

 

Million Instructions Per Second (MIPS) FLOPS (Floating Point Operations) TeraFlops

Quantitative Principles     

Principle of Locality Make the common case fast Amdahl’s Law CPU Performance Equation Take advantage of parallelism

Principle of Locality    



Spatial Locality Temporal Locality 80-20 Rule Applies to Instruction and Data More applicable to Instruction than data!

Memory

Instruction add sto

Data 2 8

add sto

… time

add sto

Space

Amdahl’s Law 

Performance Gain or Speed up

=

Performance for entire task with enhancement Performance for entire task without enhancement

OR

=

Execution time of the entire task without enhancement Execution time of the entire with enhancement Note: Performance



1/execution time

Enhancing the Execution time of a task 

Each task is divided into two parts 





Part A – whose performance cannot be improved/enhanced Part B – whose performance can be improved/enhanced

Amdahl’s law relies on this classification

Amdahl’s Law - Example Problem from Page 41 of your text book: 1. Total time for Normal Execution = 0.6+0.4 = 1.0 s Fraction–enhanced = 0.4 s Speedup-enhanced = 10 times Total time for Enhanced execution = 0.6 + 0.4/10 s = 0.64s Speedup-overall = 1.0/0.64 ~ 1.56

Amdahl’s Law - Review  



Let Speedup-enhanced be n As we increase n, 0.4/n approaches 0 Thus, limit of overall speedup = 1/(0.6 + 0.4/n) = 1/0.6 = 1/(10.4)

Short Problems 

A task with  



Part A – 0.9 Part B – 0.1

Another task with  

Part A – 0.1 Part B – 0.9

The CPU Performance Equations 1. CPU Time for a program = CPU Clock Cycles x Clock Cycle Time sto add 1 2 n ..

1 ns

2. CPU time = CPU clock cycles for a program/Clock rate 3. CPIavg

= CPU clock cycles for a program/Instruction Count CPI – Clock Cycles Per Instruction

The CPU Performance Equations… 4. CPU Time = Instruction x Clock Cycles x Seconds = Seconds Program Instruction Clock Cycle Program

Take Advantage of Parallelism 

Instruction level parallelism using 





Pipeline

Memory Access using multiple banks Parallel computation of two or more possible outcomes

Assignment 1.

2. 3. 4.

Discuss why MIPS is not an accurate measure for comparing performance among computers. Software Design Patterns Problem 1.16 from your text book Problem on p.44-45 – Do this problem in the class without referring to your book

Instruction Set Architecture

Unit I

Content          

Internal Storage Reading Memory Addressing Assignment Addressing Mode for Signal Processing Types and Size of Operands Operands for Media and Signal Processing Operations in the Instruction Set Operations for Media and Signal Processing Instruction for Control Flow Encoding and Instruction Set Role of Compilers

Internal Storage stack

Accumulator

Register-mem

Reg-Reg/Loadstore

processor TOS ACC

ALU

memory

ALU

ALU

ALU

Code Sequence with different Internal Storage Stack

Accumulator

Register (registermemory)

Register (load-store)

Push A

Load A

Load R1,A

Load R1,A

Push B

Add B

Add R3,R1,B

Load R2,B

Add

Store C

Store R3,C

Add R3,R1,R2

Pop C

Store R3,C

Memory Operands and Register Operands Number of memory address

Maximum number of operands allowed

Type of architecture

Example

0

3

Register-register

Alpha, ARM, MIPS, PowerPC, SPARC, SuperH, Trimedia TM5200

1

2

Register-Memory

IBM 360, 370,Intel 80x86, Motorola 68000,TI TMS320C54x

2

2

Memory-memory

VAX (also has threeoperand formats)

3

3

Memory-memory

VAX (also has twooperand formats)

Comparison of Reg-Reg with Reg-Memory Type

Advantage

Register1. register(0,3) 2. 3. 4.

Register1. memory(1,2) 2.

Disadvantage

simple, 1. fixed-length instruction encoding. Simple code generation 2. model. Instruction take similar numbers of clocks to execute

Higher instruction count than architectures with memory references in instructions. More instructions , lower instruction density leads to larger programs.

Data can be accessed 1. without a separate load instruction format tends to be easy 2. to encode and yields good density

Operands are not equivalent since a source operand in a binary is destroyed. Encoding a register number and a memory address in each instruction may restrict the number of registers. Clocks per instruction vary by operand location.

3.

Memory Addressing 

Byte Order   



Big Endian Small Endian Network Byte Order (Big Endian)

Addressing Modes

Small-endian and Big-endian Value is 0X0201

Small Endian

Address Value

2

0

Address Value

Byte 0

Byte 1

Byte 1

0

0

Big Endian

1

0

1

Byte 0

2

Alignment 

Alignment normally at  



2 Byte boundary  4 Byte boundary

Alignment  

Optimizes the storage Makes the CPU logic more complex & slow 4

3

2

4 byte -normal integer

1

0

char/int

Addressing Modes Addressing Mode

Instruction

Meaning

Register

Add R4,R3

Regs[R4]←Regs[R4]+Re gs[R3]

Immediate

Add R4,#3

Regs[R4]←Regs[R4]+3

Displacement

Add R4,100(R1)

Regs[R4]←Regs[R4]+M em[100+Regs[R1]]

Register indirect

Add R4,(R1)

Regs[R4]←Regs[R4]+M em[Regs[R1]]

Indexed

Add R3,(R1+R2)

Regs[R3]←Regs[R3]+M em[Regs[R1]+Regs[R2]]

Addressing Modes (Continued) Addressing Mode

Instruction

Meaning

Direct or absolute

Add R4,(1001)

Regs[R4]←Regs[R41]+Mem[10 01]

Memory indirect

Add R4,@(R3)

Regs[R4]←Regs[R4]+Mem[Me m[Regs[R3]]]

Auto-increment

Add R4,(R2)+

Regs[R4]←Regs[R4]+Mem[Reg s[R2]] Regs[R2]←Regs[R2]+d

Auto-decrement

Add R1,-(R2)

Regs[R2]←Regs[R2]-d Regs[R1]←Regs[R1]+Mem[ Regs[R2]]

Scaled

Add R1,100(R2)[R3] Regs[R1]←Regs[R1]+Mem[100 +Regs[R2]+Regs[R3]*d]

Type and Size of Operands      

Fixed Size – rarely used Encoded in the operation code Tag – rarely used Data widths - 16, 24, 32 Accumulator width - 32, 64, 56, 40 Integers – 8 bits, 16 bits, 32 bits, FP

Floating Point Number Representation

23

8 Type

Exponent

Mantissa

Zeros

0

0

Denormalized Numbers

0

non-zero

1 to 28 - 2

any

Infinities

28 - 1

0

NaNs

28 - 1

non-zero

Normalized Numbers

Encoding −118.625 using the IEEE 754 system. 1. Negative number, set the sign to "1". 2. Write the number using binary notation. The result is 1110110.101. 3. Move the radix point left, leaving only a 1 at its left: 1110110.101 = 1.110110101 × 26. This is a normalized floating point number. 1. Compute the exponent - bias is 127 and so 6 + 127 = 133.

Terminologies       

Precision – Resolution Rounding of Error Radix – Exponent Mantissa – Significand Normalization Denormals Banker’s rounding

64-bit floating point representation

Double Precision Floating Point Representation

Operations - grouping O p e ra to rT y p e A rith im e tica n d lo g ic a l D a taT ra n s fe r C o n tro l S y s te m

F lo a tin gP o in t

D e c im a l S trin g

G ra p h ic s

E x a m p le In te g e ra rth im e tica n dlo g ic a l o p e ra tio n s ; a d d ,s u b tra c t,a n d ,o r,m u tip le ,d iv id e L o a d -s to re s (m o v ein s tru c tio n so n c o m p u te rsw ithm e m o ry a d d re s s in g ) B ra n c h ,ju m p ,p ro c e d u rec a ll a n d re tu rn ,tra p s O p e ra tin gs y s te m c a ll,V irtu a l m e m o rym a n a g e m e n tin s tru c tio n s F lo a tin g -p o in t o p e ra tin s 'a d d ,m u tip ly ,d iv id e ,c o m p a re D e c im a la d d ,d e c im a l m u tip ly ,d e c im a l-to -c h a ra c te r c o n v e rs io n s S trin gm o v e ,s trin gc o m p a re ,s trin g s e a rc h P ix e la n dv e rte x o p e ra tio n s ,c o m p re s s io n s /d e c o m p r e s s io no p e ra tio n s

Top 10 Frequently used instructions Rank 1 2 3 4 5 6 7 8 9 10

80x86instruction load conditional branch compare store add and sub moveregisterregister call return Total

Integer average (%total executed) 22% 20% 16% 12% 8% 6% 5% 4% 1% 1% 96%

Instructions for Control Flow 

Four different types of control flow changes:    



Conditional Branches Jumps Procedure calls Procedure returns

Addressing modes for these changes

Three types of Instruction Encoding (a)Variable(e.g.VAX,Intel 80X86) operationand N oof operands

Address specifier 1

Address field1

Address specifier n

(b)Fixed(e.g.Alpha,ARM,MIPS,PowerPC,SPARC,SuperH) o p e r a t io n

A d d r e s s f ie ld 1

A d d r e s s f ie ld 2

A d d r e s s f ie ld 3

(c)Hybrid(e.g.IBM 360/70,MIPS 16,Thumb) A d d r e s s s p e c i f i e r

o p e r a t i o n

A d d r e s s f i e l d

o p e ra tio n

A d d re s s s p e c ifie r1

A d d re s s s p e c ifie r2

o p e r a t io n

A d d r e s s s p e c ifie r

A d d r e s s fie ld1

A d d re s s fie ld

A d d r e s s fie ld2

Address fieldn

Topics to prepare and present 1. 2. 3. 4. 5.

6.

7.

RISC versus CISC IEEE 754 Floating point standard Floating point Arithmetic DEC VAX Assembler Instruction Set Benchmarking - Whetstone, Dhrystone, Linpack, SPEC, & TPC Stack based versus Register architecture Problem – Example on page 106

Instruction Level Parallelism Unit 2

What is Instruction-level parallelism (ILP)? 

Overlapping the execution of instructions Assembled code --------------------------------------------

Pipeline - Overlap 

Compare & contrast 1. Serving an individual to completion 2. Serving in a pipeline as shown above



Second case:  Time to serve an individual  Time to server 100 people

What is pipelining?

Pipelining – Advantage, Speedup, Effect, & Balance 

Instruction Fetch (IF)



Instruction Decode (ID)



Execution (EX)



Memory Access/Data Access (MEM/DA)



Write-back Cycle (WB)

Classic Five-stage Pipeline for a RISC m/c

In s t r u c t io n n u m 1b e r In s t r u c t io n I IF In s t ru c t io n i+ 1 In s t ru c t io n i+ 2 In s t ru c t io n i+ 3 In s t ru c t io n i+ 4

2 ID IF

3 EX ID IF

C lo c k N u m b e r 4 5 6 MEM W B EX MEM W B ID EX MEM IF ID EX IF ID

7

8

9

W B MEM EX

W B MEM

W B

Pipelining Data Paths

DM

CC 5

CC 6

CC 7

CC 8

CC 9

Reg

Reg

DM

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

IM

CC 4

ALU

Reg

CC 3

ALU

IM

CC 2

ALU

CC 1

ALU

Program execution order(in instructions)

Time( in clock cycles)

Reg

Reg

Reg

DM

Reg

Basic Performance Issues in Pipelining 



Pipelining slows down individual instructions. But increases the speed of execution of your program!

Problem 1 

Consider the un-pipelined processor section. Assume that it has a 1 ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%,20% and 40% respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2 ns of overhead to the clock. Ignoring any latency impact, how many speedup in the instruction execution rate will we gain from a pipeline?



Assume 5-stage pipeline

Problem 2 

Consider the un-pipelined processor section. Assume that it has 2 GHz clock and that it uses 4 cycles for ALU operations, 3 cycles for branches, and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 50%,25% and 25% respectively. Suppose that due to clock skew and setup, pipelining the processor is reduced to work at half the clock rate. Ignoring any latency impact, how many speedup in the instruction execution rate will we gain from a pipeline?



Assume 5-stage pipeline

The Hazards of Pipelining   

What is a pipeline Hazard? What is a pipeline Stall Types of Hazards   

Structural Data Control

Effect of Stalls on Pipeline Performance Speedup from pipelining Avg instruction time unP = Avg instruction time P

CPI unP * Clock period unP

=

CPI P * Clock period P

CPI unP

unP =

CPI P

Clock period *

Clock period P

Speed due to Improved CPI Speed up CPI unP

unP =

CPI P

Clock period *

Clock period P

CPI P = Ideal CPI + Pipeline stall cycles per instruction CPI P = 1 + Pipeline stall cycles per instruction

=

CPI unP or Pipeline depth

1+ Pipeline stall cycles per instructions

Pipeline Implementation with MIPS – IF Cycle 

Instruction Fetch Cycle IR NPC

Mem[PC]; PC + 4;

Pipeline Implementation with MIPS – ID Cycle 

Instruction decoder/register fetch cycle(ID) A B IMM

Regs[rs]; Regs[rt]; sign-extended immediate field of IR;

Execution/effective address cycle(EX) 

The ALU operates on the operands prepared in the prior cycle,performing one of the MIPS instruction type. 







Memory Reference: ALUOutput A + IMM; Register-Register ALU instruction: ALUOutput A func B; Register-Immediate ALU instruction: ALUOutput A op IMM; Branch ALUOutput NPC + (IMM << 2); Cond (A==0)

Pipeline Implementation with MIPS – Memory Access Cycle 

Memory access/branch completion cycle(MEM) The PC is updated for all instructions: 

Memory reference:

LMD Mem[ALUOutput] or Mem[ALUOutput] B; 

Branch:

if(cond) PC content

ALUOutput; #NPC

Pipeline Implementation with MIPS – Write Back 





Register-Register ALU instruction: Regs[rt] ALUOutput; Register-Immediate ALU instruction: Regs[rt] ALUOutput; Load instruction: Regs[rt] LMD;

What is Instruction-level parallelism (ILP)? 

Overlapping the execution of instructions Assembled code --------------------------------------------

Instruction Level Parallelism     

Basic block Parallelism within basic block Limitation of parallelism within … Parallelism across basic block Limitation of parallelism across … entry

Inst1 Inst 2 Inst 3

exit

Inst 4

Basic block

MIPS Program Characteristics 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

------------BE ---------BNEZ

92. 93. 94. 95. 96. 97. 98. 99. 100. 101.

JUMP ------------BNE ----------

Vector Processors – Parallelism across … for (i= 1; i<= 1000; i=i+1) x[i] = x[i] + y[i]; Vector processors don’t scale well for different problem size! ILPs are different from Vector processors

Data Dependences and ILP 



Two independent instructions can be completely overlapped Two dependent instructions cannot be completely overlapped

Relationship between …   

Data Dependence Hazard Stall

Inst1 Inst 2 Inst 3 Inst 4 Inst 5 Inst 6

Data Dependences 

Three types of data dependences True dependence  Name dependence  Control dependence 

True Dependence 



Instruction i produces a result that may be used by instruction j or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i

Output from instruction i -> input to instruction j

Code Sequence with Data Dependence Loop:

L.D

F0, 0(R1) ;

ADD.D

F4, F0,F2;

S.D

F4, 0(R1);

DADDUI R1, R1,#-8; BNE

R1,R2, Loop;

Code Sequence with Data Dependence Loop:

L.D

F0, 0(R1) ;

ADD.D

F4, F0,F2;

S.D

F4, 0(R1);

DADDUI R1, R1,#-8; BNE

R1,R2, Loop;

Data Dependence can be addressed 1.

2.

Maintaining the dependence but avoiding a hazard Eliminating the dependence by transforming the code

Name Dependence 



Anti-dependence: When instruction j writes to a register or memory location that instruction i reads. Output dependence: Instruction i and instruction j write to the same register or memory location.

Control Dependence if p1 { S1; } if p2 { S2; }

Data Hazards    

RAW (Read After Write) WAW (Write After Write) WAR (Write After Read) RAR (Read After Read)

Read After Write - RAW 

 

Wrong execution: j tries to read a source before i write to it. Arises from true dependence Example:  

Load R1, A Add R1, R2, R1

Write After Write  





Arises from Output Dependence Correct order between writes to be maintained Appears in only those pipelines where there are multiple write stages Example:  

DMUL F4, F3, F4 DLOAD F4, A

Write After Read  



Arises from Anti-dependence An instruction writes to a destination before it is read by another instruction, thus making the latter read the new (incorrect value) Does not occur in static pipelines

Read After Read 

Not a hazard!

Dynamic Scheduling 





Summarize: Data Dependence, & Structural Dependence Executing instructions in non-sequential form Example: L R1, 0(R2) SUB R1, R3 BEQZ Same S R1, 0(R2) JUMP End Same: S R1, 0(R4)

Control Dependence if p1 { S1; } if p2 { S2; }

ADD R2, R3, R4 BEQZ R2, L1 LW

R1, 0(R2)

RET L1: ADD R1, R2, R3

Two Constraints 

Control dependent instruction cannot be moved before the branch



An instruction that is not control dependent cannot be converted to control dependent. ADD R2, R3, R4 BEQZ R2, L1

X

LW

R1, 0(R2)

RET L1: ADD R1, R2, R3

X

Preserving Control & Data Dependence ADD BEQZ LW L1:

R2, R3, R4 R2, L1 R1, 0(R2)

Alternative to Control & Data Dependence 









Sequential execution of a program guarantees its intended correctness Preserving control and data dependence guarantees correctness in dynamic scheduling Preserving two dependences is an easier way to guarantee correctness But not the best way in terms of better throughput! Preserving exception behavior and data flow guarantees correctness and offers better throughput!

What is Exception Behavior?

ADD R2, R3, R4 BEQZ LW

R2, L1

R1, 0(R2)

RET L1: ADD R1, R2, R3

What is Data Flow? ADD BEQZ SUB

R1, R2, R3 R4, L R1, R5, R6

OR

R7, R1, R8

L:

To prove control dependence is not critical ADD

R1, R2, R3

BEQZ

R12, Skip

DSUB

R4, R5, R6

ADD

R5, R4, R9

Skip: OR

R7, R8, R9

Scheduling & Speculation 

Executing DSUB

R4, R5, R6 ahead of

BEQZ. 

Scheduling: reordering the instructions to improve throughput.



Speculation: Indicates scheduling instructions ahead of their order based on prediction

Dynamic Scheduling vs Static Scheduling 

Static Scheduling   



Decision at Compile time No change to hardware Limited to compile time data

Dynamic scheduling   

Decision at Run time Complex hardware Better throughput

Limitations of Static Scheduling 

Unnecessary Stalling of nondependent instructions DIV.D

F0, F2, F4

ADD.D

F10, F0, F8

SUB.D

F12, F8, F14

Issues with Dynamic Scheduling  

Hazards - WAR and WAW Exception handling DIV.D

F0, F2, F4

ADD.D

F6, F0, F8

SUB.D

F8, F10, F14

MUL

F6, F10, F8

WAR

WAW

Simple 5-stage Pipeline with Integer ALU

IF Instructions Enter

ID

EX ALU

MEM

WB Instructions Leave

5-Stage Pipeline with looped FP FUs EX Integer Unit

EX FP MUL IF Instructions Enter

MEM

ID EX FP Adder

EX FP DIV

WB Instructions Leave

5-Stage Pipeline with pipelined FP units EX Integer Unit

EX FP MUL IF Instructions Enter

MEM

ID EX FP Adder

EX FP DIV

WB Instructions Leave

5-Stage Pipeline with pipelined multiple FP units EX EX Integer IntegerUnit Unit

IF Instructions Enter

ID

EX EX FP MUL Integer Unit

EX EX FP Adder Integer Unit

EX EX FP DIVUnit Integer

MEM

WB Instructions Leave

Hazards in Longer Latency Pipelines 

Divide is not fully pipelined, this can lead to structural hazard.



Instructions have varying running times, the number of writes required in a cycle can be more than 1



Instructions can complete in a different order than they were issued



WAW Hazard possible



WAR Hazard less likely



Stalls due to RAW hazards will be more due to longer latency

Dynamic Scheduling Stages 

Dynamic scheduling implies out-of-order execution



To allow this we split ID pipe stage of 5-stage pipe line into two stages 

Issue: Decode instructions, check for structural hazards.



Read operands: Wait until no data hazards, then read operands

IF Instructions Enter

Issue

RO

EX

MEM

WB

ID Instructions Leave

Variations and other characteristics 

Instruction is fetched into a register

Or



Instruction is fetched into a queue



Execution may take different number of cycles depending upon the instruction



EX stage follows read operand stage



Multiple instructions in EX stage at the same time using multiple FUs



Stall and Bypass

Dynamic Scheduling using scoreboard 

All instructions come to the issue stage in order (set by the compiler)



They can stall or bypass each other in the read operand stage and enter execution out-of-order.



Scoreboard is a technique to schedule instructions out-of-order, Tomasulo’s algorithm is the other one.

Scoreboard … 

Objective: 

 

To execute one instruction per clock cycle

From CDC 6600 Recall WAR hazard IF

Instructions Enter

Issue

RO

DIV.D ADD.D F0, F8 SUB.D F8, F14

EX

MEM

F0, F2, F4 F10, F8,

WB

ID Instructions Leave

Scoreboard content 

Instruction Status 

Indicates the current stage for an instruction



Functional Unit Status 



Busy, Op, Fi, Fj, Fk, Qj, Qk, Rj, Rk

Register Result Status 

Indicates which functional unit will write each register

Scoreboard Building blocks Registers FU

FU

FU

Scoreboard Control/status

Control/status

Tomasulo’s Approach to Dynamic Scheduling  

Addresses WAR and WAW hazards Based on renaming of registers

DIV.D ADD.D S.D SUB.D F14 MUL.D

F0, F6, F6, F8,

F2, F4 F0, F8 0(R1) F10,

F6, F10, F8

DIV.D ADD.D S.D SUB.D MUL.D

F0, F2, F4 S, F0, F8 S, 0(R1) T, F10, F14 F6, F10, T

3-Stages of Tomasulo’s Pipeline   

Issue Execute Write Result

Basic Structure of Tomasulo’s Algorithm From Instruction Unit Load-store operations

FP registers

Instruction Queue

Address Unit Store buffers

Load buffers

Floating-point operations

Operand buses

Operation bus

Data

Address

Memory Unit

3 2 1

Reservation Stations

FP adders

FU

FP multipliers

Common Data Bus

2 1

Advantages of Tomasulo’s algorithm  





Distribution of Hazard detection logic No head of the queue blocking – eliminates stalls due to WAW and WAR hazards It can also eliminate name dependences that cannot be eliminated by a compiler Common Data Bus (CDB) makes it easier to copy one destination result to multiple source registers

Reservation Station – Data Fields 

Instruction Status 

Indicates the current stage for an instruction



Reservation Station Status 



Busy, Op, Vj, Vk, Qj, Qk, A

Register Status 

Indicates which functional unit will write to each register



Show what information tables look like for the following code sequence when only the first load has completed and written its result: 1.

L.D

F6,34(R2)

2.

L.D

3.

MUL.D

F0,F2,F4

4.

SUB.D

F8,F2,F6

5.

DIV.D

F10,F0,F6

6.

ADD.D

F2,45(R3)

F6,F8,F2     

Example

In s t r u c t io n L .D L .D M U L .D S U B .D D IV . D A D D .D

In s t r u c t io n S E x e c u te

Is s u e F 6 ,3 4 (R 2 ) √ F 2 ,4 5 (R 3 ) √ F 0 ,F 2 ,F 4 √ F 8 ,F 2 ,F 6 √ F 1 0 ,F 0 ,F 6√ F 6 ,F 8 ,F 2 √

√ √



Example R e s e r va t io n S t a t io n Nam e Load1 Load2 A dd1 A dd2 A dd3 M u lt 1 M u lt 2

B usy

j

kO

Vp

Qj V

k

A

Q

no yes Load yes S UB yes A DD

4 5 + R e g s [R 3 ]] M e m [ 3 4 + R e g s [ RL 2o ] a] d 2 A dd1

no yes M UL y e s D IV

R e g s [F 4 ] Load2 M e m [ 3 4 + R e g s [ RM2 u] ] l t 1

Load2

Example

F ie ld Q i

F0 M u lt 1

F2 Load2

F4

R e g is t e r S t a t u s F6 F8 A dd2 A dd1

Steps in Algorithm 

To understand the full power of eliminating WAW and WAR hazards through dynamic renaming of registers, We must look at a loop. Consider the following simple sequence for multiplying the elements of an array by a scalar in F2: LOOP: L . D MUL . D S.D DADDUI BNE

F0 , 0 ( R1 ) F4 , F0 , F2 F4 , 0 ( R1 ) R1 , R1 , #- 8 R1 , R2 , LOOP: branches if R1 # R2

Steps in Algorithm Instruction State

Issue

Load or Store

Load only

Store Only

Wait until

Sation r empty FP Operation

Buffer r empty

Action or bookkeeping if(RegisterStat[rs].Qi ≠0) {RS[r].Qj<-RegisterStat[rs].Qi} else{RS[r].Vj<-Regs[rs];RS[r].Qj<-0}; if(Register[rt].Qi ≠0) {RS[r].Qk<-Register[rt].Qi else{RS[r].Vk<-Regs[rt];RS[r].Qk<-0}; RS[r].Busy<-yes;RegisterStat[rd],Qi=r; if(RegisterStat[rs].Qi ≠0) {RS[r].Qj<-RegisterStat[rs].Qi} else{RS[r].Vj<-Regs[rs]; RS[r].Qj<-0}; RS[r].A<-imm;RS[r].Busy<-yes;

RegisterStat[rt].Qi=r; if(RegisterStat[rt].Qi ≠0) {RS[r].Qk<-RegisterStat[rs].Qi} else{RS[r].Vk<-Regs[rt];RS[r].Qk<-0};

Steps in Algorithm

In s t r u c t i o n S t a t e W a it u n t il A c t io n o r b o o k E x e c u te (R S [ r] . Q j= 0 )a n d C o m p u t e re s u lt : o p e r F P O p e ra t io n (R S [ r] . Q k = 0 ) L o a d S t o r e s t e pR 1S [ r ] . Q j = 0 & r i s h e aRd So[ fr ]l .oAa
Reducing Branch Cost 

Control hazard can cause a greater loss for MIPS pipeline than data hazards.



Branch Prediction is the mechanism to resolve the outcome of a branch early, thus preventing control dependences from causing stalls.



Based on the fact that more often branches are associated with loops



Two types of branch prediction 

Static prediction



Dynamic prediction

Terminologies related to Branch Prediction     

Control or branch Hazard Taken (if a branch changes the PC to its target address) Not taken or untaken (otherwise) Freeze or flush Predicted-not-taken

ADD BEQZ SUB ST

R1, R4, R1, R1,

R2, R3 L R5, R6 A

OR

R7, R1, R8

L:

Branch Taken or Untaken High-level program segment

For (j = 0; j < 10; j++)… for-body End-for nextinst

Assembly program segment

RET:

check & branch to for-loop nextinst … …

for-loop: … … JUMP RET

Basic Branch Prediction – Prediction buffer 

Using Branch History Table or Prediction-buffer

Branch Address

Taken or Not

baddr1

1

baddr2

0

baddr3

1 1 bit

Take n Predict taken 11

Not taken Taken

Not taken

Taken Predict untaken 01

Predict taken 10

Not taken Taken

Predict untaken 00 Not taken

80% 90%

Static Branch Prediction 

 

Useful when branch behavior is highly predictable at compile time It can also assist dynamic predictors Loop unrolling is an example of this

Branch Causing 3-Cycle Stall Time

B B B B

ra n c h ra n c h ra n c h ra n c h

i n s t r u c t i o n IF successor successor + 1 successor + 2

ID IF

EX IF

MEM W B IF IF

ADD BEQZ SUB ST …

R1, R2, R3 R4, L R1, R5, R6 R1, A

OR

R7, R1, R8

L:

ID IF

EX ID IF

Reducing pipeline branch penalties – Compile time schemes    

Freeze or Flush Assume that every branch has not taken Assume that every branch is taken Delayed branch

Delayed Branch 

Code segment branch instruction sequential successor … Loop: branch target if taken

 



Branch delay slot Instruction in branch delay slot is always executed Compiler can schedule the instruction(s) for branch delay slot so that its execution is useful even when the prediction is wrong

Branch Delay of 1 Not Taken

U n t a k e n B r a n c h I Fi n s t r uIDc t i o nE X M E M B r a n c h d e l a y i n s t r u c t IF i o n j +I D 1 E X In s t r u c t i o n j + 2 IF ID In s t r u c t i o n j + 3 IF In s t r u c t i o n j + 4

W B M EM EX ID IF

W B M EM W B EX M EM W B ID EX M EM W B

Taken U n ta k e n B r a n c h i n s tr u cItiFo n I D

EX MEM W B

B r a n c h d e l a y in s tr u c tio n j + I1 F

ID

EX M EM W B

B r a n c h ta r g e t

IF

ID

EX MEM W B

IF

ID

EX MEM W B

IF

ID

B r a n c h ta r g e t + 1 B r a n c h ta r g e t + 2

EX MEM W B

Scheduling the branch delay slot From before DADD R1, R2, R3

From target DSUB R4, R5, R6

If R2 = 0 then Delay slot

From fall-through DADD R1, R2, R3 If R1 = 0 then

DADD R1, R2, R3

Delay slot

If R1 = 0 then

OR R7, R8, R9

Delay slot DSUB R4, R5, R6 becomes

becomes

becomes

DSUB R4, R5, R6 If R2 = 0 then DADD R1, R2, R3

DADD R1, R2, R3 DADD R1, R2, R3 If R1 = 0 then

If R1 = 0 then OR R7, R8, R9

DSUB R4, R5, R6

Best choice

DSUB R4 , R5, R6

Branch Prediction – Correlating Prediction 

Prediction based on correlation if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa != bb) {



SUB R3, R1, #2 BNEZ R3, (aa != 2) ADD R1, R0, 0 L1: SUB R3, R2, #2 BNEZ R3, (bb != 2) ADD R2, R0, R0 L2: SUB R3, BEQZ R3, L3 bb)

L1

; branch b1

;aa=0 L2

;branch b2

:bb=0 R1, R2 ; R3=aa-bb ; branch b3 (aa ==

If both first and second branches are not taken then the last branch will be taken

ILP Software based approaches

Basic Compiler Technique 

Basic Pipeline Scheduling and Loop Unrolling L: L.D ADD for (j = 1000; j > 0; j--) S.D x[j] = x[j] + s; ADDI BNE

F0, 0(R1) ; F0=array element .D F4, F0, F2 ; add scalar in F F4, 0(R1) ; store result R1, R1, #-8 ; decrement pointer R1, R2, L ; branch R1 != R2

Latencies of FP operations used In stru ctio n P ro d u cin g Re su lt In stru ctio n Usin g Re suLlta te n cy F P A LU op A nother F P A LU op F P A LU op S tore double Load Double F P A LU op Load Double S tore double

in Clo ck Cycle s 3 2 1 0

Assume an integer load latency of 1 and an integer ALU operation latency of 0 L: L.D ADD S.D ADDI BNE

F0, 0(R1) ; F0=array element .D F4, F0, F2 ; add scalar in F2 F4, 0(R1) ; store result R1, R1, #-8 ; decrement pointer R1, R2, L ; branch R1 != R2

? ? ?

Without Any Scheduling Clock cycles issued L: L.D F0, 0(R1) 1 stall 2 ; NOP ADD.D F4, F0, F2 3 stall 4 ; NOP stall 5 ; NOP S.D F4, 0(R1) 6 ADDI R1, R1, #-8 7 stall 8 ; note latency of 1 for ALU, BNE BNE R1, R2, L 9 stall 10 ; NOP 10 clock cycles/element, 7 (ADDI, BNE, stall) are loop overhead

Assume an integer load latency of 1 and an integer ALU operation latency of

With Scheduling In structio n P ro du cin g Re su lt In struction Usin g Re suLlta te n cy FP A LU op A nother FP A LU op FP A LU op S tore double Load Double F P A LU op Load Double S tore double

Clock cycles issued L: L.D ADDI ADD.D stall BNE S.D

F0, 0(R1) R1, R1, #-8 F4, F0, F2 R1, R2, L F4, 8(R1)

1 2 3 4 5 6

?

?

in Clock Cycle s 3 2 1 0

Clock cycles issued L: L.D F0, 0(R1) 1 stall 2 ; NOP ADD.D F4, F0, F2 3 stall 4 ; NOP stall 5 ; NOP S.D F4, 0(R1) 6 ADDI R1, R1, #-8 7 stall 8 ; note latency of 1 for ALU, BNE BNE R1, R2, L 9 stall 10 ; NOP

6 clock cycles/element, 3 (ADDI, BNE, stall) are loop overhead

With Unrolling L:

L.D L.D L.D L.D ADD.D ADD.D ADD.D ADD.D S.D S.D ADDI S.D BNE S.D

F0, 0(R1) F6, -8(R1) F10, -16(R1) i F14, -24(R1) F4, F0, F2 F8, F6, F2 F12, F10, F2 i F16, F14, F2 F4, 0(R1) F8, -8(R1) R1, R1, #-32 i F16, 16(R1) R1, R2, L F20, 8(R1)

14 clock cycles/4 elements, 2 (ADDI, BNE) are loop overhead

I n str u c ti o n P r o d u c i n g R e su l t I n str u c ti o n U si n g RL ea su te nl tc y i n C l o c k C y c l e s FP A LU op A no ther F P A L U op 3 FP A LU op S t o re d o u b le 2 L o a d D o u b le FP A LU op 1 L o a d D o u b le S t o re d o u b le 0

With Unrolling & Scheduling L: L.D L.D L.D L.D L.D S.D S.D S.D ADDI S.D BNE

F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F18, -32(R1) F4, 0(R1) F8, -8(R1) F12, -16(R1) R1, R1, #-40 F16, 16(R1) R1, R2, L S.D F20,

i ADD.D ADD.D ADD.D ADD.D ADD.D

F4, F0, F2 F8, F6, F2 F12, F10, F2 F16, F14, F2 F20, F18, F2

8(R1) 12 clock cycles/5 elements, 2 (ADDI, BNE) are loop overhead

The VLIW Approach 

The first multiple-issue processors required the instruction steam to be explicitly organized.



Used wide instruction with multiple operations per instruction.



This architectural approach was VLIW.(Very long instruction word)

Advantage of VLIW Approach 

VLIW increases as the maximum issue rate grows.



VLIW approaches make sense for wider processors.



For example VLIW processors might have five instructions that contain five operations, including one integer,two floating,two memory.

The Basic VLIW Approach 

There is no fundamental difference in two approaches.



The instructions to be issued simultaneously falls on the complier, the hardware in a superscalar to make these issue decisions is unneeded.

The Basic VLIW Approach 

Scheduling Techniques 

Local Scheduling



Global Scheduling

Example 

Suppose we have a VLIW that could issue two memory references, two FP operations, and one integer operation or branch in every clock cycle. Show an unrolled version of the loop x[I]=x[I]+s for such a processor. Unroll as many times as necessary to eliminate any stalls. Ignore the branch delay slot.

VLIW instructions occupy inner loop and replace the unrolled sequence I n te g e r M e m o r y r e f e r e n c e M1 e m o r y r e f e r e n c eF P2 o p e r a ti o n F1 P o p e r a ti o n 2 o p e r a ti o n / b r a n c h L .D F 0 , 0 ( R 1 )

L .D F 6 , -8 ( R 1 )

L .D F 1 0 , - 1 6 ( R 1 )

L .D F 1 4 , 2 4 ( R 1 )

L .D F 1 8 , - 3 2 ( R 1 )

L .D F 2 2 , - 4 0 ( R 1 ) A D D .D F 4 , F 0 A, FD2D .D F 8 , F 6 , F 2

L .D F 2 6 , -4 8 R 1 )

A D D .D F 1 2 , F A1 0D, DF 2.D F 1 6 , F 1 4 , F 2 A D D .D F 2 0 , F A1 8D ,DF .D 2 F 2 4 ,F 2 2 ,F 2

S .D F 4 , 0 ( R 1 )

S .D F 8 , -8 ( R 1 )

S .D F 1 2 , -1 6 ( R 1 )

S .D F 1 6 , - 2 4 ( R 1 )

S .D F 2 0 2 4 ( R 1 )

S .D F 2 4 , 1 6 ( R 1 )

S .D F 2 8 , 8 ( R 1 )

A D D .D F 2 8 , F 2 6 , F 2 D A D D U I R 1 ,R 1 ,# -5 6 B N E R 1 ,R 2 ,L o o p

Problems of VLIW Model 

They are: 

Technical problems



Logistical Problems

Technical Problems 

Increase in code size and the limitations of lockstep.



Two different elements combine to increase code size substantially for a VLIW. 

Generating enough operations in a straight-line code fragment requires ambitiously unrolling loops, thereby increasing code size.



Whenever instructions are not full, the unused functional units translate to wasted bits in the instruction encoding.

Logistical Problems And Solution 

Logistical Problems 

Binary code compatibility has been a major problem.



The different numbers of functional units and unit latencies require different versions of the code.



Migration problems



Solution



Approach: 

Use object-code translation or emulation.This technology is developing quickly and could play a significant role in future migration schemes.



To tamper the strictness of the approach so that binary compatibility is still feasible.

Advantages of multiple-issue versus vector processor 

Twofold. 

A multiple-issue processor has the potential to extract some amount of parallelism from less regularly structured code.



It has the ability to use a more conventional, and typically less expensive, cache-based memory system.

Advanced complier support for exposing and exploiting ILP 

Complier technology for increasing the amount of parallelism .



Defining when a loop is parallel and how dependence can prevent a loop from being parallel.



Eliminating some types of dependences

Detecting and Enhancing Loop-level Parallelism 

Loop-level parallelism is normally analyzed at the source level or close to it.



while most analysis of ILP is done once instructions have been generated by the complier.



It involves determining what dependences exist among the operand in a loop across the iterations of that loop.

Detecting and Enhancing Loop-level Parallelism 

The analysis of loop-level parallelism focuses on determining whether data accesses in later iterations are dependent on the data values produced in earlier iterations is called a loop-dependence.



Loop-level parallel at the source representation:

for (I = 1000; I > 0; I = I - 1) x[i] = x[i] + s;

Example 1 

Consider a loop: for (i= 1; i <= 100; i = i + 1) { A [ i + 1] = A [ i ] + C [ i ] ; B [ i + 1] = B [ i ] + A [ i + 1 ] ; }



Loop-carried dependence : execution of an instance of a loop requires the execution of a previous instance.

Example 2 

Consider a loop: for (i= 1; i <= 100; i = i + 1) { A [ i ] = A [ i ] + B [ i ] ; /* S1 */ B [ i + 1] = C [ i ] + D [ i ] ;/* S2 */ } What are the dependences between S1 and S2? Is this loop parallel? If not, show how to make it parallel.

Scheduling Iteration 1 A[1] = 1 B[2] =

A[1] + B[1] C[1] + D[1]

Iteration 1 A[1] = 1 B[2] =

A[1] + B[1] C[1] + D[1]

2 2

A[2] = B[3] =

A[2] + B[2] C[2] + D[2]

2 2

A[2] = B[3] =

A[2] + B[2] C[2] + D[2]

3 3

A[3] = B[4] =

A[3] + B[3] C[3] + D[3]

3 3

A[3] = B[4] =

A[3] + B[3] C[3] + D[3]

4 4

A[4] = B[5] =

A[4] + B[4] C[4] + D[4]

4 4

A[4] = B[5] =

A[4] + B[4] C[4] + D[4]

99 99

ooo A[99] = B[100] =

A[99] + B[99] C[99] + D[99]

99 99

ooo A[99] = B[100] =

A[99] + B[99] C[99] + D[99]

100 100

A[100] = B[101] =

A[100] + B[100] C[100] + D[100]

100 100

A[100] = B[101] =

A[100] + B[100] C[100] + D[100]

Removing Loop-carried Dependency 

Consider the same example: A[1] = B[1] + C[1] for (i= 1; i <= 99; i = i + 1) { B [ i + 1] = C [ i] + D [ i] ; A [ i +1] = A [ i +1] + B [ i +1] ; } B[101] = C[100] + D[100]

Example 3 

Loop–carried dependences. This dependence information is inexact, may exist.



Consider a example: for (i= 1; i <= 100; i = i + 1) { A[i]=B[i] +C[i]; D[i]=A[i] *E[i ]; }

Recurrence 



Loop-carried dependences for (i= 2; i <= 100; i = i + 1) { Y [ i ] = Y [ i –1 ] + Y [ i ] ; } for (i= 6; i <= 100; i = i + 1) { Y [ i ] = Y [ i –5] + Y [ i ] ; }

Dependence-distance: An iteration i of the loop dependence on the completion of an earlier iteration i-n. Dependence distance is n.

Finding Dependences 

Three tasks 

Good scheduling of code



Determining which loops might contain parallelism



Eliminating name dependences

Finding Dependence 

A dependence exists if two conditions hold: 



There are two iteration indices, j and k , both within the limits of the for loop. That is, m<=j <= n, m<=k <= n. The loop stores into an array element indexed by a*j+b and later fetches from the same array element when it is indexed by c*k+d. That is, a*j+b=c*k+d.

Example 1 

Use the GCD test to determine whether dependences exist in the loop: for ( i=1 ; i<= 100 ; i = i + 1) { X[2*i+3]=X[2*i ]*5.0; } A = 2, B = 3, C= 2, D = 0 Since 2 (GCD(A,C)) does not divide –3 (D-B), no dependence is possible GCD – Greatest Common Divisor

Example 2 

The loop has multiple types of dependences. Find all the true dependences , and antidependences , and eliminate the output dependences and antidependences by renaming. for ( i = 1 , i <= 100 ; i = i + 1) { y [ i ] = x [ i ] / c ; /* S1 */ x [ i ] = x [ i ] + c ; /* S2 */ z [ i ] = y [ i ] + c ; /* S3 */ y [ i ] = c - y [ i ] ; /* S4 */ }

Dependence Analysis 

There are a wide variety of situations in which array-oriented dependence analysis cannot tell us we might want to know , including 







When objects are referenced via pointers rather than array indices . When array indexing is indirect through another array, which happens with many representations of sparse arrays. When a dependence may exist for some value of the inputs, but does not exist in actuality when the code is run since the inputs never take on those values. When an optimization depends on knowing more than just the possibility of a dependence, but needs to know on which write of a variable does a read of than variable depend.

Basic Approach used in points-to analysis 

The basic approach used in points-to analysis relies on information from three major sources:  



 

Type information, which restricts what a pointer can point to. Information derived when an object is allocated or when the address of an object is taken, which can be used to restrict what a pointer can point to. For example if p always points to an object allocated in a given source line and q never points to that object, then p and q can never point to the same object. Information derived from pointer assignments. For example , if p may be assigned the value of q, then p may point to anything q points to.

Analyzing pointers 

There are several cases where analyzing pointers has been successfully applied and is extremely useful: 





When pointers are used to pass the address of an object as a parameter, it is possible to use points-to analysis to determine the possible set of objects referenced by a pointer. One important use is to determine if two pointer parameters may designate the same object. When a pointer can point to one of several types , it is sometimes possible to determine the type of the data object that a pointer designates at different parts of of the program. It is often to separate out pointers that may only point to local object versus a global one.

Different types of limitations 

There are two different types of limitations that affect our ability to do accurate dependence analysis for large programs.  Limitations arises from restrictions in the analysis algorithms . Often , we are limited by the lack of applicability of the analysis rather tan a shortcoming in dependence analysis per se.  Limitation is the need to analyze behavior across procedure boundaries to get accurate information.

Eliminating dependent computations 



Compilers can reduce the impact of dependent computations so as to achieve more ILP. The key technique is to eliminate or reduce a dependent computation by back substitution, which increase the amount of parallelism and sometimes increases the amount of computation required. These techniques can be applied both within a basic block and within loops.

Copy propagation 

Within a basic block, algebraic simplifications of expressions and an optimization called copy propagation which eliminates operations that copy values can be used to simplify sequences. DADDUI R1,R2, #4 DADDUI R1,R1 #4 to DADDUI

R1,R2, #8

Example 





Consider the code sequences: ADD R1,R2,R3 ADD R4,R1,R6 ADD R8,R4,R7 Notice that this sequence requires at least three execution cycles, since all the instructions depend on the immediate predecessor. By taking advantage of associativity, transform the code and rewrite: ADD R1,R2,R3 ADD R4,R6,R7 ADD R8,R1,R4 This sequence can be computed in two execution cycles. When loop unrolling is used , opportunities for these types of optimizations occur frequently.

Recurrences 

Recurrences are expressions whose value on one iteration is given by a function that depends on the previous iterations.



When a loop with a recurrence is unrolled, we may be able to algebraically optimize the unrolled loop, so that the recurrence need only be evaluated once per unrolled iterations.

Type of recurrences 

One common type of recurrence arises from an explicit program statement:

Sum = Sum + X ; 

Unroll a recurrence – Unoptimized (5 dependent instructions) Sum = Sum + X1 + X2 + X3 + X4 + X5 ;



Optimized: with 3 dependent operations: Sum = ((Sum + X1) + (X2 + X3) ) + (X4 + X5);

Software Pipelining : Symbolic Loop Unrolling 

There are two important techniques: 

Software pipelining



Trace scheduling



Reading Assignment 

Trace Scheduling



Software Pipelining versus Loop Unrolling

Software Pipelining 

This is a technique for reorganizing loops such that each iteration in the software-pipelined code is made from instruction chosen from different iterations of the original loop.



This loop interleaves instructions from different iterations without unrolling the loop.



This technique is the software counterpart to what Tomasulo’s algorithm does in hardware.

Software Pipelined Iteration Iteration 0 Iteration 1 Iteration 2 Iteration 3

Software- Pipelined Iteration

Iteration 4

Example Show a software-pipelined version of this loop, which increments all the elements of an array whose starting address is in R1 by the contents of F2: Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop You can omit the start-up and clean-up code. 

Software Pipeline Ite ra tio n I: L .D F 0 , 0 (R 1 ) A D D .D F 4 ,F 0 , F 2 S .D F 4 , 0 (R 1 ) Ite ra tio n I+ 1 L: .D F 0 , 0 (R 1 ) A D D .D F 4 ,F 0 , F 2 S .D F 4 , 0 (R 1 ) Ite ra tio n I+ 2 L: .D F 0 , 0 (R 1 ) A D D .D F 4 ,F 0 , F 2 S .D F 4 , 0 (R 1 )

Load M[I] in F0, Add to F0, hold the result in F4 Load M[I-1] in F0 Loop: S.D [i]

F4,16(R1) ; stores into M

ADD.D

F4,F0,F2

; add to M [ i –

L.D

F0,0(R1)

; load M [ i – 2 ]

DADDUI

R1,R1,#-8

BNE

R1,R2,Loop

1]

Store F4 in M[1] Add F4, F0, F2 Store it in M[0]

Major Advantage of Software Pipelining 

Software pipelining over straight loop unrolling is that software pipelining consumes less code space.



In addition to yielding a better scheduling inner loop, each reduce a different type of overhead.



Loop unrolling reduces the overhead of the loop the branch and counter update code.



Software pipelining reduces the time when the loop is not running at peak speed to once per loop at the beginning and end.

Software pipelining and Unrolled Loop Start-up code

Wind-down code

Number of Overlapped operations a) Software pipelining

Proportional to number of unrolls Number of Overlapped operations

Time

Overlap between Unrolled iterations

Time b) Loop unrolling

Software Pipelining Difficulty 

In practice, complication using software pipelining is quite difficult for several reasons: 





Many loops require significant transformation before they can be software pipelined, the trade-offs in terms of overhead versus efficiency of the software-pipelined loop are complex, and the issue of register management creates additional complexities. To help deal with the two of these issues, the IA-64 added extensive hardware support for software pipelining Although this hardware can make it more efficient to apply software pipelining , it does not eliminate the need for complex complier support, or the need to make difficult decisions about the best way to compile a loop.

Global Code Scheduling 



Global code scheduling aims to compact a code fragment with internal control structure into the shortest possible sequence that preserves the data and control dependence . Data Dependence:  This force a partial order on operations and overcome by unrolling and, in case of memory operations , using dependence analysis to determine if two references refer to the same address.  Finding the shortest possible sequence for a piece of code means finding the shortest sequence for the critical path, which is the longest sequence of dependent instructions.

Global Code Scheduling …. 

Control dependence  It dictate instructions across which code cannot be easily moved.  It arising from loop branches are reduce by unrolling.  Global code scheduling can reduce the effect of control dependences arising from conditional nonloop branches by moving code.  Since moving code across branches will often affect the frequency of execution of such code, effectively using global code motion requires estimates of the relatively frequency of different paths.

Global Code Scheduling …. 

Although global code motion cannot guarantee faster code, if the frequency information is accurate, the compiler can determine whether such code movement is likely to lead to faster code.



Global code motion is important since many inner loops contain conditional statements.



Global code scheduling also requires complex trade-offs to make code motion decisions.

A Typical Code Fragment AA[[i i]]= =AA[i] [i]+ +B[i] B[i]

F

T AA[i] [i]= =0? 0?

B [i] =

X

C [i] =

elsepart:

join:

LD LD DADDU SD …. BNEZ …. SD …. J …. X …. …. SD

R4,0(R1) R5,0(R2) R4,R4,R5 R4,0(R1)

; ; ; ;

load A load B ADD to A Store A

R4,elsepart ; Test A ; then part ….,0(R2) ;Stores to B join

….,0(R3)

; jump over else ; else part ; Code for X ; after if ; Store C[i]

Factors of the complier 

Consider the factors that the complier would have to consider in moving the computation and assignment of B:  What are the relative execution frequencies of the then case and the else case in the branch? If the then case is much more frequent, the code motion may be beneficial. If not, it is likely, although not impossible, to consider moving the code.  What is the cost of executing the computation and assignment to B on the branch? It may be that there are a number of empty instruction issue slots in the code on the branch and that the instructions for B can be placed into slots that would otherwise go empty. This opportunity makes the computation of B “free” at least to the first order.

Factors of the complier … 





How will the movement of B change the execution time for the then case? If B is at the start of the critical path for the then case, moving it may be highly beneficial. Is B the best code fragment that can be moved above the branch? How does it compare with moving C or other statements within the then case? What is the cost of the compensation code that may be necessary for the else case? How effectively can this code be scheduled, and what is its impact on execution time?

Global code principle 

The two methods on a two principle: 

Focus on the attention of the compiler on a straight-line code segment representing what is estimated to be most frequently executed code path.



Unrolling is used to generate the straight-line code but,of course, the complexity arises in how conditional branches are handled.



In both cases, they are effectively straightened by choosing and scheduling the most frequent path.

Trace Scheduling : focusing on the critical path 



It useful for processors with a large number of issues per clock, where conditional or predicated with a large number of issues per clock, where conditional or predicated execution is inappropriate or unsupported, and where simple loop unrolling may not be sufficient by itself to uncover enough ILP to keep the processor busy. It is away to organize the global code motion process, so as to simplify the code scheduling by incurring the costs of possible code motion on the less frequent paths.

Steps to Trace Scheduling 

There are two steps to trace scheduling:  Trace selection: It tries to find likely sequence of basic blocks whose operations will be put together into a smaller number of instructions: this sequence is called a trace.  Trace compaction: It tries squeeze the trace into a small number of wide instructions. Trace compaction is code scheduling; hence , it attempts to move operations as early as it can in a sequence (trace), packing the operations into a few wide instructions (or issue packets) as possible

Advantage of Trace Scheduling 

It approach is that it simplifies the decisions concerning global code motion.



In particular , branches are viewed as jumps into or out of the selected trace, which is assumed to be the most probable path.



When the code is moved across such trace entry and exit points, additional book-keeping code will often be needed on entry or exit point.

Key Assumption: 

The key assumption is that the trace is so much more probable than the alternatives that the cost of the book-keeping code need not be a deciding factor: 



If an instruction can be moved and thereby make the main trace execute faster; it is moved.

Although trace scheduling has been successfully applied to scientific code with its intensive loops and accurate profile data, it remains unclear whether this approach is suitable for programs that are less simply characterized and less loop-intensive.



In such programs,the significant overheads of compensation code may make trace scheduling an unattractive approach, or, best , its effective use will be extremely complex for the complier.

Trace in the program fragment AA[[i i]]= =AA[i] [i]+ +B[i] B[i] F

T AA[i] [i]= =0? 0?

Trace exit

B [i] = C [i] =

Trace entrance AA[[i i]]= =AA[i] [i]+ +B[i] B[i] F

T AA[i] [i]= =0? 0? B [i] = C [i] =

Trace exit

Drawback of trace scheduling 

One of the major drawbacks of trace scheduling is that the entries and exits into the middle of the trace cause significant complications, requiring the complier to generate and track the compensation code and often making it difficult to asses the cost of such code.

Superblocks 





Superblocks are formed by a process similar to that used for traces, but are a form of extended basic blocks, which are restricted to a single entry point but allow multiple exits. How can a superblock with only one entrance be constructed? The answer is to use tail duplication to create a separate block that corresponds to the portion of the trace after the entry. The superblock approach reduces the complexity of book-keeping and scheduling versus the more general trace generation approach, but may enlarge code size more than a trace- based approach.

Superblock results from unrolling the code AA[[i i]]= =AA[i] [i]+ +B[i] B[i] F

T

AA[i] [i]= =0? 0? Superblock exit Wit n=2

B [i] =

AA[[i i]]= =AA[i] [i]++B[i] B[i] T

C [i] =

AA[i] [i]= =0? 0?

AA[[i i]]= =AA[i] [i]+ +B[i] B[i] F

T

F

X

B [i] =

AA[i] [i]= =0? 0? B [i] = C [i] =

Superblock exit Wit n=1

C [i] =

Exploited of ILP 

Loop unrolling, software pipelining , trace scheduling , and superblock scheduling all aim at trying to increases the amount of ILP that can be exploited by a processor issuing more than one instruction on every clock cycle.



The effectiveness of each of these techniques and their suitability for various architectural approaches are among the hottest topics being pursed by researches and designers of high-speed processors.

Hardware Support for Exposing More Parallelism at Compile Time 







Techniques such as loop unrolling, software pipelining , and trace scheduling can be used to increase the amount of parallelism available when the behavior of branches is fairly predictable at compile time. The first is an extension of the instruction set to include conditional or predicated instructions. Such instructions can be used to eliminate branches, converting a control dependence into a data dependence and potentially improving performances. Such approaches are useful with either the hardwareintensive schemes , predication can be used to eliminate branches.

Conditional or Predicated Instructions 

The concept behind instructions is quite simple:  An instruction refers to a condition, which is evaluated as part of the instruction execution.  If the condition is true, the instruction is executed normally;  If the condition is false, the execution continues as if the instruction were a no-op.  Many newer architectures include some form of conditional instructions.

Example: Consider the following code: if (A==0) {S=T;} Assuming that registers R1,R2 and R3 hold the values of A,S and T respectively, show the code for this statement with the branch and with the conditional move. 

Conditional and Predictions 









Conditional moves are the simplest form of conditional or predicted instructions, and although useful for short sequences, have limitations. In particular, using conditional move to eliminate branches that guard the execution of large blocks of code can be efficient, since many conditional moves may need to be introduced. To remedy the inefficiency of using conditional moves, some architectures support full predication, whereby the execution of all instructions is controlled by a predicate. When the predicate is false, the instruction becomes a noop. Full predication allows us to simply convert large blocks of code that are branch dependent. Predicated instructions can also be used to speculatively move an instruction that is time critical, but may cause an exception if moved before a guarding branch.

Example: 

Here is a code sequence for a two-issue superscalar that can issue a combination of one memory reference and one ALU operation, or a branch by itself, every cycle: First InstructionSet LW R 1,40(R 2)

SecondInstructionSet AD DR 3,R 4,R 5 AD DR 6,R 3,R 7

BEQ Z R 10,L LW R 8,0(R 10) LW R 9,0(R 8)



This sequence wastes a memory operation slot in the second cycle and will incur a data dependence stall if the branch is not taken, since the second LW after the branch depends on the prior load. Show how the code can be improved using a predicated form of LW.

Predicate instructions 





When we convert an entire code segment to predicated execution or speculatively move an instruction and make it predicated, we remove a control dependence. Correct code generation and the conditional execution of predicated instructions ensure that we maintain the data flow enforced by the branch. To ensure that the exception behavior is also maintained, a predicated instruction must not generate an exception if the predicate is false.

Complication and Disadvantage 



 



The major complication in implementing predicated instructions is deciding when to annul an instruction. Predicated instructions may either be annulled during instruction issue or later in the pipeline before they commit any results or raise an exception. Each choice has a disadvantage. If predicated instructions are annulled in pipeline, the value of the controlling condition must be known early to prevent a stall for a data hazard. Since data-dependent branch conditions, which tends to be less predictable, are candidates for conversion to predicated execution, this choice can lead to more pipeline stalls.

Complication of Predication 





Because of this potential for data hazard stalls, no design with predicated execution (or conditional move) annuls instructions early. Instead , all existing processors annul instructions later in the pipeline , which means that annulled instructions will consume functional unit resources and potentially have negative impact on performance . A variety of other pipeline implementation techniques, such as forwarding, interact with predicated instructions, further complicating the implementation.

Factors Predicated or Conditional Instruction 

Predicated or conditional instructions are extremely useful for implementing short alternative control flows, for eliminating some unpredictable branches, and for reducing the overhead of global code scheduling. Nonetheless, the usefulness of conditional instructions is limited by several factors:  Predicated instructions that are annulled (I.e., whose conditions are false) still take some processor resources.  An annulled predicated instruction requires fetch resources at a minimum, and in most processors functional unit execution time.  Therefore, moving an instruction across a branch and making it conditional will slow the program down whenever the moved instruction would not have been normally executed.

Factors Predicated or Conditional Instruction 









Likewise, predicating a control-dependent portion of code and eliminating a branch may slow down the processor if that code would not have been executed. An important execution to these situations occurs when the cycles used by the moved instruction when it is not performed would have been idle anyway. Moving an instruction across a branch or converting a code segment to predicated execution is essentially speculating on the outcome of the branch. Conditional instructions make this easier but do not eliminate the execution time taken by an incorrect guess. In simple cases,where we trade a conditional move for a branch and a move, using conditional moves or predication is almost always better.

Factors Predicated or Conditional Instruction… 







When longer code sequences are made conditional, the benefits are more limited. Predicated instructions are most useful when the predicate can be evaluated early. If the condition evaluation and predicated instructions cannot be separated (because of data dependence in determining the condition) , then a conditional instruction may result in a stall for a data hazard. With branch prediction and speculation, such stalls can be avoided, at least when the branches are predicted accurately.

Factors Predicated or Conditional Instruction… 











The use of conditional instructions can be limited when the control flow involves more than a simple alternative sequence. For example, moving an instruction across multiple branches requires making it conditional on both branches, which requires two conditions to be specified or requires additional instructions to compute the controlling predicate. If such capabilities are not present, the overhead of if conversion will be larger, reduce its advantage. Conditional instructions may have some speed penalty compared with unconditional instructions. This may show up as a higher cycle count for such instructions or a slower clock rate overall. If conditional instructions are more expensive, they will need to be used judiciously.

Compiler Speculation With Hardware Support 







Many programs have branches that can be accurately predicted at compile time either from the program structure or by using a profile. In such cases, the compiler may want to speculate either to improve the scheduling or to increase the issue rate. Predicated instructions provide one method to speculate, but they are really more useful when control dependences can be completely eliminated by if conversion. In many cases,we would like to move speculated instructions not only before the branch, but before the condition evaluation cannot achieve this.

Speculate Ambitiously To speculate ambitiously requires three capabilities:  The ability of the complier to find instructions that, with the possible use of register renaming, can be speculatively moved and not affect the program data flow.  The ability to ignore exceptions in speculated instructions, until we know that such exceptions should really occur.  The ability to speculatively interchange loads and stores, or stores and stores, which may have address conflicts.  The first of these is a complier capability, while the last two require hardware support .

Hardware Support for Preserving Exception Behavior 

To speculate ambitiously, we must be able to move any type of instruction and still preserve its exception behavior.



The key to being able to do this is to observe that the results of a speculated sequence that is mispredicted will not be used in the final computation , and such a speculated instruction should not cause an exception.

Investigated for supporting more ambitious speculation 

There are four methods that have been investigated for supporting more ambitious speculation without introducing erroneous behavior:  The hardware and operating system cooperatively ignore exceptions for speculative instructions. This approach preserves exception behavior for correct programs, but not for incorrect ones. This approach may viewed as unacceptable for some programs, but it has been used, under program control , as a “fast mode” in several processors.  Speculative instructions that never raise exceptions are used, and checks are introduced to determine when an exception should occur.

Investigated for supporting more ambitious speculation 

A set of status bits, called poison bits, are attached to the result registers written by speculated instructions when the instructions cause exceptions. The poison bits cause a fault when a normal instructions attempts to use the register.



A mechanism is provided to indicate that an instruction is speculative , and the hardware buffers the instruction result until it is certain that the instruction is no longer speculative.

Schemes 





To know the schemes, we need to distinguish between exceptions that indicate a program error and would normally cause termination, such as a memory protection violation, and those that are handled and normally resumed, such as a page default. Exception that can be resumed can be accepted and processed for speculative instructions just as if they were normal instructions. If the speculative instruction should not have been executed, handling the unneeded exception may have some negative performance effects, but it cannot cause incorrect execution.

Schemes.. 

The cost of these exceptions may be high, however, and some processors use hardware support to avoid taking such exceptions, just as processors with hardware speculation may take some exceptions in speculative mode, while avoiding others until an instruction is known not to be speculative.



Exceptions that indicate a program error should not occur in correct programs, and the result of a program that gets such a exception is not well defined, except perhaps when the program is running in a debugging mode.



If such exceptions arise in speculated instructions, we cannot take the exception until we know that the instruction is no longer speculative.

Schemes … 







In the simplest method for preserving exceptions, the hardware and the operating system handle all resumable exceptions when the exception occurs and simply return an undefined value for any exception that would cause termination. If the instruction generating the terminating exception was not speculative , then the program is in error. Note the instead of terminating the program, the program is allowed to continue, although it will almost certainly generate incorrect results. If the instruction generating the terminating exception is speculative, then the program may be correct and the speculative result will simply be unused.

Schemes … 







Thus, returning an undefined value for the instruction cannot be harmful. This scheme can never cause a correct program to fail, no matter how much speculation is done. An incorrect program, which formerly might have received a terminating exception, will get an incorrect result. This a acceptable for some programs, assuming the complier can also generate a normal version of the program, which does not speculate and can receive a terminating exception.

Schemes … 



In such a scheme, it is not necessary to know that an instruction is speculative. Indeed, it is helpful only when a program is in error and receives a terminating exception on a normal instruction; in such cases, if the instruction were not marked as speculative, the program could be terminated. A second approach to preserving exception behavior when speculating introduces speculative versions of instructions that do not generate terminating exceptions and instructions to check for such exceptions. This combines preserves the exception behavior exactly.

Schemes … 









A third approach for preserving exception tracks exceptions as they occur but postpones of the exception, although not in a completely precise fashion. The scheme is simple: a poison bit is added to every register, and another bit is added to every instruction to indicate whether a speculative instructions results in a terminating exception; all other exceptions are handled immediately. If a speculative instruction uses a register with a poison bit turned on, the destination register of the instruction simply has its poison bit turned on, If a normal instruction attempts to use a register source with its poison bit turned on, the instruction causes a fault . In this way, any program that would have generated an exception still generates one, albeit at the first instance where a result is used by an instruction that is not speculative.

Schemes … 











Since poison bits exist only on register values and not memory values, stores are never speculative and thus trap if either operand is “poison”. One complication thus must be overcome is how the OS saves the user registers on a context switch if the poison bit is set. A special instruction is needed to save and reset the state of the poison bits to the avoid this problem. The fourth and final approach listed in earlier relies on a hardware mechanism that operates like a reorder buffer. In such an approach, instructions are marked by the complier as speculative and include an indicator of how many branches the instruction was speculatively moved across and what branch action (taken/not taken) the complier assumed. The last piece of information basically tells the hardware the location of the code block where the speculated instruction originally was .

Schemes … 





In practice, most of the benefit of speculative instruction is marked by a sentinel, which tells the hardware that the earlier speculative instruction is no longer speculative and values may be committed. All instructions are placed in a reorder buffer when issued and are forced to commit in order, as in hardware speculation approach.(Notice , though, that no actual speculative branch prediction or dynamic scheduling occurs). The reorder buffer tracks when instructions are ready to commit and delays the “write-back” portion of any speculative instruction.

Schemes … 







Speculative instructions are not allowed to commit until the branches that have been speculatively moved over are also ready to commit, or ,alternatively until the corresponding sentinel is reached. At that point, we know whether the speculated instructions should have been executed or not. If it should have been executed and it generated a terminating exception, then we know that the program should be terminated. If the instruction should not have been executed then the can be ignored.

Example 1 

Consider the code fragment from if-the-else statement of the form: If (A==0) A=B; else A=a+4;Where A is at 0(R3) and B is at 0(R2). Assume the then clause is almost always executed. Compile the code using complier-based speculation. Assume R14 is unused and available.

L1: L2:

LD B NE Z LD J DA DDI SD

R1,0(R3) ; load A R1,L1 ;tes t A R1,0(R2) ; then c laus e L2 ; S k ip els e R1,R1, #4; els e c laus e R0,0(R3) ; s tore A

Example 2 

Show how the code using a speculative load (sLD) and a speculation check instruction (SPECCK) to completely preserve exception behavior. Assume R14 is unused and available. LD R1,0(R3) ; load A B NE Z R1,L1 ;tes t A LD R1,0(R2) ; then c laus e J L2 ; S k ip els e L1: DA DDI R1,R1, #4; els e c laus e L2: SD R0,0(R3) ; s tore A

Example 3 

Consider the code fragment and show how it would be complied with speculative instructions and poison bits. Show where an exception for the speculative memory reference would be recognized. Assume R14 is unused and available. LD R1,0(R3) ; load A

L1: L2:

B NE Z LD J DA DDI SD

R1,L1 ;tes t A R1,0(R2) ; then c laus e L2 ; S k ip els e R1,R1, #4; els e c laus e R0,0(R3) ; s tore A

Hardware support for memory Reference Speculation 









Moving loads across stores is usually done when the complier is certain the address do not conflict. A special instruction to check for address conflicts can be included in the architecture, The special instruction is left at the original location of the load instruction (and acts like a guardian), and the load is moved up across or more stores. When a speculated load is executed, the hardware saves the address of the accessed memory location. If a subsequent store the address of the location before the check instruction, then the speculation has failed.

Hardware support for memory Reference Speculation … 



If the location has not been touched, then the speculation is successful. Speculation failure can be handled in two ways.  If only the load instruction was speculated, then it suffices to redo the load at the point of the check instruction (which could supply the target register in addition to the memory address).  If additional instructions that depend on the load were also speculated , then a fix-up sequence that reexecutes all the speculation instructions starting with the load is needed.  In this case, the check instruction specifies the address where the fix-up code is loaded.

Sections To be covered from Chapter 4  

Section 4.5 Software versus Hardware based scheduling

Memory Hierarchy

Memory Hierarchy & Cache Memory

Entry Quiz 1.

Primary cache or level 1 cache is implemented in a separate chip a. b.

True False

Entry Quiz 2.

SRAM is implemented using a. b. c. d.

Flip-Flop Magnetic core Capacitor Non-volatile Technology

Entry Quiz 3.

Main memory (200 ns) is slower compared register (0.2 ns) by an order of a. b. c. d.

3 4 1000 10000

Entry Quiz 4.

Virtual Memory is a. b. c. d.

Same as caching Same as associative memory Different from caching Same as disk memory

Entry Quiz 5.

Cache Miss occurs when the a.

b. c.

d.

e.

Required instruction is not found in the cache Required data is not found in the cache Required instruction or data is not found in the cache Required instruction or data is not found in the main memory For all of the above conditions

Module Objective To understand

 1.

Memory requirements of different computers

2.

Memory hierarchy and the motivation behind it 

3.

Moore’s Law 

4. 5. 6. 7.

Principles of Locality  Cache Memory and its implementation Cache Performance Terms: Cache, Cache Miss, Cache Hit, Latency, Bandwidth, SRAM, DRAM, by an order of, Direct Mapping, Associative Mapping, Set Mapping, Write Through, Write Allocated, Write Back, Dirty Bit, and Valid Bit

Memory Requirements 

In general we would like to have Faster Memory (lower access time or latency)  Larger (capacity and bandwidth) Memory  Simpler Memory 

Memory Requirements - Server, Desktop, and Embedded devices  Desktop  Lower Access time  Larger Memory



Server    

Lower Access time Higher Bandwidth* Better Protection* Larger Memory

 Embedded  Lower Access time  Simpler Memory*

Processor-Memory Gap

Moore’s Law 

Transistor density on a chip dye doubles every couple (1.5) of years.



Short reference: http://en.wikipedia.org/wiki/Moore's_law

What is Memory Hierarchy and Why? 250 ns Main Memory CPU 0.25 ns (Registers)

Bus Adapter

Storage & I/O devices 2,500,000 ns!

Memory Hierarchy & Cache

Cache  





  

Cache is a smaller, faster, and expensive memory. Improves the througput/latency of slower memory next to it in the memory hierarchy. Blocking reads and delaying the writes to slower memory offers better performance. There are two cache memories L1 and L2 between CPU and main memory. L1 is built into CPU. L2 is an SRAM. Data, Instructions, and Addresses are cached.

Cache Operation 



Cache Hit: CPU finds the required data item (or instruction) in the cache. Cache Miss: CPU does not find the required item in the cache.  



 

CPU Stalled Hardware loads the entire block that contains the required data item from the memory into cache. CPU continues to execute once the cache loaded.

Spatial Locality Principle of Locality

Hit and Miss 

Hit  



Miss  



Hit Rate Hit Time Miss Rate Miss Penalty

Hit Time << Miss Penalty

Higher Level Lower Level

Cache Performance Program Execution Time    

CPU Clock Cycle Cycle Time IC – Instruction Count Program Execution Time (Simple model) = (Useful CPU Cycles + Stalled CPU Cycles) x Cycle Time

Stalled CPU Cycles Stalled CPU Cycles = Number of Cache Misses X Miss Penalty = IC X Memory Access X Miss Rate X Miss Penalty Instruction

Note: Unit of Miss Penalty is in CPU cycles

Number of Cache reference

Separating Read Miss from Write Miss Stalled CPU Cycles = IC X Memory Access X Miss Rate X Miss Penalty Instruction = IC X Read Access X Read Miss Rate X Read Penalty + Instruction IC X Write Access X Write Miss Rate X Write Penalty Instruction

Example 1 

Assume we have a computer where Clock cycles Per Instruction (CPI) is 1.0 when all memory accesses are cache hits. The only data accesses are loads and stores and these total 50% of the instructions. If the miss penalty is 25 clock cycles and miss rate is 2%, how much faster would the computer be if all instructions were cache hits?

Example 1 … 1.

Execution time for the program in a computer with 100 % hits = (Useful CPU Cycles + Stalled CPU Cycles) x Cycle Time = (IC X CPI + 0) X Cycle Time = IC X 1.0 X Cycle Time

Example 1 … 2.

Stall Cycle when there is one or more Cache Misses: = IC X Memory Access X Miss Rate X Miss Penalty Instruction

= IC X (1 + 0.5) X 0.02 X 25 Cycle Time = IC X 0.75 Cycle Time 3.

Total Execution Time: = (IC X 1 + IC X 0.75) Clock Cycles = 1.75 IC Clock Cycles

Example 1 …

4.

Ratios = 1.75 IC Clock Cycles / IC X 1.0 X Cycle Time = 1.75

Execution time is 1.75 faster in a computer with no Misses.

Example 2 

Assume we have a computer where Clock cycles Per Instruction (CPI) is 1.0 when all memory accesses are cache hits. The only data accesses are loads and stores and these total 50% of the instructions. Miss penalty for read miss is 25 clock cycles and miss penalty for write miss is 50 clock cycles. Miss rate is 2% and out of this 80% is Read miss. How much faster would the computer be if all instructions were cache hits?

Cache Operation Line Number

CPU

Tag

Block

0

 

 

1

 

 

2

 

 

 

 

 

 

 

 

C= 16  

 

0 1 2 3

Block (K Words)

.. .

.. . Block Length (K Words) Cache

Block 2N - 1 Word Length (K Words)

Main Memory

Elements of Cache Design        

Cache Size Block Size Mapping Function Replacement Algorithm Write Policy Write Miss Number of caches Split versus Unified/Mixed Cache

Mapping Functions Address Tag Line Word

From CPU

Tag

Block 0

Line 0

Block 1

1. Select Line 1 2. Copy 3.Compare

+

5. Load

Line 2

4. Hit

Block n-1

Miss

Line 3 (m)

To main memory

Block n

Cache Main Memory

Mapping Function 

Direct 



Set Associative 



Line value in address uniquely points to a line in cache. 1 tag Comparison Line value in address points to a set of lines in cache (typically 2/4/8, so 2/4/8 tag comparisons). This is known as 2/4/8 way Set Associative.

Associative 

 

Line value is always 0. This means Line points to all the lines in cache (4 (m) tag Comparisons) Uses Content Addressable Memory (CAM) for comparison. Needs non-trivial replacement algorithm

0

Address Tag Line Word

From CPU 5

2

1

Tag

2

3

Line 0

Block 0

2 3 4 5

Line 1

Block 1

6 7

1. Select

SET 0

2. Copy

Line 2

3.Compare

+

5. Load 504

4. Hit

Line 3

Miss

SET 1

To main memory Cache

505

Block 126

506 507 508 509

Block 127

510 511

Memory

Mapping Function Comparison

Cache Type Direct Mapping Fully Associative

Hit Ratio

Search Speed

Good

Bes t

Bes t

M ode rate

Ve ry Good, Bette r N-Way Set Associative as N Incre ase s

Good, Worse as N Incre as e s

Replacement Algorithm    

Least Recently Used First In First Out Least Frequently Used Random

Write Policy 

Write-through 



Write back 



Information written to cache and the memory

Information written only to cache. Content of the cache is written to the main memory only when this cache block is replaced or the program terminates.

Dirty bit is used to indicate that a cache block needs “Write back”

Write Miss  

Write-Allocate No Write-Allocate

Number of Caches   

Primary Cache (CPU) Secondary Cache (SRAM) L3 Cache (Cheaper SRAM)

Split versus Unified/Mixed Cache   

Single or two Level Unified or Split Misses per 1000 instructions with various Cache size

Size (KB) 8

Instruct Cache 8.1

Improving Cache Performance  

Reducing Penalty Reducing Misses 

Compiler optimization attempts to reduce the Cache Misses falls under this category

Stalled Cycles

= IC X Memory Access X Miss Rate X Miss Penalty Instruction

Improving Cache Performance Using Compiler 

Compilers are built with the following optimization: Instructions: Reordering instructions to avoid conflict misses Data : Merging Arrays Loop Interchange Loop Fusion Blocking

Merging Arrays /* Conflict */ int key[size]; int value [size];

Key value

/* Instead no conflict*/ Key, value pairs struct merge { int key; int value; };

Loop Interchange for (k = 0; k < 100; k= k+1) for (j = 0; j < 100; j= j+1) for (i = 0; i < 5000; i= i+1) x[i][j] = 2*x[i][j] /*instead */

for (k = 0; k < 100; k= k+1) for (i = 0; i < 5000; i= i+1) for (j = 0; j < 100; j= j+1) x[i][j] = 2*x[i][j] Memory

X(0,0) X(0,1) X(0,2) X(0,3) X(0,4) X(0,5) … X(0,4999) X(1,0) X(1,1) …

Address

Blocking

Loop Fusion for (i = 0; i < 5000; i= i+1) a[i] = i; … for (i = 0; i < 5000; i= i+1) b[i] = b[i]+a[i]; /*instead */ for (i = 0; i < 5000; i= i+1) a[i]= i; b[i] = b[i]+a[i];

Example 2 



Assume a fully associative write-back cache with many cache entries that starts empty. Below is a sequence of five memory operations (the address is in square brackets) What are the number of hits and misses using no-write allocate versus write allocate? WriteMem[100]; WriteMem[100]; ReadMem[200]; WriteMem[200]; WriteMem[100];

Exit Quiz 1.

In memory hierarchy top layer is occupied by the a. b. c. d.

Fastest and the most expensive memory Slowest and the most expensive memory High band width and the fastest memory Slowest and the least expensive memory

Exit Quiz 2.

The memory technology suitable for L2 cache is a. b. c. d.

EPROM DRAM SRAM Flash

Exit Quiz 3.

Server’s memory requirements are different from desktop’s memory requirements a. b.

True False

Exit Quiz 4.

Hit Rate is (1 – Miss Rate) a. b.

5.

True False

Gordon Moore’s law states that the transistor density a. b. c. d.

Triples every year Doubles every two years Doubles every 1.5 years Doubles every year

Improving Cache Performance 

Two options  

By Reducing Miss Penalty By Reducing Miss Rate

Reducing Cache Miss Penalty 1. 2. 3. 4. 5.

Multilevel caches Critical word first & early restart Priority for Read misses over Write Merging Write Buffers Victim Cache

Reducing Miss Penalty - Multilevel caches (1) 100 ns, 128M

100 ns, 128M

10 ns, 512K

2 ns, 16K CPU

Cache1 0.0 4 Faster but smaller

Main Memory

CPU

0.0 2 larger Slower but

? ns, 528 K CPU

Cache 1

Cache 1 Cache 2 0.0 0.02 4 Faster and Larger

100 ns, 128M

Main Memory

Main Memory

Reducing Miss Rate & Global Miss Rate (1) 100 ns, 128M

2.6 ns, 528K

2 ns, 16K CPU

Cache 1

10 ns, 512K

Cache 2

Main Memory

0.05 0.02 Faster and Larger

Average memory access timeeff = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 Global Miss Rate = Miss RateL1 x Miss RateL2

Reducing Miss Penalty – Critical Word First and Early Restart (2)   

Critical word first Early restart Lets discuss the following:   

How is Miss Penalty improved? Under what conditions this is useful? Is it worth going through all the trouble?

Reducing Miss Penalty – Priority To Read Miss (3)   

Read Miss takes priority Write to Memory is put on hold Lets discuss the following:      

Is there a problem? How is Miss Penalty improved? How this is done in Write-through? How this is done in Write-back? Under what conditions this is useful? Is it worth going through all the trouble?

Reducing Miss Penalty – Merging Write Buffer (4) Write-through is sent to a write buffer Buffer & Write possible options:

  1. 2. 3. 

Buffer is empty – write address and data to buffer Buffer is not full – write or merge data Buffer is full – Stall until buffer is written to memory. Note: Block write to memory is more efficient then multiple writes.

Lets Discuss the following:

 

 

How many different ways do we optimize Miss Penalty in this scheme? Explain the Fig 5.12 in the book When there is no block write (like I/O device), no merging

Reducing Miss Penalty – Fully Associative Victim Cache (5)  

 

Key word is Recycle Another implicit key word is complex (hardware) Aimed at handling conflict misses 1 to 6 victim caches are ideal

Reducing Misses 



Misses due to 3Cs: Compulsory, Capacity, and Conflict Reducing Misses Techniques 1. 2. 3. 4. 5.

Larger Lines/Blocks Larger Caches Higher Associativity Way Prediction Pseudoassociative caches Compiler Techniques

Notes: copy tables 5.14 and 5.17 to two slides

Reducing Cache Misses – Larger Block Size (1) 



Larger block improves number of misses up to a point! Points to discuss 





 

Why number of misses starts increasing for larger blocks? Low latency encourages smaller blocks and higher latency encourages larger blocks. Low bandwidth encourages smaller blocks and higher bandwidth encourages larger blocks. Effect on Miss Penalty with larger blocks

What is the other name for 1-way set associative mapping?

Reducing Cache Misses – Larger Caches (2) 

Obviously!

Reducing Cache Misses – Higher Associativity (3)  

Miss rate improves with associativity Points to discuss 







Complexity of set associative mapping versus the improvement. Rule of thumb 1: 8-way associative is as good as fully associative. Rule of thumb 2: 2-way associative is as good as direct mapping. Greater associativity increases Hit time

Reducing Cache Misses – PseudoAssociative Caches (4) 

 





Lets understand this using an 8-way set associative Instruction Cache. Instruction exhibits better locality Each access to instruction cache normally needs 8 comparisons. Using locality predict the next block to access in the set to reduce the number of comparisons. Effect on Hit-time and Cache Miss

Cache Memory – ith set

tags Block 0

Block 1

Block 6

Block 8

Reducing Hit Time – Small and Simple Cache (1) 





Tag comparison is complex, specifically in associative mapping Tag checking can be overlapped with data transfer to reduce the hit time. Tag checking in CPU chip, cache in a chip by itself. Provides better hit time and larger cache capacity

Reducing Hit Time – Virtual Cache (2) 

Addressing 

 

VA -> Physical Address -> cache

Skip two levels, VA maps to cache Problems:  

No page boundary checks Building direct mapping between VA and Cache for every process is not easy.

Reducing Hit Time – Pipelined Cache (3) 

Increases the cache throughput not access time

Reducing Hit Time – Trace Cache 





Increasing instruction level parallelism Instead of 4 consecutive locations of cache, load the next 4 instruction required by the CPU using trace. Folding branch prediction into cache

Improving Parallelism with Cache   



Cache Hit under Cache Miss Cache Miss under Cache Miss Both require non-blocking cache, out of order execution CPU Other methods to improve performance using parallelism:  

Hardware pre-fetching of instruction Software (compiler) pre-fetching of data

Preparation Write-back 







 

Complicates cache coherency problem Low overhead memory access overhead Better cache access time than write -through Requires higher memory bandwidth if blocking

Write Through 

 



Simplifies cache coherency problem High overhead If cache is blocking then higher access time Requires Lower memory bandwidth

Note: No need to describe these two policies Write-through does not buy anything extra for a single processor system due to absence of cache coherency

CPU Execution Time CPU Execution Time = (CPU Cycle time + Stalled Cycle) X Cycle Time  



Stalled Cycle = misses x penalty Misses given either as misses/1000 instruction or misses/memory-access AKA miss rate. Instruction Count , Cycles per Instruction, Miss are also required to compute CPU execution time.

Average Access Time with Cache Average Access Time = Hit Time + Miss Rate X Penalty Multi-level Cache Avg Access Time = Hit TimeL1+ Miss RateL1X PenaltyL1 PenaltyL1 = Hit TimeL2+ Miss RateL2X PenaltyL2

Addressing Address Tag Set Word

From CPU

Tag Set 0

Line 0

0 1 2 3 4 5 6 7

Block 0

Block 1

1. Select Line 1 2. Copy

Set 1

3.Compare

+

5. Load

Line 2

4. Hit

Block 126

4. Miss

To main memory

Cache

Line 3 (m) 508 509 510 511

Block 127

Main Memory

Assignment I – Due same day next week       

Mapping functions Replacement algorithms Write policies Write Miss policies Split Cache versus Unified Cache Primary Cache versus Secondary Cache Compiler cache optimization techniques with examples

Assignment II - Due same day next week     



Multilevel Cache Cache Inclusion/Exclusion Property Thumb rules of cache Compiler pre-fetch Multi-level Caching + one another Miss Penalty Optimization technique Two miss Rate Optimization Techniques

Assignment III - Due 2nd class of next week 

All odd numbered problems from cache module of your text book.

Assignment IV - Due 2nd class of next week 

All even numbered problems from cache module of your text book.

CPU Execution Time & Average Access Time Memory Cache CPU 1 CC

100 CC

With Multi-level Cache Memory Cache CPU 1000

1 CC

10 CC

100 CC

Memory Hierarchy

Main Memory

Main Memory 

Module Objective 



To understand Main memory latency and bandwidth Techniques to improve latency and bandwidth

Memory Hierarchy & Cache

Main Memory – Cache – I/O 250 ns Main Memory 0.25 ns

CPU cache

Bus Adapter

Storage & I/O devices 2,500,000 ns!

 Cache prefers low latency main memory  I/O & Multiprocessors prefer high bandwidth/throughput main memory

Main Memory Access Time 4 cc

Main Memory 56 cc

CPU cache 4 cc

Data Bus

Address Bus

Access Time per word = 4+56+4 CC One word is 8 Bytes Latency is 1 bit/CC

CC – Clock Cycle

Improving Memory Performance 



Improving Latency ( time to access 1 memory unit - word) Improving bandwidth (bytes accessed in unit time)

Improving Memory Bandwidth Simple Design

Wider Bus

CPU

CPU

64 bits Cache 64 bits Memory Cache Block 4 words One word 8 bytes

Interleaved CPU 64 bits

64 bits

Cache

Cache 4x64 bits Memory

64 bits 0 4 8 12

Bank 0

1 5 9 13

Bank 1

Bandwidth, Latency, Penalty Interleaving Factor Address allocation with Interleaving

2 6 10 14

Bank 2

3 7 11 15

Bank 3

Problem 1 Interleaving factor



What is the optimal interleaving factor if the memory cycle is 8 CC (1+6+1)? Assume 4 banks





b1

1

2

3

1st Read Request

4

5

6

7

8

9

b2

10

b3

11

2nd Read Request

b4

12

13

14

15

X

3rd Read Request

16



17

3rd Read Request

18

19

20

Problem 2 (page 452 in your text book)      

Block size = 1 word Memory bus size = 1 word Miss rate = 3% CPU Memory access per instruction = 1.2 Cache miss Penalty = 64 CC Avg Cycles per instruction = 2

Simple Memory Cache 64 CC

Assume 1000 instructions in your Program If no miss then execution time is 2000 CC One instruction needs 1.2 memory accesses, 1000 instruction -> 1200 accesses. If miss rate is 3% then number of misses for 1200 accesses is 36. the execution time is = 2000+36x64 = 4304 CC Average cycles per instruction = 4304/1000 = 4.3

Problem 2 (wider bus – 2 words)      

Block size = 4 word Memory bus size = 2 word Miss rate = 2.5% CPU Memory access per instruction = 1.2 Cache miss Penalty = 128 CC Avg Cycles per instruction = 2

Interleaved Memory Cache 64 CC

Assume 1000 instructions in your Program If no miss then execution time is 2000 CC One instruction needs 1.2 memory accesses, 1000 instruction -> 1200 accesses. If miss rate is 2.5% then number of misses for 1200 accesses is 30. the execution time is = 2000+30x128 = 5840 CC Average cycles per instruction = 5840/1000 = 5.84

Problem 2 (2-way interleaving)      

Block size = 4 word Memory bus size = 1 word Miss rate = 2.5% CPU Memory access per instruction = 1.2 Cache miss Penalty = 68x2 CC Avg Cycles per instruction = 2

Interleaved Cache 64 CCMemory

Assume 1000 instructions in your Program If no miss then execution time is 2000 CC One instruction needs 1.2 memory accesses, 1000 instruction -> 1200 accesses. If miss rate is 2.5% then number of misses for 1200 accesses is 30. the execution time is = 2000+30x68x2 = 6000 CC Average cycles per instruction = 6000/1000 = 6.0

DRAM (Dynamic Random Access Memory)   

Address Multiplexing RAS and CAS Dynamic refreshing and implications 



Amdahl’s rule of thumb 



Data to be written after a READ! 1000 MIPS should have 1000 M memory

Read your text book for DRAM, SRAM, RAMBUS, and Flash.

SRAM (Static Random Access Memory)  

Uses six transistor per bit. Static 



NO NEED to write data after a READ

Comparison versus DRAM 

More transistors  



No Refresh 



Less memory – ¼ to 1/8 Expensive 8 - 10 times Faster 8-16 times

Used in Cache instead of main memory

Flash  

Similar to EEPROM Ideal choice for embedded systems  



Comparison versus EEPROM  



Finer and simpler control Better capacity

WRITE operation – entire block to be erased and replaced. 



Low power requirement, no moving parts Faster READ time, slower WRITE time

Writes are 50 to 100 times slower than READ

NOR type flash and NAND type flash

Virtual Memory Physical Address

Virtual Address 0 4K 8K 12K

0 4K 8K 12K 16K 20K 24K 28K

A B C D

Virtual Memory

Disk

D

C

A B

Physical Main Memory

Virtual Versus First Level Cache

B lock /P age S iz e Hit Tim e M is s P enalty A cc es s Tim e Trans ferTim e M iss Rate

A ddres s M apping

16-128 B 4K -64K B 1-3 CC 50-150 CC 8-150 CC 1-10 M CC 6-130 CC 0.8-8 M CC 2-20 CC 0.2-2 M CC 0.1-10% .00001 to 0.001% 25-45 P hy sical 32-64 V irtual A ddres s A ddres s to 14-20 to 25-45 P hy sical Cac he A ddress A ddres s

Memory Hierarchy Questions    

Where can a block be placed? How is a block found if it is in? Which block should be replaced? What happens on a write?

Segmentation versus Paging Code

Data

Paging Segmentation

Remark Words per address Programmer visible Replacing a block Memory use inefficiency

Page

Segment Two (Segment and offset) Visible to programmer Hard External fragmentation

One Invisible Trivial Internal fragmentation

Efficient disk traffic

Yes (easy tune the page size) Not always (small segments)

CPU Execution Time CPU Execution Time = (CPU Cycle time + Stalled Cycle) X Cycle Time  



Stalled Cycle = misses x penalty Misses given either as misses/1000 instruction or misses/memory-access AKA miss rate. Instruction Count , Cycles per Instruction, Miss are also required to compute CPU execution time.

Average Access Time with Cache Average Access Time = Hit Time + Miss Rate X Penalty Multi-level Cache Avg Access Time = Hit TimeL1 + Miss RateL1 X PenaltyL1 PenaltyL1 = Hit TimeL2 + Miss RateL2 X PenaltyL2

I/O Systems 

Objective 

To understand mainly disk storage technologies.

Storage Systems    

Individual Disks SCSI Subsystem RAID Storage Area Networks (NAS & SAN)

Disk Storage









Disk storage slower than Memory. Disk offers better volume at a lower unit price. Nicely fit into the Virtual Memory concept. Critical piece of the performance puzzle.

Courtesy: www.shortcourses.com/ choosing/storage/06.htm

Disk System     

Disk pack & Cache Disk Embedded Controller SCSI Host Adapter OS (kernel/driver) User Program disk

Host

SCSI Adapter

disk

Controller



Controller

SCSI Bus Host I/O Bus

T

Individual Disks

Reference: http://www.stjulians.com/cs/diskstoragenotes.html

Components of Disk Access Time     

Seek Time Rotational Latency Internal Transfer Time Other Delays That is, 

Avg Access Time = Avg Seek Time + Avg Rotational Delay + Transfer Time + Other overhead

Problem (page 684)     

Seek Time = 5 ms/100 tracks RPM = 10000 RPM Transfer rate 40 MB/sec Other Delays = 0.1 ms Sector Size = 2 KB

Average Access Time =

Average Seek Time (5ms) + Average Rotational Delay (time for 1/2 revolution) + Transfer Time (512/(40x106)) + Other overhead (0.1 ms) = 5 + 103x60/(2x10000) + 512x103/(40x106) + 0.1 ms = 5 + 3 + 128/104 + 0.1 ms = 5 + 3 + 0.0128 + 0.1 = 8.1128 ms

RAID 





RAID stands for Redundant Array of Inexpensive/Individual Disks Uses a number of little disks instead of one large disk. Six+ Types of RAID (0-5).

Terminologies   

Mean Time Between Failure (MTBF) Mean Time To Data Loss (MTTDL) Mean Time To Data Inaccessability (MTTDI)

RAID-0 Striping Stripe width

chunk

Logical Storage

Physical Storage

01 02

03

04

05

06

07

08

09

10

11 12

01 05

02 06

03 07

04 08

09

10

11

12

n number of independent disks

RAID-0 Performance 





 



Throughput: best case - nearly n x single disk value Utilization: worst case – nearly (1/ n) x single disk value Data Reliability: (r)n , where r is the reliability of a disk, (r ≤ 1). Sequential Access: Fast Random Access: Multithreaded Random Access offers better performance. When r = 0.8 and n = 2, reliability is 0.64

RAID-1 Mirroring 











Issue Addressed: RAID-0 Reliability Problem. Shadowing or mirroring is used. Data is not lost when a disk goes down. One or more disks can be used to mirror primary disk. Writes are posted to primary and shadowing disks. Read from any of them.

01 02 03

01 02 03

RAID-1 Performance 

Reliability is improved with mirroring: 





   

(1- (1-r)(1-r))

Example: when r is 0.8, the reliability of RAID-1 is .96. Writes are more complex – must be committed to primary and all shadowing disks. Writes much slower due to atomicity requirement. Expensive – due to 1-to-1 redundancy. Throughput: Same as single disk value Utilization: ½ of single disk value

RAID 0+1 -Stripping & Mirroring RAID 0 + 1 RAID 1 RAID 0

RAID 0

01

02

01

02

03

04

03

04

05

06

05

06

n disks

n disks

Performance RAID-0+1 



Let the Reliability of a RAID 0 sub-tree be R’:  Then the reliability of RAID 1 tree = 1 – (1-R’)(1R’)

Reliability R’ is: R’ = r2 (reliability of a single disk is r): Throughput is same as RAID-0, however with 2 x n disks Utilization is lower than RAID-0 due to mirroring “Write” is marginally slower due to atomicity When r = 0.9, R’ = 0.81, and R = 1 – (0.19)2 = .96 



  

RAID 1+0 - Mirroring & Stripping RAID 1+0 RAID 0 RAID 1

RAID 1

01 03

01 03

02 04

02 04

05

05

06

06

Performance RAID-1+0 



Let the Reliability of a RAID 1 sub-tree be R’:  Then the reliability of RAID 0 tree = (R’)2

Reliability R’ is:

R’ = 1-(1-r)2 (reliability of a single disk is r): Throughput is same as RAID-0, however with 2 x n disks Utilization is lower than RAID-0 due to mirroring “Write” is marginally slower due to its atomicity When r = 0.9, R’ = 0.99, and R = (0.99)2 = .98 

   

RAID-2 Hamming Code Arrays 

Low commercial interest due to complex nature of Hamming code computation.

RAID-3 Stripping with Parity

Single Bits Or Words Logical Storage

Physical Storage

Stripe width

01 02 03 04 05 06 07 08 09 10

01 05

02 06

09

10

03 07

04 08

11

12

11

12

P0 P1 P2 Parity Disk

RAID-3 Operation 

 

Based on the principle of reversible form of parity computation. Where Parity P = C0⊕ C1⊕ … Cn-1⊕ Cn⊕ Missing Stripe

Cm = P ⊕ C0⊕ C1⊕ … Cm-1⊕ Cm+1⊕… Cn-1⊕ Cn

RAID-3 Performance 

  

RAID-1’s 1-to-1 redundancy issue is addressed by 1-for-n Parity disk. Less expensive than RAID-1 Rest of the performance is similar to RAID-0. This can withstand the failure of one of its disks. Reliability = all the disks are working + exactly one failed

= rn + nc1rn-1.(1-r) 

When r = 0.9 and n = 5 = 0.95 + 5 x 0.94 x (1- 0.9) = 0.6 + 5x0.66 x 0.1 = 0.6 + 0.33 = .93

RAID-4 Performance 



Similar to RAID-3, but supports larger chunks. Performance measures are similar to RAID-3.

RAID-5 (Distributed Parity) Chunk Logical Storage

Physical Storage

Stripe width

01 02 03 04 05 06 07 08 09 10

01 P1

02 05

09

P2

11

12

03 06

04 07

P0 08

10

11

12 Parity Disk

RAID-5 Performance 



In RAID-3 and RAID-4, Parity Disk is a bottleneck for Write operations. This issue is addressed in RAID-5.

Reading Assignment   

Optical Disk RAID study material Worked out problems from the text book: p 452, p 537, p 539, p561, p 684, p 691, p 704

Amdahl’s Speedup Multiprocessors Speedup = 1/(Fractione/Speedupe+(1–Fractione ))   Speedupe - number of processors Fractione

- Fraction of the program that runs parallely on Speedupe

  Assumption: Either the program runs in fully parallel (enhanced) mode making use of all the processors or nonenhanced mode.

Multiprocessor Architectures 

Single Instruction Stream, Single Data Stream (SISD):



Single Instruction Stream, Multiple Data Stream (SIMD) 



Multiple Instruction Streams, Single Data Stream (MISD)  



Multiple Instruction Streams, Multiple Data Streams (MIMD)

Classification of MIMD Architectures 

Shared Memory: 





Centralized shared memory architecture OR Symmetric (shared-memory) Multiprocessors (SMP) OR Uniform Memory Access (UMA). Distributed shared memory architecture OR NonUniform Memory Access (NUMA) architecture.

Message Passing: 

Multiprocessor Systems based on messaging

Shared Memory versus Message Passing No 1

2

Shared Memory Compatibility with well-understood shared memory mechanism used in centralized multi-processor systems. Simplify compiler design

3 4

No need to learn a messaging protocol Low communication overhead

5

Caching to improve latency

Message Passing Simpler hardware compared to scalable shared memory Communication is explicit. This means it is easier to understand. Improved modularity No need for expensive and complex synchronization mechanisms similar to the ones used in shared memory.

Two types of Shared Memory architectures Centralized

Shared -Memory Architecture

Processor

Processor

Processor

Processor

Cache Cache

Cache Cache

Cache Cache

Cache Cache

Main Memory

I/O Systems

Distributed -Shared -Memory Architecture Processor + Cache

Processor + Cache

Processor + Cache

… Memory

I/O

Memory

I/O

Interconnection Network

Memory

I/O

Symmetric versus Distributed Memory MP 

SMP: Uses shared memory for Inter-process Communication Advantage:

  



Disadvantage  



Close coupling due to shared memory Sharing of data is faster between processors Scaling: Memory is a bottleneck High and unpredictable Memory Latency 

Distributed : Uses message passing for Inter-process Communication Advantage:

  

Low Memory Latency Scaling: Scales better than SMP

Disadvantage

 

Control and management are more complex due to distributed memory

Performance Metrics  



Communication bandwidth   Communication Latency: Ideally latency is as low as possible. Communication latency is = Sender Overhead + Time of flight + Transmission Time + Receiver Overhead Communication Latency Hiding

Cache Coherence Problem

Time 0 1 2 3

Event

Cache Contents for CPU A

Cache Contents for CPU B

CPU A reads X CPU B reads X CPU A stores 0 in X

1 1 0

1 1

Memory contents for Location X 1 1 1 0

Cache Coherence A Memory System is coherent if 1.

A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses. Defines coherent view of memory.

2.

Write to the same location are serialized; that is two writes to the same location by any two processors are seen in the same order by all processors. For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1.

Coherency and Consistency 

Coherency and consistency are complementary.



Coherency defines the behavior of reads and writes to the same memory location, while consistency defines the behavior of reads and writes with respect to other memory locations.

Features supported by Coherent Multiprocessors 

Migration: Shared data in a cache can be moved to another cache directly (without going through the main memory). This is referred to as migration. It is used in a transparent fashion. It reduces both latency (going to another cache every time the data item is accessed) and precious memory bandwidth.



Replication: Cache also provides replication for shared data items that are being simultaneously read, since the caches make a copy of the data item in the local cache. Replication reduces both latency of access and contention for a read shared data item.

Migration and Replication Centralized Shared -Memory Architecture (SMP) Processor

Processor

Processor

Processor

Cache Cache

Cache Cache

Cache Cache

Cache Cache

cc Main Memory

cc

cc Bus

cc Cache Controller

cc I/O Systems

Cache Coherent (CC) Protocols  

CC protocol implement cache coherency Two types: 



Snooping (Replicated): There are multiple copies of the sharing status. Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block, and no centralized state is kept. Directory based (logically centralized): There is only one copy of the sharing status of a block of physical memory. All the processors use this one copy. This copy could be in any of the participating processors.

Snooping Protocol Centralized Shared -Memory Architecture (SMP) Processor

Processor

Processor

Processor

Cache Cache

Cache Cache

Cache Cache

Cache Cache

cc

cc

cc

cc

Bus I/O Systems

Main Memory

cc Cache Controller

Two ways: 1. Write Invalidation 2. Write Broadcast

Invalidation of Snooping Protocol

Processor Activity CPU A reads X (block) CPU B reads X (block) A Writes 1 to X B reads X

Bus Activity

Cache Contents for CPU A

Cache Miss for X

0

Cache Miss for X Invalidation for X Cache Miss for X

0 1 1

Cache Contents for CPU B

Memory contents for Location X 0 0

0 1

0 0 1

Write Broadcast of Snooping Protocol

Processor Activity

Bus Activity

CPU A reads X (block) Cache Miss for X CPU B reads X (block) Cache Miss for X A Writes 1 to X Write Broadcast for X B reads X

Cache Contents for CPU A

Cache Contents for CPU B

0 0 1 1

Memory contents for Location X 0 0

0 1 1

0 1 1

Reading Assignment 

 

Section 7.4 (Reliability, Availability, & Dependability) in your text book Page 554 and 555 Section 7.11 I/O Design – attempt all the 5 problems in this section.

Invalidation versus Write Distribute 

Multiple writes to the same data item with no intervening reads require multiple write broadcast in an update protocol, but only one initial invalidation in a write invalidate protocol.

  

With multiword cache blocks, each word written in a cache block requires a write broadcast in an update protocol, although only the first write to any word in the block needs to generate an invalidate in an invalidation protocol.

  

An invalidation protocol works on cache blocks, while an update protocol must work on individual words.

  

The delay between writing a word in one processor and reading the written value in another processor is usually less in a write update scheme, since written data are immediately updated in the reader’s cache. By comparison, in an invalidation protocol, the reader is invalidated first, then later reads the data and is stalled until a copy can be read and returned to the processor.

Coherent Misses 

Write Invalidate causes Read cache misses 



True Miss: If an item X in a block is made exclusive by write invalidate and the read of the same item in another processor causes read miss then it is a true miss. False Miss: If non-exclusive items in the same block cause the miss then it is a false miss.

Use of Valid, Shared, and Dirty bits 

Valid bit: Every time a block is loaded into a cache from memory, the tag for the block is saved in cache and the valid bit is set to TRUE. A write update to the same block in a different processor may reset this valid bit due to write invalidate. Thus, when a cache block is accessed for READ or WRITE, tag should match AND the value of valid bit should be TRUE. If the tag matches but the valid bit is reset, then its a cache miss.

  

Shared bit: When a memory block is loaded into a cache block for the first time the shared bit is set to FALSE. When some other cache loads the same block, it is turned into TRUE. When this block is updated, write invalidate uses the value of shared bit to decide whether to send write-invalidate message or not. If the shared bit is set then an invalidate message is sent, otherwise not.



Dirty bit: Dirty bit is set to FALSE when a block is loaded into a cache memory. It is set to TRUE when the block is updated the first time. When another processor wants to load this block, this block is migrated to the processor instead of loading from the memory.

Summary of Snooping Mechanism Re q u e st Read hit Read m is s Read m is s Read m is s W rite hit W rite hit W rite m is s W rite m is s W rite m is s Read m is s

S o u rce proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or bus

Read m is s bus W rite m is s bus W rite m is s bus

S ta te o f a d d re sse d ca ch e b lock F u n ctio n a n d Ex p la n a tio n s hared or ex c lus iveRead data in c ac he invalid P lac e read m is s on bus s hared A ddres s c onflic t m is s : plac e read m is s on bus ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus ex c lus ive W rite data in c ac he s hared P lac e write m is s on bus invalid P lac e write m is s on bus s hared A ddres s c onflic t m is s : plac e read m is s on bus ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus s hared No ac tion; allow m em ory to s ervic e read m is s A ttem pt to s hare data: plac e c ac he bloc k on bus and c hange s tate to ex c lus ive s hared s hared A ttem pt to write s hared bloc k ; invalidate the bloc k A ttem pt to write bloc k that is ex c lus ive els ewhere: write bac k the c ac he ex c lus ive bloc k and m ak e its s tate invalid

State Transition CPU read hit CPU Read

Invalid

Place read miss on bus k oc l b

CPU read k miss e ac b rit w te bus Place read miss on bus i U r P C s s is W on bu s m s i on ad m e s r is ad U m P e r C e e it c r la w Cache State Transitions Exclusive P e c a Based on requests from CPU (read/write) Pl Place write miss on bus

CPU Write

CPU Write hit CPU Read hit

Shared (read only)

CPU Write miss Write-back cache block Place write miss on bus

State Transition Write miss for this block

Write-back block; Abort memory access

Invalid

CPU Write

Shared (read only)

rt o ab ; k oc l b s k es c c ba a c te y ri or W em M

Write miss for Exclusive this block

CPU read miss

Cache State Transitions Based on requests from the bus

(read/write) Read miss for this block

Some Terminologies 







Polling: A process periodically checks if there is a message that it needs to handle. This method of awaiting a message is called polling. Polling reduces processor utilization. Interrupt: A process is notified when a message arrives using built-in interrupt register. Interrupt increases processor utilization in comparison to polling.  Synchronous: A process sends a message and waits for the response to come before sending another message or carrying out other tasks. This is way of waiting is referred to as synchronous communication. Asynchronous: A process sends a message and continues to carry out other tasks while the requested message is processed. This is referred as asynchronous communication.

Communication Infrastructure 

Multiprocessor Systems with shared memory can have two types of communication infrastructure:  

Shared Bus Interconnect Interconnect Shared Bus

Directory Based Protocol

e nNod

e1 Nod



Directory

Directory

Interconnect

Directory e2 Nod

Directory



en Nod

1

State of the Block 

 

Shared: One or more processors have the block cached, and the value in memory is up to date. Uncached: No processor has a copy of the block. Exclusive: Exactly one processor has a copy of the cache block, and it has written the block, so the memory copy is out of date. The processor is called the owner of the block.

Local, Remote, Home Node 





Local Node: This is the node where request originates Home Node: This is the node where the memory location and the directory entry of an an address reside. Remote Node: A remote node is the node that has copy of a cache block

Shared State Operation 



Read-miss: The requesting processor is sent the requested data from memory and the requestor is added to the sharing set. Write-miss: The requesting processor is sent the value. All the processors in the set Sharer are sent invalidate messages, and the Sharer set is to contain the identity of the requesting processor. The state of the block is made exclusive.

Uncached State Operation 



Read-miss: The requesting processor is sent the requested data from memory and the requestor is made the only sharing node. The state of the block is made shared. Write-miss: The requesting processor is sent the requested data and becomes the sharing node. The block is made exclusive to indicate that the only valid copy is cached. Sharer indicates the identity of the owner.

Exclusive State Operation 

Read-miss: The owner processor is sent a data fetch message, which causes the state of the block in the owner’s cache to transition to shared and causes the owner to send the data to the directory, which it is written to memory and sent back to the requesting processor. The identity of the requesting processor is added to the set Sharer, which still contains the identity of the processor that was the owner (since it still has a readable copy).



Data write back: The owner processor is replacing the block and therefore must write it back. This write back makes the memory copy up to date (the home directory essentially becomes the owner), the block is now uncached and the Sharer is empty.



Write-miss: The block has a new owner. A message is sent to the old owner causing the cache to invalidate the block and send the value to the directory, from which it is sent to the requesting processor, which becomes the new owner. Sharer is set to the identity of the new owner, and the state of the block remains exclusive.

Cache State Transition

Fetch, Shares = {}

Uncached

Read Miss Data Value Reply Shares = {P}

pl e r

e lu a v a P} t da = { ; ch es + us t Fe ar n b }; P h { S so = s s i is m es m r Data Writead ha ly ad e S e r p r Back e e; re t c a e a id alu Pl l va v Exclusive In t a da (read/write)

Data Value Reply Shares = {P}

Shared (read only) y;

Write Miss

W

Write miss Fetch/invalidate Data value Reply Shares = {P}

e rit

M

is

s

Read miss Data value Reply Shares += {P}

True Sharing Miss & False Sharing Miss

Time 1 2 3 4 5

Processor P1 Write X1

Processor P2 Read X2

Write X1 Write X2 Read X2

Synchronization 

We need a set of hardware primitives with the ability to atomically read and modify a memory location.

Example 1: Atomic Exchange 



Interchanges a value in a register for a value in memory You could build a lock with this. If the memory location contains 1 then the lock is on. Register

A

0

Physical Main Memory

1

Other Operations  

Test-and-Set Fetch-and-Increment

Load Linked & Set Conditional 



Load Linked: Load linked is a primitive operation that loads the content of a specified location into cache. Set Conditional: Set conditional operation is related to Load Linked operation by operating on the same memory location. This operation sets the location to a new value only if the location contains the same value read by Load Linked, otherwise it will not.

Operation of LL & SC try:

LL DADDUI SC BEQZ

R2, R3, R3, R3,

0(R1) R2, #1 0(R1) try

; ; ; ;

load linked increment Store conditional branch store fails

1. Address of the memory location loaded by LL is kept in a register 2. If there is an interrupt or invalidate then this register is cleared 3. SC checks if this register is zero. a. If it is then it fails b. Otherwise it simply Stores the content of R3 in that memory address

Single Processor System 

Concurrent Execution is the source of problem X=0

Process 1

Process 2

Memory

TOP

time

P1 READ BNEZ

P2 X TOP P2 is scheduled TOP READ BNEZ WRITE

P1 is scheduled WRITE 1

X TOP 1

With Interrupt Mask X=0 Memory

Process 1

Process 2

P1 TOP

time

P2

READ X BNEZ TOP MASK WRITE 1 UNMASK TOP

MASK READ X BNEZ TOP WRITE UNMASK

1

Multiprocessor System 

Masking the interrupt will not work X=0 Memory

Processor1

P1 TOP

READ X BNEZ TOP MASK WRITE 1 UNMASK

Processor2

P2 TOP

READ BNEZ MASK WRITE UNMASK

X TOP

1

Multiprocessor System 

Uses Cache Coherence 

Bus Based Systems or Interconnect Based System 



Coherence system arbitrates writes (Invalidation/write distribute) Thus, serializes writes!

Spin Locks 



Locks that a processor continuously tries to acquire, spinning around a loop until it succeeds. It is used when a lock is to be for a short amount of time.

Spin Lock implementation

lockit:

LD BNEZ DADDUI EXCH BNEZ

R2, R2, R2, R2, R2,

0(R1) lockit R0,#1 0(R1) lockit

;load of lock ;not available-spin ;load locked value ;swap ;branch if lock wasn't 0

Cache Coherence steps step 1

P0

2 Set lock to 0

P1 Spins, testing if lock=0 Invalidate received

P2 Spins, testing if lock=0 Invalidate received

3 4

Cache Miss Waits fo bus

5

Lock = 0

Cache Miss Lock = 0 Executes swap, gets cache miss Completes swap: returns 0 and sets Lock = 1

1 Has lock

6

7 8

Executes swap, gets cache miss Swap completes and returns 1, and sets Lock = Enter critical 1 section Spins, testing if lock=0

Coherence state of lock

Bus/Directory activity

Shared

None Write invalidate of lock Exclusive (P0) variable from P0 Bus/Directory services P2 Cache miss; write back from Shared P0 Shared Cache miss for P2 satisfied Shared

Cache miss for P1 satisfied

Bus/Directory services P2 Cache miss; generates Exclusive (P2) invalidate Bus/Directory services P1 Cache miss; generates write Exclusive (P1) back None

Multi-threading  

Multi-threading – why and how? Fine-grain   



Almost round robin High overhead (compared to coarse grain) Better throughput then coarse grain

Coarse-grain 

 

Threads switched only on long stalls (L2 cache miss) Low overhead Individual threads get better performance

Reading Assignment 

Memory Consistency 

 

Notes (Answers)

Problem on page 596 An example that uses LL (Load Linked) and Set Conditional (SC).

End!

IETF – Internet Engineering Task Force 

  

 

Loosely self-organized groups (working groups) of … Two types of documents: I-D and RFC I-D’s life is shorter compared to RFCs life RFCs include proposed standards and standards RFC Example: RFC 1213, RFC 2543 I-D -> Proposed Standard -> Standard

SIP Evolution    



SIP I-D v1 ’97, v2 ’98 SIP RFC 2543 ’99 SIP RFC 3261 ’00, obsoletes RFC 2543 SIP working groups – SIPPING, SIMPLE, PINT, SPIRITS, … SIPit – SIP interoperability test

Predictor for SPEC92 Benchmarks 300 250

Instructions between 2 0 0 mispredictio 150 ns

P re d ic t e d T a k e n P r o fil e b a s e d

100 50

ear

dod uc

gcc

hyd ro2 d md ijdp su2 cor

com

pre ss eqn tot esp res so

li

0

Bench mark

Related Documents


More Documents from "Kuldeep Sahu"