Computer Architecture

  • Uploaded by: ainugiri
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Computer Architecture as PDF for free.

More details

  • Words: 8,496
  • Pages: 100
Tomasulo’s Approach to Dynamic Scheduling • Addresses WAR and WAW hazards • Based on renaming of registers D IV . D A D D .D S .D S U B .D M U L.D

F0 , F6 , F6 , F8 , F6 ,

F2 , F4 F0 , F8 0 (R 1 ) F10 , F14 F10 , F8

D IV . D A D D .D S .D S U B .D M U L.D

F0 , F2 , F4 S , F0 , F8 S , 0 ( R1 ) T , F10 , F14 F6 , F1 0 , T

3-Stages of Tomasulo’s Pipeline • Issue • Execute • Write Result

Basic Structure of Tomasulo’s Algorithm From Instruction Unit

Load-store operations

FP registers

Instruction Queue

Address Unit Store buffers

Load buffers

Floating-point operations

Operand buses

Operation bus

Data

Address

Memory Unit

3 2 1

Reservation Stations

FP adders

FU

FP multipliers

Common Data Bus

2 1

Advantages of Tomasulo’s algorithm • Distribution of Hazard detection logic • No head of the queue blocking – eliminates stalls due to WAW and WAR hazards • It can also eliminate name dependences that cannot be eliminated by a compiler • Common Data Bus (CDB) makes it easier to copy one destination result to multiple source registers

Reservation Station – Data Fields • Instruction Status – Indicates the current stage for an instruction

• Reservation Station Status – Busy, Op, Vj, Vk, Qj, Qk, A

• Register Status – Indicates which functional unit will write to each register

S h o w w h a t in fo rm a tio n ta b le s lo o k like fo r

n

th e fo llo w in g co d e se q u e n ce w h e n o n ly th e first lo a d h a s co m p le te d a n d w ritte n its re su lt: L.D

1.

L.D

2.

F6,34(R2) F2,45(R3)

MUL.D

F0,F2,F4

SUB.D

F8,F2,F6

DIV.D

F10,F0,F6

ADD.D

F6,F8,F2     

3. 4. 5.

6.

Example                                                                                                                                                                                       In s t r u c t i o n   In s t r u c t i o n                                                                                 Is s u e                                                           E x e c u t e       L .D                         F 6 ,3 4 (R 2 ) √ √ √ L .D                         F 2 ,4 5 (R 3 ) √ √ M U L .D                F 0 ,F 2 ,F 4 √ S U B .D                F 8 ,F 2 ,F 6 √ D IV . D                                   F 1 0 , F 0 , F 6√ A D D .D                 F 6 ,F 8 ,F 2 √

Example                                                                                                                       R e s e r va t io n   S t a t io n N a m e                                         B u s y        j                 k   O       Vp                                                        Q   j        V                                                   k                      A                    Q Load1 Load2 A dd1 A dd2 A dd3 M u lt 1 M u lt 2

no

Load yes S UB yes A DD

4 5 + R e g s [R 3 ]]

yes

M e m [ 3 4 + R e g s [ RL 2o ]a] d 2 A dd1

no

M UL y e s D IV yes

R e g s [F 4 ] Load2 M e m [ 3 4 + R e g s [ RM2 u] ] l t 1

Load2

Example

                                                                                                                          R e g is t e r  S t a t u s F ie ld                     F 0                         F 2                         F 4                           F 6                         F 8           Q i                            M u lt 1                 L o a d 2                                                 A d d 2                   A d d 1

Steps in Algorithm • To understand the full power of eliminating WAW and WAR hazards through dynamic renaming of registers, We must look at a loop. Consider the following simple sequence for multiplying the elements of an array by a scalar in F2:  

LOOP:  L . D                F0 , 0 ( R1 )

               MUL . D           F4 , F0 , F2                S . D                F4 , 0 ( R1 )            DADDUI          R1 , R1 , #- 8            BNE                R1 , R2 , LOOP: branches if R1  #  R2 

  

Steps in Algorithm Instruction State              Wait until               Action or bookkeeping      if(RegisterStat[rs].Qi ≠0)              {RS[r].Qj<-RegisterStat[rs].Qi}                   else{RS[r].Vj<-Regs[rs];RS[r].Qj<-0};          if(Register[rt].Qi ≠0)                                    {RS[r].Qk<-Register[rt].Qi                                          else{RS[r].Vk<-Regs[rt];RS[r].Qk<-0};          Issue Sation r empty RS[r].Busy<-yes;RegisterStat[rd],Qi=r;                           FP Operation if(RegisterStat[rs].Qi ≠0)                             {RS[r].Qj<-RegisterStat[rs].Qi}                 else{RS[r].Vj<-Regs[rs]; RS[r].Qj<-0};                    Load or Store       Buffer r empty                                 RS[r].A<-imm;RS[r].Busy<-yes;

Load only                                                    RegisterStat[rt].Qi=r;  if(RegisterStat[rt].Qi ≠0)                              {RS[r].Qk<-RegisterStat[rs].Qi}                  Store Only else{RS[r].Vk<-Regs[rt];RS[r].Qk<-0};                             

Steps in Algorithm

In s t r u c t i o n   S t a t e                                 W a i t   u n t i l                                                                                 A c t i o n   o r   b o o k E x e c u  t e                                                       ( R S [ r ] . Q j = 0 ) a n d                                                   C o m p u t e   r e s u l t :   o p e r             F P   O p e ra t io n                             (R S [ r] . Q k = 0 ) L o a d   S t o r e   s t e pR 1S [ r ] . Q j = 0 &   r   i s   h e aRd S  o[ f r ]l .oAa
Reducing Branch Cost • Control hazard can cause a greater loss for MIPS pipeline than data hazards. • Branch Prediction is the mechanism to resolve the outcome of a branch early, thus preventing control dependences from causing stalls. • Based on the fact that more often branches are associated with loops • Two types of branch prediction – Static prediction – Dynamic prediction

Terminologies related to Branch Prediction • • • • •

Control or branch Hazard Taken (if a branch changes the PC to its target address) Not taken or untaken (otherwise) Freeze or flush Predicted-not-taken

ADD BEQZ SUB ST

R1, R4, R1, R1,

R2, R3 L R5, R6 A

OR

R7, R1, R8

L:

Branch Taken or Untaken High-level program segment 



For (j = 0; j < 10; j++)…

Assembly program segment

R E T:

check & branch to for-loop n extin st … …

fo r-lo o p :

… … JU M P R E T

for-body



End-for



nextinst

Basic Branch Prediction – Prediction buffer • Using Branch History Table or Prediction-buffer 

baddr1

1

baddr2

0

baddr3

1 1 bit

Taken

Branch Address

Taken or Not

Predict taken 11

Not taken Taken

Predict taken 10 Not taken

Taken Predict untaken 01

Not taken Taken

Predict untaken 00 Not taken

80% 90%

Static Branch Prediction • Useful when branch behavior is highly predictable at compile time • It can also assist dynamic predictors • Loop unrolling is an example of this

Branch Causing 3-Cycle Stall Time

B B B B

r a n c h   i n s t r u c t i o n IF ra n c h  s u c c e s s o r ra n c h   s u c c e s s o r +   1 ra n c h   s u c c e s s o r +   2

ID IF

EX IF  

MEM W B IF IF      

ADD R1, BEQZ R4 , SUB R1, ST R1, …

R2, R3 L R5, R6 A

L: OR

R7, R1, R8

ID IF  

EX ID IF

Reducing pipeline branch penalties – Compile time schemes • • • •

Freeze or Flush Assume that every branch has not taken Assume that every branch is taken Delayed branch

Delayed Branch • Code segment    

branch instruction sequential successor … Loop: branch target if taken

• Branch delay slot • Instruction in branch delay slot is always executed • Compiler can schedule the instruction(s) for branch delay slot so that its execution is useful even when the prediction is wrong

Not Taken

Branch Delay of 1

U n t a k e n   B r a n c h  IFi n s t r uI Dc t i o nE X M E M B r a n c h   d e l a y   i n s t r u c t iI oF n   j   +I D  1 E X In s t r u c t i o n   j   +   2 IF ID In s t r u c t i o n   j   +   3 IF In s t r u c t i o n   j   +   4  

W B M EM EX ID IF

W B M EM EX ID

  W B M EM W B EX M EM W B

Taken U n ta k e n B r a n c h i n str u cItio F n ID

EX M EM W B

B r a n c h d e l a y in str u c ti o n j + I1 F

ID

EX MEM W B

B r a n c h ta r g e t

IF

ID

EX M EM W B

IF

ID

EX MEM W B

IF

ID

B r a n c h ta r g e t + 1 B r a n c h ta r g e t + 2

EX MEM W B

Scheduling the branch delay slot From before DADD R1, R2, R3

From target DSUB R4, R5, R6

If R2 = 0 then Delay slot

From fall-through DADD R1, R2, R3 If R1 = 0 then

DADD R1, R2, R3

Delay slot

If R1 = 0 then

OR R7, R8, R9

Delay slot DSUB R4, R5, R6 becomes

becomes

becomes

DSUB R4, R5, R6 If R2 = 0 then DADD R1, R2, R3

DADD R1, R2, R3 DADD R1, R2, R3

If R1 = 0 then

If R1 = 0 then

OR R7, R8, R9

DSUB R4, R5, R6 DSUB R4 , R5, R6

Best choice

Branch Prediction – Correlating Prediction • Prediction based on correlation SUB BNEZ

     



if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa != bb) {

R3 , R1 , # 2 R3 , L1

; branch b1 ( aa !

= 2) ADD L1 : SUB R3 , BNEZ = 2) ADD L2 : SUB R3 , BEQZ ( aa == bb )

R1 , R0 , 0 R2 , # 2 R3 , L2

; aa = 0 ; branch b2 ( bb !

R2 , R0 , R0 : bb = 0 R1 , R2 ; R3 = aa - bb R3 , L3 ; branch b3

• If both first and second branches are not taken then the last branch will be taken

ILP Software based approaches

Basic Compiler Technique

• Basic Pipeline Scheduling and Loop Unrolling L: L.D F0, 0(R1) ; F0=array element ADD .D F4, F0, F2 ; add scalar in F2 • S.D F4, 0(R1) ; store result  

for (j = 1000; j > 0; j--) x[j] = x[j] + s;

ADDI BNE

R1, R1, #-8 ; decrement pointer R1, R2, L ; branch R1 != R2

Latencies of FP operations used In stru ctio n P rod u cing Re su lt Instructio n Using Re suLlta te n cy F P  A LU op A nother F P  A LU op F P  A LU op S tore double Load Double F P  A LU op Load Double S tore double

in Clo ck Cycle s 3 2 1 0

Assume an integer load latency of 1 and an integer ALU operation latency of 0 L:

L.D F0, 0(R1) ; F0=array element ADD .D F4, F0, F2 ; add scalar in F2 S .D F4, 0(R1) ; store result ADDI R1, R1, #-8 ; decrement pointer BNE R1, R2, L ; branch R1 != R2

? ? ?

Without Any Scheduling Clock cycles issued L: L.D F0, 0(R1) stall ADD.D F4, F0, F2 stall stall S.D F4, 0(R1) ADDI R1, R1, #-8 stall of 1 for ALU, BNE BNE R1, R2, L stall

1 2 ; NOP 3 4 ; NOP 5 ; NOP 6 7 8 ; note latency 9 10 ; NOP

10 clock cycles/element, 7 (ADDI, BNE, stall) are loop overhead

Assume an integer load latency of 1 and an integer ALU operation latency of

With Scheduling

In structio n P ro du cin g Re su lt In stru ction Using Re sult L a te n cy FP  A LU op A nother FP  A LU op FP  A LU op S tore double Load Double F P  A LU op Load Double S tore double

Clock cycles issued L: L.D F0, 0(R1) ADDI R1, R1, #-8 ADD.D F4, F0, F2 stall BNE R1, R2, L S.D F4, 8(R1)

1

?

2 3 4 5 6

?

in Clock Cycle s 3 2 1 0

Clock cycles issued L:

L.D F0, 0(R1) stall ADD.D F4, F0, F2 stall stall S.D F4, 0(R1) ADDI R1, R1, #-8 stall 1 for ALU, BNE BNE R1, R2, L stall

1 2 ; NOP 3 4 ; NOP 5 ; NOP 6 7 8 ; note latency of 9

6 clock cycles/element, 3 (ADDI, BNE, stall) are loop overhead

10 ; NOP

With Unrolling L: L.D

S.D

S.D

L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) i F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 i F4, 0(R1) S.D F8, -8(R1) ADDI R1, R1, #-32 S.D F16, 16(R1) BNE R1, R2, L i F20, 8(R1)

14 clock cycles/4 elements, 2 (ADDI, BNE) are loop overhead

I n str u c ti o n P r o d u c i n g R e su l t I n str u c ti o n U si n g RL ea su te nl tc y i n C l o c k C y c l e s F P  A L U  o p A n o t h e r  F P   A L U   o p 3 F P  A L U  o p S t o re   d o u b le 2 L o a d   D o u b le F P  A L U  o p 1 L o a d   D o u b le S t o re   d o u b le 0

With Unrolling & Scheduling L:

L.D L.D L.D -16(R1) L.D -24(R1) L.D -32(R1) S.D S.D S.D -16(R1) ADDI 40 S.D BNE S.D 8(R1)

F0, 0(R1) F6, -8(R1) F10, F14, F18, F4, 0(R1) F8, -8(R1) F12, R1, R1, #F16, 16(R1) R1, R2, L F20,

i

ADD.D ADD.D ADD.D ADD.D ADD.D

F4, F0, F2 F8, F6, F2 F12, F10, F2 F16, F14, F2 F20, F18, F2

12 clock cycles/5 elements, 2 (ADDI, BNE) are loop overhead

The VLIW Approach • The first multiple-issue processors required the instruction steam to be explicitly organized. • Used wide instruction with multiple operations per instruction. • This architectural approach was VLIW. (Very long instruction word)

Advantage of VLIW Approach • VLIW increases as the maximum issue rate grows. • VLIW approaches make sense for wider processors. • For example VLIW processors might have five instructions that contain five operations, including one integer,two floating,two memory. 

The Basic VLIW Approach • There is no fundamental difference in two approaches. • The instructions to be issued simultaneously falls on the complier, the hardware in a superscalar to make these issue decisions is unneeded.

The Basic VLIW Approach • Scheduling Techniques – Local Scheduling – Global Scheduling 

Example • Suppose we have a VLIW that could issue two memory references, two FP operations, and one integer operation or branch in every clock cycle. Show an unrolled version of the loop x[I]=x[I]+s for such a processor. Unroll as many times as necessary to eliminate any stalls. Ignore the branch delay slot.

inner loop and replace the unrolled sequence I n te g e r M e m o r y r e f e r e n c e M1 e m o r y r e f e r e n c eF P2 o p e r a ti o n F1 P o p e r a ti o n 2 o p e r a ti o n / b r a n c h L .D F 0 , 0 ( R 1 )

L .D F 6 , - 8 ( R 1 )

L .D F 1 0 , - 1 6 ( R 1 )

L .D F 1 4 , 2 4 ( R 1 )

L .D F 1 8 , - 3 2 ( R 1 )

L .D F 2 2 , - 4 0 ( R 1 ) A D D .D F 4 , F 0 A, FD2D .D F 8 , F 6 , F 2

L .D F 2 6 , - 4 8 R 1 )

A D D .D F 1 2 , F 1A0D, FD 2.D F 1 6 , F 1 4 , F 2 A D D .D F 2 0 , F A1 8D ,DF .D 2 F 2 4 ,F 2 2 ,F 2

S .D F 4 , 0 ( R 1 )

S .D F 8 , - 8 ( R 1 )

S .D F 1 2 , - 1 6 ( R 1 )

S .D F 1 6 , - 2 4 ( R 1 )

S .D F 2 0 2 4 ( R 1 )

S .D F 2 4 , 1 6 ( R 1 )

S .D F 2 8 , 8 ( R 1 )

A D D .D F 2 8 , F 2 6 , F 2 D A D D U I R 1 ,R 1 ,# -5 6 B N E R 1,R 2 ,L o o p

Problems of VLIW Model • They are: – Technical problems – Logistical Problems

Technical Problems • Increase in code size and the limitations of lockstep. • Two different elements combine to increase code size substantially for a VLIW. – Generating enough operations in a straightline code fragment requires ambitiously unrolling loops, thereby increasing code size. – Whenever instructions are not full, the unused functional units translate to wasted bits in the instruction encoding.

Logistical Problems And Solution • Logistical Problems – Binary code compatibility has been a major problem. – The different numbers of functional units and unit latencies require different versions of the code. – Migration problems

• Solution • Approach: – Use object-code translation or emulation.This technology is developing quickly and could play a significant role in future migration schemes. – To tamper the strictness of the approach so that binary compatibility is still feasible.

issue versus vector processor • Twofold. – A multiple-issue processor has the potential to extract some amount of parallelism from less regularly structured code. – It has the ability to use a more conventional, and typically less expensive, cache-based memory system.

for exposing and exploiting ILP • Complier technology for increasing the amount of parallelism . • Defining when a loop is parallel and how dependence can prevent a loop from being parallel. • Eliminating some types of dependences •

Detecting and Enhancing Loop-level Parallelism • Loop-level parallelism is normally analyzed at the source level or close to it. • while most analysis of ILP is done once instructions have been generated by the complier. • It involves determining what dependences exist among the operand in a loop across the iterations of that loop. •

Detecting and Enhancing Loop-level Parallelism • The analysis of loop-level parallelism focuses on determining whether data accesses in later iterations are dependent on the data values produced in earlier iterations is called a loopdependence. • Loop-level parallel at the source representation:







for (I = 1000; I > 0; I = I - 1) x[i] = x[i] + s;

Example 1 • Consider a loop:  for (i= 1; i <= 100; i = i + 1)  {  A [ i + 1] = A [ i ] + C [ i ] ;  B [ i + 1] = B [ i ] + A [ i + 1 ] ;  }

• Loop-carried dependence : execution of an instance of a loop requires the execution of a previous instance.

Example 2 • Consider a loop:  for (i= 1; i <= 100; i = i + 1)  {  A [ i ] = A [ i ] + B [ i ] ; /* S1 */  B [ i + 1] = C [ i ] + D [ i ] ;/* S2 */  }  



What are the dependences between S1 and S2? Is this loop parallel? If not, show how to make it parallel.

Scheduling Iteration 1 A[1] = 1 B[2] =

A[1] + B[1] C[1] + D[1]

Iteration 1 A[1] = 1 B[2] =

A[1] + B[1] C[1] + D[1]

2 2

A[2] = B[3] =

A[2] + B[2] C[2] + D[2]

2 2

A[2] = B[3] =

A[2] + B[2] C[2] + D[2]

3 3

A[3] = B[4] =

A[3] + B[3] C[3] + D[3]

3 3

A[3] = B[4] =

A[3] + B[3] C[3] + D[3]

4 4

A[4] = B[5] =

A[4] + B[4] C[4] + D[4]

4 4

A[4] = B[5] =

A[4] + B[4] C[4] + D[4]

99 99

ooo A[99] = B[100] =

A[99] + B[99] C[99] + D[99]

99 99

ooo A[99] = B[100] =

A[99] + B[99] C[99] + D[99]

100 100

A[100] = B[101] =

A[100] + B[100] C[100] + D[100]

100 100

A[100] = B[101] =

A[100] + B[100] C[100] + D[100]

Removing Loop-carried Dependency • Consider the same example: A[1] = B[1] + C[1]





for (i= 1; i <= 99; i = i + 1)



{ B [ i + 1] = C [ i] + D [ i] ; A [ i +1] = A [ i +1] + B [ i +1] ;

 



}



B[101] = C[100] + D[100]

Example 3 • Loop–carried dependences. This dependence information is inexact, may exist. • Consider a example: 

for (i= 1; i <= 100; i = i + 1)



{



A[i]=B[i] +C[i];



D[i]=A[i] *E[i ];



}

Recurrence • Loop-carried dependences  for (i= 2; i <= 100; i = i + 1)  {  Y [ i ] = Y [ i –1 ] + Y [ i ] ;  }  for (i= 6; i <= 100; i = i + 1)  {  Y [ i ] = Y [ i –5] + Y [ i ] ;  }

• Dependence-distance: An iteration i of the loop dependence on the completion of an earlier iteration i-n. Dependence distance is n.

Finding Dependences • Three tasks – Good scheduling of code – Determining which loops might contain parallelism – Eliminating name dependences –

Finding Dependence • A dependence exists if two conditions hold: – There are two iteration indices, j and k , both within the limits of the for loop. That is, m<=j <= n, m<=k <= n. – The loop stores into an array element indexed by a*j+b and later fetches from the same array element when it is indexed by c*k+d. That is, a*j+b=c*k+d. 

Example 1 • Use the GCD test to determine whether dependences exist in the loop:  for ( i=1 ; i<= 100 ; i = i + 1)  {  X[2*i+3]=X[2*i ]*5.0;  }   



A = 2, B = 3, C= 2, D = 0 Since 2 (GCD(A,C)) does not divide –3 (D-B), no dependence is possible GCD – Greatest Common Divisor

Example 2 • The loop has multiple types of dependences. Find all the true dependences , and antidependences , and eliminate the output dependences and antidependences by renaming.  for ( i = 1 , i <= 100 ; i = i + 1)  {  y [ i ] = x [ i ] / c ; /* S1 */  x [ i ] = x [ i ] + c ; /* S2 */  z [ i ] = y [ i ] + c ; /* S3 */  y [ i ] = c - y [ i ] ; /* S4 */  }

Dependence Analysis • There are a wide variety of situations in which arrayoriented dependence analysis cannot tell us we might want to know , including – When objects are referenced via pointers rather than array indices . – When array indexing is indirect through another array, which happens with many representations of sparse arrays. – When a dependence may exist for some value of the inputs, but does not exist in actuality when the code is run since the inputs never take on those values. – When an optimization depends on knowing more than just the possibility of a dependence, but needs to know on which write of a variable does a

Basic Approach used in points-to analysis • The basic approach used in points-to analysis relies on information from three major sources: – Type information, which restricts what a pointer can point to. – Information derived when an object is allocated or when the address of an object is taken, which can be used to restrict what a pointer can point to. – For example if p always points to an object allocated in a given source line and q never points to that object, then p and q can never point to the same object. – Information derived from pointer assignments. – For example , if p may be assigned the value of q, then p may point to anything q points to.

Analyzing pointers • There are several cases where analyzing pointers has been successfully applied and is extremely useful: – When pointers are used to pass the address of an object as a parameter, it is possible to use pointsto analysis to determine the possible set of objects referenced by a pointer. One important use is to determine if two pointer parameters may designate the same object. – When a pointer can point to one of several types , it is sometimes possible to determine the type of the data object that a pointer designates at different parts of of the program. – It is often to separate out pointers that may only point to local object versus a global one.

Different types of limitations • There are two different types of limitations that affect our ability to do accurate dependence analysis for large programs. – Limitations arises from restrictions in the analysis algorithms . Often , we are limited by the lack of applicability of the analysis rather tan a shortcoming in dependence analysis per se. – Limitation is the need to analyze behavior across procedure boundaries to get accurate information. – –

Eliminating dependent computations • Compilers can reduce the impact of dependent computations so as to achieve more ILP. • The key technique is to eliminate or reduce a dependent computation by back substitution, which increase the amount of parallelism and sometimes increases the amount of computation required. These techniques can be applied both within a basic block and within loops.

Copy propagation • Within a basic block, algebraic simplifications of expressions and an optimization called copy propagation which eliminates operations that copy values can be used to simplify sequences.  DADDUI R1,R2, #4  DADDUI R1,R1 #4  

to

 



DADDUI

R1,R2, #8

Example • Consider the code sequences:  ADD R1,R2,R3  ADD R4,R1,R6  ADD R8,R4,R7 • Notice that this sequence requires at least three execution cycles, since all the instructions depend on the immediate predecessor. By taking advantage of associativity, transform the code and rewrite:  ADD R1,R2,R3  ADD R4,R6,R7  ADD R8,R1,R4 • This sequence can be computed in two execution cycles. When loop unrolling is used , opportunities for these types of optimizations occur frequently.

Recurrences • Recurrences are expressions whose value on one iteration is given by a function that depends on the previous iterations. • When a loop with a recurrence is unrolled, we may be able to algebraically optimize the unrolled loop, so that the recurrence need only be evaluated once per unrolled iterations.

Type of recurrences • One common type of recurrence arises from an explicit program statement: 

Sum = Sum + X ;

• Unroll a recurrence – Unoptimized (5 dependent instructions) 

Sum = Sum + X1 + X2 + X3 + X4 + X5 ;

• Optimized: with 3 dependent operations: 

Sum = ((Sum + X1) + (X2 + X3) ) + (X4 + X5);

Software Pipelining : Symbolic Loop Unrolling • There are two important techniques: – Software pipelining – Trace scheduling – – Reading Assignment • Trace Scheduling • Software Pipelining versus Loop Unrolling 

Software Pipelining • This is a technique for reorganizing loops such that each iteration in the software-pipelined code is made from instruction chosen from different iterations of the original loop. • This loop interleaves instructions from different iterations without unrolling the loop. • This technique is the software counterpart to what Tomasulo’s algorithm does in hardware.

Software Pipelined Iteration

Ite ra tio n 0

Ite ra tio n 1 Ite ra tio n 2 Ite ra tio n 3

S o ftw a re - Pip e lin e d Ite ra tio n

Ite ra tio n 4

Example • Show a software-pipelined version of this loop, which increments all the elements of an array whose starting address is in R1 by the contents of F2:  Loop: L.D F0,0(R1)  ADD.D F4,F0,F2  S.D F4,0(R1)  DADDUI R1,R1,#-8  BNE R1,R2,Loop You can omit the start-up and clean-up code.  

Software Pipeline Ite ration  I: L .D F 0 , 0 (R 1) A D D .D F 4 ,F 0 , F 2 S .D F 4 , 0 (R 1) Ite ration  I+ 1 :L .D F 0 , 0 (R 1) A D D .D F 4 ,F 0 , F 2 S .D F 4 , 0 (R 1) Ite ration  I+ 2 :L .D F 0 , 0 (R 1) A D D .D F 4 ,F 0 , F 2 S .D F 4 , 0 (R 1)

Lo a d M [ I] in F0 , A d d to F0 , h o ld th e re su lt in F4 Lo a d M [ I-1 ] in F0 Lo o p : S . D [ i] 1 ]

F4 , 1 6 ( R 1 )

; sto re s in to M

A D D .D

F4 , F0 , F2

; a d d to M [ i –

L.D

F0 , 0 ( R 1 )

; lo a d M [ i – 2 ]

D AD D UI

R 1 , R 1 ,# -8

BN E

R 1 , R 2 , Lo o p

S to re F4 in M [ 1 ] A d d F4 , F0 , F2 S to re it in M [ 0 ]

Major Advantage of Software Pipelining • Software pipelining over straight loop unrolling is that software pipelining consumes less code space. • In addition to yielding a better scheduling inner loop, each reduce a different type of overhead. • Loop unrolling reduces the overhead of the loop the branch and counter update code. • Software pipelining reduces the time when the loop is not running at peak speed to once per loop at the beginning and end.

Software pipelining and Unrolled Loop S ta rt-u p co d e

W in d -d o w n co d e

N um ber of O ve rla p p e d o p e ra tio n s a ) S o ftw a re p ip e lin in g

Pro p o rtio n a l to n u m b e r o f u n ro lls N um ber of O ve rla p p e d o p e ra tio n s

Tim e

O ve rla p b e tw e e n U n ro lle d ite ra tio n s

Tim e b ) Lo o p u n ro llin g

Software Pipelining Difficulty • In practice, complication using software pipelining is quite difficult for several reasons: – Many loops require significant transformation before they can be software pipelined, the trade-offs in terms of overhead versus efficiency of the software-pipelined loop are complex, and the issue of register management creates additional complexities. – To help deal with the two of these issues, the IA-64 added extensive hardware support for software pipelining – Although this hardware can make it more efficient to apply software pipelining , it does not eliminate the need for complex complier support, or the need to make difficult decisions about the best way to compile a loop.

Global Code Scheduling • Global code scheduling aims to compact a code fragment with internal control structure into the shortest possible sequence that preserves the data and control dependence . • Data Dependence: – This force a partial order on operations and overcome by unrolling and, in case of memory operations , using dependence analysis to determine if two references refer to the same address. – Finding the shortest possible sequence for a piece of code means finding the shortest sequence for the critical path, which is the longest sequence of dependent instructions.

Global Code Scheduling …. • Control dependence – It dictate instructions across which code cannot be easily moved. – It arising from loop branches are reduce by unrolling. – Global code scheduling can reduce the effect of control dependences arising from conditional nonloop branches by moving code. – Since moving code across branches will often affect the frequency of execution of such code, effectively using global code motion requires estimates of the relatively frequency of different paths. 

Global Code Scheduling …. – Although global code motion cannot guarantee faster code, if the frequency information is accurate, the compiler can determine whether such code movement is likely to lead to faster code. – Global code motion is important since many inner loops contain conditional statements. – Global code scheduling also requires

A Typical Code Fragment AA [ [ ii] ] ==AA [ [i]i] ++ BB[ [i]i]

T

AA [ [i]i] == 00??

B [ i] =

F

X

C [ i] =

elsepart:

join:

LD LD DADDU SD …. BNEZ …. SD …. J …. X …. …. SD

R4,0(R1) R5,0(R2) R4,R4,R5 R4,0(R1)

; load A ; load B ; ADD to A ; Store A

R4,elsepart ; Test A ; then part ….,0(R2) ;Stores to B join

….,0(R3)

; jump over else ; else part ; Code for X ; after if ; Store C[i]

Factors of the complier • Consider the factors that the complier would have to consider in moving the computation and assignment of B: – What are the relative execution frequencies of the then case and the else case in the branch? If the then case is much more frequent, the code motion may be beneficial. If not, it is likely, although not impossible, to consider moving the code. – What is the cost of executing the computation and assignment to B on the branch? It may be that there are a number of empty instruction issue slots in the code on the branch and that the instructions for B can be placed into slots that would otherwise go empty. This opportunity makes the computation of B “free” at least to the first order.

Factors of the complier … – How will the movement of B change the execution time for the then case? If B is at the start of the critical path for the then case, moving it may be highly beneficial. – Is B the best code fragment that can be moved above the branch? How does it compare with moving C or other statements within the then case? – What is the cost of the compensation code that may be necessary for the else case? How effectively can this code be scheduled, and what is its impact on execution time?

Global code principle • The two methods on a two principle: – Focus on the attention of the compiler on a straight-line code segment representing what is estimated to be most frequently executed code path. – Unrolling is used to generate the straightline code but,of course, the complexity arises in how conditional branches are handled. • In both cases, they are effectively straightened by choosing and scheduling the most frequent

Trace Scheduling : focusing on the critical path • It useful for processors with a large number of issues per clock, where conditional or predicated with a large number of issues per clock, where conditional or predicated execution is inappropriate or unsupported, and where simple loop unrolling may not be sufficient by itself to uncover enough ILP to keep the processor busy. • It is away to organize the global code motion process, so as to simplify the code scheduling by incurring the costs of possible code motion on the less frequent paths.

Steps to Trace Scheduling • There are two steps to trace scheduling: – Trace selection: It tries to find likely sequence of basic blocks whose operations will be put together into a smaller number of instructions: this sequence is called a trace. – Trace compaction: It tries squeeze the trace into a small number of wide instructions. Trace compaction is code scheduling; hence , it attempts to move operations as early as it can in a sequence (trace), packing the operations into a few wide instructions (or issue packets) as possible

Advantage of Trace Scheduling • It approach is that it simplifies the decisions concerning global code motion. • In particular , branches are viewed as jumps into or out of the selected trace, which is assumed to be the most probable path. • When the code is moved across such trace entry and exit points, additional book-keeping code will often be needed on entry or exit point.

Key Assumption: • The key assumption is that the trace is so much more probable than the alternatives that the cost of the bookkeeping code need not be a deciding factor: – If an instruction can be moved and thereby make the main trace execute faster; it is moved. • Although trace scheduling has been successfully applied to scientific code with its intensive loops and accurate profile data, it remains unclear whether this approach is suitable for programs that are less simply characterized and less loop-intensive. • In such programs,the significant overheads of compensation code may make trace scheduling an unattractive approach, or, best , its effective use will be

Trace in the program fragment AA [ [ ii] ] == AA [ [i]i] ++ BB[ [i]i]

AA [ [i]i] == 00??

T

B [ i] = C [ i] =

T

B [ i] = C [ i] =

F

AA [ [ ii] ] == AA [ [i]i] ++ BB[ [i]i] AA F [ [i]i] == 00??

Tra ce exit

Tra ce e n tra n ce

Tra ce exit

Drawback of trace scheduling • One of the major drawbacks of trace scheduling is that the entries and exits into the middle of the trace cause significant complications, requiring the complier to generate and track the compensation code and often making it difficult to asses the cost of such code.

Superblocks • Superblocks are formed by a process similar to that used for traces, but are a form of extended basic blocks, which are restricted to a single entry point but allow multiple exits. • How can a superblock with only one entrance be constructed? The answer is to use tail duplication to create a separate block that corresponds to the portion of the trace after the entry. • The superblock approach reduces the complexity of book-keeping and scheduling versus the more general trace generation approach, but may enlarge code size more than a tracebased approach.

Superblock results from unrolling the code AA [ [ ii] ] == AA [ [i]i] ++ BB[ [i]i]

AA [ [i]i] == 00??

T

B [ i] =

F S u p e rb lo ck exit W it n = 2

AA [ [ ii] ] == AA [ [i]i] ++ BB[ [i]i] T

C [ i] =

T

B [ i] = C [ i] =

AA [ [ ii] ] == AA [ [i]i] ++ BB[ [i]i] B [ i] = AA F [ [i]i] == 00?? S u p e rb lo ck exit W it n = 1

AA [ [i]i] == 00??

C [ i] =

F

X

Exploited of ILP • Loop unrolling, software pipelining , trace scheduling , and superblock scheduling all aim at trying to increases the amount of ILP that can be exploited by a processor issuing more than one instruction on every clock cycle. • The effectiveness of each of these techniques and their suitability for various architectural approaches are among the hottest topics being pursed by researches and designers of highspeed processors.

Exposing More Parallelism at Compile Time • Techniques such as loop unrolling, software pipelining , and trace scheduling can be used to increase the amount of parallelism available when the behavior of branches is fairly predictable at compile time. • The first is an extension of the instruction set to include conditional or predicated instructions. • Such instructions can be used to eliminate branches, converting a control dependence into a data dependence and potentially improving performances. • Such approaches are useful with either the hardware-intensive schemes , predication can be used to eliminate branches.

Conditional or Predicated Instructions • The concept behind instructions is quite simple: – An instruction refers to a condition, which is evaluated as part of the instruction execution. – If the condition is true, the instruction is executed normally; – If the condition is false, the execution continues as if the instruction were a noop. – Many newer architectures include some form of conditional instructions.

Example: • Consider the following code:  if (A==0) {S=T;} Assuming that registers R1,R2 and R3 hold the values of A,S and T respectively, show the code for this statement with the branch and with the conditional move.

Conditional and Predictions • Conditional moves are the simplest form of conditional or predicted instructions, and although useful for short sequences, have limitations. • In particular, using conditional move to eliminate branches that guard the execution of large blocks of code can be efficient, since many conditional moves may need to be introduced. • To remedy the inefficiency of using conditional moves, some architectures support full predication, whereby the execution of all instructions is controlled by a predicate. • When the predicate is false, the instruction becomes a noop. Full predication allows us to simply convert large blocks of code that are branch dependent. • Predicated instructions can also be used to speculatively move an instruction that is time critical, but may cause an exception if moved before a guarding branch.

Example: • Here is a code sequence for a two-issue superscalar that can issue a combination of one memory reference and one ALU operation, or a branch by itself, every cycle: First InstructionSet LW        R 1,40(R 2) 

SecondInstructionSet AD D  R 3,R 4,R 5 AD D  R 6,R 3,R 7

BEQ Z      R 10,L 10) LW        R 8,0(R LW        R 9,0(R 8)   

• This sequence wastes a memory operation slot in the second cycle and will incur a data dependence stall if the branch is not taken, since the second LW after the branch depends on the prior load. Show how the code can be improved using a predicated form of LW.

Predicate instructions • When we convert an entire code segment to predicated execution or speculatively move an instruction and make it predicated, we remove a control dependence. • Correct code generation and the conditional execution of predicated instructions ensure that we maintain the data flow enforced by the branch. • To ensure that the exception behavior is also maintained, a predicated instruction must not generate an exception if the predicate is false.

Complication and Disadvantage • The major complication in implementing predicated instructions is deciding when to annul an instruction. • Predicated instructions may either be annulled during instruction issue or later in the pipeline before they commit any results or raise an exception. • Each choice has a disadvantage. • If predicated instructions are annulled in pipeline, the value of the controlling condition must be known early to prevent a stall for a data hazard. • Since data-dependent branch conditions, which tends to be less predictable, are candidates for conversion to predicated execution, this choice

Complication of Predication • Because of this potential for data hazard stalls, no design with predicated execution (or conditional move) annuls instructions early. • Instead , all existing processors annul instructions later in the pipeline , which means that annulled instructions will consume functional unit resources and potentially have negative impact on performance . • A variety of other pipeline implementation techniques, such as forwarding, interact with predicated instructions, further complicating the implementation.



Factors Predicated or Conditional Instruction • Predicated or conditional instructions are extremely useful for implementing short alternative control flows, for eliminating some unpredictable branches, and for reducing the overhead of global code scheduling. Nonetheless, the usefulness of conditional instructions is limited by several factors: – Predicated instructions that are annulled (I.e., whose conditions are false) still take some processor resources. – An annulled predicated instruction requires fetch resources at a minimum, and in most processors functional unit execution time. – Therefore, moving an instruction across a branch and making it conditional will slow the program down whenever the moved instruction would not have been normally executed.

Factors Predicated or Conditional Instruction – Likewise, predicating a control-dependent portion of code and eliminating a branch may slow down the processor if that code would not have been executed. – An important execution to these situations occurs when the cycles used by the moved instruction when it is not performed would have been idle anyway. – Moving an instruction across a branch or converting a code segment to predicated execution is essentially speculating on the outcome of the branch. – Conditional instructions make this easier but do not eliminate the execution time taken by an incorrect guess. – In simple cases,where we trade a conditional move for a branch and a move, using conditional moves

Factors Predicated or Conditional Instruction… – When longer code sequences are made conditional, the benefits are more limited. – Predicated instructions are most useful when the predicate can be evaluated early. – If the condition evaluation and predicated instructions cannot be separated (because of data dependence in determining the condition) , then a conditional instruction may result in a stall for a data hazard. – With branch prediction and speculation, such stalls can be avoided, at least when the branches are predicted accurately.

Factors Predicated or Conditional Instruction… – The use of conditional instructions can be limited when the control flow involves more than a simple alternative sequence. – For example, moving an instruction across multiple branches requires making it conditional on both branches, which requires two conditions to be specified or requires additional instructions to compute the controlling predicate. – If such capabilities are not present, the overhead of if conversion will be larger, reduce its advantage. – Conditional instructions may have some speed penalty compared with unconditional instructions. – This may show up as a higher cycle count for such instructions or a slower clock rate overall. – If conditional instructions are more expensive, they will need to be used judiciously.

Compiler Speculation With Hardware Support • Many programs have branches that can be accurately predicted at compile time either from the program structure or by using a profile. • In such cases, the compiler may want to speculate either to improve the scheduling or to increase the issue rate. • Predicated instructions provide one method to speculate, but they are really more useful when control dependences can be completely eliminated by if conversion. • In many cases,we would like to move speculated instructions not only before the branch, but before the condition evaluation cannot achieve this.

Related Documents


More Documents from "ainugiri"