Pipeline A substantial improvement in performance can be achieved by overlapping the execution of basic steps called pipe lining.
Example: Suppose that we want to perform the combined multiply and add operations with a stream of numbers. Ai * Bi + Ci
for i = 1,2,3,……,7
Each sub operation is to be implemented in a segment within a pipeline. The sub operations performed in each segment of the pipeline are as follows: R1
Ai,
R2
Bi
Input
Ai & Bi
R3
R1* R2,
R4
Ci
Multiply & Input Ci
R5
R3 + R4
Add Ci to product
R1
R2
Multiplier Ai
Bi
Ci
R3
R4 Adder R5
Clock Pulse Number
Segment 1 R1 R2
1 2 3 4 5 6 7 8 9
A1 A2 A3 A4 A5 A6 A7 -----
B1 B2 B3 B4 B5 B6 B7 -----
Segment 2 R3 R4
Segment 3 R5
------A1 * B1 C1 --A2 * B2 C2 A1 * B1 + C1 A3 * B3 C3 A2 * B2 + C2 A4 * B4 C4 A3 * B3 + C3 A5 * B5 C5 A4 * B4 + C4 A6 * B6 C6 A5 * B5 + C5 A7 * B7 C7 A6 * B6 + C6 ----A7 * B7 + C7 Content of register in pipeline
Example: 1. 2. 3. 4.
Suppose to execute n task there are k-segment pipeline with a clock cycle time of tp . The first task T1 requires a time equal to ktp to complete its operation using k segments in the pipe. The remaining n - 1 tasks emerge from the pipe at the rate of one task per clock cycle and they will be completed after a time equal to (n - l) tp Therefore, to complete n tasks using a k-segment pipeline requires k + (n - 1) clock cycles.
SPEEDUP The speedup of a pipelining processing over an equivalent non pipelining processing is defined by the ratio. nt p S= (k + n − 1)t p As the number of tasks increases, n becomes much larger than k - 1, and k + n - 1 approaches the value of n. Under this condition, the speedup becomes tn S= tp If we assume that the time it takes to process a task is the same in the pipeline and non pipeline circuits, we will have tn = ktp. Including this assumption, the speedup reduces to kt p S= =k tp This shows that the theoretical maximum speedup that a pipeline can provide is k, where k is the number of segments in the pipeline. For example: There are 4 segments t1=60ns,t2=70ns,t3=100ns, t4=80ns and the interface registers have a delay of tr=10ns Using Pipeline : The clock cycle is chosen to be tp= t3 + tr = 100 + 10 = 110ns Using Non-pipeline: tn = t1 + t2 + t3 + t4 + tr =60 + 70 + 100 + 80 + 10 =320ns The pipeline adder has a speedup of 320/110 = 2.9ns
For example: If a pipeline executes many tasks, the overhead of start-up and drainage can be ignored and the effective throughput of the pipeline taken to be 1 task per cycle. In terms of instruction execution, this corresponds to an effective CPI (cycles per instruction) of 1.
Linear Pipeline Processor Linear pipelines are applied for instruction execution, arithmetic computation, and memory-access operations. A linear pipeline processor is constructed with k processing stages. External inputs (operands) are fed into the pipeline at the first stage S1. The processed results are passed from stage Si to stage Si+1 for all i = 1,2, ... ,k -1. The final result emerges from the pipeline at the last stage Sk. Depending on the control of data flow along the pipeline, we model linear pipelines in two categories: 1.
Asynchronous
2.
Synchronous.
Asynchronous Model:
1.
Data flow between adjacent stages in an asynchronous pipeline is controlled by a handshaking protocol.
2.
When stage Si is ready to transmit, it sends a ready signal to stage Si+1. After stage Si+l receives the incoming data, it returns an acknowledge signal to Si.
3. Asynchronous pipelines are useful in designing communication channels in message passing multicomputers. 4. Asynchronous pipelines may have a variable throughput rate. 5. Different amounts of delay may be experienced in different stages. Synchronous Model
1.
Clocked latches are used to interface between stages.
2.
The latches are made with master-slave flip-flops, which can isolate inputs from outputs.
3.
Upon the arrival of a clock pulse, all latches transfer data to the next stage simultaneously.
The pipeline stages are combinational logic circuits. It is desired to have approximately equal delays in all stages. These delays determine the clock period and thus the speed of the pipeline. Clocking and Timing Control The clock cycle τ of a pipeline is determined below. Let τ i be the time delay of the circuitry in stage Si and d the time delay of a latch, as shown in Fig. 6.1b. Clock Cycle and Throughput Denote the maximum stage delay as τ m , and we can write τ as
τ = max{τ i } k + d = τ m + d i 1 At the rising edge of the clock pulse, the data is latched to the master flip-flops of each latch register. The clock pulse has a width equal to d. In general, τ m >> d for one to two orders of magnitude. This implies that the maximum stage delay τ m dominates the , clock period. The pipeline frequency is defined as the inverse of the clock period: 1 f= τ
If one result is expected to come out of the pipeline per cycle, f represents the maximum throughput of the pipeline. Speedup, Efficiency, and Throughput Ideally, a linear pipeline of k stages can process n tasks in k + (n - 1) clock cycles, where k cycles are needed to complete the execution of the very first task and the remaining n - 1 tasks require n - 1 cycles. Thus the total time required is Tk = [k + (n - 1)] τ where 'T is the clock period. Consider an equivalent-function nonpipelined processor which has a flow-through delay of k'T. The amount of time it takes to execute n tasks on this nonpipelined processor is T1 = nk τ . Speedup Factor The speedup factor of a k-stage pipeline over an equivalent nonpipelined processor is defined as Sk =
Τi nkτ nk = = Τk kτ + (n - 1)τ k + (n - 1)
Optimal Number of Stages The optimal choice of the number of pipeline stages should be able to maximize a performance/cost ratio. Let t be the total time required for a nonpipelined sequential program of a given function. To execute the same program on a k-stage pipeline with an equal flowthrough delay t, one needs a clock period of p = t/k + d, where d is the latch delay. Thus, the pipeline has a maximum throughput of f = l/p = l / {t / k + d). The total pipeline cost is roughly estimated by c + kh, where c covers the cost of all logic stages and h represents the cost of each latch. A pipeline performance/cost ratio (PCR) has been defined by Larson (1973): PCR =
1 f = (t / k + d )(c + kh) c + kh
Figure 2 plots the PCR as function of k. The peak of the PCR curve corresponds to an optimal choice for the number of desired pipeline stages: Ko =
t ⋅c d ⋅h
Where t is the total flow-through delay of the pipeline. The total stage cost c, the latch delay d, and the latch cost h can be adjusted to achieve the optimal value ko.
Nonlinear Pipeline Processor The traditional· linear pipelines are static pipelines because they are used to perform fixed functions. 1. 2.
A dynamic pipeline can be reconfigured to perform variable functions at different times. A dynamic pipeline allows feed forward and feedback connections in addition to the streamline connections. For this reason, some authors call such a structure a nonlinear pipeline.
Pipeline conflicts: There are 3 major difficulties that cause the instruction pipeline to deviate from its normal operation. 1. 2. 3.
Resource conflict Data dependency Branch difficulties
Resource conflict: Resource conflict access to memory by 2 segments of the same time. Most of these conflicts can be resolved by using separate instruction and data memories. Data dependency: When an instruction depends on the result of a previous instruction, but this result is not available. Branch difficulties: Branch difficulties arise from branch and other instructions that change that value of PC. Example:
Branch
Instr 1 FI Instr 2 Instr 3 Instr 4 Inst 5
DA FI
FO DA FI
EX FO DA -
EX FO -
EX FI -
DA FI
FO DA
EX FO
EX
Hazard: 1. 2. 3. 4.
In pipe line system, instructions IK+1,IK+2 may be started and completed before the instruction IK is completed. This difference can cause problems if not properly considered in the design of the control. Existence of such dependencies causes what is called “hazard”. The hardware technique that detects and resolves hazards is called interlock.
Types of hazards: 1. 2. 3. 4. 1.
Instruction hazard Data hazard Branching hazards Structural hazards
INSTRUCTION HAZARDS: 1.
An instruction RAW hazard occurs when the instruction to be initiated in the pipeline is fetched from a location that is yet to be updated by some uncompleted instruction in the pipeline.
2. To handle this hazard a centralized controller is required to keep the address in the range sets for the instructions inside the pipeline. 3.
Every instruction fetch, PC be compared against the possible match with the address in the address range set used by the subsequent stages. The match with any of the address means there is an instruction RAW hazard.
Types of hazards:1. 2. 3.
RAW (read and write) or true dependency hazard WAR (write and read) or anti dependency hazard WAW (write and write) or out dependency hazard
We demonstrate 3 hazards with the help of a sample program:Instruction no 1. 2. 3. 4. 5.
code STORE SUB STORE ADD STORE
R4,A R3,A R3,A R3,A R3,A
RAW hazard: In the above code, the second instruction must use the value of A updated by the first instruction. If the second instruction (SUB) reads the value A before instruction 1 has a chance to update it, a wrong value of data will be used by the CPU. This situation is called raw hazard. WAR hazard: A WAR hazard between 2 instructions i & j occurs when the instruction j attempts to write onto some object that being read by instruction i. The WAR hazard exists between the instructions 2 & 3, since an attempt by instruction 3 to record a value in A before instruction 2 has read the value is clearly wrong.
WAW hazard: When WAW hazard between 2 instructions i & j occurs when the instruction j attempts to write onto some object that is also required to be modified by the instruction i. Similarly WAW hazard occurs between the instructions 3 & 5 since an attempt by instruction 5 to store before the store of instruction 3 is clearly incorrect.
2.
DATA HAZARDS: Data hazards occur when data is modified. Ignoring potential data hazards can result in race conditions. There are 3 situations data hazard can occur in: 1. RAW (read after write). An operand is modified and read soon after.
Example: Instr 1: R3 ← R1 + R2 Instr 2: R5 ← R3 + R2 The first instruction is calculating a value to be saved in register R3 and the second is going to use this value to compute a result for register 5. 2. WAR (write after read) Example: R1←R2+R3 R3←R4+R5 We must ensure that we do not store the result of register R3 before it has had a chance to fetch the operands. 3. WAW (write after write) Two instructions that write the same operand are performed. For example: Instr -1: Instr -2:
R2←R1+R3 R2←R4+R5
We must delay the WB (write back) of instr2 until the execution instr1. Eliminating hazards: There are several established techniques preventing hazards:
1. 2
Bubbling the pipeline/pipeline break or pipeline stall Elimination data hazards
1.
Pipeline stall Bubbles:
Data dependency in pipelines (execution of one instruction depending on completion of a previous instruction) can cause pipeline stalls which diminish the performance. Read-after-compute: register access after updating it with a computed value Read-after-load: register access after updating it with data from memory As instruction is fetched, control logic determines whether a hazard could/will occur. If this is true, then the control logic inserts NOPs into the pipeline. Thus, before the next instruction is executed, the previous would have had sufficient to complete and prevent the hazard. Bubble insertion 1. 2. 3.
Insertion of bubbles in a pipeline implies reduced throughput. An objective of pipeline designers is to minimize the instances of bubble insertion. An intelligent compiler can also help mitigate the performance penalty. It is shown in figure where the third instruction uses the value that the second instruction loads into register $8. Without data forwarding, the third instruction must read register $8 in cycle 7. . This data dependency problem can be solved by bubble insertion or via data forwarding. sw – store word lw –load word
Data Forwarding: 1.
The result of the second instruction is not yet stored in register $8 by the time the third instruction needs it, the result is in fact available at the output of the ALU. Thus, the needed value can be passed (bypass path) from the second instruction to the third one. For example: Let’s say we want to write the value 3 to register 1, and then add 7 to register 1 and store the result in register 2. i.e. Instr -0 : register 1=6
Inst r-1 : register 1=3 Instr -2 : register 2= register 1+7 =10 1.
Following execution register 2 should contain the value 10.
2.
However if instruction 1 does not completely exit the pipeline before inst-2 starts execution, it means that the register 1 does not contain the value when instr-2 performs its addition.
3.
In such an event, inst-2 adds 7 to the old value of register 1 and so register 2 would contain 13
Operand hazard: 1.
Logic to detect the hazards in operand fetches can be worked out as follow from the definition of range and domain sets.
2.
We can device a mechanism to keep track of the domain and range sets for various instructions passing through the pipeline stages.
3.
Each set will associate with one set of storage registers for the domain set and another for the range set addresses.
4.
This storage will be required for all the stages beyond the decode stage.
5.
When an instruction moves from stage to stage it also carries its range and domain set information.
Scalar processor: The simplest processors are processor. Scalar processors represent the simplest class of computer Microprocessors. A scalar processor processes one data item at a time. Vector processor: A vector processor, or array processor, is a Central processing unit design that is able to run mathematical operations on multiple data elements operates simultaneously on many data items.
4.2.1
Superscalar Processor 1.
Superscalar processors are designed to exploit more instruction-level parallelism in user programs. Pipelining in Superscalar Processors The fundamental structure of a superscalar pipeline is illustrated in Fig. 4. A superscalar processor of degree m can issue m instructions per cycle.
2. 3.
f t IFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEM WBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEX MEMWB(Superscalar Pipeline Design)
Fig. 28a. In this design, the processor can issue two instructions per cycle if there is no resource conflict and no data dependence problem. 4.
5.
For simplicity, we assume that each pipeline stage requires one cycle, except the execute stage which may require a variable number of cycles. Four functional units, multiplier, adder, logic unit, and load unit, are available for use in the execute stage. The multiplier itself has three pipeline stages, the adder has two stages, and the others each have only one stage.
lookahead window There is a lookahead window with its own fetch and decoding logic. This window is used for instruction lookahead in case of out-of-order instruction issue is desired to achieve better pipeline throughput. Limitations 1.
Available performance improvement from superscalar techniques is limited by two key areas: a) b)
The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism. The complexity and time cost of the dispatcher and associated dependency checking logic.
Data Dependences: Consider the example program in Fig. 28b. A dependence graph is drawn to indicate the relationship among the instructions.
1.
Because the register content in Rl is loaded by I1 and then used by I2, we have flow dependence: 11→ 12.
2.
Because the result in register R4 after executing I4 may affect the operand register R4 used by I3, we have anti dependence: 13 → 14.
3.
Since both I5 and I6 modify the register R6 and R6 supplies an operand for I6, we have both flow and output dependence: 15 → 16 and 15 → 16 as shown in the dependence graph.
Pipeline Stalling : 1. This is a problem which may seriously lower pipeline utilization. 2. Proper scheduling avoids pipeline stalling. 3. Stalling can be caused by data dependences or by resource conflicts among instructions already in the pipeline or about to enter the pipeline. We use an example to illustrate the conditions causing pipeline stalling. Consider the scheduling of two instruction pipelines in a two-issue superscalar processor. Figure 29a shows the case of no data dependence on the left and flow dependence (11 → 12) on the right. Without data dependence, all pipeline stages are utilized without idling.
With dependence, instruction I2 entering the second pipeline must wait for two cycles (shaded time slots) before entering the execution stages. This delay may also pass to the next instruction I4 entering the pipeline. In Fig. 6.29b, we show the effect of branching (instruction I2). A delay slot of four cycles results from a branch taken by I2 at cycle 5. Therefore, both pipelines must be flushed before the target instructions I3 and I4 can enter the pipelines from cycle 6. Here,
In Fig.29c, we show a combined problem involving both resource conflicts and data dependence. Instructions I1 and I2 need to use the same functional unit, and 12 → 14 exists.
The net effect is that I2 must be scheduled one cycle behind because the two pipeline stages (el and e2) of the same functional unit must be used by I1 and I2 in an overlapped fashion. For the same reason, I3 is also delayed by one cycle. Instruction 14 is delayed for two cycles due to the flow dependence on I2. The shaded boxes in all the timing charts correspond to idle stages. Multipipeline Scheduling: Instruction issue and completion policies are critical to superscalar processor performance. Three scheduling policies are introduced below. in-order issue When instructions are issued in program order, we call it in-order issue. out-of-order When program order is violated, out-of-order issue is being practiced. inorder completion if the instructions must be completed in program order, it is called inorder completion. out-of-order completion In-order issue is easier to implement but may not yield the optimal performance. In-order issue may result in either in-order or out-of-order completion. Out-of-order issue usually ends up with out-of-order completion. The purpose of out-of-order issue and completion is to improve performance. In-Order Issue Figure 30a shows a schedule for the six instructions being issued in program order Il, 12, …. I6. Pipeline 1 receives Il, I3, and I5, and pipeline 2 receives instructions I2,I4, and I6 in three consecutive cycles. Due to I1 -> I2, I2 has to wait one cycle to use the data loaded in by I1.
I3 is delayed one cycle for the same adder used by I2. I6 has to wait for the result1 of I5 before it can enter the multiplier stages. In order to maintain in-order
completion; I5 is forced to wait for two cycles to come out of pipeline 1. In total, nine cycles are needed and five idle cycles (shaded boxes) are observed. In Fig. 30b, out-of-order completion is allowed even if in-order issue is practiced. The only difference between this out-of-order schedule and the in-order schedule is that , 15 is allowed to complete ahead of I3 and I4, which are totally independent of I5. The; total execution time does not improve. However, the pipeline utilization rate does.
Only three idle cycles were observed. Note that in Figs..29a and 29b, we did not use the lookahead window. In order to shorten the total execution time, the window can be used to reorder the instruction issues. Out-of-Order Issue By using the lookahead window, instruction I5 can be decoded in advance because it is independent of all the other instructions. The six instructions are issued in three cycles as shown: I5 is fetched and decoded by the window, while I3 and I4 are decoded concurrently . It is followed by issuing I6 and I1 at cycle 2, and I2 at cycle 3. Because the issue is out of order, the completion is also out of order as shown in Fig. 30c. Now, the total execution time has been reduced to seven cycles with no idle stages during the execution of these six instructions. The in-order issue and completion is the simplest one to implement. It is rarely used today even in a conventional scalar processor due to some unnecessary delays in maintaining program order. However, in a multiprocessor environment, this policy is still attractive. Allowing out-of-order completion can be found in both scalar and superscalar processors. Simple data dependence checking, a small lookahead window, and scoreboarding mechanisms are needed, along with an optimizing compiler, to exploit instruction parallelism in a superscalar processor.
Definition : • Operation latency (OL): The number of cycles until the result of an instruction is available. • (Instruction) Issue Latency (IL): The number of cycles required between issuing two consecutive instructions. • (Instruction) Issue Rate (Issue Parallelism) (IP): The maximum number of instructions that can be issued in a cycle. (degree) • Maximum Instruction-level Parallelism (MILP): The maximum number of simultaneously executing instructions A base scalar performance : 1.
• OL = 1 cycle The number of cycles until the result of an instruction is available.one cycle latency for a simple operation,
2.
• IL = 1 cycle The number of cycles required between issuing two consecutive instructions.A base scalar processor can be defined as a machine with one instruction issued per cycle
3.
• IP= 1 instruction / cycle The maximum number of instructions that can be issued in a cycle. (degree)
4.
• MILP = 1.K=k The maximum number of simultaneously executing instructions
The instruction pipeline can be fully utilized if successive instruction can enter it continuously at the rate of one per cycle, as shown in fig.1
Time required to execute N instruction T(IP, OL) =
T(1,1) =
K + (N-1) = 4+(10-1) =13 base cycles
Superscalar Performance:
To compare the relative performance of a superscalar processor with that of a scalar base machine, we estimate the ideal execution time of N independent instructions through the pipeline. The time required by the scalar base machine is T (1,1) = k + N – 1
(base cycle)
=4 + (12-3) / 3= 7
( base cycle)
m-issue superscalar machine is: • OL operation latency = 1 cycle The number of cycles until the result of an instruction is available.one cycle latency for a simple operation, • IL = instruction issue latency =1 cycle The number of cycles required between issuing two consecutive instructions.A base scalar processor can be defined as a machine with one instruction issued per cycle • IP instruction issue rate = m instruction / cycle The maximum number of instructions that can be issued in a cycle. (degree) • MILP time required to execute m instruction = mK The maximum number of simultaneously executing instructions
The ideal execution time required by an m-issue superscalar machine is T(m,1) = k +
Ν−m (base cycle) m
Where k is the time required to execute the first m instructions through the m pipelines simultaneously,and the second term corresponds to the time required to execute the remaining N-m instructions ,m per cycle ,through m pipelines. The ideal speedup of the superscalar machine over the base machine is S(m,1) =
Τ(1,1) m( Ν + k − 1) Ν + k −1 = = Τ(m,1) Ν / m + k − 1 Ν + m( k − 1)
As N → ∞, the speedup limit S(m,1) → m, as expected Superpipelined Design In a superpipelined processor of degree n, the pipeline cycle time is 1/n of the base cycle. As a comparison, while a fixed-point add takes one cycle in the base scalar processor, the same operation takes n short cycles in a superpipelined processor implemented with the same technology. Superpipelining is not possible without a high speed clocking mechanism.
• OL operation latency= n minor cycles Single operation latency in n pipeline cycles equivalent to 1 baseline cycle the pipeline cycle time is =1/n • IL instruction issue latency = 1 minor cycle = 1/n baseline cycle • IP
= 1 instruction / minor cycle = n instructions / baseline cycle
• MILP max number of simultaneously executing instructions = n.K
Superpipeline Performance: The minimum time required to execute N instructions for a superpipelined machine of degree n with k stages in the pipeline is 1 T(1,n) = k +
1 (N-1) n
(base cycle)
Thus,the potential speedup of a superpipelined machine over the base machine is S(1,n) =
Τ(1,1) k + N −1 n(k + N − 1) = = Τ(1, n) k + ( N − 1) / n nk + N − 1
The speedup S (1, n) → n, as N → ∞. Superpipelined Superscalar Design : Superpipeline and superscalar approaches can be combined in building so called superpipelined superscalar processors. A superpipelined superscalar processor of degree (m,n) is shown below with (m,n) = (3,3). This machine executes m instructions every cycle with a pipeline cycle 1/n of the base cycle. Simple operation latency is n pipeline cycles. The level of parallelism required to fully utilize this machine is mn instructions.
• OL
= n minor cycles Single operation latency in n pipeline cycles equivalent to 1 baseline cycle the pipeline cycle time is =1/n = 1 baseline cycle
• IL
= 1 minor cycle = 1/n baseline cycle
• IP
= m instruction / minor cycle = nm instructions / baseline cylce
• MILP max number of simultaneously executing instructions = n.m.K Time required to execute N instruction?
T(m,n) = k +
Ν−m (base cycle) mn
Thus the speedup over the base machine is S(m,n) =
Τ(1,1) k + N −1 mn(k + N − 1) = = Τ(m, n) k + ( N − m) /(mn) mnk + N − m
The speedup limit s(m,n) → mn as N→ ∞
Pipeline Scheduling & reservation analysis A pipeline may have a feedback structure, that is, data paths among function units may have directed loops as shown in the figure. Example of a feedback pipeline
The timing information of a pipeline is fully described by a simple table called a reservation table, which specifies the function units that are busy at each clock when a task is processed without overlapping execution. Example of ``reservation table'' clock 0 1 2 3 4 5
6
unit0 X .
. X X
.
.
.
.
.
unit2 .
. X .
.
.
.
unit3 .
.
. X .
.
.
unit4 .
.
.
.
X
.
unit1 . X .
.
.
In reservation tables, `X' means ``the function unit is busy at that clock. and `.' means ``the function unit is not busy at that clock. once a task enters the pipeline, it is processed by unit0 at the first clock, by unit1 at the second clock, and so on. It takes seven clock cycles to perform a task. Example of ``conflict'' clock 0 1 2 3 4 5 6 7 unit0 0 1 . . 0 C 1 . unit1 . 0 1 . . . .
.
unit2 . . 0 1 . . .
.
unit3 . . . 0 1 . .
.
unit4 . . . . . . 0 1
(`0's and `1's in this table except those in the first row represent tasks 0 and 1, respectively, and `C' means the conflict.) Notice that no special hardware is provided to avoid simultaneous use of the same function unit. Therefore, a task must not be started if it would conflict with any tasks being processed. For instance: If two tasks, say task 0 and task 1, were started at clock 0 and clock 1, respectively, a conflict would occur on unit0 at clock 5. This means that you should not start two tasks with single cycle interval. This invalid schedule is depicted in the following process table, which is obtained by overlapping two copies of the reservation table with one being shifted to the right by 1 clock. Reservation Analysis
1.
In a static pipeline, it is easy to partition a given function into a sequence of linearly ordered sub functions.
2.
Function partitioning in a dynamic pipeline becomes quite involved because the pipeline stages are interconnected with loops in addition to streamline connections.
A multifunction dynamic pipeline is shown in Fig. 3a. This pipeline has three stages. Besides the streamline connections 1.
from Sl to S2
2.
from S2 to S3,
3.
Feed forward connection from Sl to S3
4.
Two feedback connections from S3 to S2 and from S3 to Sl.
With these connections, the output of the pipeline is not necessarily from the last stage. In fact, following different dataflow patterns, one can use the same pipeline to evaluate different functions. Reservation Tables
1. The reservation table for a static linear pipeline is trivial in the sense that dataflow follows a linear streamline. 2. The reservation table for a dynamic pipeline becomes more interesting because a nonlinear pattern is followed. Given a pipeline configuration, multiple reservation tables can be generated for the evaluation of different functions. Two reservation tables are given in Figs. 3b and 3c, corresponding to a function X and a function Y, respectively. Each function evaluation is specified by one reservation table. A static pipeline is specified by a single reservation table. A dynamic pipeline may be specified by more than one reservation table. For example: The function X requires eight clock cycles to evaluate, and function Y requires six cycles, as shown in Figs. 3b and 3c, respectively. Static versus dynamic reservation table: 1. A pipeline initiation table corresponds to each function evaluation. All initiations to a static pipeline use the same reservation table. 2. A dynamic pipeline may allow different initiations to follow a mix of reservation tables.
The Instruction Execution Unit Instructions can be executed in a hardware , the content of the program counter (PC) is supplied to the instruction cache and an instruction word is read out from the specified location.
Rd –specifies the destination register Rt-designates the destination register Once an instruction has been read out from the instruction cache, its various fields are separated and each is dispatched to the appropriate place. 1. 2. 3. 4.
op and fn fields go to the control unit whereas rs, rt, and rd are sent to the register file. The upper input of the -ALU always comes from register rs, whereas its lower input can be either the content of rt or the immediate field of the instruction. For many instructions, the output of the ALU is stored in a register; in such cases, the data cache is bypassed.
A Single-Cycle Data Path Data path composed of 1. the program counter 2. instruction cache 3. register file 4. ALU 5. data cache.
Three are three multiplexers used: 1. input to the register file 2. lower input of the ALU 3. outputs of the ALU and data cache. 1. input to the register file (i) rt, rd, or $ 31 to be used as the index of the destination register into which a result will be written. (ii) A pair of logic signals RegDst, supplied by the control unit. a)
RegDst is set to 00 for selecting rt
b)
01 for rd
c)
10 for $31.
iii) Of course, not every instruction writes a value into a register. Writing into a register requires that the RegWrite control signal be asserted by the control unit; otherwise, regardless of the state of RegDst, nothing is written into the register file. iv) the instruction cache does not receive any control signal, because an instruction is read out in every cycle (i.e., the clock signal serves as read control for the instruction cache). 2. lower input of the ALU a) control unit to choose the content of rt or the 32-bit sign-extended version of the 16-bit immediate operand b) The first or top input always comes from rs. c) This is controlled by asserting or deasserting the control signal ALUSrc. If this signal is deasserted (has a value 0), the content of rt is
used as the lower input to the ALU; otherwise, the immediate operand, sign-extended to 32 bits, is used. 3. outputs of the ALU and data cache a) allows the word supplied by the data cache, output from the ALU, or the incremented PC value to be sent to the register file for writing and is effected by a pair of control signals, RegInSrc b) set to 00 for choosing the data cache output, 01 for the ALU output, and 10 for the incremented PC value corning from the next-address block. In this case the data path is capable of executing one instruction per clock cycle; hence the name "single-cycle data path." The ALUFunc control signal bundle contains 5 bits: 1. 2. 3. 4. 5. 6. 7.
bit for adder control (Add,Sub) bits for controlling the logic unit (Logic function) 2 bits for controlling the rightmost multiplexer. The ALU output signal indicating overflow in addition or subtraction. In the case of arithmetic and logic instructions, the ALU result must be stored in the destination register and is thus forwarded to the register file through the feedback path. In the case of memory access instructions, the ALU output is a data address. for writing into the data cache (Data Write asserted) or reading from it (DataRead asserted). In the latter case, the data cache output, which appears after a short latency, is sent through the lower feedback: path to the register file for writing.
PIPELINED DATA PATHS The five instruction execution: (1) Instruction fetches (2) Instruction decodes and registers access (3) ALU operation (4) Data memory access (5) Register writes back. Figure 1 shows the execution of five instructions in such a pipelined fashion.
There are at least two strategies for achieving greater performance: Strategy 1: Use multiple independent data paths that can accept several instructions that are read out at once:- this leads to multiple-instruction-issue or superscalar organization. Strategy 2: Overlap the execution of several instructions in the single-cycle design, starting the next instruction before the previous one has run to completion:- this leads to pipelined or superpipelined organization. Pipeline Timing and Performance A q-stage pipeline ideally increases instruction execution throughput by a factor of q. This is because without any overhead or dependency of any kind, and assuming that a task of latency t is divisible into q stages having equal latencies t / q, a new task can be initiated every t / q time units instead of every t time units without pipelining. In this section, we aim to determine the actual performance gain by modeling the effects of overheads and dependencies.
T
Latching Of results
Function unit
Stage 1
Stage 2
Stage 3
o
o
o
Stage q-1
Stage q
t/q Pipelined form of a function unit with latching overhead
The overhead due to unequal stage latencies and latching of signals between stages can be modeled by taking the stage delay to be t / q + r instead of the ideal t / q, as in Figure 6. Then, the increase in throughput will drop from the ideal of q to t q = Throughput increase in a q-stage pipeline = (t r + r ) (1 + qr t ) So, the ideal throughput enhancement factor of q can be approached, provided the cumulative pipeline latching overhead qr is considerably smaller than t. Figure 7 plots the throughput improvement as a function of q for different overhead ratios r/t. We note that r / t (the relative magnitude of the per-stage time overhead r) both limit the maximum performance that can be attained.
For example: we have t = 8 ns (corresponding to the latency in single-cycle implementation), q = 5, and qr = 2 ns. This leads to a throughput improvement factor of 4 with our 5-stage pipeline. Note, however, that we have not yet considered the effects of data and control dependencies.
1. 2. 3.
To keep our analysis simple, we assume that most data dependencies are resolved through data forwarding, with a single bubble inserted in the pipeline only for a read-after-load data dependency (Figure 4). Similarly, a fraction of branch instructions (those for which the compiler does not succeed in filling the branch delay slot with a useful instruction) lead to insertion of a single bubble in the pipeline. In any case, no type of dependency results in the insertion of more than one bubble. Let the fraction of all instructions that are followed by a bubble in the pipeline be ….. Such bubbles effectively reduce the throughput by a factor of 1 +β, making the throughput equation Throughput increase with dependencies =
q (1 + qr / t )(1 + β )