Parallel Processing

  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Parallel Processing as PDF for free.

More details

  • Words: 7,305
  • Pages: 38
The Need For High Performance Computers Many of todays applications such as weather prediction, aerodynamics and artificial intelligence are very computationally intensive and require vast amounts of processing power. To calculate a 24 hour weather forcast for the UK requires about 10^ 12 operations to be performed. This would take about 2.7 hours on a Cray-1 ( capable of 10^ 8 operations per second ). So to give accurate long range forecasts ( e.g. a week ) much more powerful computers are needed. One way of doing this is to use faster electronic components. The limiting factor is however the speed of light. The speed of light is 3 * 10^ 8 m/s. Considering two electronic devices (each capable of performing 10^ 12 operations per second ) 0.5mm apart. It takes longer for a signal to travel between them than it takes for either of them to process it ( 10^ -12 seconds ). So producing faster components is ultimately of no good. So it appears that the only way forward is to use PARALLELISM. The idea here is that if several operations can be performed simultaneously then the total computation time is reduced.

The parallel version has the potential of being 3 times as fast as the sequential machine.

Classification of Parallel Machines Models of Computation ( Flynn 1966 ) Any computer, whether sequential or parallel, operates by executing instructions on data. a stream of instructions (the algorithm) tells the computer what to do. a stream of data (the input) is affected by these instructions. Depending on whether there is one or several of these streams, we have four classes of computers. There is also a discussion of an additional 'pseudo-machine' SPMD. 1. 2. 3. 4. 5.

Single Instruction Stream, Single Data Stream : SISD. Multiple Instruction Stream, Single Data Stream : MISD. Single Instruction Stream, Multiple Data Stream : SIMD. Multiple Instruction Stream, Multiple Data Stream : MIMD. Single Program Multiple Data: SPMD.

SISD Computers This is the standard sequential computer. A single processing unit receives a single stream of instructions that operate on a single stream of data.

To compute the sum of N numbers a1, a2, .... aN the processor needs to gain access to memory N consecutive times ( to receive one number ). Also N-1 additions are executed in sequence. Therefore the computation takes O(N) operations. i.e. algorithms for SISD computers do not contain any parallelism, there is only one processor

MISD Computers N processors, each with its own control unit, share a common memory.

There are N streams of instructions (algorithms / programs) and one stream of data. Parallelism is achieved by letting the processors do different things at the same time on the same datum. MISD machines are useful in computations where the same input is to be subjected to several different operations. Checking whether a number Z is prime. A simple solution is to try all possible divisions of Z. Assume the number of processors, N, is given by N = Z-2. All processors take Z as input and tries to divide it by its associated divisor. So it is possible in one step to check if Z is prime. More realistically if N < Z-2 then a subset of divisors would be assigned to each processor. For most applications MISD are very awkward to use and no commercial machines exist with this design.

SIMD Computers All N identical processors operate under the control of a single instruction stream issued by a central control unit. ( to ease understanding assume that each processor holds the same identical program. ) There are N data streams, one per processor so different data can be used in each processor.

The processors operate synchronously and a global clock is used to ensure lockstep operation. i.e. at each step (global clock tick) all processors execute the same instruction, each on a different datum. [ SPMD. operates asynchronously by running the same program on different data using an MIMD. machine ]. Array processors such as the ICL DAP (Distributed Array Processor) and pipelined vector computers such as the CRAY 1 & 2 and CYBER 205 fit into the SIMD category. SIMD machines are particularly useful to solve problems which have a regular structure. i.e. the same instruction can be applied to subsets of the data. Adding two matrices A + B = C. Say we have two matrices A and B of order 2 and we have 4 processors. A11 + B11 = C11 ... A12 + B12 = C12 A21 + B21 = C21 ... A22 + B22 = C22 The same instruction is issued to all 4 processors ( add the two numbers ) and all processors execute the instructions simultaneously. It takes one step as opposed to four steps on a sequential machine. An instruction could be a simple one (eg adding two numbers) or a complex one (eg merging two lists of numbers). Similarly the datum may be simple (one number) or complex (several numbers). Sometimes it may be necessary to have only a subset of the processors execute an instruction i.e. only some data needs to be operated on for that instruction. This information can be encoded in the instruction itself indicating whether 1. the processor is active ( execute the instruction ) 2. the processor is inactive ( wait for the next instruction )

In most problems to be solved on SIMD ( and MIMD. ) computers it is useful for the processors to be able to communicate with each other to exchange data or results. This can be done in two ways: by using a shared memory and shared variables or some form of interconnection network and message passing (distributed memory).

MIMD Computers (multiprocessors / multicomputers) This is the most general and most powerful of our classification. We have N processors, N streams of instructions and N streams of data.

Each processor operates under the control of an instruction stream issued by its own control unit.(i.e. each processor is capable of executing its own program on a different data. This means that the processors operate asynchronously ( typically ) i.e. can be doing different things on different data at the same time. As with SIMD computers communication of data or results between processors can be via a shared memory or interconnection network. MIMD computers with shared memory are known as multiprocessors or tightly coupled machines. Examples are ENCORE, MULTIMAX, SEQUENT & BALANCE. MIMD computers with an interconnection network are known as multicomputers or loosely coupled machines. Examples are INTEL iPSC, NCUBE/7 and transputer networks. Note Multicomputers are sometimes referred to as distributed systems. This is INCORRECT.

Distributed systems should, for example, refer to a network of personal workstations (such as SUN's ) and even though the number of processing units can be quite large the communication in such systems is currently too slow to allow close operation on one job.

SPMD - Single Program Multiple Data The SIMD. paradigm is an example of synchronous parallelism - each processor operates in lockstep. A related asynchronous version is SPMD. This is where the same program is run on the processors of an MIMD. machine. SPMD is not a hardware paradigm, it is the software equivalent of SIMD. Because an entire program is executed on separate data, it is possible that different branches are taken, leading to asynchronous parallelism. The processors no longer do the same thing ( or nothing ) in lockstep, they are busy executing different instructions within the same program. Consider IF X = 0 THEN S1 ELSE S2

Assume

X = 0 on P1 X != 0 on P2

Now P1 executes S1 at the same time P2 executes S2 ( which could not happen on an SIMD machine )

Fundamentals of Interprocessor Communication This section aims to give an overview of the two forms of interprocessor communication, further details can be found in the next two sections Shared Memory and Message Passing and Interconnection Networks Where there are N processors each with its own individual data stream i.e. SIMD. and MIMD. , it is usually necessary to communicate data / results between processors. This can be done in two main ways.

1. Using a SHARED MEMORY and SHARED VARIABLES This consists of a global address space which is accessible by all N processors. A processor can communicate to another by writing into the global memory where the second processor can read it.

Shared memory solves the interprocessor communication problem but introduces the problem of simultaneous accessing of the same location in the memory. Consider.

i.e. x is a shared variable accessible by P1 and P2. Depending on certain factors, x=1 or x=2 or x=3. 1. if P1 executes and completes x=x+1 before P2 reads the value of x from memory then x=3 similarly if P2 executes and completes x=x+2 before P1 reads the value of x from memory then x=3 2. if P1 and P2 read the value of x before either has updated it then the processor which finishes last will determine the value of x. if P1 finishes last the value is x=1 if P2 finishes last the value is x=2 In a multiuser, real time environment the processor which finishes last would vary from run to run - so the final value would vary. Also, even if they finish at the same time only one value of x can be stored in the location for x. This gives rise to NON-DETERMINANCY - when a parallel program with the same input data yields different results on different runs. Non-determinancy is caused by race conditions. A race is when two statements in concurrent tasks access the same memory location, at least one of which is a write, and there is no guaranteed execution ordering between accesses. The problem of nondeterminancy would be solved by synchronising the use of shared data. That is; if x=x+1 and x=x+2 were mutually exclusive statements i.e. could not be executed at the same time, then x=3 always. Shared memory computers e.g. SEQUENT , ENCORE are often implemented by incorporating a fast bus to connect processors to memory i.e.

However because the bus has a finite bandwidth i.e. finite amount of data it can carry at any instance, then as the number of processors increase the contention for the bus becomes a problem. So it is only feasible to allow P processors to access P memory locations simultaneously for relatively small P ( < 30 ) Comparison Table ---------------Distributed Memory Shared Memory -------------------------------------------------------------Large number of processors modest number of processors (100's - 1000's ) (10's - 100's) High theoretical power

Modest power

Unlimited expansion

Limited expansion

Difficult to fully utilise

Easy (?) to fully utilise

Revolutionary parallel programming

Evolutionary parallel programming

click here to see more details on SHARED MEMORY

2. INTERCONNECTION NETWORK and MESSAGE PASSING Here each processor has its own private (local) memory and there is no global, shared memory. Therefore the processors need to be connected in some way to allow communication of data to take place.

if a processor requires dara contained in a separate processor then it must be explicitly passed by using instructions supplied for communication e.g. send and receive functions. P1 receive (x,P2)

P2 send (x, P1)

the value of x is explicitly passed from P2 to P1. This is known as message passing. In addition to the extreme cases of shared memory and distributed memory there are possibilities for hybrid designs that combine features of both. E.g. clusters of processeors, where a high speed bus serves for intracluster communication and an interconnection network is used for intercluster communication.

Example follows or click here for further details on INTERCONNECTION NETWORKS.

Summing m numbers using a) Shared memory b) distributed memory Example : Summing m numbers On a sequential computer we have sum := A0 FOR i=1 TO m-1 sum := sum + Ai ENDFOR

i.e. [ (A0 + A1) + A2] + A3 ... etc Giving Theta(m) time Is addition inherently sequential ? If we have N processors then each processor can calculate the sum of m / N numbers and then the sum of these partial sums will give the final sum.

Using Shared Memory The m numbers are held in the global shared memory Global_sum := 0 FOR all Pi WHERE 0 <= i<= N-1 Local_sum := 0 Calculate local sum of m/N numbers LOCK Global_sum := Global_sum + Local_sum UNLOCK ENDFOR

Since global_sum is a shared variable each processor must have mutually exclusive access to it - otherwise the final answer will be incorrect. Running time (algorithm time complexity) is now Theta (( m / N ) + N) + S where: m/N comes from adding m/N numbers in parallel N comes from adding N numbers in sequence S is any time required for synchronization.

Distributed Memory Say we have a "square mesh" of processors.

Each processor finds the local sum of its m / N numbers. Then each processor passes its local sum to another processor (at the correct time ) until finally the global_sum is contained in processor P11 FORALL Pij -all processors active FIND local_sum of m / N numbers ENDFOR 1.

FORALL Pij j = sqrt(N) downto 2 - all column j processors Pij passes local sum to Pij-1 Pij-1 calculates local sum = new_local_sum + old_local_sum ENDFOR 2. FORALL Pi1 i = sqrt(N) downto 2 - 1 processor in col i active Pi1 passes local sum to Pi-1,1 local sum = new_local_sum + old_local_sum ENDFOR

There are (sqrt(N) - 1) + ( sqrt(N) - 1) additions and communications, therefore the total time complexity is Theta (m/N + 2(sqrt(N)) - 2 + C ) where: C is the time for communication Stage One

press here for next stage

Shared Memory and Message Passing Interprocessor Communication Where there are N processors each with its own individual data stream i.e. SIMD. and MIMD. , it is usually necessary to communicate data between processors. This is done in two ways 1. Using SHARED MEMORY and SHARED VARIABLES or 2. via an INTERCONNECTION NETWORK

Shared Memory and Message Passing Here each processor has access to any variable residing in the shared memory. So if processor x wishes to pass a number to processor y it must be done in two steps:-

processor x writes the number into the shared memory at a given location accessible to processor y, which then reads the number from that location. During the execution of a parallel algorithm, the N processors can access shared memory for reading / writing data and / or results. All processors can gain access to the shared memory simultaneously if the memory locations they are trying to read from or write into are different. However we can get problems when two or more processors require access to the same memory location simultaneously. Let x be a variable that can be accessed by two processors P1 and P2 Now consider the assignment x := x + 1, which is normally done in three stages. 1. copy value of x into some register 2. add 1 to value on register 3. store value on register at the address for x Say now that P1 and P2 both execute such an assignment, assume x=0 initially What is the final result ? P1 copies value of x (= 0) into its register P2 copies value of x (= 0) into its register P1 adds 1 to its register --- these two at the same time P1 stores value of x (= 1) P2 adds 1 to its register --- these two at the same time P2 stores value of x (= 1) Giving a result of 1 rather than 2. This is because P2 reads the value of x (=0 ) before P1 has updated it. Therefore depending on whether 2 or more processors can gain access to the same memory location simultaneously, we have 4 subclasses of shared memory computers :-

1. Exclusive Read, Exclusive Write (EREW) SM Computers Access to memory locations is exclusive i.e. no 2 processors are allowed to simultaneously read from or write into the same location.

2. Concurrent Read, Exclusive Write (CREW) SM Computers Multiple processors are allowed to read from the same location but write is still exclusive. .i.e. no 2 processors are allowed to write into the same location simultaneously

3. Exclusive Read, Concurrent Write (ERCW) SM Computers Multiple processors are allowed to write into the same memory location but read access remains exclusive.

4. Concurrent Read, Concurrent Write (CRCW) SM Computers Both multiple read and multiple write privileges are allowed. NOTES Allowing concurrent read access to the same address should pose no problems ( except perhaps to the result of a calculation; as in the example ) Conceptually, each of the several processors reading from that location makes a copy of its contents and stores it in its own register ( RAM ) Problems arise however, with concurrent write access. If several processors are trying to simultaneously store ( potentially different ) data at the same address, which of them should succeed ? i.e. we need a deterministic way of specifying the contents of a memory location after a concurrent write operation. Some ways of resolving write conflicts include :A ) Assign priorities to the processors and accept value from highest priority processor B ) All the processors are allowed to write, provided that the quantities they are attempting to store are equal, otherwise access is denied to ALL processors. C ) The max / min / sum / average of the value is stored (numeric data ). Despite this, it is only feasible to allow P processors to access P memory locations simultaneously for relatively small P (< 30 ) Usually because the cost of the communication hardware is too high and physical size of the device used for a memory location. Generally SIMD machines, because they can use very simple processors ( since have no control unit ), typically need to have large numbers of processors ( > 1000 ) to achieve high performance. So shared memory SIMD machines are unrealistic and no commercial machines exist with this design. However in MIMD machines, which use much more powerful processors, shared memory systems are in existance which have small numbers of processors ( 2 - 30 ). To illustrate the theoretical potential of the four different subclasses of shared memory consider the following example.

We have N processors to search a list S = { L1, L2, .... Ln } for the index of a given element x. 1 < N < = n assume x may appear several times in the list and any index will do. ALGORITHM : Procedure SM search (S, x, k) Step 1: FOR i=1 to N do in parallel read x ENDFOR Step 2: FOR i=1 to N do in parallel Si := {L(i-1)n/N + 1) , .........L in/N } perform sequential search on sublist (returns Ki = -1 if not in list or index if it is in the list ENDFOR Step 3: FOR i=1 to N do in parallel if Ki > 0 then K = Ki endif ENDFOR

Assuming the sequential search procedure takes O(n/N) time in the worst case what is the time complexity of running this algorithm on the four subclasses of shared memory machine ? EREW : Step 1 takes O(N) time (N reads, one at a time) Step 2 takes O(n/N) time (time for reading list & sequential search) Step 3 takes O(N) time i.e. overall O(N) + O(n/N) time

ERCW : Step 1 takes O(N) time Step 2 takes O(n/N) time Step 3 takes constant time ( with an appropriate conflict resolution rule )

i.e. overall O(N) + O(n/N)

CREW : Step 1 takes constant time

Step 2 takes O(n/N) time Step 3 takes O(N) time

i.e. overall O(N) + O(n/N)

CRCW : Step 1 takes constant time Step 2 takes O(n/N) time Step 3 takes constant time

i.e. overall O(n/N)

Interconnection Networks Introduction We have seen that one way for processors to communicate data is to use a shared memory and shared variables. However this is unrealistic for large numbers of processors. A more realistic assumption is that each processor has its own private memory and data communication takes place using message passing via an INTERCONNECTION NETWORK.

Interconnection Networks The interconnection network plays a central role in determining the overall performance of a multicomputer system. If the network cannot provide adequate performance, for a particular application, nodes will frequently be forced to wait for data to arrive. Some of the more important networks include (Those not highlighted should be investigated by the reader ): • • • • • • • •

Fully connected or all-to-all Mesh Rings Hypercube X - Tree Shuffle Exchange Butterfly Cube Connected Cycles

Metrics for Interconnection Networks

Fully connected or all-to-all This is the most powerful interconnection network ( topology ): each node is directly connected to ALL other nodes.

Each node has N-1 connections (N-1 nearest neighbours) giving a total of N(N-1) / 2 connections for the network. Even though this is the best network to have the high number of connections per node mean this network can only be implemented for small values of N. Therefore some form of limited interconnection network must be used.

Mesh ( Torus ) In a mesh network, the nodes are arranged in a k dimensional lattice of width w, giving a total of w^k nodes. [ usually k=1 (linear array) or k=2 (2D array) e.g. ICL DAP ] Communication is allowed only between neighbouring nodes. All interior nodes are connected to 2k other nodes.

Rings A simple ring is just a linear array with the end nodes linked.

It is equivalent to a 1D mesh with wraparound connections. One drawback to this network is that some data transfers may require N/2 links to be traversed e.g. A and B above ( 3 ). This can be reduced by using a chordal ring This is a simple ring with cross or chordal links between nodes on opposite sides.

Hypercube Connection ( Binary n-Cube ) Hypercube networks consist of N = 2^k nodes arranged in a k dimensional hypercube. The nodes are numbered 0 , 1, ....2^k -1 and two nodes are connected if their binary labels differ by exactly one bit.

K dimensional hypercube is formed by combining two k-1 dimensional hypercubes and connecting corresponding nodes i.e. hypercubes are recursive. each node is connected to k other nodes i.e. each is of degree k The departmental NCUBE is based on this topology i.e. a 5 dimensional hypercube (64 nodes)

Choose either of these for the relevant section : Shuffle Exchange Cube Connected Cycles

Metrics for Interconnection Networks Metrics provide a framework to compare and evaluate interconnection networks. The metrics we will use are: 1. 2. 3. 4.

Network connectivity Network diameter Narrowness Network expansion increments

For examples of their calculation refer to own lecture notes.

Network Connectivity Network nodes and communication links sometimes fail and must be removed from service for repair. When components do fail the network should continue to function with reduced capacity. Network connectivity measures the resiliency of a network and its ability to continue operation despite disabled components i.e. connectivity is the minimum number of nodes or links that must fail to partition the network into two or more disjoint networks The larger the connectivity for a network the better the network is able to cope with failures.

Network Diameter The diameter of a network is the maximum internode distance i.e. it is the maximum number of links that must be traversed to send a message to any node along a shortest path. The lower the diameter of a network the shorter the time to send a message from one node to the node farthest away from it.

Narrowness This is a measure of congestion in a network and is calculated as follows: Partition the network into two groups of processors A and B where the number of processors in each group is Na and Nb and assume Nb < = Na. Now count the number of interconnections between A and B call this I. Find the maximum value of Nb / I for all partitionings of the network. This is the narrowness of the network. The idea is that if the narrowness is high ( Nb > I) then if the group B processors want to send messages to group A congestion in the network will be high ( since there are fewer links than processors )

Network Expansion Increments A network should be expandable i.e. it should be possible to create larger and more powerful multicomputer systems by simply adding more nodes to the network. For reasons of cost it is better to have the option of small increments since this allows you to upgrade your network to the size you require ( i.e. flexibility ) within a particular budget. E.g. an 8 node linear array can be expanded in increments of 1 node but a 3 dimensional hypercube can be expanded only by adding another 3D hypercube. (i.e. 8 nodes)

Parallel Algorithm Construction Classification of MIMD Algorithms Parallel algorithms for MIMD machines can be divided into 3 categories, these are expanded in the next sections. 1. Pipelined Algorithms / Algorithmic Parallelism 2. Partitioned Algorithms / Geometric Parallelism 3. Asynchronous / Relaxed Algorithms

Pipelined Algorithms / Algorithmic Parallelism A pipelined algorithm is an ordered set of ( possibly different ) processes in which the output of each process is the input to its successor.

The input to the first process is the input to the algorithm The output from the last process is the output of the algorithm. Typically each processor forms part of a pipeline and performs only a small part of the algorithm. Data then flows through the system ( pipeline ) being operated on by each processor in succession. To illustrate pipelining consider the following Say it takes 3 steps A, B & C to assemble a widget and assume each step takes one unit of time

Sequential widget assembly machine: Spends 1 unit of time doing step A followed by 1 unit of time doing step B, followed by 1 unit of time doing step C So a sequential widget assembler produces 1 widget in 3 time units, 2 in 6 time units etc. i.e. one widget every 3 units.

Pipelined widget assembly machine Say we use a 3 segment pipeline where each of the subtasks ( A, B or C) is assigned to a segment i.e. the machine is split into 3 smaller machines; one to do step A, one for step B and one for step C and which can operate simultaneously. The first machine performs step A on a new widget every time step and passes the partially assembled widget to the second machine which performs step B. This is then passed onto the third machine to perform step C This produces the first widget in 3 time units (as the sequential machine), but after this initial startup time one widget appears every time step. i.e. the second widget appears at time 4 the third widget appears at time 5 etc.

Animated Example

press here to progress to next time step So the final result looks like this

In general if L is the number of steps to be performed and T is the time for each step and n is the number of items ( widgets ) then Time Sequential = LTn and Time Parallel = [ L + n-1 ]T T = 1, L = 100, n = 10^6 then Tseq = 10^8 and Tpipe = 100 + 10^6 - 1 = 10^6 + 99 increase = Tseq / Tpipe = 10^8 / ( 10^6 +99 ) = 100 i.e. 100 fold increase in speed. In general as n tends to infinity speedup tends to L.

Geometric Parallelism / Partitioned Algorithms These algorithms arise when there is a natural way to decompose the data set into smaller "chunks" of data, which are then allocated to individual processors. Thus each processor contains more or less the same code but operates on a subset of the total data.

The solution to these subproblems are then combined to form the complete solution. Depending on the algorithm being solved this combining of solutions usually implies communication synchronization among the processors. Synchronization means constraining a particular ordering of events. if data needs to be communicated between processors after each iteration of a numerical calculation then this implies synchronization between processes. [ we will see a particular example of this in the matric multiplication algorithm later on ] Thus partitioned algorithms are sometimes called synchronous algorithms

To illustrate the difference between pipelined and partitioned algorithms consider the following: Say an algotithm consists of 4 parts A, B, C and D and this algorithm is to operate on a data set E consisting of 4 subsets E1, E2 , E3 and E4 (e.g. divide up matrix into submatrix ) The pipelined algorithm would consist of 4 processors performing A, B, C, or D. Th complete data set would then pass through all 4 processors.

However in the partitioned algorithm the four processors all perform A, B, C and D but only on a subset of the data

i.e. In pipelined algorithms the algorithm is distributed among the processors whereas in partitioned algorithms the data is distributed among the processors. Say we want to calculate Fi = cos(sin e^sqr(xi)) for x1, x2 ,....x6 using 4 processors.

Pipelined Version

F1 is produced in 4 time units F2 is produced at time 5 i.e. time = 4 + (6-1) = 9 units ==> SPEEDUP = 24 / 9 = 2.6 ==> EFFICIENCY = 66%

Partitioned Version This time each processor performs the complete algorithm i.e. cos(sin e^sqr(x)) but on its own data.

i.e. time = 8 units ==> SPEEDUP = 24 / 8 = 3 ==> EFFICIENCY = 75%

Asynchronous / Relaxed Parallelism In relaxed algotithms there is no explicit dependency between processes, as occurs in synchronized algorithms. Instead relaxed algorithms never wait for input. If they are ready they use the most recently available data To illustrate this consider the following. Say we have two processors A and B. A produces a sequence of numbers a1, a2 .. B inputs ai and performs some calculation F which uses ai. Say that B runs much faster than A.

Synchronous Operation A produces a1 passes it to B which calculates F1; A produces a2 passes it to B which calculates F2; i.e. B waits for A to finish ( since B is faster than A ) etc..

Asynchronous Operation A produces a1 passes it to B which calculates F1 but now A is still in the process of computing a2 so instead of waiting B carries on and calculates F2 ( based on old data i.e. a1 and therefore may not be the same as F2 above )and continues to calculate F using the old data until a new input arrives e.g. Fnew = Fold + ai The idea in using asynchronous algorithms is that all processors are kept busy and never remain idle (unlike synchronous algorithms ) so speedup is maximized. A drawback is that they are difficult to analyse ( because we do not know what data is being used ) and also an algorithm that is known to work ( e.g. converge) in synchronous mode may not work (e.g diverge) in asynchronous mode. [ Examples of calculations that do run successfully as asynchronous algorithms are : Gauss Seidel and Jacobi methods for solving linear systems of equations. ] Consider the Newton Raphson iteration for solving F (x) = 0 where F is some non-linear function i.e. Xn+1 = Xn - F(Xn)/F'(Xn)......( 1 ) generates a sequence of approximations to the root, starting from a value X0. Say we have 3 processors P1 : given x, P1 calculates F (x ) in time t1, units and sends it to P3 P2 :given y, P2 calculates F'(y) in time t2 units and sends it to P3 P3 : given a, b, c, P3 calculates d = a - b/c in time t3 units; if | d-a | > Epsilon then d is sent to P1 and P2 otherwise d is output.

Serial Mode P1 computes F(Xn) then P2 computes F'(Xn) then P3 computes Xn+1 using (1) So time per iteration is t1 + t2 + t3 If k iterations are necessary for convergence then total time is k (t1 + t2 + t3)

Synchronous Parallel Mode. P1 and P2 compute F(Xn) and F'(Xn) simultaneously and when BOTH have finished the values F(Xn) and F'(Xn) are used by P3 to compute Xn+1 Time per iteration is max( t1, t2) + t3

Again k iterations will be necessary so total time is k [ max( t1, t2) + t3] X1 = X0 - F(X0)/F'(X0) ...etc

Asynchronous Parallel Mode P1 and P2 begin computing as soon as a new input value is made available by P3 and they are ready to receive it, P3 computes a new value using (1) as soon as EITHER P1 OR P2 provide a new input i.e. (1) is now of the form Xn+1 = Xn - F(SXi)/F'(Xj) Time per iteration is at most min( t1, t2) + t3 [but may be as low as t3 - e.g. if P1 and P2 complete at consecutive time steps] we cannot predict how many iterations will be necessary Say t1 = 2, t2 = 3 and t3 = 1.

TIME 2 3 4 5 6 7 8 9 10 11 12 13 14 etc

2 units P1

3 units P2

1 unit P3

F(X0) C1 F(X1) C2 F(X2) C3 F(X3) C4 or C5 . .

C0 F'(X0) C1 C1 F'(X1) C2 C2 F'(X2) C3 or C4

X1 X2 X3 X4 X5 X6

. .

= X0 - F(X0)/F'(X0) = X1 - F(X1)/F'(X0) = X2 - F(X1)/F'(X1) = X3 - F(X2)/F'(X1) = X4 - F(X2)/F'(X2) = X5 - F(X3)/F'(X2)

. .

- indicates processor is idle Ci indicates processor is using Xi in its calculation NOTE At time 11 P2 has the choice of using X3 to calculate F'(X3) or X4 to calculate F'(X4), i.e. omit X3. Which choice is made should be determined experimentally to see which gives the best results. we could relax the parameter i.e. use X4 or we could synchronise with the parameter i.e. using X3

Factors That Limit Speedup

This section is divided into two parts : 1. Defining Speedup and Efficiency. 2. Factors that limit speedup

Defining Speedup and Efficiency. A Parallel Algorithm is an algorithm for the execution of a program which involves the running of two or more processes on two or more processors simultaneously. Two important measures of the quality of parallel algorithms are speedup and efficiency : If Ts is the time taken to run the fastest serial algorithm on one processor and if Tp is the time taken by a parallel algorithm on N processors then Speedup = SN = Ts / Tp and the efficiency of the parallel algorithm is given by Efficiency = EN = SN / N If the best known serial algorithm takes 8 seconds i.e. Ts = 8, while a parallel algorithm takes 2 seconds using 5 processors, then SN = Ts / Tp = 8 / 2 = 4 and EN = SN / N = 4 / 5 = 0.8 = 80% i.e. the parallel algorithm exhibits a speedup of 4 with 5 processors giving an 80% efficiency. NOTE Care should be taken as to exactly what is meant by Ts i.e. the time taken to run the fastest serial algorithm on one processor - which processor ? we can use one processor of the parallel computer or we can use the fastest serial machine available. The latter is the fairest way to compare parallel algorithms but it is unrealistic in practice, since most people do not have access to the fastest serial machines, making it impossible to make a claim about speedup. ( Researchers also do not like this way because the speedup is reduced (since Ts is lower ) !!! ). A slightly different definition of speedup also exists. The time taken by the parallel algorithm on one processor divided by the time taken by the parallel algorithm on N processors. However this is misleading since many parallel algorithms contain extra operations to accomodate the parallelism (e.g the communication) so the result is Ts is increased thus exaggerating the speedup. Which ever definition is used the ideal is to produce linear speedup i.e. produce a speedup of N using N processors and an efficiency of 1 ( 100% ).

However in practice the speedup is reduced from its ideal value of N (the efficiency is bounded from above by 1 ).

Factors that limit speedup 1. Software Overhead Even with a completely equivalent algorithm, software overhead arises in the concurrent implementation. (e.g. there may be additional index calculations necessitated by the manner in which data are "split up" among processors. ) i.e. there is generally more lines of code to be executed in the parallel program than the sequential program.

2. Load Balancing Speedup is generally limited by the speed of the slowest node. So an important consideration is to ensure that each node performs the same amount of work. i.e. the system is load balanced.

3. Communication Overhead Assuming that communication and calculation cannot be overlapped, then any time spent communicating the data between processors directly degrades the speedup. (because the processors are not calculating ). Because of this, a goal of the parallel algorithm designer should be make the grain size ( relative amount of work done between synchronizations - communications) as large as possible, while keeping all the processors busy.

The effect of communication on speedup is reduced, in relative terms, as the grain size increases. Another important parameter when considering communication is the machine dependant ratio tcomm / tcalc Where tcomm is the time to transfer a single word between two nodes and tcalc is the time to perform some floating point calculation

4. Amdahls Law This states that the speedup of a parallel algorithm is effectively limited by the number of operations which must be performed sequentially, i.e its Serial Fraction

Let S be the amount of time spent (by one processor) on serial parts of the program and P be the amount of time spent ( by one processor ) on parts of the program that could be done in parallel. i.e. Tseq = S + P and Tpar = S + P/N ............... ( 0 ) where N is the number of processors. Therefore Speedup = Tseq / Tpar SPEEDUP = S + P / (S + P/N )

( 1 )

Say we have a program containing 100 operations each of which take 1 time unit. If 80 operations can be done in parallel i.e. P = 80 and 20 operations must be done sequentially i.e. S = 20 then using 80 processors Speedup = 100 / (20 + 80/80) = 100 / 21 < 5

i.e. a speedup of only 5 is possible no matter how many processors are available.

If we define the serial fraction F to be. F = S / Tseq ==> P = (1 - F)Tseq Then equation ( 1 ) can be written as

SPEEDUP = 1 / (F + (1 - f)/N )

( 2 )

So if F = 0 i.e. no serial part then speedup = N ( the ideal value ) If F = 1 i.e. completely serial then speedup = 1 no matter how many processors are used. Consider the effect of the serial fraction F on the speedup produced for N = 10 and N = 1024.

If 1% of a parallel program involves serial code, the maximum speedup that can be attained is 9 using 10 processors, but only 91 using 1024 processors. Therefore Amdahls Law tells us that the serial fracrion F places a severe constraint on the speedup as the number of processors increase. Since most parallel programs contain a certain amount of sequential code, a possible conclusion of Amdahls Law is that it is not cost effective to build systems with large numbers of processors because sufficient speedup will never be produced. However most of the important applications that need to be parallelised contain very small fractions ( < 0.001)

Scaled Speedup Say we have a problem to solve which takes Tseq seconds to finish, S seconds on the parts of the program that must be done serially and P ( = Tseq - S ) seconds on parts of the program that could have been done in parallel. i.e. Tseq = S + P and if F = S / Tseq is the serial fraction. So if we solve the same problem on a parallel machine i.e. the problem size is fixed, then the speedup can be predicted by Amdahls Law as Speedup = S + P / (S + P/N) Speedup = 1 / F + (1-F)/ N

However we could argue that run time is fixed i.e. problem size is not fixed but increases in proportion to N AND we could argue that serial fraction is not constant but decreases as the problem size increases i.e. S is constant.

Therefore Amdahls Law tells us that the serial fraction F places a severe constraint on the speedup as the number of processors increase Since most parallel programs contain a certain amount of sequential code the conclusion of Amdahls Law is that it is not cost effective to build systems with large numbers of processors because sufficient speedup will never be produced. Amdahls Law is valid for problems in which the serial fraction F does not vary with the problem size i.e. as the problem increases the time Tseq and S increase keeping F = S / Tseq constant.

NOTE Applications that parallelise well are ones with very small serial fractions.

Using the Serial Fraction in Measuring Performance. Given the serial fraction F and the number of processors N we can calculate predicted speedup using equation ( 2 ) i.e. speedup = 1 / F + (1 - F) / N So if we run the program and find the actual speedup from Tseq / Tpar we can rearrange ( 2 ) to find the actual serial fraction F i.e. F = 1/Speedup - 1/N -----------------1 - 1/N

The value of F is useful because equation ( 0 ) i.e. Tpar = S + P/N is idealised. it assumes all processors compute for the same amount of time i.e. perfectly load balanced. The overhead in communication and synchronization of processors is not included. Load balancing effects are likely to result in an irregular change in F as N increases. Say we have 12 pieces of work each taking the same amount of time. We have perfect load balancing for N = 2,3,4,6 & 12 processors but not for N = 5,7,8,9,10,11 Since a larger load imbalance results in a larger F, problems can be identified not apparent from speedup or efficiency. The overhead of communication & synchronization tends to increase as N increases. Since increasing the overhead decreases speedup the value of F increases smoothly as N increases. So a smoothly increasing F is a warning that the grain size is too small. Consider the following results N

Speedup

Efficiency

F

2 3 4 8

1.95 2.88 3.76 6.96

97 96 94 87

0.024 0.021 0.021 0.021

Without looking at the serial fraction we cannot tell whether the results are good or not e.g. Why does efficiency decrease ? Since F is almost constant, we can conclude it is due to limited parallelism of the program.

Related Documents