Parallel Programming And Mpi

  • Uploaded by: api-19815974
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Parallel Programming And Mpi as PDF for free.

More details

  • Words: 2,526
  • Pages: 54
Parallel Programming and MPI A course for IIT-M. September 2008 R Badrinath, STSD Bangalore ([email protected])

© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

Context and Background •

IIT- Madras has recently added a good deal of compute power.



Why – −Further R&D in sciences, engineering −Provide computing services to the region −Create new opportunities in education and skills −…



Why this course – −Update skills to program modern cluster computers



2

Length -2 theory and 2 practice sessions, 4 hrs each

September 2008

IIT-Madras

Audience Check

3

Contents 1. 2. 3. 4.

Instead we MPI_Init •Understand Issues MPI_Comm_rank MPI_Comm_size •

Understand Concepts

MPI_Send

5.

•Learn MPI_Recv

6.

MPI_Bcast

enough to pickup from the manual



7.

Go by motivating MPI_Create_comm

8.

MPI_Sendrecv

9.

MPI_Scatter

10.

MPI_Gather

•Try out some of the examples

……………… 4

examples

September 2008

IIT-Madras

Outline • Sequential

vs Parallel programming

• Shared

vs Distributed Memory

• Parallel

work breakdown models

• Communication • MPI

Examples

• MPI

Concepts

• The

role of IO

5

September 2008

IIT-Madras

vs Computation

Sequential vs Parallel • We

are used to sequential programming – C, Java, C+ +, etc. E.g., Bubble Sort, Binary Search, Strassen Multiplication, FFT, BLAST, …

• Main

idea – Specify the steps in perfect order

• Reality

– We are used to parallelism a lot more than we think – as a concept; not for programming

• Methodology

– Launch a set of tasks; communicate to make progress. E.g., Sorting 500 answer papers by – making 5 equal piles, have them sorted by 5 people, merge them together.

6

September 2008

IIT-Madras

Shared vs Distributed Memory Programming •

Shared Memory – All tasks access the same memory, hence the same data. pthreads



Distributed Memory – All memory is local. Data sharing is by explicitly transporting data from one task to another (send-receive pairs in MPI, e.g.)

Program Memory

Communications channel



HW – Programming model relationship – Tasks vs CPUs;



SMPs vs Clusters

7

September 2008

IIT-Madras

Designing Parallel Programs

8

Simple Parallel Program – sorting numbers in a large array A • Notionally

divide A into 5 pieces [0..99;100..199;200..299;300..399;400..499 ].

• Each

part is sorted by an independent sequential algorithm and left within its region.

• The

resultant parts are merged by simply reordering among adjacent parts.

9

September 2008

IIT-Madras

What is different – Think about… • How

many people doing the work. (Degree of Parallelism)

• What • Who

is needed to begin the work. (Initialization)

does what. (Work distribution)

• Access

to work part. (Data/IO access)

• Whether

they need info from each other to finish their own job. (Communication)

• When • What

10

are they all done. (Synchronization) needs to be done to collate the result.

September 2008

IIT-Madras

Work Break-down • Parallel • Prefer

algorithm

simple intuitive breakdowns

• Usually

highly optimized sequential algorithms are not easily parallelizable

• Breaking

work often involves some pre- or post- processing (much like divide and conquer)

• Fine

vs large grain parallelism and relationship to communication

11

September 2008

IIT-Madras

Digression –

Let’s get a simple MPI Program to

work

#include <mpi.h> #include <stdio.h> int main() { int total_size, my_rank; MPI_Init(NULL,NULL); MPI_Comm_size(MPI_COMM_WORLD, &total_size); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); printf("\n Total number of programs = %d, out of which rank of this process is %d\n", total_size, my_rank); MPI_Finalize(); return 0; } 12

September 2008

IIT-Madras

Getting it to work •

Compile it: − mpicc –o simple simple.c



# If you want HP-MPI set your path # /opt/hpmpi/bin

Run it − This depends a bit on the system − mpirun -np2 simple − qsub –l ncpus=2 –o simple.out /opt/hpmpi/bin/mpirun /simple − [Fun: qsub –l ncpus=2 –I hostname ]

• • •

13

Results are in the output file. What is mpirun ? What does qsub have to do with MPI?... More about qsub in a separate talk.

September 2008

IIT-Madras

What goes on • Same

program is run at the same time on 2 different CPUs

• Each

is slightly different in that each returns different values for some simple calls like MPI_Comm_rank.

• This

gives each instance its identity

• We

can make different instances run different pieces of code based on this identity difference

• Typically

14

September 2008

it is an SPMD model of computation

IIT-Madras

Continuing work breakdown…

Simple Example: Find shortest distances PROBLEM:

7

Find shortest path distances

2 5 1 2 7

Let Nodes be numbered 0,1,…,n-1 Let us put all of this in a matrix A[i][j] is the distance from i to j

15

September 2008

IIT-Madras

3

2

1

0

3

2

6

4

0 7

2 0

1 ..

.. ..

6 ..

1 .. ..

5 .. ..

0 2 ..

2 0 ..

3 2 0

Floyd’s (sequential) algorithm For (k=0; k
16

September 2008

IIT-Madras

Parallelizing Floyd • Actually

we just need n2 tasks, with each task iterating n times (once for each value of k).

• After

each iteration we need to make sure everyone sees the matrix.

• ‘Ideal’

for shared memory.. Programming

• What

if we have less than n2 tasks?... Say

• Need

to divide the work among the p tasks.

p
• We 17

can simply divide up the rows.

September 2008

IIT-Madras

Dividing the work • Each

task gets [n/p] rows, with the last possibly getting a little more. T0 i-th row q x [ n/p ] Tq k-th row

18

September 2008

IIT-Madras

Remember the observation

/* “id” is TASK NUMBER, each node has only the part of A that it owns. This is approximate code Note*/ that each node

calls its own matrix by the same name name a current_owner_task = GET_BLOCK_OWNER(k); [ ][ ] but has only if (id == current_owner_task) { [p/n] rows.

for (k=0;k
The MPI Model… -All nodes run the same code!! P replica tasks!! …

k_here = k - LOW_END_OF_MY_BLOCK(id); for(j=0;j
Distributed Memory rowk[j]=a[k_here][j]; Model

}

-Some times /* rowk is broadcast by the owner and received by they need to others.. do different The MPI code will come here later */ things for(i=0;i
September 2008

a[i][k]+rowk[j]); IIT-Madras

The MPI model • Recall

MPI tasks are typically created when the jobs are launched – not inside the MPI program (no forking). −mpirun usually creates the task set −mpirun –np 2 a.out <args to a.out> −a.out is run on all nodes and a communication channel is setup between them

• Functions

allow for tasks to find out

−Size of the task group −Ones own position within the group

20

September 2008

IIT-Madras

MPI Notions [ Taking from the example ] •

Communicator – A group of tasks in a program



Rank – Each task’s ID in the group −MPI_Comm_rank() … /* use this to set “id” */



Size – Of the group −MPI_Comm_size() … /* use to set “p” */



Notion of send/receive/broadcast… −MPI_Bcast() … /* use to broadcast rowk[] */



For actual syntax use a good MPI book or manual



Online resource: http://www-unix.mcs.anl.gov/mpi/www/

21

September 2008

IIT-Madras

MPI Prologue to our Floyd example int a[MAX][MAX]; int n=20; /* real size of the matrix, can be read in */ int id,p; MPI_Init(argc,argv); MPI_Comm_rank(MPI_COMM_WORLD,&id); MPI_Comm_size(MPI_COMM_WORLD,&p); . ./* This is where all the real work happens */ . MPI_Finalize(); /* Epilogue */ 22

September 2008

IIT-Madras

This is the time to try out several simple MPI programs using the few functions we have seen. - use mpicc - use mpirun

23

Tasks/CPUs Visualizing the executionMultiple maybe on the same Job is Launched Tasks On CPUs

node Scheduler ensures 1 task per cpu

•MPI_INIT, MPI_Comm_rank, MPI_Comm_size etc… •Other initializations, like reading in the array

•For initial values of k, task with rank 0 broadcasts row k, others receiv

•For each value of k they do their computation with the correct row •Loop above for all values of k •Task 0 receives all blocks of the final array and prints them out •MPI_Finalize September 24

2008

IIT-Madras

Communication vs Computation Often communication is needed between iterations to complete the work. • Often the more the tasks the more the communication can become. •

−In Floyd, bigger “p” indicates that “rowk” will be sent to a larger number of tasks. −If each iteration depends on more data, it can get very busy.

This may mean network contention; i.e., delays. • Try to count the numbr of “a”s in a string. Time vs p • This is why for a fixed problem size increasing number of CPUs does not continually increase performance • This needs experimentation – problem specific •

25

September 2008

IIT-Madras

Communication primitives • MPI_Send(sendbuffer,

senddatalength, datatype, destination, tag, communicator); • MPI_Send(“Hello”, strlen(“Hello”), MPI_CHAR, 2 , 100, MPI_COMM_WORLD); • MPI_Recv(recvbuffer, revcdatalength, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); • Send-Recv happen in pairs. 26

September 2008

IIT-Madras

Collectives • Broadcast

is one-to-all communication • Both receivers and sender call the same function • All MUST call it. All end up with SAME result. • MPI_Bcast (buffer, count, type, root, comm); • Examples −MPI_Bcast(&k, 1, MPI_INT, 0, MPI_Comm_World); −Task 0 sends its integer k and all others receive it. −MPI_Bcast(rowk,n,MPI_INT,current_owner_task,MPI_COMM_ WORLD); −Current_owner_task sends rowk to all others.

27

September 2008

IIT-Madras

Try out a simple MPI program with send-recvs and braodcasts. Try out Floyd’s algorithm. What if you have to read a file to initialize Floyd’s algorithm? 28

A bit more on Broadcast Ranks: 0 x : 0

1 1

MPI_Bcast(&x,1,..,0,..);

x : 0

2 2

MPI_Bcast(&x,1,..,0,..);

MPI_Bcast(&x,1,..,0,..);

0

0

0

0

0

0

29

September 2008

IIT-Madras

Other useful collectives • MPI_Reduce(&values,&results,count,type,o

perator,



root,comm);

MPI_Reduce(&x, &res, 1, MPI_INT, MPI_SUM, 9, MPI_COMM_WORLD);

• Task

number 9 gets in the variable res the sum of whatever was in x in all of the tasks (including itself).

• Must

30

September 2008

be called by ALL tasks.

IIT-Madras

Scattering as opposed to broadcasting •

MPI_Scatterv(sndbuf, sndcount[], send_disp[], type, recvbuf, recvcount, recvtype, root, comm);

• All

nodes MUST call

Rank0

Rank1

Rank0 31

September 2008

IIT-Madras

Rank2

Rank3

Common Communication pitfalls!! • Make

sure that communication primitives are called by the right number of tasks.

• Make

sure they are called in the right sequence.

• Make

sure that you use the proper tags.

• If

not, you can easily get into deadlock (“My program seems to be hung”)

32

September 2008

IIT-Madras

More on work breakdown • Finding

the right work breakdown can be challenging

• Sometime

dynamic work breakdown is good

• Master

(usually task 0) decides who will do what and collects the results.

• E.g.,

you have a huge number of 5x5 matrices to multiply (chained matrix multiplication).

• E.g.,

33

Search for a substring in a huge collection of strings.

September 2008

IIT-Madras

Master-slave dynamic work assignment Master 1

0

2

3

4

34

September 2008

IIT-Madras

Slaves

Master slave example – Reverse strings Slave(){ do{ MPI_Recv(&work,MAX,MPI_CHAR,i,0,MPI_COMM_WORLD,&stat); n=strlen(work); if(n==0) break; /* detecting the end */ reverse(work); MPI_Send(&work,n+1,MPI_CHAR,0,0,MPI_COMM_WORLD); } while (1); MPI_Finalize(); }

35

September 2008

IIT-Madras

Master slave example – Reverse strings Master(){ /* rank 0 task */ initialize_work_tems(); for(i=1;iMPI_source, 0,MPI_COMM_WORLD); } 36

} September 2008

IIT-Madras

Master slave example Main(){ ... MPI_Comm_Rank(MPI_COMM_WORLD,&id); MPI_Comm_size(MPI_COMM_WORLD,&np); if (id ==0 ) Master(); else Slave(); ... } 37

September 2008

IIT-Madras

Matrix Multiply and Communication Patterns

38

Block Distribution of Matrices • Matrix

Mutliply: −Cij = Σ (Aik * Bkj) • BMR Algorithm:

•Each task owns a block – its own part of A,B and C •The old formula holds for blocks! •Example: C21=A20 * B01 A21 * B11 A22 * B21 A23 * B31

Each isSeptember a smaller Block – a submatrix 39

2008

IIT-Madras

Block Distribution of Matrices • Matrix

Mutliply: −Cij = Σ (Aik * Bkj) • BMR Algorithm:

C21 = A20 * B01 A21 * B11 A22 * B21 A23 * B31

•A22 is row broadcast •A22*B21 added into C21 •B_1 is Rolled up one slot •Out task now has B31 Now repeat the above block except the item to broadcast is A23 Each isSeptember a smaller Block – a submatrix 40

2008

IIT-Madras

Attempt doing this with just SendRecv and Broadcast

41

Communicators and Topologies • BMR

example shows limitations of broadcast.. Although there is pattern

• Communicators

can be created on subgroups of processes.

• Communicators

topology

can be created that have a

−Will make programming natural −Might improve performance by matching to hardware

42

September 2008

IIT-Madras

for (k = 0; k < s; k++) { sender = (my_row + k) % s; if (sender == my_col) { MPI_Bcast(&my_A, m*m, MPI_INT, sender, row_comm); T = my_A; else

MPI_Bcast(&T, m*m, MPI_INT, sender, row_comm);

my_C = my_C + T x my_B; } MPI_Sendrecv_replace(my_B, m*m, MPI_INT, dest, 0, source, 0, col_comm, &status); } 43

September 2008

IIT-Madras

Creating topologies and communicators • Creating

a grid

• MPI_Cart_create(MPI_COMM_WORLD,

2, dim_sizes, istorus, canreorder, &grid_comm); −int dim_sizes[2], int istorus[2], int canreorder, MPI_Comm grid_comm

• Divide

a grid into rows- each with own communicator

• MPI_Cart_sub(grid_comm,free,&rowcom)

−MPI_Comm rowcomm; int free[2]

44

September 2008

IIT-Madras

Try implementing the BMR algorithm with communicators

45

A brief on other MPI Topics – The last leg • MPI+Multi-threaded • One • MPI

46

/ OpenMP

sided Communication and IO

September 2008

IIT-Madras

MPI and OpenMP •Grain •Communication

… …

•Where does the interesting pragma omp for fit in our MPI Floyd? •How do I assign exactly one MPI task per CPU?

47

September 2008

IIT-Madras

One-Sided Communication • Have

no corresponding send-recv pairs!

• RDMA • Get • Put

48

September 2008

IIT-Madras

IO in Parallel Programs • Typically

a root task, does the IO.

−Simpler to program −Natural because of some post processing occasionally needed (sorting) −All nodes generating IO requests might overwhelm fileserver, essentially sequentializing it. • Performance

not the limitation for Lustre/SFS.

• Parallel

IO interfaces such as MPI-IO can make use of parallel filesystems such as Lustre.

49

September 2008

IIT-Madras

MPI-BLAST exec time vs other time[4]

50

September 2008

IIT-Madras

How IO/Comm Optimizations help MPI-BLAST[4]

51

September 2008

IIT-Madras

What did we learn? • Distributed • Parallel • Work

Memory Programming Model

Algorithm Basics

Breakdown

• Topologies

in Communication

• Communication • Impact

52

September 2008

Overhead vs Computation

of Parallel IO

IIT-Madras

What MPI Calls did we see here? 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

53

MPI_Init MPI_Finalize MPI_Comm_size MPI_Comm_Rank MPI_Send MPI_Recv MPI_Sendrecv_replace MPI_Bcast MPI_Reduce MPI_Cart_create MPI_Cart_sub MPI_Scatter

September 2008

IIT-Madras

References 1.

Parallel Programming in C with MPI and OpenMP, M J Quinn, TMH. This is an excellent practical book. Motivated much of the material here, specifically Floyd’s algorithm.

2.

BMR Algorithm for Matrix Multiply and topology ideas is motivated by http://www.cs.indiana.edu/classes/b673/notes/matrix

3.

MPI online manual http://www-unix.mcs.anl.gov/mpi/www/

4.

Efficient Data Access For Parallel BLAST, IPDPDS’05

54

September 2008

IIT-Madras

Related Documents