Parallel Programming and MPI A course for IIT-M. September 2008 R Badrinath, STSD Bangalore (
[email protected])
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
Context and Background •
IIT- Madras has recently added a good deal of compute power.
•
Why – −Further R&D in sciences, engineering −Provide computing services to the region −Create new opportunities in education and skills −…
•
Why this course – −Update skills to program modern cluster computers
•
2
Length -2 theory and 2 practice sessions, 4 hrs each
September 2008
IIT-Madras
Audience Check
3
Contents 1. 2. 3. 4.
Instead we MPI_Init •Understand Issues MPI_Comm_rank MPI_Comm_size •
Understand Concepts
MPI_Send
5.
•Learn MPI_Recv
6.
MPI_Bcast
enough to pickup from the manual
•
7.
Go by motivating MPI_Create_comm
8.
MPI_Sendrecv
9.
MPI_Scatter
10.
MPI_Gather
•Try out some of the examples
……………… 4
examples
September 2008
IIT-Madras
Outline • Sequential
vs Parallel programming
• Shared
vs Distributed Memory
• Parallel
work breakdown models
• Communication • MPI
Examples
• MPI
Concepts
• The
role of IO
5
September 2008
IIT-Madras
vs Computation
Sequential vs Parallel • We
are used to sequential programming – C, Java, C+ +, etc. E.g., Bubble Sort, Binary Search, Strassen Multiplication, FFT, BLAST, …
• Main
idea – Specify the steps in perfect order
• Reality
– We are used to parallelism a lot more than we think – as a concept; not for programming
• Methodology
– Launch a set of tasks; communicate to make progress. E.g., Sorting 500 answer papers by – making 5 equal piles, have them sorted by 5 people, merge them together.
6
September 2008
IIT-Madras
Shared vs Distributed Memory Programming •
Shared Memory – All tasks access the same memory, hence the same data. pthreads
•
Distributed Memory – All memory is local. Data sharing is by explicitly transporting data from one task to another (send-receive pairs in MPI, e.g.)
Program Memory
Communications channel
•
HW – Programming model relationship – Tasks vs CPUs;
•
SMPs vs Clusters
7
September 2008
IIT-Madras
Designing Parallel Programs
8
Simple Parallel Program – sorting numbers in a large array A • Notionally
divide A into 5 pieces [0..99;100..199;200..299;300..399;400..499 ].
• Each
part is sorted by an independent sequential algorithm and left within its region.
• The
resultant parts are merged by simply reordering among adjacent parts.
9
September 2008
IIT-Madras
What is different – Think about… • How
many people doing the work. (Degree of Parallelism)
• What • Who
is needed to begin the work. (Initialization)
does what. (Work distribution)
• Access
to work part. (Data/IO access)
• Whether
they need info from each other to finish their own job. (Communication)
• When • What
10
are they all done. (Synchronization) needs to be done to collate the result.
September 2008
IIT-Madras
Work Break-down • Parallel • Prefer
algorithm
simple intuitive breakdowns
• Usually
highly optimized sequential algorithms are not easily parallelizable
• Breaking
work often involves some pre- or post- processing (much like divide and conquer)
• Fine
vs large grain parallelism and relationship to communication
11
September 2008
IIT-Madras
Digression –
Let’s get a simple MPI Program to
work
#include <mpi.h> #include <stdio.h> int main() { int total_size, my_rank; MPI_Init(NULL,NULL); MPI_Comm_size(MPI_COMM_WORLD, &total_size); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); printf("\n Total number of programs = %d, out of which rank of this process is %d\n", total_size, my_rank); MPI_Finalize(); return 0; } 12
September 2008
IIT-Madras
Getting it to work •
Compile it: − mpicc –o simple simple.c
•
# If you want HP-MPI set your path # /opt/hpmpi/bin
Run it − This depends a bit on the system − mpirun -np2 simple − qsub –l ncpus=2 –o simple.out /opt/hpmpi/bin/mpirun
/simple − [Fun: qsub –l ncpus=2 –I hostname ]
• • •
13
Results are in the output file. What is mpirun ? What does qsub have to do with MPI?... More about qsub in a separate talk.
September 2008
IIT-Madras
What goes on • Same
program is run at the same time on 2 different CPUs
• Each
is slightly different in that each returns different values for some simple calls like MPI_Comm_rank.
• This
gives each instance its identity
• We
can make different instances run different pieces of code based on this identity difference
• Typically
14
September 2008
it is an SPMD model of computation
IIT-Madras
Continuing work breakdown…
Simple Example: Find shortest distances PROBLEM:
7
Find shortest path distances
2 5 1 2 7
Let Nodes be numbered 0,1,…,n-1 Let us put all of this in a matrix A[i][j] is the distance from i to j
15
September 2008
IIT-Madras
3
2
1
0
3
2
6
4
0 7
2 0
1 ..
.. ..
6 ..
1 .. ..
5 .. ..
0 2 ..
2 0 ..
3 2 0
Floyd’s (sequential) algorithm For (k=0; k
16
September 2008
IIT-Madras
Parallelizing Floyd • Actually
we just need n2 tasks, with each task iterating n times (once for each value of k).
• After
each iteration we need to make sure everyone sees the matrix.
• ‘Ideal’
for shared memory.. Programming
• What
if we have less than n2 tasks?... Say
• Need
to divide the work among the p tasks.
p
• We 17
can simply divide up the rows.
September 2008
IIT-Madras
Dividing the work • Each
task gets [n/p] rows, with the last possibly getting a little more. T0 i-th row q x [ n/p ] Tq k-th row
18
September 2008
IIT-Madras
Remember the observation
/* “id” is TASK NUMBER, each node has only the part of A that it owns. This is approximate code Note*/ that each node
calls its own matrix by the same name name a current_owner_task = GET_BLOCK_OWNER(k); [ ][ ] but has only if (id == current_owner_task) { [p/n] rows.
for (k=0;k
The MPI Model… -All nodes run the same code!! P replica tasks!! …
k_here = k - LOW_END_OF_MY_BLOCK(id); for(j=0;j
Distributed Memory rowk[j]=a[k_here][j]; Model
}
-Some times /* rowk is broadcast by the owner and received by they need to others.. do different The MPI code will come here later */ things for(i=0;i
September 2008
a[i][k]+rowk[j]); IIT-Madras
The MPI model • Recall
MPI tasks are typically created when the jobs are launched – not inside the MPI program (no forking). −mpirun usually creates the task set −mpirun –np 2 a.out <args to a.out> −a.out is run on all nodes and a communication channel is setup between them
• Functions
allow for tasks to find out
−Size of the task group −Ones own position within the group
20
September 2008
IIT-Madras
MPI Notions [ Taking from the example ] •
Communicator – A group of tasks in a program
•
Rank – Each task’s ID in the group −MPI_Comm_rank() … /* use this to set “id” */
•
Size – Of the group −MPI_Comm_size() … /* use to set “p” */
•
Notion of send/receive/broadcast… −MPI_Bcast() … /* use to broadcast rowk[] */
•
For actual syntax use a good MPI book or manual
•
Online resource: http://www-unix.mcs.anl.gov/mpi/www/
21
September 2008
IIT-Madras
MPI Prologue to our Floyd example int a[MAX][MAX]; int n=20; /* real size of the matrix, can be read in */ int id,p; MPI_Init(argc,argv); MPI_Comm_rank(MPI_COMM_WORLD,&id); MPI_Comm_size(MPI_COMM_WORLD,&p); . ./* This is where all the real work happens */ . MPI_Finalize(); /* Epilogue */ 22
September 2008
IIT-Madras
This is the time to try out several simple MPI programs using the few functions we have seen. - use mpicc - use mpirun
23
Tasks/CPUs Visualizing the executionMultiple maybe on the same Job is Launched Tasks On CPUs
node Scheduler ensures 1 task per cpu
•MPI_INIT, MPI_Comm_rank, MPI_Comm_size etc… •Other initializations, like reading in the array
•For initial values of k, task with rank 0 broadcasts row k, others receiv
•For each value of k they do their computation with the correct row •Loop above for all values of k •Task 0 receives all blocks of the final array and prints them out •MPI_Finalize September 24
2008
IIT-Madras
Communication vs Computation Often communication is needed between iterations to complete the work. • Often the more the tasks the more the communication can become. •
−In Floyd, bigger “p” indicates that “rowk” will be sent to a larger number of tasks. −If each iteration depends on more data, it can get very busy.
This may mean network contention; i.e., delays. • Try to count the numbr of “a”s in a string. Time vs p • This is why for a fixed problem size increasing number of CPUs does not continually increase performance • This needs experimentation – problem specific •
25
September 2008
IIT-Madras
Communication primitives • MPI_Send(sendbuffer,
senddatalength, datatype, destination, tag, communicator); • MPI_Send(“Hello”, strlen(“Hello”), MPI_CHAR, 2 , 100, MPI_COMM_WORLD); • MPI_Recv(recvbuffer, revcdatalength, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); • Send-Recv happen in pairs. 26
September 2008
IIT-Madras
Collectives • Broadcast
is one-to-all communication • Both receivers and sender call the same function • All MUST call it. All end up with SAME result. • MPI_Bcast (buffer, count, type, root, comm); • Examples −MPI_Bcast(&k, 1, MPI_INT, 0, MPI_Comm_World); −Task 0 sends its integer k and all others receive it. −MPI_Bcast(rowk,n,MPI_INT,current_owner_task,MPI_COMM_ WORLD); −Current_owner_task sends rowk to all others.
27
September 2008
IIT-Madras
Try out a simple MPI program with send-recvs and braodcasts. Try out Floyd’s algorithm. What if you have to read a file to initialize Floyd’s algorithm? 28
A bit more on Broadcast Ranks: 0 x : 0
1 1
MPI_Bcast(&x,1,..,0,..);
x : 0
2 2
MPI_Bcast(&x,1,..,0,..);
MPI_Bcast(&x,1,..,0,..);
0
0
0
0
0
0
29
September 2008
IIT-Madras
Other useful collectives • MPI_Reduce(&values,&results,count,type,o
perator,
•
root,comm);
MPI_Reduce(&x, &res, 1, MPI_INT, MPI_SUM, 9, MPI_COMM_WORLD);
• Task
number 9 gets in the variable res the sum of whatever was in x in all of the tasks (including itself).
• Must
30
September 2008
be called by ALL tasks.
IIT-Madras
Scattering as opposed to broadcasting •
MPI_Scatterv(sndbuf, sndcount[], send_disp[], type, recvbuf, recvcount, recvtype, root, comm);
• All
nodes MUST call
Rank0
Rank1
Rank0 31
September 2008
IIT-Madras
Rank2
Rank3
Common Communication pitfalls!! • Make
sure that communication primitives are called by the right number of tasks.
• Make
sure they are called in the right sequence.
• Make
sure that you use the proper tags.
• If
not, you can easily get into deadlock (“My program seems to be hung”)
32
September 2008
IIT-Madras
More on work breakdown • Finding
the right work breakdown can be challenging
• Sometime
dynamic work breakdown is good
• Master
(usually task 0) decides who will do what and collects the results.
• E.g.,
you have a huge number of 5x5 matrices to multiply (chained matrix multiplication).
• E.g.,
33
Search for a substring in a huge collection of strings.
September 2008
IIT-Madras
Master-slave dynamic work assignment Master 1
0
2
3
4
34
September 2008
IIT-Madras
Slaves
Master slave example – Reverse strings Slave(){ do{ MPI_Recv(&work,MAX,MPI_CHAR,i,0,MPI_COMM_WORLD,&stat); n=strlen(work); if(n==0) break; /* detecting the end */ reverse(work); MPI_Send(&work,n+1,MPI_CHAR,0,0,MPI_COMM_WORLD); } while (1); MPI_Finalize(); }
35
September 2008
IIT-Madras
Master slave example – Reverse strings Master(){ /* rank 0 task */ initialize_work_tems(); for(i=1;iMPI_source, 0,MPI_COMM_WORLD); } 36
} September 2008
IIT-Madras
Master slave example Main(){ ... MPI_Comm_Rank(MPI_COMM_WORLD,&id); MPI_Comm_size(MPI_COMM_WORLD,&np); if (id ==0 ) Master(); else Slave(); ... } 37
September 2008
IIT-Madras
Matrix Multiply and Communication Patterns
38
Block Distribution of Matrices • Matrix
Mutliply: −Cij = Σ (Aik * Bkj) • BMR Algorithm:
•Each task owns a block – its own part of A,B and C •The old formula holds for blocks! •Example: C21=A20 * B01 A21 * B11 A22 * B21 A23 * B31
Each isSeptember a smaller Block – a submatrix 39
2008
IIT-Madras
Block Distribution of Matrices • Matrix
Mutliply: −Cij = Σ (Aik * Bkj) • BMR Algorithm:
C21 = A20 * B01 A21 * B11 A22 * B21 A23 * B31
•A22 is row broadcast •A22*B21 added into C21 •B_1 is Rolled up one slot •Out task now has B31 Now repeat the above block except the item to broadcast is A23 Each isSeptember a smaller Block – a submatrix 40
2008
IIT-Madras
Attempt doing this with just SendRecv and Broadcast
41
Communicators and Topologies • BMR
example shows limitations of broadcast.. Although there is pattern
• Communicators
can be created on subgroups of processes.
• Communicators
topology
can be created that have a
−Will make programming natural −Might improve performance by matching to hardware
42
September 2008
IIT-Madras
for (k = 0; k < s; k++) { sender = (my_row + k) % s; if (sender == my_col) { MPI_Bcast(&my_A, m*m, MPI_INT, sender, row_comm); T = my_A; else
MPI_Bcast(&T, m*m, MPI_INT, sender, row_comm);
my_C = my_C + T x my_B; } MPI_Sendrecv_replace(my_B, m*m, MPI_INT, dest, 0, source, 0, col_comm, &status); } 43
September 2008
IIT-Madras
Creating topologies and communicators • Creating
a grid
• MPI_Cart_create(MPI_COMM_WORLD,
2, dim_sizes, istorus, canreorder, &grid_comm); −int dim_sizes[2], int istorus[2], int canreorder, MPI_Comm grid_comm
• Divide
a grid into rows- each with own communicator
• MPI_Cart_sub(grid_comm,free,&rowcom)
−MPI_Comm rowcomm; int free[2]
44
September 2008
IIT-Madras
Try implementing the BMR algorithm with communicators
45
A brief on other MPI Topics – The last leg • MPI+Multi-threaded • One • MPI
46
/ OpenMP
sided Communication and IO
September 2008
IIT-Madras
MPI and OpenMP •Grain •Communication
… …
•Where does the interesting pragma omp for fit in our MPI Floyd? •How do I assign exactly one MPI task per CPU?
47
September 2008
IIT-Madras
One-Sided Communication • Have
no corresponding send-recv pairs!
• RDMA • Get • Put
48
September 2008
IIT-Madras
IO in Parallel Programs • Typically
a root task, does the IO.
−Simpler to program −Natural because of some post processing occasionally needed (sorting) −All nodes generating IO requests might overwhelm fileserver, essentially sequentializing it. • Performance
not the limitation for Lustre/SFS.
• Parallel
IO interfaces such as MPI-IO can make use of parallel filesystems such as Lustre.
49
September 2008
IIT-Madras
MPI-BLAST exec time vs other time[4]
50
September 2008
IIT-Madras
How IO/Comm Optimizations help MPI-BLAST[4]
51
September 2008
IIT-Madras
What did we learn? • Distributed • Parallel • Work
Memory Programming Model
Algorithm Basics
Breakdown
• Topologies
in Communication
• Communication • Impact
52
September 2008
Overhead vs Computation
of Parallel IO
IIT-Madras
What MPI Calls did we see here? 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
53
MPI_Init MPI_Finalize MPI_Comm_size MPI_Comm_Rank MPI_Send MPI_Recv MPI_Sendrecv_replace MPI_Bcast MPI_Reduce MPI_Cart_create MPI_Cart_sub MPI_Scatter
September 2008
IIT-Madras
References 1.
Parallel Programming in C with MPI and OpenMP, M J Quinn, TMH. This is an excellent practical book. Motivated much of the material here, specifically Floyd’s algorithm.
2.
BMR Algorithm for Matrix Multiply and topology ideas is motivated by http://www.cs.indiana.edu/classes/b673/notes/matrix
3.
MPI online manual http://www-unix.mcs.anl.gov/mpi/www/
4.
Efficient Data Access For Parallel BLAST, IPDPDS’05
54
September 2008
IIT-Madras