Message Passing Interface Unit III BCSCCS702 Parallel Computing & Algorithms Srinivasa Ramanujan Centre SASTRA University June 2009
Introduction • MPI the Message Passing Interface is a standardized and portable message passing system • Standardization of Message Passing in a Distributed Memory Environment i.e., SPC with DM and NOWS • The MPI standard defines the syntax and semantics of a core of library routines for FORTRAN 77 and C • MPI Standard developed by a collaborative effort of 80 people from 40 organizations through MPI Forum (representing vendors of parallel systems industrial users industrial and national research laboratories and universities) • MPI draft released on Nov 1993 at Supercomputing Conference. MPI 1.0 released on June 1994.
MPI intro • MPI has been strongly influenced by work at the IBM T. J. Watson Research Center. Intel’s NX/2, Express , nCUBE’s Vertex, p4, and PARMACS. Other important contributions have come from Zipcode, Chimp, PVM, Chameleon, and PICL. • MPI Forum identified some critical shortcomings of existing message passing systems e.g.., complex data layouts or support for modularity and safe communication. • The major goal of MPI as with most standards is a degree of portability across different machines
Supported Platform • • • • • • • • • • •
Runs on SPC DM parallel computers SM parallel computers NOWS As set of processes on single workstation Develop in single workstation and deploy it in SPC for production run. Runs transparently on Heterogeneous systems – supports virtual computing model hiding many architectural differences MPI implementation will automatically do any necessary data conversion and utilize the correct communications protocol Portability is central but Performance is not compromised. More portable than vendor specific systems.
Efficient Implementation • An important design goal of MPI was to allow efficient implementations across machines of differing characteristics. – MPI can be easily implemented on systems that buffer messages at the sender / receiver or do no buffering at all.
• Implementations can take advantage of specific features of the communication subsystem of various machines. – On machines with intelligent communication coprocessors much of the message passing protocol can be off loaded to this coprocessor. On other systems most of the communication code is executed by the main processor
Opaque Objects • By hiding the details of how MPI specific objects are represented each implementation is free to do whatever is best under the circumstances
Minimize work • avoid a requirement for large amounts of extra information with each message • avoid a need for complex encoding or decoding of message headers • avoids extra computation or tests in critical routines since this can degrade performance. • avoids the need for extra copying and buffering of data – data can be moved from the user memory directly to the wire and be received directly from the wire to the receiver memory
• Encourage the reuse of previous computations through persistent communication requests and caching of attributes on communicators
Overlap computation with communication
• achieved by the use of non-blocking communication calls which separate the initiation of a communication from its completion • hide communication latencies
Scalability • Sub-group of processes • Collective communication • Topology – Cartesian and Graph topology
• Reliability of sending and receiving the message
Goals summarized • to develop a widely used standard for writing message passing programs • a practical portable efficient and flexible standard for message passing • Efficient communication – avoid m2m copy, overlap comp with comm – Offload to communication co-processors when available
• Support heterogeneous environment • C & Fortran 77 – semantics of MPI APIs are Language independent. • Reliable communication • APIs not too different from PVM, NX, Express, p4 etc., • MPI APIs are Thread safe.
Supported Architectures • run on distributed memory multi computers shared memory multiprocessors networks of workstations and combinations of all of these – MIMD, MPMD, SPMD
• MPI provides many features intended to improve performance on scalable parallel computers with specialized inter processor communication hardware – Thus we expect that native high performance implementations of MPI will be provided on such machines
• At the same time implementations of MPI on top of standard Unix inter processor communication protocols will provide portability to workstation clusters and heterogeneous networks of workstations
What is included in MPI • • • • • • • •
Point-to-point communication Collective operations Process groups Communication domains Process topologies Environmental Management and inquiry Profiling interface Bindings for Fortran and C languages
P2P communication • the transmittal of data between a pair of processes one side sending the other receiving. • Almost all the constructs of MPI are built around the point to point operations.
P2P arguments • MPI procedures are specified using a language independent notation. • The arguments of procedure calls are marked as IN, OUT or INOUT. • IN : the call uses but does not update an argument marked IN • OUT : the call may update an argument marked OUT • INOUT: the call both uses and updates an argument marked INOUT
MPI_Send() args
MPI_Recv() args
Message Type • MPI provides a set of send and receive functions that allow the communication of typed data with an associated tag. – Typing of the message contents is necessary for heterogeneous support the type information is needed so that correct data representation conversions can be performed as data is sent from one architecture to another.
Message Tag •
• • • • •
The tag allows selectivity of messages at the receiving end one can receive on a particular tag or one can wildcard this quantity allowing reception of messages with any tag Integer value Used by the application to distinguish the messages Range 0..UB UB is implementation dependent. Typical value no less than 32767. Can be found by querying the value of the attribute MPI_TAG_UB
Communication domain • a communicator serves to define a set of processes that can be contacted. • Each such process is labeled by a process rank. • Process ranks are integers and are discovered by inquiry to a communicator. • MPI_COMM_WORLD is a default communicator provided upon startup that defines an initial communication domain for all the processes that participate in the computation
p0 sending a message to p0
MPI_Send() args – blocking send MPI_Send (msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD); • Arg1 : the outgoing data is to be taken from msg • Arg2 : it consists of strlen(msg)+1 entries +1 for ‘\0’ MPI_CHAR
• Arg3 : each of type MPI_CHAR • Arg4 : specifies the message destination which is process 1 • Arg5 : species the message tag • Arg6 : communicator that species a communication domain for this communication
MPI_Recv() args - blocking receive
• • • • • • •
MPI_Recv (msg,20,MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status) Arg1 : Place incoming data in msg. Arg2 : max size of incoming data is 20 Arg3 : type MPI_CHAR Arg4 : receive message from process 0 Arg5 : specifies the message tag to retrieve Arg6 : communicator that species a communication domain for this communication Arg7 : variable status set by MPI_Recv()
MPI_Send()/MPI_Recv – blocking send/receive
• The send call blocks until the send buffer can be reclaimed i.e., after the send process can safely overwrite the contents of msg • the receive function blocks until the receive buffer actually contains the contents of the message
Modes of p2p communication • The mode allows one to choose the semantics of the send operation and in effect to influence the underlying protocol of the transfer of data. • Both blocking and non-blocking communications have modes • 4 Modes – – – –
Standard mode Buffered mode Synchronous mode Ready mode
4 modes of p2p communication – Standard mode • the completion of the send does not necessarily mean that the matching receive has started and no assumption should be made in the application program about whether the outgoing data is buffered by MPI
– Buffered mode • the user can guarantee that a certain amount of buffering space is available The catch is that the space must be explicitly provided by the application program.
– Synchronous mode • a rendezvous (ron’de-voo – meeting place fixed beforehand) semantics between sender and receiver is used
– Ready mode • This allows the user to exploit extra knowledge to simplify the protocol and potentially achieve higher performance In a readymode send the user asserts that the matching receive already has been posted
P2P Blocking Message Passing Routines • • •
• • •
MPI_Send (&buf,count,datatype,dest,tag,comm) Routine returns only after the application buffer in the sending task is free for reuse. MPI_Recv (&buf,count,datatype,source,tag,comm,&status) Receive a message and block until the requested data is available in the application buffer in the receiving task. MPI_Ssend (&buf,count,datatype,dest,tag,comm) Send a message and block until the application buffer in the sending task is free for reuse and the destination process has started to receive the message. MPI_Bsend (&buf,count,datatype,dest,tag,comm) Buffered blocking send: permits the programmer to allocate the required amount of buffer space into which data can be copied until it is delivered. MPI_Rsend (&buf,count,datatype,dest,tag,comm) Should only be used if the programmer is certain that the matching receive has already been posted MPI_Sendrecv (&sendbuf,sendcount,sendtype,dest,sendtag, &recvbuf,recvcount,recvtype,source,recvtag, comm,&status) Send a message and post a receive before blocking. Will block until the sending application buffer is free for reuse and until the receiving application buffer contains the received message.
User defined buffering used along with MPI_Bsend() • MPI_Buffer_attach (&buffer,size) MPI_Buffer_detach (&buffer,size) – Used by programmer to allocate/deallocate message buffer space to be used by the MPI_Bsend routine. – The size argument is specified in actual data bytes - not a count of data elements. – Only one buffer can be attached to a process at a time.
P2P Non-Blocking Message Passing Routines • •
MPI_Isend (&buf,count,datatype,dest,tag,comm,&request) MPI_Irecv (&buf,count,datatype,source,tag,comm,&request)
– Identifies an area in memory to serve as a send/recv buffer. – Processing continues immediately without waiting for the message to be copied out/in from the application buffer. – A communication request handle is returned for handling the pending message status. – The program should not use the application buffer until subsequent calls to MPI_Wait or MPI_Test indicate that the non-blocking send/recv has completed. •
MPI_Issend (&buf,count,datatype,dest,tag,comm,&request)
– Non-blocking synchronous send. •
MPI_Ibsend (&buf,count,datatype,dest,tag,comm,&request)
– Non-blocking buffered send. •
MPI_Irsend (&buf,count,datatype,dest,tag,comm,&request)
– Non-blocking ready send.
MPI_Wait() •
MPI_Wait (&request,&status) MPI_Waitany (count,&array_of_requests,&index,&status) MPI_Waitall (count,&array_of_requests,&array_of_statuses) MPI_Waitsome (incount,&array_of_requests,&outcount, &array_of_offsets, &array_of_statuses)
– blocks until a specified non-blocking send or receive operation has completed.
MPI_Test() •
MPI_Test (&request,&flag,&status) MPI_Testany (count,&array_of_requests,&index,&flag,&status) MPI_Testall (count,&array_of_requests,&flag, &array_of_statuses) MPI_Testsome (incount,&array_of_requests,&outcount, &array_of_offsets, &array_of_statuses) – checks the status of a specified non-blocking send or receive operation. – The "flag" parameter is returned logical true (1) if the operation has completed, and logical false (0) if not.
MPI_Probe() • MPI_Probe (source,tag,comm,&status) – Performs a blocking test for a message. – The "wildcards" MPI_ANY_SOURCE and MPI_ANY_TAG may be used to test for a message from any source or with any tag.
• MPI_Iprobe (source,tag,comm,&flag,&status) – Performs a non-blocking test for a message. – The integer "flag" parameter is returned logical true (1) if a message has arrived, and logical false (0) if not.
Collective Communications • ALL or None property • Collective communication must involve all processes in the scope of a communicator
• All processes are by default, members in the communicator MPI_COMM_WORLD. • It is the programmer's responsibility to insure that all processes within a communicator participate in any collective operations.
Types of Collective Operations • Synchronization - processes wait until all members of the group have reached the synchronization point. • Data Movement - broadcast, scatter/gather, all to all. • Collective Computation (reductions) - one member of the group collects data from the other members and performs an operation (min, max, add, multiply, etc.) on that data.
Programming Considerations and Restrictions • Collective operations are blocking • Collective communication routines do not take message tag arguments. • Collective operations within subsets of processes are accomplished by first partitioning the subsets into new groups and then attaching the new groups to new communicators • Can only be used with MPI predefined datatypes - not with MPI Derived Data Types.
Collective Communication Routines • • 3. 4.
5.
6.
• •
•
•
MPI_Barrier (comm) – Creates a barrier synchronization in a group. Each task, when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same MPI_Barrier call. MPI_Bcast (&buffer,count,datatype,root,comm) – Broadcasts (sends) a message from the process with rank "root" to all other processes in the group. MPI_Scatter (&sendbuf,sendcnt,sendtype,&recvbuf, recvcnt,recvtype,root,comm) – Distributes distinct messages from a single source task to each task in the group. MPI_Gather (&sendbuf,sendcnt,sendtype,&recvbuf, recvcount,recvtype,root,comm) – Gathers distinct messages from each task in the group to a single destination task. This routine is the reverse operation of MPI_Scatter. MPI_Reduce (&sendbuf,&recvbuf,count,datatype,op,root,comm) – Applies a reduction operation on all tasks in the group and places the result in one task. – MIN, MAX, SUM,PROD, LAND,BAND,LOR,BOR,LXOR,BXOR,MAXLOC,MINLOC MPI_Allgather (&sendbuf,sendcount,sendtype,&recvbuf, recvcount,recvtype,comm) – Concatenation of data to all tasks in a group. Each task in the group, in effect, performs a one-to-all broadcasting operation within the group. MPI_Allreduce (&sendbuf,&recvbuf,count,datatype,op,comm) – Applies a reduction operation and places the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast. MPI_Reduce_scatter (&sendbuf,&recvbuf,recvcount,datatype, op,comm) – First does an element-wise reduction on a vector across all tasks in the group. Next, the result vector is split into disjoint segments and distributed across the tasks. This is equivalent to an MPI_Reduce followed by an MPI_Scatter operation. MPI_Alltoall (&sendbuf,sendcount,sendtype,&recvbuf, recvcnt,recvtype,comm) – Each task in a group performs a scatter operation, sending a distinct message to all the tasks in the group in order by index. MPI_Scan (&sendbuf,&recvbuf,count,datatype,op,comm) – Performs a scan operation with respect to a reduction operation across a task group.
C Language - Collective Communications #include "mpi.h" #include <stdio.h> #define SIZE 4 int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, sendcount, recvcount, source; float sendbuf[SIZE][SIZE] = { {1.0, 2.0, 3.0, 4.0}, {5.0, 6.0, 7.0, 8.0}, {9.0, 10.0, 11.0, 12.0}, {13.0, 14.0, 15.0, 16.0} }; float recvbuf[SIZE]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); if (numtasks == SIZE) { source = 1; sendcount = SIZE; recvcount = SIZE; MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount, MPI_FLOAT, source, MPI_COMM_WORLD); printf("rank= %d Results: %f %f %f %f\n",rank,recvbuf[0], recvbuf[1],recvbuf[2],recvbuf[3]); } else printf("Must specify %d processors. Terminating.\n",SIZE); MPI_Finalize();
Sample program output rank= 0 Results: 1.000000 2.000000 3.000000 4.000000 rank= 1 Results: 5.000000 6.000000 7.000000 8.000000 rank= 2 Results: 9.000000 10.000000 11.000000 12.000000 rank= 3 Results: 13.000000 14.000000 15.000000 16.000000
Process Topologies • • • • • • • •
Logical process arrangement is Virtual Topology. Virtual Topology Vs Physical topology Used for efficient mapping process to physical processor Virtual topology describes a mapping/ordering of MPI processes into a geometric "shape". The two main types of topologies supported by MPI are Cartesian (grid) and Graph MPI topologies are virtual - there may be no relation between the physical structure of the parallel machine and the process topology. Virtual topologies are built upon MPI communicators and groups. Must be "programmed" by the application developer.
Why we need Topologies • Some of the parallel algorithms / data suits to a specific topology like, mesh – matrix, Tree – reduction and broadcast etc., • Virtual topologies may be useful for applications with specific communication patterns - patterns that match an MPI topology structure. • A particular implementation may optimize process mapping based upon the physical characteristics of a given parallel machine. • Provide a convenient naming mechanism of processes of group. • In many parallel applications a linear ranking of processes does not adequately reflect the logical communication pattern of the processes. – which is usually determined by the underlying problem geometry and the numerical algorithm used.
Graph Vs Cart • Graph topology is sufficient for all applications • In many applications the graph structure is regular • The detailed set up of the graph would be inconvenient for the user and might be less efficient at run time. • A large fraction of all parallel applications use process topologies like rings, two or higher dimensional grids or torus. • The mapping of grids and torus is generally an easier problem than general graphs. – completely defined by the number of dimensions and the numbers of processes in each coordinate direction
Cart topology
User Defined Data types and Packing
Communicators
Process Topologies
Environmental Management
The MPI Profiling Interface
Conclusions
Reference • MPI The Complete Reference Marc Snir, Steve Otto, Steven HussLederman, David Walker, Jack Dongarra