MPI is a programming model for parallel and distributed computing. It transfers data between processing units without requiring detailed manual setup. For example, with 4 machines each having 16 CPU cores, there are 64 cores in total. MPI can utilize all 64 cores uniformly without explicit configuration.
Note that there is a significant performance difference depending on where the data transfer occurs. Transfers between cores on the same machine are relatively fast. Transfers between cores on different machines are relatively slow.
Following functions can be used for MPI programming. This post will be updated later with more information.
Send
1 | |
Sends the message in the smessage buffer with a count of count elements.
The data type is specified by datatype.
The rank of the destination processor is given by dest.
An optional tag can be used to identify and validate messages.
The communicator is passed via comm.
Receive
1 | |
Receives a message into the rmessage buffer with a count of count elements.
The data type is specified by datatype.
The rank of the source processor is given by source.
An optional tag can be used to filter and validate messages.
The communicator is passed via comm.
The status of the received message is returned in status.
Tag
You may use MPI_ANY_SOURCE or MPI_ANY_TAG to receive a message from any source or with any tag.
Deadlock
MPI can cause deadlocks. If every processor waits for another, execution halts indefinitely.
MPI Send Variants
MPI supports several variants of MPI_Send.
In fact, the standard MPI_Send does not guarantee whether it is synchronous or buffered.
The following variants give explicit control:
MPI_Bsend(), MPI_Rsend(), MPI_Ssend(), MPI_Isend().
- MPI_Bsend() : (Buffered mode) Copy data to the buffer and return.
- MPI_Rsend() : (Ready) Wait untill opponent’s recv() called then return.
- MPI_Ssend() : (Synchronous) Wait untill opponent’s recv() finished then return.
- MPI_Send() : (Base) Nothing guaranteed, depending on situation
- MPI_Isend() : (Asynchronous) Non-blocking send. Making a sender thread and make a ticket to check it.
Asynchronous send
1 | |
It’s almost the same with MPI_Send() but it uses request as a ticket to check whether it has been done or not.
Asynchronous receive
1 | |
Like send operation, there is a asynchronous recv operation either. It’s almost the same with MPI_Recv() but it uses request as a ticket to check whether it has been done or not.
Wait
1 | |
Now, there is a wait function to wait for asynchronous operation ends. It uses ticket that we receives from asynchronous operations.
Wait all
1 | |
It has a generalized version of MPI_Wait. It supports to check multiple number of tickets at once.
Test
1 | |
A non-blocking version of MPI_Wait.
It returns immediately and sets flag to indicate whether the operation has completed.
Testall
1 | |
A generalized version of MPI_Test for multiple requests.
Note that it uses a single flag, so it succeeds only if all requests have completed.
Example
MPI example is like follow.
1 | |
MPI provides built-in support for common collective communication patterns.
Broadcast
1 | |
| Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
|---|---|---|---|---|
| Before | A | |||
| After | A | A | A | A |
Broadcast is the simplest way to send data to all processors.
It sends the message at message to every other processor.
It can be thought of as a synchronization point for all processors.
Reduce
1 | |
It’s a reduce function for all operations. It’s like accumulating datas with op. For example if op is MPI_SUM then, it will sum all the data in sendbuf. Followings are possible op types.
- MPI_MAX : max operation
- MPI_MIN : min operation
- MPI_MAXLOC : max operation with indexing
- MPI_MINLOC : min operation with indexing
- MPI_SUM : sum operation
- MPI_PROD : product operation
- MPI_LAND : logical and operation
- MPI_BAND : bit-wise and operation
- MPI_LOR : logical or operation
- MPI_BOR : bit-wise or operation
- MPI_LXOR : logical xor operation
- MPI_BXOR : bit-wise xor operation
| Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
|---|---|---|---|---|
| Before | A | B | C | D |
| After | A op B op C op D | B | C | D |
Scatter
1 | |
Similar to broadcast, but instead of sending the same data to all processors, it splits the data and distributes each chunk to one processor.
| Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
|---|---|---|---|---|
| Before | A,B,C,D | |||
| After | A | B | C | D |
Gather
1 | |
Similar to reduce, but it simply collects data from all processors without applying any operation.
| Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
|---|---|---|---|---|
| Before | A | B | C | D |
| After | A,B,C,D |
All Gather
1 | |
Equivalent to a Gather followed by a Scatter: collects all data and distributes the full dataset to every processor.
| Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
|---|---|---|---|---|
| Before | A | B | C | D |
| After | A,B,C,D | A,B,C,D | A,B,C,D | A,B,C,D |
Reduce Scatter
Equivalent to performing a reduce and a scatter simultaneously. Each processor contributes data, and each receives the result of the reduction for its portion.
1 | |
| Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
|---|---|---|---|---|
| Before | $A_1,B_1,C_1,D_1$ | $A_2,B_2,C_2,D_2$ | $A_3,B_3,C_3,D_3$ | $A_4,B_4,C_4,D_4$ |
| After | $A_1$ op $A_2$ op $A_3$ op $A_4$ | $B_1$ op $B_2$ op $B_3$ op $B_4$ | $C_1$ op $C_2$ op $C_3$ op $C_4$ | $D_1$ op $D_2$ op $D_3$ op $D_4$ |
All Reduce
Equivalent to performing a reduce and broadcasting the result to all processors simultaneously.
1 | |
| Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
|---|---|---|---|---|
| Before | $A_1,B_1,C_1,D_1$ | $A_2,B_2,C_2,D_2$ | $A_3,B_3,C_3,D_3$ | $A_4,B_4,C_4,D_4$ |
| After | $A_1$ op $A_2$ op $A_3$ op $A_4$, $B_1$ op $B_2$ op $B_3$ op $B_4$, $C_1$ op $C_2$ op $C_3$ op $C_4$, $D_1$ op $D_2$ op $D_3$ op $D_4$ | $A_1$ op $A_2$ op $A_3$ op $A_4$, $B_1$ op $B_2$ op $B_3$ op $B_4$, $C_1$ op $C_2$ op $C_3$ op $C_4$, $D_1$ op $D_2$ op $D_3$ op $D_4$ | $A_1$ op $A_2$ op $A_3$ op $A_4$, $B_1$ op $B_2$ op $B_3$ op $B_4$, $C_1$ op $C_2$ op $C_3$ op $C_4$, $D_1$ op $D_2$ op $D_3$ op $D_4$ | $A_1$ op $A_2$ op $A_3$ op $A_4$, $B_1$ op $B_2$ op $B_3$ op $B_4$, $C_1$ op $C_2$ op $C_3$ op $C_4$, $D_1$ op $D_2$ op $D_3$ op $D_4$ |
All to All
Equivalent to transposing the data matrix across processors.
1 | |
| Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
|---|---|---|---|---|
| Before | $A_1,B_1,C_1,D_1$ | $A_2,B_2,C_2,D_2$ | $A_3,B_3,C_3,D_3$ | $A_4,B_4,C_4,D_4$ |
| After | $A_1,A_2,A_3,A_4$ | $B_1,B_2,B_3,B_4$ | $C_1,C_2,C_3,C_4$ | $D_1,D_2,D_3,D_4$ |
Other extension for functions
Note that all functions above have asynchronous variants (prefix I, e.g., Igather) and variable-size variants (suffix v, e.g., Alltoallv).
Variable-size variants allow each processor to send or receive different amounts of data.
Communicator Groups
By default, all processors participate in collective operations via MPI_COMM_WORLD.
However, you can create sub-groups or custom communicators in code.
An example is shown below.
1 | |