MPI is a programming model for parallel programming. It transfers data between any processing unit without detailed set-up. For example, let’s assume there are 4 machines with 16 core CPU. Then, it will have 64 cores in total. Then, MPI uses 64 cores with out any consideration but just equally.
Notice that there is a bunch of difference in actual performance between data transfer. If we transfer between core in the same machine, it will be relative fast. If we transfer between core in other machine, it will be relative slow.
Following functions can be used for MPI programming. This post will be updated later with more information.
Send
1 |
|
It sends a message at smessage buffer with the size of count. You can specify the type of data by datatype. You must mark the rank of your target processor in dest. If you want to, you may use tag to check is this message valid or not. Also, you must pass comunity structure by comm.
Receive
1 |
|
It receives a message at rmessage buffer with the size of count. You can specify the type of data by datatype. You must mark the rank of your source processor in source. If you want to, you may use tag to check is this message valid or not. Also, you must pass comunity structure by comm. You can recieve the status of message by status.
tag
You may use MPI_ANY_SOURCE as tag to receive a message with any tag.
Deadlock
MPI also can cause the deadlock. If every processor waits for other processor, it will just sutck in the moment.
MPI_sends
MPI also supports asynchronous version of MPI_Send. Infact, MPI doesn’t guarantees whether is it asynchronous or synchronous function. To be specific there are two version for Send and Recv and we can select one of them to be used by function itself. Followings are such things; MPI_Bsend(), MPI_Rsend(), MPI_Ssend(), MPI_Isend().
- MPI_Bsend() : (Buffered mode) Copy data to the buffer and return.
- MPI_Rsend() : (Ready) Wait untill opponent’s recv() called then return.
- MPI_Ssend() : (Synchronous) Wait untill opponent’s recv() finished then return.
- MPI_Send() : (Base) Nothing guaranteed, depending on situation
- MPI_Isend() : (Asynchronous) Non-blocking send. Making a sender thread and make a ticket to check it.
Asynchronous send
1 |
|
It’s almost the same with MPI_Send() but it uses request as a ticket to check whether it has been done or not.
Asynchronous receive
1 |
|
Like send operation, there is a asynchronous recv operation either. It’s almost the same with MPI_Recv() but it uses request as a ticket to check whether it has been done or not.
Wait
1 |
|
Now, there is a wait function to wait for asynchronous operation ends. It uses ticket that we receives from asynchronous operations.
Wait all
1 |
|
It has a generalized version of MPI_Wait. It supports to check multiple number of tickets at once.
Test
1 |
|
It has some trial verison of wait operation. It returns immediately and just tells you whether operation ended or not.
Testall
1 |
|
Test also have some generalized version. Notice that it has only a flag to handle this. Therefore, it will be all passed or not manner.
Example
MPI example is like follow.
1 |
|
Now there are some design patterns to transfer data to be used. Therefore, MPI supports some of common designs.
Broadcast
1 |
|
Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
---|---|---|---|---|
Before | A | |||
After | A | A | A | A |
Broadcast is a easiest way to send a data to all processors. It sends a message to all other processors at message. You can consider it as a synchronization step for whole processors.
Reduce
1 |
|
It’s a reduce function for all operations. It’s like accumulating datas with op. For example if op is MPI_SUM then, it will sum all the data in sendbuf. Followings are possible op types.
- MPI_MAX : max operation
- MPI_MIN : min operation
- MPI_MAXLOC : max operation with indexing
- MPI_MINLOC : min operation with indexing
- MPI_SUM : sum operation
- MPI_PROD : product operation
- MPI_LAND : logical and operation
- MPI_BAND : bit-wise and operation
- MPI_LOR : logical or operation
- MPI_BOR : bit-wise or operation
- MPI_LXOR : logical xor operation
- MPI_BXOR : bit-wise xor operation
Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
---|---|---|---|---|
Before | A | B | C | D |
After | A op B op C op D | B | C | D |
Scatter
1 |
|
It’s simillar with broadcast but it doesn’t sending them as it is but it sends all data in split to processors.
Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
---|---|---|---|---|
Before | A,B,C,D | |||
After | A | B | C | D |
Gather
1 |
|
It’s simillar with reduce but it doesn’t do anything but collects data from all other processors.
Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
---|---|---|---|---|
Before | A | B | C | D |
After | A,B,C,D |
All Gather
1 |
|
It’s some thing that equivalent with Gather and Scatter. It collects every data and scatter them to all.
Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
---|---|---|---|---|
Before | A | B | C | D |
After | A,B,C,D | A,B,C,D | A,B,C,D | A,B,C,D |
Reduce Scatter
It’s some thing that equivalent with doing reduce and scatter at the same time. It do reduce per data and scatter them at the same time.
1 |
|
Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
---|---|---|---|---|
Before | $A_1,B_1,C_1,D_1$ | $A_2,B_2,C_2,D_2$ | $A_3,B_3,C_3,D_3$ | $A_4,B_4,C_4,D_4$ |
After | $A_1$ op $A_2$ op $A_3$ op $A_4$ | $B_1$ op $B_2$ op $B_3$ op $B_4$ | $C_1$ op $C_2$ op $C_3$ op $C_4$ | $D_1$ op $D_2$ op $D_3$ op $D_4$ |
All Reduce
It’s some thing that equivalent with doing reduce and broadcast at the same time. It do reduce per data and broadcast them at the same time.
1 |
|
Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
---|---|---|---|---|
Before | $A_1,B_1,C_1,D_1$ | $A_2,B_2,C_2,D_2$ | $A_3,B_3,C_3,D_3$ | $A_4,B_4,C_4,D_4$ |
After | $A_1$ op $A_2$ op $A_3$ op $A_4$, $B_1$ op $B_2$ op $B_3$ op $B_4$, $C_1$ op $C_2$ op $C_3$ op $C_4$, $D_1$ op $D_2$ op $D_3$ op $D_4$ | $A_1$ op $A_2$ op $A_3$ op $A_4$, $B_1$ op $B_2$ op $B_3$ op $B_4$, $C_1$ op $C_2$ op $C_3$ op $C_4$, $D_1$ op $D_2$ op $D_3$ op $D_4$ | $A_1$ op $A_2$ op $A_3$ op $A_4$, $B_1$ op $B_2$ op $B_3$ op $B_4$, $C_1$ op $C_2$ op $C_3$ op $C_4$, $D_1$ op $D_2$ op $D_3$ op $D_4$ | $A_1$ op $A_2$ op $A_3$ op $A_4$, $B_1$ op $B_2$ op $B_3$ op $B_4$, $C_1$ op $C_2$ op $C_3$ op $C_4$, $D_1$ op $D_2$ op $D_3$ op $D_4$ |
All to All
It’s some thing like transpoising the data matrix.
1 |
|
Worker 1 | Worker 2 | Worker 3 | Worker 4 | |
---|---|---|---|---|
Before | $A_1,B_1,C_1,D_1$ | $A_2,B_2,C_2,D_2$ | $A_3,B_3,C_3,D_3$ | $A_4,B_4,C_4,D_4$ |
After | $A_1,A_2,A_3,A_4$ | $B_1,B_2,B_3,B_4$ | $C_1,C_2,C_3,C_4$ | $D_1,D_2,D_3,D_4$ |
Other extension for functions
Notice that all of function above has asynchronous version and variable sizable version. Which can be changed by adding $I$ at the beginning of the function or $v$ at the end of the function respectively. For exmaple, Alltoallv is a variable sizable version for all to all, Igather is a asynchronous version of gather. Notice that varaible sizable means each of data may have different sizes.
Community
We assumed that all the processor will join these operations. However, it can be changed by the code. You may can make a subgroup or some complex group from the MPI. Example is like below and more detail will be updated later.
1 |
|