MPI | REAL

MPI is a programming model for parallel programming. It transfers data between any processing unit without detailed set-up. For example, let’s assume there are 4 machines with 16 core CPU. Then, it will have 64 cores in total. Then, MPI uses 64 cores with out any consideration but just equally.

Notice that there is a bunch of difference in actual performance between data transfer. If we transfer between core in the same machine, it will be relative fast. If we transfer between core in other machine, it will be relative slow.

Following functions can be used for MPI programming. This post will be updated later with more information.

Send

int MPI_Send(void *smessage, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

It sends a message at smessage buffer with the size of count. You can specify the type of data by datatype. You must mark the rank of your target processor in dest. If you want to, you may use tag to check is this message valid or not. Also, you must pass comunity structure by comm.

Receive

int MPI_Recv(void *rmessage, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

It receives a message at rmessage buffer with the size of count. You can specify the type of data by datatype. You must mark the rank of your source processor in source. If you want to, you may use tag to check is this message valid or not. Also, you must pass comunity structure by comm. You can recieve the status of message by status.

tag

You may use MPI_ANY_SOURCE as tag to receive a message with any tag.

Deadlock

MPI also can cause the deadlock. If every processor waits for other processor, it will just sutck in the moment.

MPI_sends

MPI also supports asynchronous version of MPI_Send. Infact, MPI doesn’t guarantees whether is it asynchronous or synchronous function. To be specific there are two version for Send and Recv and we can select one of them to be used by function itself. Followings are such things; MPI_Bsend(), MPI_Rsend(), MPI_Ssend(), MPI_Isend().

MPI_Bsend() : (Buffered mode) Copy data to the buffer and return.
MPI_Rsend() : (Ready) Wait untill opponent’s recv() called then return.
MPI_Ssend() : (Synchronous) Wait untill opponent’s recv() finished then return.
MPI_Send() : (Base) Nothing guaranteed, depending on situation
MPI_Isend() : (Asynchronous) Non-blocking send. Making a sender thread and make a ticket to check it.

Asynchronous send

int MPI_Isend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)

It’s almost the same with MPI_Send() but it uses request as a ticket to check whether it has been done or not.

Asynchronous receive

int MPI_Irecv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)

Like send operation, there is a asynchronous recv operation either. It’s almost the same with MPI_Recv() but it uses request as a ticket to check whether it has been done or not.

Wait

int MPI_Wait(MPI_Request *request, MPI_Status *status) 

Now, there is a wait function to wait for asynchronous operation ends. It uses ticket that we receives from asynchronous operations.

Wait all

int MPI_Waitall(int count, MPI_Request array_of_requests[], MPI_Status array_of_statuses[])

It has a generalized version of MPI_Wait. It supports to check multiple number of tickets at once.

Test

int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)

It has some trial verison of wait operation. It returns immediately and just tells you whether operation ended or not.

Testall

int MPI_Testall(int count, MPI_Request array_of_requests[], int *flag, MPI_Status array_of_statuses[])

Test also have some generalized version. Notice that it has only a flag to handle this. Therefore, it will be all passed or not manner.

Example

MPI example is like follow.

#include <mpi.h>

int main(int argc, char* argv[]) {
    //Initialize the MPI
    MPI_Init(&argc, &argv); 

    //Get number of threads
    int num_ranks;
    MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); 

    //Get current program index
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    //Get processor name
    int len; 
    char hostname[MPI_MAX_PROCESSOR_NAME]; 
    MPI_Get_processor_name(hostname, &len);

    printf("hostname: %s, I am %d/%d\n", hostname, rank, num_ranks);

    //Preparing buffers
    constexpr int MSGLEN = 30;
    char *recvbuf = new char[MSGLEN];
    char *sendmsg = new char[MSGLEN];

    //How many time to repeat
    constexpr int REPEAT = 10;
    
    //Each of them will pass index of this iteration as tag
    for(int i=0; i<REPEAT; i++) { 
        if(rank % 2==0) {//If it's even rank, it sends a message
            sprintf(sendmsg, "hello from %s_%d #%d",hostname, rank, i); //Prepare the message
            MPI_Send(sendmsg, MSGLEN, MPI_CHAR, rank + 1, i, MPI_COMM_WORLD);
        } 
        else {//If it's odd rank, it receives a message
            MPI_Status status; 
            MPI_Recv(recvbuf, MSGLEN, MPI_CHAR, rank ‐ 1, i, MPI_COMM_WORLD, &status); 
            printf("#%d:recved %lld bytes : %s\n", rank, MSGLEN, recvbuf);//Received message
        }
    }
    //Notice that this will work with only even number of worker.

    //Terminates the MPI
    MPI_Finalize();
    return 0;
}

Now there are some design patterns to transfer data to be used. Therefore, MPI supports some of common designs.

Broadcast

int MPI_Bcast(void *message, int count, MPI_Datatype type, int root, MPI_Comm comm) 

	Worker 1	Worker 2	Worker 3	Worker 4
Before	A
After	A	A	A	A

Broadcast is a easiest way to send a data to all processors. It sends a message to all other processors at message. You can consider it as a synchronization step for whole processors.

Reduce

int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype type, MPI_Op op, int root, MPI_Comm comm)

It’s a reduce function for all operations. It’s like accumulating datas with op. For example if op is MPI_SUM then, it will sum all the data in sendbuf. Followings are possible op types.

MPI_MAX : max operation
MPI_MIN : min operation
MPI_MAXLOC : max operation with indexing
MPI_MINLOC : min operation with indexing
MPI_SUM : sum operation
MPI_PROD : product operation
MPI_LAND : logical and operation
MPI_BAND : bit-wise and operation
MPI_LOR : logical or operation
MPI_BOR : bit-wise or operation
MPI_LXOR : logical xor operation
MPI_BXOR : bit-wise xor operation

	Worker 1	Worker 2	Worker 3	Worker 4
Before	A	B	C	D
After	A op B op C op D	B	C	D

Scatter

int MPI_Scatter(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

It’s simillar with broadcast but it doesn’t sending them as it is but it sends all data in split to processors.

	Worker 1	Worker 2	Worker 3	Worker 4
Before	A,B,C,D
After	A	B	C	D

Gather

int MPI_Gather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

It’s simillar with reduce but it doesn’t do anything but collects data from all other processors.

	Worker 1	Worker 2	Worker 3	Worker 4
Before	A	B	C	D
After	A,B,C,D

All Gather

int MPI_Allgather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)

It’s some thing that equivalent with Gather and Scatter. It collects every data and scatter them to all.

	Worker 1	Worker 2	Worker 3	Worker 4
Before	A	B	C	D
After	A,B,C,D	A,B,C,D	A,B,C,D	A,B,C,D

Reduce Scatter

It’s some thing that equivalent with doing reduce and scatter at the same time. It do reduce per data and scatter them at the same time.

int MPI_Reduce_scatter(void *sendbuf, void *recvbuff, int *revcnt, MPI_Datatype type, MPI_Op op, MPI_Comm comm)

	Worker 1	Worker 2	Worker 3	Worker 4
Before	$A_1,B_1,C_1,D_1$	$A_2,B_2,C_2,D_2$	$A_3,B_3,C_3,D_3$	$A_4,B_4,C_4,D_4$
After	$A_1$ op $A_2$ op $A_3$ op $A_4$	$B_1$ op $B_2$ op $B_3$ op $B_4$	$C_1$ op $C_2$ op $C_3$ op $C_4$	$D_1$ op $D_2$ op $D_3$ op $D_4$

All Reduce

It’s some thing that equivalent with doing reduce and broadcast at the same time. It do reduce per data and broadcast them at the same time.

int MPI_Allreduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

	Worker 1	Worker 2	Worker 3	Worker 4
Before	$A_1,B_1,C_1,D_1$	$A_2,B_2,C_2,D_2$	$A_3,B_3,C_3,D_3$	$A_4,B_4,C_4,D_4$
After	$A_1$ op $A_2$ op $A_3$ op $A_4$, $B_1$ op $B_2$ op $B_3$ op $B_4$, $C_1$ op $C_2$ op $C_3$ op $C_4$, $D_1$ op $D_2$ op $D_3$ op $D_4$	$A_1$ op $A_2$ op $A_3$ op $A_4$, $B_1$ op $B_2$ op $B_3$ op $B_4$, $C_1$ op $C_2$ op $C_3$ op $C_4$, $D_1$ op $D_2$ op $D_3$ op $D_4$	$A_1$ op $A_2$ op $A_3$ op $A_4$, $B_1$ op $B_2$ op $B_3$ op $B_4$, $C_1$ op $C_2$ op $C_3$ op $C_4$, $D_1$ op $D_2$ op $D_3$ op $D_4$	$A_1$ op $A_2$ op $A_3$ op $A_4$, $B_1$ op $B_2$ op $B_3$ op $B_4$, $C_1$ op $C_2$ op $C_3$ op $C_4$, $D_1$ op $D_2$ op $D_3$ op $D_4$

All to All

It’s some thing like transpoising the data matrix.

int MPI_Alltoall(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)

	Worker 1	Worker 2	Worker 3	Worker 4
Before	$A_1,B_1,C_1,D_1$	$A_2,B_2,C_2,D_2$	$A_3,B_3,C_3,D_3$	$A_4,B_4,C_4,D_4$
After	$A_1,A_2,A_3,A_4$	$B_1,B_2,B_3,B_4$	$C_1,C_2,C_3,C_4$	$D_1,D_2,D_3,D_4$

Other extension for functions

Notice that all of function above has asynchronous version and variable sizable version. Which can be changed by adding $I$ at the beginning of the function or $v$ at the end of the function respectively. For exmaple, Alltoallv is a variable sizable version for all to all, Igather is a asynchronous version of gather. Notice that varaible sizable means each of data may have different sizes.

Community

We assumed that all the processor will join these operations. However, it can be changed by the code. You may can make a subgroup or some complex group from the MPI. Example is like below and more detail will be updated later.

MPI_Group world_group; 
MPI_Comm_group(MPI_COMM_WORLD, &world_group); 
MPI_Group mpi_local_group;
int num_participants=3;
int participants[3] = {0,2,4}; 
MPI_Group_incl(world_group, num_participants, participants, &mpi_local_group);
MPI_Comm new_comm; 
MPI_Comm_create_group(MPI_COMM_WORLD, mpi_local_group, 0, &new_comm); 
if(new_comm != MPI_COMM_NULL) { 
    int local_rank;
    MPI_Comm_rank(new_comm, &local_rank); 
    printf("%d:local :%d\n", rank, local_rank ); 
    MPI_Bcast(buf, MSGLEN, MPI_INT, 0, new_comm);
}