CPSC 441, Fall 2014
About MPI


MPI is simply a specification of a set of functions, and there are many different implementations of the specification. We will be using one called MPICH. MPICH supports several kinds parallel programming, including distributed computing on a network of workstations, which is how we will use it.


About Programming in C

MPI can work with various programming languages, but it is most commonly used with FORTRAN and with C or C++. We will use the C interface. Fortunately, most of what you need to know about C is very similar to Java. The major exception is that you need to know a little bit about pointers and arrays, which are significantly different in C. I will give you a little information here. Hopefully, that, combined will some sample MPI programs, will be enough to get you started with C programming for MPI.

The good news is that control structures are almost identical in C and in Java. There is no try statement in C, and the Linux C compiler, gcc, by default does not allow you to declare a for-loop variable in the loop (although there is a command-line option to gcc that enables that syntax).

The basic data types int and double are the same in both languages. The char type in C is really a numeric type with values in the range 0 to 255. C does not have a boolean type; instead it generally uses int, with 0 representing false and any other integer representing true.

There are no classes in C, so functions and variables exist on their own, outside of any class. There are no public, private, or static modifiers. (The word "static" has other uses in C, but we won't need it.) For declaring constants, const is used instead of final. Functions and variables must be declared before they are used, but you can declare a function by giving a function "prototype" (the function heading with a semicolon at the end instead of a body), and give the full definition later.

Function declarations are often placed into "header" files that can be "included" at the start of a program. Many function libraries have header files that must be included in order to use the functions. For example, an MPI program generally has the include statement

              #include <mpi.h>

at the top of the program. Programs often need header files for math functions (math.h), I/O functions (stdio.h), and standard library functions (stdlib.h). The actual compiled function definitions are in libraries that are linked to the program when it is run. (The math library is not automatically linked by gcc, and you have to specify the option "-lm" at the end of the gcc command when you compile a program that uses functions from math.h.)

While C does not have classes, it does have structs, which are similar to classes that have only public data members. A struct can be used much like a class to declare variables and parameters. We will only need the struct named MPI_status, which is used for parameters in several MPI functions.

For input and output, C uses functions printf and scanf for user I/O (and similar functions fprintf and fscanf for file I/O). You shouldn't need scanf. The printf function in C is pretty much the same as System.out.printf in Java. That is, its first parameter is a format string that contains format specifications such as %d and %1.2f acting as placeholders for the output values. The remaining parameters are the output values. For example,

           printf( "The average of the %d values is %1.3f.\n", ct, avg );

Now, about arrays and pointers. Java, of course, uses pointers behind the scenes. In C, they become explicit. For example, you can have variables of type int, and you can also have variables of type "pointer to int," which is written int*. You will see that most MPI functions have some parameters of pointer type. If you have a variable num of type int, then &num represents the address of num, which is a value of type pointer to int. The & in &num is an operator that takes the address of a variable. The expression &num can be passed as an actual parameter corresponding to a formal parameter of type int*. One effect of this is that the function will be able to change the value of num; that is, num can have a different value coming out of the function than going in.

Furthermore, an array variable in C is really nothing more than a pointer to the first element of the array. For example, if A is a variable of type array-of-int, then A can be passed as actual parameter for a formal parameter of type int*. (Note that the array name A itself, not &A, is what you need for a value of type int*.) Many parameters of pointer type in MPI functions will actually take an entire array of values as the actual parameter.

An array can be created in C with a variable declaration such as

         int A[100];

This makes an array of 100 ints, named A. (Note that you don't use a new operator to create an array.) There is no such thing as A.length in C, and there is no range check to stop you from using an array location that is outside the actual size of the array. (By the way, string variables are surprisingly tricky in C. A string is represented by a pointer value of type char*, which is actually just a pointer to the first character in the string. The end of a string is always marked by a null character, that is by a char number 0. Fortunately, you won't need to deal with string variables, but this explains the usual parameters to main in C: int main(int argc, char** argv). Here, argv is an array of strings, given as a pointer to an array of pointers-to-char. And argc is the number of strings in the array, which is necessary since arrays in C don't come with lengths. The intent in C is pretty much the same as in Java: The parameters to main represent the command-line arguments.)


Compiling and Running MPI programs

MPICH comes with specialized commands for compiling and running programs that use MPI. Suppose mpiprog.c is a C program that uses MPI (including a main() routine for the program). You could compile it with

          mpicc  mpiprog.c

Just like the regular C compiler, this creates an executable program named a.out, which is not really what you want. To give the executable file a different name, use the "-o" option to the compiler command, with the name of the file. For example,

           mpicc  -o mpiprog  mpiprog.c

Assuming that there are no syntax errors in the program, this command creates an executable program named mpiprog. If the program uses functions from math.h, as does my sample program primes1.c, then you should add "-lm" at the end of the command. For example,

          mpicc  -o primes1  primes1.c  -lm

The command for running an MPI executable is mpirun. This command takes the name of the executable file as its argument. In general, you will also need some options in the command. The most common option is "-n", which specifies the number of processes that you want to start. For example:

           mpirun  -n 4  ./mpiprog

The "./" in front of the program name is necessary to run a program in your working directory. You could also use a full path name to the file. This command starts 4 processes on the computer that you are using. Each of those processes will run the program, mpiprog. This will provide some speedup on a multicore computer. To run the program on several computers, you should first create a file that contains the names of the computers that you want to use. For example, you could create a file named mpihosts that contains

           cslab3
           cslab8
           cslab10
           csfac2
           csfac4

You do not need to list the computer that you are using. Then use the "-f" option to mpirun to specify the file when you run the program:

           mpirun  -f mpihosts  -n 24  ./mpiprog

This will start four processes on each of six computers -- the one that you are using and the five that are listed in the hosts file. Process 0 will be on the computer that you are using (as long as you don't list it in the hosts file, or list it first in that file). On our network, mpirun will use ssh to start the processes on the networked computers. (It can actually use several different ways to manage hosts and processes and is smart about knowing what to do. On our network, only ssh is available.) It should be possible to use ssh transparently, but our system administrator has not been able to make that work with the AFS network file system that we use. So, unfortunately, you will have to enter your password when you run an MPI program on multiple computers.

When mpirun logs you into the networked computers, it will set your working directory on those computers to be the same working directory that you are using. The processes that are running on the networked computers will all run the same program. This will all work because your home directory is on a network file system and is available on all of our computers.


MPI Commands

We will use only a small subset of all the available MPI commands in this course. However, this subset has enough power to write a full range of message-passing, parallel-processing programs.

The four most basic commands are MPI_Init, MPI_Finalize, MPI_Comm_size and MPI_Comm_rank. These commands are not covered here, but they are illustrated in the sample program hello_mpi.c. They are used in basically the same way in almost every program.

Most MPI functions return an error code as the return value. However, the default behavior is for the MPI system itself to abort the program when an error occurs, so that you never see the error code. We will not use error codes in any of our programs. (It's possible to change the default behavior so that errors are reported instead of causing program termination, but we won't worry about that.)

The rest of this page contains information about other commands that you should know about. There are two classes of commands: those related to point-to-point communication between two processes (MPI_Send, MPI_Recv, and MPI_Iprobe) and those related to collective communications that involve all processes (MPI_Barrier, MPI_Bcast, MPI_Reduce, MPI_Gather, MPI_Allgather, and MPI_Scatter). Finally, there are two miscellaneous functions, MPI_Wtime and MPI_Abort.

MPI_Send
Prototype:
     int MPI_Send(
           void*        message,          // Pointer to data to send
           int          count,            // Number of data values to send
           MPI_Datatype datatype,         // Type of data (e.g. MPI_INT)
           int          destination_rank, // Rank of process to receive message
           int          tag,              // Identifies message type
           MPI_Comm     comm              // Use MPI_COMM_WORLD
         )
Explanation: Sends a message from the process that executes this command to another process. The receiver is specified by giving the destination_rank. (Recall that each process has a number between zero and process_count minus one.) The message consists of one or more items, all of which must be of the same type. The type of data is specified in the datatype parameter, which must be a constant such as MPI_INT, MPI_DOUBLE, or MPI_CHAR. (You will probably use either MPI_INT or MPI_DOUBLE.) The data type specified should match the actual type of the message parameter.

The actual data is referenced by the message parameter, which must be a pointer to the start of the data. If the data is a single value stored in a variable named data, for example, pass the address of the variable, &data. If the data consists of several values in an array named data_array, just pass the array name (with no "&" operator) -- remember that an array in C is already a pointer.

The tag is a non-negative integer that you make up and use in your program to identify each different type of message that you might want to send. The tag allows the recipient to differentiate between different types of messages, so they can be processed differently. You can use any value you like. If you just have one type of message, the value of tag is irrelevant for your program.

On the system that we are using, MPI_Send just drops the message into an outgoing message queue and returns immediately. The message is actually delivered by a separate process. On other systems, however, MPI_Send might block and fail to return until the message is actually sent. (This is unfortunate because it means that a program that works on one system might fail on another. MPI has many more specialized send functions that give more control.)

The last parameter to MPI_Send is of type MPI_Comm. This is a "communicator," which identifies some particular defined group of processes. MPI_COMM_WORLD is an MPI constant that represents the group that contains every process in the virtual machine. We will not work with other groups, so you should always specify MPI_COMM_WORLD as the value of a parameter of type MPI_Comm. Almost every MPI function has such a parameter; I will not mention them again.

The return value of the function is a error code, which we will always ignore, as discussed above.

Example(s):
     int count;
     double results[5];
        .
        .  // Compute values for count and for the results array
        .
     MPI_Send( &count, 1, MPI_INT, 0, 1, MPI_COMM_WORLD );
           // send 1 int to process 0 with identification tag 1
     MPI_Send( results, 5, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD );
           // Now send 5 doubles to process 0 with ID tag 2 --
           // the different tag indicates a different type of message.

MPI_Recv
Prototype:
     int MPI_Recv(
           void*        message,     // Points to location in memory where
                                     //     received message is to be stored.
           int          count,       // MAX number of data values to accept 
           MPI_Datatype datatype     // Type of data (e.g. MPI_INT)  
           int          source_rank, // Rank of process to receive from
                                     //    (Use MPI_ANY_SOURCE to accept
                                     //     from any sender)
           int          tag,         // Type of message to receive
                                     //    (Use MPI_ANY_TAG to accept any type)
           MPI_Comm     comm,        // Use MPI_COMM_WORLD
           MPI_Status*  status       // To receive info about the message
         )
Explanation: Receive a message sent by another process using MPI_Send (or certain other send functions that aren't covered here). This function will block until a message arrives. When it returns, the message that was received has been placed in the memory specified by the message parameter. This parameter will ordinarily be specified as an array name or as &data where data is a simple variable.

The count specifies the maximum number of data values that can be accepted. It should be 1 for a simple variable; for an array, it can be as large as the array size. The actual number of values received can be fewer than count. Usually, you know how many values you will get. If not, there a function, MPI_Get_count, that can be used to find out the actual number received.

If source_rank is MPI_ANY_SOURCE, this function will accept a message from any other process. If source_rank is a process number, then it will only accept a message from that process. (If there are messages waiting from other processes, they will be ignored -- they are not errors; the will just wait in the incoming message queue and they can be read later with other calls to MPI_Recv.) If tag is MPI_ANY_TAG, than any message will be accepted from the specified source. Otherwise, only a message with a matching tag will be accepted. (A message's tag is specified by the sending process in the MPI_Send command; see above.)

The status parameter is a struct that is filled with information about the message when it is received. Usually, you will declare a variable named status of type MPI_Status and pass its address, &status, to the function. After the function returns, status.MPI_SOURCE contains the number of the process that sent the message, and status.MPI_TAG contains the message's tag. You might need this information if you used MPI_ANY_SOURCE or MPI_ANY_TAG.

Example(s):
     int count_received;
     double results_received[5];
     int stuff[100];
     MPI_status status;
     MPI_Recv( &count_received, 1, MPI_INT, 17, 1, MPI_COMM_WORLD, &status );
           // Receive 1 int from process 17 in a message with tag 1
     MPI_Recv( results_received, 5, MPI_DOUBLE, 17, 2, MPI_COMM_WORLD, &status );
           // Receive up to 5 doubles from process 17 in a message with tag 2
     MPI_Recv( stuff, 100, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status );
           // Receive up to 100 ints from any source in a message with any tag;
           // status.MPI_SOURCE is the message source and status.MPI_TAG is its tag

MPI_Iprobe
Prototype:
     int MPI_Iprobe(
           int          source_rank, // Rank of process to receive from
                                     //    (Use MPI_ANY_SOURCE to accept
                                     //     from any sender)
           int          tag,         // Type of message to receive
                                     //    (Use MPI_ANY_TAG to accept any type)
           MPI_Comm     comm,        // Use MPI_COMM_WORLD
           int*         flag,        // Tells if a message is available
           MPI_Status*  status       // To receive info about the message
         )
Explanation: This function allows you to check whether a message is available, without blocking to wait for that message. The function returns immediately (which is what the "I" in the name stands for). If a message that matches source_rank and tag is available, the value of flag is set to true. In this case, you can call MPI_Recv to receive the message, and you can be sure that it won't block. If no such message is available at the current time, the value of flag is set to false. If a message is available, the status parameter is filled with information about the message in exactly the same way that it would be in the MPI_Recv function. The actual flag parameter in the function call will almost certainly be of the form &test were test is a variable of type int.
Example(s):
    MPI_Status status;
    int flag;
    MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &status );
    if (flag) {
        // Receive message with MPI_Recv and do something with it.
    }
    else {
        // Do other work.
    }

MPI_Barrier
Prototype:
     int MPI_Barrier(
           MPI_Comm   comm    // Use MPI_COMM_WORLD
         )
Explanation: This is the simplest collective communication function. In a collective communication, every process must call the function. Otherwise, an error occurs. For example, the program might hang, waiting forever for one of the processes to call the function.

When a function calls MPI_Barrier, it will simply wait until every other process has also called the same function. At that time, all the processes start running again. The point of this function is simply to synchronize the processes by bringing them all to a certain point in the program before allowing them to continue. In practice, you will probably not use this function, since the other collective communication functions do the same sort of synchronization implicitly.


MPI_Bcast
Prototype:
     int MPI_Bcast(
           void*        message,   // Pointer to location of data
           int          count,     // Number of data values to broadcast
           MPI_Datatype datatype,  // Type of data (e.g. MPI_INT)
           int          root,      // Process that originally has the data
           MPI_Comm     comm       // Use MPI_COMM_WORLD
         )
           
Explanation: Broadcasts a message from one "root" process to every other process. Every process -- both the sender and all the receivers -- must call this function, or an error occurs. The rank of the process that originally has the data is specified in the root parameter. In this process, the message parameter specifies the data to be sent (as in an MPI_Send command). In every other process, the message parameter specifies where the incoming message is to be stored. Note that although all processes call the same function, the meaning is different for the sender than it is for the receivers.

As usual, the first parameter will ordinarily be either the name of an array or the address (&data) of a simple variable. The datatype parameter should match the actual type of the message parameter. The count should be the actual number of data values to be broadcast -- 1 for a simple variable or anything up to the array size for an array.

Example(s):
     int initialdata[20];
     if (my_rank == 0) {
         // Fill the initialdata array with some data.
         // Only process 0 does this.
     }
     MPI_Bcast( initialdata, 20, MPI_INT, 0, MPI_COMM_WORLD );
        // Now, every process has the same data in its initialdata array.

MPI_Reduce
Prototype:
     int MPI_Reduce(
           void*        value,      // Input value from this process
           void*        answer,     // Result -- on root process only
           int          count,      // Number of values -- usually 1
           MPI_Datatype datatype,   // Type of data (e.g. MPI_INT)         
           MPI_Op       operation,  // What to do (e.g. MPI_SUM)           
           int          root,       // Process that receives the answer    
           MPI_Comm     comm        // Use MPI_COMM_WORLD                  
         )
Explanation: This function is used when each process has a data value, and it is necessary to combine the values form different process in some way, such as adding them all up or finding the minimum. Every process must call this function. Each function passes its own input value as the first parameter. The answer is computed and is placed in the answer parameter on only one process. The process that is to receive the final answer is called the root and is specified in the root parameter. For other processes, the value of the answer parameter after the function is called is undefined (but they still have to specify the parameter). As usual, the first and second parameters can be either array names or addresses of simple variables.

The datatype parameter specifies the type of data to be operated on. The operation parameter specifies how the individual values are to be combined. Possible operations include: MPI_SUM, MPI_PROD, MPI_MIN, MPI_MAX, and others.

The count parameter specifies the number of values on each process. If count is greater than 1, then the first two parameters should be arrays, and the operation is done on each array location separately.

Note that every process must call this function, or an error occurs.

Example(s):
     double my_total;      // value computed by each process
     double overall_total; // total of values from all the processes
        .
        .  // Compute a value for my_total
        .
     MPI_Reduce( &my_total, &overall_total, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
     if ( my_rank == 0 ) {
           // Only process 0 knows the overall answer!
        cout << "The final result  is " << overall_total << endl;
     }

MPI_AllReduce
Prototype:
     int MPI_Allreduce(
           void*        value,      // Input value from this process
           void*        answer,     // Result -- on all processes
           int          count,      // Number of values -- usually 1
           MPI_Datatype datatype,   // Type of data (e.g. MPI_INT)         
           MPI_Op       operation,  // What to do (e.g. MPI_SUM)           
           MPI_Comm     comm        // Use MPI_COMM_WORLD                  
         )
Explanation:

Does the same thing as MPI_Reduce, but every process receives a copy of the answer. (So, there is no root parameter.)

Example(s):
     double my_total;      // value computed by each process
     double overall_total; // total of values from all the processes
        .
        .  // Compute a value for my_total
        .
     MPI_Reduce( &my_total, &overall_total, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD );
     // All processes now have the answer in overall_total

MPI_Gather
Prototype:
     int MPI_Gather(
           void*        sendbuffer, // Contains data values to send
           int          sendcount,  // Number of values sent from each process --
                                    //    explanation below assumes sendcount is 1
           MPI_Datatype sendtype,   // Type of data (e.g. MPI_INT)         
           void*        recvbuffer, // Collected data stored here -- on root only
           int          recvcount,  // Use same value as sendcount
           MPI_Datatype recvtype,   // Use same value as sendtype        
           int          root,       // Process that receives the answer    
           MPI_Comm     comm        // Use MPI_COMM_WORLD                  
         )
Explanation: This function is used when each process has a data value, and it is necessary to collect all the data values from the various processes and combine them into an array on one process. The process where the data is gathered is called the root and is specified by the root parameter. Before this function is executed, each process -- including the root -- has one data value, which is stored in sendbuffer. The sendbuffer will usually be specified as the address of a simple variable. The recvbuffer must be specified by each process, but it is only used in the root process. It should be an array whose length is at least equal to the total number of processes. After the function returns, the array will contain the data from all the processes, with the value from process 0 in position 0, the value from process 1 in position 1, and so on.

(If sendcount is greater than 1, then sendbuffer should be an array of size sendcount. All the values from the arrays on all the processes are then collected and stored in the recvbuffer, which must be large enough to hold all the data.)

Note that every process must call this function, or an error occurs.

Example(s):
     int my_data;    // Data value computed by this process
     int all_data[process_count];  // Data from all processes
        .
        . Generate a value for my_data
        .
     MPI_Gather( &my_data, 1, MPI_INT, all_data, 1, MPI_INT, 0, MPI_COMM_WORLD );
        // Now process 0 has all the data in its all_data array.

MPI_Allgather
Prototype:
     int MPI_Allgather(
           void*        sendbuffer, // Contains data values to send
           int          sendcount,  // Number of values sent from each process --
                                    //    explanation below assumes sendcount is 1
           MPI_Datatype sendtype,   // Type of data (e.g. MPI_INT)         
           void*        recvbuffer, // Collected data stored here on root only
           int          recvcount,  // Use same value as sendcount
           MPI_Datatype recvtype,   // Use same value as sendtype        
           MPI_Comm     comm        // Use MPI_COMM_WORLD                  
         )
Explanation: Does the same thing as MPI_Gather, except that the combined data is distributed to all processes. That is, after MPI_Allgather executes, the recvbuffer on every process contains the complete set of data from all the processes.

MPI_Scatter
Prototype:
     int MPI_Scatter(
           void*        sendbuffer, // An array of data values to send -- on root only
           int          sendcount,  // Number of values sent to each process --
                                    //    explanation below assumes sendcount is 1
           MPI_Datatype sendtype,   // Type of data (e.g. MPI_INT)         
           void*        recvbuffer, // Location to receive data value on each process
           int          recvcount,  // Use same value as sendcount
           MPI_Datatype recvtype,   // Use same value as sendtype        
           int          root,       // Process that receives the answer    
           MPI_Comm     comm        // Use MPI_COMM_WORLD                  
         )
Explanation: This function is an inverse to MPI_Gather. It takes an array of data values on the root process and distributes on value from the array to each process. The size of the array should be equal to the number of processes. Note that the sendcount parameter is the number of data values to send to each process, which I am assuming is 1. (If it is more than one, then a block of data values is sent to each process.)

MPI_Wtime
Prototype:
     double MPI_Wtime()        // No parameters
Explanation: This function allows you to time activities in a single process. It's not a communication function -- it just works on the single process where it is called. It gives the number of seconds that have elapsed since some reference time in the past. Note that it gives elapsed time, not compute time. (The elapsed time is called the "wall time," hence the "W" in the name of the function.)
Example(s):
     double start_time = MPI_Wtime();
       .
       .  // do some work
       .
     double elapsed_time = MPI_Wtime() - start_time;

MPI_Abort
Prototype:
     int MPI_Abort(
           MPI_Comm  comm,       // Use MPI_COMM_WORLD
           int       error_code  // Use 1
         )
Explanation: If you encounter an error or some other condition that makes you want to shut down your entire program, you can call MPI_Abort to do so. It is not sufficient to call the usual exit() function, since that will only abort the one process on which it is executed, and it will do so without issuing the required call to MPI_Finalize. MPI_Abort, on the other hand, will properly shut down all the processes. The error_code is returned to the environment just as the parameter of exit() would be. In practice, you will probably not need to use this function.
Example(s):
     MPI_Abort( MPI_Comm_World, 1 );


David Eck