CPSC 441, Fall 2018
Lab 9, Part 1:  Introduction to MPI
This lab will get you started working with MPI. You will write a couple simple MPI programs, and do some testing. You should have read the MPI handout before starting the lab!
In the second part of the MPI lab, next Monday, you will work on a more substantial program. Both MPI labs are meant to be fairly short. Work from both parts will be due on Friday, December 7. For this part of the lab, everyone will turn in files named ex1.c and ex2.c. The remaining exercises can be turned in either individually or with a partner.
Running MPI
You should work in your netXX account, and you should already have set up that account to do passwordless login to the cslab and csfac computers. (See the first section of the handout for instructions.)
The folder /classes/cs441/MPI contains files that you will need for this lab. You should copy that folder, and use the copy as your working directory for the rest of this lab. The commands given here assume that all the files are in your working directory.
To make sure that you can compile and run MPI programs, you should try compiling and running the sample program hello_mpi.c. This is the same sample program that I handed out in class. To compile it, use the command
    mpicc  -o hello_mpi  hello_mpi.c
The name that you use for the executable (after the "-o" option) is up to you, but using the name of the main .c file is typical. Once the program has been compiled, you can use
    mpirun  -n 8  hello_mpi
to run it. This command runs the program in 8 processes on the computer that you are using.
You should also try running the program on several computers. One way to do it is to use the -H option and give a list if the computers on which you want to run the program. For example:
    mpirun  -H cslab3,cslab8,csfac2,csfac6  -n 4  hello_mpi
However, this only allows one process on each of the named computers. If you want to try to run extra processes, you will get an error message saying that you do not have enough "slots" available. To avoid that, you can specify a hostfile that lists available computers and the number of slots available on each. The file allhosts lists all 20 lab computers and specifies four slots on each one. You can use this file with the -hostfile of mpirun. For example:
    mpirun  -hostfile allhosts  -n 20  hello_mpi
You can run up to 80 processes in this way. It is also possible to combine the -H and -hostfile options, to specify which computers from the file you would like to use:
    mpirun  -hostfile allhosts  -H csfac4,cslab7,cslab1  -n 12  hello_mpi
It is actually possible to tell MPI that it can start more processes than the specified number of slots. To do that give the following command in the Terminal where you are working:
     export OMPI_MCA_rmaps_base_oversubscribe=1
(The last time I used MPI, it allowed oversubscribing by default, but apparently it's not considered to be a good idea.)
One more thing. You might want to run the following command to clean up any leftover MPI processes on all of the lab computers:
    mpirun --pernode --hostfile allhosts ompi-clean
I have occasionally run into some very long startup times for the MPI virtual machine. It could be that left-over processes are part of the problem, and running ompi-clean might help.
An MPI Program: Estimating PI (Badly)
The mathematical constant PI is equal to the area of a circle of radius one. Consider the circle of radius one defined by x*x + y*y < 1. Now consider the quarter-slice of the circle that satisfies x >= 0 and y >= 0. This quarter circle has area PI/4, and it lies inside the square 0 <= x < 1, 0 <= y < 1, which has area 1. Suppose that you pick a large number of radom points in the square. Then you can expect the fraction of the random points that lie inside the circle to be approximately equal to PI/4. Multiplying this fraction by 4 will give an approximation for PI. The more points you use, the better the approximation you can expect to get (although the approximation turns out to be pretty poor, even using a lot of points).
The program estimate_pi_uniprocessor.c implements this algorithm, with no parallelism. Of course, if you use MPI to spread out the calculations onto a lot of computers, you should get the answer faster. That's the programming assignment for this lab. You might find it useful to look at the sample MPI programs primes1.c and primes2.c. The first uses MPI_Send/MPI_Recv to communicate, while the second uses MPI_Reduce.
Exercise 1: Write an MPI version of estimate_pi_uniprocessor.c that uses all available processes to do the work. Please name the program ex1.c. (You might want to start this exercise by copying primes1.c to ex1.c, and then paste in some code from estimate_pi_uniprocessor.c.) In the program, each process will perform the task of selecting many random points and counting how many of the points satisfy x*x + y*y < 1. Each process except process 0 should send its count back to process 0 using MPI_Send. Process 0 should use MPI_Recv to receive the messages from the other processes. It should add up the counts (including its own), and print out the estimate of PI given by the combined results. The number of trials to be performed by each process can be given by a constant in the program, as is done in the "uniprocessor" version. Only process 0 should do any output.
Exercise 2: Your program for Exercise 1 uses MPI_Send and MPI_Recv for communication. In fact, it is simpler to use the collective communication function MPI_Reduce to get the data from all of the processes to process 0. Write a second version of the pi-estimating program, using a collective communication function instead of MPI_Send/MPI_Recv. You should be able to start with a copy of ex1.c, and make some changes. Please name the program ex2.c
A Little Empirical Speedup Test
Reminder: Programs such as primes1.c that use functions such as sqrt from the math library must be linked to that library. The math library is actually named just "m", and you can link to it by adding the option "-lm" to the end of the compilation command. For example,
    mpicc  -o primes2  primes2.c  -lm
Exercise 3: For this exercise, use the sample MPI program primes2.c. You should measure the speedup that you get by using various numbers of processes on one machine and on several machines. Please report and commment on your results from this exercise in a file named ex3.txt. Compile the program. Run it with just one process, and see how long it takes. (Process 0 will report the elapsed time to standard output.) You can do this with the command:
    mpirun  -n 1  primes2
Run the program with 4 processes on one computer. Run it with 4 processes on 4 computers. Run it with 16 processes on 4 computers. Run it with 20 processes using all 20 computers, and with 80 processes on all 20 computers. You might try each experiment several times, and take the average run time. You can try other numbers of processes and computers if you want. (One thing to keep in mind: Starting up and tearing down the MPI virtual machine takes time, and the more processes and computers you use, the more significant that overhead will be.)
Note that if you are trying to do this exercise at the same time as someone else, using the same computers, then you will be sharing processing time on those computers, and your time measurements won't give an accurate idea of the true compute time.