Running LAM MPI

From Bootable Cluster CD

Setting LAM as the Default MPI Environment

LAM is a great MPI environment. Unfortunately, it's not the default MPI environment used on the BCCD. The reason for this is simple: LAM lost the coin toss to MPICH when the BCCD was first created. The problem historically with having LAM-MPI and MPICH in the same environment is one of libraries, default executables, consistency across hosts, and the expectations of the end user.

Switching the default environment from MPICH to LAM is easy to do, but one needs to be completely thorough. In other words, if the systems are not completely transitioned to use LAM, the resulting environment will be very, very broken.

However, by following these directions completely, all will be well in your LAM-MPI world. For each host in the BCCD cluster, do the following steps:

Edit the bccd User's Default Settings

By default, the bccd user's PATH setting points to the MPICH binaries. This includes the mpich versions of mpirun, mpicc, mpif77, etc. To point the bccd user's PATH to the LAM compiling scripts and tools, modify the PATH setting in the bccd user's .bashrc file:

vi ~/.bashrc

Look for the line that reads:

export PATH=$PATH:/mpich/bin

change this to read:

export PATH=/lam-mpi/bin:$PATH

Great!

Allow the Changes to Take Effect

Log out of every shell, or source your .bashrc. Changes in your .bashrc settings do not take effect immediately. These changes will "stick" the next time you log in, or if you do what is referred to as "sourcing your .bashrc file". So you will have to do one of the following actions:

  • Shutdown X and/or log out of every shell. Then log back in.
  • Or issue:
. ~/.bashrc

A necessary (but not sufficient) condition for things to go forward is to issue which mpirun. The response that comes back should reflect the mpirun under /lam-mpi/bin.

 Type "which mpirun" to make sure it is set correctly

If you don't see LAM's version of mpirun, then you've done something wrong (or there's a problem with this howto so of course you'll create yourself an account, log in, and fix the problem, RIGHT?).

Rebuild the System Library Cache

Become root (su -, password letmein). Then issue:

ldconfig -v | less

Ignore errors, if any, about /usr/local/lib not being there. Somewhere in the file, you need to see that libmpi.so.0 is being taken care of by lam-mpi:

/lam-mpi/lib:
       libmpi.so.0 -> libmpi.so.0.0.0
       liblammpi++.so.0 -> liblammpi++.so.0.0.0
       liblamf77mpi.so.0 -> liblamf77mpi.so.0.0.0
       liblam.so.0 -> liblam.so.0.0.0

There are likely more entries in the output, the above just illustrates what you're looking for. Do this for every host in the cluster, then log out of the root account.

Booting LAM MPI

LAM-MPI requires a file consisting of a list of current nodes to boot. Make sure that every node has started pkbcast, bccd-allowall, and bccd-snarfhosts, as discussed in Booting up the CD. The bccd-snarfhosts command should generate the appropriate machines file, in the user bccd's local directory. This file contains a list of active nodes, and is exactly what LAM needs. Issue the following command to verify that the cluster is bootable:

 Type "recon -v ~/machines" at the command prompt

If the command is successful, you should see the message below:

 The success message for the recon command, beginning with, "Woo hoo!  recon has completed successfully."

To actually start LAM on the specified cluster, issue the following:

 Type "lamboot -v ~/machines" at the command prompt

If you don't see any error message, then you can now run MPI programs under LAM. Gravy!

Compiling and Running MPI Programs with LAM

To find out how to compile and run sample MPI programs, take a look at Compiling and Running. Remember, the example programs for LAM are in the directory ~/lam-mpi/examples. The examples are sorted inside directories. You may go into each directory to compile and run each program using the familiar mpicc and mpirun commands.

Shutting Down LAM

Cleaning LAM

Instead of lambooting after each MPI run, we can issue a lamclean command to remove all user processes and messages:

 Type "lamclean -v" at the command prompt

After doing this, we can mpirun another program.

Halting LAM

After we are all done, the lamhalt command removes all traces of the LAM session on the network.

 Type "lamhalt" at the command prompt

And just in case...

In the case of a catastrophic failure (i.e., one or more LAM nodes crash), we can issue a wipe command to halt everything instead of issuing lamhalt.

 Type "wipe -v machines" at the command prompt
Personal tools