Heterogeneous networks and the ch_p4 device


Up: Special features of different systems Next: Using special switches Previous: Tuning the P4 device

A heterogeneous network of workstations is one in which the machines connected by the network have different architectures and/or operating systems. For example, a network may contain 3 Sun SPARC (sun4) workstations and 3 SGI IRIX workstations, all of which communicate via the TCP/IP protocol. The mpirun command may be told to use all of these with

mpirun -arch sun4 -np 3 -arch IRIX -np 3 program.%a 
While the ch_p4 device supports communication between workstations in heterogeneous TCP/IP networks, it does not allow the coupling of multiple multicomputers. To support such a configuration, you should use the ch_nexus device. See the following section for details.

The special program name program.%a allows you to specify the different executables for the program, since a Sun executable won't run on an SGI workstation and vice versa. The %a is replaced with the architecture name; in this example, program.sun4 runs on the Suns and program.IRIX runs on the SGI IRIX workstations. You can also put the programs into different directories; for example,

mpirun -arch sun4 -np 3 -arch IRIX -np 3 /tmp/%a/program 
For even more control over how jobs get started, we need to look at how mpirun starts a parallel program on a workstation cluster. Each time mpirun runs, it constructs and uses a new file of machine names for just that run, using the machines file as input. (The new file is called PIyyyy, where yyyy is the process identifier.) If you specify -keep_pg on your mpirun invocation, you can use this information to see where mpirun ran your last few jobs. You can construct this file yourself and specify it as an argument to mpirun. To do this for ch_p4, use
mpirun -p4pg pgfile myprog 
where pfile is the name of the file. The file format is defined below.

This is necessary when you want closer control over the hosts you run on, or when mpirun cannot construct it automatically. Such is the case when

* You want to run on a different set of machines than those listed in the machines file.
* You want to run different executables on different hosts (your program is not SPMD).
* You want to run on a heterogeneous network, which requires different executables.
* You want to run all the processes on the same workstation, simulating parallelism by time-sharing one machine.
* You want to run on a network of shared-memory multiprocessors and need to specify the number of processes that will share memory on each machine. This is only a benefit with the ch_p4 device. Nexus is currently developing a shared memory module that should be available in its next release

The format of a ch_p4 procgroup file is a set of lines of the form
<hostname>  <#procs>  <progname>  [<login>] 
An example of such a file, where the command is being issued from host sun1, might be
sun1   0  /users/jones/myprog 
    sun2   1  /users/jones/myprog 
    sun3   1  /users/jones/myprog 
    hp1    1  /home/mbj/myprog    mbj 
The above file specifies four processes, one on each of three suns and one on another workstation where the user's account name is different. Note the 0 in the first line. It is there to indicate that no other processes are to be started on host sun1 than the one started by the user by his command.

You might want to run all the processes on your own machine, as a test. You can do this by repeating its name in the file:

sun1 0 /users/jones/myprog 
    sun1 1 /users/jones/myprog 
    sun1 1 /users/jones/myprog 
This will run three processes on sun1, communicating via sockets.

To run on a shared-memory multiprocessor, with 10 processes, you would use a file like:

sgimp  9  /u/me/prog 
Note that this is for 10 processes, one of them started by the user directly, and the other nine specified in this file. This requires that mpich was configured with the option -comm=shared; see the installation manual for more information.

If you are logged into host gyrfalcon and want to start a job with one process on gyrfalcon and three processes on alaska, where the alaska processes communicate through shared memory, you would use

local    0  /home/jbg/main 
    alaska   3  /afs/u/graphics     



Up: Special features of different systems Next: Using special switches Previous: Tuning the P4 device


Using special switches


Up: Heterogeneous networks and the ch_p4 device Next: Heterogeneous networks of multicomputers and the ch_nexus device Previous: Heterogeneous networks and the ch_p4 device

In some installations, certain hosts can be connected in multiple ways. For example, the ``normal'' Ethernet may be supplemented by a high-speed FDDI ring. Usually, alternate host names are used to identify the high-speed connection. All you need to do is put these alternate names in your machines/machines.xxxx file. In this case, it is important not to use the form local 0 but to use the name of the local host. For example, if hosts host1 and host2 have ATM connected to host1-atm and host2-atm respectively, the correct ch_p4 procgroup file to connect them (running the program /home/me/a.out) is

host1-atm 0 /home/me/a.out 
    host2-atm 1 /home/me/a.out 



Up: Heterogeneous networks and the ch_p4 device Next: Heterogeneous networks of multicomputers and the ch_nexus device Previous: Heterogeneous networks and the ch_p4 device