Heterogeneous networks of multicomputers and the ch_nexus device

Up: Special features of different systems Next: MPPs Previous: Using special switches

The startup files for the ch_nexus device are somewhat different, but easy to understand. Check URL http://www.mcs.anl.gov/nexus/ for the Nexus User's Guide. The information contained in a ch_p4 pgfile is a subset of the items in a Nexus resouce database (.resource_database) file.

Here is an example of a Nexus .resource_database file:

sun1:sun1.mcs.anl.gov \ 
  startup_dir=/users/jones/sun1 \ 
  startup_exe=/users/jones/myprog \ 

sun2:sun2.mcs.anl.gov \ startup_dir=/users/jones/sun1 \ startup_exe=/users/jones/myprog \ startups=rsh

The .resource_database file can either be placed in your current working directory or your home directory.

In this .resource_database file two hosts are specified: sun1 and sun2 as nodes where Nexus programs can be run. You can optionally include aliases for the hosts using a colon to separate the names. Note: the .resource_database file is line-oriented in its syntax. The \ character is used to continue the input line for each resource being defined. There must be no character other than a new-line following the continuation character,

The startup_dir attribute identifies the working directory for the executable; the startup_exe attribute identifies the full path to the executable. If you have built MPI with ch_nexus device support on multiple platforms, you may not necessarily have the same startup_dir and startup_exe attribute values.

The startups attribute identifies what startup method is to be used to establish a process on one of the machines. Currently, rsh is the preferred method but in the future you may opt to use ssh (Secure shell) and other startup methods.

These attributes are optional. Nexus in most cases will choose sensible defaults if they are not supplied. As an example, if you omit the startup_exe attribute, Nexus will use the path of the executable started on node 0 (zero).

Often you may wish to make use of a collection of workstations where your login name is not consistent between them. The rsh_login attribute is used whenever your login name differs on one machine from the machine on which you started the Nexus-based MPI application, as follows:

sun1:sun1.mcs.anl.gov \ 
  startup_dir=/users/jones/sun1 \ 
  startup_exe=/users/jones/myprog \ 
  startups=rsh \ 

sun2:sun2.mcs.anl.gov \ startup_dir=/users/jones/sun1 \ startup_exe=/users/jones/myprog \ startups=rsh \ rsh_login=jpjones

If you wish to use TCP/IP over an alternate host name, you can add the tcp_interface attribute for the .resource_database file:

sun1 tcp_interface=sun1-atm 
    sun2 tcp_interface=sun2-atm 
One attribute which you can specify is the domain attribute. This tends to be important when starting up nodes on systems (e.g. Solaris) where the gethostname function call does not always return a fully-qualified host name. As an example of how to use the domain attribute:

sun1 domain=.mcs.anl.gov 
    sun2 domain=.mcs.anl.gov 
There are many other attributes which you can define in the .resource_datbase file. As an example, the following entry could be used in your .resource_database file if you wanted to make use of an IBM SP-2 as part of a larger collection of machines:

mysp2.mcs.anl.gov \ 
  startup_dir=/sphome/jones/mpich-tcp+mpl/examples/ \ 
  startup_exe=/sphome/jones/mpich-tcp+mpl/examples/myprog \ 
  startups=rsh \ 
  startup_type=mpl \ 
  startup_count=2 \ 
  startup_mpl_hostname=spnode019.mcs.anl.gov \ 
  startup_mpl_hostfile=/sphome/jones/mpich-tcp+mpl/hostfile \ 
This entry makes use of a number of additional attributes. The startup_dir, startup_exe, and startups attributes were explained in the earlier example.

The startup_type attribute defines MPL as the underlying communication mechanism to be used (on the SP-2 this is normally done via the Parallel Operation Environment or poe).

The startup_count attribute indicates the number of nodes to be used on the SP-2 (in addition to any nodes you may be using on other workstations or multicomputers).

The attributes startup_mpl_hostfile and startup_mpl_hostname indicate where the poe command should be run as well as the list of machines (containing one hostname per line) where other nodes can be started.

The final attribute listed, startup_mpl_euilib, indicates whether you are going directly to the switch or running TCP/IP over the switch. You would use us to go directly to the switch or ip to use TCP/IP over the switch.

There are many configurations for which there is planned support, such as the Paragon. The information presented here is preliminary and demonstrates example configurations which are known to work as a result of MPI testing. Please check the Nexus home page http://www.mcs.anl.gov/nexus for the latest information and status pertaining to MPICH on Nexus and new protocol modules and startup mechanisms to be supported in the Nexus distribution.

Up: Special features of different systems Next: MPPs Previous: Using special switches