Problems starting programs


Up: In case of trouble Next: General Previous: HPUX



Up: In case of trouble Next: General Previous: HPUX


General


Up: Problems starting programs Next: Workstation Networks Previous: Problems starting programs

    1. Q: When trying to start a program with
    mpirun -np 2 cpi 
    
    either I get an error message or the program hands.

    A: On Intel Paragons and IBM SP1 and SP2, there are many mutually exclusive ways to run parallel programs; each site can pick the approach(es) that it allows. The script mpirun tries one of the more common methods, but may make the wrong choice. Use the -v or -t option to mpirun to see how it is trying to run the program, and then compare this with the site-specific instructions for using your system. You may need to adapt the code in mpirun to meet your needs.


    2. Q: When trying to run a program with, e.g., mpirun -np 4 cpi, I get

    usage : mpirun [options] <executable> [<dstnodes>] [-- <args>] 
    
    or
    mpirun [options] <schema> 
    
    A: You have a command named mpirun in your path ahead of the mpich version. Execute the command
    which mpirun 
    
    to see which command named mpirun was actually found. The fix is to either change the order of directories in your path to put the mpich version of mpirun first, or to define an alias for mpirun that uses an absolute path. For example, in the csh shell, you might do
    alias mpirun /usr/local/mpi/bin/mpirun  
    

    3. Q: When I issue the command:
    mpirun -dbx -np 1 foo 
    
    dbx does start up but this message appears:
    dbx version 3.19 Nov  3 1994 19:59:46 
    Unexpected argument ignored: -sr 
    /scr/MPI/me/PId8704 is not an executable 
    
    A: Your version of dbx does not support the -sr argument; this is needed to give dbx the initial commands to execute. You will not be able to use mpirun with the -dbx argument. Try using -gdb or -xxgdb instead of -dbx if you have the GNU debugger.


    4. Q: When attempting to run cpilog I get the following message:

    ld.so.1: cpilog: fatal: libX11.so.4: can't open file: errno 2 
    
    A: The X11 version that configure found isn't properly installed. This is a common problem with Sun/Solaris systems. One possibility is that your Solaris machines are running slightly different versions. You can try forcing static linking (-Bstatic on SunOS).

    Consider adding these lines to your .login (assuming C shell):

    setenv OPENWINHOME /usr/openwin 
        setenv LD_LIBRARY_PATH /opt/SUNWspro/lib:/usr/openwin/lib 
    
    (you may want to check with your system administrator first to make sure that the paths are correct for your system). Make sure that you add them before any line like
    if ($?USER == 0 || $?prompt == 0) exit  
    

    5. Q: My program fails when it tries to write to a file.

    A: If you opened the file before calling MPI_INIT, the behavior of MPI (not just mpich) is undefined. On the ch_p4 version, only process zero (in MPI_COMM_WORLD) will have the file open; the other processes will not have opened the file. Move the operations that open files and interact with the outside world to after MPI_INIT (and before MPI_FINALIZE).


    6. Q: Programs seem to take forever to start.

    A: This can be caused by any of several problems. On systems with dynamically-linked executables, this can be caused by problems with the file system suddenly getting requests from many processors for the dynamically-linked parts of the executable (this has been measured as a problem with some DFS implementations). You can try statically linking your application.

    On workstation networks, long startup times can be due to the time used to start remote processes; see the discussion on the secure server.



Up: Problems starting programs Next: Workstation Networks Previous: Problems starting programs


Workstation Networks


Up: Problems starting programs Next: Intel Paragon Previous: General

    1.


    2.


    3.


    4.


    5.


    6.


    7.


    8.


    9.


    10. Q: When running the workstation version (-device=ch_p4), I get error messages of the form

    more slaves than message queues 
    
    A: This means that you are trying to run mpich in one mode when it was configured for another. In particular, you are specifying in your p4 procgroup file that several processes are to shared memory on a particular machine by either putting a number greater than 0 on the first line (where it signifies number of local processes besides the original one), or a number greater than 1 on any of the succeeding lines (where it indicates the total number of processes sharing memory on that machine). You should either change your procgroup file to specify only one process on line, or reconfigure mpich with
    configure -device=ch_p4 -comm=shared 
    
    which will reconfigure the p4 device so that multiple processes can share memory on each host. The reason this is not the default is that with this configuration you will see busy waiting on each workstation, as the device goes back and forth between selecting on a socket and checking the internal shared-memory queue.


    11. Q: My programs seem to hang in MPI_Init.

    A: There are a number of ways that this can happen:

      1. One of the workstations you selected to run on is dead (try tstmachines).
      2. You linked with the FSU pthreads package; this has been reported to cause problems, particularly with the system select call that is part of Unix and is used by mpich.

      Another is if you use the library -ldxml (extended math library) on Digital Alpha systems. This has been observed to case MPI_Init to hang. No workaround is known at this time; contact Digital for a fix if you need to use MPI and -ldxml together.


    12. Q: My program (using device ch_p4) fails with
    p0_2005:  p4_error: fork_p4: fork failed: -1 
                  p4_error: latest msg from perror: Error 0 
    
    A: The executable size of your program may be too large. When a ch_p4 or ch_tcp device program starts, it creates a copy of itself to handle certain communication tasks. Because of the way in which the code is organized, this (at least temporarily) is a full copy of your original program and occupies the same amount of space. Thus, if your program is over half as large as the maximum space available, you wil get this error. On SGI systems, you can use the command size to get the size of the executable and swap -l to get the available space. Note that size gives you the size in bytes and swap -l gives you the size in 512-byte blocks. Other systems may offer similar commands.

    A similar problem can happen on IBM SPx using the ch_eui or ch_mpl device; the cause is the same but it originates within the IBM MPL library.


    13. Q: Sometimes, I get the error

    Exec format error. Wrong Architecture. 
    
    A: You are probably using NFS (Network File System). NFS can fail to keep files updated in a timely way; this problem can be caused by creating an executable on one machine and then attempting to use it from another. Usually, NFS catches up with the existence of the new file within a few minutes. You can also try using the sync command. mpirun in fact tries to run the sync command, but on many systems, sync is only advisory and will not guarentee that the file system has been made consistent.


    14.



Up: Problems starting programs Next: Intel Paragon Previous: General


Intel Paragon


Up: Problems starting programs Next: IBM RS6000 Previous: Workstation Networks

    1. Q: How do I run jobs with mpirun under NQS on my Paragon?

    A: Give mpirun the argument -paragontype nqs.



Up: Problems starting programs Next: IBM RS6000 Previous: Workstation Networks


IBM RS6000


Up: Problems starting programs Next: IBM SP Previous: Intel Paragon

    1. Q: When trying to run on an IBM RS6000 with the ch_p4 device, I got
    % mpirun -np 2 cpi 
    Could not load program /home/me/mpich/examples/basic/cpi  
    Could not load library libC.a[shr.o] 
    Error was: No such file or directory 
    
    A: This means that mpich was built with the xlC compiler but that some of the machines in your util/machines/machines.rs6000 file do not have xlC installed. Either install xlC or rebuild mpich to use another compiler (either xlc or gcc; gcc has the advantage of never having any licensing restrictions).


    2.



Up: Problems starting programs Next: IBM SP Previous: Intel Paragon


IBM SP


Up: Problems starting programs Next: Programs fail at startup Previous: IBM RS6000

    1.


    2.


    3.


    4.


    5.



Up: Problems starting programs Next: Programs fail at startup Previous: IBM RS6000