2

I'll preface this with the note that I'm not a sysadmin (anymore)- I'm a scientist, but I'm trying to help our sysadmins out here. So I can't immediately try things that require hardware changes or require sudo, but will provide as much information to the best of my ability.

I'm trying to run program foo, written in Fortran and compiled with PGI and MPICH, across 8 nodes using MPI (the nodes are named node01, node02, etc.). Each node has 8 cores, and I'm trying to use all of them. The problem is that node08 isn't playing nicely with the rest and I'm not quite sure why. To get the simple stuff out of the way- ssh, file transfer, etc. works fine between node08 and all other nodes, I've setup passwordless ssh between all nodes (and it works on node08), the mpi executable is the same, and everything works when using all other nodes (it is also not a number of cores thing- you run 16 cores between node06 and node07 and it works fine, you run the same number of cores between node07 and node08 and it fails).

Below is how I run my jobs (I have tried the same thing with mpiexec), and I will show example of MPI output on each one. The mpich version is 1.4.1 and is consistent between machines. Also, both of these were run from node07 with just the machs file changed to either include node06 or node08.

mpiexec.hydra -iface eth1 -verbose -machinefile machs -np 16 ./executable

When I run this command across node07 and node08, I get the following MPI output immediately before it hangs (to be clear, it hangs at 100% CPU on all cores on both nodes; also I've blocked out the ip addresses and the kvsname):

[mpiexec@node07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_xx14_0 key=P0-businesscard
[mpiexec@node07] PMI response to fd 7 pid 5: cmd=get_result rc=0 msg=success value=description#""$port#38204$ifname#xx.xx.xx.7$
[proxy:0:1@node08] got pmi command (from 5): get
kvsname=kvs_xx14_0 key=P0-businesscard
[proxy:0:1@node08] forwarding command (cmd=get kvsname=kvs_xx14_0 key=P0-businesscard) upstream
[proxy:0:1@node08] got pmi command (from 7): get
kvsname=kvs_xx14_0 key=P0-businesscard
[proxy:0:1@node08] forwarding command (cmd=get kvsname=kvs_xx14_0 key=P0-businesscard) upstream
[proxy:0:1@node08] got pmi command (from 10): get
kvsname=kvs_xx14_0 key=P0-businesscard
[proxy:0:1@node08] forwarding command (cmd=get kvsname=kvs_xx14_0 key=P0-businesscard) upstream
[proxy:0:1@node08] got pmi command (from 16): get
kvsname=kvs_xx14_0 key=P0-businesscard
[proxy:0:1@node08] forwarding command (cmd=get kvsname=kvs_xx14_0 key=P0-businesscard) upstream
[mpiexec@node07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_xx14_0 key=P0-businesscard
[mpiexec@node07] PMI response to fd 7 pid 7: cmd=get_result rc=0 msg=success value=description#""$port#38204$ifname#xx.xx.xx.7$
[proxy:0:1@node08] we don't understand the response get_result; forwarding downstream
[mpiexec@node07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_xx14_0 key=P0-businesscard
[mpiexec@node07] PMI response to fd 7 pid 10: cmd=get_result rc=0 msg=success value=description#""$port#38204$ifname#xx.xx.xx.7$
[mpiexec@node07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_xx14_0 key=P0-businesscard
[mpiexec@node07] PMI response to fd 7 pid 16: cmd=get_result rc=0 msg=success value=description#""$port#38204$ifname#xx.xx.xx.7$
[proxy:0:1@node08] we don't understand the response get_result; forwarding downstream
[proxy:0:1@node08] we don't understand the response get_result; forwarding downstream
[proxy:0:1@node08] we don't understand the response get_result; forwarding downstream

When running the same command across node06 and node07, I get this where the previous one hangs.

[proxy:0:1@node07] got pmi command (from 22): get
kvsname=kvs_xxx00_0 key=P0-businesscard
[proxy:0:1@node07] forwarding command (cmd=get kvsname=kvs_xxx00_0 key=P0-businesscard) upstream
[mpiexec@node07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_xxx00_0 key=P0-businesscard
[mpiexec@node07] PMI response to fd 6 pid 22: cmd=get_result rc=0 msg=success value=description#""$port#48883$ifname#xxx.xx.xx.6$
[proxy:0:1@node07] got pmi command (from 6): get
kvsname=kvs_xxx00_0 key=P0-businesscard
[proxy:0:1@node07] forwarding command (cmd=get kvsname=kvs_xxx00_0 key=P0-businesscard) upstream
[mpiexec@node07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_xxx00_0 key=P0-businesscard
[mpiexec@node07] PMI response to fd 6 pid 6: cmd=get_result rc=0 msg=success value=description#""$port#48883$ifname#xxx.xx.xx.6$
[proxy:0:1@node07] we don't understand the response get_result; forwarding downstream
[proxy:0:1@node07] got pmi command (from 11): get
kvsname=kvs_xxx00_0 key=P0-businesscard
[proxy:0:1@node07] forwarding command (cmd=get kvsname=kvs_xxx00_0 key=P0-businesscard) upstream
[proxy:0:1@node07] got pmi command (from 16): get
kvsname=kvs_xxx00_0 key=P0-businesscard
[proxy:0:1@node07] forwarding command (cmd=get kvsname=kvs_xxx00_0 key=P0-businesscard) upstream
[mpiexec@node07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_xxx00_0 key=P0-businesscard
[mpiexec@node07] PMI response to fd 6 pid 11: cmd=get_result rc=0 msg=success value=description#""$port#48883$ifname#xxx.xx.xx.6$
[proxy:0:1@node07] we don't understand the response get_result; forwarding downstream
[mpiexec@node07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_xxx00_0 key=P0-businesscard
[mpiexec@node07] PMI response to fd 6 pid 16: cmd=get_result rc=0 msg=success value=description#""$port#48883$ifname#xxx.xx.xx.6$
[proxy:0:1@node07] we don't understand the response get_result; forwarding downstream
[proxy:0:1@node07] we don't understand the response get_result; forwarding downstream

To be clear- this code works fine across every other machine we have and this is not an issue on any other deployment we've tried- just this one. Also, I should note that the nodes have IPs corresponding to their node number, so xx.xx.xx.7 is node07, xx.xx.xx.8 is node08, etc. Running on just node08 on all 8 cores works just fine, the problem comes when using any cores on node08 with any cores on any other machine.

Edited to add: ssh is the launcher (and the only one that we have access to).

Laxsnor
  • 121
  • 2
  • Following up on this, it appears that the specific line where it hangs is in an HDF5 file close command. I'm trying to investigate this further and also update my parallel HDF5. – Laxsnor Mar 07 '18 at 22:54
  • Okay so I actually have no idea why this happened, but I found a workaround. Node08 is the local host of the filesystem that I am outputting on, which was somehow causing the problem. If I output files to another filesystem, I'm ok. I'm still open to actual answers though. – Laxsnor Mar 08 '18 at 22:36

0 Answers0