0

I think I have a Infiniband setup issue. I'm not very experienced with setting up Infiniband or networks. If I try to force MPICH to use IB, I get errors:

[Bryan@node1 shared]$ ./mpich-3.3.1/bin/mpiexec -hosts=node1,node2 -iface=ib0 -n 4 ./test
[mpiexec@node1] HYDU_sock_get_iface_ip (../../../../mpich-3.3.1/src/pm/hydra/utils/sock/sock.c:496): unable to find interface ib0
[mpiexec@node1] HYDU_sock_create_and_listen_portstr (../../../../mpich-3.3.1/src/pm/hydra/utils/sock/sock.c:550): unable to get network interface IP
[mpiexec@node1] HYD_pmci_launch_procs (../../../../mpich-3.3.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:114): unable to create PMI port
[mpiexec@node1] main (../../../../mpich-3.3.1/src/pm/hydra/ui/mpich/mpiexec.c:332): process manager returned error launching processes

Similar problem with OpenMPI. I've been trying several days to solve this. Probably something simple that I missed.

More background:
1. I installed a Mellanox ConnectX-3 MCX354A-QCBT into two machines; node1 and node2. Connected port 1 to port 1 directly to each other. No switch. No port 2 connection.
2. Each machine is running Centos 7.
3. Set up passwordless ssh.
4. Installed the Mellanox drivers.
5. Installed MPICH in a shared folder.
6. I start the Mellanox drivers on both machines with sudo mst start.
7. I start opensm on node1. It enters master state.

checking the connection, looks good:

[Bryan@node1 shared]$ sudo ibnodes
Ca      : 0x0002c90300ee7620 ports 2 "node2 HCA-1"
Ca      : 0x0002c90300ee69e0 ports 2 "node1 HCA-1"

I used ibping as well to test the connection. Everything seems to be fine.

I run a test MPI program without forcing IB, works fine:

[Bryan@node1 shared]$ ./mpich-3.3.1/bin/mpicc -std=c11 MPI_Test.c -o test
[Bryan@node1 shared]$ ./mpich-3.3.1/bin/mpiexec -hosts=node1,node2 -n 4 ./test
Rank 0: Hostname node1
Rank 2: Hostname node1
Rank 1: Hostname node2
Rank 3: Hostname node2

nmcli shows disconnected. I thought that MPI didn't need an IP address for IB, but I tried setting up a connection anyway:

[Bryan@node1 shared]$ sudo nmcli con add con-name ib0 ifname ib0 type infiniband ip4 10.0.0.1
[sudo] password for Bryan:
Connection 'ib0' (09669a93-98b3-4fcb-9fed-ea65fff65e24) successfully added.
[Bryan@node1 shared]$ nmcli device status
DEVICE  TYPE        STATE        CONNECTION
enp4s0  ethernet    connected    enp4s0
ib0     infiniband  connected    ib0
ib1     infiniband  unavailable  --
[Bryan@node1 shared]$ nmcli device show ib0
GENERAL.DEVICE:                         ib0
GENERAL.TYPE:                           infiniband
GENERAL.HWADDR:                         A0:00:02:20:FE:80:00:00:00:00:00:00:00:02:C9:03:00:EE:69:E1
GENERAL.MTU:                            2044
GENERAL.STATE:                          100 (connected)
GENERAL.CONNECTION:                     ib0
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/ActiveConnection/54
IP4.ADDRESS[1]:                         10.0.0.1/32
IP4.GATEWAY:                            --
IP4.ROUTE[1]:                           dst = 10.0.0.1/32, nh = 0.0.0.0, mt = 150
IP6.ADDRESS[1]:                         fe80::7a74:c87f:2d49:2cfc/64
IP6.GATEWAY:                            --
IP6.ROUTE[1]:                           dst = fe80::/64, nh = ::, mt = 150
IP6.ROUTE[2]:                           dst = ff00::/8, nh = ::, mt = 256, table=255

I did the same with node2 except using 10.0.0.2 as the IP.

If I try to run again forcing IB after setting up the IP's, it will just hang until I ctrl+c. What have I missed?

  • Normally you want to use RDMA/Verbs with Infiniband instead of the IP stack. You only need an IP connection on startup, which can also be ethernet, the MPI communication itself should then use RDMA/Verbs. I am not sure how you installed the MPICH version, but you should use `--with-device=ch3:nemesis:mxm` during `configure` according to https://www.mpich.org/static/downloads/3.2.1/mpich-3.2.1-README.txt. Another option would be to use `mvapich` with `OFA-IB-CH3` interface: http://mvapich.cse.ohio-state.edu/overview/. – Thomas Jul 31 '19 at 10:54
  • Configure was not picking up on the mxm libraries. was located. Specifying the location worked.Thank you so much! – Bryan Carroll Oct 26 '19 at 01:52

1 Answers1

0

After adding --with-device=ch3:nemesis:mxm as Thomas suggested, the configure process told me the Mellanox libraries could not be found. Adding --with-mxm=/opt/mellanox/mxm to the configure options solved the issue.

The README has more detail then the mpich-3.3.1-installguide.pdf from the mpich.org