I have installed two dual-port FDR Infiniband VPI HBAs, one in each of two servers running CentOS 6.9,
server1>lspci
03:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
server2>lspci
81:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
I want to use these for high-speed NFSv4 (probably via RDMA) connectivity between these two machines, directly attached to each other via Infiniband (2-meter 56 Gbps QSFP+ passive cable). I have done the following on both (substituting the correct PCI address below).
yum -y install rdma infiniband-diags
chkconfig rdma on
service rdma start
printf "0000:XX:00.0 eth eth\n" >> /etc/rdma/mlx4.conf
echo eth > /sys/bus/pci/devices/0000:XX:00.0/mlx4_port1
echo eth > /sys/bus/pci/devices/0000:XX:00.0/mlx4_port2
modprobe -r mlx4_core
modprobe mlx4_core
modprobe ib_umad
cp -f ifcfg-eth4 /etc/sysconfig/network-scripts/ifcfg-eth4
cp -f ifcfg-eth5 /etc/sysconfig/network-scripts/ifcfg-eth5
chmod 644 /etc/sysconfig/network-scripts/ifcfg-*
chcon system_u:object_r:net_conf_t:s0 /etc/sysconfig/network-scripts/ifcfg-*
ifup eth4
ifup eth5
An example network configuration file (e.g. ifcfg-eth4) looks thus, substituting the appropriate MAC and IP address for each port:
DEVICE=eth4
HWADDR=XX:XX:XX:XX:XX:XX
TYPE=Ethernet
ONBOOT=yes
NM_CONTROLLED=no
BOOTPROTO=none
USERCTL=no
NETWORK=10.72.1.0
NETMASK=255.255.255.0
IPADDR=XXX.XXX.XXX.XXX
There are three other similar files, two on each machine, and ifup and ifdown work for both interfaces on both machines. Additionally, routes exist
server1>ip route show
10.72.1.0/24 dev eth4 proto kernel scope link src 10.72.1.3
10.72.1.0/24 dev eth5 proto kernel scope link src 10.72.1.4
...
This is where things start going badly.
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.11.500
Hardware version: 0
Node GUID: 0xf45...
System image GUID: 0xf45...
Port 1:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x04010000
Port GUID: 0xf6...
Link layer: Ethernet
Port 2:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x04010000
Port GUID: 0xf6...
Link layer: Ethernet
Both machines show the same thing, "State: Down" and "Physical state: Disabled". Status lights on the HBAs themselves are dark. I have tried all combinations of connections between the two machines, including connecting each card to itself.
I have read about the need for opensm
, and I tried installing it, but despite what seems like correct configuration, it fails:
May 09 20:18:14 888369 [A8697700] 0x01 -> osm_vendor_bind: ERR 5426: Unable to register class 129 version 1
May 09 20:18:14 888418 [A8697700] 0x01 -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed
May 09 20:18:14 888436 [A8697700] 0x01 -> osm_sm_bind: ERR 2E10: SM MAD Controller bind failed (IB_ERROR)
Further, I have read some people say that opensm
is not needed for this type of configuration.
At this point, I do not know if this suggests that one or both cards are bad, the cable is bad, there is an aspect of my configuration that is bad, or something else. I have tried yum -y groupinstall "Infiniband Support", but this did not help, and I subsequently removed the extraneous packages.
What I have not done is reboot the machine, because that is not presently an option, but I thought that the modprobe -r; modprobe
sequence would be equivalent, and all aspects of the configuration related to module installation seems to be working correctly.
I will appreciate any thoughts!