3

I have installed 2 mellanox FDR dual-port ConnectX-3 HCA cards (CX354A), each to a separate machines. The machines are connected directly to each other (switchless configuration). Both ports on the cards are connected such that port1 is to port1 and port2 is to port2. Each ports is configured as follow:

HCA1 port1:  ib0    inet addr:192.168.10.13  Bcast:192.168.10.255  Mask:255.255.255.0
          port2: ib1     inet addr:192.168.10.15  Bcast:192.168.10.255  Mask:255.255.255.0

HCA2 port1: ib0     inet addr:192.168.10.24  Bcast:192.168.10.255  Mask:255.255.255.0
         port2: ib1     inet addr:192.168.10.26  Bcast:192.168.10.255  Mask:255.255.255.0

Running 2 opensm commands on HCA1 as below and ibstat shows that all 4 ports are up and active.

root@HCA1# opensm -g <ib0 GUID> --daemon
root@HCA1# opensm -g <ib1 GUID> --daemon

With the above configured, I can ping from any of the IP to any others from the above.

HOWEVER, when I disconnected cables for port1, ping does not work between the connected port2 pair. Disconnecting port2 pair and connect only port1 pair, ping works fine even for disconnected port2 IP (?) What could be the reason for this and how can I fix the problem. Please mention what extra info I should post.

What I'm trying to achieve is to establish a totally isolated link for each port pair and run separated openMPI processes to test and compare bandwidth for two infiniband cables at a same time. Could anyone advise on how this could be done?

As to what I have learnt, I think I need to create different partition key for each port pair. (currently they are using the default pkey 0xffff ) However this default pkey cannot be changed once the infiniband is configured during boot-up. Any suggestion or advice?

Both machines are running CentOS 6.4 and I have installed Mellanox OFED 1.5.3.

These are the output of the ibstat on both machines:

[root@HCA1 Desktop]# ifconfig ib0  
ib0       Link encap:InfiniBand  HWaddr   80:00:00:48:FE:81:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:192.168.10.13  Bcast:192.168.10.255  Mask:255.255.255.0  
          inet6 addr: fe80::202:c903:21:8f11/64 Scope:Link  
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1  
          RX packets:4144160 errors:0 dropped:0 overruns:0 frame:0  
          TX packets:4141376 errors:0 dropped:2 overruns:0 carrier:0  
          collisions:0 txqueuelen:1024  
          RX bytes:702746349 (670.1 MiB)  TX bytes:719570861 (686.2 MiB)  


[root@HCA1 Desktop]# ifconfig ib1  
ib1       Link encap:InfiniBand  HWaddr   80:00:00:49:FE:82:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:192.168.10.15  Bcast:192.168.10.255  Mask:255.255.255.0  
          inet6 addr: fe80::202:c903:21:8f12/64 Scope:Link  
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1  
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0  
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0  
          collisions:0 txqueuelen:1024  
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)  


[root@HCA2 Desktop]# ifconfig ib0  
ib0       Link encap:InfiniBand  HWaddr   80:00:00:48:FE:81:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:192.168.10.24  Bcast:192.168.10.255  Mask:255.255.255.0  
          inet6 addr: fe80::202:c903:21:8f51/64 Scope:Link  
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1  
          RX packets:4141382 errors:0 dropped:0 overruns:0 frame:0  
          TX packets:4144161 errors:0 dropped:2 overruns:0 carrier:0  
          collisions:0 txqueuelen:1024  
          RX bytes:703005597 (670.4 MiB)  TX bytes:719323129 (685.9 MiB)  


[root@HCA2 Desktop]# ifconfig ib1  
ib1       Link encap:InfiniBand  HWaddr   80:00:00:49:FE:82:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:192.168.10.26  Bcast:192.168.10.255  Mask:255.255.255.0  
          inet6 addr: fe80::202:c903:21:8f52/64 Scope:Link  
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1  
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0  
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0  
          collisions:0 txqueuelen:1024  
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)  

The loaded modules are as below:

[root@HCA1 Desktop]# /etc/init.d/openibd status

  HCA driver loaded

Configured IPoIB devices:
ib0 ib1

Currently active IPoIB devices:
ib0
ib1

The following OFED modules are loaded:

  rdma_ucm  
  rdma_cm  
  ib_addr  
  ib_ipoib  
  mlx4_core  
  mlx4_ib  
  mlx4_en  
  ib_mthca  
  ib_uverbs  
  ib_umad  
  ib_ucm  
  ib_sa  
  ib_cm  
  ib_mad  
  ib_core  
  iw_cxgb3  
  iw_nes  
Tom O'Connor
  • 27,440
  • 10
  • 72
  • 148
FC Yit
  • 31
  • 1
  • 3
  • What does your /etc/default/opensm file contain? Does it have PORTS=ALL? – hookenz Jun 13 '13 at 08:09
  • Which OS are you using? – hookenz Jun 13 '13 at 08:10
  • I suspect that pair2 isn't actually working. You're reaching both IP's on the opposite machine via pair1 (ib0). – hookenz Jun 13 '13 at 08:17
  • Matt, Thanks for your comment. There is no such link as you mentioned. However I can find this file: /etc/opensm/opensm.conf. It's auto-generated with command # opensm -c -o. I have changed the subnet_prefix in this file. – FC Yit Jun 13 '13 at 08:32
  • The OS I'm using is Linux CentOS ver6.3 (not 6.4 as initially posted). – FC Yit Jun 13 '13 at 08:34
  • Yes, I agree with your point that pair2 isn't actually working and IP are reached via pair1. However, I have no idea why IP of port2 could be reached via pair1. What I want to have is isolated link where pair1 shouldn't be able to reach port2 IPs. Are dual ports card by default merged the two ports assuming that they will be used for high availability (redundancy) – FC Yit Jun 13 '13 at 08:39
  • Do you have link lights on both pairs? – hookenz Jun 13 '13 at 20:06

2 Answers2

2

Ok, I'm not entirely familiar with the setup on CentOS but what I think is happening is this. That one or both copies of opensm are working on ib0 link but not other. ib0 being the default for OpenSM.

As I understand it you'll need two copies of opensm running on this particular setup because without a switch binding all HCA's together it's essentially two fabrics and you need to run the subnet manager on both fabrics. You've correctly picked that up but not actually run them correctly (specifically the 2nd instance).

Ping appears to work when both are connected because Linux is passing the ping to the second interface and responding for both IP's. All that's working over ib0 (Pair1).

Under ubuntu which I'm familiar with there is a config file /etc/default/opensm.

It sounds like it's different on CentOS. The format of that file on Ubuntu is used to run opensm with the right ports because you need an opensm subnet manager on each port.

Basically what you want to do is not run

opensm -g --daemon

twice but instead

/usr/sbin/ibstat -p

Which will give output like:

0x001a4bffff0c34e5
0x001a4bffff0c34e6

Then run

opensm -g 0x001a4bffff0c34e5 --daemon 
opensm -g 0x001a4bffff0c34e6 --daemon 

Under Ubuntu the init script actually automates that process for ports=ALL (read from /etc/default/opensm) where ALL is a keyword picked up the by init script.

There is likely an init script for opensm under CentOS. In the mean time the above commands can be used or you can write your own startup script.


UPDATE: I'm not sure if it will make a difference or not but I also have the following two kernel modules loaded which you don't.

ib_ipath
ib_qib

Have you also flashed your HCA's with the latest firmware? This is actually quite important. Don't assume they have the latest out of the factory.

hookenz
  • 14,132
  • 22
  • 86
  • 142
  • Thank you. Oh sorry for the typo in my initial posting. I believe what i have done was what you mentioned here. I used ibstat -p to check the guid of each port and actually run 2 instances of opensm -g (port1 guid) --daemon and opensm -g (port2 guid) --daemon to bring up the infiniband interfaces. – FC Yit Jun 13 '13 at 09:04
  • What does the output of ifconfig show on both hosts? – hookenz Jun 13 '13 at 10:10
  • Also what does your /etc/modules contain? – hookenz Jun 13 '13 at 10:22
  • This are the output of ifconfig – FC Yit Jun 13 '13 at 10:26
  • I have added the output of ifconfig in my question posting. There is no /etc/modules in my machine but I include output of openibd status to show what modules were loaded. – FC Yit Jun 13 '13 at 10:43
0

As I can see two different physical subnets are configured with the same subnet address 192.168.10.0. I think you should assign different subnet addresses to solve this issue.

Veniamin
  • 853
  • 6
  • 11