4

I have two identical computers with Mellanox cards connected to each other through a cable. No switch. Using opensm.

I have run several tests, including ping_pong tests, ibping, etc. They all seem to work. However, when I run this test, it comes back with what appears to be an error, which I don't understand.

I did tell the firewall

sudo iptables -I INPUT -p tcp -s 192.168.0.0/24  -j ACCEPT -m comment --comment "Allow Infiniband"

sudo iptables -I INPUT -p udp -s 192.168.0.0/24  -j ACCEPT -m comment --comment "Allow Infiniband"

Any help deciphering and a possible solution would be great.

[idf@node2 Downloads]$ sudo ib_write_bw

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF      Device         : mlx4_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC       Using SRQ      : OFF
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x01 QPN 0x004a PSN 0xa79f2e RKey 0x50042a04 VAddr 0x007f1682804000
 remote address: LID 0x02 QPN 0x004a PSN 0x5ef914 RKey 0x40042502 VAddr 0x007f94f9ce9000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
ethernet_read_keys: Couldn't read remote address
 Unable to read to socket/rdam_cm
 Failed to exchange data between server and clients
[idf@node2 Downloads]$


[idf@node1 python]$ sudo ib_write_bw 192.168.0.1
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF      Device         : mlx4_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC       Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x02 QPN 0x004a PSN 0x5ef914 RKey 0x40042502 VAddr 0x007f94f9ce9000
 remote address: LID 0x01 QPN 0x004a PSN 0xa79f2e RKey 0x50042a04 VAddr 0x007f1682804000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 1600.000000 != 1733.000000
Can't produce a report
[idf@node1 python]$ 
Ivan
  • 289
  • 4
  • 13

2 Answers2

2

It turns out this has been seen before. I don't like the answer because it seems to sweep it under the rug, but it is an answer nontheless:

http://linuxtoolkit.blogspot.com/2013/01/errors-when-running-doing-ib-testing.html

Ivan
  • 289
  • 4
  • 13
  • That's because CPU frequency scaling is enabled. Set the CPU to performance mode in the BIOS and that error will go away. Does lsmod show up rdma_ucm and the other modules I mentioned in my answer. If it doesn't then this is your issue. modprobe them on both machines and try again. And make sure all the required packages are installed. – hookenz May 18 '15 at 19:35
  • Gotcha. Let me see if I can change that... – Ivan May 19 '15 at 02:20
  • That worked. On CentOS 7, I said "cpupower frequency-set --governor performance" – Ivan May 23 '15 at 03:28
0

This usually is a result of not having all the required modules loaded in the kernel. They don't load by default. Not sure how centos deals with it but in Ubuntu you need to put these modules in /etc/modules so that the kernel will load them.

mlx4_ib
rdma_ucm
ib_umad
ib_uverbs
ib_ipoib

I assume ib_ipoib and mlx4_ib are already loaded or else you wouldn't get ip networking over infiniband working.

You will also need to install libmlx4 if you haven't installed that.

Failing that, try this link which lists all the required packages for Centos (Note: libmthca is for an older mellanox chipset [infinihost] so you won't need it in your case.

https://sort.symantec.com/public/documents/sfha/6.1/linux/productguides/html/sfrac_install/apls05s02.htm

hookenz
  • 14,132
  • 22
  • 86
  • 142
  • He obviously got mlx4_ib loaded since ib_write_bw is using mlx4_0. – haggai_e May 18 '15 at 06:00
  • 1
    Yes but if you don't have ib_uverbs and rdma_ucm loaded by the kernel some tools work (i.e. ones that use send/recv, but rdma_send,recv don't). – hookenz May 18 '15 at 19:31
  • Without ib_uverbs you wouldn't see mlx4_0 in user space tools like ib_write_bw – haggai_e May 18 '15 at 19:40
  • @haggai all I can say is that I've had this issue before. Although under Ubuntu. I'm saying to Ivan, ensure all the required packages and kernel modules are installed then it should just work. – hookenz May 18 '15 at 20:10