0

I got Infiniband running on RHEL 6.3

[root@master ~]# ibv_devinfo 
hca_id: mthca0
transport:          InfiniBand (0)
fw_ver:             4.7.927
node_guid:          0017:08ff:ffd0:6f1c
sys_image_guid:         0017:08ff:ffd0:6f1f
vendor_id:          0x08f1
vendor_part_id:         25208
hw_ver:             0xA0
board_id:           VLT0060010001
phys_port_cnt:          2
    port:   1
        state:          PORT_ACTIVE (4)
        max_mtu:        2048 (4)
        active_mtu:     2048 (4)
        sm_lid:         2
        port_lid:       3
        port_lmc:       0x00
        link_layer:     InfiniBand

    port:   2
        state:          PORT_DOWN (1)
        max_mtu:        2048 (4)
        active_mtu:     512 (2)
        sm_lid:         0
        port_lid:       0
        port_lmc:       0x00
        link_layer:     InfiniBand

but it's only working as root.

when trying from a non-super user, I got nothing :

[nicolas@master ~]$ ibv_devices
device                 node GUID
------              ----------------
mthca0              001708ffffd06f1c

So, how to allow regular users to use infiniband ?

user1219721
  • 467
  • 1
  • 6
  • 15

4 Answers4

3

Ok, this is a bug in RHEL 6.3 release

Udev rule is missing :

/etc/udev/rules.d/90-rdma.rules

KERNEL=="umad*", SYMLINK+="infiniband/%k"
KERNEL=="issm*", SYMLINK+="infiniband/%k"
KERNEL=="ucm*", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="uverbs*", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="uat", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="ucma", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="rdma_cm", SYMLINK+="infiniband/%k", MODE="0666"

see https://www.centos.org/modules/newbb/viewtopic.php?topic_id=38586&forum=55

user1219721
  • 467
  • 1
  • 6
  • 15
1

It is better to simply update the package with the repaired version, rdma-3.3-4. More details here: http://rhn.redhat.com/errata/RHBA-2012-1423.html

0

here is more complete info for persons looking to solve this Issue faced on RH 6.3 Linux 2.6.32-279.9.1.el6.x86_64 #1 SMP Fri Aug 31 09:04:24 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

**#ibstat**
CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 2
        Firmware version: 2.10.700
        Hardware version: 0
        Node GUID: 0x0002c90300129780
        System image GUID: 0x0002c901013029781
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 56
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x0251486a
                Port GUID: 0x0002c901013029781
                Link layer: InfiniBand

1.Create the missing file as root:

**vi /etc/udev/rules.d/90-rdma.rules**

------------ cut here ------------
KERNEL=="umad*", SYMLINK+="infiniband/%k"
KERNEL=="issm*", SYMLINK+="infiniband/%k"
KERNEL=="ucm*", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="uverbs*", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="uat", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="ucma", SYMLINK+="infiniband/%k", MODE="0666"
KERNEL=="rdma_cm", SYMLINK+="infiniband/%k", MODE="0666"
------------ cut here ------------

on the management node (ie. head node, service node etc)

2.Copy this file via ssh or any preferred method to any compute node in the cluster.

**#ssh compute000 cp /home/90-rdma.rules /etc/udev/rules.d/90-rdma.rules**

**#ssh compute001 cp /home/90-rdma.rules /etc/udev/rules.d/90-rdma.rules**

**#ssh compute002 cp /home/90-rdma.rules /etc/udev/rules.d/90-rdma.rules**

**#ssh compute003 cp /home/90-rdma.rules /etc/udev/rules.d/90-rdma.rules**

**#ssh compute004 cp /home/90-rdma.rules /etc/udev/rules.d/90-rdma.rules**

etc

3.Verify that the file is created in folder of every compute node in /etc/udev/rules.d

**#ssh compute000 ls /etc/udev/rules.d | grep rdm 
*#90-rdma.rules ***

4.Restart all the compute nodes and management nodes.

NOTE: a. After the change the user will still get this result when running the command

ibv_devices

[root@master ~]# ibv_devices
    device                 node GUID
    ------              ----------------
    mlx4_0              0002c901013029781

but don't worry just run your preferred mpi application and will be fine.

b. The issue is regardless the use of any HCA vendor, is directly connected to the OS.

c. This seems to be caused by a change made in upstream to the rdma package (no more udev rules), the infiniband devices get created by the kernel with the wrong permissions. This problem has been reported as by users of CentOS 6.3 and Scientific Linux 6.3

Hope will help others

Mark Henderson
  • 68,316
  • 31
  • 175
  • 255
Florin
  • 1
0

I guess you get into a similar situation like me.

I ran the rping and ib_write_bw, with the output like

Couldn't allocate MR

this is as Dotan said that

I suspect that you are working as a non-root user and there is a limit to the amount of memory pages that can be locked (i.e. pinned). Increasing this size should solve the problem.

Thanks Dotan

the solution is simple , as here Dotan said https://www.rdmamojo.com/2014/10/11/working-rdma-redhatcentos-7/

Edit the file /etc/security/limits.conf and add the following lines:

  • soft memlock unlimited
  • hard memlock unlimited
Y00
  • 130
  • 5