I have two nodes connected with dual-port Mellanox Connext-X3 VPI HCAs via an IB switch. The nodes are two socket machines with Hasswell CPUs and 2 16GB DIMMs per each socket (totaling 64GB). Everything seems to work perfectly, except for the performance numbers that don't seem right.
When I run ib_read_bw
benchmark:
server# ib_read_bw --report_gbits
client# ib_read_bw server --report_gbits
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 1000 37.76 37.76 0.072016
---------------------------------------------------------------------------------------
But when I run dual-port:
server# ib_read_bw --report_gbits -O
client# ib_read_bw server --report_gbits -O
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 2000 52.47 52.47 0.100073
---------------------------------------------------------------------------------------
I only get less than 40% improvement (am I wrong to expect ~2x the single port bandwidth)?
I don't know what could be the bottleneck here and how to find it.
Other configurations that may be helpful:
- Each socket has 8 cores, overall each machine has 32 HTs
- Each DIMM provides ~14GB/s bw (per socket mem-bw: ~28 GB/s, overall ~56 GB/s)
- I used Mellanox's Auto Tuning Utility tool to tune the interrupts.
- IB links are 4X 10.0 Gbps (FDR10) -- each 40 Gb/s
- I am using Mellanox OFED 4.3.