2 port 10Gbe NIC Performance half of expected

1

I'm having trouble getting my expected throughput of a Intel dual port 82599EB 10-Gigabit. I've tried many things and want to know if there's anything I could try that I've missed.

My hardware configuration

Two servers with OpenSUSE and a Intel dual port 82599EB 10GbE in each. They are manually configured to static IPs and each port on one machine is connected to a port on the second.

lspci -vv

Throughput Test

I am using iperf to test. The cards are driving by ixgbe.

On the receiver side, I run

iperf -s

On the transmitter side:

iperf -c 192.168.1.10 -t 20 -B 192.168.1.20
iperf -c 192.168.1.11 -t 20 -B 192.168.1.21

And I am now getting around 4.x Gb per interface. If I run only one interface, I get 9.x Gb.

Configuration Attempts

I have looked around SE sites, and many other articles. Here are three helpful ones I found.

  1. Network Connectivity — Tuning Intel® Ethernet Adapter throughput performance
  2. https://www.kernel.org/doc/Documentation/networking/ixgbe.txt
  3. http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf (PDF)

The two things that really helped:

  1. Using jumbo frames by setting the MTU at 9000.
  2. Increasing rmem settings in /etc/sysctl.conf

However, I am still running at about 9.5Gbe combined for both channels. I'm thinking I should get 9Gbe or more per channel.

Things I've tried without much success:

  • Used ethtool -c to vary interrupt coalescing
  • Used ethtool to disable/enable flow control

Edits as per comments

To test the CPU utilization I am using mpstat -P ALL 5. On the transmitting server, I see 61% utilization.

01:12:59 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
...
01:12:59 PM    4    0.00    0.00   61.33    0.00    0.00    9.38    0.00    0.00   29.29

That should be okay right? On the receiver I see a max of 30%.

Using lspci, I got the following. I can post the full outputs if needed, but think this shows the required pcie info:

Sender:

1: LnkCap: Port #16, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 <2us, L1 <32us
                        ClockPM- Surprise- LLActRep- BwNot-
   LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
                    ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
   LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
   DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
   DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
   LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                     Compliance De-emphasis: -6dB
2: LnkCap: Port #16, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 <2us, L1 <32us
                        ClockPM- Surprise- LLActRep- BwNot-
   LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
   LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
   DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
   DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
   LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
   Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB

Receiver:

 1: LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 <1us, L1 <8us
                        ClockPM- Surprise- LLActRep- BwNot-
    LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
    LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
    DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
    DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
    LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                             Compliance De-emphasis: -6dB
2: LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 <1us, L1 <8us
                        ClockPM- Surprise- LLActRep- BwNot-
   LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
   LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
   DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
   DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
   LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
   Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB

5 G/T at x8 should be plenty right?

Nate

Posted 2014-07-30T23:04:21.187

Reputation: 153

1And I am now getting around 4.x Gb per interface. If I run only one interface, I get 9.x Gb. Are you sure there are no other bottlenecks and your problem is the network interfaces? Is your CPU maxed out or something? – Zoredache – 2014-07-31T00:26:05.217

2

Or your PCI Express bus? https://communities.intel.com/community/wired/blog/2009/06/08/understanding-pci-express-bandwidth

– cpt_fink – 2014-07-31T06:00:45.100

The PCI bus is a common bottleneck for 10G cards. – MaQleod – 2014-07-31T19:34:02.933

Ah good points! Thanks! I'm a little out of my element here but I think I've provided the correct info. If not, let me know. It seems like 5G/T per second at x8 should be plenty, but I'll keep looking. – Nate – 2014-07-31T19:38:08.663

No answers