10

We have 2 Red Hat servers that are dedicated for customer speedtest. They both use 10Gb fiber connections and sit on 10Gb links. All network gear in between these servers fully support 10Gb/s. Using Iperf or Iperf3 the best I can get is around 6.67Gb/s. That being said, one server is in production (customers are hitting it) and the other server is online but not being used. (we are using it for testing atm) The 6.67Gb/s is also one way, I should mention. We'll call these server A and server B.

When server A acts as the iperf server, we get the 6.67Gb/s speeds. When server A acts as the client to server B it can only push about 20Mb/s.

What I have done:

So far the only thing I have done is increase the TX/RX buffers on both server to their max. One was set to 512 the other 453. (RX only, TX was already maxed out) so here is that that looks like on both after the update:

Server A:
Ring parameters for em1:
Pre-set maximums:
RX:     4096
RX Mini:    0
RX Jumbo:   0
TX:     4096
Current hardware settings:
RX:     4096
RX Mini:    0
RX Jumbo:   0
TX:     4096

Server B:
Ring parameters for p1p1:
Pre-set maximums:
RX:     4078
RX Mini:    0
RX Jumbo:   0
TX:     4078
Current hardware settings:
RX:     4078
RX Mini:    0
RX Jumbo:   0
TX:     4078

NICS look like this:

Server A: 
ixgbe 0000:01:00.0: em1: NIC Link is Up 10 Gbps, Flow Control: RX/TX

Serer B:
bnx2x 0000:05:00.0: p1p1: NIC Link is Up, 10000 Mbps full duplex,     Flow control: ON - receive & transmit

Server A ethtool stats:
 rx_errors: 0
 tx_errors: 0
 rx_over_errors: 0
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_fifo_errors: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 0
 tx_heartbeat_errors: 0
 rx_long_length_errors: 0
 rx_short_length_errors: 0
 rx_csum_offload_errors: 123049

 Server B ethtool stats:
 [0]: rx_phy_ip_err_discards: 0
 [0]: rx_csum_offload_errors: 0
 [1]: rx_phy_ip_err_discards: 0
 [1]: rx_csum_offload_errors: 0
 [2]: rx_phy_ip_err_discards: 0
 [2]: rx_csum_offload_errors: 0
 [3]: rx_phy_ip_err_discards: 0
 [3]: rx_csum_offload_errors: 0
 [4]: rx_phy_ip_err_discards: 0
 [4]: rx_csum_offload_errors: 0
 [5]: rx_phy_ip_err_discards: 0
 [5]: rx_csum_offload_errors: 0
 [6]: rx_phy_ip_err_discards: 0
 [6]: rx_csum_offload_errors: 0
 [7]: rx_phy_ip_err_discards: 0
 [7]: rx_csum_offload_errors: 0
 rx_error_bytes: 0
 rx_crc_errors: 0
 rx_align_errors: 0
 rx_phy_ip_err_discards: 0
 rx_csum_offload_errors: 0
 tx_error_bytes: 0
 tx_mac_errors: 0
 tx_carrier_errors: 0
 tx_deferred: 0
 recoverable_errors: 0
 unrecoverable_errors: 0

Potential issue: Server A has tons of rx_csum_offload_errors. Server A is the one on production and I can't help but think that CPU interrupts may be an underlying factor here and whats causing the errors I see.

cat /proc/interrupts from Server A:

122:   54938283          0          0          0          0            0          0          0          0          0          0          0            0          0          0          0          0          0          0           0          0          0          0          0  IR-PCI-MSI-edge      em1-  TxRx-0
123:   51653771          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-1
124:   52277181          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-2
125:   51823314          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-3
126:   57975011          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-4
127:   52333500          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-5
128:   51899210          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-6
129:   61106425          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-7
130:   51774758          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-8
131:   52476407          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-9
132:   53331215          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-10
133:   52135886          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0

Would disabling rx-checksumming help if this is what the issue may be? Also I see no CPU interrupts on the server that's not in production, which makes sense, since its NIC is not needing CPU time.

Server A:
 ethtool -k em1
Features for em1:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-unneeded: off
tx-checksum-ip-generic: off
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: on [fixed]
tx-checksum-sctp: on [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off
tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: on [fixed]
tx-gre-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
fcoe-mtu: off [fixed]
loopback: off [fixed]

Other than using jumbo frames, which is not possible because our network gear does not support them, what else can I do or check to provide me with the most optimal TCP performance for my 10Gb network? The 6.67Gb/s is not that bad I guess taking into consideration that one of the servers is in production and my hypothesis about the CPU interrupts the NIC is generating. But the 20Mb/s speed in the other direction on a 10Gb link is simply not acceptable. Any help would be greatly appreciated.

Server A specs: x64 24v CPU 32GB RAM RHEL 6.7

Server B Specs: x64 16v CPU 16GB ram RHEL 6.7

user53029
  • 619
  • 2
  • 14
  • 34
  • Are the servers of the same specs? Is `irqbalance` enabled? Are you using a [tuned profile](http://serverfault.com/questions/518629/understanding-redhats-recommended-tuned-profiles/518709#518709)? – ewwhite Feb 17 '16 at 22:12
  • Updated question to include specs. irqbalance is not enabled and no tuned profile. – user53029 Feb 17 '16 at 22:25
  • There is a ton of tuning information here. I've used it more then a few times. http://fasterdata.es.net/ – Stefan Lasiewski Feb 17 '16 at 22:30

4 Answers4

9

In Linux/Intel I would use following methodology for performance analysis:

Hardware:

  • turbostat
    Look for C/P states for cores, frequencies, number of SMIs. [1]
  • cpufreq-info
    Look for current driver, frequencies, and governor.
  • atop
    Look for interrupt distribution across cores
    Look for context switches, interrupts.
  • ethtool
    -S for stats, look for errors, drops, overruns, missed interrupts, etc
    -k for offloads, enable GRO/GSO, rss(/rps/rfs)/xps
    -g for ring sizes, increase
    -c for interrupt coalescing

Kernel:

  • /proc/net/softirq[2] and /proc/interrupts[3]
    Again, distribution, missed, delayed interrupts, (optional) NUMA-affinity
  • perf top
    Look where kernel/benchmark spends its time.
  • iptables
    Look if there are rules (if any) that may affect performance.
  • netstat -s, netstat -m, /proc/net/*
    Look for error counters and buffer counts
  • sysctl / grub
    So much to tweak here. Try increasing hashtable sizes, playing with memory buffers, congestion control, and other knobs.

In your case your main problem is interrupt distribution across the cores, so fixing it will be your best corse of action.

PS. Do not forget that in those kinds of benchmarks kernel and driver/firmware versions play a significant role.

PPS. You probably want to install the newest ixgbe driver from Intel[4]. Do not forget to read README there and examine scripts directory. It has lots of performance-related tips.

[0] Intel also has nice docs about scaling network performance
https://www.kernel.org/doc/Documentation/networking/scaling.txt
[1] You can pin your processor to a specific C-state:
https://gist.github.com/SaveTheRbtz/f5e8d1ca7b55b6a7897b
[2] You can analyze that data with:
https://gist.github.com/SaveTheRbtz/172b2e2eb3cbd96b598d
[3] You can set affinity with:
https://gist.github.com/SaveTheRbtz/8875474
[4] https://sourceforge.net/projects/e1000/files/ixgbe%20stable/

SaveTheRbtz
  • 5,621
  • 4
  • 29
  • 45
4

Are the servers of the same specs (make and model)? Have you made any sysctl.conf changes?

You should enable irqbalance because your interrupts are only occurring on CPU0.

If you aren't using a tuned profile with EL6, you should choose one that's close to your workload, according to the schedule here.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • No, server A is a Dell PE R620 with a Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) and server B a Dell PE 430. with Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10). Is there any best practices on tuning irqbalance for 10 Gb ethernet or do I just start the service and thats it? I have made no change with sysctl. – user53029 Feb 17 '16 at 22:55
  • 1
    Just start the service. – ewwhite Feb 17 '16 at 22:56
2

Speed 6 Gb/s is ok, if you run only one instance of iperf, as it's limited to single CPU core. Two processes simultaneously should give you expected 10Gb/s.

The problem with 20Mb/s in one direction looks like driver/firmware/hardware incompatibility issue.

I suggest you to try the following troubleshooting steps:

Your NICs have dual ports, so first, try loopback speed tests on both NICs. It can help you to localize the problem: on server A or on server B. 2. Change patch cords. 3. Try new drivers. 4. Upgrade firmware. 5. Change NICs)

anx
  • 328
  • 1
  • 6
  • running 2 parallel streams does give me around 9Gb/s. I also noticed that when I run parallel streams in the slow direction, say, up to 10, I can get around 220Mb/s. 100 parallel and I get 1.8Gb/s. So it seems to have the capability to push out more data and in theory if I used enough streams I could probably get max of the circuit. But why does it need only 2 streams in one direction and multiple in the other? – user53029 Feb 18 '16 at 14:41
  • Slow direction is always from server A to server B? – anx Feb 18 '16 at 16:01
  • Yes, when server A is the iperf client and server B accepts the packets. – user53029 Feb 18 '16 at 16:03
  • 1
    Try another optical patch. Or swap the fibers of the one that is used now. May be it's broken. – anx Feb 18 '16 at 16:07
1

I would try disabling LRO ( Large Receive Offload )... I'd guess you have one with it turned on, and one with it turned off.

It's NIC / driver dependent, but in general, when we see that in our environment, we know we missed one, and go disable LRO

kettlewell
  • 11
  • 1