How to fine tune TCP performance on Linux with a 10Gb fiber connection

Question

We have 2 Red Hat servers that are dedicated for customer speedtest. They both use 10Gb fiber connections and sit on 10Gb links. All network gear in between these servers fully support 10Gb/s. Using Iperf or Iperf3 the best I can get is around 6.67Gb/s. That being said, one server is in production (customers are hitting it) and the other server is online but not being used. (we are using it for testing atm) The 6.67Gb/s is also one way, I should mention. We'll call these server A and server B.

When server A acts as the iperf server, we get the 6.67Gb/s speeds. When server A acts as the client to server B it can only push about 20Mb/s.

What I have done:

So far the only thing I have done is increase the TX/RX buffers on both server to their max. One was set to 512 the other 453. (RX only, TX was already maxed out) so here is that that looks like on both after the update:

Server A:
Ring parameters for em1:
Pre-set maximums:
RX:     4096
RX Mini:    0
RX Jumbo:   0
TX:     4096
Current hardware settings:
RX:     4096
RX Mini:    0
RX Jumbo:   0
TX:     4096

Server B:
Ring parameters for p1p1:
Pre-set maximums:
RX:     4078
RX Mini:    0
RX Jumbo:   0
TX:     4078
Current hardware settings:
RX:     4078
RX Mini:    0
RX Jumbo:   0
TX:     4078

NICS look like this:

Server A: 
ixgbe 0000:01:00.0: em1: NIC Link is Up 10 Gbps, Flow Control: RX/TX

Serer B:
bnx2x 0000:05:00.0: p1p1: NIC Link is Up, 10000 Mbps full duplex,     Flow control: ON - receive & transmit

Server A ethtool stats:
 rx_errors: 0
 tx_errors: 0
 rx_over_errors: 0
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_fifo_errors: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 0
 tx_heartbeat_errors: 0
 rx_long_length_errors: 0
 rx_short_length_errors: 0
 rx_csum_offload_errors: 123049

 Server B ethtool stats:
 [0]: rx_phy_ip_err_discards: 0
 [0]: rx_csum_offload_errors: 0
 [1]: rx_phy_ip_err_discards: 0
 [1]: rx_csum_offload_errors: 0
 [2]: rx_phy_ip_err_discards: 0
 [2]: rx_csum_offload_errors: 0
 [3]: rx_phy_ip_err_discards: 0
 [3]: rx_csum_offload_errors: 0
 [4]: rx_phy_ip_err_discards: 0
 [4]: rx_csum_offload_errors: 0
 [5]: rx_phy_ip_err_discards: 0
 [5]: rx_csum_offload_errors: 0
 [6]: rx_phy_ip_err_discards: 0
 [6]: rx_csum_offload_errors: 0
 [7]: rx_phy_ip_err_discards: 0
 [7]: rx_csum_offload_errors: 0
 rx_error_bytes: 0
 rx_crc_errors: 0
 rx_align_errors: 0
 rx_phy_ip_err_discards: 0
 rx_csum_offload_errors: 0
 tx_error_bytes: 0
 tx_mac_errors: 0
 tx_carrier_errors: 0
 tx_deferred: 0
 recoverable_errors: 0
 unrecoverable_errors: 0

Potential issue: Server A has tons of rx_csum_offload_errors. Server A is the one on production and I can't help but think that CPU interrupts may be an underlying factor here and whats causing the errors I see.

cat /proc/interrupts from Server A:

122:   54938283          0          0          0          0            0          0          0          0          0          0          0            0          0          0          0          0          0          0           0          0          0          0          0  IR-PCI-MSI-edge      em1-  TxRx-0
123:   51653771          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-1
124:   52277181          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-2
125:   51823314          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-3
126:   57975011          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-4
127:   52333500          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-5
128:   51899210          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-6
129:   61106425          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-7
130:   51774758          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-8
131:   52476407          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-9
132:   53331215          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-10
133:   52135886          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0

Would disabling rx-checksumming help if this is what the issue may be? Also I see no CPU interrupts on the server that's not in production, which makes sense, since its NIC is not needing CPU time.

Server A:
 ethtool -k em1
Features for em1:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-unneeded: off
tx-checksum-ip-generic: off
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: on [fixed]
tx-checksum-sctp: on [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off
tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: on [fixed]
tx-gre-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
fcoe-mtu: off [fixed]
loopback: off [fixed]

Other than using jumbo frames, which is not possible because our network gear does not support them, what else can I do or check to provide me with the most optimal TCP performance for my 10Gb network? The 6.67Gb/s is not that bad I guess taking into consideration that one of the servers is in production and my hypothesis about the CPU interrupts the NIC is generating. But the 20Mb/s speed in the other direction on a 10Gb link is simply not acceptable. Any help would be greatly appreciated.

Server A specs: x64 24v CPU 32GB RAM RHEL 6.7

Server B Specs: x64 16v CPU 16GB ram RHEL 6.7

Are the servers of the same specs? Is `irqbalance` enabled? Are you using a [tuned profile](http://serverfault.com/questions/518629/understanding-redhats-recommended-tuned-profiles/518709#518709)? — ewwhite, Feb 17 '16 at 22:12
Updated question to include specs. irqbalance is not enabled and no tuned profile. — user53029, Feb 17 '16 at 22:25
There is a ton of tuning information here. I've used it more then a few times. http://fasterdata.es.net/ — Stefan Lasiewski, Feb 17 '16 at 22:30

score 9 · Answer 1 · answered Feb 18 '16 at 02:25

In Linux/Intel I would use following methodology for performance analysis:

Hardware:

turbostat
Look for C/P states for cores, frequencies, number of SMIs. [1]
cpufreq-info
Look for current driver, frequencies, and governor.
atop
Look for interrupt distribution across cores
Look for context switches, interrupts.
ethtool
-S for stats, look for errors, drops, overruns, missed interrupts, etc
-k for offloads, enable GRO/GSO, rss(/rps/rfs)/xps
-g for ring sizes, increase
-c for interrupt coalescing

Kernel:

/proc/net/softirq[2] and /proc/interrupts[3]
Again, distribution, missed, delayed interrupts, (optional) NUMA-affinity
perf top
Look where kernel/benchmark spends its time.
iptables
Look if there are rules (if any) that may affect performance.
netstat -s, netstat -m, /proc/net/*
Look for error counters and buffer counts
sysctl / grub
So much to tweak here. Try increasing hashtable sizes, playing with memory buffers, congestion control, and other knobs.

In your case your main problem is interrupt distribution across the cores, so fixing it will be your best corse of action.

PS. Do not forget that in those kinds of benchmarks kernel and driver/firmware versions play a significant role.

PPS. You probably want to install the newest ixgbe driver from Intel[4]. Do not forget to read README there and examine scripts directory. It has lots of performance-related tips.

[0] Intel also has nice docs about scaling network performance
https://www.kernel.org/doc/Documentation/networking/scaling.txt
[1] You can pin your processor to a specific C-state:
https://gist.github.com/SaveTheRbtz/f5e8d1ca7b55b6a7897b
[2] You can analyze that data with:
https://gist.github.com/SaveTheRbtz/172b2e2eb3cbd96b598d
[3] You can set affinity with:
https://gist.github.com/SaveTheRbtz/8875474
[4] https://sourceforge.net/projects/e1000/files/ixgbe%20stable/

score 4 · Answer 2 · edited Apr 13 '17 at 12:14

4

Are the servers of the same specs (make and model)? Have you made any sysctl.conf changes?

You should enable irqbalance because your interrupts are only occurring on CPU0.

If you aren't using a tuned profile with EL6, you should choose one that's close to your workload, according to the schedule here.

edited Apr 13 '17 at 12:14

Community

1

answered Feb 17 '16 at 22:28

ewwhite

194,921
91
434
799

No, server A is a Dell PE R620 with a Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) and server B a Dell PE 430. with Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10). Is there any best practices on tuning irqbalance for 10 Gb ethernet or do I just start the service and thats it? I have made no change with sysctl. – user53029 Feb 17 '16 at 22:55
1

Just start the service. – ewwhite Feb 17 '16 at 22:56

score 2 · Answer 3 · answered Feb 18 '16 at 11:01

2

Speed 6 Gb/s is ok, if you run only one instance of iperf, as it's limited to single CPU core. Two processes simultaneously should give you expected 10Gb/s.

The problem with 20Mb/s in one direction looks like driver/firmware/hardware incompatibility issue.

I suggest you to try the following troubleshooting steps:

Your NICs have dual ports, so first, try loopback speed tests on both NICs. It can help you to localize the problem: on server A or on server B. 2. Change patch cords. 3. Try new drivers. 4. Upgrade firmware. 5. Change NICs)

answered Feb 18 '16 at 11:01

anx

328
1
6

running 2 parallel streams does give me around 9Gb/s. I also noticed that when I run parallel streams in the slow direction, say, up to 10, I can get around 220Mb/s. 100 parallel and I get 1.8Gb/s. So it seems to have the capability to push out more data and in theory if I used enough streams I could probably get max of the circuit. But why does it need only 2 streams in one direction and multiple in the other? – user53029 Feb 18 '16 at 14:41
Slow direction is always from server A to server B? – anx Feb 18 '16 at 16:01
Yes, when server A is the iperf client and server B accepts the packets. – user53029 Feb 18 '16 at 16:03
1

Try another optical patch. Or swap the fibers of the one that is used now. May be it's broken. – anx Feb 18 '16 at 16:07

score 1 · Answer 4 · answered Jan 12 '18 at 15:14

1

I would try disabling LRO ( Large Receive Offload )... I'd guess you have one with it turned on, and one with it turned off.

It's NIC / driver dependent, but in general, when we see that in our environment, we know we missed one, and go disable LRO

answered Jan 12 '18 at 15:14

kettlewell

11
1

How to fine tune TCP performance on Linux with a 10Gb fiber connection

4 Answers4

Linked