42

I’m trying to improve my TCP throughput over a “gigabit network with lots of connections and high traffic of small packets”. My server OS is Ubuntu 11.10 Server 64bit.

There are about 50.000 (and growing) clients connected to my server through TCP Sockets (all on the same port).

95% of of my packets have size of 1-150 bytes (TCP header and payload). The rest 5% vary from 150 up to 4096+ bytes.

With the config below my server can handle traffic up to 30 Mbps (full duplex).

Can you please advice best practice to tune OS for my needs?

My /etc/sysctl.cong looks like this:

kernel.pid_max = 1000000
net.ipv4.ip_local_port_range = 2500 65000
fs.file-max = 1000000
#
net.core.netdev_max_backlog=3000
net.ipv4.tcp_sack=0
#
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.somaxconn = 2048
#
net.ipv4.tcp_rmem = 4096 87380 16777216 
net.ipv4.tcp_wmem = 4096 65536 16777216
#
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_mem = 50576   64768   98152
#
net.core.wmem_default = 65536
net.core.rmem_default = 65536
net.ipv4.tcp_window_scaling=1
#
net.ipv4.tcp_mem= 98304 131072 196608
#
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_rfc1337 = 1
net.ipv4.ip_forward = 0
net.ipv4.tcp_congestion_control=cubic
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_tw_reuse = 0
#
net.ipv4.tcp_orphan_retries = 1
net.ipv4.tcp_fin_timeout = 25
net.ipv4.tcp_max_orphans = 8192

Here are my limits:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 193045
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1000000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1000000

[ADDED]

My NICs are the following:

$ dmesg | grep Broad
[    2.473081] Broadcom NetXtreme II 5771x 10Gigabit Ethernet Driver bnx2x 1.62.12-0 (2011/03/20)
[    2.477808] bnx2x 0000:02:00.0: eth0: Broadcom NetXtreme II BCM57711E XGb (A0) PCI-E x4 5GHz (Gen2) found at mem fb000000, IRQ 28, node addr d8:d3:85:bd:23:08
[    2.482556] bnx2x 0000:02:00.1: eth1: Broadcom NetXtreme II BCM57711E XGb (A0) PCI-E x4 5GHz (Gen2) found at mem fa000000, IRQ 40, node addr d8:d3:85:bd:23:0c

[ADDED 2]

ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: off

[ADDED 3]

 sudo ethtool -S eth0|grep -vw 0
 NIC statistics:
      [1]: rx_bytes: 17521104292
      [1]: rx_ucast_packets: 118326392
      [1]: tx_bytes: 35351475694
      [1]: tx_ucast_packets: 191723897
      [2]: rx_bytes: 16569945203
      [2]: rx_ucast_packets: 114055437
      [2]: tx_bytes: 36748975961
      [2]: tx_ucast_packets: 194800859
      [3]: rx_bytes: 16222309010
      [3]: rx_ucast_packets: 109397802
      [3]: tx_bytes: 36034786682
      [3]: tx_ucast_packets: 198238209
      [4]: rx_bytes: 14884911384
      [4]: rx_ucast_packets: 104081414
      [4]: rx_discards: 5828
      [4]: rx_csum_offload_errors: 1
      [4]: tx_bytes: 35663361789
      [4]: tx_ucast_packets: 194024824
      [5]: rx_bytes: 16465075461
      [5]: rx_ucast_packets: 110637200
      [5]: tx_bytes: 43720432434
      [5]: tx_ucast_packets: 202041894
      [6]: rx_bytes: 16788706505
      [6]: rx_ucast_packets: 113123182
      [6]: tx_bytes: 38443961940
      [6]: tx_ucast_packets: 202415075
      [7]: rx_bytes: 16287423304
      [7]: rx_ucast_packets: 110369475
      [7]: rx_csum_offload_errors: 1
      [7]: tx_bytes: 35104168638
      [7]: tx_ucast_packets: 184905201
      [8]: rx_bytes: 12689721791
      [8]: rx_ucast_packets: 87616037
      [8]: rx_discards: 2638
      [8]: tx_bytes: 36133395431
      [8]: tx_ucast_packets: 196547264
      [9]: rx_bytes: 15007548011
      [9]: rx_ucast_packets: 98183525
      [9]: rx_csum_offload_errors: 1
      [9]: tx_bytes: 34871314517
      [9]: tx_ucast_packets: 188532637
      [9]: tx_mcast_packets: 12
      [10]: rx_bytes: 12112044826
      [10]: rx_ucast_packets: 84335465
      [10]: rx_discards: 2494
      [10]: tx_bytes: 36562151913
      [10]: tx_ucast_packets: 195658548
      [11]: rx_bytes: 12873153712
      [11]: rx_ucast_packets: 89305791
      [11]: rx_discards: 2990
      [11]: tx_bytes: 36348541675
      [11]: tx_ucast_packets: 194155226
      [12]: rx_bytes: 12768100958
      [12]: rx_ucast_packets: 89350917
      [12]: rx_discards: 2667
      [12]: tx_bytes: 35730240389
      [12]: tx_ucast_packets: 192254480
      [13]: rx_bytes: 14533227468
      [13]: rx_ucast_packets: 98139795
      [13]: tx_bytes: 35954232494
      [13]: tx_ucast_packets: 194573612
      [13]: tx_bcast_packets: 2
      [14]: rx_bytes: 13258647069
      [14]: rx_ucast_packets: 92856762
      [14]: rx_discards: 3509
      [14]: rx_csum_offload_errors: 1
      [14]: tx_bytes: 35663586641
      [14]: tx_ucast_packets: 189661305
      rx_bytes: 226125043936
      rx_ucast_packets: 1536428109
      rx_bcast_packets: 351
      rx_discards: 20126
      rx_filtered_packets: 8694
      rx_csum_offload_errors: 11
      tx_bytes: 548442367057
      tx_ucast_packets: 2915571846
      tx_mcast_packets: 12
      tx_bcast_packets: 2
      tx_64_byte_packets: 35417154
      tx_65_to_127_byte_packets: 2006984660
      tx_128_to_255_byte_packets: 373733514
      tx_256_to_511_byte_packets: 378121090
      tx_512_to_1023_byte_packets: 77643490
      tx_1024_to_1522_byte_packets: 43669214
      tx_pause_frames: 228

Some info about SACK: When to turn TCP SACK off?

Worker
  • 647
  • 1
  • 6
  • 9

6 Answers6

23

The problem might be that you are getting too many interrupts on your network card. If Bandwidth is not the problem, frequency is the problem:

  • Turn up send/receive buffers on the network card

    ethtool -g eth0
    

Will show you the current settings (256 or 512 entries). You can probably raise these to 1024, 2048 or 3172. More does probably not make sense. This is just a ring buffer that only fills up if the server is not able to process incoming packets fast enough.

If the buffer starts to fill, flow control is an additional means to tell the router or switch to slow down:

  • Turn on flow control in/outbound on the server and the switch/router-ports it is attached to.

    ethtool -a eth0
    

Will probably show:

Pause parameters for eth0:
Autonegotiate:  on
RX:             on
TX:             on

Check /var/log/messages for the current setting of eth0. Check for something like:

eth0: Link is up at 1000 Mbps, full duplex, flow control tx and rx

If you don't see tx and rx your network admins have to adjust the values on the switch/router. On Cisco that is receive/transmit flow control on.

Beware: Changing these Values will bring your link down and up for a very short time (less than 1s).

  • If all this does not help - you can also lower the speed of the network card to 100 MBit (do the same on the switch/router-ports)

    ethtool -s eth0 autoneg off && ethtool -s eth0 speed 100
    

But in your case I would say - raise the receive buffers in the NIC ring buffer.

slm
  • 7,355
  • 16
  • 54
  • 72
Nils
  • 7,657
  • 3
  • 31
  • 71
  • Looking at your numbers from `ethtool` I would say - set the receive buffers of the network card to the maximum to avoid the RX discards. I hope your Broadcom has enough of these. – Nils Feb 12 '12 at 21:36
  • 1
    Increasing buffering with TCP is almost never a good idea. We have way too much buffering already: http://www.bufferbloat.net/projects/bloat/wiki/Introduction – rmalayter Feb 16 '12 at 19:38
  • 4
    This buffer is a hardware buffer directly on the NIC. I will update my answer with more details. Since you are loosing incoming packets you need that buffer. I have a similar server where I had to switch to a different NIC (from onboard Broadcom to PCIe Intel) to be able to increase these buffers. After that I never encountered lost RX-packets any more. – Nils Feb 16 '12 at 21:43
  • @malayter: this is a ring-buffer on layer 2. See my updated answer. – Nils Feb 16 '12 at 22:04
  • 1
    Finally we have 1GB. There were lot of tuning in different places, so cannot really say that there was single problem. – Worker Feb 17 '12 at 05:22
6

Following might not be the definitive answer but it will definitely put forth some ideas

Try adding these to sysctl.conf

##  tcp selective acknowledgements. 
net.ipv4.tcp_sack = 1
##enable window scaling
net.ipv4.tcp_window_scaling = 1
##
net.ipv4.tcp_no_metrics_save = 1

While selective tcp ack is good for optimal performance in the case of high bandwidth network . But beware of other drawbacks though. Benefits of window scaling is described here. As for third sysctl option: By default, TCP saves various connection metrics in the route cache when the connection closes, so that connections established in the near future can use these to set initial conditions. Usually, this increases overall performance, but may sometimes cause performance degradation. If set, TCP will not cache metrics on closing connections.

Check with

ethtool -k ethX

to see if offloading is enabled or not. TCP checksum offload and large segment offload are supported by the majority of today's Ethernet NICs and apparently Broadcom also supports it.

Try using tool

powertop

while network is idle and when the network saturation is reached. This will definitely show if NIC interrupts are the culprit. Device polling is an answer to such situation. FreeBsd supports polling switch right inside ifconfig but linux has no such option. Consult this to enable polling. It is saying BroadCom also supports polling which is good news for you.

Jumbo packet tweak might not cut it for you since you mentioned your traffic consitutes mostly of small packets. But hey try it out anyway !

kaji
  • 2,510
  • 16
  • 17
  • 2kaji, I will try you suggestions tomorrow. About PowerTop - should I tune power saving if my goal is performance? – Worker Feb 12 '12 at 07:24
  • Yes of course that might also help . I mentioned powertop just to make sure if interrupts are the evil. Interrupts frequency could also be harvested from other tools – kaji Feb 12 '12 at 07:33
  • I see high "Rescheduling Interrupts" - could it be a reason? What is "Rescheduling Interrupts"? – Worker Feb 12 '12 at 09:20
  • Try to follow this ---> https://help.ubuntu.com/community/ReschedulingInterrupts – kaji Feb 12 '12 at 09:29
  • yeah.. I saw that tutorial, but it is for laptops while I see high interrupts in the server. Will try to apply it to server. – Worker Feb 12 '12 at 09:56
3

I noticed in the list of tweaks that timestamps is turned off, please do not do that. That is an old throwback to days of yore when bandwidth was really expensive and people wanted to save a few bytes/packet. It is used, for example, by the TCP stack these days to tell if a packet arriving for a socket in "CLOSE_WAIT" is an old packet for the connection or if it is a new packet for a new connection and helps in RTT calculations. And saving the few bytes for a timestamp is NOTHING compared to what IPv6 addresses are going to add. Turning off timestamps does more harm than good.

This recommendation for turning off timestamps is just a throwback that keeps getting passed from one generation of sysadmin to the next. Sort of an "urban legend" sort of thing.

GeorgeB
  • 171
  • 3
2

you need to distribute the load across all CPU cores. Start 'irqbalance'.

user175978
  • 21
  • 1
  • 2
    This will not help if a single IRQ has a very high freuency. IRQBalance tries to distribute single IRQs to suiting logical processors - but there will be never more than one processor serving a single IRQ. – Nils Nov 23 '13 at 22:05
2

In my case only a single tuninng:

net.ipv4.tcp_timestamps = 0

made a very big and useful change, site loading time descreased by 50%.

Diamond
  • 8,791
  • 3
  • 22
  • 37
avz2012
  • 49
  • 1
  • 1
    Something must be severely broken in your setup in order for that to happen. Timestamps use less than 1% of the bandwidth under normal circumstances and will allow TCP to do retransmissions much more tightly timed than otherwise. – kasperd Feb 19 '16 at 11:39
1

I propose this:

kernel.sem = 350 358400 64 1024
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 4194304
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_adv_win_scale = 2
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_rmem = 4096 262144 4194304
net.ipv4.tcp_wmem = 4096 262144 4194304
net.ipv4.tcp_keepalive_time = 900
net.ipv4.tcp_keepalive_intvl = 900
net.ipv4.tcp_keepalive_probes = 9

Tested in Oracle DB servers on RHEL and in backup software.

  • 7
    These numbers are configurable because there is no one-size-fits-all. That means the numbers themselves are not valuable. What could be valuable is the method you used to decide on which numbers to use. – kasperd Feb 19 '16 at 11:41