33

We had a little failover problem with one of our HAProxy VMs today. When we dug into it, we found this:

Jan 26 07:41:45 haproxy2 kernel: [226818.070059] __ratelimit: 10 callbacks suppressed
Jan 26 07:41:45 haproxy2 kernel: [226818.070064] Out of socket memory
Jan 26 07:41:47 haproxy2 kernel: [226819.560048] Out of socket memory
Jan 26 07:41:49 haproxy2 kernel: [226822.030044] Out of socket memory

Which, per this link, apparently has to do with low default settings for net.ipv4.tcp_mem. So we increased them by 4x from their defaults (this is Ubuntu Server, not sure if the Linux flavor matters):

current values are:    45984   61312   91968
new values are:       183936  245248  367872

After that, we started seeing a bizarre error message:

Jan 26 08:18:49 haproxy1 kernel: [ 2291.579726] Route hash chain too long!
Jan 26 08:18:49 haproxy1 kernel: [ 2291.579732] Adjust your secret_interval!

Shh.. it's a secret!!

This apparently has to do with /proc/sys/net/ipv4/route/secret_interval which defaults to 600 and controls periodic flushing of the route cache

The secret_interval instructs the kernel how often to blow away ALL route hash entries regardless of how new/old they are. In our environment this is generally bad. The CPU will be busy rebuilding thousands of entries per second every time the cache is cleared. However we set this to run once a day to keep memory leaks at bay (though we've never had one).

While we are happy to reduce this, it seems odd to recommend dropping the entire route cache at regular intervals, rather than simply pushing old values out of the route cache faster.

After some investigation, we found /proc/sys/net/ipv4/route/gc_elasticity which seems to be a better option for keeping the route table size in check:

gc_elasticity can best be described as the average bucket depth the kernel will accept before it starts expiring route hash entries. This will help maintain the upper limit of active routes.

We adjusted elasticity from 8 to 4, in the hopes of the route cache pruning itself more aggressively. The secret_interval does not feel correct to us. But there are a bunch of settings and it's unclear which are really the right way to go here.

  • /proc/sys/net/ipv4/route/gc_elasticity (8)
  • /proc/sys/net/ipv4/route/gc_interval (60)
  • /proc/sys/net/ipv4/route/gc_min_interval (0)
  • /proc/sys/net/ipv4/route/gc_timeout (300)
  • /proc/sys/net/ipv4/route/secret_interval (600)
  • /proc/sys/net/ipv4/route/gc_thresh (?)
  • rhash_entries (kernel parameter, default unknown?)

We don't want to make the Linux routing worse, so we're kind of afraid to mess with some of these settings.

Can anyone advise which routing parameters are best to tune, for a high traffic HAProxy instance?

Jeff Atwood
  • 12,994
  • 20
  • 74
  • 92

3 Answers3

28

I never ever encountered this issue. However, you should probably increase your hash table width in order to reduce its depth. Using "dmesg", you'll see how many entries you currently have:

$ dmesg | grep '^IP route'
IP route cache hash table entries: 32768 (order: 5, 131072 bytes)

You can change this value with the kernel boot command line parameter rhash_entries. First try it by hand then add it to your lilo.conf or grub.conf.

For example: kernel vmlinux rhash_entries=131072

It is possible that you have a very limited hash table because you have assigned little memory to your HAProxy VM (the route hash size is adjusted depending on total RAM).

Concerning tcp_mem, be careful. Your initial settings make me think you were running with 1 GB of RAM, 1/3 of which could be allocated to TCP sockets. Now you've allocated 367872 * 4096 bytes = 1.5 GB of RAM to TCP sockets. You should be very careful not to run out of memory. A rule of thumb is to allocate 1/3 of the memory to HAProxy and another 1/3 to the TCP stack and the last 1/3 to the rest of the system.

I suspect that your "out of socket memory" message comes from default settings in tcp_rmem and tcp_wmem. By default you have 64 kB allocated on output for each socket and 87 kB on input. This means a total of 300 kB for a proxied connection, just for socket buffers. Add to that 16 or 32 kB for HAProxy, and you see that with 1 GB of RAM you'll only support 3000 connections.

By changing the default settings of tcp_rmem and tcp_wmem (middle param), you can get a lot lower on memory. I get good results with values as low as 4096 for the write buffer, and 7300 or 16060 in tcp_rmem (5 or 11 TCP segments). You can change those settings without restarting, however they will only apply to new connections.

If you prefer not to touch your sysctls too much, the latest HAProxy, 1.4-dev8, allows you to tweak those parameters from the global configuration, and per side (client or server).

I am hoping this helps!

Peter Mortensen
  • 2,319
  • 5
  • 23
  • 24
Willy Tarreau
  • 3,894
  • 1
  • 19
  • 12
8

The Out of socket memory error is often misleading. Most of the time, on Internet facing servers, it does not indicate any problem related to running out of memory. As I explained in far greater details in a blog post, the most common reason is the number of orphan sockets. An orphan socket is a socket that isn't associated to a file descriptor. In certain circumstances, the kernel will issue the Out of socket memory error even though you're 2x or 4x away from the limit (/proc/sys/net/ipv4/tcp_max_orphans). This happens frequently in Internet-facing services and is perfectly normal. The right course of action in this case is to tune up tcp_max_orphans to be at least 4x the number of orphans you normally see with your peak traffic.

Do not listen to any advice that recommends tuning tcp_mem or tcp_rmem or tcp_wmem unless you really know what you're doing. Those giving out these advices typically don't. Their voodoo is often wrong or inappropriate for your environment and will not solve your problem. It might even make it worse.

tsuna
  • 1,613
  • 1
  • 15
  • 10
  • 1
    When this happens, the message is different in dmesg, you see "too many of orphaned sockets". However I agree with you that orphans can consume a huge amount of memory. – Willy Tarreau Mar 15 '11 at 07:36
  • When you exceed the number of `/proc/sys/net/ipv4/tcp_max_orphans` you will experience a different error. The entire Stack Exchange stack for instance has `/proc/sys/net/ipv4/tcp_max_orphans` at 65536 and `/proc/net/sockstat` results in TCP: inuse 2996 orphan 171 tw 15972 alloc 2998 mem 1621 - a difference that cannot be ignored. – Geoff Dalgas Mar 15 '11 at 08:26
-4

We tune some of these parameters regularly. Our standard for high throughput, low latency trading platforms is:

net.ipv4.tcp_rmem = 4096 16777216 33554432
net.ipv4.tcp_wmem = 4096 16777216 33554432
net.ipv4.tcp_mem = 4096 16777216 33554432
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.netdev_max_backlog = 30000
net.core.netdev_max_backlog = 30000
Jeff Atwood
  • 12,994
  • 20
  • 74
  • 92
  • 1
    per Willy's math that means your standard memory pressure # (middle number) is 68 GB?! Times three (rmem, wmem, mem)?? – Jeff Atwood Jan 26 '10 at 23:00
  • 10
    These tunables are wrong and are very frequently found in bench environments then blindly copy-pasted. They will not have any problem with just a few concurrent sessions, but even with 100 TCP sockets, you'll allocate 3.2 GB of RAM. As long as the latency is low, you won't notice anything suspect. You just have to unplug a remote machine during a transfer to see the output buffers fill, or freeze a local task and see the input buffer fill. This is insane... – Willy Tarreau Jan 27 '10 at 05:37
  • 6
    Jeff, this is not times three. tcp_mem is in pages and defines the global size. tcp_rmem and tcp_wmem are in bytes and define the per-socket size. – Willy Tarreau Jan 27 '10 at 05:40
  • Those tuneables look wrong, for concurrent servers with small data you don’t want to reserve so much socket buffers and tcp_mem is totally different from r/wmem, using the same numbers does not really make sense, (one is bytes per Connections the other pages per system) – eckes Aug 07 '19 at 19:12