41

Ubuntu Server 10.04.1 x86

I've got a machine with a FCGI HTTP service behind nginx, that serves a lot of small HTTP requests to a lot of different clients. (About 230 requests per second in the peak hours, average response size with headers is 650 bytes, several millions of different clients per day.)

As a result, I have a lot of sockets, hanging in TIME_WAIT (graph is captured with TCP settings below):

TIME_WAIT

I'd like to reduce the number of sockets.

What can I do besides this?

$ cat /proc/sys/net/ipv4/tcp_fin_timeout
1
$ cat /proc/sys/net/ipv4/tcp_tw_recycle
1
$ cat /proc/sys/net/ipv4/tcp_tw_reuse
1

Update: some details on the actual service layout on the machine:

client -----TCP-socket--> nginx (load balancer reverse proxy) 
       -----TCP-socket--> nginx (worker) 
       --domain-socket--> fcgi-software
                          --single-persistent-TCP-socket--> Redis
                          --single-persistent-TCP-socket--> MySQL (other machine)

I probably should switch load-balancer --> worker connection to domain sockets as well, but the issue about TIME_WAIT sockets would remain — I plan to add a second worker on a separate machine soon. Won't be able to use domain sockets in that case.

Alexander Gladysh
  • 2,343
  • 7
  • 30
  • 47

2 Answers2

33

One thing you should do to start is to fix the net.ipv4.tcp_fin_timeout=1. That is way to low, you should probably not take that much lower than 30.

Since this is behind nginx. Does that mean nginx is acting as a reverse proxy? If that is the case then your connections are 2x (one to client, one to your web servers). Do you know which end these sockets belong to?

Update:
fin_timeout is how long they stay in FIN-WAIT-2 (From networking/ip-sysctl.txt in the kernel documentation):

tcp_fin_timeout - INTEGER
        Time to hold socket in state FIN-WAIT-2, if it was closed
        by our side. Peer can be broken and never close its side,
        or even died unexpectedly. Default value is 60sec.
        Usual value used in 2.2 was 180 seconds, you may restore
        it, but remember that if your machine is even underloaded WEB server,
        you risk to overflow memory with kilotons of dead sockets,
        FIN-WAIT-2 sockets are less dangerous than FIN-WAIT-1,
        because they eat maximum 1.5K of memory, but they tend
        to live longer. Cf. tcp_max_orphans.

I think you maybe just have to let Linux keep the TIME_WAIT socket number up against what looks like maybe 32k cap on them and this is where Linux recycles them. This 32k is alluded to in this link:

Also, I find the /proc/sys/net/ipv4/tcp_max_tw_buckets confusing. Although the default is set at 180000, I see a TCP disruption when I have 32K TIME_WAIT sockets on my system, regardless of the max tw buckets.

This link also suggests that the TIME_WAIT state is 60 seconds and can not be tuned via proc.

Random fun fact:
You can see the timers on the timewait with netstat for each socket with netstat -on | grep TIME_WAIT | less

Reuse Vs Recycle:
These are kind of interesting, it reads like reuse enable the reuse of time_Wait sockets, and recycle puts it into TURBO mode:

tcp_tw_recycle - BOOLEAN
        Enable fast recycling TIME-WAIT sockets. Default value is 0.
        It should not be changed without advice/request of technical
        experts.

tcp_tw_reuse - BOOLEAN
        Allow to reuse TIME-WAIT sockets for new connections when it is
        safe from protocol viewpoint. Default value is 0.
        It should not be changed without advice/request of technical
        experts.

I wouldn't recommend using net.ipv4.tcp_tw_recycle as it causes problems with NAT clients.

Maybe you might try not having both of those switched on and see what effect it has (Try one at a time and see how they work on their own)? I would use netstat -n | grep TIME_WAIT | wc -l for faster feedback than Munin.

Kyle Brandt
  • 82,107
  • 71
  • 302
  • 444
  • 1
    @Kyle: what value for `net.ipv4.tcp_fin_timeout` would you recommend? – Alexander Gladysh Dec 13 '10 at 18:37
  • 1
    @Kyle: client --TCP-socket--> nginx (load balancer reverse proxy) --TCP-socket--> nginx (worker) --domain-socket--> fcgi-software – Alexander Gladysh Dec 13 '10 at 18:39
  • @Kyle: I probably should replace TCP socket between load balancer and worker on the same machine with domain socket as well. – Alexander Gladysh Dec 13 '10 at 18:40
  • 2
    I would say `30` or maybe `20`. Try it and see. You have a lot of load so a lot of TIME_WAIT kind of makes sense. – Kyle Brandt Dec 13 '10 at 18:41
  • 1
    @Kyle: sorry for a stupid question (I'm on a cargo-cult level here so far, unfortunately), but what exactly should I expect to see when I change `net.ipv4.tcp_fin_timeout` from `1` to `20`? – Alexander Gladysh Dec 13 '10 at 18:47
  • @Kyle: The graph was captured when the TCP settings, listed in question, were in effect. – Alexander Gladysh Dec 13 '10 at 18:59
  • RRD (Munin) is set to `GAUGE` – Alexander Gladysh Dec 13 '10 at 19:00
  • @Kyle: sorry, I'm confused about your Update 2. How to keep `net.ipv4.tcp_fin_timeout` "at the 32k cap", when it is in seconds? – Alexander Gladysh Dec 13 '10 at 19:12
  • @Kyle: it seems that by increasing `net.ipv4.tcp_fin_timeout`, I'll increase the memory consumption. What will I gain instead? – Alexander Gladysh Dec 13 '10 at 19:14
  • @Kyle: I'm not sure that there is a cap at 32K sockets in TIME_WAIT. (At least, my graph does not show it — it seems to be just the natural load profile that incidentally gets flat.) Well, I expect higher load tomorrow, will see... – Alexander Gladysh Dec 13 '10 at 19:17
  • 1
    @Kyle: TCP disruption? I don't like the sound of this... :-( – Alexander Gladysh Dec 13 '10 at 19:49
  • @Alexander: You will probably hit conntrack issues before that I bet :-P Just make sure to keep a close eye on /var/log/messages for any of these sort of things – Kyle Brandt Dec 13 '10 at 19:50
  • Also I should monitor this information as well! Right now I am at 14k TIME_WAIT on my lb – Kyle Brandt Dec 13 '10 at 19:52
  • If 32K TIME_WAIT is a hard limit, I feel that getting yet another IP for machine would not work, and I would have to get a whole second front-end machine... That's a pity, event the first one is working at most at 20% of estimated capacity, judging by CPU and memory load... – Alexander Gladysh Dec 13 '10 at 19:53
  • @Kyle: by conntrack you mean something like this? http://serverfault.com/questions/189796/error-cant-alloc-conntrack/189800#189800 – Alexander Gladysh Dec 13 '10 at 19:56
  • @Alex: Exactly, I actually just bypassed it on my LB since the NOTRACK iptables target since my LBs are behind firewalls already. – Kyle Brandt Dec 13 '10 at 19:58
  • @Alex: More on the conntrack here http://serverfault.com/questions/202157/mysterious-haproxy-request-errors/202205#202205 .. make sure you are not bumping up against your max. – Kyle Brandt Dec 13 '10 at 20:02
  • @Kyle: Thanks. It seems that I'm not bumping against it (yet). I've got conntrack_max at 65536 and no messages in dmesg since boot 3 days ago. – Alexander Gladysh Dec 13 '10 at 20:07
  • @Kyle: Please advise, what line should I set my automonitoring to watch for in `/var/log/messages` to catch possible conntrack issues? Will "Can't alloc conntrack" do? Or something broader/narrower? – Alexander Gladysh Dec 13 '10 at 20:09
  • @Alex: Conn is probably fine. If you *really* want to get it right, grab the Kernel source code and look for the kprintf (Kernel print) function and the exact current text for you kernel version ;-) – Kyle Brandt Dec 13 '10 at 20:12
  • @Alex: How close are you currently? `cat /proc/sys/net/netfilter/nf_conntrack_count` – Kyle Brandt Dec 13 '10 at 20:13
  • @Kyle: My `nf_conntrack_count` is `33789` now. – Alexander Gladysh Dec 13 '10 at 20:14
  • @Kyle: even if my conntrack_count is far from the limit, perhaps the extra monitoring wouldn't hurt — to play safe. – Alexander Gladysh Dec 13 '10 at 20:15
  • @Alex: Are you running netstat as root? Don't think that matters though.. Maybe pipe it to `less` and see if there is just a grep issue or something. – Kyle Brandt Dec 13 '10 at 20:20
  • @Kyle: No. Just tried running it as root (with sudo, and then from the root shell) — no significant difference. 429—480 sockets. – Alexander Gladysh Dec 13 '10 at 20:25
  • @Kyle: I checked, it does not seem to be a grep issue. Rougly 5K connections total, 500 of them are in TIME_WAIT. – Alexander Gladysh Dec 13 '10 at 20:26
  • 4
    Oh, here is a nice one liner: `netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c`. So @Alex, if Munin doesn't like up, maybe drill into just how it monitors these stats. Maybe the only issue is that Munin is giving you bad data :-) – Kyle Brandt Dec 13 '10 at 20:27
  • 1
    @Kyle: 10% according to netstat, instead of 400% according to Munin. I wonder, is this a Munin bug? – Alexander Gladysh Dec 13 '10 at 20:28
  • @Kyle: Nice one-liner indeed! Numbers are the same though. 5000/500. – Alexander Gladysh Dec 13 '10 at 20:30
  • @Alex: I use this n2rrd tool to do my graphing. So I write the rrd data files and the graph files myself, as well as often the checks. What I can tell you from that is that it is very easy to chose the wrong type (like the GAGUE vs COUNTER issue) or all sort of other typos. So there could be a bug there. – Kyle Brandt Dec 13 '10 at 20:30
  • @Kyle: about nf_conntrack_count and the string for monitoring: I guess I'll just monitor current count vs. maximum. – Alexander Gladysh Dec 13 '10 at 20:38
  • @Kyle: Here is what Munin does: http://munin-monitoring.org/browser/tags/1.4.4/plugins/node.d.linux/fw_conntrack.in – Alexander Gladysh Dec 13 '10 at 21:42
  • @Kyle: Munin looks at /proc/net/ip_conntrack. Currently it has 19K sockets in TIME_WAIT out of total 20K. Netstat says 300/3000. Who's right? o_O – Alexander Gladysh Dec 13 '10 at 21:50
  • @Kyle: Seems that I look at the wrong graph for Munin. That looks like some iptables state, not actual sockets. – Alexander Gladysh Dec 13 '10 at 21:53
  • 1
    Indeed, `net.netfilter.nf_conntrack_tcp_timeout_time_wait` does not affect `TIME_WAIT` timeout: http://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux.html – msanford Aug 05 '14 at 14:13
  • btw, what is the problem about time_wait ??? if it is a problem, why not set tcp_max_tw_buckets to a very low value, say 1000?? – Scott混合理论 Dec 06 '20 at 14:20
3

tcp_tw_reuse is relatively safe as it allows TIME_WAIT connections to be reused.

Also you could run more services listening on different ports behind your load-balancer if running out of ports is a problem.

andrew pate
  • 251
  • 1
  • 2
  • 5