Our app became unresponsive at high loads, with longer wait times. Process usage was abnormally low (~15% CPU utilisation per process, our app runs on 8 processes).
Nginx error log output showed a number of these:
2014/12/04 03:39:31 [crit] 24383#0: *2008067 connect() to 127.0.0.1:4567 failed (99: Cannot assign requested address) while connecting to upstream, client: 108.162.246.229, server: example.org, request: "GET /socket.io/?EIO=3&transport=polling&t=1417682366937-11501 HTTP/1.1", upstream: "http://127.0.0.1:4567/socket.io/?EIO=3&transport=polling&t=1417682366937-11501", host: "example.org", referrer: "https://example.org/unread"
What I saw
- Output of
ss -tan | grep TIME-WAIT | wc -l
was somewhere in the neighbourhood of 30,000, ouch! - The app would be responsive, and then:
- All processes would suddenly drop down to near 0 CPU usage
- App would become unresponsive
- After ~30 seconds, app would be back up, repeat ad infinitum
Needed to get the app up, so band-aid solution:
echo 28000 65535 > ip_local_port_range
(MongoDB runs on 27101 so I picked a lower limit above that)echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
This reduced the # of sockets in TIME-WAIT
state to a more managable ~400.
Here's a snippet of ss -tan | grep TIME-WAIT
:
State Recv-Q Send-Q Local Address:Port Peer Address:Port
TIME-WAIT 0 0 127.0.0.1:29993 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:28522 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:29055 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:31849 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:32744 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:28304 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:34858 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:36707 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:34756 127.0.0.1:4567
TIME-WAIT 0 0 104.131.91.122:443 108.162.250.6:55549
TIME-WAIT 0 0 127.0.0.1:32629 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:34544 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:34732 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:33820 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:33609 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:34504 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:32463 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:35089 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:30003 127.0.0.1:4567
TIME-WAIT 0 0 104.131.91.122:443 199.27.128.100:36383
TIME-WAIT 0 0 127.0.0.1:33040 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:34038 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:28096 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:29541 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:30022 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:31375 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:29827 127.0.0.1:4567
TIME-WAIT 0 0 127.0.0.1:29334 127.0.0.1:4567
My questions:
- A lot of these are from 127.0.0.1 to 127.0.0.1, is this normal? Shouldn't the peer addresses all be from external IPs?
- Our Node.js app is behind nginx proxy, behind CloudFlare DNS, this limits the number of unique inbound IP addresses, could this be related?
- How do I properly reduce the number of sockets in
TIME-WAIT
state? - I'm positive we don't have 3000 unique socket connections per second, is something misconfigured on our end and opening up hundreds of sockets when it should be opening one?
Thanks in advance for any help you can offer!