High # of sockets in TIME-WAIT state, server unresponsive at load

Question

Our app became unresponsive at high loads, with longer wait times. Process usage was abnormally low (~15% CPU utilisation per process, our app runs on 8 processes).

Nginx error log output showed a number of these:

2014/12/04 03:39:31 [crit] 24383#0: *2008067 connect() to 127.0.0.1:4567 failed (99: Cannot assign requested address) while connecting to upstream, client: 108.162.246.229, server: example.org, request: "GET /socket.io/?EIO=3&transport=polling&t=1417682366937-11501 HTTP/1.1", upstream: "http://127.0.0.1:4567/socket.io/?EIO=3&transport=polling&t=1417682366937-11501", host: "example.org", referrer: "https://example.org/unread"

What I saw

Output of ss -tan | grep TIME-WAIT | wc -l was somewhere in the neighbourhood of 30,000, ouch!
The app would be responsive, and then:
- All processes would suddenly drop down to near 0 CPU usage
- App would become unresponsive
- After ~30 seconds, app would be back up, repeat ad infinitum

Needed to get the app up, so band-aid solution:

echo 28000 65535 > ip_local_port_range (MongoDB runs on 27101 so I picked a lower limit above that)
echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle

This reduced the # of sockets in TIME-WAIT state to a more managable ~400.

Here's a snippet of ss -tan | grep TIME-WAIT:

State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
TIME-WAIT  0      0                 127.0.0.1:29993            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:28522            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:29055            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:31849            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:32744            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:28304            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:34858            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:36707            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:34756            127.0.0.1:4567  
TIME-WAIT  0      0            104.131.91.122:443          108.162.250.6:55549 
TIME-WAIT  0      0                 127.0.0.1:32629            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:34544            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:34732            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:33820            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:33609            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:34504            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:32463            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:35089            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:30003            127.0.0.1:4567  
TIME-WAIT  0      0            104.131.91.122:443         199.27.128.100:36383 
TIME-WAIT  0      0                 127.0.0.1:33040            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:34038            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:28096            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:29541            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:30022            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:31375            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:29827            127.0.0.1:4567  
TIME-WAIT  0      0                 127.0.0.1:29334            127.0.0.1:4567

My questions:

A lot of these are from 127.0.0.1 to 127.0.0.1, is this normal? Shouldn't the peer addresses all be from external IPs?
- Our Node.js app is behind nginx proxy, behind CloudFlare DNS, this limits the number of unique inbound IP addresses, could this be related?
How do I properly reduce the number of sockets in TIME-WAIT state?
I'm positive we don't have 3000 unique socket connections per second, is something misconfigured on our end and opening up hundreds of sockets when it should be opening one?

Thanks in advance for any help you can offer!

The local->local connection is normal since nginx is proxying to your node.js app (so it has to open a TCP connection to it). Note that you can use a lower range even if Mongo is listening on one since the kernel won't try to use it. — Nathan C, Dec 04 '14 at 19:17
Consider [using a UNIX domain socket](http://stackoverflow.com/q/7045614/1068283) instead of a TCP listener in your node.js app. — Michael Hampton, Dec 04 '14 at 19:19
Thanks Nathan, good to know. Will give it a shot and report back @MichaelHampton — Julian H. Lam, Dec 04 '14 at 19:21

score 2 · Accepted Answer · answered Dec 07 '14 at 17:57

In the end, I did more research and read this very excellent guide on scaling Node apps behind an nginx proxy.

The major difference came when I added the keepalive parameter to the upstream block in nginx. As it turns out, nginx workers don't cache incoming connections and re-use them, causing many thousands of new connections to be created (esp. w/ socket.io handshakes and such)

@MichaelHampton's suggestion of using a unix domain socket would also resolve this quite well.

High # of sockets in TIME-WAIT state, server unresponsive at load

What I saw

1 Answers1

Linked