17

OS: Windows Server 2008, SP2 (running on EC2 Amazon).

Running web app using Apache httpd & tomcat server 6.02 and Web server has keep-alive settings.

There are around 69,250 (http port 80) + 15000 (other than port 80) TCP connections in TIME_WAIT state (used netstat & tcpview). These connections don't seem to close even after stopping web server (waited 24 hours)

Performance monitor counters:

  • TCPv4 Active Connections: 145K
  • TCPv4 Passive Connections: 475K
  • TCPv4 Failure Connections: 16K
  • TCPv4 Connections Reset: 23K

HKEY_LOCAL_MACHINE\System \CurrentControlSet\Services\Tcpip\Parameters does not have TcpTimedWaitDelay key, so value should be the default (2*MSL, 4 mins)

Even if there are thousands of connection requests are coming at the same time, why windows OS is not able to clean them eventually?
What could be the reasons behind this situation?
Is there any way to forcefully close all these TIME_WAIT connections without restarting windows OS?

After few days we app stops taking any new connections.

Aliaksandr Belik
  • 259
  • 6
  • 17

6 Answers6

14

We've been dealing with this issue too. It looks like Amazon found the root cause and corrected it. Here is the info they gave me.

Hi, I am pasting below an explanation of what was causing this issue. Good news is that this has been fixed very recently by our engineering team. To get fix, all you'll have to do is STOP/START the Windows Server 2008 instances where you are seeing this issue. Again, I am not talking about REBOOT which is different. STOP/START causes the instance to move to a different (healthy) host. When these instances launch again, they will be running on hosts that have the fix in place so they won't have this issue again. Now below is the engineering explanation of this issue. After an in depth investigation, we've found that when running Windows 2008 x64 on most available instance types, we've identified an issue which may result in TCP connections that remain in TIME_WAIT/CLOSE_WAIT for excessively long periods of time (in some cases, remaining in this state indefinitely). While in these states, the particular socket pairs remain unusable and if enough accumulate, will result in port exhaustion for the ports in question. If this particular circumstance occurs, the only solution to clear the socket pairs in question is to reboot the instance in question. We have determined the cause to be the values produced by a timer function in Windows 2008 kernel API which, on many of our 64-bit platforms, will occasionally retrieve a value that is extremely far in the future. This affects the TCP stack by causing the timestamps on the TCP socket pairs to be stamped significantly far in the future. According to Microsoft, there is a stored cumulative counter which will not be updated unless the value produced by this API call is larger than the cumulative value. The ultimate result is that sockets created after this point will all be stamped too far in the future until that future time is reached. In some cases, we have seen this value several hundred days into the future, thus the socket pairs appear to be stuck forever.

GregB
  • 1,362
  • 2
  • 13
  • 21
  • This thread is like two weeks old, and somehow you posted their response _seconds_ before me. Excellent news! They've been giving us the runaround for months now. – Marc Bollinger Apr 04 '11 at 17:54
  • @MarcBollinger: Just found [your answer](http://serverfault.com/a/255560/10305) via the AWS team response to the thread you mentioned ([System.Diagnostics.Stopwatch not working](https://forums.aws.amazon.com/message.jspa?messageID=190686#190686)) - that thread is still unanswered, but your comment here seems to indicate it might actually have been addressed already as per the info @GregB quoted? Or could the `QueryPerformanceCounter` issue root cause still be in place and only the TCP issue at hand has been remedied? Thanks for your insight! – Steffen Opel May 25 '12 at 11:10
4

Ryan's answer is good general advice except that it doesn't apply to the condition Ravi is experiencing in EC2. We too have seen this problem and for whatever reason Windows is completely ignoring the TcpTimedWaitDelay and never releasing the socket from its TIMED_WAIT state.

Waiting doesn't help... restarting the app doesn't help... the only remedy we've found is to restart the OS. Really ugly.

3

I completely randomly found this thread while looking to debug a separate issue, but this is a little-brought-up, but well-known issue with Windows on EC2. We used to have premium support, and discussed this with them in a non-public setting via that channel, but this is a related issue that we did discuss in the public forums.

As others have mentioned, you do need to tune Windows Servers out of the box. However, in the same way that StopWatch isn't working in the above thread, the TCP/IP stack also uses the QueryPerformanceCounter call to determine exactly when the TCP_TIME_WAIT period should last. The problem is that on EC2, they've encountered, and know about, an issue in which QueryPerformanceCounter goes haywire, and may return times far, far into the future; it's not that your TIME_WAIT state is being ignored, it's that the expiration time of TIME_WAIT is potentially years into the future. When running in an httpd setting, you can see how you quickly accumulate these zombie sockets once the state is encountered (we generally see that this is a discrete event, not that you slowly accumulate zombies).

What we do is run a service in the background that queries the number of sockets in the TIME_WAIT state, and once this hovers over a certain threshold, we take action (reboot the server). Somehow in the past 45 seconds, someone pointed out that you can stop/start the server to fix the issue--I suggest you couple these two approaches.

2

The default settings for the TCP stack in Windows is, to say the least, not optimal for systems that are going to host an HTTP server.

To get the best out of your windows machine when used as an HTTP server, there are a few parameters that you'd normally tweak like MaxUserPort TcpTimedWaitDelay, TcpAckFrequency, EnableDynamicBacklog, KeepAliveInterval etc

I had written a note to self on this a few years ago, just in case I need some quick defaults to start with. Feel free to understand the parameters and then tweak them.

Ryan Fernandes
  • 312
  • 5
  • 19
2

Unrelated to AWS, we just ran into this problem, it seems as a result of this KB article:

http://support.microsoft.com/kb/2553549/en-us

Basically, it kicks in if a system is up for >497 days and the hotfix hasn't been applied. A reboot has, of course, cleared it down - we might not know for then next 16 months if the hotfix worked, but this may help anyone who has long-uptime servers out there.

rmc47
  • 463
  • 2
  • 6
  • 15
  • What a strange number of days. We were just bitten by this too - 500 days 12 hours uptime. Time to decomm this box anyway. – Josh Smeaton Mar 03 '15 at 03:11
0

I was experiencing the almost exact same thing on a number of boxes with Windows Server 2008 R2 x64 with SP1, mostly with CLOSE_WAIT (which is somewhat different than TIME_WAIT). I bumped into this answer which referenced a KB at Microsoft and a hotfix if the servers were running behind a load balancer (which mine are). After installing the hotfix and rebooting, all of the CLOSE_WAIT stuff was resolved.

Jonathan Oliver
  • 319
  • 1
  • 3
  • 13