35

Short version: One Windows Server 2012 machine on my network is getting persistent but intermittent TCP RSTs when connecting to certain websites. Dunno where they're coming from. Check out the wireshark log for my analysis & questions.

Long version:

We run a caching web-proxy on one of our servers to service our small office. A co-worker reported getting a lot of 'Connection Reset' or 'Page can't be displayed' errors when connecting to certain sites, but that refreshing usually fixes it.

I verified the browser behavior, and then more directly by trying an un-proxied browser on the server itself. But pings & traceroutes to troublesome sites don't show any problems, the issues seemed to be limited to tcp connections.

I then made a script to test the affected sites by sending them HTTP HEAD requests directly via cURL & checking how often they succeed. A typical test looks like this: (this is unproxied, running directly on the bad server)

C:\sdk\Apache24\htdocs>php rhTest.php
Sending HTTP HEAD requests to "http://www.washingtonpost.com/":
20:21:42: Length: 0     Response Code: NULL (0%)
20:22:02: Length: 0     Response Code: NULL (0%)
20:22:22: Length: 0     Response Code: NULL (0%)
20:22:42: Length: 0     Response Code: NULL (0%)
20:23:02: Length: 3173  Response Code: HTTP/1.1 302 Moved Temporarily (20%)
20:23:22: Length: 3174  Response Code: HTTP/1.1 302 Moved Temporarily (33.33%)
20:23:43: Length: 0     Response Code: NULL (28.57%)
20:24:03: Length: 3171  Response Code: HTTP/1.1 302 Moved Temporarily (37.5%)
20:24:23: Length: 3173  Response Code: HTTP/1.1 302 Moved Temporarily (44.44%)
20:24:43: Length: 3172  Response Code: HTTP/1.1 302 Moved Temporarily (50%)
20:25:03: Length: 0     Response Code: NULL (45.45%)

Over the long term, only about 60% of the requests succeed, the rest return nothing, with a curl error code of: "cURL error (56): Failure when receiving data from the peer" The bad behavior is consistent for the websites I test (no site has ever 'gotten better') and it's quite persistent, I've been troubleshooting for a week now, and co-workers report the problem has been there for months apparently.

I tested the HEAD request script on other machines on our network: no problems, all connections go through to all the sites on my test list. Then I set up a proxy on my personal desktop, and when I run the HEAD requests from the problematic server though it, all connections go through. So whatever the problem is, it's very specific to this server.

Next I tried to isolate which websites exhibit the connection-reset behavior:

  • None of our intranet sites (192.168.x.x) drop connections.
  • No ipv6 site I've tested drops connections. (We are dual-stack)
  • Only a small minority of internet ipv4 sites drop connections.
  • Every site which uses cloudflare as a CDN (that I've tested) drops connections. (but the problem does not seem to be exclusive to cloudflare sites)

This angle wasn't developing into anything really helpful, so next I installed wireshark to look at what was going on when a request failed. A failed HEAD requests looks like this: (larger screenshot here: http://imgur.com/TNfRUtX )

127 48.709776000    192.168.1.142   192.33.31.56    TCP 66  52667 > http [SYN, ECN, CWR] Seq=0 Win=8192 Len=0 MSS=8960 WS=256 SACK_PERM=1
128 48.728207000    192.33.31.56    192.168.1.142   TCP 66  http > 52667 [SYN, ACK, ECN] Seq=0 Ack=1 Win=42340 Len=0 MSS=1460 SACK_PERM=1 WS=128
129 48.728255000    192.168.1.142   192.33.31.56    TCP 54  52667 > http [ACK] Seq=1 Ack=1 Win=65536 Len=0
130 48.739371000    192.168.1.142   192.33.31.56    HTTP    234 HEAD / HTTP/1.1 
131 48.740917000    192.33.31.56    192.168.1.142   TCP 60  http > 52667 [RST] Seq=1 Win=0 Len=0
132 48.757766000    192.33.31.56    192.168.1.142   TCP 60  http > 52667 [ACK] Seq=1 Ack=181 Win=42240 Len=0
133 48.770314000    192.33.31.56    192.168.1.142   TCP 951 [TCP segment of a reassembled PDU]
134 48.807831000    192.33.31.56    192.168.1.142   TCP 951 [TCP Retransmission] http > 52667 [PSH, ACK] Seq=1 Ack=181 Win=42240 Len=897
135 48.859592000    192.33.31.56    192.168.1.142   TCP 951 [TCP Retransmission] http > 52667 [PSH, ACK] Seq=1 Ack=181 Win=42240 Len=897
138 49.400675000    192.33.31.56    192.168.1.142   TCP 951 [TCP Retransmission] http > 52667 [PSH, ACK] Seq=1 Ack=181 Win=42240 Len=897
139 50.121655000    192.33.31.56    192.168.1.142   TCP 951 [TCP Retransmission] http > 52667 [PSH, ACK] Seq=1 Ack=181 Win=42240 Len=897
141 51.564009000    192.33.31.56    192.168.1.142   TCP 951 [TCP Retransmission] http > 52667 [PSH, ACK] Seq=1 Ack=181 Win=42240 Len=897
143 54.452561000    192.33.31.56    192.168.1.142   TCP 951 [TCP Retransmission] http > 52667 [PSH, ACK] Seq=1 Ack=181 Win=42240 Len=897

The way I'm reading this (correct me if I'm wrong, this isn't really my area) is that:

  • We open a tcp connection to the webserver
  • webserver ACK's
  • HTTP HEAD request is send
  • There is a RST packet, marked as from the webserver IP, that kills the connection.
  • Webserver sends ACK
  • Webserver (tries) to respond to HEAD request with valid HTTP data (The 951 byte reply contains the correct HTTP header)
  • Webserver retransmits (several times over several seconds) the valid HTTP response, but it cannot succeed since the connection has been RST

So if the webserver has sent a valid RST, why does it keep trying to fill the request? And if the webserver didn't generate the RST, what the heck did?

Things I have tried that have had no effect:

  • Disabling NIC teaming
  • Changing out the network adaptor (replacement NIC was known to be working)
  • Assigning a static ip.
  • Disabling ipv6.
  • Disabling jumbo frames.
  • Plugging server directly into our modem one night, bypassing our switches & router.
  • Turning off windows firewall.
  • Resetting TCP settings via netsh
  • Disabling practically every other service on the server. (We mostly use it as a fileserver, but there's apache & a couple DB's)
  • Banging head on desk (repeatedly)

I suspect something on the server is generating the RST packets, but for the life of me I can't find it. I feel like if I knew: why is it just this server? OR why only some websites? it'd help a lot. While I'm still curious, I'm increasingly inclined to nuke from orbit & start over.

Ideas / Suggestions?

-Thanks

Morty
  • 293
  • 3
  • 7
  • What operating system does this caching proxy server run? And what is the proxy server software? – Michael Hampton Nov 04 '14 at 02:35
  • 1
    The server is running Windows Server 2012, the proxy is squid 3.3.3 running via cygwin; but this happens to all TCP connections from the machine, not just the proxy's connections. The curl test script is unproxied. – Morty Nov 04 '14 at 02:51

1 Answers1

41

Your packet capture had something unusual: The ECN bits were set in the outgoing SYN packet.

Explicit congestion notification is an extension to the IP protocol that allows for hosts to react more quickly to network congestion. It was first introduced to the Internet 15 years ago, but there were serious issues noted when it was first deployed. The most serious of them was that many firewalls would either drop packets or return an RST when receiving a SYN packet with the ECN bits set.

As a result, most operating systems disabled ECN by default, at least for outgoing connections. As a result, I suspect that a lot of sites (and firewall vendors!) simply never fixed their firewalls.

Until Windows Server 2012 was released. Microsoft enabled ECN by default starting with this operating system version.

Unfortunately nobody has in recent memory done any significant testing of Internet sites' responses to ECN, so it's hard to gauge whether the problems seen in the early 2000s are still extant, but I strongly suspect that they are and that your traffic is, at least some of the time, passing through such equipment.

After enabling ECN on my desktop and then firing up Wireshark it was only a few seconds before I caught an example of a host from which I got an RST to a packet with SYN and ECN set, though most hosts seem to work fine. Maybe I'll go scan the Internet myself...

You can try disabling ECN on your server to see if the issue clears up. This will also make you unable to use DCTCP, but in a small office it's highly unlikely that you are doing so or have any need to do so.

netsh int tcp set global ecncapability=disabled
Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
  • 4
    Thank You! After disabling ECN I'm seeing a 100% success rate for connections to the most troublesome sites! I'll have to test more in the morning before turning our proxy back on, but I'm going to go ahead and mark this as both answered and as another smashing victory in Microsoft QA's continuing war on users. – Morty Nov 04 '14 at 04:15
  • 9
    To be fair, I don't think it's Microsoft's fault that some firewall admins are idiots. ECN is very nice to have, as it does help a lot, and it would be nice if we all could start to use it...someday. – Michael Hampton Nov 04 '14 at 04:28
  • Oh, I wonder if _this_ explains the tons of resets I've been getting from Imgur and Wikia for ages (happens with two different local ISPs, but never when VPN'd through another country, which confuses me) – user1686 Nov 04 '14 at 18:29
  • I _suspect_ (but obviously can't prove) that some of the machines responsible for this are lurking in the default-free zone. – Michael Hampton Nov 04 '14 at 18:36