Short version: One Windows Server 2012 machine on my network is getting persistent but intermittent TCP RSTs when connecting to certain websites. Dunno where they're coming from. Check out the wireshark log for my analysis & questions.
Long version:
We run a caching web-proxy on one of our servers to service our small office. A co-worker reported getting a lot of 'Connection Reset' or 'Page can't be displayed' errors when connecting to certain sites, but that refreshing usually fixes it.
I verified the browser behavior, and then more directly by trying an un-proxied browser on the server itself. But pings & traceroutes to troublesome sites don't show any problems, the issues seemed to be limited to tcp connections.
I then made a script to test the affected sites by sending them HTTP HEAD requests directly via cURL & checking how often they succeed. A typical test looks like this: (this is unproxied, running directly on the bad server)
C:\sdk\Apache24\htdocs>php rhTest.php
Sending HTTP HEAD requests to "http://www.washingtonpost.com/":
20:21:42: Length: 0 Response Code: NULL (0%)
20:22:02: Length: 0 Response Code: NULL (0%)
20:22:22: Length: 0 Response Code: NULL (0%)
20:22:42: Length: 0 Response Code: NULL (0%)
20:23:02: Length: 3173 Response Code: HTTP/1.1 302 Moved Temporarily (20%)
20:23:22: Length: 3174 Response Code: HTTP/1.1 302 Moved Temporarily (33.33%)
20:23:43: Length: 0 Response Code: NULL (28.57%)
20:24:03: Length: 3171 Response Code: HTTP/1.1 302 Moved Temporarily (37.5%)
20:24:23: Length: 3173 Response Code: HTTP/1.1 302 Moved Temporarily (44.44%)
20:24:43: Length: 3172 Response Code: HTTP/1.1 302 Moved Temporarily (50%)
20:25:03: Length: 0 Response Code: NULL (45.45%)
Over the long term, only about 60% of the requests succeed, the rest return nothing, with a curl error code of: "cURL error (56): Failure when receiving data from the peer" The bad behavior is consistent for the websites I test (no site has ever 'gotten better') and it's quite persistent, I've been troubleshooting for a week now, and co-workers report the problem has been there for months apparently.
I tested the HEAD request script on other machines on our network: no problems, all connections go through to all the sites on my test list. Then I set up a proxy on my personal desktop, and when I run the HEAD requests from the problematic server though it, all connections go through. So whatever the problem is, it's very specific to this server.
Next I tried to isolate which websites exhibit the connection-reset behavior:
- None of our intranet sites (192.168.x.x) drop connections.
- No ipv6 site I've tested drops connections. (We are dual-stack)
- Only a small minority of internet ipv4 sites drop connections.
- Every site which uses cloudflare as a CDN (that I've tested) drops connections. (but the problem does not seem to be exclusive to cloudflare sites)
This angle wasn't developing into anything really helpful, so next I installed wireshark to look at what was going on when a request failed. A failed HEAD requests looks like this: (larger screenshot here: http://imgur.com/TNfRUtX )
127 48.709776000 192.168.1.142 192.33.31.56 TCP 66 52667 > http [SYN, ECN, CWR] Seq=0 Win=8192 Len=0 MSS=8960 WS=256 SACK_PERM=1
128 48.728207000 192.33.31.56 192.168.1.142 TCP 66 http > 52667 [SYN, ACK, ECN] Seq=0 Ack=1 Win=42340 Len=0 MSS=1460 SACK_PERM=1 WS=128
129 48.728255000 192.168.1.142 192.33.31.56 TCP 54 52667 > http [ACK] Seq=1 Ack=1 Win=65536 Len=0
130 48.739371000 192.168.1.142 192.33.31.56 HTTP 234 HEAD / HTTP/1.1
131 48.740917000 192.33.31.56 192.168.1.142 TCP 60 http > 52667 [RST] Seq=1 Win=0 Len=0
132 48.757766000 192.33.31.56 192.168.1.142 TCP 60 http > 52667 [ACK] Seq=1 Ack=181 Win=42240 Len=0
133 48.770314000 192.33.31.56 192.168.1.142 TCP 951 [TCP segment of a reassembled PDU]
134 48.807831000 192.33.31.56 192.168.1.142 TCP 951 [TCP Retransmission] http > 52667 [PSH, ACK] Seq=1 Ack=181 Win=42240 Len=897
135 48.859592000 192.33.31.56 192.168.1.142 TCP 951 [TCP Retransmission] http > 52667 [PSH, ACK] Seq=1 Ack=181 Win=42240 Len=897
138 49.400675000 192.33.31.56 192.168.1.142 TCP 951 [TCP Retransmission] http > 52667 [PSH, ACK] Seq=1 Ack=181 Win=42240 Len=897
139 50.121655000 192.33.31.56 192.168.1.142 TCP 951 [TCP Retransmission] http > 52667 [PSH, ACK] Seq=1 Ack=181 Win=42240 Len=897
141 51.564009000 192.33.31.56 192.168.1.142 TCP 951 [TCP Retransmission] http > 52667 [PSH, ACK] Seq=1 Ack=181 Win=42240 Len=897
143 54.452561000 192.33.31.56 192.168.1.142 TCP 951 [TCP Retransmission] http > 52667 [PSH, ACK] Seq=1 Ack=181 Win=42240 Len=897
The way I'm reading this (correct me if I'm wrong, this isn't really my area) is that:
- We open a tcp connection to the webserver
- webserver ACK's
- HTTP HEAD request is send
- There is a RST packet, marked as from the webserver IP, that kills the connection.
- Webserver sends ACK
- Webserver (tries) to respond to HEAD request with valid HTTP data (The 951 byte reply contains the correct HTTP header)
- Webserver retransmits (several times over several seconds) the valid HTTP response, but it cannot succeed since the connection has been RST
So if the webserver has sent a valid RST, why does it keep trying to fill the request? And if the webserver didn't generate the RST, what the heck did?
Things I have tried that have had no effect:
- Disabling NIC teaming
- Changing out the network adaptor (replacement NIC was known to be working)
- Assigning a static ip.
- Disabling ipv6.
- Disabling jumbo frames.
- Plugging server directly into our modem one night, bypassing our switches & router.
- Turning off windows firewall.
- Resetting TCP settings via netsh
- Disabling practically every other service on the server. (We mostly use it as a fileserver, but there's apache & a couple DB's)
- Banging head on desk (repeatedly)
I suspect something on the server is generating the RST packets, but for the life of me I can't find it. I feel like if I knew: why is it just this server? OR why only some websites? it'd help a lot. While I'm still curious, I'm increasingly inclined to nuke from orbit & start over.
Ideas / Suggestions?
-Thanks