32

TL;DR version: Turns out this was a deep Broadcom networking bug in Windows Server 2008 R2. Replacing with Intel hardware fixed it. We don't use Broadcom hardware any more. Ever.

We have been using HAProxy along with heartbeat from the Linux-HA project. We are using two linux instances to provide a failover. Each server has with their own public IP and a single IP which is shared between the two using a virtual interface (eth1:1) at IP: 69.59.196.211

The virtual interface (eth1:1) IP 69.59.196.211 is configured as the gateway for the windows servers behind them and we use ip_forwarding to route traffic.

We are experiencing an occasional network outage on one of our windows servers behind our linux gateways. HAProxy will detect the server is offline which we can verify by remoting to the failed server and attempting to ping the gateway:

Pinging 69.59.196.211 with 32 bytes of data:
Reply from 69.59.196.220: Destination host unreachable.

Running arp -a on this failed server shows that there is no entry for the gateway address (69.59.196.211):

Interface: 69.59.196.220 --- 0xa
Internet Address      Physical Address      Type
69.59.196.161         00-26-88-63-c7-80     dynamic
69.59.196.210         00-15-5d-0a-3e-0e     dynamic
69.59.196.212         00-21-5e-4d-45-c9     dynamic
69.59.196.213         00-15-5d-00-b2-0d     dynamic
69.59.196.215         00-21-5e-4d-61-1a     dynamic
69.59.196.217         00-21-5e-4d-2c-e8     dynamic
69.59.196.219         00-21-5e-4d-38-e5     dynamic
69.59.196.221         00-15-5d-00-b2-0d     dynamic
69.59.196.222         00-15-5d-0a-3e-09     dynamic
69.59.196.223         ff-ff-ff-ff-ff-ff     static
224.0.0.22            01-00-5e-00-00-16     static
224.0.0.252           01-00-5e-00-00-fc     static
225.0.0.1             01-00-5e-00-00-01     static

On our linux gateway instances arp -a shows:

peak-colo-196-220.peak.org (69.59.196.220) at <incomplete> on eth1
stackoverflow.com (69.59.196.212) at 00:21:5e:4d:45:c9 [ether] on eth1
peak-colo-196-215.peak.org (69.59.196.215) at 00:21:5e:4d:61:1a [ether] on eth1
peak-colo-196-219.peak.org (69.59.196.219) at 00:21:5e:4d:38:e5 [ether] on eth1
peak-colo-196-222.peak.org (69.59.196.222) at 00:15:5d:0a:3e:09 [ether] on eth1
peak-colo-196-209.peak.org (69.59.196.209) at 00:26:88:63:c7:80 [ether] on eth1
peak-colo-196-217.peak.org (69.59.196.217) at 00:21:5e:4d:2c:e8 [ether] on eth1

Why would arp occasionally set the entry for this failed server as <incomplete>? Should we be defining our arp entries statically? I've always left arp alone since it works 99% of the time, but in this one instance it appears to be failing. Are there any additional troubleshooting steps we can take help resolve this issue?

THINGS WE HAVE TRIED

I added a static arp entry for testing on one of the linux gateways which still didn't help.

root@haproxy2:~# arp -a
peak-colo-196-215.peak.org (69.59.196.215) at 00:21:5e:4d:61:1a [ether] on eth1
peak-colo-196-221.peak.org (69.59.196.221) at 00:15:5d:00:b2:0d [ether] on eth1
stackoverflow.com (69.59.196.212) at 00:21:5e:4d:45:c9 [ether] on eth1
peak-colo-196-219.peak.org (69.59.196.219) at 00:21:5e:4d:38:e5 [ether] on eth1
peak-colo-196-209.peak.org (69.59.196.209) at 00:26:88:63:c7:80 [ether] on eth1
peak-colo-196-217.peak.org (69.59.196.217) at 00:21:5e:4d:2c:e8 [ether] on eth1
peak-colo-196-220.peak.org (69.59.196.220) at 00:21:5e:4d:30:8d [ether] PERM on eth1

root@haproxy2:~# arp -i eth1 -s 69.59.196.220 00:21:5e:4d:30:8d
root@haproxy2:~# ping 69.59.196.220
PING 69.59.196.220 (69.59.196.220) 56(84) bytes of data.
--- 69.59.196.220 ping statistics ---
7 packets transmitted, 0 received, 100% packet loss, time 6006ms

Rebooting the windows web server solves this issue temporarily with no other changes to the network but our experience shows this issue will come back.

Swapping network cards and switches

I noticed the link light on the port of the switch for the failed windows server was running at 100Mb instead of 1Gb on the failed interface. I moved the cable to several other open ports and the link indicated 100Mb for each port that I tried. I also swapped the cable with the same result. I tried changing the properties of the network card in windows and the server locked up and required a hard reset after clicking apply. This windows server has two physical network interfaces so I have swapped the cables and network settings on the two interfaces to see if the problem follows the interface. If the public interface goes down again we will know that it is not an issue with the network card.

(We also tried another switch we have on hand, no change)

Changing network hardware driver versions

We've had the same problem with the latest Broadcom driver, as well as the built-in driver that ships in Windows Server 2008 R2.

Replacing network cables

As a last ditch effort we remembered another change that occurred was the replacement of all of the patch cords between our servers / switch. We had purchased two sets, one green of lengths 1ft - 3ft for the private interfaces and another set of red cables for the public interfaces. We swapped out all of the public interface patch cables with a different brand and ran our servers without issue for a full week ... aaaaaand then the problem recurred.

Disable checksum offload, remove TProxy

We also tried disabling TCP/IP checksum offload in the driver, no change. We're now pulling out TProxy and moving to a more traditional x-forwarded-for network arrangement without any fancy IP address rewriting. We'll see if that helps.

Switch Virtualization providers

On the off chance this was related to Hyper-V in some way (we do host Linux VMs on it), we switched to VMWare Server. No change.

Switch host model

We've reached the end of our troubleshooting rope and are now formally involving Microsoft support. They recommended changing the host model:

We did that, and we also got some unpublished kernel hotfixes which were presumably rolled into 2008 R2 SP1. No fix.

Replacing network card hardware

Ultimately, replacing the Broadcom network hardware with Intel network hardware fixed this issue for us. So I am inclined to think that the Broadcom Windows Server 2008 R2 drivers are at fault!

http://blog.serverfault.com/post/broadcom-die-mutha/

Greg Askew
  • 34,339
  • 3
  • 52
  • 81
Geoff Dalgas
  • 2,416
  • 5
  • 31
  • 32
  • also of note -- we also use TProxy (transparent proxy) to send back the actual IP of the traffic coming in through HAProxy. http://blog.loadbalancer.org/configure-haproxy-with-tproxy-kernel-for-full-transparent-proxy/ – Jeff Atwood Jan 20 '10 at 22:13
  • LUnix... heh heh... http://hld.c64.org/poldi/lunix/lunix.html – Evan Anderson Jan 20 '10 at 22:56
  • 2
    Never trust auto settings on a production environment. Set the speed to what it should be, and put a monitor on it to be sure. – Daniel C. Sobral Jan 21 '10 at 11:25
  • 3
    @Daniel Sobral: I have to heartily disagree with you. In 2003 I suppose I could see that. With modern hardware, hard-setting port speed and duplex is a recipe for getting speed / duplex mismatches. Autonegotiation on modern Ethernet gear works fine. – Evan Anderson Jan 21 '10 at 13:10
  • 1
    I stand with @Daniel Sobral, too many times I've had network failures caused by bad speed negotiations at the worst moment, so on production systems I go with static settings. When that happens, what does the link state on the switch says? It is managed, right? What does the Windows system say? I would bet on network failing at link level, and that is what is causing those ARP incompletes (failed or waiting to receive ARP who-has). Bad hardware/driver could be a cause. Lets see how it goes after swapping. – Pablo Alsina Jan 21 '10 at 14:11
  • @Evan I suppose you could be right about newer hardware (not 2004, however :), but I have had trouble with auto settings, never with hard settings. Whenever I connect a server to a switch, or connect switches and routers, I know precisely the settings they should have. So, until I do face the opposite problem, I'll stand by my recommendation. – Daniel C. Sobral Jan 21 '10 at 16:43
  • Point of interest, Service Pack 1 has now been released. – tombull89 Feb 23 '11 at 22:54
  • Shouldn't the answer be putted as a proper answer, and not as a question edit? This way the question could be marked as "answered". – egarcia Apr 25 '11 at 11:47
  • But you're still using Windows Server? – Rudie Apr 26 '11 at 19:41
  • @Rudie : Was there an OS issue or why are you saying that? – Andrei Rînea May 01 '11 at 22:41
  • @Jeff - weak, but any chance of a copy of that MSFT patch? We're having this exact problem on the 3 new Dell R610's hosting all SSL for our site :| (I have Intel dualport NICs on order in the meantime..) –  Jun 12 '11 at 18:50
  • @gdh no OS patches work -- this is purely a broadcom driver issue AFAIK and if you have the latest broadcom drivers there is nothing else to be done. – Jeff Atwood Jun 12 '11 at 22:55
  • you know its funny that i don't see **What's Your question?**, **This question is too broad?**, **This question isn't productive**, or **why are you even using 2008 windows server**? You know typical response that you get along termination of the question in < 1s. – Muhammad Umer Feb 25 '15 at 03:20

9 Answers9

7

From http://linux-ip.net/html/ether-arp.html:

If no ARP cache entry exists for a requested destination IP, the kernel will generate mcast_solicit ARP requests until receiving an answer. During this discovery period, the ARP cache entry will be listed in an incomplete state. If the lookup does not succeed after the specified number of ARP requests, the ARP cache entry will be listed in a failed state. If the lookup does succeed, the kernel enters the response into the ARP cache and resets the confirmation and update timers.

It looks like your gateway box is not responding (or responding too slowly) to ARP requests from your gateway box. Does that <incomplete> eventually switch to <failed>? What network hardware do you have between the the server and the gateway? Is it possible broadcast ARP requests are being filtered or blocked somewhere between the two hosts?

5

It means that you pinged the address, the IP has a PTR record (hence the name) but nothing responded from the machine in question. When we see this it's most commonly due to a subnet mask being set incorrectly - or in the case of IPs bound to a loopback interface that were accidentally bound to the eth interface instead.

What is 196.220? What is it's relationship with 196.211? I'm assuming that .220 is one of the HA Proxy hosts. When you run ifconfig -a & arp -a on it what does it show?

Jeff Atwood
  • 12,994
  • 20
  • 74
  • 92
Max Clark
  • 51
  • 2
  • If it's happening intermittently, though, that tends to make me think that it's not an incorrectly set subnet mask (which, admittedly, is often the cause of machines failing to answer ARP requests). – Evan Anderson Jan 20 '10 at 22:23
  • The post seems fairly clear to me. The .211 IP address is a virtual IP shared by the HAProxy instances. The .220 IP address is assigned to a Windows machine that, periodically, loses its ability to communicate with the .211 IP address (as can be seen in the "Interface:" line of the ARP output quoted in the post). – Evan Anderson Jan 20 '10 at 22:43
  • 196.220 is the ip of the failed windows server - 196.211 is the virtual ip for the haproxy interfaces. – Geoff Dalgas Jan 20 '10 at 22:50
4

As Max Clark says, the <incomplete> just means that 69.59.196.211 has put out an ARP request for 69.59.196.220 and hasn't received a response yet. (In Windows-land you'll see this as an ARP mapping to "00-00-00-00-00-00"... It seems odd to me, BTW, that you're not seeing such an ARP mapping on 69.59.196.220 for 69.59.196.211.)

I tend not to like to use static ARP entries because, in my experience, ARP has generally done its job all the time.

If it were me, I'd sniff the appropriate Ethernet interface on the "failing" Windows machine (69.59.196.220) to observe it ARP'ing for 69.59.196.211, and to observe how / if it's responding to ARP requests from 69.59.196.211. I'd also consider sniffing on the gateway machine for ARP only (tcpdump -i interface-name arp) to see what the ARP traffic looks like from the side of the Linux machine.

I know, from the blog, that you've got a back-end network and a front-end network. During these outages, does the "failing" Windows server (69.59.196.220) have any problems communicating to other machines in the front-end network, or is it just having problems talking to its gateway? I'm curious if you're coming at the failing machine thru the front-end or back-end network when you're catching it in the act.

What are you doing to "resolve" the issue when it occurs?

Edit:

I see from your update that you're rebooting the "failing" Windows machine to resolve the issue. Before you do that next time, can you verify that the Windows machine is able to "talk" on its front-end interface at all? Also, grab a copy of the routing table from the Windows machine (route print) during a failure, too. (I'm trying to ascertain if the NIC / driver is going bonkers on the Windows machine, basically.)

Evan Anderson
  • 141,071
  • 19
  • 191
  • 328
  • When this issue occurs we can reboot the failed web server (196.220) and it will work - our experience has shown that within 24 hours it will fail again. – Geoff Dalgas Jan 20 '10 at 22:52
  • 1
    It would be interesting to know if the server was able to talk, at all, on the NIC attached to the segment with the .211 machine (which, I understand from your updated, is now swapped with the back-end segment). My gut says "bonkers NIC" is going to be the root cause on this one, but we'll see... – Evan Anderson Jan 21 '10 at 13:31
  • 1
    When this happens, the machine definitely cannot talk on the front end (public) NIC *at all*. The back end (private) NIC is unaffected. I have always felt it was the NIC driver going bonkers, but the question is "why"? (also: this happens with the latest broadcom driver as well as the default Wink28 R2 driver) I'm going to check the event logs after it reboots, which takes 10+ minutes as it has to eventually bluescreen as part of the shutdown first. I cleared them beforehand. – Jeff Atwood Jan 27 '10 at 21:10
  • we are now involving Microsoft support as we honestly believe this is an OS level issue. We've done *every possible bit of troubleshooting* we possibly can and ruled out.. well, everything. – Jeff Atwood Apr 22 '10 at 01:54
  • Zow. I'd love to hear how it turns out. – Evan Anderson Apr 23 '10 at 00:56
  • @evan see post update. Indeed, I called it: OS bug. – Jeff Atwood Jun 11 '10 at 08:16
  • @Jeff: Thanks for the update! I'm glad that I'm not seeing the misbehavior at any of my Customer sites, but now that I'm aware of it it'll almost certainly happen somewhere almost immediately! – Evan Anderson Jun 11 '10 at 12:44
2

This document shows the different states (table 2.1). Incomplete would mean that it has sent a first ARP request (presumably after a stale, delay, probe) but hasn't yet received a response.

Cade Roux
  • 375
  • 2
  • 5
  • 18
2

The reason the static ARP on the haproxy node doesn't help is that your web server still can't figure out how to get back to the gateway.

Static ARP on the web server breaks the ability for your web servers to switch gateways when one of the haproxy nodes failed -- I'm guessing the virtual interface shares the same MAC address as the haproxy node's eth1, so you'd have to hard code to one of the two gateways into each web server.

Do you have any kind of security software installed on the failing web server? I spent a long night with a Windows 2008 server that had Symantec Endpoint Security on it -- it installs some filtering code in the networking stack that prevented it from seeing the gateway's ARP packets at all. The fix for that (as provided by Microsoft) was to remove the registry entry that loaded the DLL.

The other time this problem occurred, removing the whole network adapter from device manager and reinstalling seemed to help.

jaredg
  • 221
  • 1
  • 2
2

Since you've statically set your arp entry, your servers know where to find the gateway. However, if your switch doesn't know where the gateway is, it won't forward your packets.

Sounds like you've got a bad (or confused) switch between your HAproxy's and your web servers. Reboot it.

Either that, or your HAproxy servers disagree about which one is in control, and both answering arp lookups for .211.

Along the same lines, if your switch is overloaded, your HAproxies might be unable to communicate with each other fast enough, and are failing over.

Seth
  • 646
  • 2
  • 6
  • 17
1

The next time this problem occurs, I would suggest running some packet captures on the two hosts in question, to determine what ARP traffic each of them is observing.

Your HAproxy machine will most likely have some flavour of tcpdump installed. For the Windows machine you will either need a WinPCAP application, like Wireshark, or Microsoft Network Monitor.

In fact, thinking about it, as the problem appears to be with ARP specifically, you could potentially just continuously record all ARP traffic on the HAproxy machine and the Windows machine in question, with a rolling capture file of (for argument's sake) 10MB. That should be large enough such that by the time you've detected a failure, the capture file will still contain the ARP traffic from before the failure. (It's worth experimenting by running the capture for an hour or so, to see how much data it generates).

Example capture syntax for Linux tcpdump (note, I don't have a Linux box handy to test this on; please test the behaviour of -C and -W before using in production!):

tcpdump -C 10 -i eth1 -w /var/tmp/arp.cap -W 1 arp

This should hopefully give you some indication of what precisely is failing. When an ARP entry expires (and according to this article, newer versions of Windows appear to age out 'inactive' entries very aggressively), I would expect the following to happen:

  1. The source host will send an ARP request to the target host. ARP requests are generally broadcast, but in the case where a host is refreshing an existing entry, the ARP may be sent unicast.
  2. The target host will respond with an ARP reply. 99% of the time this will be unicast, but the RFC permits broadcast responses. (See also the RFC regarding IPv4 Address Collision Detection for more detail).

Simple as it sounds, there are a bunch of other things that may interfere with this process:

  • The original request may not be arriving at the target.
  • The request may be arriving at target, but the response may not be reaching the source.
  • Some sort of high availability mechanism may be interfering with the 'normal' behaviour of ARP:
    • How does failover between the HAProxy nodes work? Does it use a shared MAC address, or does it use gratuitous ARP to fail an IP address over between nodes?
    • A lot of the MAC addresses in the ARP tables above begin with 00-15-5D, which is apparently registered to Microsoft. Are you using any form of clustering or other HA on the Windows machine in question? Are these 00-15-5D MAC addresses the same ones you see associated with the hardware NICs when you do an 'ipconfig /all' on the Windows server?

Things to check if/when this happens again:

  • Look at the packet captures of ARP traffic; has any part of the conversation obviously not occurred?
  • Check the switch's bridging/CAM tables; do all the MAC addresses in question map to the ports you expect them to?
  • Do other hosts on the subnet have valid ARP entries for the IP addresses of both the Windows and HAProxy hosts?
  • Do ARP entries for the same target IP on multiple different source machines resolve to the same MAC address? i.e. log on to a couple of other hosts on the subnet and verify that 196.211 resolves to the same MAC address on both.
Murali Suriar
  • 10,166
  • 8
  • 40
  • 62
  • we are definitely looking at packet captures now – Jeff Atwood Jan 28 '10 at 20:15
  • unfortunately the packet captures didn't show us anything obvious, and the machine we captured on has sensitive network traffic.. so we can't give it to experts to look at. – Jeff Atwood Mar 12 '10 at 11:17
  • @Jeff: could you provide captures showing only the ARP traffic? I'd be interested to see the ARP behaviour if nothing else. – Murali Suriar Mar 12 '10 at 15:56
  • we followed MSFT support's directions on whatever data they want captured -- it took a few weeks, but eventually they found a private kernel networking hotfix for us. – Jeff Atwood Jun 11 '10 at 08:21
0

We had a similar issue with one of our 2008 R2 terminal servers where all traffic on the NIC would stop but stay connected, and the NIC LEDs would show comms. This was an ongoing issue that kept cropping up 2-3 times a week, but only after around 12-13 hours uptime (server is rebooted nightly).

I found Seriousbit Netbalancer was the cause, after I tried (out of curiosity) terminating the NetbalancerService service. Traffic then started moving across the interface. I've since uninstalled Netbalancer.

0

I had a same problem with Asus Mainboard lan. It was fixed by installing a latest driver from realtek website

M-Razavi
  • 111
  • 4