Random and Selective ARP blindness in VMWare ESXi 4.1

Question

We have multiple VMWare ESX servers spread out amongst our company, doing various tasks. One particular ESXi host is exhibiting very peculiar behavior. We detect it when our monitoring system (Orion) notifies us that it can no longer ping the box.

Upon jumping on the local console of the guest in question, we see that it cannot ping any new addresses that aren't already in its ARP table.

At first we thought that the problem was just related to one of our guests, as the problem seemed to always happen to another guest, DevRedis. However, this afternoon the problem swapped and started happening on ApacheBox rather than DevRedis.

When I have been fortunate to catch the problem, I have run tcpdump on both sides of the connection (one side being vmware, the other side being a physical webserver) and have noticed the following course of events:

Guest ApacheBox sends an ARP request for the physical address of server WindowsBeast
WindowsBeast tenders an ARP is-at back to the network indicating its physical mac address.
ApacheBox never sees the ARP is-at response.

The ESX host in question is running VMware ESXi, 4.1.0, 348481

The two guests (DevRedis and ApacheBox) are both running CentOS 6.3, however they are running two separate kernel versions ( 2.6.32-279.9.1.el6.x86_64 and 2.6.32-279.el6.x86_64 ) so I'm not entirely sure it's a CentOS problem.

Does anyone have any thoughts on what might cause this? Has anyone run into it before?

Anything of interest at the networking side? VLANs? Switch setups? Do you have something like dynamic VLAN memberships? — the-wabbit, Oct 09 '12 at 20:56
No Dynamic VLAN's and we're not doing any changes to the boxes when it happens, it just seems to happen out of nowhere. Something must be triggering it, but we've not figured out what it might be. — Peter Grace, Oct 09 '12 at 21:09
I would strongly suspect the switching layer. I have seen similar intermittent connectivity issues in setups where VLAN memberships have been incorrectly defined and the switch was set up to accept dynamic VLAN memberships. If you get the chance, try checking if you still can ARP-resolve hosts on the same physical machine (i.e. just needing the VMWare virtual switch) for further isolation of the issue. — the-wabbit, Oct 09 '12 at 22:06
What does the networking look like at the ESXi layer - how many uplinks does the host have for that vSwitch, what kind of load balancing? — Shane Madden, Oct 10 '12 at 00:04
@syneticon-dj This is a possibility, however the fact that I'm seeing ARP on the physical box leads me to believe the problem is not switching layer. — Peter Grace, Oct 10 '12 at 13:04
@ShaneMadden Four active gigabit uplinks, not LACP'd. I also tried doing active/passive on this vswitch, but it did not appear to affect this particular problem, and given that there are 4 other guests on this same vswitch, I'm wondering if the problem is somewhere other than the vSwitch. — Peter Grace, Oct 10 '12 at 13:08
@PeterGrace you might have a situation where the ARP *response* is not correctly forwarded for whatever reason - may it be because of incorrect VLAN tables or because of incorrect forwarding DB tables (network loops can cause this for example). — the-wabbit, Oct 10 '12 at 15:13
@PeterGrace Check if the physical switch(es) involved have the VM's address in their MAC address tables (in the right vlan) when the ARP response is not delivered to the VM correctly? Let's find out whether the switches just have no idea where the unicast ARP response should go, or if they think they do and are sending it to the wrong place, or if they're sending it to the right place and something else is up. Also, is there anything that might be toying with the MAC address in that VM, like VRRP? And is the vSwitch set to allow MAC address changes in its security tab? — Shane Madden, Oct 10 '12 at 15:39
Just as quickly as the problem surfaced, it's now disappeared again. If it pops back up we'll do some more detailed analysis. — Peter Grace, Oct 16 '12 at 18:36

score 1 · Answer 1 · answered Oct 15 '12 at 20:56

1

This sounds like you just might have a MAC-collision on your hands. The fact that the two boxes are swapping is what suggests it to me. Something in the vSwitch layer may be forwarding packets incorrectly.

answered Oct 15 '12 at 20:56

sysadmin1138

131,083
18
173
296

Random and Selective ARP blindness in VMWare ESXi 4.1

1 Answers1