Windows Incorrect MAC Address Populating ARP Table

Question

Recently I have been running into issues where incorrect MAC Addresses are populating the arp tables of some Windows VMs running in a Cloud Provider's cloud. For example, if I ping 10.1.2.3, some Windows VMs show a different MAC Address from the majority of other VMs. The result is that these few Windows VMs cannot reach 10.1.2.3 but the rest of the VMs (both Windows and Linux) can reach it.

After running Packet Captures, the source of the incorrect MAC Addresses seem to be MS-NLB-PhysServer-XX_, which is included in wireshark's published list. I am not running any sort of MS-NLB though, and so it is very confusing as to what that source is. My Cloud Provider says that it is not coming from them. My questions are:

Is there a good way to identify the source device based on its MAC Address if I do not own that device? ie- I am wondering if it is coming from our cloud provider's load balancers.
What are reasons this source device would have incorrect MAC Addresses that it is sending to other devices? ie- why does it have the wrong MAC address for 10.1.2.3 and other newly created Network Interfaces?
What are reasons only a subset of the VMs get the bad MAC addresses from this source and other VMs in the same subnet get good MAC addresses from other sources?

Which Cloud Provider? Some of them do *very* strange things with networking. In Azure f.e. all ARP table entries always point to `12-34-56-78-9a-bc`, because all Azure networking is handled at the IP layer and ARP doesn't exist or work at all. — Massimo, Dec 29 '20 at 22:17
I don't know AWS in the slightest, can't help you here, sorry. But maybe someone with more AWS experience can understand what's going on; I've added the `amazon-web-services` and `networking` tags to your question. — Massimo, Dec 29 '20 at 22:19

score 2 · Accepted Answer · answered Apr 23 '21 at 23:39

We ran into this also, it is happening on our EKS Windows nodes after they reboot. We have nodes that join a domain for GMSA, which requires a reboot, so these instances immediately saw the issue.

I opened a support ticket and the workaround they provided is to have a shutdown script run the following

powershell.exe /c "get-hnsendpoint | remove-hnsendpoint"
exit

The exit was said to be important because it prevents a hang on shutdown for a period of time.

I used this answer as the basis for automating this process - https://stackoverflow.com/a/47709154

Wow yes this is exactly what happened. I am bound under some nebulous NDA under Enterprise Support so I don’t think I’m allowed to elaborate. I hope they eventually announce this issue in their EKS Windows documentation. — Howard_Roark, Apr 24 '21 at 03:16

score 0 · Answer 2 · answered Dec 30 '20 at 08:14

0

If you don't own the other device I'm assuming it's because it's on a wholly different network which would mean you won't see its MAC address but the one on the device closest to you that will route the traffic to that other final device.

Remember that end-to-end communication doesn't happen at the Data-link Layer (ie Layer 2).

The most likely scenario here might be that your routing is set up incorrectly in some of your VMs and at the OS level rather than your Cloud Provider's network route table... or they're at different networks (maybe AWS Subnet?) and have different route tables.

answered Dec 30 '20 at 08:14

Oscar De León

131
1
1
6

By not owning the device I mean I just control the abstractions AWS provides-- Ie- their VPC abstraction, and their Ec2 abstraction etc but I have no idea what devices are actually there behind the scenes. All of the VM's are built from an identical out-of-the-box AWS Image (AMI), and they are brought up in an identical way (ie- via Terraform) in the same subnet. So, it's not like I'm doing anything random on some but not others. They're all identical – Howard_Roark Dec 30 '20 at 15:45

score 0 · Answer 3 · answered Apr 24 '21 at 00:15

Adding this as an aswer too.

Several cloud providers have a quite peculiar view of networking, and handle it in a way that a networking professional would find just outrageous; however, this is the way they work, and we have to just deal with it.

In Azure, MAC addresses just don't make any sense; all ARP table entries always point to 12-34-56-78-9a-bc, because all Azure networking is handled at the IP layer and ARP doesn't exist or work at all; an Azure VM can't simply yell "I have this IP address" (AKA "gratuitous ARP"), because the Azure platform needs to know about it in order to route traffic to that VM. Azure clustering works in such a strange way that you have to put a very unusual load balancer in front of your cluster.

I honestly don't know how this works in AWS, but I guess it's as strange if not worse.

Windows Incorrect MAC Address Populating ARP Table

3 Answers3