5

After upgrading our development team to Windows 10 from 7, we are experiencing an issue with the ARP cache where a machine has the correct IP-MAC mapping cached, but the type is invalid due to failed connections during a power cycling of the target machine. On Windows 10, but not Windows 7, the target machine cannot be connected to until the ARP cache is cleared. I can have reproduce the issue as follows, where 10.10.10.10 is the correct IP address and 01:23:45:67:89:AB is the correct MAC address of the target machine in question:

  1. Start with the target machine powered off, and ping it through the entire test:

    ping -t 10.10.10.10
    

Ping is replying with "Request timed out" and the ARP cache contains, as expected

10.10.10.10    00:00:00:00:00:00    invalid
  1. Power on the target machine. Ping starts getting replies, and the ARP cache updates to

    10.10.10.10    01:23:45:67:89:AB    dynamic
    

So far so good.

  1. Power off target machine. Ping starts reporting "Request timed out" and the ARP cache remains

    10.10.10.10    01:23:45:67:89:AB    dynamic
    
  2. After about 40 seconds, the ping replies with "Destination host unreachable" for one request, then returns to reporting "Request timed out", and the ARP cache changes to

    10.10.10.10    01:23:45:67:89:AB    invalid
    
  3. Power on the target machine, and ping (and any other connection) will not find it until you clear ARP cache, or at least delete the offending entry with the correct IP-MAC mapping with invalid type.

How do I prevent the ARP cache from getting into this state, given that the target machine in the development environment does tend to require power cycling during the development process? Manually manipulating the ARP cache is not sustainable, and nobody reported this issue before moving to Windows 10.

Windows 7 functions as one would expect, which is to say, what we desire: The ARP cache goes through the same stages as above. Ping starts by replying "Destination host unreachable" before the target is powered on, as opposed to Windows 10's "Request timed out" and returns to "Destination host unreachable" continually after it is powered off, as opposed to Windows 10 only reporting that once. When the machine is powered on, the connection is immediately established and the ARP cache returns to

10.10.10.10    01:23:45:67:89:AB    dynamic

without any need to clear any entries first.

The developers' specific setup is a Windows workstation connected to several Beaglebone Blacks (small ARM based embedded boards running Linux) through a simple unmanaged 8 port gigabit switch. IP addresses are assigned by reserved DHCP, and addresses are picked up successfully each time the Beaglebones are powered on. When one Windows 10 machine has the invalid ARP entry that needs deleting, other machines without the Beaglebone in the ARP cache can successfully connect to the target machine.

Adam
  • 51
  • 1
  • 3

1 Answers1

0

Still the same after all these years. Very irritating when doing embedded development.

At least I found a few semi-solutions.

Win10 immediately drops all ARP caching on the interface when the link goes down. On embedded device powerdown, this therefore flushes the cache. This is not a problem.

The problem comes when Windows attempts to ARP the device before it is able to respond. This sets the ARP cache up for failure ("incomplete"). Even if the device comes online and would answer to ARP requests, no such requests are made for a while. The situation either requires another hard link drop and re-ARP, which equates to an ARP cache flush, or a wait of something like a minute before Win10 re-ARPs.

A simple but usually irritating solution is to add a static ARP entry. This is irritating because a) you need to know this in advance and b) you need to be admin on the PC to do it.

Another fix is to have an Ethernet switch in the line that shields the Win10 PC from a link drop, and disabling the Neighbor Unreachable Detection by (as admin):

netsh interface ipv4 set interface nud=disabled store=persistent

Now a failing attempt to reach the booting device won't drop the cache down to "Incomplete".

Neither of these solutions are satisfactory but they do get you a bit further along.

Hoppie

Hoppie
  • 1