ARP broadcast flooding network and high cpu usage

Question

Hoping someone here might have some insight to the issue we are facing. Currently we have Cisco TAC looking at the case but they are struggling to find the root cause.

Although the title mentions ARP broadcast and high CPU usage, we are unsure if they are related or unrelated at this stage.

The original issue has been posted on INE Online Community

We have stripped the network down to a single link no redundancy setup, think of it as a star topology.

Facts:

We use 3750x switches, 4 in one stack. Version 15.0(1)SE3. Cisco TAC confirms no known issues for high cpu or ARP bugs for this particular version.
No hubs/ unmanaged switches connected
Reloaded Core stack
We don't have a default route "Ip route 0.0.0.0 0.0.0.0 f1/0". Using OSPF for routing.
We see large broadcast packets from VLAN 1, VLAN 1 used for desktop devices. We use 192.168.0.0/20
Cisco TAC said they don't see anything wrong with using /20, other then we'd have a large broadcast domain but should still function.
Wifi, management, printers etc are all on different VLAN
Spanning tree has been verified by Cisco TAC & CCNP/CCIE qualified individuals. We shutdown all redundant links.
Configuration on the core has been verified Cisco TAC.
We have the default ARP timeout on majority of the switches.
We do not implement Q & Q.
No new switches been added (at least none we know of)
Cannot use dynamic arp inspection on edge switches because these are 2950
We used show interfaces | inc line|broadcast to figure out where the large number of broadcast coming from, however both Cisco TAC and 2 other engineers(CCNP & CCIE) confirmed this is normal behaviour due to what is happening on the network (as in large number of mac flaps causing the larger broadcast). We verified the STP was functioning correctly on the edge switches.

Symptoms on the network and switches:

Large number of MAC flaps
High CPU usage for ARP Input process
Huge number of ARP packets, rapidly increasing and visible
Wiresharks shows that 100s of computers are flooding the network with ARP Broadcast
For test purpose, we put approx 80 desktop machines different vlan, however we tested this and made no visible difference to high cpu or arp input
Have ran different AV/ malware/ spyware, but no viruses visible on the network.
sh mac address-table count, shows us approx 750 different mac addresses as expected on vlan 1.

#sh processes cpu sorted | exc 0.00%
CPU utilization for five seconds: 99%/12%; one minute: 99%; five minutes: 99%

 PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
  12   111438973    18587995       5995 44.47% 43.88% 43.96%   0 ARP Input
 174    59541847     5198737      11453 22.39% 23.47% 23.62%   0 Hulc LED Process
 221     7253246     6147816       1179  4.95%  4.25%  4.10%   0 IP Input
  86     5459437     1100349       4961  1.59%  1.47%  1.54%   0 RedEarth Tx Mana
  85     3448684     1453278       2373  1.27%  1.04%  1.07%   0 RedEarth I2C dri

Ran show mac address-table on different switches and core itself (on the core, for example, plugged by desktop directly, my desktop ), and we can see the several different MAC hardware address being registered to the interface, even though that interface has only one computer attached to this:

 Vlan    Mac Address       Type        Ports
 ----    -----------       --------    -----
    1    001c.c06c.d620    DYNAMIC     Gi1/1/3
    1    001c.c06c.d694    DYNAMIC     Gi1/1/3
    1    001c.c06c.d6ac    DYNAMIC     Gi1/1/3
    1    001c.c06c.d6e3    DYNAMIC     Gi1/1/3
    1    001c.c06c.d78c    DYNAMIC     Gi1/1/3
    1    001c.c06c.d7fc    DYNAMIC     Gi1/1/3

show platform tcam utilization

 CAM Utilization for ASIC# 0                      Max            Used
                                              Masks/Values    Masks/values

  Unicast mac addresses:                       6364/6364       1165/1165
  IPv4 IGMP groups + multicast routes:         1120/1120          1/1
  IPv4 unicast directly-connected routes:      6144/6144        524/524
  IPv4 unicast indirectly-connected routes:    2048/2048         77/77
  IPv4 policy based routing aces:               452/452          12/12
  IPv4 qos aces:                                512/512          21/21
  IPv4 security aces:                           964/964          45/45

We are now at a stage, where we will require huge amount of downtime to isolate each area at a time unless anyone else has some ideas to identify the source or root cause of this weird and bizarre issue.

Update

Thank you @MikePennington and @RickyBeam for the detailed response. I will try and answer what I can.

As mentioned, 192.168.0.0/20 is an inherited mess. However, we do intend to split this up in the future but unfortunately this issue occured before we could do this. I personally also agree with majority, whereby the broadcast domain is far too big.
Using Arpwatch is definitely something we can try but i suspect because several access port is registering mac address even though it doesn't belong to this port, the conclusion of arpwatch may not be useful.
I completely agree with not being 100% sure finding all redundant links and unknown switches on the network, but as best of our finding, this this is the case until we find further evidence.
Port security has been looked into, unfortunately management has decided not to use this for various reasons. Common reason is we constantly move computers around (college environment).
We have used spanning-tree portfast with in conjunction with spanning-tree bpduguard by default on all access ports (desktop machines).
We do not use switchport nonnegotiate at the moment on access port, but we are not getting any Vlan hopping attack bouncing across multitple Vlans.
Will give mac-address-table notification a go, and see if we can find any patterns.

"Since you're getting a large number of MAC flaps between switchports, it's hard to find where the offenders are (suppose you find two or three mac addresses that send lots of arps, but the source mac addresses keep flapping between ports)."

We started on this, picked any one MAC flaps and continued our way through all the core switch to distribution to access switch, but what we found was once again, the access port interface was hogging multiple mac address hence mac flaps; so back to square one.
Storm-control is something that we did consider, but we fear some off the legitimate packets will be dropped causing further issue.
Will triple check the VMHost configuration.
@ytti the unexplainable MAC addresses is behind many access ports rather then an individual. Haven't found any loops on these interfaces. The MAC addresses also exist on other interfaces, which would explain large number of MAC flaps
@RickyBeam i agree with why hosts are sending so many ARP requests; this is one of the puzzling issue. Rouge wireless bridge is an interesting one that i haven't given thought to, as far as we are aware, wireless is on different VLAN; but rogue will obviously mean it may well be on VLAN1.
@RickyBeam, i don't really wish to unplug everything as this will cause massive amount of downtime. However this is where it may just be heading. We do have Linux servers but not more then 3.
@RickyBeam, can you explain DHCP server "in-use" probing?

We (Cisco TAC, CCIEs, CCNP) globally agree that this is not a switch configuration but a host/device is causing the issue.

I would note: unless there are loops in the network, mac flaps should not happen. The only other logical reason would be VMs using the same MAC. (or some bonehead has multiple nics set to use the same MAC) — , Oct 26 '13 at 19:04
@ColdT, I updated my answer as I mis-read a few things in my original response. — Mike Pennington, Oct 27 '13 at 03:48
Do you experience large number of unexplainable MAC addresses behind many ports or just one port? Could the port be looped? Do the MAC addresses stay behind that port or appear behind other ports as well? Do we have PCAP for the ARP? Large number of MAC flaps certainly isn't normal at all, it implies either topology keeps changing or you have unmanaged loop in the network. — , Oct 27 '13 at 07:10
@ColdT, I think you should re-engage with management about port-security; I specifically gave you configurations that permit PCs to move between switchports. `switchport port-security aging time 5` and `switchport port-security aging type inactivity` means that you can move stations between ports after 5 minutes of inactivity, or if you manually clear the port-security entry. However, this configuration prevents mac flaps between access ports of the switch because ports cannot arbitrarily source the same mac-address from a different port. — Mike Pennington, Oct 27 '13 at 14:11
it's also worth mentioning that arpwatch doesn't register a flip flop unless there are different ARPs for the same IP address. Regardless of the reason, you need to know when that happens. Mere mac floods are not sufficient to confuse arpwatch — Mike Pennington, Oct 27 '13 at 14:53
@MikePennington will give ARPWatch a go and see where it leads, as it may indeed provide some useful insight. I will reconsider port security after this week and issues like this is exactly things that will change management decision. — Cold T, Oct 27 '13 at 14:55
Also worth noting on Port Security, is that you don't have to go all the way to "violation shutdown". I usually recommend in these types of situations to start with `switchport port-security violation restrict`. That way the port isn't shutdown, but you do get an incrementing counter on violations. [See here for more info.](http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/port_sec.html#wp1055296) — Brett Lykins, Oct 27 '13 at 16:50
@BrettLykins, one never really knows what state configs are in when responding to questions... I assumed someone could be filtering bpdus, so violation shutdown gives you the safest response... It is very true that violation restrict is less severe — Mike Pennington, Oct 27 '13 at 18:01
@Mike Pennington - That makes sense, I was mainly trying to lay out the options for Cold T so he could make his case to management. — Brett Lykins, Oct 27 '13 at 19:15

score 12 · Accepted Answer · answered Nov 14 '13 at 14:04

Solved.

The issue is with SCCM 2012 SP1, a service called: ConfigMrg Wake-Up Proxy. The 'feature' does not existing SCCM 2012 RTM.

Within 4 hours of turning this off within the policy, we saw steady drops in CPU usage. By the time 4 hours was up, ARP usage was merely 1-2%!

In summary, this service does MAC address spoofing! Cannot believe how much havoc it caused.

Below is a full text from Microsoft Technet as i feel it's important to understand how this relates to the issue posted.

For anyone who is interested, below is the technical details.

Configuration Manager supports two wake on local area network (LAN) technologies to wake up computers in sleep mode when you want to install required software, such as software updates and applications: traditional wake-up packets and AMT power-on commands.

Beginning with Configuration Manager SP1, you can supplement the traditional wake-up packet method by using the wake-up proxy client settings. Wake-up proxy uses a peer-to-peer protocol and elected computers to check whether other computers on the subnet are awake, and to wake them if necessary. When the site is configured for Wake On LAN and clients are configured for wake-up proxy, the process works as follows:

Computers that have the Configuration Manager SP1 client installed and that are not asleep on the subnet check whether other computers on the subnet are awake. They do this by sending each other a TCP/IP ping command every 5 seconds.

If there is no response from other computers, they are assumed to be asleep. The computers that are awake become manager computers for the subnet.

Because it is possible that a computer might not respond because of a reason other than it is asleep (for example, it is turned off, removed from the network, or the proxy wake-up client setting is no longer applied), the computers are sent a wake-up packet every day at 2 P.M. local time. Computers that do not respond will no longer be assumed to be asleep and will not be woken up by wake-up proxy.

To support wake-up proxy, at least three computers must be awake for each subnet. To achieve this, three computers are non-deterministically chosen to be guardian computers for the subnet. This means that they stay awake, despite any configured power policy to sleep or hibernate after a period of inactivity. Guardian computers honor shutdown or restart commands, for example, as a result of maintenance tasks. If this happens, the remaining guardian computers wake up another computer on the subnet so that the subnet continues to have three guardian computers.

Manager computers ask the network switch to redirect network traffic for the sleeping computers to themselves.

The redirection is achieved by the manager computer broadcasting an Ethernet frame that uses the sleeping computer’s MAC address as the source address. This makes the network switch behave as if the sleeping computer has moved to the same port that the manager computer is on. The manager computer also sends ARP packets for the sleeping computers to keep the entry fresh in the ARP cache. The manager computer will also respond to ARP requests on behalf of the sleeping computer and reply with the MAC address of the sleeping computer.

During this process, the IP-to-MAC mapping for the sleeping computer remains the same. Wake-up proxy works by informing the network switch that a different network adapter is using the port that was registered by another network adapter. However, this behavior is known as a MAC flap and is unusual for standard network operation. Some network monitoring tools look for this behavior and can assume that something is wrong. Consequently, these monitoring tools can generate alerts or shut down ports when you use wake-up proxy. Do not use wake-up proxy if your network monitoring tools and services do not allow MAC flaps.

When a manager computer sees a new TCP connection request for a sleeping computer and the request is to a port that the sleeping computer was listening on before it went to sleep, the manager computer sends a wake-up packet to the sleeping computer, and then stops redirecting traffic for this computer.

The sleeping computer receives the wake-up packet and wakes up. The sending computer automatically retries the connection and this time, the computer is awake and can respond.

Ref: http://technet.microsoft.com/en-us/library/dd8eb74e-3490-446e-b328-e67f3e85c779#BKMK_PlanToWakeClients

Thank you for everyone who has posted here and assisted with the troubleshooting process, very much appreciated.

You did not put the essential in the answer: how do you turn off that feature ? — Overmind, May 06 '19 at 04:58

score 10 · Answer 2 · answered Oct 26 '13 at 15:41

ARP / Broadcast storm

We see large broadcast packets from VLAN 1, VLAN 1 used for desktop devices. We use 192.168.0.0/20 ...

Wiresharks shows that 100s of computers are flooding the network with ARP Broadcast ...

Your ARP Input process is high, which means the switch is spending a lot of time processing ARPs. One very common cause of ARP flooding is a loop between your switches. If you have a loop, then you also can get the mac flaps you mentioned above. Other possible causes of ARP floods are:

IP address misconfigurations
A layer2 attack, such as arp spoofing

First eliminate the possibility of misconfigurations or a layer2 attack mentioned above. The easiest way to do this is with arpwatch on a linux machine (even if you have to use a livecd on a laptop). If you have a misconfiguration or layer2 attack, then arpwatch gives you messages like this in syslog, which list the mac addresses which are fighting over the same IP address...
Oct 20 10:31:13 tsunami arpwatch: flip flop 192.0.2.53 00:de:ad:85:85:ca (00:de:ad:3:d8:8e)

When you see "flip flops", you have to track down the source of the mac addresses and figure out why they're fighting over the same IP.

Large number of MAC flaps

Spanning tree has been verified by Cisco TAC & CCNP/CCIE qualified individuals. We shutdown all redundant links.

Speaking as someone who has been through this more times than I would like to recall, don't assume you found all redundant links... just make your switchports behave at all times.

Since you're getting a large number of mac flaps between switchports, it's hard to find where the offenders are (suppose you find two or three mac addresses that send lots of arps, but the source mac addresses keep flapping between ports). If you aren't enforcing a hard limit on mac-addresses per edge port, it is very difficult to track these problems down without manually unplugging cables (which is what you want to avoid). Switch loops cause an unexpected path in the network, and you could wind up with hundreds of macs learned intermittantly from what should normally be a desktop switchport.

The easiest way to slow down the mac-moves is with port-security. On every access switchport in Vlan 1 that is connected to a single PC (without a downstream switch), configure the following interface-level commands on your cisco switches...

switchport mode access
switchport access vlan 1
!! switchport nonegotiate disables some Vlan-hopping attacks via Vlan1 -> another Vlan
switchport nonnegotiate
!! If no IP Phones are connected to your switches, then you could lower this
!!   Beware of people with VMWare / hubs under their desk, because 
!!   "maximum 3" could shutdown their ports if they have more than 3 macs
switchport port-security maximum 3
switchport port-security violation shutdown
switchport port-security aging time 5
switchport port-security aging type inactivity
switchport port-security
spanning-tree portfast
!! Ensure you don't have hidden STP loops because someone secretly cross-connected a 
!!   couple of desktop ports
spanning-tree bpduguard enable

In most mac/ARP flooding cases, applying this configuration to all your edge switch ports (especially any with portfast) will get you back to a sane state, because the config will shutdown any port that exceeds three mac-addresses, and disable a secretly looped portfast port. Three macs per port is a number that works well in my desktop environment, but you could raise it to 10 and probably be fine. After you have done this, any layer 2 loops are broken, rapid mac flaps will cease, and it makes diagnosis much easier.

Another couple of global commands that are useful for tracking down ports associated with a broadcast storm (mac-move) and flooding (threshold)...

mac-address-table notification mac-move
mac address-table notification threshold limit 90 interval 900

After you finish, optionally do a clear mac address-table to accelerate healing from potentially full CAM table.

Ran show mac address-table on different switches and core itself (on the core, for example, plugged by desktop directly, my desktop ), and we can see the several different MAC hardware address being registered to the interface, even though that interface has only one computer attached to this...

This whole answer assumes your 3750 doesn't have a bug causing the problem (but you did say that wireshark indicated PCs that are flooding). What you're showing us is obviously wrong when there is only one computer attached to Gi1/1/3, unless that PC has something like VMWare on it.

Misc thoughts

Based on a chat conversation we had, I probably don't have to mention the obvious, but I will for sake of future visitors...

Putting any users in Vlan1 is normally a bad idea (I understand you may have inherited a mess)
Regardless of what TAC tells you, 192.168.0.0/20 is too large to manage in a single switched domain without risks of layer2 attacks. The larger your subnet mask is, the greater exposure you have to layer2 attacks like this because ARP is an unauthenticated protocol and a router must at least read a valid ARP from that subnet.
Storm-control on layer2 ports is usually a good idea as well; however, enabling storm-control in a situation like this it will take out good traffic with the bad traffic. After the network has healed, apply some storm-control policies on your edge ports and uplinks.

Actually, his tcam is not maxed out. The first column is the max, the second is the current use. You can ignore the masks vs values part. From here, it sounds like a "simple" arp storm, but without knowledge of his topology and the actual traffic, I cannot guess why. — , Oct 27 '13 at 04:27

score 2 · Answer 3 · answered Oct 26 '13 at 19:29

The real question is why are hosts sending so many ARPs in the first place. Until this is answered, the switch(es) will continue to have a hard time dealing with the arp storm. Netmask mismatch? Low host arp timers? One (or more) hosts having an "interface" route? A rouge wireless bridge somewhere? "gratuitous arp" gone insane? DHCP server "in-use" probing? It doesn't sound like an issue with the switches, or layer 2; you have hosts doing bad things.

My debugging process would be unplug everything and watch closely as things are reattached, one port at a time. (I know it's miles from ideal, but at some point you have to cut your losses and attempt to physically isolate any possible source(s)) Then I'd work towards understanding why select ports are generating some many arp's.

(Would a lot of those hosts happen to be linux systems? Linux has had a very d***med stupid ARP cache management system. The fact that it will "re-verify" an entry in mere minutes, is broken in my book. It tends to be less of an issue in small networks, but a /20 is not a small network.)

In this case, we have a bad designed Windows service not being scalable. All systems use ARP timeout, not only Linux, for autodetecting physically moved machines. The real design problem is having a lot of machines using net resources for internal processes, not for real and useful data traffic. The solution is, of course, splitting the local net in VLANs and routing segments. Why do I have to know there exists some printer two buildings apart? Or asking if exists a media server in the east wing? We need a simple rule in this situation: restrict unwanted traffic to where it can be useful. — Fjor, Dec 18 '21 at 04:35

score 1 · Answer 4 · answered Oct 29 '13 at 13:29

This may or may not be related to your issue at hand, however I figured it may be something worth at least throwing out there:

We currently have quite a few stacked 3750x's in some of our remote sites, mostly running 15.0.2(SE0 through 4, there are some FRU bugs with SE0 that I am slowly migrating away from).

During a routine IOS update, going from 15.0.2 to 15.2-1 (most recent SE) we noticed a pretty significant CPU increase, from an average of about 30% to 60% and higher during off-peak times. I have reviewed configurations and IOS Change logs, and have been working with Cisco's TAC. According to TAC, they seem to be at the point where they believe this is some IOS 15.2-1 bug.

As we continued to investigate the CPU increase, we began seeing massive amounts of ARP traffic to the point where our ARP tables filled completely and caused network instability. The temporary crutch for this was to manually back our ARP timeouts away from default (14400) to 300 on our voice and data vlans.

After reducing our ARP timeouts, we were stable for a weeks or so, at which point we returned to IOS 15.0.2-SE4, and removed our non-default ARP timeouts. Our CPU utilization is back down to ~30% and our ARP table issues are non-existent.

interesting story... thanks for sharing, although it might help to add a bugid so it's easier to discern whether the OP is exposed. FYI, it's often a good idea to keep your ARP timeout lower than your CAM timer. — Mike Pennington, Oct 29 '13 at 16:38
Thanks for the comment, but in light of the original issue, we are currently using a lower IOS version across the stack and has been quite stable for some time. @MikePennington by default the ARP timeout is set to 4 hours and CAM timeout is 5 minutes? Isn't this the case? — Cold T, Oct 29 '13 at 20:03
@ColdT, that's why I mentioned this. For some HSRP cases, [Cisco's CAM / ARP timers break things by default](http://networkengineering.stackexchange.com/questions/2192/best-practice-for-the-combination-of-hsrp-and-ecmp/775/). Unless there is a compelling reason otherwise, I set my `arp timeout 240` on all SVI / L3 interfaces that are facing a switch. — Mike Pennington, Oct 29 '13 at 21:33

score 0 · Answer 5 · answered Oct 28 '13 at 21:58

0

A faily simple one but maybe overlooked; do your clients have a valid default gateway, isnt you core doing a lot of proxy arps. You could consider to negate the ip proxy arp function on your 3750?

answered Oct 28 '13 at 21:58

ARP broadcast flooding network and high cpu usage

Update

5 Answers5

Solved.

ARP / Broadcast storm

Misc thoughts

Linked