Hoping someone here might have some insight to the issue we are facing. Currently we have Cisco TAC looking at the case but they are struggling to find the root cause.
Although the title mentions ARP broadcast and high CPU usage, we are unsure if they are related or unrelated at this stage.
The original issue has been posted on INE Online Community
We have stripped the network down to a single link no redundancy setup, think of it as a star topology.
Facts:
- We use 3750x switches, 4 in one stack. Version 15.0(1)SE3. Cisco TAC confirms no known issues for high cpu or ARP bugs for this particular version.
- No hubs/ unmanaged switches connected
- Reloaded Core stack
- We don't have a default route "Ip route 0.0.0.0 0.0.0.0 f1/0". Using OSPF for routing.
- We see large broadcast packets from VLAN 1, VLAN 1 used for desktop devices. We use 192.168.0.0/20
- Cisco TAC said they don't see anything wrong with using /20, other then we'd have a large broadcast domain but should still function.
- Wifi, management, printers etc are all on different VLAN
- Spanning tree has been verified by Cisco TAC & CCNP/CCIE qualified individuals. We shutdown all redundant links.
- Configuration on the core has been verified Cisco TAC.
- We have the default ARP timeout on majority of the switches.
- We do not implement Q & Q.
- No new switches been added (at least none we know of)
- Cannot use dynamic arp inspection on edge switches because these are 2950
- We used show interfaces | inc line|broadcast to figure out where the large number of broadcast coming from, however both Cisco TAC and 2 other engineers(CCNP & CCIE) confirmed this is normal behaviour due to what is happening on the network (as in large number of mac flaps causing the larger broadcast). We verified the STP was functioning correctly on the edge switches.
Symptoms on the network and switches:
- Large number of MAC flaps
- High CPU usage for ARP Input process
- Huge number of ARP packets, rapidly increasing and visible
- Wiresharks shows that 100s of computers are flooding the network with ARP Broadcast
- For test purpose, we put approx 80 desktop machines different vlan, however we tested this and made no visible difference to high cpu or arp input
- Have ran different AV/ malware/ spyware, but no viruses visible on the network.
- sh mac address-table count, shows us approx 750 different mac addresses as expected on vlan 1.
#sh processes cpu sorted | exc 0.00%
CPU utilization for five seconds: 99%/12%; one minute: 99%; five minutes: 99%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
12 111438973 18587995 5995 44.47% 43.88% 43.96% 0 ARP Input
174 59541847 5198737 11453 22.39% 23.47% 23.62% 0 Hulc LED Process
221 7253246 6147816 1179 4.95% 4.25% 4.10% 0 IP Input
86 5459437 1100349 4961 1.59% 1.47% 1.54% 0 RedEarth Tx Mana
85 3448684 1453278 2373 1.27% 1.04% 1.07% 0 RedEarth I2C dri
- Ran show mac address-table on different switches and core itself (on the core, for example, plugged by desktop directly, my desktop ), and we can see the several different MAC hardware address being registered to the interface, even though that interface has only one computer attached to this:
Vlan Mac Address Type Ports
---- ----------- -------- -----
1 001c.c06c.d620 DYNAMIC Gi1/1/3
1 001c.c06c.d694 DYNAMIC Gi1/1/3
1 001c.c06c.d6ac DYNAMIC Gi1/1/3
1 001c.c06c.d6e3 DYNAMIC Gi1/1/3
1 001c.c06c.d78c DYNAMIC Gi1/1/3
1 001c.c06c.d7fc DYNAMIC Gi1/1/3
- show platform tcam utilization
CAM Utilization for ASIC# 0 Max Used
Masks/Values Masks/values
Unicast mac addresses: 6364/6364 1165/1165
IPv4 IGMP groups + multicast routes: 1120/1120 1/1
IPv4 unicast directly-connected routes: 6144/6144 524/524
IPv4 unicast indirectly-connected routes: 2048/2048 77/77
IPv4 policy based routing aces: 452/452 12/12
IPv4 qos aces: 512/512 21/21
IPv4 security aces: 964/964 45/45
We are now at a stage, where we will require huge amount of downtime to isolate each area at a time unless anyone else has some ideas to identify the source or root cause of this weird and bizarre issue.
Update
Thank you @MikePennington and @RickyBeam for the detailed response. I will try and answer what I can.
- As mentioned, 192.168.0.0/20 is an inherited mess. However, we do intend to split this up in the future but unfortunately this issue occured before we could do this. I personally also agree with majority, whereby the broadcast domain is far too big.
- Using Arpwatch is definitely something we can try but i suspect because several access port is registering mac address even though it doesn't belong to this port, the conclusion of arpwatch may not be useful.
- I completely agree with not being 100% sure finding all redundant links and unknown switches on the network, but as best of our finding, this this is the case until we find further evidence.
- Port security has been looked into, unfortunately management has decided not to use this for various reasons. Common reason is we constantly move computers around (college environment).
- We have used spanning-tree portfast with in conjunction with spanning-tree bpduguard by default on all access ports (desktop machines).
- We do not use switchport nonnegotiate at the moment on access port, but we are not getting any Vlan hopping attack bouncing across multitple Vlans.
- Will give mac-address-table notification a go, and see if we can find any patterns.
"Since you're getting a large number of MAC flaps between switchports, it's hard to find where the offenders are (suppose you find two or three mac addresses that send lots of arps, but the source mac addresses keep flapping between ports)."
- We started on this, picked any one MAC flaps and continued our way through all the core switch to distribution to access switch, but what we found was once again, the access port interface was hogging multiple mac address hence mac flaps; so back to square one.
- Storm-control is something that we did consider, but we fear some off the legitimate packets will be dropped causing further issue.
- Will triple check the VMHost configuration.
- @ytti the unexplainable MAC addresses is behind many access ports rather then an individual. Haven't found any loops on these interfaces. The MAC addresses also exist on other interfaces, which would explain large number of MAC flaps
- @RickyBeam i agree with why hosts are sending so many ARP requests; this is one of the puzzling issue. Rouge wireless bridge is an interesting one that i haven't given thought to, as far as we are aware, wireless is on different VLAN; but rogue will obviously mean it may well be on VLAN1.
- @RickyBeam, i don't really wish to unplug everything as this will cause massive amount of downtime. However this is where it may just be heading. We do have Linux servers but not more then 3.
- @RickyBeam, can you explain DHCP server "in-use" probing?
We (Cisco TAC, CCIEs, CCNP) globally agree that this is not a switch configuration but a host/device is causing the issue.