10

Today we had a number of machines stop getting internet access. After a lot of troubleshooting, the common thread is that they all had their dhcp lease renewed today (we're on 8 day leases here).

Everything you would expect looks good after the lease renewal: they have a valid IP address, dns server, and gateway. They have access to internal resources (file shares, intranet, printers, etc). A little more troubleshooting reveals they are unable to ping or tracert to our gateway, but they can get to our core layer3 switch just in front of the gateway. Assigning a static IP to the machine works as a temporary solution.

One final wrinkle is that so far reports have only come in for clients on the same vlan as the gateway. Our administrative staff and faculty is on the same vlan as the servers and printers, but phones, key fob/cameras, students/wifi, and labs each have their own vlans and as far as I've seen nothing on any of the other vlans has had a problem yet.

I have a separate ticket in with the gateway vendor, but I suspect they'll take the easy out and tell me the problem is elsewhere on the network, so I'm asking here as well. I've cleared arp caches on the gateway and core switch. Any ideas welcome.

Update:
I tried pinging from the gateway back to some affected hosts, and the odd thing is that I did get a response: from a completely different IP address. I tried a few more at random and eventually got this:

Fri Sep 02 2011 13:08:51 GMT-0500 (Central Daylight Time)
PING 10.1.1.97 (10.1.1.97) 56(84) bytes of data.
64 bytes from 10.1.1.105: icmp_seq=1 ttl=255 time=1.35 ms
64 bytes from 10.1.1.97: icmp_seq=1 ttl=255 time=39.9 ms (DUP!)

10.1.1.97 is the actual intended target of the ping. 10.1.1.105 is supposed to be a printer in another building. I have never seen a DUP in a ping response before.

My best guess at the moment is a rogue wifi router in one of our dorm rooms on the 10.1.1.0/24 subnet with a bad gateway.

...continued. I've now powered down the offending printer, and pings to an affected host from the gateway just fail completely.

Update 2:
I check arp tables at an effected machine, the gateway, and every switch between them. At each point, the entries for those devices were all correct. I didn't verify every entry in the table, but every entry that could possibly impact traffic between the host and the gateway was okay. ARP is not the problem.

Update 3:
Things are working at the moment, but I can't see anything I did to fix them and so I have no idea whether this might be just a temporary lull. Anyway, there's not much I can do to diagnose or troubleshoot now, but I'll update more if it breaks again.

Joel Coel
  • 12,910
  • 13
  • 61
  • 99
  • Ping work to their gateway? Are the configured DNS server(s) on the same subnet, or elsewhere? DNS resolution working? – Shane Madden Sep 02 '11 at 18:03
  • @Shane, all that works, and is answered in the text – Joel Coel Sep 02 '11 at 18:06
  • You said "unable to ping or tracert to our gateway" - is that the devices' first-hop gateway, or an internet router that their traffic gets routed to after being routed by a different first-hop device? – Shane Madden Sep 02 '11 at 18:12
  • 2
    I would run a packet capture on one of the clients and then ping and trace route to the gateway. See which MAC addresses show up in the capture for which ip addresses and also look for ICMP redirects. I would also take a close look at the ARP table on one of the clients, the switch, and the gateway and make sure they look right. – joeqwerty Sep 02 '11 at 18:17
  • You will see a DUP response to a ping when you ping the broadcast address. So if you ping 192.168.1.255 on a /24 network, then you should get lots of replies, though a lot less these days since firewalls typically block this now. – Zoredache Sep 02 '11 at 18:20
  • It's not really a duplicate response, it's a separate response from two different hosts. It's pretty strange nonetheless. Did anything in the DHCP scope(s) change, like the gateway address or the subnet mask? – joeqwerty Sep 02 '11 at 18:26
  • Wait, there's dorm rooms on this same layer 2 segment? Don't rule out ARP poisoning.. – Shane Madden Sep 02 '11 at 18:27
  • 1
    To clarify: you're saying that the gateway has valid ARP for an affected host, and the host has valid ARP back to the gateway, but the gateway is getting no reply when trying to ping the host? Are the ping packets getting to the device, or are they not getting switched properly? – Shane Madden Sep 02 '11 at 19:08
  • @Shane - I think the packets made it to the gateway from the host, but not back. I can't confirm this, though, because things started working again on their own before my wirecapture was ready. I'm a little nervous right now that it will start failing again. – Joel Coel Sep 02 '11 at 19:33

3 Answers3

3

"My best guess at the moment is a rogue wifi router in one of our dorm rooms on the 10.1.1.0/24 subnet with a bad gateway."

This happened in my office. The offending device turned out to be a rogue android device:

http://code.google.com/p/android/issues/detail?id=11236

If the android device gets the gateway's IP from another network via DHCP, it may join your network and start responding to ARP requests for the gateway IP with it's MAC. Your use of the common 10.1.1.0/24 network increases the probability of this rogue scenario.

I was able to check the ARP cache on an affected workstation on the network. There, I observed an ARP flux problem where the workstation would flip-flop between the correct MAC and a MAC address from some rogue device. When I looked up the suspicious MAC the workstation had for the gateway, it came back with a Samsung prefix. The astute user with the troubled workstation replied that he knew who had a Samsung device on our network. Turned out to be the CEO.

dmourati
  • 24,720
  • 2
  • 40
  • 69
2

As already discussed in the comment section getting a packet capture is really critical. However there also a really great tool called arpwatch:

http://ee.lbl.gov/

(or http://sid.rstack.org/arp-sk/ for windows)

This tool will email you or just keep a log of all the new MAC Addresses seen on the network as well as any changes for MAC addresses for IPs on a given subnet(flip-flops). For this issue you had it would have detected both the current theories by either reporting that there were flip-flops going on for IPs changing MACs, or you would see a new MAC for the rogue DHCP router when it first started communicating with hosts. The one down side with the tool is that you need to have the host connected to all the networks you monitor, but it is a small price for the great information it can provide to help diagnose these sorts of issues.

polynomial
  • 3,968
  • 13
  • 24
1

A quick way in detecting the typical rogue DHCP servers is to ping the gateway that it serves up and then examine the its MAC in the corresponding ARP table. If the switching infrastructure is a managed one, then the MAC can also be tracked down to the port hosting it and the port can be either shut down or traced back to the location of the offending device for further redress.

The use of DHCP Snooping on switches which support it can also be an effective option in protecting a network from rogue DHCP servers as well.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
user48838
  • 7,393
  • 2
  • 17
  • 14