2

Problem

All Production Servers were suddenly not able to access the internet anymore, while four other Servers connected to the same VLAN and same eth0 settings can.

enter image description here

Figure 1: System 1 represents the four systems which are able to access the internet, while System 2 indicates the ones which suddenly cannot since today afternoon.

Analysis

  • System 1 can access System 2 and vice versa
  • Default Gateway (10.10.10.1) can be pinged from System 1 and System 2 as well
  • System 1 can access the internet
  • System 2 cannot access the internet
  • Ifconfig's eth0 configuration identical between all Production Servers
  • Internal DNS server is identical to other systems which can access the internet
  • The IP's and names located in /etc/resolve.conf can be accessed
  • The internet can be accessed from the Switch
  • Configuration of all 8 Switchports on Cisco IOS is identical
  • Tracepath from System 2 to 8.8.8.8 (DNS Google), google IP or google.com hangs at the Default Gateway
  • The systems which cannot access the system seems to have an em1 adapter instead of eth0
  • sudo arping -I eth0 ping.tweakers.net works on all 8 systems
  • One of the systems which cannot access the internet show an output if sudo iptables-save has been executed
  • Output route -n is identical between all the systems

Tracepath

[username@hostname ~]$ tracepath google.com
 1:  10.10.10.10 (10.10.10.10)                                  0.222ms pmtu 1500
 1:  10.10.10.1 (10.10.10.1)                                    0.662ms
 1:  10.10.10.1 (10.10.10.1)                                    0.601ms
 2:  no reply

ARP

System1: ? (10.10.10.1) at AA:BB:CC:DD:EE:FF [ether] on em1

System2: ? (10.10.10.1) at AA:BB:CC:DD:EE:FF [ether] on eth0

Output iptables-save on one of the systems which cannot access the internet

# Generated by iptables-save vX on Fri Aug  1 10:00:01 2014
*filter
:INPUT ACCEPT [X:Y]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [X:Y]
COMMIT
# Completed on Fri Aug  1 10:00:01 2014

route -n

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.10.10.0      0.0.0.0         255.255.255.0   U     0      0        0 eth0
X.Y.0.0         0.0.0.0         255.255.0.0     U     Z      0        0 eth0
0.0.0.0         10.10.10.1      0.0.0.0         UG    0      0        0 eth0

It is unclear why the internet cannot be accessed anymore from the four production servers. As these are running in Production, a restart of the network should be prevented. Which further tests could be done to investigate the issue?

030
  • 5,731
  • 12
  • 61
  • 107
  • The actual `tracepath` output may provide a hint. Also check the ARP cache on each system to see if they have the same MAC for the gateway. You can use `arp -n` to see the ARP cache. – kasperd Jul 31 '14 at 18:29
  • @kasperd `Tracepath` output has been added. I will check the `ARP` of each system at my work tomorrow. I have executed `arp -n` on a local test system and the MAC address of the adapters differ compared to `ifconfig`, while the internet can be accessed. Could you explain why there is a discrepancy and why the internet can be accessed and what could be concluded by investigating the output of `arp -n` – 030 Jul 31 '14 at 19:19
  • 2
    The ARP cache doesn't contain the MAC address of the host itself, it contains the MAC addresses of the hosts it is communicating with. The `tracepath` output shows the packets are making it to the gateway and back, so there is nothing suggesting it would be an ARP related problem. Rather the root cause is likely to be found on the gateway. Maybe a misconfigured NAT. – kasperd Jul 31 '14 at 19:29
  • @kasperd Executing `arp -n` on a VM showed a MAC address which is identical to my host. This is correct as the VM communicates via the Host to the internet. Thank you for the explanation. I will search commands to check the NAT table now. – 030 Jul 31 '14 at 19:37
  • @kasperd Could you indicate whether [these commands](http://www.cyberciti.biz/faq/howto-iptables-show-nat-rules/) are useful to investigate the NAT? – 030 Jul 31 '14 at 19:44
  • If the NAT is running on Linux, then those commands should work. However I prefer looking at the rules in the format output by `iptables-save`, as that provides the complete rule set with no details missing. – kasperd Jul 31 '14 at 19:55
  • @kasperd The systems which cannot access the internet seems to have `em1` instead of `eth0` adapter (found after executing `arp -a`) – 030 Aug 01 '14 at 07:11
  • Seems Fedora switched to a different naming scheme for network interfaces. But there is nothing suggesting that would be causing connectivity problems. – kasperd Aug 01 '14 at 07:24
  • `sudo arping -I em1 ` works from the systems which cannot ping the internet – 030 Aug 01 '14 at 07:46
  • Yes, the `tracepath` output already established, that there is no connectivity problem between the host and gateway. Symptoms suggest the problem is a misconfiguration on the NAT. – kasperd Aug 01 '14 at 07:48
  • `sudo iptables-save` shows an output on one of the four systems which cannot access the internet. I will check the NAT on the switch as well. – 030 Aug 01 '14 at 08:03

0 Answers0