1

My server is operating properly.

Something happens...

Suddenly, my server is inaccesible via ssh and http. (The only open ports in the firewall.)

All eth0 configuration is static, no DHCP.

Everything is configured properly in /etc/network/interfaces.

ethtool is configured to force 100baseT full duplex, as is the gateway - autoneg is off.

ifconfig shows zero errors, zero dropped packets, zero overruns, and zero collisions.

My server responds to pings sent from other networks. (Which must go through the gateway?)

route shows everything configured properly.

My server receives no response from pings sent to the gateway.

My server gets responses from other machines on the local network.

After receiving ping responses from other local machines, it can then get a response from the gateway, http operates properly and so does ssh. I didn't restart or reconfigure the network in any way, the only commands used were ifconfig, ethtool, route and ping (in the order indicated above) to bring server access back.

See this question for more background information. (I implemented the suggested solution provided, but to no avail.)

Any ideas? This problem has been ongoing for weeks, and I'm at a complete loss...

Andrew Parker
  • 203
  • 3
  • 9
  • maybe the NAT table filled up or has a memory leak? what kind of router is this. – SpacemanSpiff Mar 08 '12 at 21:19
  • Did you try using a different ethernet port on the server? – aseq Mar 08 '12 at 21:25
  • It's another server. I'm not sure of its details, I don't have access to it. But other sites behind this firewall/gateway are not losing their connectivity - just mine. – Andrew Parker Mar 08 '12 at 21:25
  • @aseq No, I haven't. I can try it, but what makes you think that'd make a difference? – Andrew Parker Mar 08 '12 at 21:27
  • Well, you seem to have exhausted almost all other options. The server seems to be configured right, speed of ethernet ports is correct. The other thing is that the gateway itself could be at fault. I assume you switched cables? – aseq Mar 08 '12 at 21:29
  • The cables are run up through the ceiling... by "their best man". (That's a direct quote.) It could be done, I'm just hoping to take suggestions for exhausting all possibilities first. – Andrew Parker Mar 08 '12 at 21:32
  • Check the Arp table on the router, and the Switching Tables on any switches involved while the system is in a broken state. Compare to the working state. I've seen something like this (http://serverfault.com/questions/215500/cant-reliably-ping-6224-router-from-directly-attached-system) but that is an edge case that probably doesn't apply here. – David Mackintosh Mar 09 '12 at 06:16
  • Will do... been fortunate so far. No crashes since I set the server up to ping a non-gw address every hour. – Andrew Parker Mar 12 '12 at 22:41

1 Answers1

5

You don't say what your gateway is, but my vote is that this is a router/gateway issue.

Most likely the arp entry on the gw is timing out and not getting refreshed, or there is another device on the network responding to arps for your server's ip address or is gratuitously arpping your server's address.

When you ping the gateway, it looks in it's arp table and finds an (incorrect entry) and replies to the wrong host. When you ping the other hosts, they arp and your server responds and then everyone's arp table is updated with the correct info and it all works again. Just like magic.

kls
  • 379
  • 1
  • 6
  • 1
    Agreed - ARP was the first thing that popped into my head as well when I read this. – EEAA Mar 08 '12 at 21:38
  • Would insufficient traffic to my machine cause an arp timeout? My current bandaid for this situation is to ping another machine on my network (not the gw) every hour. Does that sound sufficient if arp is the problem? – Andrew Parker Mar 08 '12 at 21:43
  • Depends on the gateway you are using and what they programmed the arp timeout to. Generally its 5 minutes or less, but some implementations (Cisco cef) are set to 8 hours. If it were me and I couldn't get rid of the gw, I'd ping every minute. The amount of traffic is almost non-existent and it is a server after all. Don't want the thing going down. – kls Mar 08 '12 at 22:41
  • However, you should diagnose this more because it could be a sign of bad things happening on one of your computers. MITM attacks use gratuitous arp. If it were me, I wouldn't just bandaid it until I knew it was just a bad gw. – kls Mar 08 '12 at 22:46
  • Okay. The gw admin just emailed that an ARP check during the outage was pointing to the correct address. Also, the timeout is set to four hours. – Andrew Parker Mar 08 '12 at 23:13
  • What kind of gw is this ? Your server is being natted at the gw, right ? The pings responded to from outside when this happens is the gw responding for the server. Its called proxy arp. Necessary/typical because of NAT. When actual packets try to get through, like ssh, the gw must send them on to the server and they get lost because the gw does not know, or fails to forward the packets on to the server. – kls Mar 08 '12 at 23:37