0

The tale of the NIC that quit;

I have chucked the end of dmesg output from a server into pastebin;

This server (PowerEdge 1850) has two NICs, eth0 and eth1. eth1 has a couple of VLANs defined on it, and those in turn are in different bridges, one bridge has multiple IPs. eth1 is the public facing interface. eth0 if for backend/management access.

The server went "off-line" in the sense that it stopped serving public requests and I received an alert. I connected via SSH into using the management IP on eth0, to find the server up, load is low, plenty of disk space, RAM and cpu cycles etc. All services were up and running, but the server wasn't serving any webpages.

That's when I checked dmesg and saw the above output. It seems that there was a problem with eth1 and it wouldn't send any packets, but it was receiving them. There are a few "Reset adapter" messages in the dmesg output, so I assume the server was "self repairing"?

[10716872.816012] e1000 0000:07:08.0: eth1: Reset adapter

I ran tcpdump to see what was going on, (the output of which I have misplaced!). However, I could see that the default gateway that faces the public VLAN sub-interfaces on eth1 was ARP'ing out for the public IPs assigned to the server but it wasn't sending any response.

So this is likely why the public facing services weren't working. I restarted the interface with sudo ifdown eth1 && sudo ifup eth1 which executed successfully, but didn't help.

I checked the arp table;

user@server:~$ arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
5.5.5.6                  (incomplete)                                      br12

Seeing this incomplete address I took a shot in the dark (not really expecting it to work) and added the MAC for the default gateway of 5.5.5.6 manually. It didn't work.

It had been several minutes of production downtime so I rebooted the server moments later, and everything was back to normal after it rebooted.

Below I have posted the /etc/network/interfaces contents, however I need some help understanding the pastebin entry I linked above. What is a possible cause for eth1 to quit it's day job, in the middle of the working day?

allow-hotplug eth0 
allow-hotplug eth1 
allow-hotplug eth1.1
allow-hotplug eth1.2

auto eth0
iface eth0 inet static
address 10.0.1.25
netmask 255.255.255.0

auto eth1
iface eth1 inet manual

auto eth1.2
iface eth1.2 inet manual
vlan_raw_device eth1

auto br12
iface br12 inet static
address 10.0.0.25
netmask 255.255.255.0
bridge_ports eth1.2
bridge_stp off

auto eth1.1
iface eth1.1 inet manual
vlan_raw_device eth1

auto br11
iface br11 inet static
address 5.5.5.5
netmask 255.255.255.248
gateway 5.5.5.6
bridge_ports eth1.118
bridge_stp off

auto br11:0
iface br11:0 inet static
address 5.5.5.4
netmask 255.255.255.248

auto br11:1
iface br11:1 inet static
address 5.5.5.3
netmask 255.255.255.248

For debug purposes;

user@server:~$ uname -a
Linux server.site.com 3.4.10 #1 SMP Thu Sep 13 13:12:24 BST 2012 x86_64 GNU/Linux
user@server:~$ cat /etc/issue
Debian GNU/Linux 6.0 \n \l

The server has been up for 3 days and 17 hours now, no errors in dmesg/kern.log/message/syslog and it's running fine. This is the lshw details for the NICs.

jwbensley
  • 4,122
  • 11
  • 57
  • 89

1 Answers1

0

I think this was a kernel error, probably a driver bug or a hardware error.

You can try to looking for kernel bugs, update kernel, and so.

Brigo
  • 1,504
  • 11
  • 8