15

I've noticed since few days ago that same repeating kind of messages occurs and I positively can say that nothing was intentionally changed (installed/uninstalled) in that period.

here's sample of /var/log/kern.log message:

Mar 30 06:32:45 aurora kernel: [566322.867110] e1000e: eth0 NIC Link is Down

Mar 30 06:32:47 aurora kernel: [566325.313634] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

Mar 30 06:32:59 aurora kernel: [566337.632930] e1000e: eth0 NIC Link is Down

Mar 30 06:33:18 aurora kernel: [566356.543664] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

Mar 30 11:05:47 aurora kernel: [582689.779752] e1000e: eth0 NIC Link is Down

Mar 30 11:05:50 aurora kernel: [582692.174337] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

from complete log file - when take all log message this kind into count - I can conclude:

  • eth0 fails every few hours
  • eth0 fails in first case for two and in second for 19 seconds

It's production server I'm talking about here.

How to solve this problem, since mail server is in production and network failures of 19 seconds duration I cannot tolerate?

Miloš Đakonović
  • 640
  • 3
  • 9
  • 28
  • 1
    What have you checked so far? Is the cable properly attached and in unharmed condition? Does the switch on the other end also observe the link going down? Worth noting is that the detected link is different at different times (flow control differs in your log). Maybe the autonegotiation fails? Does the problem go away if you force 1000Mbps FD Rx/Tx? – Håkan Lindqvist Mar 30 '14 at 11:36
  • @HåkanLindqvist I don't have option to check cable, since server is not physically near me. Is that something I should ask server farm tech stuff to check? How do I force 1000Mbps FD Rx/Tx? And, about flow control being different at different times, is this issue? – Miloš Đakonović Mar 30 '14 at 11:51
  • The link "type" changing over time suggests to me that something is not quite right but finding the actual cause is of course a separate question entirely. Asking the tech staff may be a good idea. – Håkan Lindqvist Mar 30 '14 at 11:54
  • 1
    You can use ethtool or mii-tool to check auto-negotiate status etc at the server end. You need to make sure that the switch your server is setup to match. This sounds like a hardware problem - could be server adapter, cable or switch. I suggest looking at the status of the switch to see what it thinks is happening. – Paul Haldane Mar 30 '14 at 15:35

2 Answers2

13
  1. check for errors on the wire, look at the "errors" field in the output of ifconfig. If non-zero then there are problems with hardware (cable, NIC card, or hub/switch). An unreliable Ethernet cable will give errors in this field too.
  2. replace the Ethernet cable, regardless of step 1. This is quick, cheap and easy, and should be done whenever your link is going up and down at random intervals.
  3. use ethtool and make sure the network settings (duplex, etc) match those on the switch. If you are not the admin of the switch, then ask the network admin to provide you with the settings.
  4. if the switch has flow control enabled, then be sure it is enabled on your Linux box. Otherwise, disable it.

As a side note, you should assess whether you need flow control. According to HP, it is only necessary for high-performance applications: see HP article on When to Use Flow Control

Michael Martinez
  • 2,543
  • 3
  • 20
  • 31
2

Here's my fix. This problem happens on specific hardware (on one machine only 1 out of 2 ports on the NIC), always with the e1000e driver, since kernel 3.9 or so. This file is for centos7, goes in /etc/init.d/ and has to be enabled with chkconfig --add <name>. The interface name is hardcoded...be sure to set it.

#!/bin/sh

### BEGIN INIT INFO
# Provides:          pm-e1000e-fix
# Required-Start:    $network
# Required-Stop:     $network
# Default-Start:     2 3 4 5
# Default-Stop:      0 6
# Short-Description: workaround for e1000e issue
# Description:       e1000e fix
### END INIT INFO

################################################################################
# Give Usage Information                                                       #
################################################################################
usage() {
    echo "Usage: $0 start|restart" >&2
    exit 1
}

################################################################################
# E X E C U T I O N    B E G I N S   H E R E                                   #
################################################################################
command="$1"
shift

interface="eth0"

case "$command" in
    start)
        ethtool -K "$interface" gso off gro off tso off
        ;;
    restart)
        ethtool -K "$interface" gso off gro off tso off
        ;;
    *)
        usage
        ;;
esac
Peter
  • 2,546
  • 1
  • 18
  • 25