How to fix “eth0: Detected Hardware Unit Hang” in Debian 9?

8

4

I have my own home server which acts as a gateway to internet (don't know if it is best name for it). Basically it has two ethernet ports, one connected to my ISP and other to LAN switch. Routing and NAT successfully configured and a bunch of other services.

Recently I migrated from Ubuntu 14.04 to Debian 9 (new, clean install) and now slowly restoring previous configuration. I'm stuck quite early as I made just basic network configuration to allow other computers/phones/TVs/etc access internet, but noticed that there are a lot of packet losses and connection seems to hang for a few seconds. Inspecting logs gave me this:

        [  212.088208] e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
                     TDH                  <69>
                     TDT                  <aa>
                     next_to_use          <aa>
                     next_to_clean        <69>
                   buffer_info[next_to_clean]:
                     time_stamp           <ffffa7f6>
                     next_to_watch        <69>
                     jiffies              <ffffa9e8>
                     next_to_watch.status <0>
                   MAC Status             <80083>
                   PHY Status             <796d>
                   PHY 1000BASE-T Status  <3800>
                   PHY Extended Status    <3000>
                   PCI Status             <10>
    [  214.072275] e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
                     TDH                  <69>
                     TDT                  <aa>
                     next_to_use          <aa>
                     next_to_clean        <69>
                   buffer_info[next_to_clean]:
                     time_stamp           <ffffa7f6>
                     next_to_watch        <69>
                     jiffies              <ffffabd8>
                     next_to_watch.status <0>
                   MAC Status             <80083>
                   PHY Status             <796d>
                   PHY 1000BASE-T Status  <3800>
                   PHY Extended Status    <3000>
                   PCI Status             <10>
    [  216.088094] e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
                     TDH                  <69>
                     TDT                  <aa>
                     next_to_use          <aa>
                     next_to_clean        <69>
                   buffer_info[next_to_clean]:
                     time_stamp           <ffffa7f6>
                     next_to_watch        <69>
                     jiffies              <ffffadd0>
                     next_to_watch.status <0>
                   MAC Status             <80083>
                   PHY Status             <796d>
                   PHY 1000BASE-T Status  <3800>
                   PHY Extended Status    <3000>
                   PCI Status             <10>
    [  218.071082] ------------[ cut here ]------------
    [  218.072129] WARNING: CPU: 0 PID: 0 at /build/linux-EAZfyE/linux-4.9.51/net/sched/sch_generic.c:316 dev_watchdog+0x22d/0x230
    [  218.073249] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
    [  218.074368] Modules linked in: xt_conntrack iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf pcspkr i915 sg drm_kms_helper lpc_ich mei_me mfd_core drm ie31200_edac joydev evdev mei edac_core shpchp i2c_algo_bit battery video button ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto ecb glue_helper lrw gf128mul ablk_helper cryptd aes_x86_64 mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear hid_generic usbhid hid
    [  218.078853]  raid1 md_mod sd_mod crc32c_intel i2c_i801 ahci i2c_smbus libahci libata scsi_mod ehci_pci ehci_hcd xhci_pci xhci_hcd e1000e ptp usbcore pps_core usb_common fan thermal
    [  218.082049] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
    [  218.083772] Hardware name:                  /DQ77KB, BIOS KBQ7710H.86A.0051.2013.0329.1350 03/29/2013
    [  218.085468]  0000000000000000 ffffffffa7729974 ffff98909e203e20 0000000000000000
    [  218.087205]  ffffffffa7476eae 0000000000000000 ffff98909e203e78 ffff989094e28000
    [  218.088980]  0000000000000000 ffff989094fb9c80 0000000000000001 ffffffffa7476f2f
    [  218.090766] Call Trace:
    [  218.092579]  <IRQ>
    [  218.092597]  [<ffffffffa7729974>] ? dump_stack+0x5c/0x78
    [  218.094413]  [<ffffffffa7476eae>] ? __warn+0xbe/0xe0
    [  218.096268]  [<ffffffffa7476f2f>] ? warn_slowpath_fmt+0x5f/0x80
    [  218.098133]  [<ffffffffa74aed52>] ? enqueue_task_fair+0x82/0x940
    [  218.100024]  [<ffffffffa792cb2d>] ? dev_watchdog+0x22d/0x230
    [  218.101909]  [<ffffffffa792c900>] ? qdisc_rcu_free+0x40/0x40
    [  218.103860]  [<ffffffffa74e4020>] ? call_timer_fn+0x30/0x110
    [  218.105766]  [<ffffffffa74e4524>] ? run_timer_softirq+0x1d4/0x430
    [  218.107709]  [<ffffffffa74f4ca0>] ? tick_sched_handle.isra.12+0x20/0x50
    [  218.109654]  [<ffffffffa74f4d08>] ? tick_sched_timer+0x38/0x70
    [  218.111630]  [<ffffffffa7a0b0d5>] ? __do_softirq+0x105/0x290
    [  218.113594]  [<ffffffffa747cf8e>] ? irq_exit+0xae/0xb0
    [  218.115567]  [<ffffffffa7a0aeee>] ? smp_apic_timer_interrupt+0x3e/0x50
    [  218.117536]  [<ffffffffa7a0a202>] ? apic_timer_interrupt+0x82/0x90
    [  218.119509]  <EOI>
    [  218.119527]  [<ffffffffa78cd31a>] ? cpuidle_enter_state+0x11a/0x2b0
    [  218.121505]  [<ffffffffa74b9634>] ? cpu_startup_entry+0x154/0x240
    [  218.123486]  [<ffffffffa8138f57>] ? start_kernel+0x443/0x463
    [  218.125426]  [<ffffffffa8138120>] ? early_idt_handler_array+0x120/0x120
    [  218.127400]  [<ffffffffa8138408>] ? x86_64_start_kernel+0x14c/0x170
    [  218.129384] ---[ end trace 6cd1142bfcc66b87 ]---
    [  218.131367] e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
    [  222.052843] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

I have found this question: e1000e Reset adapter unexpectedly / Detected Hardware Unit Hang which seems to be same problem, but none of found fixes worked.

I tried:

  • Booting kernel with pcie_aspm=off
  • Turning off strange options: ethtool -K eth0 gso off gro off tso off
  • Disabled ASPM in bios
  • Disabled any power saving features in bios
  • Used script fixeep-82573-dspd.sh which said that my hardware is not compatible with that fix or something like that
  • Compiled newest driver from Intel website

What else I can try? I already lost whole day on this, internet connection is unusable, everybody needs to use own LTE/3G internet on phones to access web.

Is Debian a bad choice for such server?

WombaT

Posted 2017-11-22T09:14:21.913

Reputation: 215

Other options are changing the driver/firmware version. If you can find out what you used in Ubuntu 14.04, try to get these installed. – dirkt – 2017-11-22T09:41:45.227

@dirkt Already tried newest driver - did not help, but dont know what about firmware? How i can update it? I have still old backups from previous system, can i check driver version from files? – WombaT – 2017-11-22T12:41:55.493

Answers

7

There's a question in ServerFault that has two more potential fixes:

https://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang

As mentioned in there, you might want to try:

  • disabling enhanced C1 power saving state (C1E) in the BIOS settings, or
  • disabling TCP checksum offloading with ethtool -K eth0 tx off rx off

If you can extract the e1000e.ko kernel module from your backups, you can use the modinfo command on it to list the driver version.

To identify the possibility of a NIC firmware update, a precise identification of the exact NIC model is necessary. From your log output, I can see your NIC is PCI device 00:19.0 and the name of the network interface is eth0. Please run these commands as root:

# lspci -nn -s 00:19.0 -v
# ethtool -i eth0

The first command reveals the PCI ID numbers of the NIC, and the second command has the NIC firmware version number in its output.

If the NIC happens to be Intel 82579V (PCI ID 8086:1503), there's a known bug in it related to power saving states, that has also caused problems in Windows 8 and newer (i.e. with operating systems that are new enough to use the more advanced power saving states of modern hardware). An update from Ubuntu 14.04 to Debian 9 might have brought an equivalent update in power saving code of Linux, causing the bug to trigger.

Intel has a specific firmware update tool for the 82579V chip that can even be used with NICs integrated on motherboards. Unfortunately, I think the update tool must be run in Windows.

telcoM

Posted 2017-11-22T09:14:21.913

Reputation: 2 016

Cannot see in bios anything related to C1E. ethtool command did not change anything. NIC's are 82574L and 82574LM. Anyway, problem solved, described in my answer. – WombaT – 2017-11-27T11:19:19.590

Disabling checksum offloading worked well for me. – Ceisc – 2018-05-09T14:34:20.047

Disabling checksum offloading fixed it for me, but are there any consequences for doing so? If it's on by default, I would have expected that to be for a reason. – John Leuenhagen – 2019-12-18T10:38:12.550

Disabling checksum offloading may somewhat increase your CPU workload for a given network traffic level, as the CPU will have to be used to calculate those checksums instead. But the increase should not be very significant unless you have a really low-power processor and/or a lot of network traffic. – telcoM – 2019-12-18T12:21:55.840

1

Problem solved by... switching cables and configuration between NIC's. So eth0 was LAN side device and eth1 was Internet side device. After switch eth0 became Internet device and eth1 LAN one. Dont know why and how, but simply it is working. Even under heavy load after a couple of hours everything is ok. When i switch it back to initial config, driver crashes after no more than 2 minutes.

I'm completely unable to find any reason why this happens, but well... now is good.

WombaT

Posted 2017-11-22T09:14:21.913

Reputation: 215

3This is most likely, because the bug is usually only triggered during high bandwidth usage (> 500 Mbit). Your internet connection is probably not giving you that bandwidth. – Till Schäfer – 2018-03-01T13:21:01.650

This solved for me too: I have 2 interfaces (same model provided by the motherboard), one connected to a device and the other to the local network. The one connected to the network (without internet), is the one that was throwing that error. After switching places it worked without issues. I'm sure its a hardware issue as other servers (different brand) with exactly the same configuration haven't shown this behavior. I'm not using Debian, but Ubuntu 14.04 in those servers. In my case it is not about bandwidth as there is almost no traffic going through the local network interface. – lepe – 2019-07-12T03:52:50.783