3

I have Suse12 system with Intel 82599ES nic(with 2*10-Gigabit SFI/SFP+ port),two ports are bonded by lacp.Recently,the system network is unreachable,lasted 3 minutes.

Looking through the message log, I notice when the interface goes down, we are getting following information:

2019-03-03T09:23:10.491731+08:00 oradb12 kernel: [9519285.192448] ixgbe 0000:02:00.1 eth5: initiating reset due to tx timeout
2019-03-03T09:23:10.491754+08:00 oradb12 kernel: [9519285.192464] ixgbe 0000:02:00.1 eth5: Reset adapter
2019-03-03T09:23:16.995739+08:00 oradb12 kernel: [9519291.696952] ixgbe 0000:02:00.1 eth5: speed changed to 0 for port eth5
2019-03-03T09:23:16.995763+08:00 oradb12 kernel: [9519291.697438] bond1: link status definitely down for interface eth5, disabling it

system kernel version is as follows:

Linux oradb12 4.4.74-92.35-default #1 SMP Mon Aug 7 18:24:48 UTC 2017 (c0fdc47) x86_64 x86_64 x86_64 GNU/Linux
oradb12:/etc/sysconfig/network # cat /etc/SuSE-release 
SUSE Linux Enterprise Server 12 (x86_64)
VERSION = 12
PATCHLEVEL = 2

bonding networking interface is as follows:

oradb12:/etc/sysconfig/network # cat ifcfg-bond1
BOOTPROTO='static'
STARTMODE='onboot'
BONDING_MASTER='yes'
BONDING_SLAVE0='eth3'
BONDING_SLAVE1='eth5'
IPADDR=10.252.128.2
GATEWAY=10.252.128.1
NETMASK=255.255.255.0
USERCONTROL='no'
BONDING_MODULE_OPTS='mode=4 miimon=100 use_carrier=1' 
oradb12:/etc/sysconfig/network # cat ifcfg-eth3
NAME='bond1-slave-eth3'
TYPE='Ethernet'
BOOTPROTO='none'
STARTMODE='onboot'
MASTER='bond1'
SLAVE='yes'
USERCONTROL='no'
oradb12:/etc/sysconfig/network # cat ifcfg-eth5
NAME='bond1-slave-eth5'
TYPE='Ethernet'
BOOTPROTO='none'
STARTMODE='onboot'
MASTER='bond1'
SLAVE='yes'
USERCONTROL='no'

bonding network interface status is as follows:

oradb12:/etc/sysconfig/network # cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 48:fd:8e:c9:21:64
Active Aggregator Info:
    Aggregator ID: 1
    Number of ports: 2
    Actor Key: 13
    Partner Key: 10273
    Partner Mac Address: 74:4a:a4:08:ea:14

Slave Interface: eth3
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 48:fd:8e:c9:21:64
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 48:fd:8e:c9:21:64
    port key: 13
    port priority: 255
    port number: 1
    port state: 61
details partner lacp pdu:
    system priority: 32768
    system mac address: 74:4a:a4:08:ea:14
    oper key: 10273
    port priority: 32768
    port number: 33
    port state: 61

Slave Interface: eth5
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 24
Permanent HW addr: 48:fd:8e:c9:21:65
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 48:fd:8e:c9:21:64
    port key: 13
    port priority: 255
    port number: 2
    port state: 61
details partner lacp pdu:
    system priority: 32768
    system mac address: 74:4a:a4:08:ea:14
    oper key: 10273
    port priority: 32768
    port number: 87
    port state: 61

network interface driver information is as follows:

oradb12:/etc/sysconfig/network # ethtool -i eth3
driver: ixgbe
version: 4.2.1-k
firmware-version: 0x800003df
expansion-rom-version: 
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
oradb12:/etc/sysconfig/network # ethtool -i eth5
driver: ixgbe
version: 4.2.1-k
firmware-version: 0x800003df
expansion-rom-version: 
bus-info: 0000:02:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

When the network interface goes down restarting the networking service on the server, by running service networking restart, seems to remedy the issues

I was wondering if anyone had experienced similar issues before and or has any suggestions for debugging the cause of something like this?

jim corleone
  • 31
  • 1
  • 4
  • I don't think it is related to the bonding. Rather to the driver, similar problems where reported see [https://sourceforge.net/p/e1000/bugs/583/](https://sourceforge.net/p/e1000/bugs/583/) – Soulimane Mammar Mar 05 '19 at 07:14
  • Thank you for your reply. I just read the link you provided. That question is very similar to mine, but the post did not mention the cause of the breakdown. – jim corleone Mar 05 '19 at 07:30
  • My guess it is probably a bug in the driver (some kind of memory leak) – Soulimane Mammar Mar 05 '19 at 08:45
  • Thank you.In fact,we have more than one hundred oracle database servers, which happens issue every once in a while, I am very troubled now. – jim corleone Mar 06 '19 at 02:02

1 Answers1

0

I just hit the same issue with same error messages.. but in my case issue wasn't on the server side at all. Stderr prints not only e1000e NIC, but all 4 of them. With cable disconnect/connect the messages are reproduced. So different drivers have same behavior... and after we did software debug on server, then cabling (changing cables with new ones), what's rest was the top of rack switch.

A switch reboot solved it.

Sto
  • 9
  • 2