Why would a server lockup knock other servers off the network?

Question

We have a couple dozen Proxmox servers (Proxmox runs on Debian), and about once a month, one of them will have a kernel panic and lock up. The worst part about these lock ups is that when it's a server that is on a separate switch than the cluster master, all other Proxmox servers on that switch will stop responding until we can find the server that has actually crashed and reboot it.

When we reported this issue on the Proxmox forum, we were advised to upgrade to Proxmox 3.1 and we've been in the process of doing that for the past several months. Unfortunately, one of the servers that we migrated to Proxmox 3.1 locked up with a kernel panic on Friday, and again all Proxmox servers that were on that same switch were unreachable over the network until we could locate the crashed server and reboot it.

Well, almost all Proxmox servers on the switch... I found it interesting that the Proxmox servers on that same switch that were still on Proxmox version 1.9 were unaffected.

Here is a screen shot of the console of the crashed server:

enter image description here

When the server locked up, the rest of the servers on the same switch that were also running Proxmox 3.1 became unreachable and were spewing the following:

e1000e 0000:00:19.0: eth0: Reset adapter unexpectedly
e1000e 0000:00:19.0: eth0: Reset adapter unexpectedly
e1000e 0000:00:19.0: eth0: Reset adapter unexpectedly
...etc...

uname -a output of locked server:

Linux ------ 2.6.32-23-pve #1 SMP Tue Aug 6 07:04:06 CEST 2013 x86_64 GNU/Linux

pveversion -v output (abbreviated):

proxmox-ve-2.6.32: 3.1-109 (running kernel: 2.6.32-23-pve)
pve-manager: 3.1-3 (running version: 3.1-3/dc0e9b0e)
pve-kernel-2.6.32-23-pve: 2.6.32-109

Two questions:

Any clues what would be causing the kernel panic (see image above)?
Why would other servers on the same switch and version of Proxmox be knocked off the network until the locked server is rebooted? (Note: There were other servers on the same switch that were running the older 1.9 version of Proxmox that were unaffected. Also, no other Proxmox servers in the same 3.1 cluster were affected that were not on that same switch.)

Thanks in advance for any advice.

Can you give the full crashdump? The picture above cut off the interesting parts. Also, did you post the crashdump on [lkml](http://vger.kernel.org/vger-lists.html#linux-kernel)? However, looking at it again, this is a pretty old kernel, are there plans to upgrade Debian to a current stable release? — ckujau, Jan 31 '14 at 05:16
Unfortunately, we don't have a crash dump. I've added it to my list to configure a serial console and/or kdump. As for the kernel being old, Proxmox uses an OpenVZ's kernel which is a branch off the mainstream kernel. So, once I can get crash dumps working, I'll contact the OpenVZ developers for help. Thanks for your comment... it helped me get pointed in the right direction. — Curtis, Jan 31 '14 at 22:46
The issue has happened with 3 different switches (one dlink and 2 cisco). I don't have the model numbers on the two previous switches, but the latest is a Cisco SG102-24. Since it only affects servers on the switch that are running the same kernel, and because I'm on my third switch it seems unlikely that the switch is to blame (although that was my original thought too). — Curtis, Feb 19 '14 at 18:24
I received an email notification that someone posted the following comment here... "I have a similar issue except that I can make mine crash with a couple containers doing hard core..." Unfortunately, it was cut off there and when I came here, the author had removed their comment so I don't know what the rest of it was. But, I will add that I have noted that the problem does seem to happen most often when there is heavy network traffic (like when backups are running). Perhaps that comment was "hardcore network transfers"? — Curtis, Feb 19 '14 at 18:28

score 2 · Answer 1 · edited Apr 07 '14 at 17:52

I'm almost certain your problem is not caused by just one single factor but rather by a combination of factors. What those individual factors are is not certain, but most likely one factor is either the network interface or driver and another factor is found on the switch itself. Hence it is quite likely the problem can only be reproduced with this particular brand of switch combined with this particular brand of network interface.

You seem the trigger for the problem is something happening on one individual server which then has a kernel panic which has effects that somehow manage to propagate across the switch. This sounds likely, but I'd say it is about as likely, that the trigger is somewhere else.

It could be that something is happening on the switch or network interface, which simultaneously causes the kernel panic and link issues on the switch. In other words, even if the kernel had not had a kernel panic, the trigger may very well have brought down connectivity on the switch.

One has to ask, what could possibly happen on the individual server, which could have this effect on the other servers. It shouldn't be possible, so the explanation has to involve a flaw somewhere in the system.

If it was just the link between the crashed server and the switch which went down or became unstable, then that should have no effect on the link state to the other servers. If it does, that would count as a flaw in the switch. And trafficwise, the other servers should see slightly less traffic once the crashed server lost connectivity, which cannot explain why they see the problem they do.

This leads me to believe a design flaw on the switch is likely.

However a link problem is not the first explanation one would look for when trying to explain how an issue on one server could cause problems to other servers on the switch. A broadcast storm would be a more obvious explanation. But could there be a link between a server having a kernel panic and a broadcast storm?

Multicast and packets destined for unknown MAC addresses are more or less treated the same as broadcasts, so a storm of such packets would count as well. Could the paniced server be trying to send a crashdump across the network to a MAC address not recognized by the switch?

If that's the trigger, then something is going wrong on the other servers. Because a packet storm should not cause this kind of error on the network interface. Reset adapter unexpectedly does not sound like a packet storm (which should just cause a drop in performance but no errors as such), and it does not sound like an link problem (which should have resulted in messages about links going down, but not the error you are seeing).

So it is likely there is some flaw in the network interface hardware or driver, which is triggered by the switch.

A few suggestions that can give additional clues:

Can you hook up some other equipment to the switch and look at what traffic you see on the switch when the problem shows up (I predict it either goes quiet or you see a flood).
Would it be possible to replace the network interface on one of the servers with a different brand using a different driver to see how the result turns out differently?
Is it possible to replace one of the switches with a different brand? I expect replacing the switch will ensure the problem no longer affects multiple servers. What's more interesting to know is if it also stops the kernel panics from happening.

Thank you for your thoughtful reply. In terms of your 3 suggestions: 1) What type of equipment/software would do that? 2) Wish I could, but there are a lot of servers involved and I don't know where the problem is going to happen next. 3) I've tried 3 different switches already (3 different models, 2 different brands). Also interesting is that only servers on the same version of Proxmox are affected. Proxmox does have a cluster syncing mechanism, so I suspect it has something to do with that. Fortunately, it's been a couple months since the issue has occurred now. — Curtis, Apr 07 '14 at 17:08
For looking at the traffic on the switch I was thinking hooking up an ordinary PC with tcpdump and/or wireshark. Obviously you'd want to avoid having the affected software installed on that PC. But it sounds like there must actually be a bug in the code that Proxmox installs into the kernel. If it happens so rarely, that you only see it about once per month and only on one switch at a time, then it may take a long time to track it down. I'll think a bit about it and comment, if more ideas come up. — kasperd, Apr 07 '14 at 18:39

score 1 · Answer 2 · answered Mar 21 '14 at 20:58

It sounds to me like a bug in the ethernet driver or the hardware/firmware, this being a red flag:

e1000e 0000:00:19.0: eth0: Reset adapter unexpectedly
e1000e 0000:00:19.0: eth0: Reset adapter unexpectedly
e1000e 0000:00:19.0: eth0: Reset adapter unexpectedly

I have seen these before and it can knock the server offline. I don't remember exactly whether it was on intel ethernet cards but I believe so. It could even be related to a bug in the ethernet cards themselves. I remember reading something about particular intel ethernet cards having such issues. But I lost the article's link.

I would imagine that the trigger for this depends partially on the driver (version) being used, the fact an older version of the software works ok seems to confirm that. You say the vendor uses their own custom kernel, try to update the ethernet driver module that's being used for your particular ethernet hardware. Either one from your vendor or one from the official kernel source tree.

Also look into bonding your ethernet hardware, normally a server would have two ethernet ports, onboard and/or add on card(s). That way if one ethernet card is having this problem the other will pick up. I use the word "card" but it applies to any ethernet hardware of course.

Also replacing the ethernet hardware can fix it. Either replace or add a newer (intel) ethernet card and use that instead. Chances are if the issue is in the hardware/firmware a newer card has a fix (or older?).

The machines all do have dual ethernet ports, however, this error happens across multiple servers all at the same time that are on the same switch at the very moment that one of the machines lock up. The moment the one locked server is power cycled, all affected servers instantly become accessible again. This seems to indicate that the locked server isn't completely locked but is somehow flooding the reset of the machines on the same switch. It would be interesting to see if a driver update could help, but I don't think activating the other ethernet card could help based on the evidence. — Curtis, Mar 24 '14 at 18:20
Old thread, but even with Intel e1000e NIC Model 82574L and one of the newer ProxMox versions of 5.0-23/af4267bf the network issues still remain. I can bring up my windows laptop (wake from sleep or just login) connected to same switch and the ProxMox server reboots basically every time. I've also seen it just reboot sporadically when not connected to the switch. And it will reboot when I first connect it to the switch. Current driver is 3.3.5.3 and there's a 3.3.5.10, 3.3.6 and 3.4.0.2 so I'll likely try building and using those. My .02c. — JGlass, Apr 15 '18 at 21:09

Why would a server lockup knock other servers off the network?

2 Answers2