2

We have a server that has been occasionally kernel panicking for a while now that we believe has a hardware problem. How would you go about troubleshooting hardware that you don't have physical access to? Are there any tools that I can use within the OS itself to diagnose different pieces of the system to try to figure out what's causing all of this panicking?

Jeremy Privett
  • 248
  • 1
  • 5
  • 14
  • Are you able to capture the kernel panic/oops output? e.g. via a serial console, virtual serial port, netconsole, IPMI serial over LAN etc? – James Mar 29 '10 at 16:47
  • Only a small part of it. The IPMI console has no scrollback capability. – Jeremy Privett Mar 29 '10 at 17:16
  • http://www.kernel.org/doc/Documentation/networking/netconsole.txt - great way to capture kernel panics and oops. – Khushil Oct 18 '10 at 18:11

2 Answers2

4

Barring anything revealing in the system's logs or vendor-supplied test tools (front panel display, Dell Diagnostics, etc.), most diagnostic procedures will require physical access to the system.

My suggestion would be to have memtest86 or memtest86+ run on the system: Most panics/random crashes are caused by bad RAM and this will usually catch it.

voretaq7
  • 79,345
  • 17
  • 128
  • 213
  • I PXE Booted into a memtest and it passes a couple tests then reboots abruptly every time. I'm not familiar with this behavior, so is this a sign that the RAM is bad? – Jeremy Privett Mar 29 '10 at 20:02
  • well, it definitely shouldn't do that :-) I'd say it points to RAM (90%) or thermal issues (10% - normally a thermal problem is a hard shutdown rather than a reboot) – voretaq7 Mar 29 '10 at 20:43
  • Don't know if it's RAM, but it sounds like the best thing to start with at this point. – Bart Silverstrim Mar 29 '10 at 22:50
3

You're going to have a really hard time diagnosing hardware problems without access to the hardware; if it's not obvious in the logs or from smoke and crackly noises followed by neat sparkles of light then a lot of hardware troubleshooting comes down to switching parts until the issue goes away.

Thing with hardware is that when you use software to troubleshoot it, it can only tell you what is the problem, not what might be the problem. I.e., memtest86 finds a definite memory problem, you have a definite memory problem, but if memtest86 says there isn't a memory problem, you actually might still have a memory problem (I've had systems test fine but only stopped crashing after swapping the module).

It's like asking your brain to diagnose yourself. You can't trust the conclusions. :-)

Bart Silverstrim
  • 31,092
  • 9
  • 65
  • 87
  • If you have burst capacitors on your mainboard, you're certainly not going to see any evidence in logs or from diagnostic utilities. Hardware diagnostics often can be "best guessed" based on experience but I'm right with you on troubleshooting properly. Sometimes, you can luck out in dmesg. Hard drives usually leave evidence. – Warner Mar 29 '10 at 16:42
  • 1
    @Warner: Yeah, like neat grinding and clicking noises when you hit the power button :-) On the server itself (with proper hardware) there is actually helpful things like diag messages on LCD panels or blinking status lights on drive cages. – Bart Silverstrim Mar 29 '10 at 16:46
  • panel displays are a godsend if you have them, unfortunately el-cheapo servers don't usually come with them. Next best thing if you can get into the machine or have a management network is an IPMI event reader (which is really all those panels are anyway :) – voretaq7 Mar 29 '10 at 20:45
  • +1 on "any negative diagnosis is suspect" -- I've replaced entire systems once component at a time even though diagnostics claimed everything was fine (by the time we were done the only original part was the sheet metal chassis :) – voretaq7 Mar 29 '10 at 20:47