6

I have a Linux server I've just set up, debian squeeze, 2.6.32-5-amd64, and over the past week it's rebooted three times, twice in one day. There was no power outage that I am aware of (and it's running on a UPS), and there are no errors in syslog, besides a few to-be-expected ones on bootup to do with clearing out entries in the ext4 journal due to the unclean shutdown.

What steps can I take to determine the cause of the reboots? Is there a way to get it to hang instead of rebooting, so I can copy stack traces or something off the screen? Any way to increase debug messages, or get it to dump things to disk, or something?

davr
  • 1,729
  • 3
  • 14
  • 24

3 Answers3

2

That may be some hardware problem; the most common are failed RAM and overheating. You could install mbmon to monitor motherboard and CPU temperature; and runmemtest86+ to check your RAM and CPU cache.

wazoox
  • 6,782
  • 4
  • 30
  • 62
  • mbmon gives "No Hardware Monitor found" but after upgrading my kernel, lm-sensors gives CPU core temperature now, which seems reasonable (average around 35 C, goes up to 55C if I run a benchmark on all cores). Will try memtest when I figure out how to run it on a remote server. – davr Jun 02 '11 at 14:36
  • You can't run it directly on a remote server, but there's a linux equivalent you may run without rebooting (though it's not as thorough): memtester. – wazoox Jun 04 '11 at 15:11
1

There is a chance it is a 'kernel panic' and a kernel 'oops' message is sent to the console before the reboot. The kernel can be configured to reboot on 'panic' or to stay on. Check:

cat /proc/sys/kernel/panic

If it is non-zero try putting 0 there (you can do it directly writting to the file, via /etc/sysctl.conf which is usually parsed on boot, or using the sysctl utility), this should stop rebooting. If it is already 0, then the reboots are not caused by kernel panics.

Jacek Konieczny
  • 3,597
  • 2
  • 21
  • 22
0

Check the output of last. Look for reboot. Try to correlate that with who was logged in if anyone and who has superuser privileges. If it is not a user, you may have power/heat issues or some type of kernel panic causing issues. Try to rule those out one by one.

dmourati
  • 24,720
  • 2
  • 40
  • 69
  • Nobody else but me logging in (double checked the IP even from last). It's on an UPS, so I don't think it's a power issue, unless the server power supply is failing. I don't think it's heat issues, at least the server is not heavily loaded, and the current temperature is quite low (CPU at 36C), I'll start logging the temperature though. – davr Jun 02 '11 at 05:42
  • Is there a way to get it to dump the kernel panic to disk or to screen? I'm worried that it's just rebooting and not saving the error messages anywhere. Is there a 'debug mode' or something I can enable? – davr Jun 02 '11 at 05:43
  • A kernel panic is very unlikely - hardware trouble such as an overheating processor or memory errors is much more likely. – reinierpost Jun 02 '11 at 11:26
  • how to verify that? I have the same problem but I need to convince my provider that something is hardware? Too much reboot may be a software issue. – user4234 Jul 25 '13 at 04:22