DELL PowerEdge - System fatal error during previous boot

3

0

My dedicated DELL R710 server (CentOS 6.4) is rebooting by itself and popping up with the following error.

enter image description here

Does this mean the box cannot boot, or it kernel panicked during Linux boot up and the server somehow knows?

Could anyone advise on diagnoses or if this is a hardware issue and should be passed off to the datacentre from whom I rent the box? Has been running fine for months and now the past two days randomly rebooted.

Update - Box continues to reboot one minute it is working, then next line entry shows kernel booting without any shutdown or other error message.

Jan 10 16:29:12 squirtle kernel: Firewall: *TCP_IN Blocked* IN=em1 OUT= MAC=84:2b:2b:54:84:58:00:04:96:82:74:3e:08:00 SRC=93.174.93.67 DST=13.129.118.21 LEN=40 TOS=0x00 PREC=0x00 TTL=245 ID=54321 PROTO=TCP SPT=35003 DPT=21320 WINDOW=65535 RES=0x00 SYN URGP=0
Jan 10 16:35:50 squirtle kernel: Firewall: *UDP_IN Blocked* IN=em1 OUT= MAC=84:2b:2b:54:84:58:00:04:96:82:74:3e:08:00 SRC=179.107.38.35 DST=13.129.118.21 LEN=443 TOS=0x00 PREC=0x00 TTL=53 ID=0 DF PROTO=UDP SPT=5067 DPT=5060 LEN=423
Jan 10 16:42:05 squirtle kernel: imklog 5.8.10, log source = /proc/kmsg started.
Jan 10 16:42:05 squirtle rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1203" x-info="http://www.rsyslog.com"] start
Jan 10 16:42:05 squirtle kernel: Initializing cgroup subsys cpuset
Jan 10 16:42:05 squirtle kernel: Initializing cgroup subsys cpu
Jan 10 16:42:05 squirtle kernel: Linux version 2.6.32-431.3.1.el6.i686 (mockbuild@c6b10.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Fri Jan 3 18:53:30 UTC 2014
Jan 10 16:42:05 squirtle kernel: KERNEL supported cpus:
Jan 10 16:42:05 squirtle kernel:  Intel GenuineIntel
Jan 10 16:42:05 squirtle kernel:  AMD AuthenticAMD
Jan 10 16:42:05 squirtle kernel:  NSC Geode by NSC
Jan 10 16:42:05 squirtle kernel:  Cyrix CyrixInstead
Jan 10 16:42:05 squirtle kernel:  Centaur CentaurHauls
Jan 10 16:42:05 squirtle kernel:  Transmeta GenuineTMx86
Jan 10 16:42:05 squirtle kernel:  Transmeta TransmetaCPU
Jan 10 16:42:05 squirtle kernel:  UMC UMC UMC UMC

Update 2

I have been running the utility stress on the server for the past 4 days, server has not rebooted once. Its maxing out all cores at 100% CPU. I will need to check is stress is using memory or disk writes, but as for as the processors are concerned they seem OK.

g18c

Posted 2014-01-09T11:12:33.933

Reputation: 212

4

If the reboot was caused by a hardware issue then you can use IPMI SEL. It will provide more information on the FATAL errors. Can you check those and add that information?

– Hennes – 2014-01-09T11:56:08.527

thanks @Hennes I presume i will need to setup the IPMI IP address on the server? It continues to reboot randomly, /var/log shows nothing of use – g18c – 2014-01-10T16:30:01.610

This does sound like hardware. If the Kernel detected an issue and rebooted the server, it would have had a chance to log something in /var/log/messages. If the power or hardware caused the reboot, the Kernel never gets a chance to log it. – R Hughes – 2014-01-12T07:04:40.970

Answers

2

As the R710 dates from 2009/2010, component failure is always a possibility.

Dell documentation (although for the R410) says :

Alert! System fatal error during previous boot.
An error caused the system to reboot.
Check other system messages for additional information for possible causes. 

As the only other message I see is about the fan speed, I think you should carefully examine and log the temperature and its variation.

See for example How to monitor & log server hardware temperatures & load.

It also wouldn't hurt to open up the server, clean it up and check all contacts.

You could try using the tools as in the article How to troubleshoot hardware problems in Linux and report here their results.

harrymc

Posted 2014-01-09T11:12:33.933

Reputation: 306 093

2

That message is coming from the BIOS asking you to continue. That means the motherboard saw something it did not like at the hardware level. The OS would not have done that and it and would have logged something to the messages file if it had been given the chance to. I would request a full diag be run on the server. the F1/F2 prompt is usually a BIOS mis-configuration or hardware fault alert.

R Hughes

Posted 2014-01-09T11:12:33.933

Reputation: 181