How to determine cause of system crash?

Question

My server crashes about once a week and does not leave any kind of clue as to what's causing it. I have checked /var/log/messages and it just stops recording at some point and starts at the computer post information when I perform a hard reboot.

Is there something I can check or software I can install that can determine the cause?

I'm running CentOS 7.

Here is the only error/problem in my /var/log/dmesg: https://paste.netcoding.net/cosisiloji.log

[    3.606936] md: Waiting for all devices to be available before autodetect
[    3.606984] md: If you don't use raid, use raid=noautodetect
[    3.607085] md: Autodetecting RAID arrays.
[    3.608309] md: Scanned 6 and added 6 devices.
[    3.608362] md: autorun ...
[    3.608412] md: considering sdc2 ...
[    3.608464] md:  adding sdc2 ...
[    3.608516] md: sdc1 has different UUID to sdc2
[    3.608570] md:  adding sdb2 ...
[    3.608620] md: sdb1 has different UUID to sdc2
[    3.608674] md:  adding sda2 ...
[    3.608726] md: sda1 has different UUID to sdc2
[    3.608944] md: created md2
[    3.608997] md: bind<sda2>
[    3.609058] md: bind<sdb2>
[    3.609116] md: bind<sdc2>
[    3.609175] md: running: <sdc2><sdb2><sda2>
[    3.609548] md/raid1:md2: active with 3 out of 3 mirrors
[    3.609623] md2: detected capacity change from 0 to 98520989696
[    3.609685] md: considering sdc1 ...
[    3.609737] md:  adding sdc1 ...
[    3.609789] md:  adding sdb1 ...
[    3.609841] md:  adding sda1 ...
[    3.610005] md: created md1
[    3.610055] md: bind<sda1>
[    3.610117] md: bind<sdb1>
[    3.610175] md: bind<sdc1>
[    3.610233] md: running: <sdc1><sdb1><sda1>
[    3.610714] md/raid1:md1: not clean -- starting background reconstruction
[    3.610773] md/raid1:md1: active with 3 out of 3 mirrors
[    3.610854] md1: detected capacity change from 0 to 20970405888
[    3.610917] md: ... autorun DONE.
[    3.610999] md: resync of RAID array md1
[    3.611054] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[    3.611119] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
[    3.611180] md: using 128k window, over a total of 20478912k.
[    3.611244]  md1: unknown partition table
[    3.624786] EXT3-fs (md1): error: couldn't mount because of unsupported optional features (240)
[    3.627095] EXT2-fs (md1): error: couldn't mount because of unsupported optional features (244)
[    3.630284] EXT4-fs (md1): INFO: recovery required on readonly filesystem
[    3.630341] EXT4-fs (md1): write access will be enabled during recovery
[    3.819411] EXT4-fs (md1): orphan cleanup on readonly fs
[    3.836922] EXT4-fs (md1): 24 orphan inodes deleted
[    3.836975] EXT4-fs (md1): recovery complete
[    3.840557] EXT4-fs (md1): mounted filesystem with ordered data mode. Opts: (null)

score 7 · Answer 1 · answered Apr 17 '17 at 21:41

7

If you have crashkernel/kdump installed and enabled, you should be able to examine the crashed kernel with relative easy using the crash utility. For example, presuming that you crashed kernel dumps are saved under /var/crash: crash /var/crash/2009-07-17-10\:36/vmcore /usr/lib/debug/lib/modules/uname -r/vmlinux.

Give a look here and here for added details.

answered Apr 17 '17 at 21:41

shodanshok

44,038
6
98
162

I have repaired the `/dev/md1 not found` error when running `grub2-probe` and installed and configured crashkernel/kdump and will report back if/when it crashes again. – Nahydrin Apr 18 '17 at 02:15

score 5 · Answer 2 · answered Apr 17 '17 at 09:59

5

You could check the dmesg file at /var/log/dmesg, which is logging the kernel messages. The messages log is just logging service and application messages and if you have a kernel error, the services and applications will just stop running, but the kernel error is still logged in dmesg.

answered Apr 17 '17 at 09:59

TooCloudy

71
7

I checked dmesg and dmesg.old, both only contain the startup information (about 4.8 seconds). The only "problem" I can see is the startup disk or raid drives appear to have something wrong but the system fixes it and works regardless. Check main post for link. – Nahydrin Apr 17 '17 at 20:48

score 3 · Answer 3 · answered Apr 17 '17 at 20:51

3

bios memory test
bios hard drive test
Check smart drive log smartctl /dev/sda -a
Smart drive tests
leave dmesg -wH running in a window

answered Apr 17 '17 at 20:51

Jim U

263
1
9

I've ran smart drive tests on all 3 drives, they are uncorrupted. I have `dmesg -wH` running in a window (I assume until it crashes again; and can still read the output after the crash over SSH). I do not have physical access to the machine, do I ask my host to run the bios memory and hard drive tests? – Nahydrin Apr 17 '17 at 20:59

How to determine cause of system crash?

3 Answers3