4

I have a brand new CentOS 6.5 install with two (mounted to /mnt/data) 1tb Western Digital Black drives in raid 1 with mdadm, configured via the installer. Unfortunately every now and again the entire system kernel panics with a trace similar to below:

Kernel Panic

Any tips on diagnosing or fixing this? Much appreciated!

EDIT: It appears this happened around the same time as a raid data check occured: EDIT 2: The last two crashes have happened at just past 1am Sunday, same time data check occurs.

Mar 23 01:00:02 beta kernel: md: data-check of RAID array md0
Mar 23 01:00:02 beta kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Mar 23 01:00:02 beta kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Mar 23 01:00:02 beta kernel: md: using 128k window, over a total of 976629568k.

/proc/mdstat

    Personalities : [raid1] 
md0 : active raid1 sdc1[1] sdb1[0]
      976629568 blocks super 1.1 [2/2] [UU]
      bitmap: 0/8 pages [0KB], 65536KB chunk

unused devices: <none>

mdadm -D

/dev/md0:
        Version : 1.1
  Creation Time : Fri Mar  7 16:07:17 2014
     Raid Level : raid1
     Array Size : 976629568 (931.39 GiB 1000.07 GB)
  Used Dev Size : 976629568 (931.39 GiB 1000.07 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Mar 23 03:36:59 2014
          State : active 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : beta.fmt2.spigot-servers.net:0  (local to host beta.fmt2.spigot-servers.net)
           UUID : 89a86538:f6162473:d5e0524c:b80566d6
         Events : 1728

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1

EDIT 3: Different crash, occurred during a forced resync / check, also memtest passed 4 passes just fine: http://files.md-5.net/s/X3Hi.png

EDIT 4: Even dd is causing crashes: http://files.md-5.net/s/hba2.png

EDIT 5: The SSD survives the dd torture test, guess that means I'm gonna try the drives without raid.

MadHatter
  • 78,442
  • 20
  • 178
  • 229
md_5
  • 376
  • 2
  • 11
  • I'm just trying to get my head around how the SSD relates, could you cut-and-paste the output of `cat /proc/mdstat` into your question? – MadHatter Mar 23 '14 at 09:42
  • The SSD doesn't relate. I have amended the question with the mdstat output as well as some info regarding the potential cause – md_5 Mar 23 '14 at 10:44
  • Yup, the last two crashes have been at just past 1am Sunday, the same time the data check is running. – md_5 Mar 23 '14 at 10:48
  • Try to get the _first_ crash after a system boot. These are not the first crash and may not be indicative of the actual problem. You ought to find it in the system log `/var/log/messages`. Try searching the log for `Not tainted`. – Michael Hampton Mar 23 '14 at 14:17
  • The only other stuff is this: /var/log/messages-20140323:Mar 23 03:37:07 beta kernel: WARNING: at drivers/pci/dmar.c:588 warn_invalid_dmar+0x7a/0x90() (Tainted: G I--------------- ) /var/log/messages-20140323:Mar 23 03:37:07 beta kernel: Pid: 1, comm: swapper Tainted: G I--------------- 2.6.32-431.5.1.el6.x86_64 #1 Which happens right at the start of system boot indicating the bios reported DMAR at address 0. This is fine and a normal bug with Linux on all Gigabyte mobos, hasn't affected stability on any other of my systems. – md_5 Mar 23 '14 at 20:12
  • Can you take the machine offline to run memtest? Alternatively, put some load on the machine with `fio` or similar benchmarking tools. I would suspect some flaky hardware here that only manifests itself during the high load of the RAIAD verify. – devicenull Mar 24 '14 at 00:16
  • Already run 4 full passes of memtest from last 12 hours.Gonna try forcing a ddata-check and then fio. – md_5 Mar 24 '14 at 05:43
  • Yup, failed 58% through resync: http://files.md-5.net/s/X3Hi.png – md_5 Mar 24 '14 at 07:15
  • Yup, even dd is causing completely disparate crashes, I am at a loss: http://files.md-5.net/s/hba2.png – md_5 Mar 24 '14 at 07:48
  • Could you try to reproduce the test with the 2 drives on another server, see if is crashing there aswell? Additionally: do you monitor temperatures of cpu/disk and/or cpu frequency? i remember having had some issues with CPU frequency changes and short (~2-3sec) freezes a few years ago, back then i disabled the power saving options and my problem was gone. try to monitor the cpu load while doing the raid recheck/verify. – Dennis Nolte Mar 27 '14 at 09:04
  • dstat reported it as being fine. I moved the disks to another server and they worked better, although I did still get one crash. I am able to crash it even without mdadm (although not on the new server), therefore I think its just bad hardware. The part that doesn't make sense is both drives do it, but not a third drive of the same model :( – md_5 Mar 27 '14 at 09:19
  • You have been rebooting after each incident, right? As I said before, none of these crash reports are useful. – Michael Hampton Mar 27 '14 at 13:24
  • do you have a swap partition / swap file on the raid array? – Olivier S Mar 27 '14 at 19:29
  • No I do not. Yes, I have to do a full power reset after each incident. – md_5 Mar 28 '14 at 05:19
  • What graphics card do you have in this box? Is it by any chance using a proprietary driver? – MadHatter Mar 31 '14 at 08:44
  • Its intel integrated into the CPU. I don't think it is, its using whatever the kernel provides, i915 I think – md_5 Apr 01 '14 at 06:04
  • I'm trying to work out why the kernel's tainted, since "proprietary kernel module" often maps directly to "kernel keep dying". Do you have any idea why it's tainted? Could you give us the output of `cat /proc/sys/kernel/tainted`? – MadHatter Apr 01 '14 at 15:10
  • Oh well, sorry, I'd hoped my bounty might get you some better answers. I'd still like to know about the kernel taint, though (see above). – MadHatter Apr 02 '14 at 15:38
  • "268437504" is what it outputs – md_5 Apr 03 '14 at 05:57
  • Wow, that's a bigger output than I've ever seen. That is a *highly* tainted kernel; I'm not surprised it's falling over. – MadHatter Apr 04 '14 at 19:49
  • It's stock CentOS.I haven't touched it. – md_5 Apr 06 '14 at 01:27

1 Answers1

1

This might give an indication of the disk hardware state:

[root@ninja ~]$ /etc/rc.d/init.d/smartd start

[root@ninja ~]$ smartctl --all /dev/sdc | grep 'health'
SMART overall-health self-assessment test result: PASSED

[root@ninja ~]$ smartctl --all /dev/sdb | grep 'health'
SMART overall-health self-assessment test result: PASSED
Onnonymous
  • 1,064
  • 1
  • 9
  • 14