CentOS 6.5 mdadm Raid 1 - Kernel Panic during raid data check

Question

I have a brand new CentOS 6.5 install with two (mounted to /mnt/data) 1tb Western Digital Black drives in raid 1 with mdadm, configured via the installer. Unfortunately every now and again the entire system kernel panics with a trace similar to below:

Kernel Panic

Any tips on diagnosing or fixing this? Much appreciated!

EDIT: It appears this happened around the same time as a raid data check occured: EDIT 2: The last two crashes have happened at just past 1am Sunday, same time data check occurs.

Mar 23 01:00:02 beta kernel: md: data-check of RAID array md0
Mar 23 01:00:02 beta kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Mar 23 01:00:02 beta kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Mar 23 01:00:02 beta kernel: md: using 128k window, over a total of 976629568k.

/proc/mdstat

    Personalities : [raid1] 
md0 : active raid1 sdc1[1] sdb1[0]
      976629568 blocks super 1.1 [2/2] [UU]
      bitmap: 0/8 pages [0KB], 65536KB chunk

unused devices: <none>

mdadm -D

/dev/md0:
        Version : 1.1
  Creation Time : Fri Mar  7 16:07:17 2014
     Raid Level : raid1
     Array Size : 976629568 (931.39 GiB 1000.07 GB)
  Used Dev Size : 976629568 (931.39 GiB 1000.07 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Mar 23 03:36:59 2014
          State : active 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : beta.fmt2.spigot-servers.net:0  (local to host beta.fmt2.spigot-servers.net)
           UUID : 89a86538:f6162473:d5e0524c:b80566d6
         Events : 1728

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1

EDIT 3: Different crash, occurred during a forced resync / check, also memtest passed 4 passes just fine: http://files.md-5.net/s/X3Hi.png

EDIT 4: Even dd is causing crashes: http://files.md-5.net/s/hba2.png

EDIT 5: The SSD survives the dd torture test, guess that means I'm gonna try the drives without raid.

I'm just trying to get my head around how the SSD relates, could you cut-and-paste the output of `cat /proc/mdstat` into your question? — MadHatter, Mar 23 '14 at 09:42
The SSD doesn't relate. I have amended the question with the mdstat output as well as some info regarding the potential cause — md_5, Mar 23 '14 at 10:44
Yup, the last two crashes have been at just past 1am Sunday, the same time the data check is running. — md_5, Mar 23 '14 at 10:48
Try to get the _first_ crash after a system boot. These are not the first crash and may not be indicative of the actual problem. You ought to find it in the system log `/var/log/messages`. Try searching the log for `Not tainted`. — Michael Hampton, Mar 23 '14 at 14:17
The only other stuff is this: /var/log/messages-20140323:Mar 23 03:37:07 beta kernel: WARNING: at drivers/pci/dmar.c:588 warn_invalid_dmar+0x7a/0x90() (Tainted: G I--------------- ) /var/log/messages-20140323:Mar 23 03:37:07 beta kernel: Pid: 1, comm: swapper Tainted: G I--------------- 2.6.32-431.5.1.el6.x86_64 #1 Which happens right at the start of system boot indicating the bios reported DMAR at address 0. This is fine and a normal bug with Linux on all Gigabyte mobos, hasn't affected stability on any other of my systems. — md_5, Mar 23 '14 at 20:12
Can you take the machine offline to run memtest? Alternatively, put some load on the machine with `fio` or similar benchmarking tools. I would suspect some flaky hardware here that only manifests itself during the high load of the RAIAD verify. — devicenull, Mar 24 '14 at 00:16
Already run 4 full passes of memtest from last 12 hours.Gonna try forcing a ddata-check and then fio. — md_5, Mar 24 '14 at 05:43
Yup, failed 58% through resync: http://files.md-5.net/s/X3Hi.png — md_5, Mar 24 '14 at 07:15
Yup, even dd is causing completely disparate crashes, I am at a loss: http://files.md-5.net/s/hba2.png — md_5, Mar 24 '14 at 07:48
Could you try to reproduce the test with the 2 drives on another server, see if is crashing there aswell? Additionally: do you monitor temperatures of cpu/disk and/or cpu frequency? i remember having had some issues with CPU frequency changes and short (~2-3sec) freezes a few years ago, back then i disabled the power saving options and my problem was gone. try to monitor the cpu load while doing the raid recheck/verify. — Dennis Nolte, Mar 27 '14 at 09:04
dstat reported it as being fine. I moved the disks to another server and they worked better, although I did still get one crash. I am able to crash it even without mdadm (although not on the new server), therefore I think its just bad hardware. The part that doesn't make sense is both drives do it, but not a third drive of the same model :( — md_5, Mar 27 '14 at 09:19
You have been rebooting after each incident, right? As I said before, none of these crash reports are useful. — Michael Hampton, Mar 27 '14 at 13:24
No I do not. Yes, I have to do a full power reset after each incident. — md_5, Mar 28 '14 at 05:19
What graphics card do you have in this box? Is it by any chance using a proprietary driver? — MadHatter, Mar 31 '14 at 08:44
Its intel integrated into the CPU. I don't think it is, its using whatever the kernel provides, i915 I think — md_5, Apr 01 '14 at 06:04
I'm trying to work out why the kernel's tainted, since "proprietary kernel module" often maps directly to "kernel keep dying". Do you have any idea why it's tainted? Could you give us the output of `cat /proc/sys/kernel/tainted`? — MadHatter, Apr 01 '14 at 15:10
Oh well, sorry, I'd hoped my bounty might get you some better answers. I'd still like to know about the kernel taint, though (see above). — MadHatter, Apr 02 '14 at 15:38
Wow, that's a bigger output than I've ever seen. That is a *highly* tainted kernel; I'm not surprised it's falling over. — MadHatter, Apr 04 '14 at 19:49

score 1 · Answer 1 · answered Mar 27 '14 at 09:20

1

This might give an indication of the disk hardware state:

[root@ninja ~]$ /etc/rc.d/init.d/smartd start

[root@ninja ~]$ smartctl --all /dev/sdc | grep 'health'
SMART overall-health self-assessment test result: PASSED

[root@ninja ~]$ smartctl --all /dev/sdb | grep 'health'
SMART overall-health self-assessment test result: PASSED

answered Mar 27 '14 at 09:20

Onnonymous

1,064
1
9
14

1

Both drives pass, they are fine. – md_5 Mar 27 '14 at 09:24
Or, indeed, `smartctl -H /dev/sda`. – MadHatter Mar 27 '14 at 09:31
1

Have you tried other kernels, for e.g. kernel-2.6.32-358.18.1/2.6.32-431.11.2.el6? – ALex_hha Mar 27 '14 at 11:35

CentOS 6.5 mdadm Raid 1 - Kernel Panic during raid data check

1 Answers1