12

It's happened again! I have 4 servers which are crashing periodically, and there is no information printed to the system logs or the serial console.

In addition, the Linux kdump service isn't writing core dumps to the default location of /var/crash.

  • Can you help me figure out why?
  • Does it matter if my root filesystem is a LVM volume?

Here is what I've tried.

  1. My system is Scientific Linux 6.5 with the latest kernel.

    [root@host1 ~]# uname -r
    2.6.32-431.11.2.el6.x86_64
    [root@host1 ~]# cat /etc/issue
    Scientific Linux release 6.5 (Carbon)
    
  2. The file /etc/kdump.conf is the vanilla file containing the default settings. Most lines are commented out, there are only two active lines for path and core_collector.

    #net my.server.com:/export/tmp
    #net user@my.server.com
    path /var/crash
    core_collector makedumpfile -c --message-level 1 -d 31
    #core_collector scp
    
  3. I ensure that the kdump service is running, and that kdump doesn't need to rebuild my initrd.

    [root@host1 ~]# chkconfig --list kdump
    kdump           0:off   1:off   2:off   3:on    4:on    5:on    6:off
    [root@host1 ~]# /etc/init.d/kdump restart
    Stopping kdump:                                            [  OK  ]
    Starting kdump:                                            [  OK  ]
    [root@host1 ~]# 
    
  4. Then, I force a Kernel crash using these commands borrowed from the RHEL6 Deployment Guide: Chapter 29. The kdump Crash Recovery Service:

    Then type the following commands at a shell prompt:

    echo 1 > /proc/sys/kernel/sysrq
    echo c > /proc/sysrq-trigger
    

    This will force the Linux kernel to crash

  5. The system crashes. I can view the progress on my serial console. I see the message Saving to the local filesystem UUID=e7abcdeb-1987-4c69-a867-fabdceffghi2, but immediately after that I see the strange message of Usage: fsck.ext4, which sort of looks like something is accidentally calling fsck instead of whatever it should be doing. I see no mention of an out-of-memory error or anything.

    host1.example.org login: SysRq : Trigger a crash
    BUG: unable to handle kernel NULL pointer dereference at (null)
    ...
    ... skipping 50 lines of output
    ...
    Creating block device ram8
    Creating block device ram9
    Creating Remain Block Devices
    Making device-mapper control node
    Scanning logical volumes
      Reading all physical volumes.  This may take a while...
      No volume groups found
      No volume groups found
    Activating logical volumes
      No volume groups found
      No volume groups found
    Free memory/Total memory (free %): 58272 / 116616 ( 49.9691 )
    Saving to the local filesystem UUID=e7abcdeb-1987-4c69-a867-fabdceffghi2
    Usage: fsck.ext4 [-panyrcdfvtDFV] [-b superblock] [-B blocksize]
            [-I inode_buffer_blocks] [-P process_inode_size]
            [-l|-L bad_blocks_file] [-C fd] [-j external_journal]
            [-E extended-options] device
    
    Emergency help:
     -p                   Autom
    
  6. And then the system reboots (which is the default).

  7. When the system comes back online, there is nothing in /var/crash. I assume that the crash dump was not written.

    [root@host1 ~]# ls -lA /var/crash/
    total 0
    [root@host1 ~]#
    
  8. I know that crash dumps can work in general. If I tell kdump to copy the core dump to another system with the following configuration, kdump will successfully write the core dump to another host:

    path vmcore
    ssh user@hostb.example.org
    sshkey /root/.ssh/kdump_id_rsa
    
  9. If I set default shell in /etc/kdump.conf and rebuild initrd, and then crash the system again I get a slightly more informative error about mount: can't find /mnt in /etc/fstab

    Free memory/Total memory (free %): 58272 / 116616 ( 49.9691 )
    Saving to the local filesystem UUID=e720481b-1987-4c69-a867-f2b4cba3b312
    Usage: fsck.ext4 [-panyrcdfvtDFV] [-b superblock] [-B blocksize]
    [-I inode_buffer_blocks] [-P process_inode_size]
    [-l|-L bad_blocks_file] [-C fd] [-j external_journal]
    [-E extended-options] device
    
    Emergency help:
     -p                   Automatic repair (no questions)
     -n                   Make no changes to the filesystem
     -y                   Assume "yes" to all questions
     -c                   Check for bad blocks and add them to the badblock list
     -f                   Force checking even if filesystem is marked clean
     -v                   Be verbose
     -b superblock        Use alternative superblock
     -B blocksize         Force blocksize when looking for superblock
     -j external_journal  Set location of the external journal
     -l bad_blocks_file   Add to badblocks list
     -L bad_blocks_file   Set badblocks list
    mount: can't find /mnt in /etc/fstab
    dropping to initramfs shell
    exiting this shell will reboot your system
    /sys/block #
    
  10. But now, I'm stuck.

Stefan Lasiewski
  • 22,949
  • 38
  • 129
  • 184
  • What is the make/model of the server? – ewwhite Jun 06 '14 at 21:57
  • This is a Supermicro with a X9DRW4 motherboard, and the latest bios. – Stefan Lasiewski Jun 06 '14 at 22:03
  • Bummer. I'm having a [similar crash on HP ProLiants](https://access.redhat.com/site/solutions/707563) with the newest RHEL6 kernel. I'm wondering if it's a deeper issue. – ewwhite Jun 06 '14 at 22:14
  • To me, it looks a bit like a bug. But I don't remember what the output should look like. – Stefan Lasiewski Jun 06 '14 at 23:05
  • 1
    Hi. Did you resolve this issue? I'm facing very similar problem. – Chul-Woong Yang May 31 '15 at 02:13
  • 1
    This is may be due to the if you try to generate the core on the LVM partition. Did you check the generate the core on the normal file system? Also another could be the array controller card, which might not be able to initialize when system boot on minikernel. –  Sep 17 '16 at 10:41
  • @StefanLasiewski did you managed to solve it ? – sherpaurgen Sep 02 '20 at 08:26
  • @satch_boogie No I don't think I ever did, sorry. We gave up on that crash. I would avoid the LVM filesystem for this as an experiment. – Stefan Lasiewski Sep 03 '20 at 00:16
  • @satch_boogie I was facing the same problem some time ago and it was related to the fact that crash directory was mounted on the LVM partition, the moment I plugged a new harddrive and mounted it as ext4, kdump started to persist vmcores in the crash directory. – Ottovsky Sep 16 '20 at 12:09
  • @AdamOtto , I fallback to lower kernel version v.4.4.0 (was using 4.15 which didn't) and it worked for lvm as well as raw ext4 partition.Generating coredump for kernelv5.4(Ubu 20lts) is also found to be working. I have not found yet..why its not generated for some kernel versions – sherpaurgen Sep 17 '20 at 09:02

2 Answers2

5

A little late to the game but if you need to configure kdump for the future:

I think the path directive designates a path from the partition or file system designated. By default this is the root FS. If you have a separate partition in /etc/fstab for /var it will obfuscate the crash directory when your system is booted normally. ie if you were to boot normally and unmount /var you would see the crash/[UniqCoreDir]. You can adjust this by adding an ext4 /PATH/TO/DEVICE directive in kdump.conf. Also you could use a different path that won't be mounted over.

Just a guess but might have a number of vmcores burried under /var.

Yuri
  • 208
  • 1
  • 6
nick
  • 51
  • 1
  • 2
2

Pull apart your kdump initrd in /boot/ check to to see the final path that its trying to dump to.

  • I think the "path" option is a little weird, I'd probably leave it to the default or set it explicitly to /var/crash

  • Do you have some kind of watchdog rebooting the machine ? this may also prevent the core being created by rebooting the machine before the is started.

No Username
  • 121
  • 3
  • I'll check out the initrd and see what I find. The `path` option in #2 is the default path (`/var/crash`). – Stefan Lasiewski Jun 20 '14 at 05:35
  • No I do not have a watchdog rebooting the machine. Turns out that the LSI controller + Samsung SSDs are periodically freezing for reasons that we don't totally understand. – Stefan Lasiewski Jan 22 '15 at 00:16
  • Did you get any feedback, because that is pretty crazy, maybe a power draw problem dropping the voltage too low ? – No Username Apr 20 '15 at 12:49