3

I'm running Oracle Linux 6 on a HP Proliant server. It's been running fine for the last week, but seemed slow earlier so the Oracle service was stopped. Rather than restart the service, I was asked to reboot the server, but on start we got a kernel panic

First I get the following, which HP said isn't important, but I'm inclined not to believe them

[Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
ERST: Can not request iomem region <0xffff88030c1dfe20-0xffff1006183bfc40> for ERST

Then the Kernel panic

Kernel panic - not syncing: Attempted to kill init!
Pid: 1, comm: init Not tainted 2.6.39-300.26.1.el6euk.x86-64 #1
Call Trace:
[<ffffffff81509077>] panic+0x91/0x1a8
[<ffffffff81061562>] ? enqueue_entity+0x52/0x210
[<ffffffff8107196b>] forget_original_parent+0x32b/0x330
[<ffffffff8105adbd>] ? sched_move_task+0x9d/0x150
[<ffffffff8107198b>] exit_notify+0x1b/0x190
[<ffffffff81072a8e>] do_exit+0x1fe/0x430
[<ffffffff81072d15>] do_group_exit+0x55/0xd0
[<ffffffff81072da7>] sys_exit_group+0x17/0x20
[<ffffffff81514402>] system_call_fastpath+0x16/0x1b
panic occurred: switching back to text console

Could anyone give me a pointer as to what is or even could be causing this? I'm completely stumped at this point. (System administration isn't my day job - I can get a server running but kernel panics are outside my comfort zone)

Edit: Tested with the following kernels

2.6.39-300.26.1.el6euk.x86_64
2.6.39-200.24.1.el6euk.x86_64
2.6.32-279.19.1.el6.x86_64
2.6.32-279.el6.x86_64

Jon Story
  • 139
  • 2
  • 10

2 Answers2

1

The first message you see during init: [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330) is not an issue. That's standard on EL6 and ProLiant systems. However, the fix to remove the message is available here.

As for the crazy Oracle Linux kernel version, 2.6.39-300.26.1.el6euk.x86-64, can you try booting with the previous kernel in GRUB?

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • What's crazy about the kernel version? I'm not questioning you, I just don't know the difference. I believe there are 3 more options, I'll check what they are and get back to you. – Jon Story Feb 08 '13 at 15:33
  • Oh, it's because Oracle's Linux departs from the RHEL and CentOS standard with much newer kernels. Check the system boot with a different kernel. – ewwhite Feb 08 '13 at 15:35
  • @ewwwhite That would make sense: I'm running diagnostics at the moment, but if they don't come up with anything I'll try one of the other 3 kernels in GRUB. I'm not sure what they are right now, but I'll post the options before trying it out. – Jon Story Feb 08 '13 at 16:40
  • There's nothing wrong with your server hardware. This is a kernel/OS interaction. Don't waste the time. – ewwhite Feb 08 '13 at 16:42
  • @ewwwhite The same error occurs with all 4 kernel options available (I've posted these in the main question as I can't put line breaks in the comment) – Jon Story Feb 08 '13 at 17:16
  • I fear an issue with an initial ramdisk. The server can't find its root filesystem, it seems... What type of storage is this system using? Local? SAN? – ewwhite Feb 08 '13 at 17:20
  • @ewwwhite A local 6-disk RAID-5 array – Jon Story Feb 09 '13 at 17:08
  • If I were you, I'd check the firmware on the server/controllers. Use the [**Firmware Update DVD**](http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?swItem=MTX-9ed665a89aba447d925937f38b&lang=en&cc=us&mode=3&) or [**Service Pack for ProLiant DVD**](http://h18004.www1.hp.com/products/servers/management/spp/index.html). Outside of that, it's possible that some autoupdating of packages could have rendered the system unbootable. – ewwhite Feb 10 '13 at 11:33
  • Willdo, cheers. We've ended up nuking the disk and re-installing - not ideal, but it's got us a working system again. I'll try to keep a closer eye on what gets updated in case anything similar happens. We've gone for CentOS this time, so I know exactly what's being installed - OracleLinux starts off with too many packages for my liking. Thanks for your help – Jon Story Feb 12 '13 at 10:57
0

I think it should be a hardware proble, memory, cpu or so. Try first to boot from a rescue disk (cd or usb disk) with memtest, and test it for some hours.

If you are a lucky man, you only have to replace ram, if you aren't ... may be you have to change the motherboad, cpu, ...

Brigo
  • 1,504
  • 11
  • 8
  • Memtest seems fine, but I'll run a longer test soon. Currently doing some diagnostics on the disks. Thanks. – Jon Story Feb 08 '13 at 16:39
  • Apologies, I forgot to come back to this - there was no hardware issue we ever found. After a re-install, the system has been rock solid for 18 months – Jon Story Sep 15 '14 at 14:03