0

We have a windows server running 24/7.
I have been worried for quite a while, when i started taking a look at the windows event log.
There I found a lot of instances of Kernel Power Event ID 41:
It indicated that multiple times a day (mainly during the night) the server unexpectedly re-booted after a crash.

The server has been running rock solid for years!
So my first assumption was some faulty software with recent patches.
But I just couldn't make out any pattern - as to why and when exactly there would be a crash.

Doing some web searching for Kernel Power Event ID 41 it mainly points to hardware issues:
PSU glitches, cpu or memory overheating, etc.

The server has a LSI MegaRAID 9260-4i with 4 physical HDDs, two of them each configured as "RAID 1".

The raid controller logs don't show anything suspicious (in regards to any of the physical disks having any problems).

So I'm currently thinking the raid controller itself may be having problems.
this idea is backed up by the following two observations:

1)
I boot from windows server OS "installation CD".
then go into recovery options.
Then select "restore from backup" (with the USB HDD backup drive connected).
At a certain stage during the restore procedure it will throw error 0x80070002.
And if I then switch over to the command prompt: no drives will be visible.

2)
Quite similar with "acronis true image".
I boot from ATI recovery CD.
Then select to backup my partitions.
It all starts processing.
But at some point it's throwing some error.
And after cancelling that backup procedure, then going to "backup my disks and partitions" everything is empty! No disk are being shown.

--

All of the above makes me assume the following:
The raid controller itself (not the physical HDDs) must be defective:
right in the middle of operations the logical drives just "disappear".

During windows server uptime this causes an OS crash - followed by a re-boot.
During windows backup restore from CD the drives suddenly disappear.
During ATI backup from CD the drives suddenly disappear.

--

Considering all of the above:
Is it safe to assume that these are symptoms of the raid controller itself dying? It's neither the physical HDDs nor any other system components causing the problems?

To get the current problems solved:
Would the best option be to get the current raid controller replaced with an identical one?

paulgutten
  • 101
  • 3
  • Have you run a utility like smartctl to see what the disks themselves are reporting for disk health? If your RAID controller is failing as consistently as you say, then you will probably start seeing the drives take errors. – tilleyc Mar 25 '21 at 20:03
  • Maybe you're having power outages. Do you have a UPS? – Andrew Schulman Mar 29 '21 at 12:09

2 Answers2

1

Biggest part of IT is don't panic and don't assume.

Many times you have to load additional drivers for Windows Recovery Discs/ Install disk or Acronis to see the Raid configuration. The version of Windows server would help determine if the raid controller drivers should already be on the recovery media. Also, if you did not build the Acronis media from that server, it likely doesn't have the drivers to see the raid controller.

Side note, check the power profile in the control panel and ensure the drive(s) and system never goes to sleep or power down. This is likely not the issue, but check it anyways. Let us know what version of windows OS you are running.

Cheers!

bitcollision
  • 131
  • 7
0

I'm the OP.

My key questions were:

Considering all of the above: Is it safe to assume that these are symptoms of the RAID controller itself dying? It's neither the physical HDDs nor any other system components causing the problems?

To get the current problems solved: Would the best option be to get the current RAID controller replaced with an identical one?

I'm answering both of those section with: YES

I have replaced the RAID controller with a "similar" one I bought 2nd hand. Same brand, but this time LSI MegaRAID 9260-8i.
Surprisingly, I didn't face any major issues with simply attaching the existing disks (with the existing RAID configuration) to the new controller.

To be on the safe side, this time I have installed an additional fan to directly blow on the heatsink of the RAID controller.

References: https://vcojot.blogspot.com/2015/07/lsi-megaraid-hbas-overheating-and-one.html and LSI MegaRAID Expected Chip Temperature?

The server is running rock solid again. Uptime easily 90 days plus.

paulgutten
  • 101
  • 3