We have a windows server running 24/7.
I have been worried for quite a while, when i started taking a look at the windows event log.
There I found a lot of instances of Kernel Power Event ID 41
:
It indicated that multiple times a day (mainly during the night) the server unexpectedly re-booted after a crash.
The server has been running rock solid for years!
So my first assumption was some faulty software with recent patches.
But I just couldn't make out any pattern - as to why and when exactly there would be a crash.
Doing some web searching for Kernel Power Event ID 41 it mainly points to hardware issues:
PSU glitches, cpu or memory overheating, etc.
The server has a LSI MegaRAID 9260-4i
with 4 physical HDDs, two of them each configured as "RAID 1".
The raid controller logs don't show anything suspicious (in regards to any of the physical disks having any problems).
So I'm currently thinking the raid controller itself may be having problems.
this idea is backed up by the following two observations:
1)
I boot from windows server OS "installation CD".
then go into recovery options.
Then select "restore from backup" (with the USB HDD backup drive connected).
At a certain stage during the restore procedure it will throw error 0x80070002.
And if I then switch over to the command prompt: no drives will be visible.
2)
Quite similar with "acronis true image".
I boot from ATI recovery CD.
Then select to backup my partitions.
It all starts processing.
But at some point it's throwing some error.
And after cancelling that backup procedure, then going to "backup my disks and partitions" everything is empty! No disk are being shown.
--
All of the above makes me assume the following:
The raid controller itself (not the physical HDDs) must be defective:
right in the middle of operations the logical drives just "disappear".
During windows server uptime this causes an OS crash - followed by a re-boot.
During windows backup restore from CD the drives suddenly disappear.
During ATI backup from CD the drives suddenly disappear.
--
Considering all of the above:
Is it safe to assume that these are symptoms of the raid controller itself dying? It's neither the physical HDDs nor any other system components causing the problems?
To get the current problems solved:
Would the best option be to get the current raid controller replaced with an identical one?