0

I had a disaster-weekend; I'm running a server with some virtualisations (KVM); they are hosting round about 100 users each. The load is at 0.40-0.89 all the day, the machine has 128GB RAM.

Well: On saturday the server was no longer available. I instantly used my IPMI-Access and couldn't believe what I saw: The RAID was completely degraded. Only 2 hard disks were "alive" but there was no data on them.

About a hour before I was informed about the crash, I saw that there was a Proxmox-Backup running. But could this really be the reason for a crash of all HDD?

I'm not quite sure what I should do to prevent this...

MyFault
  • 893
  • 3
  • 14
  • 35
  • It could be a RAID controller failure. You should provide all information: What kind of RAID controller, what version of Linux kernel, what filesystems... Else it's just a guessing game. – wazoox Aug 13 '16 at 12:33

1 Answers1

2

But could this really be the reason for a crash of all HDD?

It seems unlikely but it may be an idea to check with Proxmox perhaps ?

I'm not quite sure what I should do to prevent this...

In order to prevent this from happening again, you need to understand why it happened.

To do that you will need to bring your (or someone else's) sysadmin toolkit to bear on the problem and do some root cause analysis.

I find good old Scientific Method is the perfect tool for this.

Here's some Q&A I prepared earlier which should help.

I would imagine that in your case, reading your logs for relevant information would be a good place to start.

user9517
  • 114,104
  • 20
  • 206
  • 289