trouble with fsck on raid - abort or not abort

Question

I have a problem. While I was a away, a server with a 50 TB hardware raid (I think 5) apparently for some reason kicked out two drives. A collegue just put them back in by just adding them back in the configuration utility. Everything seemed to be ok. Then I noticed that I got IO Errors on a lot of files.

I then thought that I could correct them with fsck.ext4 This ran for an hour or so and then crashed. The 16 GB of Ram were full. I had to create a 64 GB swap file (on HDD...) to have it actually not crash. Now it runs since two weeks, always saying that some blocks are used by different files, and multiply referenced blocks are cloned. I know this is so slow because it is swapping like crazy.

Do you think that if I abort and restart it will maybe not go into swap this time because a lot of it was already done? Would it be ok to abort, or should I not abort? Or did I just destroy every file on the raid?

I actually tried ctrl+c, but nothing happens ....

score 1 · Answer 1 · answered Jun 07 '19 at 14:56

Activate your business continuity plan.

Determine the state of any backups. If you have an acceptable point in time, rebuild the array with good drives and restore. While you are at it, use RAID 6 or similar parity that can survive more than one drive failure.

If data must be recovered from the faulty array, define an alternate plan for getting operational. Consider getting another equivalent array to restore to while you attempt data recovery on the original.

Reduce the memory consumption of e2fsck by configuring a scratch_files directory on different storage. It will run slowly, but the memory system won't thrash the paging space.

Make a decision to abort or not. Forcibly terminating a fsck (rebooting the host) may cause further data loss. However, you may need the array freed up to recover in a timely manner.

trouble with fsck on raid - abort or not abort

1 Answers1