How to check/repair a RAID1 array?

1

1

I’m currently using a software RAID-1 array on linux, built on top of a HDD and a SSD. I have a strong feeling that the SSD is failing.

I’d like to check how bad the SSD is behaving. I ran a check of the array, with echo check > /sys/block/md1/md/sync_action and, when it was finished, I had a look at the content of /sys/block/md1/md/mismatch_cnt. I ran it 3 times in a row, and got 3 different results: 256, 128 and 384. What puzzles me is that the second run gave a lower result than the first one. Was a mismatch fixed?

Is there a way I can get more detail about the mismatches that are detected? It might be interesting to check if the mismatching blocks change or if it’s always the same. I’d also like to have a look at the contents of the mismatching blocks, to see if I can tell which one is correct. (For example if the SSD has zeroed some blocks it could not reread.)

Moreover, I see there is an option to repair an MD array. But I’m somewhat suspicious: how can the kernel guess which one of the mismatching blocks is correct?

user2233709

Posted 2018-07-13T00:24:08.667

Reputation: 163

I don't know the answer to the question, but a RAID1 setup operates at the speed of the slowest component. It's no faster than running on the HD alone. – Christopher Hostage – 2018-07-13T00:35:33.273

2@ChristopherHostage That’s not true for read performance if I configure the array to read the SSD rather than the HDD, which I can do by using the “write-mostly” option when I add the HDD. – user2233709 – 2018-07-13T00:39:22.387

Huh. "Write-mostly" was a new term for me... Googling it lead me to the following link, which also mentions the tool you used. Neat. https://www.tansi.org/hybrid/

– Christopher Hostage – 2018-07-13T00:43:11.620

Answers

1

Well… Reading the source code of the process_checks function in the drivers/md/raid1.c file from linux 4.9.88, if I read it correctly:

  1. There is no way to make the check or repair operations verbose about where mismatches are found.
  2. I a read failure in encountered during a check or repair operation, the failing block will be rewritten.
  3. If a mismatch in encountered during a repair operation, it will be fixed by copying the block from “primary” (first non-failing) block to the other one(s).

Hence, there is no guess which of the mismatching blocks is correct; it just takes the first one as correct. (As I read it, even if there are 3 components and the 2nd and 3rd have the same contents.)

user2233709

Posted 2018-07-13T00:24:08.667

Reputation: 163