Online Drive Replacement: BTRFS with RAID 6

I have repeated this test on a test system running kernel 4.3.

Like you, I have created a BTRFS RAID-6 array with 4 drives:

# mkfs.btrfs -m raid6 -d raid6 /dev/sdb /dev/sdc /dev/sdd /dev/sde

I then mounted it and started writing data on it.

While that was going on, I removed one of the drives. Of course, this caused a lot of error messages in the log and everywhere. But as expected, the write process was not interrupted and no files were damaged.

More importantly, BTRFS increased its error counts (dev stats) for write and flush errors. So if this was a production system, it would be monitored, a cronjob such as this one would have generated a notification email:

MAILTO=admin@myserver.com
@hourly /sbin/btrfs device stats /mnt/tmp | grep -vE ' 0$'

Then, I did not run a balance but a scrub, because I wanted BTRFS to scan the full filesystem and fix all errors, which is exactly what a scrub does.

# btrfs scrub start -B /mnt/tmp

Finally, I reset the BTRFS error counts back to zero (this would stop the warning messages if this filesystem was being monitored):

# btrfs device stats -z /mnt/tmp

Another scrub found no more errors.

And the file that I was writing during the test is correct. Its MD5 sum matches the original.

Of course, every test is different. If the 3rd drive (sdd) is assigned a new name like sdf, you can replace it with itself, effectively resilvering it:

# btrfs replace start 3 /dev/sdf /mnt/tmp

By the way, you mentioned removing a drive. You don't need to do that, it will only mix up your devids and be inefficient. The replace command has been around like forever.

Btw. in one case, BTRFS caused the test system to crash while I was trying to read from the damaged filesystem before I ran the scrub. After all, unlike most parts of this filesystem, BTRFS RAID-5/RAID-6 is still considered experimental (although it's constantly being improved, so this statement may be outdated, this is for 4.3). But this was only one single time, I repeated the test and it didn't crash then. Also, this tells us that even though BTRFS RAID-6 could crash while it's still experimental, it protects your data and a scrub tells you reliably if there are errors because it uses the stored checksums to verify the files.

I have also repeated the test, causing errors on 2 drives. This is a RAID-6 so this also worked as expected. Everything was fine after a scrub.

basic6

Posted 2014-11-28T08:46:23.397

Reputation: 2 032

Online Drive Replacement: BTRFS with RAID 6

Answers