3

I have a rocks cluster system with a NAS using a btrfs file system in a RAID 10 configuration. Recently we have been facing "no space left on device" errors that I have finally tracked down to metadata being almost entirely used up. So I wish to perform a balance operation to fix this issue.

What is not clear to me, is whether our users can continue working and accessing their directories on the nas while the balance operation is taking place? The manual on balance states:

"The on-disk state of the filesystem is always consistent so an unexpected interruption (eg. system crash, reboot) does not corrupt the filesystem. The progress of the balance operation is temporarily stored and will be resumed upon mount, unless the mount option skip_balance is specified."

Which makes me think that chunks of data only get reallocated after the balancing on that chunk is complete, but I haven't found a definite answer to my question anywhere: Is it safe for users to continue working, reading/writing data on the nas during a balancing operation, or is it necessary to take the system offline during this process that could take many hours or days for our TB's of data?

Abs
  • 33
  • 4

1 Answers1

5

Yes, you can do this while online. Data or metadata references are only updated once a balance has completed for a particular chunk, so it will remain consistent even during modification.

If the system is highly write transactional, the balance operation will take a bit longer, but that's a lot better than having to take the system offline.

Spooler
  • 7,016
  • 16
  • 29
  • This seems to make sense. For my own peace of mind (and perhaps an even better answer), have you had experience doing this in the past? Do you know if there is any 'official' information about this online somewhere? Thank you – Abs Nov 16 '17 at 19:55
  • Yeah, I've done it a lot of times. I've been managing BTRFS based NAS devices of various kinds for a few years. Balancing is a pretty domestic operation on them that I've never had issues with in modern implementations. Let me see if I can find some documentation on it that's generic enough. – Spooler Nov 16 '17 at 19:57
  • You've already posted the most definitive answer about this from the balance manpage itself, and that's the most commonly referenced note regarding data durability during balance. Beware of using too much I/O, though. Running a balance on an already overburdened roational system would seriously affect performance, while on SSD I've never had an issue even in heavily used systems. – Spooler Nov 16 '17 at 20:02
  • @Abs, the fact that the system even LETS you do it online should be a strong indicator that it is safe to so do. That is to say, if it wasn't safe to balance online, it would make you take it offline to do so. – psusi Nov 16 '17 at 22:56
  • @psusi I actually really don’t like this line of reasoning because there are in fact plenty of unsafe things that the system will allow you to do, especially as root. I would t count on this fact alone. – Abs Nov 17 '17 at 03:23
  • @Abs, true, but many of them will warn you at least that you are about to do something stupid, like running fsck on a mounted volume. Also it is *only* possible to do a balance on a mounted btrfs volume so clearly this is intended to be safe. – psusi Nov 19 '17 at 01:03