I'm researching for ways to build and run a huge storage server (must be running Linux) where for all data arrays I can run consistency check and fix, while the usual applications using the arrays (reads and writes) keep on working as usual.
Say you have a many-TB of data on a single traditional Linux filesystem (EXT4, XFS) that is used by hundreds of users and suddenly the system reports consistency/corruption problem with it, or you know that the machine went down recently in a dirty way and filesystem corruption is very likely.
Taking the filesystem offline and running the filesystem check can easily take many hours/days of downtime, since neither EXT4 nor XFS can run check & repair while in normal operation; the filesystem needs to be taken offline first.
How to avoid this weakness of EXT4/XFS with Linux? How can I build a large storage server without ever needing to get it offline for hours for maintenance?
I've read a lot about ZFS and its reliability due to its use of data/metadata consistency checks. Is it possible to run consistency check and fix ZFS filesystem without taking it offline? Would some other new filesystem or some other organization of the data on disk be better?
One other option I'm thinking about is to divide the data array into ridiculously many (hundreds) of partitions, each having its own independent filesystem, and fix applications to know to use all those partitions. Then, when the need to check one of them arises, only that one will need to be taken offline. Not a perfect solution, but better than nothing.
Is there a perfect solution to this problem?