2

I'm researching for ways to build and run a huge storage server (must be running Linux) where for all data arrays I can run consistency check and fix, while the usual applications using the arrays (reads and writes) keep on working as usual.

Say you have a many-TB of data on a single traditional Linux filesystem (EXT4, XFS) that is used by hundreds of users and suddenly the system reports consistency/corruption problem with it, or you know that the machine went down recently in a dirty way and filesystem corruption is very likely.

Taking the filesystem offline and running the filesystem check can easily take many hours/days of downtime, since neither EXT4 nor XFS can run check & repair while in normal operation; the filesystem needs to be taken offline first.

How to avoid this weakness of EXT4/XFS with Linux? How can I build a large storage server without ever needing to get it offline for hours for maintenance?

I've read a lot about ZFS and its reliability due to its use of data/metadata consistency checks. Is it possible to run consistency check and fix ZFS filesystem without taking it offline? Would some other new filesystem or some other organization of the data on disk be better?

One other option I'm thinking about is to divide the data array into ridiculously many (hundreds) of partitions, each having its own independent filesystem, and fix applications to know to use all those partitions. Then, when the need to check one of them arises, only that one will need to be taken offline. Not a perfect solution, but better than nothing.

Is there a perfect solution to this problem?

Ján Lalinský
  • 262
  • 1
  • 10
  • ZFS is self-healing and virtually all repair operations one might do manually with ZFS are done online. – Michael Hampton Sep 21 '19 at 21:30
  • Your many partititions solution is likely a nightmare to maintain. I agree with ewwhite answer about ZFS and experts. Depending on your requirements however, another thought - what about writing to a database (or database backed filesystem) and replicating that, which would give you failover and redundancy as well. – davidgo Sep 23 '19 at 06:08

3 Answers3

3

This would be a case for XFS or ZFS. FSCK is not a concept in the ZFS world.

There's a good amount of skill in building something like this in a robust manner. If there's a budget for bringing in an expert or ZFS consultant, your organization should consider doing so.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
2

The crude reality is that legacy filesystems are not really well suited for multi-TB volumes. For example, RedHat recommend EXT4 filesystems no bigger than 50 TB; with the fsck time being one of the limiting factors.

XFS is in a better shape, both due to much faster xfs_repair (compared to the old xfs_check) and to the on-going project to add on-line scrub.

EXT4, XFS and other filesystems (BTRFS excluded) can be checked on-line by taking a snapshot of the main volume and running an fsck against the snapshot rather than the main filesystem itself. This will catch any serious error without requiring downtime, but it clearly need a volume manager (with snapshot capability) being in place under the filesystem. As a side note, this is one of the main reason why RedHat uses LVM by default.

That said, the most know and reliable filesystem with on-line scrubbing clearly is ZFS: it was designed from the start to efficiently support very large arrays, and its online scrub facility is extremely effective. If any, it has the opposite problem: it lack an offline fsck, which would be useful to correct some rare class of errors.

shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • Thanks, I wonder how does the snapshot method work, does it need to replicate all data to additional disks? – Ján Lalinský Sep 23 '19 at 14:03
  • 1
    @JánLalinský No, for a snapshot you don't need any additional disk. A snapshot is a CoW instance of the main filesystem, created and modified "on the fly" during normal operation. I strongly suggest you to familiarize with snapshot and related concepts before even trying to manage huge filesystems/volumes. – shodanshok Sep 23 '19 at 16:17
  • Will do. Did you mean LVM snapshots or XFS snapshots? So far I found LVM snapshots are reported to have many problems, like extremely slow boot or corruption, e.g. https://serverfault.com/a/72743/387576 . Also, sysadmin1138 warns not to use LVM snapshots if XFS is used: https://serverfault.com/a/41202/387576 So XFS snapshots seem preferable. Overall, ZFS seems much simpler and reliable to set up and maintain, the only problem I see is ZFS on Linux is not supported by Linux developers, so there is some uncertainty about it in the future. – Ján Lalinský Sep 24 '19 at 11:34
  • @JánLalinský what you linked is basically outdated knowledge. Fast-forward to 2019, where we have two kind of LVM snapshots: classical/legacy vs thinsnap. Both are stable and not suscettible to data loss, but they have vastly different behavior and performance profile (spoiler: thin snapshots are much faster and flexible). This is a too large topic for a comment; I strongly advise you to read the official LVM documentation and, maybe, [this mailing list thread](https://www.redhat.com/archives/linux-lvm/2017-April/msg00000.html) for a real-world different use cases for the two snapshot methods. – shodanshok Sep 24 '19 at 14:36
0

Do a business continuity analysis by asking the organization how much downtime for this storage is acceptable. Doing better than a handful of planned outages and a couple hours downtime per year usually requires investing in a multiple node solution.

Protect against as many downtime risks as you can think of. For example, a fire in the data center will shut things down for a couple hours, whatever the storage technology. If service must continue, replicate the data to a different system in a different building.

Regarding the file system, pick something you can fix and/or your vendor can support. EXT4 will strongly encourage you to fsck every so many mounts. XFS fsck doesn't do anything due to journal but xfs_check is offline. ZFS has no fsck, rather it has online scrubs.

Splitting data into multiple volumes might make sense to some extent. Would isolate failures, perhaps by organization unit or application. However, hundreds of small volumes just to keep fsck fast increases work. One advantage of centrally managed storage was supposed to be less administrative work.

For multiple node availability and performance, consider adding on another layer, a scale out distributed file system. Ceph, Lustre, Gluster, others. Quite different from one large array. Implementations vary in whether they use a file system underneath, and if they provide block or file protocols to users.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32