9

I recently experienced a filesystem meltdown. I had a server running for about 180 days non stop without any issues, but then I noticed weird stuff happen and apparently the ext3 filesystem was in really bad shape. I had the drives and the memory tested and they were all fine. Ultimately, I was forced to hose the system and do a full reinstall. fsck.ext3 only made things worse.

Now, I don't want this to happen again so this time I went with XFS instead, which I feel is more mature than ext3, but I am at a loss how to monitor the health of the file system. xfs_check simply won't let me scan the device while it is mounted.

So, how do you monitor the health of an XFS filesystem while the system is online?

Yuri
  • 208
  • 1
  • 6
zidar
  • 233
  • 1
  • 3
  • 6

6 Answers6

8

Truthfully there isn't much you can do to monitor the operational health of the filesystem itself. This thread explains the reasons why you can't perform an fsck-style check on a filesystem which is online as read/write.

In part, you should trust that as a journalling filesystem, XFS is doing it's best to keep your data in good health. You may also take some solace in knowing that xfs_check is much faster than fsck.ext3 and XFS doesn't stipulate a periodic checks in the same way as ext3's 180 day / x mounts rule.


Edit to comments:

While I understand that you're once bitten, twice shy. I can assure you that "complete meltdown" isn't a systematic issue associated with UNIX filesystems. In my experience such events tend only to materialise in hand with hardware failure, user error (no disrespect intended), or an unfortunate mixture of both. However this is kind of hard to reason with you on a technical level without some very specific details of what went wrong with your previous ext3 install.

Dan Carley
  • 25,189
  • 5
  • 52
  • 70
  • 1
    +1, exactly my thoughts aswell. Just leave it alone :) – pauska Dec 07 '09 at 17:24
  • But how do I know I'm not facing a full system meltdown like I just had with ext3? Surely there must be some kind of counter or something I can monitor.. – zidar Dec 07 '09 at 19:01
  • 1
    Of course you can watch the health of the underlying physical disk that holds the filesystem. If bad blocks appears on disk, they could potentially lead to file system errors. There are some tools that can read SMART data (smartd). – nrgyz Dec 08 '09 at 00:57
6

Put the filesystem on a LVM logical volume, create a temporary snapshot from the logical volume and then fsck this snapshot (while the logical volume is still online).

Maybe Theodore Ts'o's e2croncheck script for ext3 will get you started.

(As 3dinfluence mentioned: ZFS is definitely the better solution...)

MadMike
  • 163
  • 7
knweiss
  • 3,955
  • 23
  • 20
  • 5
    +1, but make sure you run xfs_freeze before taking a snapshot otherwise you'll be checking an inconsistent filesystem... – James Dec 08 '09 at 09:21
5

I noticed weird stuff happen

Then the issue is not the filesystem (or at least it's extremely unlikely). ext3 is one the most used FS and any bug severe enough to cause catastrophic corruption should have already been found and fixed.

The cause lies elsewhere, possibly in the hardware itself (maybe the RAM).

To answer you question: you can check XFS filesystem online, but only if it's mounted read-only.

Luca Tettamanti
  • 846
  • 8
  • 11
3

Short disclaimer: I love XFS and its speed. This isn't so much a rant as it is a warning.


Immediate answer: no, you will need to unmount the filesystem to perform the check. Running a fsck on a live filesystem is a bad thing. The filesystem is constantly changing underneath such an examination, meaning you can never really be sure if it is consistently being examined, or worse, if your "repairs" won't make it worse.

While this is not a direct answer, it is a clear one. Ext3 is probably a better option for you, and if you're experiencing corruption with Ext3 then you'll want to re-examine your hardware. For the love of ${DIETY}, you shouldn't be using XFS if you're looking for something that won't (potentially) loose data during recovery. Under certain circumstances it will zero out data blocks during recovery.

Quoted from the 2nd link:

5.1 Write Failures

Data: We see that data errors are mostly ignored or little action is taken other than informing the user of the error. In most cases data loss occurs silently without the knowledge of the user.

Keep in mind that XFS was originally designed with video work in mind, so if you had a damaged video file, it wasn't a big deal, you could always splice in video to patch the "bad spot"; waiting a few days for a fsck on a 14 terabyte filesystem was a big deal, so it trades check-time for data integrity.

Avery Payne
  • 14,326
  • 1
  • 48
  • 87
3

Checking the consistency of any filesystem that is currently mounted is simply NOT recommanded.

nrgyz
  • 550
  • 2
  • 9
  • I don't want to modify or repair anything, just to make sure it's okay without taking the production critical server offline. – zidar Dec 07 '09 at 16:53
  • 2
    You don't want to do this. Seriously. Don't check filesystems while they are hot (mounted) unless you are *completely desperate* and have no other choice. Finding corruption errors in this manner is like finding the needle in the haystack that's inside a tornado going over an earthquake. – Avery Payne Feb 03 '10 at 18:13
2

Filesystem corruptions happen regardless of what file system you're using. I've had both Ext3 and XFS file systems go south on me over the years.

ZFS while not available on Linux, other than using FUSE, does have an online background scrub which can detect and repair errors before you encounter data loss. It also does a lot of ECC on all filesystem operations and should detect and report any errors it encounters. However it should be able to recover and heal itself from most of these. But even with all the ECC tricks that ZFS does there have been some extreme cases, normally hardware issues, where a ZFS filesystem has been corrupted.

The best thing to do is have a good backup strategy and DR plan in place. Restoring data from a known good backup is the fastest way to recover from these sort of issues. Going through lost+found is a painful, error prone process.

3dinfluence
  • 12,409
  • 2
  • 27
  • 41