38

Recently I've seen the root filesystem of a machine in a remote datacenter get remounted read-only, as a result of consistency issues.

On reboot, this error was shown:

UNEXPECTED INCONSISTENCY: RUN fsck MANUALLY (i.e., without -a or -p options)

After running fsck as suggested, and accepting the corrections manually with Y, the errors were corrected and the system is now fine.

Now, I think that it would be interesting if fsck was configured to run and repair everything automatically, since the only alternative in some cases (like this one) is going in person to the remote datacenter and attach a console to the affected machine.

My question is: why does fsck by default ask for manual intervention? How and when a correction performed by such program would be unsafe? Which are the cases when the sysadmin might want to leave a suggested correction aside for some time (to perform some other operations) or abort it alltogether?

0x5C91
  • 483
  • 1
  • 5
  • 10
  • 15
    If the developers were 100% confident the error could be fixed automatically, then it wouldn't be an error in the first place. – user253751 Jun 28 '16 at 11:23

3 Answers3

42

fsck definitely causes more harm than good if the underlying hardware is somehow damaged; bad CPU, bad RAM, a dying hard drive, disk controller gone bad... in those cases more corruption is inevitable.

If in doubt, it's a good idea to just to take an image of the corrupted disk with dd_rescue or some other tool, and then see if you can successfully fix that image. That way you still have the original setup available.

Janne Pikkarainen
  • 31,454
  • 4
  • 56
  • 78
  • 4
    I've worked a lot with failing hardware and I agree with this. The last thing I want to do is fsck if there's suspected bad hardware of any sort. I've also seen a low power event and subsequent recovery which was greatly delayed by automatic fsck. – jorfus Jun 28 '16 at 20:54
  • To give a concrete example: I have worked on a machine with a disk controller that "randomly" (about 1 time in 10^5) would turn a read or a write to block XXXXXXYY on any device to a write to block 000000YY on the first device. I.e., it frequently blasted structured wrong and unstructured wrong data to the boot sector and various critical filesystem structures of the boot disk. Running fsck in such a situation (millions of reads) can eliminate any remaining chance of recovering data. – Eric Towers Jun 28 '16 at 21:15
  • 2
    1 in 10^5 is a lot... that's 10 bytes ever Mb. – Nelson Jun 29 '16 at 01:07
  • 1
    @Nelson : It sort of is... The unit there is "single block transfers", not "bytes". So ten bad block writes per million blocks (and blocks are significantly larger than bytes). – Eric Towers Jun 29 '16 at 03:18
21

You have seen one example where fsck worked, but I've seen more then enough damaged file systems where it did not work successfully at all. If it would work fully automatic, you might have no chance to do things like a dd disk dump or something like that which in many cases would be an excellent idea to do before attempting a repair.

It's never, ever a good idea to try something like that automatic at all.

Oh, and modern servers should have remote consoles or at least, independent rescue systems to recover from something like that without lugging a KVM rack to the server.

Sven
  • 97,248
  • 13
  • 177
  • 225
  • 7
    Actually, what's not a good idea is to say "**never, ever**" like that, when it isn't true. Usage case where it is a good idea: The server's main partitions can be re-created from scratch rather quickly, in case of problem. Actually important data gets accessed via a remote filesystem, with appropriate redundancy in place for that data. I'd much rather take the chance of `fsck -p /` and `fsck -p /var`, etc., working fine, and getting server up without manual intervention, and risk the small, non-zero % chance of major catastrophe to those partitions which I can just re-create if needed. – TOOGAM Jun 29 '16 at 04:45
  • 1
    If the system can be easily reinstalled, I just do that ... – Sven Jun 29 '16 at 09:21
  • 2
    That would take longer. Options are: A) Risk doing it automatically. B) Have someone tell `fsck` to preen, and then everything works fine. Takes about 2 minutes, if that. Downtime until this happens. C) Have someone re-install the operating system. Takes 30+ minutes. You're choosing option C? Maybe a key difference we have is that I've had `fsck` work a greater percentage of the time than what you quote in your answer. My main point wasn't the system design (this cheap-o system doesn't use a remote console), but just that saying "**never, ever**" was too strong a phrase to be accurate – TOOGAM Jun 29 '16 at 14:36
  • Let's just agree to disagree. – Sven Jun 29 '16 at 14:49
0

First of all, you need to understand that with modern (journalized) filesystems, a system crash will not corrupt the filesystem and no fsck will be required at boot time.

Ext3, Ext4, ZFS, btrfs, xfs and all modern FS are 100% consistent after a crash or system reset.

Non journalized FS like ext2 or vfat are a big NOGO for a system rootfs.

Now, if your system requires a fsck at boot time, you should ask yourself: what was the reason for this in the first place?

You should investigate your kernel logs afterwards to find out, when and what did happen. You should also go back in time in the logs to find since when the error did start. You should check your disks with smartctl. Etc... If you need a fsck on a journalized fs, it is virtually certain that your hardware is failing, assuming the fs was not damaged by an admin (with block-level tools like dd) or by a bug.

So it is silly to use fsck to "fix" the problem without investigating and fixing the root cause (by replacing/upgrading the faulty hardware/firmware/software).

Doing a fsck, completing the boot and being happy is naive to say the least. Stating "I've had fsck work a greater percentage of the time than what you quote" is making me wondering what you mean with "fsck work". fsck may have brought back your fs to a consistent state by loosing some files and data in the process... Did you compare with a backup? Many people loose files or get file data corruption without noticing...

Pierre.Vriens
  • 1,159
  • 34
  • 15
  • 19