1

once in a while after a reboot I found fsck would fail on OS disk and booting becoming impossible on many of my servers. and then I had to reinstall OS and do data migration from the failed OS disk.
would the following measures prevent this from happening?

1.regular scheduled fsck on disks
2. use raid 5/6

any other suggestions and best pratices?

user12145
  • 1,075
  • 6
  • 26
  • 47
  • Tell us about what type of storage these servers are using. – EEAA Feb 12 '15 at 02:27
  • I would strongly recommend trying to avoid using fsck if possible. If you do use it, use it in interactive mode and don't have it fix root blocks, directories that appear to be files, or anything like that. Occasionally corrupted disks of that sort can be overcome by mounting the drive with an alternate superblock (try looking it up), but in general if you're having weird filesystem problems I would either a) back up the drive before the fsck or b) migrate the data to a fresh setup. Other sysadmins may disagree with this advice. – Some Linux Nerd Feb 12 '15 at 22:34
  • Also if fsck fails sometimes it's bad sectors in important places. I've had good luck with ddrescue - it's like dd that you can have zero out bad sectors. Try ddrescue --no-split /dev/sda /dev/sdb assuming sdb is a new drive you just added. The no-split option just has it skip bad sectors, otherwise it'll take all week to recover the data – Some Linux Nerd Feb 12 '15 at 22:45
  • Oh right, you can also skip fsck if you can mount the drive and create the file /fastboot – Some Linux Nerd Feb 12 '15 at 22:49
  • Definitely back up the drive before doing anything. – Michael Martinez Feb 12 '15 at 23:31
  • @EEAA: the storage is ext3 or ext4 file system with CENTOS 6+ – user12145 Feb 13 '15 at 04:12

2 Answers2

2

once in a while after a reboot I found fsck would fail on OS disk and booting becoming impossible on many of my servers.

Are you doing graceful reboots/shutdowns on these systems? If you are doing so (meaning filesystems get cleanly un-mounted), and you are still seeing corruption, then it's likely that the underlying storage has issues.

What filesystem are you using? Hopefully you're using a journaled filesystem? With journaled filesystems, even if they go down hard (meaning the server gets shut down before a clean un-mount), large-scale corruption is very un-likely.

1.regular scheduled fsck on disks

Doing so won't hurt, but it's also not all that necessary if your hardware is good.

  1. use raid 5/6

RAID won't do a thing for you. RAID protects from hardware failure, not filesystem failure.

EEAA
  • 108,414
  • 18
  • 172
  • 242
0
  • Run a memcheck on your RAM. If you have bad RAM, it will cause random corruption in your filesystem.
  • Run a health check on your hard drives and mainboard
Michael Martinez
  • 2,543
  • 3
  • 20
  • 31