3

I am experiencing quite a typical but strange problem: the hdd is going bust after power cuts on my server. I can feel it huge work done by fsck on startup after the crash, and then my subversion repository is losing months of work - it all sounds weird for me. The server is running ext4 on that hard drive, so it is supposed to be safe - but it isn't. I am starting to suspect a hard drive problem, but maybe there can be other causes?

The relevant fstab line is

/dev/mapper/vg_data-LV_data /data ext4 defaults 1 2

and the system is Fedora 11 x86_64.

Michael Pliskin
  • 225
  • 1
  • 12

2 Answers2

8

Irrespective of the claims that any filesystem makes about being resilient to unclean shutdowns I'd never allow a production server computer to run w/o power protection. To my mind, there are too many potential layers of caching and too much abstraction for the OS to be absolutely sure that data really is committed (even when the disk subsystem claims it is).

It's not clear to me if Fedora 11 shipped with the ext4 delayed allocation bug fixed or not. It looks like it, but the phrasing of the FAQ isn't 100% clear (and I don't have time to look thru the kernel SRPM for Fedora 11 right now).

For background: Kernel 2.6.30 changed the default behaviour of ext4 not to use delayed allocation. Prior to 2.6.30 this delayed allocation behaviour was in effect by default and could cause data loss if power was lost before disk operations were committed. (Reference at http://en.wikipedia.org/wiki/Fedora_(operating_system) and http://en.wikipedia.org/wiki/Ext4#Delayed_allocation_and_potential_data_loss and background at http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/).

Make sure you're running the most updated kernel for Fedora 11 and, if at all possible, stop allowing the filesystem to be taken down hard.

Evan Anderson
  • 141,071
  • 19
  • 191
  • 328
  • 3
    Unusual part is that the poster is saying it's losing months of work multiple times. I wonder if there is something odd or corrupted in the filesystem and the power cutoffs are exacerbating it. – Bart Silverstrim Jul 01 '10 at 11:38
  • I've checked the kernel and it is kernel.x86_64 2.6.30.10-105.2.23.fc11 so it seems it is the right one. However is there a way to disable delayed allocation in fstab or whatever other way just in case? @Bart Silverstrim: you might be right, it sounded suspicious either, I've just initiated a deep disk check to see if that's the case. – Michael Pliskin Jul 01 '10 at 12:11
  • @Michael: Hope you have good backups...it's possible the repair could make more than a few months of data disappear. – Bart Silverstrim Jul 01 '10 at 12:16
  • 1
    It's struck me as kind of weird-funny that we consider disk utilities to be repair tools when really they try to make the filesystem consistent. Sometimes consistent means wiping out part of the filesystem, leaving new admins scratching their heads as to what happened. – Bart Silverstrim Jul 01 '10 at 12:17
3

Depending on how fancy your LVM setup is, the problem might be that LVM disregards I/O barriers. Barriers on simple linear devices should work as of 2.6.30 (which you seem to have), but the more complicated stuff should work as of 2.6.33.

janneb
  • 3,761
  • 18
  • 22
  • Agree that write barriers may well be the problem - see http://serverfault.com/questions/279571/lvm-dangers-and-caveats/279577#279577 for more details on LVM and write caching. Also, https://lwn.net/Articles/322823/ covers ext4 data loss issues. Buying a UPS is also a good idea so you have clean shutdowns on extended power outages. – RichVel Dec 07 '11 at 12:25