How to investigate frequent but dissimilar data-loss events

Question

I have a Xen domU provided by a third party, running Ubuntu (10.04, server edition, stock -server kernel). This server runs Dovecot and Exim4, with mail stored in Maildirs, and runs a fairly typical LAMP stack with most applications in Perl, and all data stored either in a directory tree full of TIFF files, or a MySQL DB. This server has been operation for around 3 months for LAMP stuff, and a month serving mail. All filesystems (except swap) are Ext3.

A couple of weeks ago we suddenly found a whole bunch of TIFF files which were no longer accessible, as noted by our backup script (using rsync). rsync on the remote host reported the following errors:

rsync: readlink_stat("/srv/data/documents/archive/pdf/2007/Aug/06/085717/00000002.TIF") failed: Input/output error (5)
rsync: readlink_stat("/srv/data/documents/archive/pdf/2007/Aug/06/085717/00000001.TIF") failed: Input/output error (5)
rsync: readlink_stat("/srv/data/documents/archive/pdf/2011/Jan/04/125227/XSMDESC.DAT") failed: Input/output error (5)
rsync: readlink_stat("/srv/data/documents/archive/pdf/2011/Jan/04/125227/DOC010.XST") failed: Input/output error (5)
rsync: readlink_stat("/srv/data/documents/archive/pdf/2011/Jan/04/125227/00000001.TIF") failed: Input/output error (5)

...and so on. The files will have been created either late December, or on the date given in the path, whichever is the later, as we migrated our data to this machine late last year. No process to my knowledge will have written to the files since -- only read from them.

Throughout that day, we noted the list of affected files increasing, so that night we unmounted that filesystem (a Xen Virtual Block Device) and ran a fsck, which found and fixed many, many errors. The affected files were now gone. However, the corruption stopped spreading once the fsck was complete and the filesystem remounted.

(As an aside, to illustrate the kind of luck we've had here -- the single disk holding our only backup of this data died catastrophically the same afternoon. Yes, really. Our only other backup was from Dec 10th 2010...)

It may or may not be relevant that the vast majority of the files affected were created on Jan 4th or 5th of this year -- however some were documents from 2006/7, and some were newer.

With the fsck complete and the machine now apparently stable, we were worried -- the hosting provider could find no root cause, and nor could we -- and we'd lost data, but at least the corruption had stopped.

Skip forward several days, and a routine mysqldump refuses to dump 3 tables because they are marked as crashed. mysqlcheck confirms this, and REPAIR TABLE [foo] fixes all 3, in 2 cases reporting fewer rows found after the event than before. Again the vendor can see no root cause, there has been no interruption to power, disk access or mysqld. The problem appears unrelated but -- in 3 months hosting on this server, we've already lost more data than in several years of running these applications on a variety of different (but never virtual!) platforms.

Finally, this week we have found 3 files on the FS which appear to have turned to binary gunk -- more specifically, all 1s (or all \0xFF if you prefer). All 3 files (2 small text config files, 1 100-ish line perl script) were part of our web application, and would be frequently read but written to only when we deployed a new version, which works by updating a local "working" copy, exporting that working copy to get a clean fresh install, and pointing a symlink at that fresh install. The files were broken in the working copy and propagated from there, and the modification times on all files were consistent with them not having been changed for many weeks (in which time there had been several deployments, all of which went fine!) so the content clearly changed without the mtime being updated.

Any one of these events would have me restoring from backups, scratching my head and carrying on with my life, but 3 in a fortnight has me waiting for the next thing to happen.

My questions is simple: is it even possible that these 3 events are connected, and if so, where should I be looking for a root cause?

(Answers regarding solutions are also welcome, however we are already in the process of setting up a parallel platform running CentOS, on VMware, with the same vendor, to try rule out distribution, kernel, hypervisor and virtual block device related issues. It would be great to know which of those was the issue, but if we don't have a diagnosis, and replacing that whole stack works, that'll help me sleep at night ... eventually.)

As always if any extra information would help, please comment and I will update accordingly!

For what it's worth, about 8 hours after I posted this the VM failed catastrophically. The vendor's last good backup of the machine state + data was ... 14 days old, from exactly the day we first noted corruption on the FS. Hilariously they claim these events are unrelated... — James Green, Feb 14 '11 at 20:30

score 1 · Answer 1 · answered Jul 27 '11 at 19:48

Looks like the backup software of the vendor corrupted the filesystem.

We had a similar case where a DomU started misbehaving after it got backed up by an unpatched version of our standard backup client.

After trying to repair the fs two times it kept on misbehaving (files could not be read...)

The solution was to completely re-setup the file-systems (mkfs), install the latest patched version of the standard backup client and then restore the last good data.

We were lucky here: The data partition (/opt) was still intact and never lost anything. The corrupted partitions just contained / and /var.

How to investigate frequent but dissimilar data-loss events

1 Answers1