-1

I have a server which backs up itself on another server with duplicity (actually duply). The full backup is about 330 1-GB files. The full backup finished without problems, but the next day the incremental terminated with "CRC check failed". On the backup server several files appear to have a problem:

# gzip *20170530* --test

gzip: duplicity-full-signatures.20170530T032515Z.sigtar.gz: invalid compressed data--crc error

gzip: duplicity-full.20170530T032515Z.vol139.difftar.gz: invalid compressed data--crc error

gzip: duplicity-full.20170530T032515Z.vol139.difftar.gz: invalid compressed data--length error

gzip: duplicity-full.20170530T032515Z.vol146.difftar.gz: invalid compressed data--crc error

gzip: duplicity-full.20170530T032515Z.vol169.difftar.gz: invalid compressed data--crc error

gzip: duplicity-full.20170530T032515Z.vol171.difftar.gz: invalid compressed data--crc error

gzip: duplicity-full.20170530T032515Z.vol171.difftar.gz: invalid compressed data--length error

gzip: duplicity-full.20170530T032515Z.vol193.difftar.gz: invalid compressed data--crc error

gzip: duplicity-full.20170530T032515Z.vol223.difftar.gz: invalid compressed data--crc error

gzip: duplicity-full.20170530T032515Z.vol224.difftar.gz: invalid compressed data--crc error

gzip: duplicity-full.20170530T032515Z.vol233.difftar.gz: invalid compressed data--crc error

gzip: duplicity-full.20170530T032515Z.vol301.difftar.gz: invalid compressed data--crc error

gzip: duplicity-full.20170530T032515Z.vol310.difftar.gz: invalid compressed data--crc error

gzip: duplicity-full.20170530T032515Z.vol310.difftar.gz: invalid compressed data--length error

gzip: duplicity-full.20170530T032515Z.vol53.difftar.gz: invalid compressed data--crc error

gzip: duplicity-full.20170530T032515Z.vol53.difftar.gz: invalid compressed data--length error

gzip: duplicity-full.20170530T032515Z.vol63.difftar.gz: invalid compressed data--crc error

gzip: duplicity-full.20170530T032515Z.vol63.difftar.gz: invalid compressed data--length error

If only one file had an error I'd just retry hoping it would be a random error. But... 13 files with error? How should I debug this?

Both servers are Debian 8. Duplicity is 0.6.24, installed with apt; the same thing with all dependencies, with the exception of paramiko, for which 1.16.0 has been installed.

The main server's logs do show some SATA stuff, but would this result in silently corrupting files? Wouldn't the full backup have stopped with an I/O error or something? Here's an example of stuff written in the log:

May 31 06:49:11 acheloos kernel: [1887359.720042] ata3.00: exception Emask 0x50 SAct 0x40000 SErr 0x280900 action 0x6 frozen
May 31 06:49:11 acheloos kernel: [1887359.720472] ata3.00: irq_stat 0x08000000, interface fatal error
May 31 06:49:11 acheloos kernel: [1887359.720870] ata3: SError: { UnrecovData HostInt 10B8B BadCRC }
May 31 06:49:11 acheloos kernel: [1887359.721255] ata3.00: failed command: READ FPDMA QUEUED
May 31 06:49:11 acheloos kernel: [1887359.721639] ata3.00: cmd 60/40:90:ac:3b:d8/00:00:2e:00:00/40 tag 18 ncq 32768 in
May 31 06:49:11 acheloos kernel: [1887359.721639]          res 40/00:94:ac:3b:d8/00:00:2e:00:00/40 Emask 0x50 (ATA bus error)
May 31 06:49:11 acheloos kernel: [1887359.722430] ata3.00: status: { DRDY }
May 31 06:49:11 acheloos kernel: [1887359.722927] ata3: hard resetting link
May 31 06:49:11 acheloos kernel: [1887360.040025] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May 31 06:49:11 acheloos kernel: [1887360.041846] ata3.00: configured for UDMA/133
May 31 06:49:11 acheloos kernel: [1887360.041859] ata3: EH complete
Antonis Christofides
  • 2,556
  • 2
  • 22
  • 35

1 Answers1

1
May 31 06:49:11 acheloos kernel: [1887359.720870] ata3: SError: { UnrecovData HostInt 10B8B BadCRC }

This sums it up, a unrecoverable error occurred and some data was lost. How many drives do you have in the host? The usual cause is a bad drive, so the first step would be to replace that and then move on to RAM and then the controller itself (via the motherboard).

You could also query the SMART data from the drives with smartctl --health /dev/sda (change the drive name) and smartctl --all /dev/sda; https://www.faqforge.com/linux/get-the-disk-health-status-with-smart-monitor-tools-on-debian-and-ubuntu-linux/ describes that in more detail. You are looking for any incrementing error counters.

You can test if it is the drive by making a backup to an alternate drive and see if those files show corruption. You might also run badblocks (https://linux.die.net/man/8/badblocks) to see if the existing drive has issues. If you can vacate the drive first then the destructive test is better, but is more work.

At this point if the backups you are making of the drive are being corrupted its quite possible the existing data is also corrupted. You'll want to consider your recovery scenario if the existing data is bad.

It is a little odd that this did not bubble up through your backup tool into an error, though it is possible it is ignoring them.

Jason Martin
  • 4,865
  • 15
  • 24
  • I have three disks that form a soft RAID5 array. `smartctl` doesn't show any error. If the drive had an error the RAID would fix it, and if it couldn't, I'd get an I/O error, not silent corruption (on 13 files!). Anyway I shutdown the machine and started it again. After it has come up the only fishy thing I see so far is a CPU usage of 5-15% by `ksoftirq` (but `/proc/interrupts` shows something like 100 interrupts per second, which I take is normal). – Antonis Christofides Jun 01 '17 at 15:42