Are periodic full backups really necessary on an incremental backup setup?

2

I intend to use an old computer I have as a remote backup server for myself and a few other people. We are all geographically separated, and the plan is to do incremental daily backups using rsync and ssh.

My original idea was to make one initial full backup then never again have to deal with the overhead of doing it, and from that moment on only copy the files changed since the last backup.

I've been told that this could be bad, but I fail to understand why. Since each snapshot is comprised of hard links to the unchanged files plus the original changed ones, isn't it going to be identical to a new full backup? Why would I want to make another full backup?

EDIT:

I should have explained a point better. When I mean I'm going to do incremental backups using rsync, I mean this:

rsync -avh --delete --link-dest=./remote/previous_increment ./local/ ./remote/new_increment

Which results on a complete and working snapshot, since it will contain hardlinks to all the unchanged files. Even if the full backup and all of the previous incremental ones happen to be deleted the last incremental is still consistent. But I'm pretty sure that if any of the previous files get corrupted, the last ones will too, since they point to the same inode.

What if I do a synthetic full backup server-side periodically, by breaking the links on the last snapshot and copying it to another, write-protected HD (say, once a month). That way I would have a redundant full copy and would still avoid the overhead of re-sending the files.

Would that solve the issue? Would I still need to do full backups?

rgcalsaverini

Posted 2013-10-23T07:11:40.283

Reputation: 123

This looks like exactly what dirvish (http://www.dirvish.org/) is meant to do.

– jia103 – 2015-12-24T19:07:35.567

I would say yes - in case your original full backup failed. – Journeyman Geek – 2013-10-23T07:23:51.513

1Reliability. If your original backup or any of your incrementals get corrupted then your backup is potentially worthless and you might as well not have bothered at all. You should have regular incremental and full backups. – Mokubai – 2013-10-23T08:21:57.880

I get the point, makes sense. I edited the question, do you think that the problem would be solved with server-side synthetic backups then? – rgcalsaverini – 2013-10-23T17:41:21.687

Answers

6

Normally if you do incremental backups you only store the actually changed files in some way (like a tar archive) while you have the unchanged files in earlier backup-files only. This way you would need all backup-files for a recover and could never delete old backups. As this is not practicable, you need to make new full backups after some time.

What you are using is more advanced (rsnapshot?), where you always store a full set of data but keep the overhead minimal by sharing data between backups through the use of hard links. This way you can delete old backups without invalidating current ones. So the usual argument doesn't count.

Edit:

rsnapshot works like this:

The first time it just creates a full copy using rsync.

Any backup after that creates a new complete directory tree where all files are hardlinks to the previous backup. After that the changed files are replaced by running rsync over this tree.

So each backup is complete, but shares data with older backups. If you delete an older backup, only files which are different in all other backups are really deleted. For the shared files only the hard link count would be reduced by 1.

The overhead of a backup is the additional directory tree which uses some space too of course. But you can delete old backup trees without affecting the remaining ones to recover that space.

The description of your backup strategy sounds like rsnapshot to me.

Edit2:

If you are concerned with bit rot - that is that existing backup files are getting corrupted - you could add the option -c to rsync which creates MD5 checksums on local and remote files. This would increase the disk I/O significantly as every file would have to be read. But the network traffic increases only slightly as only the checksums of every file have to be transmitted additionally. This would remove the last reason for a new full backup.

Michael Suelmann

Posted 2013-10-23T07:11:40.283

Reputation: 578

I never heard of rsnapshot before, but you are right, that's precisely what I'm doing with rsync alone. I edited the question, do you think that it solves the problem? – rgcalsaverini – 2013-10-23T17:43:44.917

That is a great idea and would greatly improve the backup consistency! So in your opinion with this strategy further full backups would be unnecessary? – rgcalsaverini – 2013-10-23T18:51:10.087

A 2nd tier of backups could improve security like occasionally copying a backup tree to another hard drive. If you have bad luck the backup just got corruped before the server crashed. But you can never be 100% sure. – Michael Suelmann – 2013-10-23T18:59:49.107

1

Incremental backup, i.e. using rsync, is a more complicated process than a static backup, i.e. using cp. Some people believe that the incremental backup is more likely to be corrupted.

  1. The failure may be in the tool itself; rsync on Windows is known to be flaky, and sometimes deletes files from the backup when it shouldn't.

  2. If your backup tool only stores the binary difference between versions of a file, then loss of an intermediate version of the file may make it impossible to reconstruct the final version of the file.


Whatever your backup solution is, test it regularly by restoring a copy of your data from the backup.

Li-aung Yip

Posted 2013-10-23T07:11:40.283

Reputation: 568

>

  • Thanks, good to know that. 2. Its doesn't it stores the whole changed files plus hard links to unchanged ones, so every snapshot is self-consistent.
  • < – rgcalsaverini – 2013-10-23T17:45:32.737

    1

    I think there has been a miscommunication.

    Most of the time when I hear full backups and incremental backup people mean:
    Full: Backup all the data.
    Incremental: Backup only the changes.

    If you need to restore a backup then you start with the full backup and then all the incrementals. That can take a lot of time. This is one reason why many corporations do full backups in the weekend and partial ones during weekdays. Up to five partials is manageable.

    Now rsync usually does not make partial backups. If sends only the changes over the net, but the end result is a full copy of all the data. Thus the most used reason for to not use only partials does not apply.


    Note that it is a good idea to have at least two backups. One known good one and a working copy. Either alternate between these two, or make and test a yearly backup, set it read only and use other backup until next year. Then repeat.

    Hennes

    Posted 2013-10-23T07:11:40.283

    Reputation: 60 739

    Great advice about the alternating backups. Any reason for it to be yearly? Could I make a synthetic full backup once a month, store it on anoter HD and set it to read-only? – rgcalsaverini – 2013-10-23T17:48:06.297

    No reason for it to be yearly except to select a regular period. I used to use 1) week tapes (full backups) & day tapes (Monday - Thursday incremental. 2) 4 different week tapes (#week in the month). 3) Full month backups (stored off-site) 4) full year backups (kept in a vault in another building). All this might be a tad much for non-work backups. :) – Hennes – 2013-10-23T18:06:57.473

    Hehe yeah, it's just some regular personal data. I edited my question, but the main point is: If I keep redundant full backups generated server-side from the incremental ones, do I still need to make full backups from the original data periodically? – rgcalsaverini – 2013-10-23T18:46:20.747