Could hardlinks be used for backup deduplication purposes?

5

1

I know about full backups, incremental backups and decremental backups. However, I wondered why nobody (Windows Backup, TrueImage, Paragon) seems to have implemented the following backup algorithm.

It needs a backup medium which supports links, e.g. NTFS. Ideally the backup medium is of the same format in order to support all features such as alternate data streams (ADS).

  1. First backup is a full backup. This will copy all files onto the backup medium into a subfolder of \. Let's call this folder L (for "last"). There's no special file format, just copy the files.
  2. With the next backup, a new subfolder of \ will be created, let's call it C (for "current"). Files which have changed from the full backup will be copied again from the source disk. Files which have not changed are moved from L to C and a hardlink is created to point from L to C.
  3. On repeated backups, the same procedure will be applied with C and another new folder.

Is there anything I miss in this algorithm which would not work?

While I did notice any problems yet, I can see the following advantages:

  • the last backup (C) is always a full backup. For restoring the backup, you only need this one backup. The user can delete any old backup without destroying the possibility of recovering (which is not the case in full, incremental and decremental backups).
  • Old backups will act like full backups due to the links, but take much less space on disk.
  • there's a full history of file changes if the user did not delete a file. But unlike SVN, it is possible to delete old revisions.
  • Moving files and creating links are very fast operations. Creating the backup should be accordingly performant.
  • It is possible to selectively delete changed files in old backups (e.g. only the big ones), not deleting a complete backup

Thomas Weller

Posted 2014-01-06T19:10:19.583

Reputation: 4 102

Related question: Using NTFS hard links to combine full/differential backups

– Simon East – 2019-09-12T10:10:16.277

BTW, this is very similar to the approach taken by Apple's Time Machine backup algorithm. – Simon East – 2019-09-12T10:59:50.390

Answers

2

What you describe is already in use, with rsync and its --link-dest= option, via dozens of wrapper programs, such as dirvish, among others.

Dan D.

Posted 2014-01-06T19:10:19.583

Reputation: 5 138

http://rsnapshot.org/ is pretty good, not sure about in windows – Tim Abell – 2016-06-18T16:42:12.377

I'll have a look into that. I just noticed that neither Windows Backup nor TrueImage nor Paragon implement such a feature. – Thomas Weller – 2014-01-07T07:37:22.457

It seems we still need a Linux system to get this up and running to backup a Windows machine. Reading the Windows Dirvish guide, it seems usability is a mess. Anyway, thanks for the input, it's a good source of information (http://www.dirvish.org/StevesWindowsGuide/Windows_Dirvish_Guide.html)

– Thomas Weller – 2014-01-07T11:03:57.047

3

I think that's basically what you call a Delorean-copy. There is, for example, the Link Shell Extension for Windows, that implements this behaviour. They have a quite good explanation in their documentation:

http://schinagl.priv.at/nt/hardlinkshellext/linkshellextension.html#deloreancopy

bweber

Posted 2014-01-06T19:10:19.583

Reputation: 131

Thanks for the link. The Delorean copy has the hardlinks point into the wrong direction. The file exists in old location and new backups point to the old file. I want it vice versa: the file exists in the new version and old versions point to the new file. – Thomas Weller – 2017-03-08T16:32:22.430

2The file doesn't exist anywhere in a location in the filesystem. There are two layers: The actual files on the disk and their representation within the file system. The latter is what you see when you open your Windows Explorer for example. All the file entries you see there are basically hardlinks to a file on disk. – bweber – 2017-03-09T09:20:17.827

2There can be multiple hardlinks to one file at disk at different locations in the filesystem. Hardlinks have a reference counter. When a new hardlink to a file is created, this reference count is increased by 1. When one hardlink is deleted, it is decreased by one. Now the actual file representation on disk is deleted when there is no more hardlink pointing to it, or in other words when the reference count reaches 0. – bweber – 2017-03-09T09:20:26.990

1So your way of seeing this is incorrect. There is not a hardlink and "the real file" and the hardlink points to that file. Instead, all representations you see of a file at a filesystem level are hardlinks pointing to the same file object on disk. Otherwise a hardlink could become invalid, if the "original" it was pointing to was deleted. But that can't happen. Because there is no "original". There is only one physical instance on disk and that is not bound to one particular hardlink instance. What you might be thinking about are symbolic links. – bweber – 2017-03-09T09:20:41.463

If that's true, the picture in the link is incorrect, because it points from one folder to the next folder, but it should point to some sector on a hard disk instead. – Thomas Weller – 2017-03-09T10:09:07.167

2

You mean the red arrow? That arrow is not supposed to indicate pointing of links. It indicates that a hardlink copy is created from InitialBackup to Backup1 and the from there to Backup2. Btw. they also provide a description for hardlinks: http://schinagl.priv.at/nt/hardlinkshellext/linkshellextension.html#hardlinks

– bweber – 2017-03-09T10:20:51.240

3

Seems like a viable plan. It would decrease the amount of time taken to view and use the backups. If the backups are used frequently and one needs to see complete snapshots, this would be very handy.

I would change the wording "moved from L to C" to simply say "hard linked from L to C".

One consideration - deleting a file with large numbers of links (in reference to your last bullet point) means locating all of those links and removing them. So, recovering space selectively in that manner would be more challenging, but easy enough to do with the find command.

ash

Posted 2014-01-06T19:10:19.583

Reputation: 166

2You bring up a significant change from the OP's question. Obviously, the implementation would need to be tested, but in theory, files identical on both L and the current folder could be stored on C using hard links to L, which would save the time of copying from L to C.

Again, in theory, the file-link for one or more files on L could be deleted, retaining the original file due to the existing hard link on C. Perhaps this is moot, since the OP seems to be considering an existing solution, but just in case someone may be considering using this algorithm for their own implementation. – GlennFromIowa – 2015-08-24T12:28:13.857

2

HardLinkShellExtension with its "Delorean-Copy" (see other answer) isn't the only "ready to use" solution. There are alternatives:

  • the console tool ln.exe from the same programmer with the same functionality. The author offers a pre-written batch-file for timestamped DeLorean copies too.
  • the GUI Backup solution HardLinkBackup which does pretty much exactly what you want.
  • use ln.exe from 1. to make a hardlink copy of the old backup into the new backup folder and then use xcopy or robocopy to only copy new files and remove old ones (i thinks it's --mirror for robocopy). Test it to make sure changed files are deleted and then copied and not just modified (the latter would change the file in older backups too because of the hardlinks).
  • using xcopy or robocopy to make normal backups and then run dfhl.exe /l /r /w /s /h "X:\Backups-parent-folder\." to hardlink all identical files.
  • same as 3. but finddupe -hardlink X:\Backups-parent-folder\** instead of dfhl.


Disclaimer: I've used all of the above mentioned programs except for finddupe, but not necessarily in the same way. And I've no monetary connection or investment or any other connection to any of the programs.

Limer

Posted 2014-01-06T19:10:19.583

Reputation: 63

1

shameless self plug: since nonce of the above are FLOSS, I wrote a commandline tool that essentially does what the OP wants, including support to copy files that are in use via Volume Shadow Copy. Available at https://github.com/asdfjkl/yahb . In the hope that this could be useful to someone...

– ndbd – 2020-01-05T20:46:28.040

1

@ndbd - IMHO that may be a self plug but it's not shameless because it doesn't "advance your own selfish interests". A quote from the last changelog of ln.exe regarding FLOSS: "ln.exe is on gitlab.com. For now only private, but hope to change this soon." So maybe it will be F(L)OSS in the future. And I was gonna add a script from Heise that I just remembered but I saw it is very old and you already were there.... ;-)

– Limer – 2020-01-06T17:22:20.003

1

What you are describing is essentially an incremental backup scheme.

As Dan D. points out, it's actually used by various tools, particularly on Unix-like platforms where hardlinks are handled natively by a lot of the programs that care.

However, lots of Windows programs don't deal very well with hardlinks. Back in the days of FAT, hardlinks would actually have been considered an error as no two names in the file system were allowed to point to the same data blocks.

What you describe is an incremental backup scheme because any one backup builds upon all the previous backups. The only real difference is how those previous backups are referenced, and the fact that it's easier to delete a previous backup because the data will only actually be deleted once the reference count for the file in question reaches zero, which will happen when it is no longer being referenced from any backup. Of course the downside of that is that it's harder to predict exactly how much space will be freed by deleting a given previous backup; in the extreme case, except for space used by file system metadata and reclaimed, it could actually be zero. (No changes between that backup and an adjacent backup.)

In the case of "normal" incremental backups, you have to do the restores manually. In the case of what you are describing, the reference is implicit. However, if you were to delete everything that wasn't actually copied during (has a reference count of exactly one in) the most recent backup, the backup would still be just as incomplete as it would be if you had made multiple incremental backups and then tried to restore only the most recent one.

a CVn

Posted 2014-01-06T19:10:19.583

Reputation: 26 553

@aCVn - Hardlinks and Junctions do not require administrative rights. Only Symlinks do. – Limer – 2020-01-06T17:27:24.057

I agree for the FAT statement. That's why my focus is on NTFS or other partition supporting hard links. – Thomas Weller – 2014-01-07T10:52:14.553

@ThomasW. Obviously file system support for hardlinks would be needed to use hardlinks for deduplication of data. That's not the point, though. The point is that in *nix, hardlinks have been part of the ecosystem like forever, and are actually being used. In Windows, they are a relatively new (in versions of Windows used by most people) and rather obscure feature that IIRC isn't even available to non-administrator accounts by default. – a CVn – 2014-01-07T10:56:19.927

From the way the backup is created and what it needs to copy, it's an incremental backup. However, the result at the end is a full backup.

If you call it that way, a decremental backup is also an incremental backup, because it also copies the differences to the last full backup. – Thomas Weller – 2014-01-07T10:58:27.327

1Hardlinks exist since Windows NT4 SP4. Windows 7 uses hardlinks in several places for the C:\User / C:\Benutzer directory as well as C:\Program Files / C:\Programme directory. So it looks like hardlinks actually work nowadays and I could give it a try. – Thomas Weller – 2014-01-07T11:07:46.703