What are the pitfalls of hardlinked files on my desktop PC?



All the identical-content files on my PC are now hardlinked. (My data is completely de-duplicated. It is a consequence of the way I copied my data from my old computer.)

What pitfalls do I need to be aware of now that certain actions on one file could silently affect a number of other files?

I know that deleting the file I'm working on is not a problem (assuming I deleted it on purpose). It doesn't affect any of the other hardlinked files and I don't see that the delete action would lead to unexpected side effects.

Moving or renaming the file is not a problem. I don't see any unexpected consequences.

I don't think copying hardlinked files is a problem, but I'm not as confident about any unexpected consequences in this regard. What I have seen is that making a copy (to the same disk) of a hardlinked file with cp keeps the copy hardlinked (i.e., inode number doesn't change in the copy). Copying to another filesystem obviously breaks the hardlink. (I guess one pitfall is forgetting this fact, given that my PC has 3 hard disks.)

Changing permissions does affect all linked files. So far this has proven handy. (I made a large number of the hardlinked files read-only.)

None of the operations above seem to produce any major unexpected consequences.

However, as was pointed out to me by Daniel Beck in a comment, editing or modifying a file can sometimes be a problem. It depends on the tool and maybe the type of edit. (For example, editing small text files using sed seems to always break the link while using nano doesn't.) This introduces the chance that editing one file could affect all the hardlinked files (i.e., alter the original inode).

My proposed solution to this is to make all hardlinked files read-only (and that is already mostly the case). If I can't do that for some files, I will unlink those particular files. Is there any problem with this read-only approach?

I'm assuming that if I go to edit a file and find it to be read-only, I'll remember to unlink that filename while making it writable. So one pitfall might be forgetting this rule. In that case, I'll have to rely on my backups.

Am I correct in the above statements? And what else do I need to know?

BTW, I'm running Kubuntu 12.04. I'm also using btrfs. (I have 2 SSD's and 1 HDD in the PC. I will also be adding an external USB HDD. I'm also connected to a network and I mount some NFS shares. I don't assume any of these last bits are relevant to the question, but I'm adding them just in case.)

BTW, since I have more than one drive (with separate file systems), to unlink any file all I have to do is copy it to another drive, then move it back. However, using sed also works (in my testing). Here's my script:

sed -i 's/\(.\)/\1/' file1

Surprisingly, this even unlinks zero byte files. In my testing it also appears to work on non-text files without any special options. (But I understand that the --binary option might be needed on Windows, MS-DOS and Cygwin.) However, copying to another disk and moving back may be the best way to unlink. For my use-case, unlink command doesn't really "unlink", rather it "removes".


Posted 2012-03-19T06:37:47.810

Reputation: 1 735

Editing only affects just one file if the editor replaces the original file. Otherwise all hard linked instances get their content changed. This requires just two unrelated empty placeholder files and changing one of them to break anything relying on the other. – Daniel Beck – 2012-03-19T07:05:51.510

@Daniel Beck - thanks. I updated my question based on your comment. – MountainX – 2012-03-19T15:40:32.547

Though not duplicates, a lot of questions have gone into hardlinks in general - maybe Unix.SE or ServerFault holds some enlightenment.

– new123456 – 2012-03-19T21:23:11.717

@new123456: those links are good references for other readers. I've already read all that stuff. I'm looking for real "in the trenches" feedback from people who have experience doing what I am doing. I understand the basics. I want to know the pitfalls and the hard core details that come with deeper experience than mine. – MountainX – 2012-03-20T01:55:52.837



Here are the pitfalls I have thought of so far:

1. It is possible to unintentionally change the content of one or more file x's when editing file y.

A workaround for this, as stated in my original question, is to make all hardlinked files read-only by default. For files that are edited often, I simply won't use hardlinks, as they are probably not appropriate.

IMPORTANT UPDATE: Here's a real pitfall. Sometimes editors will silently overwrite a file even if it is read-only. For example, I had an empty file with permissions of 400 and owned by root. I opened the file in nano, edited it and saved it. nano did not complain that it was read-only. All the hardlinked filnames now had the wrong content. So unfortunately, making the files read-only is not the workaround I expected, and this is indeed a serious pitfall.

2. It is possible to unintentionally create a new copy of a file. This is essentially the opposite of the first pitfall. The single file content may have N file names. Editing one of those file names may now lead to two distinct items of content with N (number of filenames) not changing at all. I could be unaware of the fact that this happened (if I don't pay attention to hardlinks).

An illustration of this in my case is my disorganized photo collection. I presently have the same photo stored under different names in different directories (e.g., because of downloading it from my camera more than once without taking the time to organize my photos). Hardlinking means that I no longer waste a lot of space because of this. I would prefer that editing one of these files would always affect all the hardlinked filenames (unless I specifically save the edited photo under a new name). However, this will most likely not be the case. So the pitfall is that editing a photo could lead back to more disorganization of my photo collection. The same pitfall could apply to music or videos (or virtual machine images, etc.).

The same workaround is the only one I have come up with -- make the files read-only, so I am reminded upon the need to edit that I should pay atttention to the hardlinks. (Is there a better workaround, such as some way to quickly relink all the filenames?)

Another (positive) consequence of my photo collection being hardlinked is that I can much more quickly organize it now. For example, with this command I can find all duplicate photos:

find 2>/dev/null /home/me/Pictures -type f -links +1 -printf "%n\t%i\t%d\t%s\t%t\t%p\n" | sort -gr > /home/me/Pictures/duplicatesList.txt

Using that list, I can confidently delete the file names I don't want to keep. Eventually, I may not have any more hardlinked photos.

3. I can't think of a third pitfall. If anyone has more than 2 pitfalls, please answer and I will accept your answer (assuming it is better than mine).

Overall, I don't think the hardlinks will add much complexity to my daily computing tasks if I make all hardlinked files read-only. I can do that easily with a command similar to this:

find . -type f -links +1 -perm /g+w,o+w -iname *.gif -exec chmod 444 '{}' \;

I can alter the path or file extension as needed. I don't plan to touch any hardlinks used by default installations of Linux. I'm only working with hardlinks in my personal data. I could simply change all my hardlinked files to read-only with a single command.

Over time, I'll get rid of unneeded filenames and simplify my data (and my life). If files truly are read-only and duplicates are warranted, I'll leave the hardlinks for those files indefinitely.

However, in some cases I'll unlink the files and leave independent duplicate files on purpose. This last case occurs very commonly in source code trees; the same file content is justified in more than one place and it should be writable. When I encounter a source code file that is read-only and I need to edit it, I'll unlink. Typically, just editing the file will unlink it. But I can be sure by using this command, which I know unlinks files:

sed -i 's/\(.\)/\1/' file1


Here's an example of pitfall #1 above. This is an actual example from my filesystem that I just came across.

I was going to destructively edit "Copy of index.html" because I saw the file "index.original.html" and I thought I was safe editing the copy. However, it turns out the files were hardlinked, so editing the "copy" would have changed the original too.

Here's the info showing the files were hardlinked:

2   45214   6   6641    Thu Oct 30 10:46:00.0000000000 2008 /Site/FusionAppsVPS/index.original.html
2   45214   6   6641    Thu Oct 30 10:46:00.0000000000 2008 /Site/FusionAppsVPS/Copy of index.html


Posted 2012-03-19T06:37:47.810

Reputation: 1 735


A pitfall is the overwriting of the files.

Some applications try to remove the file and write a new one at the original name. In this case, the file names will be decoupled. Other applications try to directly open the file for writing. In this case, the content of the other names is also changed. However, as you make all duplicate linked file r/o, this can be easily distinguished.

Michael Tsang

Posted 2012-03-19T06:37:47.810

Reputation: 306