How best to compare huge directory trees

6

7

How can I compare directory trees of huge size?

I am thinking a free tool to make a snapshot of the filesystem structure (listing of files and directories, their size & timestamps) would be ideal so I could compare the snapshot to another one made later.

Treecomp would be great for that but with a huge tree (I mean really huge!) it crashes because it tries to keep it in memory (4GB of memory are not enough)...

I worked around the problem by splitting the snapshots into pieces and compare these pieces. But that's tedious, and the problem can for sure be solved better.

Is there another free (best also open source) tool that I can try out? Or is there another way to do this that I am overlooking?

OS can be Linux or Windows.

jdehaan

Posted 2009-09-30T09:11:16.497

Reputation: 903

Anyone have a good command line equivalent for linux? I've rolled my own with find and sha1sum, but I think this warrants a first class program. – Peter Lyons – 2010-09-10T15:28:32.820

What I've done in the past is produce a directory dump to file, and then compare the files with an ad-hoc program. – Daniel R Hicks – 2012-07-13T02:00:23.040

Beyond Compare 3? How many files and folders are we talking? How big of a drive is this? – Richie086 – 2012-07-12T23:29:07.600

Answers

2

I'll try to expand a bit on how to do it with Total Commander (I hope I understood what you want to do).

  • install DiskDir packer plugin (I put a direct link to plugin, if you prefer you can go to plugins page and look for DiskDir plugin
  • after the plugin is installed "pack" the directory you want to track changes of with Alt+F5 and select "lst" from the drop down list in Packer part of the dialog box; this will create a "package" that you can enter by pressing enter, like you would enter a directory and it will show complete contents of the directory
  • when comparing results go to the original directory on the left pane and enter desired snapshot on right pane
  • use "Synchronize Dirs" function, located in Commands menu
  • in Synchronize directories window uncheck compare by contents, check Subdirs and Ignore date (or not if changed date is important) and run comparison
  • window will show you files that are equal (in this case not by contents, only by size), files that are different and files missing on left/right side

Since the snapshot is a plain text file and you are not comparing by contents it should be fast but I never used it for a really huge directory.

This is useful if you are not making backups but only wish to make a snapshot of what contents of the directory was at some point. If you do make backups you can use same tool (Synchronize dirs) to also compare by contents.

There is also an extended version of DiskDir plugin, download link is in the first post. This version enables you to have packages (like zip, 7z...) show as directories in the snapshot. This would of course increase time to make a snapshot.

T. Kaltnekar

Posted 2009-09-30T09:11:16.497

Reputation: 7 636

+1 for TC (although not free :) – None – 2009-09-30T16:12:50.723

7

you can just use in the terminal

du -a

This will return all the files in all sub folder including there sizes, then just compare the files

To save the data to a text file

du -a > dump.txt

Then you can just use something like diff to compare the files

This is for linux :D

monkey_p

Posted 2009-09-30T09:11:16.497

Reputation: 441

Just used this to compare copies of massive render directories with lots of subdirectories on my mac. FileMerge was completely choked until I fed it du -a dumps of the directory trees. Just to run the output through sed to change the two root directory names to the same string. – rebusB – 2017-09-28T04:32:12.813

3

I've used MD5 hashes and diff to compare trees in the past. It's slow but will find changed files in cases where the dates are not reliable. It's also portable so you can transfer the index instead of comparing files via the network.

find /path/to/check -type f | xargs md5sum > after.txt

diff before.txt after.txt > diffs.txt

Chris Nava

Posted 2009-09-30T09:11:16.497

Reputation: 7 009

1Good answer, but I would avoid the md5 on a file system of the size under discussion here. – DaveParillo – 2009-09-30T23:57:46.493

1

For someone attempting to do something similar on a Windows Machine (2008/Vista and above) you can use the following command:

forfiles /P C:\Your\Path\Here /s /C "cmd /c rhash --simple @file" > C:\OutputOfHashes.txt

forfiles is a built in command as of 2008/Vista. http://technet.microsoft.com/en-us/library/cc753551%28v=ws.10%29.aspx

Simply replace the rhash (open source hash generation utility) command with a hasher of your choice. http://rhash.anz.ru/

– aolszowka – 2013-07-12T15:37:25.900

1Perhaps a tool that can cache the hashes would be a solution. Something like GIT will recompute only the hashes of changed files. I wonder if you could use it's hash store as your comparison source... (Git uses SHA1 vs MD5 so the initial computation would be higher but the upkeep would be lower due to the caching features.) – Chris Nava – 2013-07-12T19:08:56.437

1

This is what I use to compare really big directory trees:

rsync --archive --dry-run --verbose /src/directory/ /dst/directory/

juangiordana

Posted 2009-09-30T09:11:16.497

Reputation: 11

1

You could just use the command prompt to dump the listing:

DIR /S >Listing1.txt

(you can fine tune the options if you want, but this basic syntax is probably good enough)

To compare the two listings use any file comparison tool, like WinDiff, or CompareIt etc. WikiPedia has a list of such tools here: http://en.wikipedia.org/wiki/Comparison_of_file_comparison_tools

ssollinger

Posted 2009-09-30T09:11:16.497

Reputation: 379

<sarcasm>Great trick</sarcasm>, if you tell me how to compare the resulting 2x 10GB files in a file comparison tool! A tool storing info into a database could help me but this doesn't sorry. – jdehaan – 2009-09-30T10:09:10.653

Sorry it sounds maybe a bit rude, after I reread myself. It wasn't meant so. This can maybe still help others with a smaller amount of data without installing any additional software on the system. – jdehaan – 2009-09-30T11:17:36.600

No problem. I didn't realize that your listings are that huge, and you are right that in this case my suggestion is not suitable. I thought it was mentioning this method as sometimes people get too carried away trying to find the best tool, forgetting about the simple ways of doing things. But as you said, in your case that's not a solution and you will need some other tool. – ssollinger – 2009-09-30T19:39:27.157

1

One week ago take first snapshot:

rsync --archive /the/source/ /var/snapshot1/

Now take second snapshot:

rsync --archive /the/source/ /var/snapshot2/

And compare them:

rsync --archive --list-only /var/snapshot1/ /var/snapshot2/

Perleone

Posted 2009-09-30T09:11:16.497

Reputation: 220

I like this answer, because: 1. rsync compares on file name, size and timestamp (just as the OP wants) and 2. It works on both Windows and Linux (and even on Windows drives cifs-mounted on Linux). – agtoever – 2014-10-14T20:42:01.257

0

Have you tried Back In Time?

It's a GNU/Linux tool that makes a snapshot of a filesystem by using hard links or physical copies of files and directories.

It's very configurable and has a daemon and GUI parts that runs separately.

atrent

Posted 2009-09-30T09:11:16.497

Reputation: 83

0

I did this in Total Commander, using the synchronise directory feature. 1.2TB data across two drives.

user3463

Posted 2009-09-30T09:11:16.497

Reputation:

Sounds good but how do I compare the state of the data to the one that was there one month ago. I do not need a backup solution, I just want to identify changes from one checkpoint to another. I am not interested in the changes inside files just changes to the structure, files added/deleted/modified, directories created/deleted/modified. The content doesn't matter to me. – jdehaan – 2009-09-30T10:14:02.190

The Total Commander synchronise feature shows you a list of files new or changed on both sides, without actually having to run the sync. – None – 2009-09-30T10:16:40.950

That's fine but I have only the data once (today). For the other side (month earlier) I would also need the data and I don't have enough storage for a few dozens of TB... Treecomp has this feature but doesn't scale well with big trees, till 2TB it works.. – jdehaan – 2009-09-30T10:24:43.883

I must precise the limitation is not really the amount of data but the amount of files/directories, as the data is not included in the snapshot by treecomp. – jdehaan – 2009-09-30T10:26:01.103

0

Freecommander has the option to compare two different folders.

Steef Min

Posted 2009-09-30T09:11:16.497

Reputation: 69

Thanks but my problem is not so trivial... – jdehaan – 2009-09-30T10:26:58.060

0

You may also try :

Karen's Directory Printer

Karen's Directory Printer can print the name of every file on a drive, along with the file's size, date and time of last modification, and attributes (Read-Only, Hidden, System and Archive)! And now, the list of files can be sorted by name, size, date created, date last modified, or date of last access.

File List Generator

FLG is a free File List Generator. It searches the directory tree for the files with the requested criteria and produces a list in HTML format.

harrymc

Posted 2009-09-30T09:11:16.497

Reputation: 306 093

Karen's Directory Printer is really a nice tool. Maybe parsing the output files with a perl script could help me for comparing them but it would have to be smart to not use too much memory... I cannot really believe I am the only one having this trouble... – jdehaan – 2009-09-30T10:34:45.597

You're assuredly not the only one. Is your problem rather that of syncing directories? If so, I can recommend the very fast SyncBack Freeware at http://www.2brightsparks.com/assets/software/InfoHesiveViewerEP_Setup.exe.

– harrymc – 2009-09-30T13:31:16.383

0

Have you tried meld? I have no idea if it's any good for huge trees, but you can always give it a try :)

Meld is a visual diff and merge tool targeted at developers. Meld helps you compare files, directories, and version controlled projects. It provides two- and three-way comparison of both files and directories, and has support for many popular version control systems.

Meld helps you review code changes and understand patches. It might even help you to figure out what is going on in that merge you keep avoiding.

Peltier

Posted 2009-09-30T09:11:16.497

Reputation: 4 834

That's a very good and nice diff tool, but cannot save a directory tree state (at least not in the version I have) for later use and comparison – jdehaan – 2009-09-30T13:46:02.230

0

You could check Beyond Compare.

It is not free, but you can test it for 30 days (working days, not days after installation). Perhaps that's enough time to make your task.

knut

Posted 2009-09-30T09:11:16.497

Reputation: 101