Find and delete duplicated files in different disks and directories

1

I have hundreds of thousands files distributed in many external disks and disks in computers and many are duplicated. This mess was caused by myself creating copies for safety proposal. From time to time I changed the directory structure of my organization but not replicated in other places where had copies.

Now, I have a single huge disk with almost all that I really need backups and mirrored in the cloud.

I would like a way to delete everything from all those distributed disks that is already in the big disk.

Let me show the scenarie:

OldDisk1:

/code/{manystructures}/{manyfiles}
/docs/{manystructures}/{manyfiles}

OldDisk2:

/dev/{another_structures}/{same_files_different_names}
/documents/{another_structures}/{same_files_different_names}

NewHugeDisk:

/home/username/code/{new_strutucture}/{new_files}
/home/username/documents/{new_strutucture}/{new_files}

Anyone know a tool or a way to do something like "find all files on OldDisk1 that are already in NewHugeDisk and delete"?

I looked at many tools (Windows, Mac and Linux as I have this issue on both) free and payed, but with no luck.

And ideia would be create a code to do that, but I'm not a developer. I can do small and simple codes, but this kind of code, I think would be to complicated for me.

I will appreciate any help or any ideas on this.

Tuts

Posted 2017-06-13T23:58:44.630

Reputation: 13

What are some of the tools you've tried? How have they failed? – music2myear – 2017-06-14T00:14:00.583

if you are using linux, I've had some luck with fslint. Of course, you will want to delete them as an explicit process (not automatic) but you can generate a list of file names for a delete script or whatever. – Frank Thomas – 2017-06-14T01:30:14.403

@music2myear I have tried many tools for mac, windows and linux. This a a short list I have tried: Easy Duplicate, Duplifinder, Mr Clean, Gemini 2, Dupe Guru, CCleaner, Duplicate File Finder, Auslogics Duplicate File Finder, Disk Drill, Tidy Up, Duplicate Detective, Decloner, Clone Spy, Doppleganger. There are many others that I have read about and even tried but didn't sound to do what I expected to do. – Tuts – 2017-06-15T13:19:43.630

@Frank Thomas, I didn't tried it, but as I read, it will not be able to acomplish what I want. – Tuts – 2017-06-15T13:20:33.847

@music2myear I forgot to answer yours second question.

All those tools delete any duplicate file it finds on any path you provide. Lets say you provide /old/* and /new/* and just on the new has /new/dir1/a.txt and /new/dir2/a.txt. it will delete one of then. But I just want to delete files on the /old/*. – Tuts – 2017-06-15T13:31:07.277

Answers

2

Assuming you can use Windows as an OS for the whole process and you don't like Free Duplicate File Finder (never tried it, but found it mentioned here), you could use PowerShell to achieve what you want with relatively little effort. Note: I'm not a real pro at PowerShell, so I'm pretty sure that one could refine my code.

Just open Powershell ISE (or, if you don't have that, use Notepad), copy&paste the following code to it and save the resulting file somewhere as *.ps1. You also have to change $oldpath's and $newpath's values to your directories - just put your paths between the quotes.

# Search-and-Destroy-script
# Get all files of both code-directories:
$oldpath = "Disk1:\code"
$newpath = "DiskNew:\code"

$files_old = Get-ChildItem -Path $oldpath -Recurse -File
$files_new = Get-ChildItem -Path $newpath -Recurse -File

for($i=0; $i -lt $files_old.length; $i++){
    $j=0
    while($true){
        # if last edit time is the same and file-size is the same...
        if($($files_old[$i]).length -eq $($files_new[$j]).length -and $($files_old[$i]).lastWriteTime -eq $($files_new[$j]).lastWriteTime){
            # Get File-Hashes for those files (SHA1 should be enough)
            $files_old_hash = Get-FileHash -Path $($files_old[$i]).FullName -Algorithm SHA1 | ForEach-Object {$_.Hash}
            $files_new_hash = Get-FileHash -Path $($files_new[$j]).FullName -Algorithm SHA1 | ForEach-Object {$_.Hash}
            # if hashes also are the same...
            if($files_old_hash -eq $files_new_hash){
                # remove the old file (-Confirm can be removed so you don't have to approve for every file)
                # if you want to check the files before deletion, you could also just rename them (here we're adding the suffix ".DUPLICATE"
                # Rename-Item -Path $($files_old[$i]).FullName -NewName "$($files_old[$i]).Name.DUPLICATE"
                Remove-Item -Path $($files_old[$i]).FullName -Confirm
                Write-Host "DELETING`t$($files_old[$i]).FullName" -ForegroundColor Red
                break
            }
        # if files aren't the same...
        }else{
            # if old_file is compared to all new_files, check next old file
            if($j -ge $files_new.length){
                break
            }
        }
        $j++
    }
}

Then start the script (via right-click, for example) - if that fails, make sure your ExecutionPolicy is set (https://superuser.com/a/106363/703240).

I use an almost identical script to check for files that were already copied (but possibly with changed names). This code assumes that only the names of the files are different, but not the content. The last edit time usually stays the same even after copying a file to a new path - unlike the creation time. If the content is different, my solution fails badly - you could use different unique attributes of files (but which?) or state that e.g. only files tat are smaller or older (considering the edit-time, again) than the new files should be deleted.

What the script does:

  1. Getting all files in the specified folders (and their subfolders)
  2. getting first old file (specified by $i)...
  3. comparing its last-edit-time and its file size with that of the first new file (specified by $j)...
  4. ...if they are equal, it calculates a file-hash to be sure that it is definitely the same file (arguably, this could be a bit too much effort for your goal)
  5. if hashes are equal, the old file gets deleted (and it will write which file in the terminal), then starting again at 2. with the next old file...
  6. if hashes are not equal (or last edit times don't equal or file-sizes don't equal) it starts again at 3. with the next new file.

flolilo

Posted 2017-06-13T23:58:44.630

Reputation: 1 976

It is exactly what I needed. Thanks very much. Although I would be able to work just on windows to solve my issues, I will try to "replicate this code" (if you allow me) in bash as I will be able to put it on my raspiberry pi a let ir running for as long as it needs.

I already did a little fix as it wasn't working correctly: old line for($i=0; $i -lt $files_new.length; $i++){ new line for($i=0; $i -lt $files_old.length; $i++){

When I create the shell script, I will post it here. – Tuts – 2017-06-15T13:21:33.483

That's a nicely built solution. – music2myear – 2017-06-15T16:56:12.120

@Tuts thanks, I fixed that line just now. I'm really looking forward to the resulting bash-script! – flolilo – 2017-06-16T09:32:53.153

0

Have you tried using third-party deduplication software?
I have tried cloudberry deduplication and it is really efficient as:

  • it has its own dedup mechanism to eliminate duplicate data thus saving a lot of storage space.
  • Another advantage of such tools is that they are more reliable and have a dedicated resource management technique.

user8010482

Posted 2017-06-13T23:58:44.630

Reputation: 1

Can you insert a link to the software? – yass – 2017-06-17T17:23:55.410

Please read How do I recommend software in my answers? and [edit] your answer accordingly.

– Kamil Maciorowski – 2017-06-17T18:34:54.000

https://www.cloudberrylab.com/dedup-server.aspx – user8010482 – 2017-06-20T18:17:58.950

I didn't cloudberry for deduplication. But having a look now, it is a windows server software (it is for my home). and still, don't looks that it will handle my issue. which is, I have a new disk and many old disks. there are duplications inside the new disks (by now, I don't care) and might be duplication on old disks. What I wan't is basically, delete what I already have on the new disks. (New disk are a copy of all other disks) – Tuts – 2017-06-23T21:48:53.560

0

rmlint is a command-line utility with options to do exactly what you want. It runs on Linux and macOS. The command you want is:

$ rmlint --progress \
    --must-match-tagged --keep-all-tagged \
    /mnt/OldDisk1 /mnt/OldDisk2 // /mnt/NewHugeDisk

This will find the duplicates you want. Instead of deleting them directly, it creates a shell script (./rmlint.sh) which you can review, optionally edit and then execute to do the desired deletion.

The '--progress' option gives you a nice progress indicator. The '//' separates 'untagged' fro 'tagged' paths; paths after '//' are considered 'tagged'. The '--must-match-tagged --keep-all-tagged' means only find files in untagged paths that have a copy in a tagged path.

You can also shorten that command using the short format of the options:

rmlint -g -m -k /mnt/OldDisk1 /mnt/OldDisk2 // /mnt/NewHugeDisk

thomas_d_j

Posted 2017-06-13T23:58:44.630

Reputation: 121

sound like it is really exactly what I need. Just did some quick tests and looks promissing. Will do some more tests. Thanks a lot. – Tuts – 2017-07-10T16:43:41.767

You're welcome. Please use https://github.com/sahib/rmlint/issues to report any problems or improvement suggestions

– thomas_d_j – 2017-07-11T12:58:11.067