How to remove duplicated files in a directory?

25

9

I downloaded a lot of images in a directory.
Downloader renamed files which already exist.
I also renamed some of the files manually.

a.jpg
b.jpg
b(2).jpg
hello.jpg      <-- manually renamed `b(3).jpg`
c.jpg
c(2).jpg
world.jpg      <-- manually renamed `d.jpg`
d(2).jpg
d(3).jpg

How to remove duplicated ones? The result should be:

a.jpg
b.jpg
c.jpg
world.jpg

note: name doesn't matter. I just want uniq files.

kev

Posted 2012-02-05T10:39:23.930

Reputation: 9 972

Answers

27

bash 4.x

#!/bin/bash
declare -A arr
shopt -s globstar

for file in **; do
  [[ -f "$file" ]] || continue

  read cksm _ < <(md5sum "$file")
  if ((arr[$cksm]++)); then 
    echo "rm $file"
  fi
done

This is both recursive and handles any file name. Downside is that it requires version 4.x for the ability to use associative arrays and recursive searching. Remove the echo if you like the results.

gawk version

gawk '
  {
    cmd="md5sum " q FILENAME q
    cmd | getline cksm
    close(cmd)
    sub(/ .*$/,"",cksm)
    if(a[cksm]++){
      cmd="echo rm " q FILENAME q
      system(cmd)
      close(cmd)
    }
    nextfile
  }' q='"' *

Note that this will still break on files that have double-quotes in their name. No real way to get around that with awk. Remove the echo if you like the results.

SiegeX

Posted 2012-02-05T10:39:23.930

Reputation: 1 911

fine, the bash version worked for me, but in my test, with 2 similar folders, it deleted half of duplicates in one folder, and half in the other. why. i would expect deletion of everyone (duplicated) of one folder. – Ferroao – 2017-12-05T17:58:20.793

@Ferroao Perhaps they were not exact duplicates. If just one bit is off the md5 hash that my script is using to determine duplicity would be completely different. You could add an echo cksm just after the line starting with read if you want to see each file’s hash. – SiegeX – 2017-12-05T20:38:56.153

no, all "duplicates" (copies) were removed, remaining 1 version, let's say the original. half copies were deleted from one folder, and the other half from the other folder (100% deletion of copies). my 100% is for copies in excess, not of the totality – Ferroao – 2017-12-05T22:40:14.287

@Ferroao I see. In that case it seems when bash does its recursive path expansion via **, it orders the list in such a way that the two folders are interleaved rather than all of folder 1 then all of folder 2. The script will always leave the first ‘original’ it hits as it iterates through the list. You can echo $file before the read line to see if this is true. – SiegeX – 2017-12-06T00:49:28.000

46

fdupes is the tool of your choice. To find all duplicate files (by content, not by name) in the current directory:

fdupes -r .

To manually confirm deletion of duplicated files:

fdupes -r -d .

To automatically delete all copies but the first of each duplicated file (be warned, this warning, this actually deletes files, as requested):

fdupes -r -f . | grep -v '^$' | xargs rm -v

I'd recommend to manually check files before deletion:

fdupes -rf . | grep -v '^$' > files
... # check files
xargs -a files rm -v

Jakob

Posted 2012-02-05T10:39:23.930

Reputation: 683

Works great, but fails if file names contain spaces. – Daniel Wolf – 2017-06-23T12:15:26.917

1@DanielWolf try with xargs option -d '\n' – Jakob – 2017-06-27T12:13:15.863

1Also, newer versions of fdupes have the built-in option to delete all but the first in a list of duplicate files: fdupes -rdN . where -r is recursive, -d is delete and -N is no-prompt – Rand – 2019-05-15T22:38:15.370

Thank you, This is outstanding because can detect more than 2 duplicates and allows you to select which one of the dups you want to preserve (or all of them). – Smeterlink – 2019-08-04T16:39:15.970

xargs: unterminated quote – Alexey Sh. – 2020-01-03T05:53:52.787

2

You can try FSLint. It has both command line and GUI interface.

Bibhas

Posted 2012-02-05T10:39:23.930

Reputation: 2 490

or dupeguru-se https://www.hardcoded.net/linux_notes

– Ferroao – 2017-12-06T13:33:07.260

1

More concise version of removing duplicated files(just one line)

young@ubuntu-16:~/test$ md5sum `find ./ -type f` | sort -k1 | uniq -w32 -d | xargs rm -fv

find_same_size.sh

#!/usr/bin/env bash
#set -x
#This is small script can find same size of files.
find_same_size(){

if [[ -z $1 || ! -d $1 ]]
then
echo "Usage $0 directory_name" ;
 exit $?
else
dir_name=$1;
echo "current directory is $1"



for i in $(find $dir_name -type f); do
   ls -fl $i
done | awk '{f=""
        if(NF>9)for(i=9;i<=NF;i++)f=f?f" "$i:$i; else f=$9;
        if(a[$5]){ a[$5]=a[$5]"\n"f; b[$5]++;} else a[$5]=f} END{for(x     in b)print a[x] }' | xargs stat -c "%s  %n" #For just list files
 fi
   }

find_same_size $1


young@ubuntu-16:~/test$ bash find_same_size.sh tttt/ | awk '{ if($1 !~   /^([[:alpha:]])+/) print $2}' | xargs md5sum | uniq -w32 -d | xargs rm -vf

niceguy oh

Posted 2012-02-05T10:39:23.930

Reputation: 11

1

Being a bit lazy, it didn't take me long to find one online.

You first have to create a CRC checksum of each file, as you obviously only want to remove exact duplicates.

cksum  *.jpg | sort -n > filelist

Then, iterate over this file list, reading in the checksum and also the filename. If two checksums are the same, the file will be removed. This works, since the sort is numerical, and only sorts on the checksums, which groups duplicate files.

old=""
while read sum lines filename
do
      if [[ "$sum" != "$old" ]] ; then
            old="$sum"
            continue
      fi
      rm -f "$filename"
done < filelist

Obviously, this doesn't work recursively.

slhck

Posted 2012-02-05T10:39:23.930

Reputation: 182 472

1

How to test files having unique content?

if diff "$file1" "$file2" > /dev/null; then
    ...

How can we get list of files in directory?

files="$( find ${files_dir} -type f )"

We can get any 2 files from that list and check if their names are different and content are same.

#!/bin/bash
# removeDuplicates.sh

files_dir=$1
if [[ -z "$files_dir" ]]; then
    echo "Error: files dir is undefined"
fi

files="$( find ${files_dir} -type f )"
for file1 in $files; do
    for file2 in $files; do
        # echo "checking $file1 and $file2"
        if [[ "$file1" != "$file2" && -e "$file1" && -e "$file2" ]]; then
            if diff "$file1" "$file2" > /dev/null; then
                echo "$file1 and $file2 are duplicates"
                rm -v "$file2"
            fi
        fi
    done
done

For example, we have some dir:

$> ls .tmp -1
all(2).txt
all.txt
file
text
text(2)

So there are only 3 unique files.

Lets run that script:

$> ./removeDuplicates.sh .tmp/
.tmp/text(2) and .tmp/text are duplicates
removed `.tmp/text'
.tmp/all.txt and .tmp/all(2).txt are duplicates
removed `.tmp/all(2).txt'

And we get only 3 files leaved.

$> ls .tmp/ -1
all.txt
file
text(2)

ДМИТРИЙ МАЛИКОВ

Posted 2012-02-05T10:39:23.930

Reputation: 218

1

I wrote this tiny script to delete duplicated files

https://gist.github.com/crodas/d16a16c2474602ad725b

Basically it uses a temporary file (/tmp/list.txt) to create a map of files and their hashes. Later I use that files and the magic of Unix pipes to do the rest.

The script won't delete anything but will print the commands to delete files.

mfilter.sh ./dir | bash

Hope it helps

crodas

Posted 2012-02-05T10:39:23.930

Reputation: 111

0

I found an easier way to perform the same task

for i in `md5sum * | sort -k1 | uniq -w32 -d|awk '{print $2}'`; do
rm -rf $i
done

Prashant Lakhera

Posted 2012-02-05T10:39:23.930

Reputation: 101

0

Most and possibly all of the remaining answers are terribly inefficient by computing the checksum of each and every file in the directory to process.

A potentially orders of magnitude faster approach is to first get the size of each file, which is almost immediate (ls or stat), and then compute and compare the checksums only for the files having a non unique size.

jlliagre

Posted 2012-02-05T10:39:23.930

Reputation: 12 469

0

This is not what you are asking, but I think someone might find it useful when the checksums are not the same, but the name is similar (with suffix in parentheses). This script removes the files with suffixes as ("digit")

#! /bin/bash
# Warning: globstar excludes hidden directories.
# Turn on recursive globbing (in this script) or exit if the option is not supported:
shopt -s globstar || exit
for f in **
do
extension="${f##*.}"
#get only files with parentheses suffix
FILEWITHPAR=$( echo "${f%.*}".$extension | grep -o -P "(.*\([0-9]\)\..*)")
# print file to be possibly deleted
if [ -z "$FILEWITHPAR" ] ;then
:
else
echo "$FILEWITHPAR ident"
# identify if a similar file without suffix exists
FILENOPAR=$(echo $FILEWITHPAR | sed -e 's/^\(.*\)([0-9])\(.*\).*/\1\2/')
echo "$FILENOPAR exists?"
if [ -f "$FILENOPAR" ]; then
#delete file with suffix in parentheses
echo ""$FILEWITHPAR" to be deleted"
rm -Rf "$FILEWITHPAR"
else
echo "no"
fi
fi
done

Ferroao

Posted 2012-02-05T10:39:23.930

Reputation: 123

-3

I found a small program that really simplifies this kind of tasks: fdupes.

Ricky Neff

Posted 2012-02-05T10:39:23.930

Reputation: 55

Please add installation instructions and a usage example appropriate for the question. – simlev – 2018-09-21T08:05:10.317