How to test files having unique content?
if diff "$file1" "$file2" > /dev/null; then
...
How can we get list of files in directory?
files="$( find ${files_dir} -type f )"
We can get any 2 files from that list and check if their names are different and content are same.
#!/bin/bash
# removeDuplicates.sh
files_dir=$1
if [[ -z "$files_dir" ]]; then
echo "Error: files dir is undefined"
fi
files="$( find ${files_dir} -type f )"
for file1 in $files; do
for file2 in $files; do
# echo "checking $file1 and $file2"
if [[ "$file1" != "$file2" && -e "$file1" && -e "$file2" ]]; then
if diff "$file1" "$file2" > /dev/null; then
echo "$file1 and $file2 are duplicates"
rm -v "$file2"
fi
fi
done
done
For example, we have some dir:
$> ls .tmp -1
all(2).txt
all.txt
file
text
text(2)
So there are only 3 unique files.
Lets run that script:
$> ./removeDuplicates.sh .tmp/
.tmp/text(2) and .tmp/text are duplicates
removed `.tmp/text'
.tmp/all.txt and .tmp/all(2).txt are duplicates
removed `.tmp/all(2).txt'
And we get only 3 files leaved.
$> ls .tmp/ -1
all.txt
file
text(2)
fine, the bash version worked for me, but in my test, with 2 similar folders, it deleted half of duplicates in one folder, and half in the other. why. i would expect deletion of everyone (duplicated) of one folder. – Ferroao – 2017-12-05T17:58:20.793
@Ferroao Perhaps they were not exact duplicates. If just one bit is off the md5 hash that my script is using to determine duplicity would be completely different. You could add an
echo cksm
just after the line starting withread
if you want to see each file’s hash. – SiegeX – 2017-12-05T20:38:56.153no, all "duplicates" (copies) were removed, remaining 1 version, let's say the original. half copies were deleted from one folder, and the other half from the other folder (100% deletion of copies). my 100% is for copies in excess, not of the totality – Ferroao – 2017-12-05T22:40:14.287
@Ferroao I see. In that case it seems when bash does its recursive path expansion via
**
, it orders the list in such a way that the two folders are interleaved rather than all of folder 1 then all of folder 2. The script will always leave the first ‘original’ it hits as it iterates through the list. You canecho $file
before theread
line to see if this is true. – SiegeX – 2017-12-06T00:49:28.000