How to compare mp3, flac audio data in a file, ignoring header data (ID3 tag) etc.?

16

10

I've backed up some audio files up in 2 places and added ID3 tags into one backup but not the other, since time has passed my own memory has faded on whether the backups are actually the same, but now one has ID3 data and the other doesn't, basic binary compare will fail and inspection will be cumbersome.

Is there a tool to compare just the audio data (not the header, ID3) in mp3s, flac files, and other files using header data such as ID3.

started a thread on beyond compare here: http://www.scootersoftware.com/vbulletin/showthread.php?t=7413

would consider other comparison software that does this task

therobyouknow

Posted 2011-02-21T15:31:53.093

Reputation: 3 596

Answers

8

Ah, the eternal plight. I myself struggled with this very question for so long and tried so many duplicate-file-finding apps that I eventually gave up and decided to write one myself. And then I found AllDup.

AllDup made me indefinitely back-burner my own project because it is a fast DFF that has the ability to compare MP3 and JPEG files, ignoring their ID3 tags and Exif data respectively. Even better, Michael Thummerer is very responsive to feedback and is quick to fix bugs and implement suggestions (you can suggest ignoring FLAC headers). To top it all off, AllDup is free.

Synetech

Posted 2011-02-21T15:31:53.093

Reputation: 63 242

6

Here's a way to do it at the shell. You need avconv, which in Debian/Ubuntu is in libav-tools.

$ avconv -i INPUT_FILE -c:a copy -f crc - 2>/dev/null | grep CRC

You'll get a line like this:

CRC=0xabfdfe10

This will compare every frame of audio data and generate a CRC for it. So a command like this can compare multiple files:

ls *.mp3 | while read line; do echo -n "$line: "; avconv -i "$line" -f crc - 2>/dev/null | grep CRC; done

blujay

Posted 2011-02-21T15:31:53.093

Reputation: 469

Not very fast, but work perfectly to have a unique checksum on mp3 files to check duplicates. Thank you. – fred727 – 2016-11-11T16:13:25.127

A faster alternative if you can use php is getid3 library : http://www.getid3.org/phpBB3/viewtopic.php?f=3&t=1936

– fred727 – 2016-11-11T23:15:12.450

3@fred727 I checked the avconv man page and realized that the crc option decodes the audio and computes the CRC of the decoded audio. But you can avoid that by setting the audio codec to copy. Now, on my system, the command runs in 0.13 seconds instead of 1.13 seconds. I updated the answer, so now you can avoid using PHP. :) – blujay – 2016-11-14T06:59:04.693

2

As possible solution you may use any tool to convert file into uncompressed stream (pcm, wav) without metadata info and then compare it. For conversion you may use any software you have like ffmpeg, sox or avidemux.

For example how I do that with ffmpeg

Say I have for that example 2 files with different metadata: $ diff Original.mp3 Possible-dup.mp3 ; echo $? Binary files Original.mp3 and Possible-dup.mp3 differ Brute force comparison complain they are differ.

Then we just convert and diff body: $ diff <( ffmpeg -loglevel 8 -i Original.mp3 -map_metadata -1 -f wav - ) <( ffmpeg -loglevel 8 -i Possible-dup.mp3 -map_metadata -1 -f wav - ) ; echo $? 0

Off course ; echo $? part is just for demonstration purpose to see return code.

Processing multiple files (traverse directories)

If you want try duplicates in collection it have worth to calculate checksums (any like crc, md5, sha2, sha256) of data and then just find there collisions.

Although it is out of scope of that question I would suggest some simple suggestions how to find duplicates of files in directory accounting only it contents without metadata consideration.

  1. First calculate hash of data in each file (and place into file for next processing): for file in *.mp3; do printf "%s:%s\n" "$( ffmpeg -loglevel 8 -i "$file" -map_metadata -1 -f wav - | sha256sum | cut -d' ' -f1 )" "$file"; done > mp3data.hashes File will be looks like: $ cat mp3data.hashes ad48913a11de29ad4639253f2f06d8480b73d48a5f1d0aaa24271c0ba3998d02:file1.mp3 54320b708cea0771a8cf71fac24196a070836376dd83eedd619f247c2ece7480:file2.mp3 1d8627a21bdbf74cc5c7bc9451f7db264c167f7df4cbad7d8db80bc2f347110f:Original.mp3 8918674499b90ace36bcfb94d0d8ca1bc9f8bb391b166f899779b373905ddbc1:Other-dup.mp3 8918674499b90ace36bcfb94d0d8ca1bc9f8bb391b166f899779b373905ddbc1:Other.mp3 1d8627a21bdbf74cc5c7bc9451f7db264c167f7df4cbad7d8db80bc2f347110f:Possible-dup.mp3 Any RDBMS will be very helpful there to aggregate count and select such data. But continue pure command-line solution you may want do simple steps like further.

See duplicates hashes if any (extra step to show how it works, does not needed for find dupes): $ count.by.regexp.awk '([0-9a-f]+):' mp3data.hashes [1:54320b708cea0771a8cf71fac24196a070836376dd83eedd619f247c2ece7480]=1 [1:1d8627a21bdbf74cc5c7bc9451f7db264c167f7df4cbad7d8db80bc2f347110f]=2 [1:ad48913a11de29ad4639253f2f06d8480b73d48a5f1d0aaa24271c0ba3998d02]=1

  1. And all together to list files duplicated by content: $ grep mp3data.hashes -f <( count.by.regexp.awk '([0-9a-f]+):' mp3data.hashes | grep -oP '(?<=\[1:).{64}(?!]=1$)' ) | sort 1d8627a21bdbf74cc5c7bc9451f7db264c167f7df4cbad7d8db80bc2f347110f:Original.mp3 1d8627a21bdbf74cc5c7bc9451f7db264c167f7df4cbad7d8db80bc2f347110f:Possible-dup.mp3 8918674499b90ace36bcfb94d0d8ca1bc9f8bb391b166f899779b373905ddbc1:Other-dup.mp3 8918674499b90ace36bcfb94d0d8ca1bc9f8bb391b166f899779b373905ddbc1:Other.mp3

count.by.regexp.awk is simple awk script to count regexp patterns.

Hubbitus

Posted 2011-02-21T15:31:53.093

Reputation: 141

1+1 thanks Hubbitus - a nice self-contained solution based on open-source. Good to know. Also useful for putting into a batch. – therobyouknow – 2017-06-15T11:28:11.190

2

Foobar2000 with the Binary Comparator plugin will do this.

afrazier

Posted 2011-02-21T15:31:53.093

Reputation: 21 316

1+1 Foobar2000 looks FANTASTIC. Why? Because it uses proper Windows native UIs, looks nice and lightweight and minimalist like VNC yet rich in functionality and actually provides information and features that one really wants - like song length etc. Windows Media Player and WinAmp lack showing this information and instead put in prominently obscure features that one would rarely use. Binary Comparator is a great feature for the question I'm asking. Thanks. – therobyouknow – 2011-02-22T19:37:31.217

Glad you like it! – afrazier – 2011-02-22T21:07:33.743

1

I also asked this on the Beyond Compare forum, as mentioned in the question - and Beyond Compare does also provide a solution:

http://www.scootersoftware.com/vbulletin/showthread.php?t=7413

Both approaches are worth considering:

  • the AllDup solution is best if you don't care about which copies of the files are preserved and which are discarded in a directory folder tree AND you have a mix of tagged and non-tagged files in the same folders that you want to run the duplicate check on.

  • Beyond Compare is best if you want to retain the diectory/folder tree AND are compare 2 separate folder/directory structures, helped also by using the on-the-fly non-destructive flatten-tree option

therobyouknow

Posted 2011-02-21T15:31:53.093

Reputation: 3 596