Is there a tool to measure file difference percentage?

7

1

I am looking to compare two text files. Normally, I can just use diff to compare the two files to see the differences. This is great, except that I am more concerned with the percentage difference of the two files.

For example:

File A:
    banana
    TESTING

File B:
    TESTING

In this case, the result would be a 50% difference. I've taken a look at wdiff, and it mostly works, with the exception being that it looks at elements word-by-word (in fact, I can get the result above by doing wdiff -s filea fileb).

Does a tool exist to provide file percentage difference on a by character/ by byte level?

NT3RP

Posted 2011-10-17T19:16:03.273

Reputation: 425

Answers

5

Doing a character-by-character comparison of two text files is effectively a Levenshtein distance calculation. There isn't a common standalone program in Linux that will do this calculation, but there are some library functions (I know PHP has one) and tons of example code online for this calculation.

One other little caveat is that Levenshtein distance is strictly the number of changes between two strings, so if you're looking for a percentage, you'll need to normalize the calculated distance. Dividing by the mean of the lengths of the two strings (sizes of the text files) is a widely-used normalization.

stharward

Posted 2011-10-17T19:16:03.273

Reputation: 181

3

Try piping the output if diff to the wc command. There are several options but -l will likely give you a decent count of the number of changed lines. Since diff outputs before & after lines along with some other formatting you may have to divide the result by that factor and then place that over the the result of the entire file length in lines. wc -l

Chris Nava

Posted 2011-10-17T19:16:03.273

Reputation: 7 009

Small correction, you shouldn't have a space after -u, otherwise diff thinks 0 is a file name (at least on linux). Should be diff -u0 file1 file2. – kristianp – 2020-02-16T21:51:50.027

2diff -u 0 removes context – Lazy Badger – 2011-10-18T15:05:19.593

1

I had a similar problem with two sets of transcribed files I used the Levenshtein distance as suggested in the most upvoted answer but found using python a better option

pip install python-Levenshtein

and the code will be that:

import sys

from Levenshtein import *

txt1 = open(sys.argv[1]).read()
txt2 = open(sys.argv[2]).read()

print("distance:", distance(txt1,txt2)

use:

python distance.py file1 file2

Eduard Florinescu

Posted 2011-10-17T19:16:03.273

Reputation: 2 116