How can I compare the contents of .pdf files, excluding filenames from comparison?

3

I usually use WinMerge to view the differences between files, but in this case it doesn't help. The files I'm comparing are known to have different filenames, which is creating false positives when 2 files with the same document inside have different filenames.

I have a folder full of many directories representing all the vendors my company does business with, and they include many .pdf files of receipts & invoices. It's the master vendor list. The invoices & receipts are named such that the names don't make sense without the surrounding directory structure to provide context. For example here we have "Vendors/Company Foo/Product Bar/Invoice#3.pdf"

Then I have another folder with many receipts & invoices in it, that used to be maintained separately from the master vendor list, and was supposed to include a manually-created copy of every receipt & invoice that was entered into the appropriate entry in master vendor directory structure. These receipts & invoices were to have been renamed so they're easier for the accountant to read & know what they refer to. For example here we have "Taxes/CompanyFoo ProductBar.pdf".

I've searched for files of type .pdf in the top-level folder of the master vendor list, so that my search results include receipts & invoices from all the vendors in the directory structure. Then I copied these .pdf files to another folder on my Desktop, so I can compare them. I compared those files to the files in the 'taxes' folder using WinMerge to see if any of the files in the 'taxes' folder don't exist in the 'master vendor' directories, and vice-versa.

But WinMerge counts files as different just because their filenames don't match. I need to know if the file content is different despite what the filename is.

There are hundreds of these files & if any are in the 'taxes' folder that aren't in their corresponding 'master vendor' directory, I need to rectify that & file them correctly.

Can someone recommend a tool that can do this?

cdvonstinkpot

Posted 2012-03-18T19:29:40.570

Reputation: 31

1Why don't you use md5sum recursively? Two PDF files with the same checksum and same file size have extremely low chance of being different. – Benoit – 2012-03-18T19:35:33.647

possible duplicate of Which duplicate files and folders finders exist for Windows?

– Daniel Beck – 2012-03-18T19:37:37.910

I found something in this thread that does what I need, in fact the answer to that thread is what it was. Thanks Daniel Beck! I don't know how to make that the answer to this one however. – cdvonstinkpot – 2012-03-18T23:43:25.147

Answers

2

I think the i-net PDF content comparer would be helpful.

It is now in Version 2.0 offering a GUI and flexible pricing options. There is still a free 30 days trial version where you can check on every aspect of the software.

Comparison Result

hamed

Posted 2012-03-18T19:29:40.570

Reputation: 4 960

1Looked do-able until I saw the price: 1295 US$. And the terms of the free trial make it unusable since I'm not a developer. – cdvonstinkpot – 2012-03-18T23:21:10.963

1

If you have some kind of unix environment available (If you're on Windows, I suggest Cygwin) you can easily find duplicate files below the current directory with something like this:

find . -type f -exec md5sum '{}' '+' | sort | uniq -D -w 32

The output will be md5sum and name of every file that has at least one duplicate (same md5sum). Duplicates show up right after each other in alphabetical order. Exchange the . after find with the path you want to look under if it's not the current directory.

Edit:

Conversely, to get the files that have no duplicates, you can use

find . -type f -exec md5sum '{}' '+' | sort | uniq -u -w 32

That will only print files without any duplicate below the current directory.

Eroen

Posted 2012-03-18T19:29:40.570

Reputation: 5 615

0

Try the app "PDF Compare", which compares both pdf document metadata and page images at the pixel level:

https://www.microsoft.com/en-us/store/p/pdfcompare/9n9dmzjbz2nl#

rick

Posted 2012-03-18T19:29:40.570

Reputation: 1

0

  1. You can (must, really) use xdocdiff plugin for WinMerge, if you compare content by eyes
  2. CompareIt! can render (so-so) and visualize in comparison windows pdf-files without additional plugins
  3. DiffPDF compare and show compared files even better (see screenshot on page), crossplatform

As alternative solution you can think about storing plain-text copies of each PDF under the same name (converted from with, f.e, pandoc) and compare text-versions only by any tool

Lazy Badger

Posted 2012-03-18T19:29:40.570

Reputation: 3 557

0

Just did this is is what I used it worked swell and it was simple!

http://www.qtrac.eu/diffpdf.html

Micah Armantrout

Posted 2012-03-18T19:29:40.570

Reputation: 624