Is two different type of format of files having same data can have same checksum value?

2

I am not talking from the Point of View where exploitation happens. Assuming MD5 does not has any problem.

  • If I have 2 same format of files having same content then both checksum will be same. But, if 2 files have same content but different format like pdf and doc. Will it have different checksum.
  • If 2 files have same Base64 encoded value, will they have same MD5 checksum?
  • Application which can find duplicate files. Do they use checksum value or which technology?

P Satish Patro

Posted 2019-02-14T12:15:11.193

Reputation: 169

Answers

3

if 2 files have same content but different format like pdf and doc. Will it have different checksum.

The format is part of the file's contents. What makes a file "a PDF file" or "a Word DOC file" isn't some auxiliary metadata – it's literally just bytes inside the file. So because a different format means different contents, it will generally mean a different hash/digest as well.

If 2 files have same Base64 encoded value, will they have same MD5 checksum?

Base64 is not a compression function, it is a lossless 1:1 encoding. So if two files have the same Base64-encoded output, that means they had the same input before encoding, too.

In short, the files themselves are identical, so yes they'll have the same digest.

Application which can find duplicate files. Do they use checksum value or which technology?

The exact implementation varies, but usually yes, the application will digest the whole file and store the resulting hash in memory, then it'll look for identical hashes. This obviously requires much less memory than remembering the whole file, and much less time than comparing each possible pair one-at-a-time.

user1686

Posted 2019-02-14T12:15:11.193

Reputation: 283 655

In 3rd question, do they take metadata under considration also? – P Satish Patro – 2019-02-14T12:39:39.690

You mean metadata like modification time? Probably not. Only file size is very useful to because obviously different-size files won't be duplicates. But other metadata might be different even though the contents are identical. – user1686 – 2019-02-14T12:43:15.713

like created time and extension etc – P Satish Patro – 2019-02-14T12:44:17.060

Timestamps are generally useless for duplicate checks. You can easily end up with duplicates having different timestamps, e.g. just by copying the file to another disk and back will cause the "created" time to change. Filename extension isn't actually separate metadata, it's merely part of the filename... it's possible that programs might use it to speed up scans, but I doubt it. – user1686 – 2019-02-14T12:45:46.313

Offtopic, In windows when I am doing properties of a folder of a big folder it starts showing filesize not directly but takes sometime like starts from 100KB, 2MB, 4MB, 18MB, .... 4GB. Assume folder size 4GB. And after that I start 'folder size' a 3rd party application which gives folder size and total storage analyzer like nested folder size etc. And surprisingly 'foldersize' win. How can this happen? 'File size' is not running in bg and not indexing before I start. Can we say window's way is less efficient relatively? – P Satish Patro – 2019-02-14T12:47:53.493

Got it @grawity – P Satish Patro – 2019-02-14T12:48:47.317