Could I recover the content of file from its checksum/hash?

Question

Let's say I have a video file that is split into multiple parts. Each piece is 2 Megabytes. I also have a list of the *insert hash name here* for each piece and also for the full file.

Now assume that I have misplaced/lost/fubar one of these pieces.

Could I retrieve the lost piece from its hash, using brute force or any other method in a human-lifespan amount of time?

A rainbow-style table would be unfeasible, I think.

Bonus numerical question - how much would it take on a medium-size distributed computing network based on mostly consumer PCs? (Example: 4 GHz CPU + entry level GPU + 8 GB RAM)

Dutch [Jan Sloot](https://en.wikipedia.org/wiki/Jan_Sloot) claimed he could compress a complete movie down to 8 kilobytes of data. Perhaps he used these Sha256 checksums. Unfortunately he died before publishing his work. — Jeff, Jan 25 '16 at 18:03
[A very similar question](http://stackoverflow.com/q/34757434/1468366) was recently asked on Stack Overflow. — MvG, Jan 25 '16 at 18:19
`x + y = 10`. Can I recover the exact values of `x` and `y` from 10? — Ajedi32, Jan 25 '16 at 20:12
@Jeff There is no way that could work. Even if you took only the script there'd be more than 8KB of entropy in that. — kasperd, Jan 25 '16 at 20:20
@Eumel, not necessarily. His question does not specify the COST of the inversion, just if it were possible. If such a thing were possible but extremely costly, the answer would still be Yes, but it would not necessarily be a practical compression scheme — SplashHit, Jan 25 '16 at 22:51
@SplashHit he asks for a human amount of time on his PC which would probably be beween a few hours and a few days. — Eumel, Jan 26 '16 at 09:04
The amount of time would be on the order of the time required to brute-force a 10-million character long password. With current technology that's beyond the realms of geological time, let alone human time. — Dog, Jan 26 '16 at 13:40
@Ajedi32 if you know that the result must look like a movie, then possibly. For example, the entropy should be low, so there is a good chance that x an y follow each other more or less closely, or that a specific header is known to be found in the file. — njzk2, Jan 26 '16 at 16:34

pri · Accepted Answer · 2016-01-25T13:38:28.687

61

A simple answer, NO.

It is like asking, if I know, that x%4 = 3, is it possible to find the value of x? No. Surely, there would be infinite values of x satisfying this equation, but you wouldn't simply know which one is correct.

Similarly, many(or infinite) video clips could result in a given hash value(obviously, infinite video clips have to be mapped to a specific number of hash values, so collisions are bound to happen). You wouldn't know which clip is correct.

That too, in human time? No.

EDIT: As pointed out in comments, since the file is chunked into pieces of 2 MB, there won't be infinite possibilities, but it would be pretty large(2 raised to power of 16.7 million, approximately). Brute-forcing such a large number of possibilities, in human time, is still close to impossible. But yeah, it's not infinite.

edited Jan 25 '16 at 13:38

answered Jan 25 '16 at 08:39

pri

4,438
24
31

1

Good answer. explains it very simply, very good especially for people who dont understand all that techy stuff... – My1 Jan 25 '16 at 08:56
I will remember this as well. I often have to eli5 (explain as if the other person would be a 5 years old) some tech stuff. my favorite example is explaining OS file deletion with books. – My1 Jan 25 '16 at 09:04
20

I disagree. *"You wouldn't know which clip is correct"* While it's true *in general*, here we have additional constraints: 1. data is 2MB in size, 2. data is a video file, 3. that video file is part of larger video. – el.pescado - нет войне Jan 25 '16 at 09:10
1

@el.pescado Still, even then there would be infinite samples which would result in the same hash. Also, op needs it in human time. – pri Jan 25 '16 at 09:59
2

@PriyankGupta Well, there are only finitely many 2MB chunks. But there is certainly more information to be expected in a 2MB video chunk than can be expressed using the 512 bits of the effectively two SHA256 hashes. – Hagen von Eitzen Jan 25 '16 at 13:18
@HagenvonEitzen I get it. Although still that would amount to 16.7M bits, and brute forcing wouldn't be done in human time, I'll edit the answer to make it more clear. – pri Jan 25 '16 at 13:32
2

@PriyankGupta: There are not infinite possibilities but many - about 2^(2 * 8 * 1024^2 - 256). (not counting the mentioned constraints 2. and 3.) Anyway such a large number is like infinite for our imagination :) In Python 3: `import decimal; '{:.2E}'.format(decimal.Decimal(2**(2*8*1024**2-256)))`: `1.57E+5050368` It is like guessing the 2 MB chunk shorter by the length of the hash (32 bytes). – pabouk - Ukraine stay strong Jan 25 '16 at 13:42
@pabouk Yes, I've edited the answer. :) – pri Jan 25 '16 at 13:45
I was about to call you out because your algebra example *was* figure-outtable; `x=12`. **BUT** then I realized that you used the modulo operator and not a divisor so I +1'ed :-) – MonkeyZeus Jan 25 '16 at 17:44
you also gain the benefit of knowing the format the data is in so any combination of data that doesn't fit your format you can quickly exclude, and that narrows the search space by many, many orders of magnitude. Still, my gut tells me it won't be enough orders of magnitude to get into a reasonable time. – corsiKa Jan 26 '16 at 13:16
Everyone seems to have missed the bit about there's also a checksum for the complete file. The chance of a hash collision that matches *both* checksums in combination with the rest of the file are far slimmer. – Dog Jan 26 '16 at 13:38
You can also correlate frames from the 2MB chunks before and after. There's probably enough constraints on the problem at that point to say that you could solve it a human amount of time. Hardest part would be integrating all the information. It would be similar to reversing a PRNG or a set of them. The larger the filesize of a frame the easier it will be. – Black Jan 26 '16 at 14:41

score 14 · Answer 2 · answered Jan 25 '16 at 05:43

14

This is not possible no matter how fast your computer is, simply because you cannot recreate the correct information out of practically nothing.

You are actually asking for restoring 2 MB from 32 byte (size of SHA-256) or at most 64 byte (SHA-256 for chunk and for total file). This would be an ratio of 1:65536 or 1:32768. Given that video is already heavily compressed the chance is practically zero that you could restore the original data from this few information. It might be that you could create a 2 MB chunk which results in the specific SHA-256 hashes but chances are very low that this would be that original chunk.

answered Jan 25 '16 at 05:43

Steffen Ullrich

184,332
29
363
424

36

If you actually manage to create a 2 MB chunk that results in a SHA-256 hash that you did not already know to be the result of that chunk, I think the NSA will want to have a word with you. – user541686 Jan 25 '16 at 10:17
3

@Mehrdad: Or the people running Powerball, for that matter. – Piskvor left the building Jan 25 '16 at 15:30

rdegges · Answer 3 · 2016-01-25T17:46:17.170

10

You could not reproduce the file in any reasonable amount of time. The reason is that the only way to 'reverse' a hash is via brute-force, and considering how large the original file was, it would take you that exact amount of bytes to brute force.

Let's say you have a video file that is 100MB large, precisely.

1MB = 1,000,000 bytes
100MB = 100,000,000 bytes

This means you would need to brute force this original file and verify it's hash, you'd need to try n^r permutations. Assuming the video file uses only 256 characters per byte (ascii), we'd be looking at:

256^100,000,000 ≈ 10^240,823,997 ≈ ∞

That's essentially infinite -- it would take basically FOREVER to compute this, regardless of CPU resources.

UPDATE: There's also, of course, the issue with hash collisions that I left out here -- with a Sha256 hash, you're likely going to run into pretty much an infinite amount of collisions with a file as large as our example. I neglected to mention this earlier for simplicity's sake.

edited Jan 25 '16 at 17:46

answered Jan 25 '16 at 03:33

rdegges

333
1
5

13

This isn't even the real problem, it's not possible even given infinite CPU power and time. The real problem is that with a file significantly larger than the hash size you're just going to find practically infinite hash collisions. You're equally as likely to find a 2 megabyte clip of the movie Inception as your original video. – Matti Virkkunen Jan 25 '16 at 08:04
5

This answer is completely wrong. The real reason why you can't recover is because there are a possibly infinite number of files which produce the same hash. How would you know which one is the correct one. Another way to look at it is this. If the hash contained all the information, why not store the hash instead of a much larger file? – Roy T. Jan 25 '16 at 08:29
9

@RoyT. - An infinite number of files, but not an infinite number of files *of that size*. – Compro01 Jan 25 '16 at 08:55
The end result also has to parse as a valid video file, so collision concerns aren't an issue – wireghoul Jan 25 '16 at 10:02
6

@wireghoul: Here's a thought experiment for you: Take any valid 2MB video file, and change pixel values around ever so slightly (it's lossy compression so it's unnoticeable anyways), until you get a hash collision. This will happen within 2^256 tries for SHA-256 because you literally run out of hash bits to be different. Given infinite CPU power or time you can generate practically any video content that matches a given hash. – Matti Virkkunen Jan 25 '16 at 10:38
2

@wireghoul: Obviously this is impossible in practice with current computing resources, but it shows how it is impossible to recover a video based on just the hash. Now of course if the original file had been stored somewhere in a searchable location, the hash is a practically perfect way to find and identify it, because the probability of a *random* collision is almost nonexistent. – Matti Virkkunen Jan 25 '16 at 10:39
1

And you have implied that ASCII is a binary encoding system and it is possible for a byte (8 bits) to store a number of values other than 2^8 = 256. Not to be a killjoy, but this is wrong on so many levels! – wizzwizz4 Jan 25 '16 at 17:02
@wizzwizz4 [A byte isn't necessarily 8 bits](https://stackoverflow.com/questions/2098149/what-platforms-have-something-other-than-8-bit-char), although it has nothing to do with ASCII. – isanae Jan 25 '16 at 17:34
@isanae On most (read: all modernish) file-systems, and **most importantly in SHA itself**, bytes are 8 bits long. – wizzwizz4 Jan 25 '16 at 17:37
@Matti uuuhh, ok, now back it up with math. How many collisions can you have in a 2mb file vs how many can you have in a 2mb video file? – wireghoul Jan 25 '16 at 21:40
@wireghoul: By far most bytes in video files can be changed without making it invalid, as long as you don't change the important header bits too badly. Most of the data in compressed video files is just compressed color values, and you will only get various visual artifacts, some more visible than others. A 2MB video file has 2097152 bytes to play with, and you only need 32 freely chosen bytes to get a collision for a 256-bit (32 byte) hash. – Matti Virkkunen Jan 26 '16 at 07:44
@wiregroul: Small side-not: From the beginning I'm assuming that SHA-256 values are evenly distributed here - it's possible to actually not get a collision even if you try 2^256 different values because you might get collisions *between the hashes you tried*. But SHA-256 is supposed to be "good" which means the hash values should be more or less evenly distributed. If they're not, trying with 33 or 34 bytes ought to do the trick. Still plenty left to choose from. – Matti Virkkunen Jan 26 '16 at 07:47

Mark Buffalo · Answer 4 · 2016-02-07T21:01:00.533

Let's say you have a computer that has infinite amounts of processing power, and can reliably check every possible message against every possible hash in a short time. Here's the problem you now face: collisions.

What's a collision? Many different files can match the exact same signature. Many different messages can match the exact same signature.

Hashing is one-way. You convert a series of characters to a hash. When you validate your hash, you are merely checking to see if the message matches the calculated value of the hash. The problem is that many different messages may match this same hash. It's called collision.

However, since you also have infinite computational power, you can also eventually reconstruct the file through supermassive trial and error. However, once you have all of the possible examples for this hash value, how are you going to tell which one is which?

So you're telling me there's a chance?

With today's technology, and since we'll never have infinite computational power, it will be completely impossible. Even taking the entire world's combined computing power, and multiplying it by a billion, you cannot do this. Even if you somehow did this, how would you be able to tell which message was correct?

Where would my idea apply?

Hashing is one-way. With the provided key, you only validate that it matches your calculated hash.
Encryption is two-way. With the provided key, you get the results back.

Your idea would apply under encryption, not hashing. With encryption, if you have the key, you can get the decrypted contents of the file.

Good explanation! And especially the reference to Encryption is good for addressing the question asked. — Tim Skov Jacobsen, Oct 03 '20 at 14:08

score 4 · Answer 5 · answered Jan 26 '16 at 01:22

It is hard if the underlying file has high enough entropy. If you know something about the underlying data, then you may well be able to recover it. For example, if there is a hacker anywhere in the vicinity it won't be long at all before someone tells you what I md5 hashed to get:

73868cb1848a216984dca1b6b0ee37bc

However video usually has lots of entropy, making this a lost cause or at least a darn hard one. You'd need the video to be a videocam and you'd have to hope that the missing chunk shows an hour of black as black night. Let's put this in perspective: Creating a bitcoin is essentially a matter of inverting a hash. Inverting a very short video snip is probably akin to making about 20 bitcoins, maybe more. So in your shoes I'd make the bitcoins, buy a fresh copy of the video and pocket the change. Nearly eight thousand dollars worth of change. Maybe I'd buy shares in a quantum computer company and make future exploits easier; it's fun doing the "impossible".

To those who say, "hashes are many to one, so you cannot tell what was hashed": That is true, but of all the many values that hash to that one value, some will be more plausible than others. If you invert the hash above you will have not the least doubt that you found the right input. Have fun! :-)

Finally, someone who considers plausibility of things being hashed =D — justhalf, Jan 26 '16 at 06:43
Bitcoin inverts **only a fraction** of (double) SHA256, for input that AIUI only nees 2 compression rounds. Currently total bitcoin mining is just over 2^60 hash/sec, so if the OPs case is SHA256 (it's now been edited out) and 32MB is about 2^19 rounds, the bitcoin capacity could find SOME video-like preimage in about 2^214 seconds or 300,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 years -- except that the universe will almost certainly end before you finish. Even then this has immeasuarbly small chance of being the DESIRED preimage. — dave_thompson_085, Aug 02 '16 at 09:30

score 3 · Answer 6 · answered Jul 24 '16 at 12:26

There is one possibility for this: Google it - literally.

If the file has been uploaded to any of a number of file-sharing sites already, they have likely posted a hash of it, and it may have been indexed.

For example, google '60CCE9E9C6557335B4F7B18D02CFE2B438A8B3E2'.

score 1 · Answer 7 · answered Jan 26 '16 at 04:02

A comment but it's too long:

As others have shown, this isn't possible. However, there is a related problem that certainly is reasonable:

Ok, you can't reconstruct that 200mb video that was split into 100 2mb files of which you have 99.

However, you can create another file that will be a hair over 2mb that will allow you to reconstruct any one missing file. Two such files will allow you to reconstruct any two missing files and so on. While the block size can't profitably be set higher than the file size (a 4mb repair file still only fixes one missing file) it can be set lower which can be of value if partial damage is a possibility. (The calc time goes up, the files get slightly larger but you have more ability to recover from damage.)

The standard program for a long time was: Quickpar but it hasn't been updated in ages. The more modern alternative that I'm aware of (but have not used much yet) is Multipar (Note: This site is in Japanese. The program is in good English, though.)

If I'm going to back up some data to a DVD I routinely create extra repair files just in case something happens--the extra space on the DVD is going to waste anyway, why not put some insurance there? Multipar even has modes specifically for this (although I have not yet tried them) where it will generate blocks to fill out a DVD-R or BD-R disk.

score 1 · Answer 8 · edited Jan 28 '16 at 10:12

It´ll basically take too long to achieve a satisfying result, addressing both: generating the missing video-part (according to computable criteria) and sorting the best ones out of them (that needs human intelligence or extremely high developed AI). Even if you finally have a nice video matching all criteria, you´ll never know if the original movie had the same contents. It might make no sense trying to "reconstruct" something that can be most variable - better and faster: use your own fantasy.

Certainly some "crossfiring" 10 Byte hash-values can´t represent/contain the information of 10 MB, so I think your gist is the following:

Even if you have lots of additional information for corrections inside the whole video-file: data-format, frames, the storyboard itself, voices of the actors and so on: there will be thousands of more or less different videos that will fit all known criteria. I´d even assume that a handful of single video-frames here and there could make any video leading to the same hashes.

This question is very alike: Is it possible for a (small) virus to add itself to a (big) file while keeping the file´s checksum the same value by padding a (not so big) amount of variable bytes? I guess it´s possible, though hard to calculate in time today. On the other hand, we know that many possible codes lead to the same hash, so computing time might be overestimated. Maybe it´s possible in seconds - only hackers will know.

Edit: Over night I got the inspiration for a nice additional comparison of your "lost-video-part-problem": For such cases (complete data recovery) there has already been invented the RAID-5 technology (Wiki see here: https://en.wikipedia.org/wiki/RAID ). One out of three or more harddrives could fail and all data can be reconstructed lossless. Certainly you have lots of data-overhead (redundancy for error-correction) stored over all drives to be able to do so.

Hashes/Checksums are good for detection of little (bits or few bytes) tampering/errors that occured somewhere inside a file. More advanced are CRCs with error-correction. At least we have redundacy-systems like RAID.

score 1 · Answer 9 · answered Feb 08 '16 at 03:02

The answer is NO, and it seems that you're mixing up two different things :

Checksums and Hashes are one-way Integrity checkers. The purpose of their usage in that matter is to make sure that the data was not corrupted, and nothing else
Recovery codes are the ones you're using if you need to recover your data by code provided. The most shining example is a Reed-Solomon code for recovering CD-ROM data. The purpose of their usage in this matter is to help you recover data corrupted/lost by some reason

They're seem to be similar from a first sight, but they're a VERY different things.

score 0 · Answer 10 · answered Jan 26 '16 at 21:41

It is effectively impossible, due to information theory. Effectively impossible, as in "heat death of the universe" becomes a legitimate limiting factor on your search.

You have a 2,000,000 byte (2MB) slice missing. A hash like SHA-1 has 20 bytes of information in it. By information theory, we should expect that there are 1,999,980 bytes which are still unknown. That means 2^(8*1,999,980) possible files to explore. That is a number so large that you start talking about heat death of the universe before every atom in the universe magically acting as a 2Ghz processor, working in tandem, could find it. And that doesn't include the challenge of actually figuring out which of the solutions is the right one. It's just the cost of eventually producing the right one.

Some have mentioned that you have additional information. For example, you have the SHA-1 of the entire file. Sadly, this is not very useful. Presuming you have this hash as well, you now have 1,999,960 bytes of information which are still unknown, and thus 2^(8*199,960) possible slices to consider. We're still in the heat death of the universe realm. We could add additional constraints, such as continuity with the existing video, but eventually we're going to run into limits as to how much we could possibly know about the slice without having enough information to just recreate it directly from the information we know.

The best chance you would have is to have the entire world all band together to solve your problem, and feed you every 2MB slice of data in the entire internet. It is highly likely that if you "lost" the data, someone else might have a copy of it. It's much easier to scan through the petabytes of data humanity has gathered than to can through the far larger number of possibilities 2MB of arbitrary data has to offer.

score 0 · Answer 11 · answered Feb 08 '16 at 02:38

0

Hashes are designed to be one way. Its easy to travel from left to right, but it is practically impossible to travel from right to left when talking about Hashing.

answered Feb 08 '16 at 02:38

abhinav singh

283
1
4

score 0 · Answer 12 · edited Jul 24 '16 at 11:38

Preface: a hash is normally utilised in verifying the integrity of a file or set of data.

Provided the checksum hash is inclusive of the data and name, then that could be a reference point for the container, which then could be implemented in searching through checksum pattern matching. Provided you knew a salt (which could include the date or time value for instance).

Although to cause a single collision at a rate of 1MH/s it could still take approximately 3 years to eliminate every absolute possibility for a result as little as 15 numbers. So understanding another reference e.g. where this file is on the storage medium would help to be more specific .e.g. sector or file id entry.

But it is credible to note that data transfer (particularly over networks) tends to commonly get in the way, with its own checksum for reference.

And in case anyone wants to argue, a salt is usually complimentary and cryptography shouldn't get mixed up with recovery, as when you encrypt with not just some pathetic cryptography standard, and you forget the key, then you'll generally be unable to recover your data.

Could I recover the content of file from its checksum/hash?

12 Answers12

So you're telling me there's a chance?

Where would my idea apply?

Linked

Related