Automatically 'brute force' a few bytes to recover a corrupt file

35

4

Does anyone out there know of a way to brute force values at a particular offset in a file? It's 4 consecutive bytes which would need to be brute forced. I know the correct SHA-1 of the corrupt file. So, what I'd like to do is compare the complete file SHA-1, each time it changes the byte value.

I know the exact 4 bytes which were changed, because the file was given to me by a data recovery expert, as a recovery challenge. For those who are interested in knowing, the rar file has 4 bytes which were intentionally changed. I was told the offsets of the changed 4 bytes and the original SHA-1. The person said it's IMPOSSIBLE to recover the exact file in the archive once the 4 bytes were changed. Even if it was only a few bytes and you knew exactly where the corruption was located. Since it doesn't have a recovery record. I'm trying to see if there is a way for those particular 4 bytes to be filled in correctly so the file will decompress without error. The file size is around 5mb.

Example:

I uploaded photos so it's more clearly defined of exactly what I'm looking to do. I believe someone can post them here for me with more rep.

Screenshot One

Screenshot Two

The example offset I'm focusing on is 0x78 where the first pic shows the value as CA I want the script to the take the value up by 1 so it becomes CB as shown in the second pic. I want it to keep increasing the value by 1 and then compare the whole file SHA-1 each time. Only making changes to those 4 bytes at the specified offset.

It will try CAC5C58A and compare the SHA-1. If doesn't match, then it will try CBC5C58A.Then once the first value reaches FF it will then go to 00C6C58A and so on. Basically, I would like it to be able to go from 00000000-FFFFFFFF but to also have the option to choose where you want it start and end. I know it could take some time but I would still like to try it. Keep in mind I know the exact offset of the bytes which are corrupt. I just need the correct values.

If you search on Google: "How to fix a corrupted file by brute force" There's a person that wrote a Linux program. However, it only works against the files included with the program. I'm looking for some way to use the same process with my file.

Sbt19

Posted 2018-04-19T10:14:04.057

Reputation: 341

3

Welcome to Super User! I have edited your question to remove the request for a program, which would be off-topic. Can you edit your question to include (some of) the examples you saw? It's good that you have done research, but showing us exactly what research that is would be helpful :)

– bertieb – 2018-04-19T10:21:51.117

Thanks for head up Bertieb! I added some more details. – Sbt19 – 2018-04-19T10:48:58.570

20could i ask how you ended up with this file and how you can be sure that those are the only 4 corrupt bytes? – Edoardo – 2018-04-19T12:38:22.867

1Do you know the file format ? If you do you might be able to work out the correct values or limit the ranges, rather than trying to brute force them. In general, however, I'd suggest any corrupted file should be dumped for security reasons. – StephenG – 2018-04-19T14:59:40.867

11@eddyce I'm really interested in the second part of your question - why those 4 bytes? – Craig Otis – 2018-04-19T15:26:55.160

1

I guess the blog post you are alluding to is https://conorpp.com/how-to-fix-a-corrupted-file-by-brute-force

– tripleee – 2018-04-19T15:40:03.050

2Out of curiosity, how did the file get corrupted? And how do you know it was those four bytes? – JohnEye – 2018-04-19T16:49:17.650

@CraigOtis i never asked why those 4 bytes, “how can you be sure those are the only 4 corrupt ones” is the thing – Edoardo – 2018-04-19T19:27:43.507

The program "ghex" is useful for things like this. – Lee Daniel Crocker – 2018-04-19T19:28:04.020

@LeeDanielCrocker Could you elaborate on how it's useful? Are you going to manually save 4 billion files in ghex, run sha on them and see if it matches? A bit tedious. – pipe – 2018-04-19T21:46:37.577

The question was about patching a few bytes in a single file. – Lee Daniel Crocker – 2018-04-19T22:55:58.137

1@LeeDanielCrocker No, the question is about patching it until the checksum gets the expected value, and just like pipe alrealy asked, we now wonder if you didn't read the question properly, or if ghex can actually do this. – tripleee – 2018-04-20T03:26:30.327

1I added more details about the file in question. It's only a data recovery test file. – Sbt19 – 2018-04-20T05:12:00.813

@eddyce: It's pretty easy to get into this situation if you accidentally save an edit in your hex editor and then it discards its undo buffer when it saves. (I've used ones that do that.) – user541686 – 2018-04-20T07:37:31.187

1Note that, due to the pigeon-hole principle, there might be more than one sequence of bytes that make the hash match. One of those sequences might be more "valid" for whatever type of file this is. – Roger Lipscombe – 2018-04-20T08:34:13.467

It looks like you are searching for a hex-editor. https://softwarerecs.stackexchange.com/ is the correct place to ask

– Mawg says reinstate Monica – 2018-04-20T09:58:58.340

@mehrdad ok, its some kind of challenge then :) advice: make sure you are checking the file against the SHA-1 that was given to you and not only by unpacking the RAR archive, because - maybe - the changed 4 bytes are part of the RAR CRC records... – Edoardo – 2018-04-20T15:00:00.627

You might time how long it takes to sha1 the current file and multiply that by 2^32 for the worst case search time. If each sha1 evaluation takes 0.01 seconds you’re looking at worst case 1.36 years unless you parallelize the search. On average, half of that. – rrauenza – 2018-04-20T16:51:42.720

Let’s Enhance! How we found @rogerkver’s $1,000 wallet obfuscated private key – Vlastimil Ovčáčík – 2018-04-25T09:33:05.893

Answers

27

Here's a small Python program which does what you seem to be describing.

#!/usr/bin/env python3
from hashlib import sha1

with open('binaryfile', 'rb') as bin:
    binary = bin.read()

base = 0x0078
# ... is not valid Python; add more sequences, or take it out (or see below)
for seq in [[0xCA, 0xC5, 0xC5, 0x8A], [0xCB, 0xC5, 0xC5, 0x8A], ...]:
    copy = binary[0:base]
    copy += bytes(seq)
    copy += binary[base+len(seq):]
    if sha1(copy).hexdigest() == '9968733ce3ff0893bbb0a19e75faaf2fb0000e19':
        print('success with bytes {0}'.format(seq))
        break
else:
    print('no success')

UnOnly briefly tested; please ping me if you find typos.

The base specifies where to try to apply the four bytes, and the long string '996873... is the hex representation of the expected SHA1. The line for seq in... defines the bytes to try; and of course replace 'binaryfile' with the path to the file you want to attempt to salvage.

You can replace the literal list [[0xCA, 0xC5,...]] with something to actually loop over all possible values but it's basically just a placeholder for something more useful because I'm not really sure what exactly you want there.

Something like for seq in itertools.product(range(256), repeat=4)): will loop over all possible values from 0 to 232-1. (You will need to add import itertools near the top then.) Or perhaps you could simply add an offset; update the script to replace the current for seq in with the following (where again the import needs to go before the main program);

import struct

for n in range(2**32):
    val=(n+0x8AC5C5CA) % 2**32  # notice reverse order
    seq=list(reversed(struct.pack(">I", val)))
    copy = ...

I reversed the order of the bytes so that it naturally increments from 0x8AC5C5CA to 0x8AC5C5CB but then the next increment will be 0x8AC5C5CC etc. The struct magic is to convert this to a sequence of bytes (had to look it up from https://stackoverflow.com/a/26920983/874188). This will start at 0x8AC5C5CA and go to 0xFFFFFFFF, then wrap around to 0x00000000 and climb back up to 0x8AC5C5C9.

If you have multiple candidate ranges you would like to examine in a particular order, maybe something like

for rge in [(0x8AC5C5CA, 0x8AFFFFFF), (0x00C6C58A, 0x00FFFFFF),
        (0x00000000, 0x00C6C589), (0x01000000, 0x8AC5C5C9)]:
    for val in range(*rge):
        seq=list(reversed(struct.pack(">I", val)))
        copy = ...

but then you'll need to make sure yourself that the (start, end) pairs in rge cover all of the space between 0x00000000 and 0xFFFFFFFF if you really want to examine all of it. (And again, notice that the range increments the last byte and that seq applies the bytes of the value in reverse, in accordance with your stated requirements.)

If you wanted to use two different base addresses, you quickly run up against the limits of what's feasible to do in your lifetime with brute force; but you could, for example, split the 4-byte number into two 2-byte parts and apply those at different offsets.

base1 = 0x1234
base2 = 0x2345

for seq in range(whatever):
    copy = binary[0:base1]
    copy += bytes(seq[0:1])
    copy += binary[base1+2:base1+base2]
    copy += bytes(seq[2:3])
    copy += binary[base2+2:]

tripleee

Posted 2018-04-19T10:14:04.057

Reputation: 2 480

Comments are not for extended discussion; this conversation has been moved to chat.

– Journeyman Geek – 2018-04-20T12:25:24.633

4

No, no, no and again NO!

Seldom the answer you get is not what you expect.

Some questions for you:

  • Is it possible that an expert doesn't know that it is possible to brute force a string of for bytes and iteratively try the SHA-1 until it converges? No
  • Is it possible he forget it ? No
  • Is it possible that you cannot do it on a rar file? No
  • Is the other answer wrong? absolutely NO

So what? ... Time.

The point is that you have to change so few bytes... only 4!

What does it means? 2564 that is 256x256x256x256 possibilities, a really really big number.
If your computer were able to process 1 operation per second (substitution in the file + sha1)...
you should wait more then 136 years, or if you prefer more then 49710 days.

You're enough lucky, a 5MB pre-cached file (already loaded in ram and in the cache) asks only about 0.03 seconds (min 0.025s), on an old computer. That shrinks your expecting time down to 1242-1492 days (something more then 3 years).

It is true, BTW, that statistically you should have a positive answer in half of the time. Nonetheless you should wait till you will have tried all the possibilities to be sure that there is only 1 substitution that will give you the same SHA-1 checksum...

Now that IMPOSSIBLE sounds as "not possible in a WORTHWHILE amount of time".


How to proceed

A more proper answer to your technical question: when you speak about brute force it have not to be necessary blind brute force.

  • It is just stated in a comment in the other answer that you do not need to calculate the sha1 checksum on the part before the corruption. You do the 1st time and you save time for each successive iteration (maybe a factor 2 it depends from the position).

  • Something that can change the worthless of the effort is to write a parallel code that will run on the GPU. If you have a good graphic card you may have around 1000 cores that can compute for you in parallel (even more but they have a frequency lower than the cpu, but still they are a lot). If you are able to decrease the time from 1400 to 1.4 days maybe you can even do it.

  • A different approach can lead you to a faster solution.
    You said it is a rar file. The rar file structure is divided into blocks. If you take count of it you can see where the corruption falls. If it is on the part of the data, on the part of the headers or on both. Then you can act consequently. For sake of simplicity let's suppose it is over the data:
    you can do the brute force of your offset, check for each positive CRC of that block if it is even positive the SHA1 on the whole file. Again you can do a parallel code.

Final note

If they were 6 bytes instead of 4 you were out of the game with the present technology.

Hastur

Posted 2018-04-19T10:14:04.057

Reputation: 15 043

Great answer - one wouldn’t necessarily need to exhaust the entire space though because the rar itself in this example would not uncompress due to internal checks even if the sha1 worked with a duplicate hash. Hitting 4 bytes that solved the sha1 falsely AND an internal crc falsely would be very very unlikely. – rrauenza – 2018-04-20T17:00:54.970

@rrauenza Thanks. BTW not only (the double check). Indeed the block should be shorter then the whole part from the corrupted bytes to the end of the file, and the CRC should be lighter to compute then the sha1 algorithm... – Hastur – 2018-04-20T17:45:52.093

@rrauenza Do you know how I would go about getting the actual parallel code to run on the GPU? I have a good GPU. Thanks. – Sbt19 – 2018-04-21T00:38:12.687

No I do not. You could use multiple cpus by partitioning the search space though. – rrauenza – 2018-04-21T00:43:44.333

@Sbt19 Whatever they said you about it google is not so scaring to use ;-). Search for (if nvidia) Cuda, brute force, sha1 and you will have a lot of hints, e.g. source code. BTW keep your attention high because browsing from that google path, oh my boy, may lead you on one of the dark sides of the net... :-). (Not on github...in other site that you can meet with this kind of researches). PS> There is a lot of scientific papers on related topics, e.g. this one...

– Hastur – 2018-04-22T06:47:00.990

@Hastur Heh, I'm very good with using Google:D I do know about the all the sides of the net. However, I'm not skilled in the programming area. I do know about SHA1 brute forcing. I haven't been able to find right code which can brute force a file through the GPU. It would be much faster. P.S. Thanks for your elaborate answer. I'm sure it will help others. – Sbt19 – 2018-04-22T08:59:07.683

256⁴ is about 4.3 billion; that's not really a "really really big number". It's similar to, for example, the clock speeds CPUs run at. PS: When I run the numbers on this several year old computer, I get ~63 days, without anything complicated like using a GPU. – derobert – 2018-04-26T11:40:15.100

Also, if you have a CRC to check against, it'd be silly to brute force that — you can do that analytically. Then only bother with SHA-1 on the ones where CRC passes (this should divide your search space by approx 2¹⁶, bringing the search time down in to the minutes range.) – derobert – 2018-04-26T11:42:35.847

@derobert 1). I've problem to count more then 21 (using fingers and nose... :-)) for me it's really really big when you think to complete iteration to do. 2). The comparison with the CPU clock holds if you have a processor that in a cycle is able to do the SHA1 on the whole 5MB file (so with 5MB registers...) 3) I totally agree that it is easier to use the CRC information (and I proposed) but the OP asked for SHA1... 4) I suppose that the WORTHWHILE concept is the central point of the question... – Hastur – 2018-04-26T12:00:06.050

@Hastur On my not at all new machine, SHA-1 is ~490MB/s per core (at least per OpenSSL). Across all 4 cores, that's almost 2GB/sec. I think whether its worthwhile depends entirely on how much the data is worth. It's an "embarrassingly parallelizable" problem, so you could easily (for example) buy all that computing power from Amazon EC2 (etc.). If a c5large is at least as fast as my not-so-new machine, then at current spot prices that's under $50 to have it done in an hour. – derobert – 2018-04-26T12:13:41.050

I cannot agree more with you about the "embarrassingly parallelizable" -ness of the problem. IMHO the best approach without the rar info is to update the sha1 variable with the 1st chunk of the file (linear), generate the permutations of the 4 bytes, increase the sha1 variables, order them unique, continue to sha1 with the second chunk. With the rar info it depends, because if it falls on data and header you have to rebuild the latter(but not to count in the permutations at least 1 byte if not all)... then if the rar block (in which the corruption is) search for the correct CRC then sha1. – Hastur – 2018-04-26T12:25:34.723