5

What exactly happens after I provide the hash from the pdf file to John for cracking the password?

Does John extract just the password hash from the file and work on it, or is there something else?

proslaniec
  • 175
  • 8
rats20
  • 99
  • 2
  • 2
  • 4

2 Answers2

7

Generally the target hash you want to break in the case of a PDF is the user hash, which is derived from the user's password. A PDF will do two things when a password is entered for an encrypted PDF -

  1. It will derive a symmetric key from the user password. This is the key that the document is encrypted with.
  2. It will derive a hash from the password and will compare it to the user hash in the documents metadata to check if the password is correct.

PDF is strange in that it actually derives the symmetric key before it computes the hash, in fact the symmetric key is used in the computation of the hash. The process of producing the hash is as follows:

  1. Derive the symmetric key from the user password.
  2. Concatenate the following values and pass the result to the MD5 hash function:
    • A 32 byte padding string (defined in the spec)
    • The 16 byte document ID (contained in the documents metadata)
  3. Encrypt the output of the MD5 call via RC4 with the symmetric key from step 1.
  4. Do the following for i = 1 to 19:
    • Create a new RC4 key by XORing every byte of the symmetric key from step 1 with i.
    • Take the output of the previous RC4 call and encrypt it under the new RC4 key.
  5. Append 16 bytes of arbitrary padding to the output of the last RC4 call. This is the user hash. (It's not clear why the user hash is padded at all, since the comparison to validate the user password throws out the padding bytes of the user hash and just looks at the first 16 bytes.)

What I would assume that John the Ripper does is it will feed passwords (defined by whatever rules you give it to generate passwords) into the above algorithm until it computes a user hash that matches the one in the document metadata (i.e. the hash that you provided). Since the hash derivation uses only MD5 and RC4 (and not a lot of rounds of either) it is quite easy to try a lot of passwords in a short amount of time, so PDF is quite susceptible to brute force and dictionary attacks. In fact the whole algorithm is rather bizarre and doesn't instill much confidence in the security of password protected PDFs.

puzzlepalace
  • 681
  • 3
  • 11
4

John works on different kinds of hashes.

You can extract the hash from pdf file using utility like pdf2john and then start cracking with john as usual.

Relevant - How can I extract the hash inside an encrypted PDF file?

Edit: How does actual hash cracking works

Message digesting is a process of making hashes. You get the message in some digesting function and you get hash out. This kind of functions are called one-way functions. This means that once you digest a message, you can't get it back from the hash by reverting this function. Information about the original message is irrecoverably lost in the process.

What can you do then? You can try to reproduce the message by making hashes of multiple words and checking them against the hash of original message. If generated hash is equal to hash of original message, you then know that the word you used to generate that hash is equal to original message.

This is essentially how password cracking in JohnTheRipper works.

Words used for cracking may be generated incrementally (bruteforce) or using dictionary.

You can read about john's cracking modes here.

You can read about password hashes in detail here Why are hash functions one way? If I know the algorithm, why can't I calculate the input from it?

proslaniec
  • 175
  • 8