What exactly happens after I provide the hash from the pdf file to John for cracking the password?
Does John extract just the password hash from the file and work on it, or is there something else?
What exactly happens after I provide the hash from the pdf file to John for cracking the password?
Does John extract just the password hash from the file and work on it, or is there something else?
Generally the target hash you want to break in the case of a PDF is the user hash, which is derived from the user's password. A PDF will do two things when a password is entered for an encrypted PDF -
PDF is strange in that it actually derives the symmetric key before it computes the hash, in fact the symmetric key is used in the computation of the hash. The process of producing the hash is as follows:
What I would assume that John the Ripper does is it will feed passwords (defined by whatever rules you give it to generate passwords) into the above algorithm until it computes a user hash that matches the one in the document metadata (i.e. the hash that you provided). Since the hash derivation uses only MD5 and RC4 (and not a lot of rounds of either) it is quite easy to try a lot of passwords in a short amount of time, so PDF is quite susceptible to brute force and dictionary attacks. In fact the whole algorithm is rather bizarre and doesn't instill much confidence in the security of password protected PDFs.
John works on different kinds of hashes.
You can extract the hash from pdf file using utility like pdf2john
and then start cracking with john
as usual.
Relevant - How can I extract the hash inside an encrypted PDF file?
Edit: How does actual hash cracking works
Message digesting is a process of making hashes. You get the message in some digesting function and you get hash out. This kind of functions are called one-way functions. This means that once you digest a message, you can't get it back from the hash by reverting this function. Information about the original message is irrecoverably lost in the process.
What can you do then? You can try to reproduce the message by making hashes of multiple words and checking them against the hash of original message. If generated hash is equal to hash of original message, you then know that the word you used to generate that hash is equal to original message.
This is essentially how password cracking in JohnTheRipper works.
Words used for cracking may be generated incrementally (bruteforce) or using dictionary.
You can read about john
's cracking modes here.
You can read about password hashes in detail here Why are hash functions one way? If I know the algorithm, why can't I calculate the input from it?