What security scheme is used by PDF password encryption, and why is it so weak?

Question

Many PDFs are distributed as encrypted PDFs to lock out some of their functionality (eg printing, writing, copying). However, PDF cracking software is available online, which usually cracks the PDF passwords in less than 1 second.

It doesn't make sense that the PDF system is so easy to crack if Adobe implemented a proper encryption techniques in their document security, and it looks like that there is some major implementation error in their PDF encryption scheme that allows documents to be unlocked with trivial amounts of work.

What is the security scheme used in such locked PDF files, and why do these PDF password removers take so little time to defeat it?

(From what you describe, it sounds like getting around such protection would be much easier than cracking the password.) — , Aug 03 '15 at 12:08
You have to separate two different cases: 1. A really encrypted PDF, which you cannot open at all until you provide the password - this could be really good encrypted and not so easy to crack. **AND** 2. A PDF which you can open/read without password but not print/copy text. Here we have a big problem, since if you can read the document it cannot be encrypted: Your PC has to print all the characters to the screen, which is the same as printing. So in this case, the password is just some artificial hurdle but no real encryption/barrier — Falco, Aug 03 '15 at 14:32
@Falco even with a closed spec and not breaking into the drawing routine case 2 would be easy to defeat: Printscreen+OCR for example, within the capabilities of mist users. This is a typical example of making things harder for many legitimate users while barely slowing down the intended nefarious users. — Chris H, Aug 03 '15 at 15:31
@ChrisH Exactly! Reading a PDF but requiring a Password for printing is functionality which can never really be secured in the way PDF work today (There could be ways with streaming the PDF Content via HDCP to your screen, but lets not go there...) - So in the end it is just obfuscation and cannot be made "hard" poor users who rely on this feature -_ — Falco, Aug 03 '15 at 15:35
The PDF files which have been "protected for printing" can be converted to "unprotected" documents by using eg. Ghostscript to make a new PDF out of the rendered document. This is actually very easy to do (a one line shell command), and has nothing to do with encryption. Ghostscript knows how to render and to write PDFs, so it has no difficulty rendering a protected document to something unprotected. — Alexandre C., Aug 03 '15 at 19:12
Disallowing printing and disallowing saving are essentially single bits in the document header. (It's a little more complicated than that, of course, but that's the gist.) What rather a lot of useful software does is ignore those bits and let you print/save anyway. This is not an encryption function/operation. — Eric Towers, Aug 04 '15 at 02:17
@Falco Note that HDCP is security by obscurity, so if you want a completely secure way to allow reading but prevent printing, that's still not it. Also consider that someone could take a picture of their screen. — user253751, Aug 04 '15 at 12:31
See also http://superuser.com/questions/56132/how-good-is-pdf-password-protection — caw, Dec 03 '16 at 23:19

score 53 · Accepted Answer · edited Aug 04 '15 at 15:08

There are two types of PDF protection: Password-based encryption and User-Interface restrictions. You are describing the second type of protection, namely the missing permission to copy-and-paste, to print and so on. If there are user-interface restrictions placed on a PDF file, the viewer still needs to decrypt the contents to display it on your screen, so you are not in an "password-based encryption" scenario where you are missing a key to decrypt the document, but in a "DRM" scenario where you trust that the applications that are able to decrypt the file (based on static knowledge like master keys) do only the things the author wants them to do.

Nothing prevents computer experts reverse engineering how the legitimate application decrypts the data (no password needed), and performing the decryption themselves. After having the document decrypted, rights may be "adjusted" to e.g. include printing permission or the decrypting application can do things (like copy all bitmap images) itself.

Adobe tries to prevent "rogue applications" that allow you to circumvent the usage restrictions by their license on the PDF specification: They revoke the license to use the (claimed) intellectual property in that specification for applications that do not obey the usage restrictions. AFAIK some open source tools have or had a build switch for whether the usage restrictions should be obeyed or not. This makes a perfect starting point for people selling "PDF deprotector" software.

In the case described above, the "user password" is the empty string. PDF readers are required to try to an empty user password if a protected PDF file is opened. Only if that fails the password validity check is the user asked for a password. begueradj describes the key derivation in his answer, and as you see, the "DRM permissions" (/P entry) enters the key derivation, so if you just "fix the permissions" in a protected PDF file, a conformant reader will derive the wrong key and fail to open the document. On the other hand, if a PDF file is completely protected by a password (even against opening), the user password is no longer empty, and this type of PDF protection is reasonably secure.

The KDE pdf reader used to have a checkbox "obey DRM limitations" in its settings, which if nothing else is probably the best-named setting I've ever come across. — , Aug 03 '15 at 15:53
Okular still has this option (Settings->Configure Okular->General->"Obey DRM limitations") — and I think it is not on by default :) — dom0, Dec 28 '17 at 16:09

score 46 · Answer 2 · 2015-08-05T08:23:14.437

Adobe's PDF lock functionality obeys to the rule of security through obscurity. If third party softwares are able to unlock a PDF file it is because If the if the file is encrypted then it contains necessarily the information needed to decrypt it.

The encryption key of a PDF file is generated as following:

   1. Pad the user password out to 32 bytes, using a hardcoded
       32-byte string:
           28 BF 4E 5E 4E 75 8A 41 64 00 4E 56 FF FA 01 08
           2E 2E 00 B6 D0 68 3E 80 2F 0C A9 FE 64 53 69 7A
       If the user password is null, just use the entire padding
       string.  (I.e., concatenate the user password and the padding
       string and take the first 32 bytes.)

    2. Append the hashed owner password (the /O entry above).

    3. Append the permissions (the /P entry), treated as a four-byte
       integer, LSB first.

    4. Append the file identifier (the /ID entry from the trailer
       dictionary).  This is an arbitrary string of bytes; Adobe
       recommends that it be generated by MD5 hashing various pieces
       of information about the document.

    5. MD5 hash this string; the first 5 bytes of output are the
       encryption key.  (This is a 40-bit key, presumably to meet US
       export regulations.)

This algorithm takes as an input the user's password and several other data. Among those data you can find:

        /Size 95         % number of objects in the file
        /Root 93 0 R     % the page tree is object ID (93,0)
        /Encrypt 94 0 R  % the encryption dict is object ID (94,0)
        /ID [<1cf5...>]  % an arbitrary file identifier    

        /Filter /Standard   % use the standard security handler
        /V 1                % algorithm 1
        /R 2                % revision 2
        /U (xxx...xxx)      % hashed user password (32 bytes)
        /O (xxx...xxx)      % hashed owner password (32 bytes)
        /P 65472            % flags specifying the allowed operations

Software uses as a decryption process this algorithm:

    1. Take the 5-byte file key (from above).

    2. Append the 3 low-order bytes (LSB first) of the object number
       for the stream/string object being decrypted.

    3. Append the 2 low-order bytes (LSB first) of the generation
       number.

    4. MD5 hash that 10-byte string.

    5. Use the first 10 bytes of the output as an RC4 key to decrypt
       the stream or string.  (This apparently still meets the US
       export regulations because it's a 40-bit key with an additional
       40-bit "salt".)

Of course, this is the general scheme of encryption/decryption, but more or less differences exist between various Adobe PDF versions.

Further reading

This answer only applies for old PDF versions, but the decryption programs also work on newer PDF versions. — March Ho, Aug 03 '15 at 12:33
@MarchHo As I stated by the end of the answer, this is the general scheme, more less differences exist between different Adobe PDF versions. You can not know everything about it because it follows the principle of security through **obscurity** — , Aug 03 '15 at 12:36
Do we know why they use such a weak system? Is it one of those cases of "its just a lock to keep nosy coworker Bob out and not to keep hacker Mallory out"? — David says Reinstate Monica, Aug 03 '15 at 14:04
Simple - it is just impossible to make it work. Can you imagine a book which you can read but not copy? All DRM systems are breakable unless you completely lock down hardware. — user158037, Aug 03 '15 at 15:16

score 20 · Answer 3 · edited Mar 17 '17 at 13:21

The main problem with password-protecting a PDF file with a password is that you are basing the security on a password, which is some piece of data that a human user, somewhere, came up with in his mind, and was arrogant enough to deem "unguessable". It turns out that most passwords are guessable. The situation can be somewhat improved by making the password-to-key transformation expensive (this is called password hashing) but a weak password is still weak.

A second problem is that there is not one format for PDF encryption, but several. PDF encryption has a long history of custom schemes, the first of which taking root at times when the USA had strong, strict export rules for cryptographic-aware software; to make the story short, to allow the software to be exported without any administrative hassle, the crypto had to be laughably weak. Hence the encryption format described by @begueradj in his answer: the password is hashed, and only the first 40 bits of the results are kept as "master key" for the whole file. A 40-bit key is highly amenable to exhaustive search with today's computers, making the whole encryption thing a joke. It is now possible to make strongly-encrypted PDF files, that modern versions of Adobe Reader can process (I personally wrote some code to make PDF files that could be decrypted only with a smart card), but you have to do it explicitly.

Compounding the situation is the PDF internal structure. A PDF is a set of "objects", some of them being streams of other objects, or raw data. The whole idea is that the document should be amenable to a variety of accesses, e.g. jumping to any page within the document (possibly before having downloaded it whole), or extracting a table of contents. Since encryption is applied on a per-stream basis, the usual conclusion is that a lot of the document structure can be obtained without breaking the encryption (e.g. number of pages, length of each paragraph, number, size and position of pictures...). Whether this is a serious problem or not depends on the context, in particular why you want to encrypt. The real issue here is that the decision about what to encrypt and what not to encrypt is taken by some generic software that cannot, by definition, be aware of the context.

In practice, the point of password-protecting a PDF file is not to make it really inscrutable by eavesdropper; it is to document, in a clear and unavoidable way, that the file contents are sensitive and the file shall be handled with care. It is the equivalent of a red "top secret" stamp.

What security scheme is used by PDF password encryption, and why is it so weak?

3 Answers3

Linked

Related