How to remove meta and sensitive data from PDF file?

Question

I've some PDF files generated from different sources (such as web-browser, Photoshop, etc.) on Unix.

How do I make sure PDF doesn't contain any sensitive information such as IP address, OS, user name, full name, or whatever could be stored and other digital watermarks/fingerprints?

So as result none of forensic analysis won't disclosure its author and its origin based on the file content.

Ideally I'd like to know which Unix command-line tool would help me to achieve that (similar to pdftotext, but I'd like to keep the same format).

LSerni · Accepted Answer · 2015-04-13T22:55:20.753

You can transform the PDF into uncompressed form using pdftk. Most metadata will then be immediately visible (and removable, provided you repair the file with pdftk afterwards). Same goes for "non-immediately-PDF" code (you can see that with tools such as PDFid). You can for example easily alter the trailer where fields such as /ID are to be found:

/Info 104 0 R
/ID [<81b14aafa313db63dbd6f981e49f94> <81b14aafa313db63dbd6f981e49f94>]

(in the case of ID, just replace with a sequence of random hex of the same length. No repair necessary. pdftk compress command is advised to save disk space).

It much depends on what kind of redaction you're trying to achieve. "Semi-inadvertent" metadata such as the ID (above) can be easily removed either directly or by re-saving the PDF removing unused objects and previous revisions of extant objects, which could contain sensitive information either intentionally or unintentionally:

7.5.6 Incremental Updates

The contents of a PDF file can be updated incrementally without rewriting the entire file. When updating a PDF file incrementally, changes shall be appended to the end of the file, leaving its original contents intact

Some filters may contain garbage without this compromising PDF legibility (for example the DCTDecode filter used to store a JPEG entity within the document. The filter supplies its own data size field internally; so what happens is that the "outer" PDF reader gathers the DCTDecode-encoded object and passes it to the filter, which gives back a raster image. The original object might contain extra bytes that the filter will ignore and discard, and which may be meaningless, or may contain pieces of random memory from the encoding computer (and possibly sensitive information), or may contain the same intentionally. Actually, being a JPEG, a DCTDecode object could contain EXIF information and this might be sensitive (e.g. GPS positioning and the like).

For example if I take an image containing a copyright

$ exiftool istockphoto_2425717-getting-a-call.jpg | grep Copyright
Profile Copyright               : Copyright (c) 1998 Hewlett-Packard Company

and convert it to PDF

$ convert istockphoto_2425717-getting-a-call.jpg test.jpg

...the original string is still there:

$ strings test.pdf | grep Copyright
Copyright (c) 1998 Hewlett-Packard Company

To remove it, I would need to extract the JPEG(s), remove the EXIF and other data tags from them, and then re-embed them back. It could conceivably be done using e.g. iText, or a state machine identifying the streams

...
/Filter [/DCTDecode]
/Width 364
/Height 380
/BitsPerComponent 8
/Length 20514
/ColorSpace 8 0 R
>>
stream
ÿØÿà^@^PJFIF<this is the JPEG file...>
...
endstream
endobj
...

saving the stream object as a JPEG, manipulating it, and writing it back. Now the stream length needs to be corrected, and the index offsets will be changed, yielding a broken PDF; pdftk can reconstruct the index from this broken PDF generating a clean one.

Yet other kinds of metadata, designed for stealth, may not be so easy to detect, much less remove. The possibilities there are endless. Some of the sneakier:

PDF has an internal index, so the actual ordering of most object entities inside the file does not matter. Find a metric - lexicographic could do - to define a "natural" order of those objects. If you have N objects, you have N! possible shuffles of those objects, and can then encode log2(N!) bits of information in the way they are ordered. Remove with: a tool capable of reordering the objects in the PDF file.
Fonts can map arbitrary symbols to glyphs and glyphs are ordered within a font. Again you can choose the mapping so it carries information. Remove with: most PDF "font optimizers".
Several objects allow positioning at much higher precision than actually needed for typesetting. Assuming there's no typesetting difference between the coordinate values 3.1415 and 3.1416, a character positioned at 3.1415064 could "encode" the ASCII code 64. Remove with: some tool (iText) capable of extracting and processing every entity in the PDF and align it to a grid with arbitrary precision (here 0.0002), thereby rounding 3.1415... to 3.1416.

A not always optimal, but working, way of doing all the above is to regenerate the PDF from scratch by e.g. printing it to another PDF device after typesetting it in a different format. A radicalization of this approach is to typeset the PDF onto a raster canvas and generate a PDF out of that. The resulting PDF will be far bulkier since it now basically consists of a series of TIFF or JPEG images, one per page. It will also be, of course, non-searchable.

It is still possible to steganograph information that will survive such treatment by using visible character kerning encoding, that will be replicated on the rasterized version. Some "safe" documents are typeset that way - if the document leaks, accessing a copy of it will reveal the identity of the responsible party; two copies assigned to different personnel, superimposed against a sunny window, will show the characters aren't exactly aligned in the two copies. A scheme used to do the trick is called QIM (other schemes exist).

Depending on the degree of control the encoding party has on the document, you might therefore never be sure you've successfully "anonymized" a document. For sufficiently large values of paranoid, even going all the PDF2ASCII way might not touch information encoded in typos or, who knows, synonym choice (I think I still have some texts that demonstrate such an encoding, but unfortunately they're in Italian :-) ).

score 0 · Answer 2 · answered Apr 13 '15 at 10:39

0

What you are looking for is called "redaction".

I am not aware of any "cheap" redaction tool for Unix. Depending on your documents, you will need a PDF viewer anyway, because if you have images, redaction will be visual (and you always should visually check your documents while redacting).

That said, I could see three possibilities:

a) Adobe Acrobat (which has a Redaction tool built in)

b) Adobe Acrobat with the Redax Plug-in (by Appligent); this is considered to be the industry standard, because Redax can do certain things better than Acrobat on its own

c) Redax Server by Appligent; this is a high-volume Redaction tool, which would run on some Unix systems

Option c) would come closest to your request (but you'd have to talk to Appligent about pricing…).

answered Apr 13 '15 at 10:39

Max Wyss

207
1
3

1

I think you misunderstood my question. I don't care about redaction: images and content which I put deliberately (these could be generated automatically by some scripts). I care about meta data which is usually hidden and it can contain some sensitive information without my knowledge. Like Word storing full name of the PC user inside the `.docx` files. Or like camera jpeg files which are storing model name and lens of my camera which allows to identify the user who took the picture quite easily. – cicada Apr 13 '15 at 10:49
That is indeed possible (me misunderstanding the question). In this case, your tool essentially clears all the XMP entries. One quick and dirty workaround would be creating a proper PDF/X1-a file (which, if done right) has all metadata cleared. Or, another quick and dirty hack, refry the document (create a PostScript file, and recreate the PDF from it); Ghostscript should be able to do that. – Max Wyss Apr 13 '15 at 11:13

How to remove meta and sensitive data from PDF file?

2 Answers2

Linked