6

Background

My boss asked me to come up with a way ordinary users can redact information from PDF files using free software. We get a lot of scanned documents and our client requires that sensitive information be redacted from PDFs before they are uploaded to their system. Here's what I came up with. I have convinced myself that this will effectively destroy potentially sensitive client metadata from the original document, as well as making it impossible to remove any black bars covering up sensitive information. However, I have also found that I don't know nearly as much as I think I do.

Many forum members posting about this topic have stated quite firmly that only Adobe Acrobat or other paid software can do this securely. If you are of this opinion, please explain why. I'm having trouble figuring out why this wouldn't work.

Overview

In some PDF program, cover up the sensitive stuff with boxes, then convert it to a TIFF file. Then convert the TIFF file back to a PDF.

  • Would this work? Does the TIFF file preserve any information about objects or layers? Is any potentially sensitive metadata likely to make it through, or will all metadata be changed, as I hope?

How I'm doing it specifically

I don't know if I should include this, since the general question will probably be more useful, but here's my specific setup:

The software:

PDFCreator and Foxit PDF.

The setup:

Change the settings in PDFCreator so that it converts the document to a TIFF, instead of a PDF. For the output, set PDFCreator to print back to FoxIt, rather than opening the document.

The process:

  1. Open the PDF in Foxit Reader and cover up any visible sensitive data with black rectangles.
  2. Print the document to PDFCreator.
  3. In the background, PDFCreator saves the file as a TIFF and then prints the TIFF to Foxit's PDF printer. Foxit asks where you want to save the PDF.

Related

Inspired by Blacking out a part of a PDF, or redaction of text on AskDifferent.

This is related to How to remove meta and sensitive data from PDF file?, but we are all on Windows, not Unix.

Also related from SuperUser: How to remove OCR from a PDF?

Step by step instructions for a similar process by someone else: Quick and Dirty Redaction

Summary

From a security standpoint, will converting a PDF to an image, blacking out a portion, then converting it back to a PDF be sufficient in removing information from the document?

browly
  • 2,100
  • 2
  • 12
  • 21
  • I can't find how this has a relation to infosec, really. What is your question? – Stephane Sep 16 '15 at 08:43
  • I think this would be better suited at Super User. – ThoriumBR Sep 16 '15 at 12:30
  • The OP states his question at the end. Will converting a PDF to an image, blacking out a portion, then converting it back to a PDF be sufficient in removing information from the document. – RoraΖ Sep 16 '15 at 12:56
  • @Stephanie The information that needs to be secured here is the stuff we're redacting from the PDF. We don't want anyone to be able to access the redacted information if they have a copy of the PDF. So is converting to an image, and back to a PDF, enough to insure that it is impossible to retrieve the redacted information from the new PDF? – browly Sep 16 '15 at 15:30
  • Do you have any control over the scanning process? If not, does the scanning process involve OCR? – Max Wyss Sep 16 '15 at 15:41
  • @Max If they mail us the documents, then we scan them ourselves, but most people fax or email them to us. Some of the emailed documents have OCR. The "new" document after converting back to a PDF doesn't have OCR, as far as I can tell. – browly Sep 16 '15 at 15:44
  • @browly OK, so, the only places where you have to be careful is with the PDFs you get by eMail; the other documents are simple raster images (if you scan to TIFF anyway). If you want to be really sure to get rid of the meta and private data of the PDFs, you might not just export as TIFF, but print to TIFF, and then proceed with the redacting. – Max Wyss Sep 16 '15 at 16:13

1 Answers1

1

If the scanned documents don't have passed an OCR process, and come along without sensitive metadata, rendering to TIFF, properly changing the pixels to be redacted to a uniform black (or any other color), flattening the TIFF, and writing it back as a PDF would be sufficient. This is because you create a completely new document. If it had metadata, that metadata would concern your process, and not previous ones.

Depending on the requirements for the redacted document, you could run OCR over it, and/or add your custom metadata. But, again, that would be done under your control.

Ideally, the scans would come in as TIFF, which you process, and only then create the PDF. This would simplify the workflow.

In any case, you would need an image editor which understands TIFF and PDF, and do the redaction in the image editor.

Max Wyss
  • 207
  • 1
  • 3
  • Thanks, Max. Any particular reason why drawing the "black-out" rectangle has to occur on the TIFF and not the original PDF before printing to TIFF? Does it matter? My original question had me drawing the rectangle on the original PDF using Foxit Reader, before printing it to TIFF. – browly Sep 16 '15 at 16:25
  • 1
    Sorry for not answering quicker… Actually, there is a difference. When you do the blacking/whiting in the TIFF, you have directly changed the contents, whereas if you are using a patch over the contents of the PDF, you do indirectly change. An indirect change is not controlled, and there is a high chance that artefacts get created, which may allow a reconstruction of the original contents. – Max Wyss Sep 18 '15 at 21:29