Generate or update a PDF to include an encrypted, hidden watermark?

10

4

Background

Using LaTeX to write a book. When a user purchases the book, the PDF will be generated automatically.

Problem

The PDF should have a watermark that includes the person's name and contact information.

Question

What software meets the following criteria:

  • Applies encrypted, invisible watermarks to a PDF
  • Open Source
  • Platform independent (Linux, Windows)
  • Fast (marks a 200 page PDF in under 1 second)
  • Batch processing (exclusively command-line driven)
  • Collusion-attack resistant
  • Non-fragile (e.g., PDF -> EPS -> PDF still contains the watermark)
  • Well documented (shows example usages)

Ideas & Resources

Some thoughts and findings:

The problem with NLP is that grammatical errors can be introduced. The problem with steganography is that the images are sourced from an image cache, and so recreating that cache with watermarked images will impart a delay when generating the PDF (I could just delete one image from the cache, but that's not an elegant solution).

Thank you!

Dave Jarvis

Posted 2010-12-26T09:02:47.810

Reputation: 2 126

Please modify your description of the requirements a bit, otherwise they are unclear. "undetectable watermarks" clearly are not what you want... otherwise, how would you yourself detect them if you needed to? – Kurt Pfeifle – 2010-12-26T12:28:59.920

It is a bit unclear what exactly the purpose of your conceived system is: Detect if the PDF is passed along to another user, even though your license does forbid this? Detect if the PDF is printed on paper, even though your license does forbid this? Track the way of a particular PDF through the internet and track when it's opened? Or something else? – Kurt Pfeifle – 2010-12-26T13:45:43.680

@pipitas: If a registered version of the PDF is released, without permission, into the wild, I would like to know who released it. But if people can see that the PDF has a watermark, then the watermark becomes that much easier to circumvent. – Dave Jarvis – 2010-12-26T16:08:01.553

Answers

6

I did something similar a few years ago. It did not meet all your "hard" criteria. It worked like this:

  • I put a hardly detectable, 2x2 point sized "clickable" area on some random place at one of the borders of a random PDF page. It's not very likely that it get's discovered by accident (amongst the load of other very obviously clickable hotspots that was in the PDF anyway...).

  • Should you click on the link, it would take you to a webpage http://my.own.site/project/87245e386722ad77b4212dbec4f0e912, with some made-up "errata" bullet points. (Did I mention that 87245e386722ad77b4212dbec4f0e912 was the MD5 hash of the person's name + contact data which I kept stored in a DB table? :-)

Obviously, this does not protect against printing+scanning+ocr-ing or against a PDF "refrying" cycle. And it also relies on some degree of "security by obscurity".

Here is how you use Ghostscript to add such a clickable hotspot to the lower left corner of page 1 of random-in.pdf:

gs \
 -o random-out.pdf \
 -sDEVICE=pdfwrite \
 -dPDFSETTINGS=/prepress \
 -c "[ /Rect [1 1 3 3]" \
 -c "  /Color [1 1 1]" \
 -c "  /Page 1" \
 -c "  /Action <</Subtype /URI" \
 -c "  /URI (http://my.own.site/87245e386722ad77b4212dbec4f0e912)>>" \
 -c "  /Subtype /Link" \
 -c "  /ANN pdfmark" \
 -f random-in.pdf

To make the clickable area bigger and visible change above commandline parameters like this:

 [....]
 -c "[/Rect [1 1 50 50]" \
 -c "  /Color [1 0 0]" \
 [....]

Even more simpler would be to generate and keep an MD5 hash of the PDF in your database. It will be uniq for each PDF you create, because of the documents UUID and the CreationDate and ModDate inside its meta data. Of course, this also only allows to track the original PDFs in their digital form...

Kurt Pfeifle

Posted 2010-12-26T09:02:47.810

Reputation: 10 024

How can one add the watermark to multiple pages/all pages? I've already tried reiterating the whole block ([ /Rect ... pdfmark) but with different page numbers to no avail. Just duplicating the /Page commands inside this block doesn't work either. I guess I have to read PostScript's pdfmark documentation.

– ComFreek – 2017-03-16T10:43:21.640

Did you ever find a PDF in the wild and trace it back using this technique? – Dave Jarvis – 2010-12-28T20:19:05.820

@Dave Jarvis: Yes, I did, in a way... But it wasn't a "serious" thing, I didn't have any real interest in tracking. I did it as a proof of concept only, and after about 6 months I switched off the "tracking" web server. It was for a network PDF server, I had set up inside a customer's company. The "tracker" was similar to the one described above, but used a fullpage clickable area. I just tracked the number of "hits" in the apache log file.... – Kurt Pfeifle – 2010-12-28T21:07:43.200

Nice idea, but note that running GhostScript over a PDF like that could degrade any sampled images that it contains, since GhostScript doesn't support not decompressing them (which loses information from images that were JPEG-style compressed in the input) and tends to apply JPEG-style compression to all images (even the ones it just decompressed)... – SamB – 2011-01-02T22:51:25.260

@SamB: I think you can add -dJPEQ=100 -dQFActor=1.0 to the Ghostscript commandline to make sure you'll maintain 100% of exisiting JPEG quality. But no, I've not noticed any degrading of image quality in my files if I used the generic setting of -dPDFSETTINGS=/prepress when re-distilling any PDFs with Ghostscript.... – Kurt Pfeifle – 2011-01-03T00:05:18.617

[contd.] And no, it's not only JPEG compression that's on offer for images from Ghostscript -- you can use -dColorImageFilter=/FlateEncode (which is lossless ZIP) to override the default =/DCTEncode (which is lossy JPEG) in older GS versions. Since GS v7.21 the default is =/FlateEncode anyway... Same as for color is true for -dGrayImageFilter=... (-dMonoImageFilter=... uses /CCITTFaxEncode by default.) – Kurt Pfeifle – 2011-01-03T00:07:41.653

1

Very hard one and I am not sure that this will answer all your questions at all.

I am not sure on an all in one solution that can do this, or randomise.

However, if I was tasked with this, I would think that the easiest way is to keep the document in an intermediate format such as formatted HTML, or similar.

Using a print CSS file or similar, you can get the layout to be identical to the book and use a script of some sort to randomise the picture, content or anything and a server side PDF component that assembles the document back.

so then - for example, upon someone purchasing the document, your buy script can randomly choose a number which identifies a protection mechanism (e.g. first picture, second picture, text somewhere etc.), and then generate a unique download link.

When that download link is called, it checks the number, performs the operation and compiles to pdf then downloads it to the client.

Again, I know this will not be easy/straight forward, but you are not asking for something that is easy and this is the best way I can think of.

William Hilsum

Posted 2010-12-26T09:02:47.810

Reputation: 111 572

@Dave Jarvis - I understand fully what you are trying to do... as I said, I am not sure the best solution, but what I said should at least work... just far from easy. – William Hilsum – 2010-12-26T16:31:12.057

@Dave Jarvis - What I was trying to say/get across is that I have never seen an all in one/easy way to do what you want, but using PHP/ASP.Net, it is easier to write scripts/call third party components. I would think that if you have the entire document in HTML formatted correctly/exactly, it would be very easy to use a PDF component to convert.... For example, lets say there is 100 pages and a picture on page 31, you could have pages 1-30 as a PDF and pages 32-100 as a PDF, page 31 would be generated and formatted in html (to the style of the rest of the book), you can then use a 1/2 – William Hilsum – 2010-12-26T17:08:25.877

pdf component that will get the first PDF, covert the html page, get the second PDF and generate a new pdf combining all of it. The generated page can call scripts, can perform steganography (not sure on the verb!) or anything else you want... there are many (free and pay) pdf components - this is one for example... http://www.componentone.com/SuperProducts/PDF/ I hope this makes it a little clearer what I am trying to get across - just very hard to explain. 2/2

– William Hilsum – 2010-12-26T17:24:15.680

@Dave Jarvis - ehh, not exactly... As I said, very hard to explain.... Some PDF components are amazing along with CSS/print styles. For example, look at Moodle. It is possible to fully format a web page and make a print out look like a book / follow a style. You can then use a PDF component to export/save EXACTLY like how it should look at the end result. You can easily generate the picture you need and have the text, and assemble it (seamlessly to the end user) as a single PDF file. I just mention web/php/asp.net as I think it is the easiest way to get to what you want. – William Hilsum – 2010-12-26T18:04:22.697

@Dave Jarvis: I guess you aren't using pdfTeX, then? (Or were you more worried about users doing pdf->ps->pdf conversion and degrading the sample images in the process?). Anyway, ps->pdf conversion does typically degrade images, since GhostScript isn't smart enough to preserve JPEG-style images in compressed form, and tends to automatically apply JPEG-style compression to any images occurring in the input. (Distiller apparently can be instructed to leave JPEG-style images alone, but does anyone actually have that?) – SamB – 2011-01-02T22:44:02.303