15

I am legally obliged to distribute a document (probably by email, probably saved as MS word, or a PDF) to several hundred recipients.

The recipients are legally obliged to keep it confidential. However, based on past experience I'm pretty sure it's going to end up publicly leaked pretty quickly. (in the past it's been freely distributed verbatim)

This has happened before, it's a serious problem and causes us financial damage and I'd really like to put a stop to this and identify the miscreant.

I'm aware of the John Le Carre technique of making each document very slightly different (missing full stop here, minor typo there etc etc) but with several hundred recipients making several hundred uniquely identifiable copies of the same basic document would be a non-trivial task.

Is there a way to automate this? or is there a better way of finding who's doing the leaking?

UPDATES - documents published 2 or 3 times a year. In the past the whole pdf has been published verbatim on public or semi-public forums, often within days (sometimes hours) of being distributed. On other occasions the documents have been re-distributed via email from 'burner' accounts (normally gmail)

  • The document is released to meet various legal obligations, so the information HAS to be accurate. It also HAS to go to the various recipients. So changing any of the data is not an option, but there's no law against making a spelling/grammar error
ConanTheGerbil
  • 251
  • 1
  • 4
  • How are you going to stop multiple collaborators from removing differences in their copies of the document before leaking it? – user Mar 13 '20 at 18:01
  • @user - I'm assuming there's only one leaker, I'm also guessing they're not going to spot the unique differences so long as they're not too obvious – ConanTheGerbil Mar 13 '20 at 18:08
  • How frequently do you send out the documents? How quickly does the document get leaked once sent out? – user Mar 13 '20 at 18:18
  • What is the leak vector? Do you expect the PDF itself to appear on a site somewhere, or that the text will be copied and pasted? – Andrew Leach Mar 13 '20 at 18:35
  • Is the problem the leaking of the full and unmodified document or the leakage of specific information contained in the document? – Steffen Ullrich Mar 13 '20 at 18:46
  • @Steffen Ullrich - does it make a difference? Both/either are a problem – ConanTheGerbil Mar 13 '20 at 18:48
  • @ConanTheGerbil Document watermarking will completely fail if the leaker pulls out numbers or content and leaks that instead of the document, unless you're going to be falsifying the actual document contents (prices, volume, etc.) and not just introducing superficial errors (typos, etc.). – user Mar 13 '20 at 18:50
  • @ConanTheGerbil: Sure it makes a difference. If it is about the information then applying any subtle changes to the document will likely not help since they will only be propagated if the text is copied verbatim but not if the essence of the information is distributed. Essentially you then not asking about detecting the origin of leaked documents but about leaked information which is even harder or impossible since all recipients essentially get the same information. – Steffen Ullrich Mar 13 '20 at 18:51
  • While investigating the leaks of the FISA applications for Carter Page, the government altered the dates of the 4 applications, helping to track down the leaker. That’s why those dates were redacted in the early versions of the IG report. So altering insignificant pieces of data in each document has been effective. The trick is software to do it efficiently. – Darrell Root Mar 13 '20 at 20:09
  • if you send these documents via e-mail, there is no way to track them back to a specific person as they could be intercepted in-transit. – pcalkins Mar 13 '20 at 23:07

8 Answers8

3

There are a bunch of ways to modify the document so that they are not visibly different, but still be able to uniquely identify each document. Here are a few ideas.

Leaks of entire document

Changes in Meta data
You could put a unique hash in each of the document's meta data.

Slight changes in text color
You could use slightly different colors of text in the document's color that would all look visibly the same, but still able to spot the differences with a computer.

Assuming that there are only 2 colors to use that still look the same, you could color the first letter in color 1 and the rest in color 2 for doc 1, the second would have 2 letters of color 1 and the rest of color 2, etc.

Invisible characters
You could put a certain number of spaces at the end of each document and use that for identification.

Leaks of by copy and paste

Encode a unique id using spaces between words
Put different numbers of spaces between specific words and use that to identify documents. For instance the first document would have two spaces between word 1 and 2 and one space between every other word. For document 2, you would have 2 spaces between words 2 and 3.

Automation

I would highly recommend generating the documents with a python script using the library FPDF.

Documentation and examples can be found here.

MikeSchem
  • 2,266
  • 1
  • 13
  • 33
2

A prudent leaker may easily spot some of the techniques, like changes in punctuation or misspelling, particularity when the leaker is familiar with the sender's level of writing, or simply one doesn't expect such mistakes in a published document.

You can still exploit "slight changes", but with more cautious approach, which is using "Synonyms".
For instance, in one copy you have the word "Changes", in a second copy you replace it with "Modifications" in one occurrence, and in the third copy you replace it in two or more occurrences.

Another example is to choose to replace three different words in one document and five different words in a second document and so on.

As you can imagine, you have various schemes that you can also combine between them.

Is there a way to automate this? or is there a better way of finding who's doing the leaking?

This approach can be automated with any in-hand scripting language, you can have as an entree for the scripted program, a dictionary of synonyms of the most common (english) words.

Initially, you can script a program that :

  • Generates distinct copies of a document.
  • Creates a list that maps a distinct copy to one single recipient.
  • Sends each copy to its recipient.
elsadek
  • 1,782
  • 2
  • 17
  • 53
1

You can earmark the pdfs individually for each recipient in a number of ways, but one of the tricks I've heard used a couple of times is to individually encode a signature in each document sent out to each recipient using non-printing characters such as zero-width spaces. These will be copied if someone copies and pastes the material verbatim. This will not allow you to track if the content is retyped or printed/scanned. The last example of this that I remember was described here: Google steals lyrics from Genius

There are various watermark technologies which I'm less familiar with that can track documents through printing/scanning. But most distribution in the 21st century will be happening by copy-pasting or sending the document as is, either way could be tracked using the above method.

TopherIsSwell
  • 371
  • 1
  • 14
1

Unique watermarking is the most reliable way to identify leakers. If every recipient gets an identical copy of the document, you can't use a copy as evidence to identify the source of the leak.

I read a paper recently on using fonts to watermark the document. By using nearly identical glyphs pulled from different Unicode character sets, the author was able to encode an almost invisible unique code in each copy that was sent out. The Unicode characters survived copy/pasting. And if the glyphs are chosen carefully (nearly identical but not completely identical), they may also survive a photographic copy process.

You could also use a big, obvious watermark, such as printing the recipient's name in the header and footer of each page. But if you do that, you encourage them to copy/paste the contents, which risks damaging any hidden watermarks.

John Deters
  • 33,650
  • 3
  • 57
  • 110
1

Use homoglyphs.

There are some characters that looks exact the same, but are different char in Unicode. It's trivial to create a script replacing a set of chars for another, and keep the replacements on a file for later search.

For example, the following words aren't the same:

cοnfidential confidential confᎥdential confiԁential confidentᎥal

It's impossible to know that the words aren't the same on a page full of text, but for a computer it's an easy task:

for word in cοnfidential confidential confᎥdential confiԁential confidentᎥal ; do 
   echo -n $word | hexdump -C ; done
00000000  63 ce bf 6e 66 69 64 65  6e 74 69 61 6c           |c..nfidential|
00000000  63 6f 6e 66 69 64 65 6e  74 69 61 6c              |confidential|
00000000  63 6f 6e 66 e1 8e a5 64  65 6e 74 69 61 6c        |conf...dential|
00000000  63 6f 6e 66 69 d4 81 65  6e 74 69 61 6c           |confi..ential|
00000000  63 6f 6e 66 69 64 65 6e  74 e1 8e a5 61 6c        |confident...al|

So write the document in text using Markdown, for example, and use a PHP/Python/Perl/ASP script to make random changes on every paragraph, and generate a PDF from the edited document. It's very easy to write a script for that, that will output the formatted PDF, the changed words, and the filename (preferably something like document-firstname-lastname.pdf). Keep those records and send the hundreds of files to each one.

When the document leaks, you just have to look at your table and search for the changed words. Changing on every paragraph is important to detect leaks even if the culprit leaks only part of the document.

But don't stop on the technical side. If there's a legal obligation to not divulge the contents of the document, have a lawyer at your side and involve him when the leaker is identified. Sue him and divulge to the other recipients that you will pursue and prosecute violation of the confidentiality contract.

ThoriumBR
  • 50,648
  • 13
  • 127
  • 142
1

Defense in depth

I suggest you do at least 2 layers of defense:

  • 1st, make sure the PDFs you send cannot just simply cut and paste content - maybe make the content a rendered bitmap file or something like that. That will make it bigger and generally more useless as search probably no longer works. I suggest doing some random noise addition on the bitmap just for good measure.

    Also add a visible watermark with the email of the recipient to make them KNOW you might be tracking those docs. That alone might scare a number of would be leakers. PsyOps is a thing.

    Also, this makes doing md5sums comparison moot since the difference is due to some tracking changes AND noise, so the real differences will be hard to pick out.

  • 2nd, add some of the steganographic methods above (before the rendering), like different spaces or unicode, typos etc. If you include lots of numbers, maybe you can introduce insignificant changes, e.g. reporting a 7.8123% unemployment rate vs a 7.8128% rate.

    The not-so-smart leakers might be thinking that removing the visible watermark is enough and not bother to do the rest.

The really smart leakers - I'm afraid, there's little one can do as one can always re-type documents and extract only relevant details, or use OCR to produce something similar.

0

I am legally obliged to distribute a document (probably by email, probably saved as MS word, or a PDF) to several hundred recipients.

The recipients are legally obliged to keep it confidential. However, based on past experience I'm pretty sure it's going to end up publicly leaked pretty quickly. (in the past it's been freely distributed verbatim)

This has happened before, it's a serious problem and causes us financial damage and I'd really like to put a stop to this and identify the miscreant.

So, when you identify the culprit(s), what are you going to do ? Sue them ? You could certainly add some watermark or change metadata in the documents but I find that sneaky.

So much emphasis on meeting your various legal obligations, but is what you are contemplating even legal and vetted ? I would talk to a lawyer, to make sure you operate within legal bounds and that your evidence will be admissible in courts.

It's the kind of stuff that the NSA does and that's how one of their employee was busted. But classified intelligence enjoys certain protections, and I doubt you are handling data that could be qualified as such.

This could backfire and create a PR disaster for your firm when people find out you are tracking them silently and without their consent. There is a chance that someone will find discrepancies between two versions of the same document (all it takes is two versions downloadable online). Even if you manage to keep the file size identical, the fingerprints will be different (as easy as running the md5sum command).

I am not sure the question belongs here, it's more a legal question than a technical question. Certainly, each PDF can be personalized, you for example have libraries like pdftk for automation. Pretty much everything can be automated. The sending can and should be automated too.

Somebody who is paranoid enough will do a copy-paste, do screenshots, or print to PDF and defeat the in-built protections. So it's not certain that you are going to catch anybody. On the other hand, somebody could catch you doing things you'd rather not expose. So maybe the downside is too big here. Trying to use technology to solve human problems, doesn't work very well.

Kate
  • 6,967
  • 20
  • 23
  • I guess my question has two aspects 1) the technical side of how to catch the leaker, and 2) the legal/moral question "is it OK to do something sneaky to catch someone doing something illegal". Personally,I'm happy I already know the answer to 2. – ConanTheGerbil Mar 14 '20 at 10:03
0

You can use a MD5 Hash Changer which are publicly available on pages like Github and create one file per employee.

You can make a list of which file was sent to whom and compare the file's MD5 which was leaked with this list. It is invisible to users so that they may try to copy/paste content but instead leak the whole document as-is.

schroeder
  • 123,438
  • 55
  • 284
  • 319