How to defeat confirmation-of-a-file attack?

Question

Suppose you want to upload some files to an online storage without the storage provider figuring out what you have uploaded.

The obvious thing to do is of course to encrypt the files. However, we still suffer from a confirmation-of-a-file attack - the content of publicly available files can be guessed by looking at its size.

To defeat this, I think we have to pad the files with zeros before encrypting. Is there FLOSS / known scheme for doing it?

Edit: Thanks for your responses, @Dissimilis and @MikeGoodwin. However, I am not convinced that it is a non-issue. I am happy to be shown wrong though. Here is an example illustrating my point. Suppose you live in a place where there are banned books and documentaries. An officer can force you to decrypt your files if they have reasonable suspicion. To guess if you have banned books and documentaries, they can run the following algorithm:

Download a list of banned books and documentaries, called them b₁, b₂, ..., b_n
Compress these files to get compress(b₁), compress(b₂), ..., compress(b_n)
Compute the sizes of the above files to get size(b₁), size(b₂), ..., size(b_n), size(compress(b₁)), size(compress(b₂)), ..., size(compress(b_n))
Check if your files have size near to the above sizes. If there is match, then it is reasonable to suspect you have banned books and documentaries.

Of course, compression in step 2 is only an example. You can also do other common operations scrambling the file. But of course, this will make the constant of the algorithm larger.

One may argue that this algorithm will yield too much false positives to be useful. However, there are two reasons to believe this can be fixed. Firstly, we only need the probability P(really have banned books and documentaries | positive) > 0.5 because we can always improve this probability by raising the number of matches needed to trigger a positive.

Secondly, there are more large files than small files. So the algorithm may not work for small files (basically we have too many small files with similar size). But for large enough files, there is basically no file near to its size, so we can detect it more reliably.

Finally, encryption doesn't change the size of a file within some small error. So the above algorithm still works.

If you are talking about encrypting files yourself and not about broader concept, then this might be non-issue, because almost always your file will sit in some kind of container (zip, 7z, PGP, etc.) and that container will have metadata and/or padding making size comparison infeasible. — Dissimilis, Apr 18 '19 at 10:16
How can the content of an encrypted file be guessed by looking at it's size, unless the encryption scheme is flawed? — Mike Goodwin, Apr 18 '19 at 16:08
Short answer: encryption already handles the randomisation to prevent this sort of attack. — schroeder, Apr 18 '19 at 20:42
@schroeder I am still puzzled after reading the other question. Although encryption is not deterministic, my algorithm for detecting banned books and documentaries still works, right? Because there are simply too little files of the same size in the wild. So people can say, well this file has size 3141516MiB, so it is probably the leaked video footage of war in XXX. Can you please elaborate? — anon, Apr 18 '19 at 20:58
I think you are assuming more of a scarcity mindset than is practical in the situation. A file of a known size *when encrypted in a particular way* is going to share a similar size with a very, very large number of other files in existence. Humans are adding new files at an alarming rate. — schroeder, Apr 18 '19 at 21:02
Download a bunch of regime-approved files (maybe the biography and 'thoughts' of the 'dear leader'?) and pad your 'bad' files (or pieces of them) to the same length as the 'good' files. Now they can't distinguish which you have encrypted and uploaded. — dave_thompson_085, Apr 19 '19 at 02:05
@anon everyone is saying it is a non-issue because it is impractical. You correctly understand that the number of false positives would be too high. Encrypting with AES pads your text to nearest 128bits further increasing rate of false positives. — Dissimilis, Apr 26 '19 at 13:41
@anon What you are asking is more along these lines: What if I upload encrypted 1TB archive of recently stolen highly confidential documents and my storage provider is being monitored for uploads of files >1TB. No padding would help here and this is probably a topic of plausible deniability. — Dissimilis, Apr 26 '19 at 13:51

How to defeat confirmation-of-a-file attack?

0 Answers0