Suppose you want to upload some files to an online storage without the storage provider figuring out what you have uploaded.
The obvious thing to do is of course to encrypt the files. However, we still suffer from a confirmation-of-a-file attack - the content of publicly available files can be guessed by looking at its size.
To defeat this, I think we have to pad the files with zeros before encrypting. Is there FLOSS / known scheme for doing it?
Edit: Thanks for your responses, @Dissimilis and @MikeGoodwin. However, I am not convinced that it is a non-issue. I am happy to be shown wrong though. Here is an example illustrating my point. Suppose you live in a place where there are banned books and documentaries. An officer can force you to decrypt your files if they have reasonable suspicion. To guess if you have banned books and documentaries, they can run the following algorithm:
- Download a list of banned books and documentaries, called them b1, b2, ..., bn
- Compress these files to get compress(b1), compress(b2), ..., compress(bn)
- Compute the sizes of the above files to get size(b1), size(b2), ..., size(bn), size(compress(b1)), size(compress(b2)), ..., size(compress(bn))
- Check if your files have size near to the above sizes. If there is match, then it is reasonable to suspect you have banned books and documentaries.
Of course, compression in step 2 is only an example. You can also do other common operations scrambling the file. But of course, this will make the constant of the algorithm larger.
One may argue that this algorithm will yield too much false positives to be useful. However, there are two reasons to believe this can be fixed. Firstly, we only need the probability P(really have banned books and documentaries | positive) > 0.5 because we can always improve this probability by raising the number of matches needed to trigger a positive.
Secondly, there are more large files than small files. So the algorithm may not work for small files (basically we have too many small files with similar size). But for large enough files, there is basically no file near to its size, so we can detect it more reliably.
Finally, encryption doesn't change the size of a file within some small error. So the above algorithm still works.