Key derived from source files; deduplicated file storage

Question

I'm wondering if the following system would be a secure way to store frequently duplicated files.

Hash the source file, deterministically derive a key from it.
- The source file(s) are likely to be high entropy (PDFs, archives of multiple files etc)
Use that key to encrypt the source file.
Upload that encrypted file to any public network.
- If the files are "encrypted", the storage provider does not have to be trusted.

To access/share the files, the address of the encrypted file and the source hash would need to be shared. The advantage I'm going for here is even if two people have the same source file (which is likely in my use case), it would only have to be stored once, saving storage costs.

My question from a storage provider's standpoint; Is this system secure? (assuming good hash and key derivation functions are used)

Edit: I'm not concerned about the key changing since the storage will primarily be content addressed (IPFS like) and is for long term storage rather than regular updates.

Imagine the cost of this! Every change in the file will result in expensive encryption operations and the key will constantly change — Limit, Feb 16 '21 at 04:12
you may wish to read about [*TahoeLAFS*](https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/about.rst) — brynk, Feb 16 '21 at 05:55
@Limit wouldn't the expensive encryption cost be for any encrypted cloud storage? The key changing is a good point that I'll update the question about. — Karmanyaahm, Feb 16 '21 at 06:09
@Karmanyaahm usually file systems work at block level and not file level. But otherwise, you are adding the cost of computing the key to the cost of encryption in general. — Limit, Feb 16 '21 at 19:59

score 2 · Accepted Answer · answered Feb 16 '21 at 07:46

No, this is not generally secure. You cannot have deduplication with other users' files and privacy. But this depends on your exact threat model/security goals.

Using the hash of a file to encrypt it is called convergent encryption, and is a deterministic encryption method. Notably, there is no secret key that would add security to the ciphertext. Instead, any security comes from the entropy in the plaintext files alone.

It is clear that this scheme can't offer e.g. semantic security, and an attacker can efficiently learn something from the ciphertext. In particular it is trivial to verify if one of your files matches a publicly known file. Furthermore, PDF documents such as form letters might look high-entropy, but actually only differ in a few fields. This can be brute-forced. Also, a malicious service provider can pre-compute encryptions of documents they're interested in.

The solution of this is to stop using a deterministic encryption method. For example, adding a sufficient amount of random bits per block/file will help, but will also destroy deduplication opportunities. Using a real symmetric encryption with a secret key would be another solution, and would at lesst allow deduplication among holders of the same key.

You claim that your plaintext has high entropy. This is a strong assumption. Note that if your plaintext is high-entropy, then deduplication is unlikely and you might as well use a secure encryption method instead. In the context of this discussion, entropy must not be understood relative to all-zero bits but relative to the distribution of content known to other participants. If every user knows 1000 files of which one is the same 2GB large movie, that file only has around 10 bits of security, or actually zero since a fingerprint for the convergently encrypted file can be precomputed by everyone. Since your scheme doesn't feature secrets, it is impossible to distinguish between legitimate users and malicious attackers.

Your concerns are valid, though I think I didn't explain my use case well enough in the question. I'm building a public file sharing service, so knowing someone has a file is not a concern. I think I'll have to weigh the advantages of documents only differing in a few fields. — Karmanyaahm, Feb 17 '21 at 05:44
`In the context of this discussion, entropy must not be understood relative to all-zero bits but relative to the distribution of content known to other participants` I meant deduplication as in multiple people uploading the same file to the network. I'm not sure what you mean by that part about entropy. — Karmanyaahm, Feb 17 '21 at 05:45
@Karmanyaahm I understand your scenario. Convergent encryption isn't necessarily useless, but you must decide which deduplication vs privacy tradeoff is appropriate for your use case. In the last paragraph, I take a jab at explaining the *security level* of this scheme, a measure of how hard the crypto is to crack. For symmetric encryption like AES, the security level is usually equal to the number of bits in the key (e.g. 256). A 2GB file would have an insane number of bits, but the resulting convergent encryption may only have negligible security, depending on what the attacker knows. — amon, Feb 17 '21 at 10:14

Karmanyaahm · Answer 2 · 2021-02-17T05:37:36.620

0

@brynk suggested looking at TahoeLAFS, and apart from the obvious confirmation caveat, it seems to prove that encryption as described in the question is possible and secure.

~~(if anyone knows otherwise please post an answer)~~

Edit: Reading more TahoeLAFS docs it seems like convergent encryption is relatively common, I didn't know it was called that.

edited Feb 17 '21 at 05:37

answered Feb 16 '21 at 06:37

Karmanyaahm

3
3

Key derived from source files; deduplicated file storage

2 Answers2