Possible Duplicate:
Online backup : how could encryption and de-duplication be compatible?
I just saw an article on Techcrunch wherein a startup company was describing their new cloud storage service. They've claimed (in a previous article) that everything is encrypted client-side with a key that only the user has, and the people running the service couldn't access your files even if they tried. From the second link:
It doesn’t see the file’s title or know its contents. It doesn’t know who wrote the file. And because the data is encrypted on the client side, Bitcasa doesn’t even know what it’s storing. So if you want to cloud-enable your 80 GB collection of MP3′s or a terabyte of movies (acquired mainly through torrenting, naughty you!), go ahead. Even if the RIAA and MPAA came knocking on Bitcasa’s doors, subpoenas in hand, all Bitcasa would have is a collection of encrypted bits with no means to decrypt them.
They also claim that their service does data de-duplication on the server side so that a user's files don't need to be uploaded if they're identical to the files of another user. From the second link, paraphrased slightly:
How on earth is it so cheap? The fact is, 60% of your data is duplicate. If you have an MP3 file, someone else probably has the same one, for example. Each person only tends to have around 25 GB of unique, personal data, he says. Using patented de-duplication algorithms, compression techniques and encryption, Bitcasa keeps costs down.
These two claims would appear to be contradictory. If you can't access the file's contents without my key, how could you provide the contents of the file to another user and allow them to decrypt it? It would seem to me that if anybody besides me could decrypt the data, then they can't have the security level they claim to have. And if nobody besides me can decrypt the data, then they can't use my data as a backup for another user who has identical files, because that user wouldn't be able to get at the data.
Others have expressed the same concerns, and this is what the company had to say about the question (from the first link I gave, paraphrased slightly):
Q: What do you do in terms of encryption or security?
A: We encrypt everything on the client side. We use AES-256 hash, SHA-256 hashing for all the data.
Q: So it’s encrypted all on the client side and you can’t look at it on the server side?
A: Exactly.
Q: So if I upload a file and Marissa uploads the same file, do you store two different copies of that or one?
A: No, we do de-duplication on the server side. So we actually determine on the server side if it’s there, and if it’s already there, we don’t have to upload it again.
Q: But how do you do that…if it’s encrypted and you don’t have the key?
A: There’s an academic paper called Convergent Encryption. This is actually something that’s been known for many years in the encryption community. But what we actually do is…we don’t encrypt it in the way that you think we’re doing it….There’s other ways to do it.
Q: I think the audience would like to know a little more about that…what does that mean?
A: OK, so convergent encryption….what happens is when you encrypt data, I have a key and you have a key. And let’s say that that these are completely different. Let’s say that we both have the exact same file. I encrypt it with my key and you encrypt it with your key. Now the data looks completely different because the encryption keys are different. Well, what happens if you actually derive the key from the data itself? Now we both have the exact same encryption key and we can de-dupe on the server side.
Now, frankly this doesn't make any sense to me. Regardless of the details, they can't provide my data to another user and have it be useful to them unless there's some way of accessing it other than the key I used. And if there is such a method, then you can't tell me the service operators can't access the data. It doesn't make sense to me at all.
However, I will admit there are many things I don't know about the world of cryptography. Is it possible that the service really does work like they claim it does? Does this "convergent encryption" exist and is it some magic solution to have user-specific private encryption keys and server-side de-deduplication at the same time? Or is this company making grandiose claims that they can't back up?