58

A "soon to enter beta" online backup service, Bitcasa, claims to have both de-duplication (you don't backup something already in the cloud) and client side encryption.

http://techcrunch.com/2011/09/12/with-bitcasa-the-entire-cloud-is-your-hard-drive-for-only-10-per-month/

A patent search yields nothing with their company name but the patents may well be in the pipeline and not granted yet.

I find the claim pretty dubious with the level of information I have now, anyone knows more about how they claim to achieve that? Had the founders of the company not had a serious business background (Verisign, Mastercard...) I would have classified the product as snake oil right away but maybe there is more to it.

Edit: found a worrying tweet : https://twitter.com/#!/csoghoian/status/113753932400041984, encryption key per file would be derived from its hash, so definitely looking like not the place to store your torrented film collection, not that I would ever do that.

Edit2: We actually guessed it right, they used so called convergent encryption and thus someone owning the same file as you do can know wether yours is the same, since they have the key. This makes Bitcasa a very bad choice when the files you want to be confidential are not original. http://techcrunch.com/2011/09/18/bitcasa-explains-encryption/

Edit3: https://crypto.stackexchange.com/questions/729/is-convergent-encryption-really-secure have a the same question and different answers

Bruno Rohée
  • 5,221
  • 28
  • 39
  • 4
    I wonder if this deduplication feature could be used to identify individuals that had uploaded the same data? I think that would be a privacy issue. – MToecker Sep 14 '11 at 17:24
  • 2
    @MToecker: In this case, the purpose of the encryption is strictly to conceal the data. It is not intended to provide privacy. (That does have to be made clear to users, of course!) – David Schwartz Sep 20 '11 at 10:32
  • You could ask the developers at [Conformal Systems](https://opensource.conformal.com/) about their [Cyphertite](https://www.cyphertite.com/) project? –  Oct 13 '11 at 21:09

8 Answers8

27

I haven't thought through the details, but if a secure hash of the file content were used as the key then any (and only) clients who "knew the hash" would be able to access the content.

Essentially the cloud storage would act as a collective partial (very sparse, in fact) rainbow table for the hashing function, allowing it to be "reversed".

From the article: "Even if the RIAA and MPAA came knocking on Bitcasa’s doors, subpoenas in hand, all Bitcasa would have is a collection of encrypted bits with no means to decrypt them." -- true because bitcasa don't hold the objectid/filename-to-hash/key mapping; only their clients do (client-side). If the RIAA/MPAA knew the hashes of the files in question (well known for e.g. specific song MP3s) they'd be able to decrypt and prove you had a copy, but first they'd need to know which cloud-storage object/file held which song.

Clients would need to keep the hash for each cloud-stored object, and their local name for it, of course, to be able to access and unencrypt it.

Regarding some of the other features claimed in the article:

  • "compression" -- wouldn't work server-side (the encrypted content will not compress well) but could be applied client-side before encryption
  • "accessible anywhere" -- if the objid-to-filename-and-hash/key mapping is only on the client then the files are useless from other devices, which limits the usefulness of cloud storage. Could be solved by e.g. also storing the collection of objid-to-filename-and-hash/key tuples, client-side encrypted with a passphrase.
  • "patented de-duplication algorithms" -- there must be more going on than the above to justify a patent -- possibly de-duplication at a block, rather than file level?
  • the RIAA/MPAA would be able to come with a subpoena and an encrypted-with-its-own-hash copy of whatever song/movie they suspect people have copies of. Bitcasa would then be able to confirm whether or not that file had been stored or not. They wouldn't be able to decrypt it (without RIAA/MPAA giving them the hash/key), and (particularly if they aren't enforcing per-user quotas becausrer they offer "infinite storage") they might not have retained logs of which users uploaded/downloaded it. However, I suspect they could be required to remove the file (under DMCA safe harbour rules) or possibly to retain the content but then log any accounts which upload/download it in the future.
Misha
  • 2,699
  • 2
  • 19
  • 17
  • 3
    It seems like it would be easy to dodge the RIAA's known hash of an MP3 by simply setting an ID3 tag to a long random string. Some similar non-operational modification to movie file's would hamper efforts by the MPAA. – bstpierre Oct 25 '11 at 12:27
  • Deduplication isn't likely to happen at the file level, rather on blocks of a selected size. So the hashing and deduplication wouldn't probably be possible for obtaining very useful information on specific files. – deed02392 Feb 18 '14 at 17:25
23

The commercial ad you link to, and the company web site, are really short on information; and waving "20 patents" as a proof of competence is weird: patents do not prove that the technology is good, only that there are some people who staked a few thousand dollars on the idea that the technology will sell well.

Let's see if there is a way to make these promises come true.

If data is encrypted client-side, then there must be a secret key Kf for that file. The point of the thing is that Bitcasa does not know Kf. To implement de-duplication and caching and, more importantly, sharing, it is necessary that every user encrypting a given file f will end up using the same Kf. There is a nifty trick which consists in using the hash of the file itself, with a proper hash function (say, SHA-256), as Kf. With this trick, the same file will always end up into the same encrypted format, which can then be uploaded and de-duplicated at will.

Then a user would have a local store (on his computer) of all the Kf for all his files, along with a file ID. When user A wants to share the file with user B, user A "right clicks to get the sharing URL" and sends it to B. Presumably, the URL contains the file ID and Kf. The text says that both users A and B must be registered users for the sharing to work, so the "URL" is probably intercepted, on B's machine, by some software which extracts the ID and Kf from that "URL", downloads the file from the server, and decrypts it locally with its newly acquired knowledge of Kf.

For some extra resilience and usability, the set of known keys Kf for some user could be stored on the servers, too -- so you just need to "remember" a single Kf key, which you could transfer from one computer to another.

So I say that what Bitcasa promises is possible -- since I would know how to do it, and there is nothing really new or technologically advanced here. I cannot claim that this is what Bitcasa does, only that this is how I would do it. The "hard" part is integrating that in existing operating systems (so that "saving a file" triggers the encryption/upload process): some work, but hardly worth a patent, let alone 20 patents.

Note that using Kf = h(f) means that you can try an exhaustive search on the file contents. This is unavoidable anyway in a service with de-duplication: by "uploading" a new file and just timing the operation, you can know whether the file was already known server-side or not.

Thomas Pornin
  • 320,799
  • 57
  • 780
  • 949
  • Ain't TechCrunch a pinnacle of fair and ethical reporting ;-) – Bruno Rohée Sep 14 '11 at 12:55
  • 1
    If the technology functioned as you described, would that mean that if your hard drive crashed you wouldn't be able to recover your files from the cloud, as the originals (and probably the keys too) would have been lost to you? If this is the case it would make the service useless as backup, correct? – Joshua Carmody Sep 19 '11 at 16:01
  • 2
    @Joshua: well, with crypto you always have to start at _something_. If the servers stored everything in such a way that your data could be recovered even if you did not remember anything at all, then the system would not be secure against the servers themselves. What could be done is to store all the _Kf_ in a file, and then just "remember" the _Kf_ for that file -- possibly, encrypt it with a password, or write it down on a paper which you store in a safe. With crypto you can begin at a single, small key, which can be store with low-tech tools. – Thomas Pornin Sep 19 '11 at 16:06
16

Bruce Schneier touched on the subject in May http://www.schneier.com/blog/archives/2011/05/dropbox_securit.html related to the Dropbox problem of that week. TechRepublic offers a great 7 page white paper on the subject for the price of an e-mail sign up at http://www.techrepublic.com/whitepapers/side-channels-in-cloud-services-the-case-of-deduplication-in-cloud-storage/3333347.

The paper focuses on the side channel and covert channel attacks available in cloud deduplication. The attacks leverage the cross user deduplication. For example, if you knew Bob was using the service and his template-built salary contract was up there you could craft versions of same until you hit his salary. Success indicated by the time the file took to upload.

Of course your protection is to encrypt prior to using the service. That will however prevent the cost savings to the service that makes it economically viable as it would eliminate almost all deduplication opportunities. Thus the service will not be encouraging the choice.



zedman9991
  • 3,377
  • 15
  • 22
9

In addition to the other good answers here, I'd like to point you to the following two academic papers, which were published recently:

  • Martin Mulazzani, Sebastian Schrittwieser, Manuel Leithner, Markus Huber, and Edgar Weippl, Dark Clouds on the Horizon: Using Cloud Storage as Attack Vector and Online Slack Space, Usenix Security 2011.

    This paper describes how Dropbox does de-duplication and identifies attacks on the mechanism. They propose a novel way to defend against some -- but not all -- of these attacks, based upon requiring the client to prove they know the contents of the file (not just its hash) before they're allowed to access the file.

  • Danny Harnik, Benny Pinkas, Alexandra Shulman-Peleg. Side channels in cloud services, the case of deduplication in cloud storage, IEEE Security & Privacy Magazine.

    This paper analyzes three cloud storage services that perform de-duplication (Dropbox, Mozy, and Memopal), and points out the consequent security and privacy risks. They propose a novel defense against these risks, based upon ensuring that a file de-duplicated only if there are many copies of it, thus reducing the information leakage.

These papers seem directly relevant to your question. They also demonstrate that there is room for innovation on non-trivial mitigations for the risks of naive de-duplication.

D.W.
  • 98,420
  • 30
  • 267
  • 572
6

Encryption and de-duplication between arbitrary users are not compatible if you are concerned about distinguishing certain plaintexts. If you are not concerned about these types of attacks, then it can be safe.

If the data is only de-duplicated for a certain user, the server doesn't know anything about the equivalence of plaintexts and the attacks that remain are really minor.

If the data is de-duplicated between a circle of friends that share something that isn't known to the service provider (doable automatically), only people from that circle of friends can distinguish plaintexts (via timing etc.).

But if the data is de-duplicated between all users, all the hypothetical attacker, who wishes to know which plaintexts are accessed, needs to do is to store the file to the cloud themselves and then monitor which user accounts are accessing the same data. Sure, the service can just "not log" the user accounts / IP addresses accessing the data - but that has nothing to do with encryption then and the same "protection" would remain even if the files were plaintext.

None of the other answers given here seem to propose anything that would stop this attack and I believe Bitcasa does not either. I would be glad to be proven wrong though.

(Note: There are some ways to possibly achieve something close to this - there have been quite a few papers published about secure cloud storage using all sorts of innovative techniques - but these are new research and most of them will probably be broken or shown infeasible rather fast. I wouldn't trust my data on any of them yet.)

Nakedible
  • 4,501
  • 4
  • 25
  • 22
  • 1
    To this I can only add that MPAA and RIAA will most likely just get a court order/law forcing Bitcasa to implement mechanism to enable the two organizations to get list of users having certain content. So the problem is not even technical. – Franci Penov Sep 15 '11 at 00:58
5

The same question was asked at the cryptography stack exchange. Please see my answer there, as there is a subtlety that is easy to overlook and that has been carefully analyzed by the Tahoe-LAFS open source project: https://crypto.stackexchange.com/questions/729/is-convergent-encryption-really-secure/758#758

Zooko
  • 151
  • 2
  • 1
    Can you expand just a little here - couple of bullet points on the subtlety you mention would help users. – Rory Alsop Sep 28 '11 at 14:23
  • 1
    There are two possible attacks. The first one, which we call the "confirmation of a file attack" is the obvious problem that deduplication exposes the fact that the two things were the same as each other. This issue was immediately appreciated and discussed when convergent encryption was first proposed (not under that name) on the cypherpunks mailing list in 1996. (Before Microsoft applied for a patent on convergent encryption, so the cypherpunks discussion is prior art that invalidates the Microsoft patent.) – Zooko Oct 01 '11 at 05:19
  • 1
    The second attack, which we call "learn the remaining information", is not so obvious, and as far as I know nobody was aware of this attack until 2008 when Drew Perttula and Brian Warner developed it as an attack against the Tahoe-LAFS secure filesystem. In the "learn the remaining information" attack, the attacker can make guesses about a few secret, random, unknown parts of a larger file and then find out if one of their guesses is correct. Please see the write-up at: http://tahoe-lafs.org/hacktahoelafs/drew_perttula.html – Zooko Oct 01 '11 at 05:21
2

Aside from the great answer @Misha just posted on the 'known hash', client side encryption effectively removes any other way to do de-duplication unless there is an escrow key, which would potentially cause other logistical issues anyway.

Rory Alsop
  • 61,367
  • 12
  • 115
  • 320
  • 1
    I don't believe that is correct. Metadata is one side channel that can provide a deduplication avenue. Just look at the filesize of all your documents. Prior to easily available hashing, filesize was a frequently used metric for duplication detection. – this.josh Sep 16 '11 at 07:36
  • I didn't realise it was actually used. Anyone still doing it? Way too easy to spoof whatever file you want, surely? – Rory Alsop Sep 16 '11 at 17:03
  • 1
    It was used in Windows (3.1 and 95) shareware programs to look for duplicate files (when the filename wasn't enought). I don't think anyone using that technique explicitly, but size is an important protection against appending to bring modified data back to a target hash value. For the average home user, it used to be that they only had a few documents and they were usually different sizes. The massive amount of data the average consumer now has along with hords of nearly identically sized files (pictures) make file size a poor indicator. – this.josh Sep 16 '11 at 19:10
-1

you totally right! Using just convergent encryption is not a good choice, even for non-original files https://tahoe-lafs.org/hacktahoelafs/drew_perttula.html Fortunately, looks like there is a solution to combine encryption and deduplication. It's called ClouDedup: http://elastic-security.com/2013/12/10/cloudedup-secure-deduplication/

pAkY88
  • 99
  • 1