Are multiple encrypted containers with the same passphase containing the same files a cryptographic risk?

Question

Situation as follows:

Let's assume two (or more) containers, encrypted using the same passphrase.
They will contain the same files. So their content is identical.
However, the containers themselves aren't identical files - they're not duplicated, but individually created from scratch and then filled with content.
The attacker will have access to both (or more) encrypted container files.
The attacker will not know the content itself.
However, the attacker will know that the content is identical in each container.

Does this decrease overall cryptographic security of the container?

I feel like there is a vulnerability but I don't know how to search further.

(It's a general question, the container might be from dm-crypt/LUKS or VeraCrypt or even an encrypted ZIP archive.)

It makes a difference whether you created multiple containers from scratch and provided the same password for creation of each of them, or if you just created one container and at some point made a copy of the container. — kasperd, Dec 02 '18 at 00:28

score 2 · Answer 1 · answered Dec 20 '18 at 12:07

Most modern (symmetric and public key) encryption schemes are considered to have (at least) IND-CPA security. This is'indistinguishability under a chosen plaintext attack' and is defined by a cryptographic game.

In the game, after setting up the encryption scheme, the adversary chooses two plaintext messages (m0, m1) and leanrs the corresponding ciphertext of one of the randomly chosen messages mb (for some random b), the adversary is unable to distinguish which plaintext message the ciphertext corresponds to. The adversary can repeat this procedure multiple times: after receiving the ciphertext for one of their plaintexts, they can choose two more plaintext messages (m0' and m1') and receive the ciphertext of mb'. After receiving a computationally limited number of ciphertexts, the adversary must then guess the value of b. For the scheme to have IND-CPA secure, the probability of the adversary correctly guessing b must be no larger than 1/2.

This game applies to your setting as the adversary is not restricted to choose distinct messages during each request. Thus, even if m0 remains constant during each request, in a scheme which is IND-CPA secure, the adversary has no advantage in guessing b.

This is generally possible because encryption schemes add an element of randomness: most RSA padding inserts some randomness, and symmetric schemes use IVs (which must change for each identical plaintext encrypted under the same key- a point which may be worth noting in your scenario).

@Jayjayyy You are understanding the answer correctly, but the implication goes a little further than is explained. If you can't even deduce a single bit of information (i.e. whether it is the same or not), even when you can choose a near-infinite number of plaintexts to be encrypted, and you can observe the resulting ciphertext, then you can also not recover the key. I think that is what you are asking: "does it help the attacker get plain from cipher" is done using the key, so the question is whether the key can be recovered given many encryptions/decryptions/ciphertexts-known-to-be-equal. — Luc, Dec 21 '18 at 00:13

Luc · Answer 2 · 2018-12-21T08:07:38.397

Anything designed to be an encrypted container will not be weakened by having multiple, known-to-be-identical files placed in them. This happens regularly: think of when you install an operating system on disk encryption, there will be thousands of identical files of any size you might need. So dm-crypt/LUKS, Veracrypt, definitely safe. Pkzip on the other hand, I don't know the details (I would hope it uses an IV) but I wouldn't trust something like that as much.

It also depends on the implementation: LUKS and Veracrypt are meant to be cryptographic and there is one implementation of each, and perhaps a few compatible products. For pkzip, there are thousands of implementations in active use. One of them definitely has a vulnerability, and it might be the one you are using. But that's implementation details, let's assume you're using something that was meant to be a cryptographic container, and some smart guy checked the design and code.

The reason this is not weakened is twofold:

They use a so-called initialization vector (fancy word which roughly means "mix in random data in a certain way") which makes each encryption unique. If you have two billion containers with identical files (each generated independently) and the attacker has access to all of them, then the two billion and first (2000000001th) container, again, has a unique encryption and the attacker cannot really tell anything about it.
Any currently accepted encryption algorithm does not allow for key recovery: given a plaintext and ciphertext, you cannot derive the key in any way. Someone recently asked this because he had the plain- and ciphertext of a database, but lost the password, and they wanted to decrypt another database encrypted with the same password. It was AES encrypted and there was nothing we could do for them.

There are some things an attacker can know: its size for example. Or, if the attacker can observe successive versions, they can observe how much change there was. If it's a payment system, an attacker might be able to learn how many purchases there were based on the frequency and size of the changes. But none of those are things which you couldn't already tell without access to any of the other containers.

Moreover, most of the setups don't encrypt the data with your password. They encrypt the encryption key with your password. So you encrypt the data with K1, randomly generated, and you encrypt K1 with your password. When you unlock the data, you decrypt K1 with your password, and K1 decrypts the data. This has several advantages, one easy to explain example being that if you change your password, you just re-encrypt K1 with the new password, instead of having to re-encrypt all data. So the password being the same might not even mean that the encryption key is actually the same. But even if it is, then there are still the above two properties that make it secure. (And an attacker also should not be able to tell that K1 was encrypted with the same password in different containers, assuming K1 was encrypted correctly (with an initialization vector).)

I thought of something like [this](https://security.stackexchange.com/questions/153180): If it's not safe to store a container on Dropbox because of XTS, wouldn't the same attack be applicable for multiple containers? Is it because it's multiple containers of the same content vs multiple versions of one container? — finefoot, Jun 12 '19 at 18:08
@finefoot Correct, because the containers are encrypted with different keys you don't have that issue. — AndrolGenhald, Jun 12 '19 at 19:46

score 1 · Answer 3 · answered Dec 01 '18 at 18:30

If I understand the question correctly, that's more or less what I do with my backups: they are all encrypted with the same password.

According to the CIA triad, the security of data depends on:

confidentiality
integrity
availability

If all the archives are the same, that is, they contain the same data, and they use the same encryption method with the same password, then confidentiality isn't a problem. If an attacker manages to read the data in container A, then they already know all the data contained in container B, C, etc. Having different passwords won't help with confidentiality. Even if the encryption methods are different, the weakest one might be attacked more easily than the other, revealing all the data, and so having different passwords for the other archives won't help.

As for integrity (and availability), there is going to be a problem though. Since all the archives have the same password, if the attacker finds the password of one archive they will be able to access all the others, and corrupt or delete all of them. If you use a different password for each archive instead, if the attacker finds one password they will be able to read all your data and maybe corrupt an archive (total loss of confidentiality), but not corrupt all of the archives (total loss of integrity or availability, with no good copies of the data left!).

As I said, I use the same password for all my backups, but I mitigate the treat to integrity and availability by storing some copies off-site. An attacker would need to know the locations and have physical access there, which I don't even consider in my threat model to be honest. I'm more worried about a natural disaster destroying one location (that's the actual reason of my off-site backups).

@Jayjayyy, oh, now I understand. Unfortunately I don't know the answer. I would guess it's not a problem as long as the encryption methods are good enough, but it's just a guess. I'll leave my somewhat unrelated answer here anyway for now. — reed, Dec 01 '18 at 19:17

score 1 · Answer 4 · answered Dec 20 '18 at 22:58

A couple of points on terminology. In your question you state that they will contain the same files. So their content is identical.. This statement is ambiguous at best. When discussing high-level cryptography, files and content have little meaning. Instead, there is only plaintext and cyphertext.

Lets reword your question for clarity:

Given two or more encrypted volumes initialized with the same passphrase, and each having the same size
With an identical set of files placed into each after initialization
Does an attacker gain any advantage by independently knowing that each volume, when presented with the correct decryption key, will produce an identical set of files?

If the above is correct, then the short answer is either "no", or "a very slight advantage". A longer answer depends on the exact encryption scheme used.

For the short version, let's use VeryCrypt for illustration. When a VC volume is initialized, you are asked to move the mouse cursor randomly to generate entropy. This entropy will be very different for each of your volumes and is independent of using a repeated passphrase. The entroypy serves two purposes. First, for many encryption schemes it is used to generate the IV (initialization vector). The IV is encrypted or combined with the passphrase and is used in a sequence of transformations on the plaintext. Part of the output of the previous sequence or block is used as the input for the next. This means that even though the plaintext may be identical, given different IVs, the cyphertext blocks will be very different from each other.

Now, if you were merely encrypting a single file or set of files as opposed to placing them into a volume, you could expose your data to various types of plaintext attacks including the one described by @arthurmilton. However, when using a volume or container, the entropy and IV are also used to fill the container with random data. This random data is encrypted using the same scheme and the random cyphertext that is produced is practically indistinguishable from the real cypher text belonging to the files. I say practically indistinguishable because depending on the scheme being used, there are theoretical ways to increase the probability of identifying the random cyphertext from the file cyphertext, but AFIK there has never been a real-world attack of this nature on a well-tested encryption scheme.

Taken together, these two things mean that an attacker gains nothing or almost nothing from her knowledge, since any comparison between the containers would require too vast a number of calculations to extract any pattern.

There are a few theoretical weak points (and assumptions) that interfere with above. First, we are relying very heavily on the idea the randomness generator will produce random enough sequences that an attacker who is blind to the entropy source(s) (the random mouse movements and other factors), won't be able to detect any patterns in a computationally feasible manner. This source of randomness is one of the harder problems to fully solve in encryption, but it appears that the generator in VeraCrypt is "good enough" for now. Second, we are assuming that the container is larger than the size of the files being placed into it. When combined with poor entropy generation, less random cyphertext, could make it slightly easier to identify the cyphertext that belongs to the real files. (I'm somewhat misusing the distinction between random and non-random cyphertext, but it helps illustrate the point). There are other counters to this issue, and VeryCrypt does not require or even recommend that some ratio of files to available space be maintained in their containers.

Second, we are assuming that the encryption scheme uses a transformation mode with known protections against reuse attacks. This is a very dense area that I don't have the expertise to simplify.

Are multiple encrypted containers with the same passphase containing the same files a cryptographic risk?

4 Answers4