3

Consider a system that receives, but does not store, a piece of sensitive information X from a user. If I wanted to add a way to remember when a user has submitted the same X more than once, what would be the most secure way to do so? X is used for no other purpose; it is not a password.

The obvious but insecure approach would be store all submissions of X and check for duplicates. One way to make this more secure would be to compute a verifier (bcrypt or similar) of submissions. Properly generated, my understanding is that this would defeat precached attacks with rainbow tables.

My concern is with dictionary attacks. If X is relatively guessable, a determined attacker can probably make slow but steady progress against the verifier. One solution to this is to deliberately truncate the verifier to increase the number of possible plaintexts that it matches. Because X is not a password, it should be OK to increase the number of false positives (a duplicate detected when there is none). But on the attacker side, they may find that it's easy to find a matching plaintext, but there are so many of them that it's impossible to tell which one is the original.

In other words, if the verifier entropy is significantly less than the plaintext entropy, the set of all plaintexts that match a given verifier should be much larger than if the entropies were of comparable size. Then it's impossible to tell which plaintext is the original.

Is this a valid approach? Or is it overkill if the verifier is computed using a slow algorithm like bcrypt?

Alex Quach
  • 33
  • 3
  • In your scenario, what is X? Is it a password? How frequently does it change? Does it need to be recoverable? – user2320464 Aug 11 '16 at 19:10
  • 1
    I don't understand what's the goal of your attacker. Submit something twice? Get something identified as a duplicate which is not? Find out if a piece of information was already submitted? – Philipp Aug 11 '16 at 19:25
  • 1
    If it's okay to have realistic false positives, then are you fine with real users running into these false positives? – Macil Aug 11 '16 at 20:39
  • @user2320464 X is not a password; it is not used as part of any authentication scheme. Maybe think about it as a piece of sensitive personal information. It doesn't need to be recoverable; all I want to do is know when the same X is submitted twice by the same user. – Alex Quach Aug 12 '16 at 20:35
  • @Philipp The attacker's goal is to reconstruct X, given access to any information derived from the X's passing through the system. So if the system stores a hash of X's, then the attacker could attempt to reconstruct X from the hashes. – Alex Quach Aug 12 '16 at 20:36
  • @AgentME Great question. The answer is probably "sure, but does there exist a scheme that will generate enough false positives to confuse an attacker, but few enough that a realistic user would not regularly encounter them?" Apologies if the question was too vague/open-ended. – Alex Quach Aug 12 '16 at 20:38
  • How frequent such collisions would be depends on how much data you plan to store. Is it for going to be stored forever, or just minute? Is it going to be checked for all users, or for every user separately? This question probably belongs to user experience related SE, not here. – axapaxa Oct 11 '16 at 00:37
  • 1
    This seems heavily related to http://security.stackexchange.com/q/142110/38069 – Michael Nov 10 '16 at 05:41
  • What does happen with X, once in the system? Will the submission give any feedback to the user, depending on whether the submission is a duplicate, or just keep that information for itself? If not, you need not to worry. If yes, you may want to redesign your process. – Marcel Nov 10 '16 at 06:52

2 Answers2

1

You said that you allow ''X'' to have false-positives. Therefore you can simply strip much entropy and allow false-positives, but attacker would face same problem if he got your database.

But if ''X'' has enough entropy on it's own (128-bit or 256-bit), attacker won't be able to get any more information from hash (for example using SHA-256 or SHA-3), and you wouldn't have false-positives. Hash would exactly only allow this one information - is new information I got same as this one.

If your ''X'' doesn't have enough entropy and/or you accept risk of having false-positives, then you can cut on your hash too (then it becomes how many false-positives you accept to have).

On the other hand when using bcrypt and other PKDF, I find that you will quickly lock yourself out of the information. Assuming information ''a'' was to arrive to you, you had no other choice than to iterate over every bcrypt hash stored and check if it is ''a'', which (as PKDFs are slow) would be inefficient to check.

Also, remember that weak point of system that only passes some confidential information is an active attacker - then he can simply store every ''X'' your service got since it was hacked.

axapaxa
  • 201
  • 1
  • 6
0

It seems to depend about the possible values of X.

For example, if X can take value in a range of 1 000 000 possibilities, you could compute a md5 of X and only store the first 16 bytes and keep a low risk level of collision. The birthday paradox show that the probability of collision is ((1000000*1000000)/pow(16.0,16.0))*100 0.0000054 %

If an attacker want to find md5 antecedents, he has to choice between at least 18 446 744 073 709 551 616 equiprobable candidates without possibility to know if its try is the good one (which is near to be useless). So this part is ok.

But you have to determine which collision rate is acceptable for X. 0.0000054% is not very high (you don't say) but when it happen, is it a small problem or a big one? If you cannot handle false positive about X values, hash algorithms shall not be used, truncated or not.

But, considering what you described, your approach seems strong enough. You don't really need cryptographic hash algorithm, SHA could be sufficient for your purpose.

Sibwara
  • 1,316
  • 7
  • 19