Anonymisation better than just a hash

Question

I'm trying to help anonymise users but still give them some controls. So this is different from say, anonymising a data set where you never need to go back to the original users.

Let me give you an example. I want a user to come in with his or her email address to log into my system, but I don't want to store the email address. I want to assign that user a UUID that I will then use. However, if the user loses the random identifier they need to be able to re-enter their email address and then get back into the system.

The simplest answer to this is a hash. Of course I could store the hash next to the UUID. If they lose the UUID they can come back and I can re-hash the password. On the other hand, if my system is broken into, the bad guys can simply do a dictionary attack on a list of email addresses and then re-identify the UUIDs.

Firstly, is there a standard way of doing this? bcrypt and PBKDF2 pop to mind, but I obviously can't store a tuple of <email, salt, iterations> without making the intruders job even easier.

I don't like to invent new security stuff, but I have had one idea. Basically I store <SHA512(email), salt, iterations> and then I store <PBKDF2(email, salt, iterations), UUID>.

That way they have to first dictionary attack the first table, and then use the results of that to do individual dictionary attacks on each row of the second table, which should slow things down.

First of all, asking users to remember a UUID would be bad for usability of your solution. Secondly, can you tell us what is the risk associated with attackers identifying the UUID? — Limit, Jul 07 '16 at 17:53
The behaviour of the user could be tracked via the UUID. For example, I might store a history of web interactions that includes the UUID. I'm trying to make it so that if someone attacks the site, it is very hard for them to then figure out who did what. Its probably a little paranoid, but that is a good thing right? — Paul Fremantle, Jul 07 '16 at 18:07
one can't "simply do a dictionary attack" on bcrypt in a reasonable amount of time. — dandavis, Jul 07 '16 at 19:22
@dandavis: actually, one *can* simply dictionary attack bcrypt in this attack scenario. Suppose that everyone on earth has exactly one email address, so we've got 7 billion email addresses, and you've configured your bcrypt load factor to take 1 seconds to compute in a typical computer. An attacker only need to have 220 of these typical computers to go through every email address in that list in 1 year. There are a lot of computer clusters that are much more powerful that that. — Lie Ryan, Jul 09 '16 at 05:07
@LieRyan: that assumes the attacker already knows which 7 billion email addresses to try, not very realistic, and even then it takes a year with hundreds of computers (>$25k) and lots in electricity (~$20k). If it's worth about 50 grand and that much effort, I can't help but think that someone would just beat the info out of you instead. — dandavis, Jul 09 '16 at 08:10
@dandavis: It isn't that hard to collect a large number of email addresses. You can buy various email address lists someone else collected, spammers do that all the time. Two hundred CPU are not that expensive for a serious attacker either. An AWS c4.8xlarge Spot Instance costs $0.324 per Hour for a 36 CPU machine. Seven of these instances (250 CPUs) costs an attacker only about $20000, which is chump change for attackers with enough motivation (e.g. governments, medium-sized criminal organizations). Also, I'm thinking of mass surveillance scenario, rather than deanonymizing a single target. — Lie Ryan, Jul 09 '16 at 08:43
@LieRyan: it's a cat and mouse game to be sure. Mouse could add an extra pre-pbkdf hash call, then after a couple months, hash the hash, then later, hash the hash of the hash, each time disposing of all that hard pre-imaging effort. In fact, since hash is cheap, maybe re-hashing every month should be a standard practice on high-risk setups... — dandavis, Jul 09 '16 at 08:53
@dandavis: None of that matters. $20000 is for pure brute force with just a list of all known email addresses and a leaked database, no preimaging or anything else. The goal of deanonymization is to identify the user, not pretend to be them, so once you've got the leaked database, it doesn't matter if the defender change their hash methods. — Lie Ryan, Jul 09 '16 at 09:19
If your system can find a user account just based on their email address, so can an attacker who has your database - nothing you can do to stop that, because they can just pretend to be your system. — user253751, Oct 07 '16 at 10:57

Lie Ryan · Answer 1 · 2016-07-09T05:58:25.107

1

Keep things simple, if you want anonymity, don't use email for logins.

Make your user generate an asymmetric key pair to authenticate to your system, call this the login key.

When the user want to post a message to your system, the user generates a new asymmetric key pair, call this the data key. The user would then sign their message with the data key and encrypt the private data key with their public login key addressed to themselves. The user would submit the encrypted private data key, the public data key, and the signed message.

To edit the posted message, your user downloads the encrypted data key, decrypt it with their private key. And then they sign the message with the private data key. Your system knows that the new message are generated by the same user as the original, because both messages are signed by the same public data key.

Be careful:

many implementations of asymmetric cryptography attaches the encrypting key's ID to an encrypted message, to make it easy for the recipient of a message to identify which key to use to decrypt the message. You want to make sure you encrypt the private data key without adding these metadata.
an attacker that manages to observe who downloaded which encrypted private key would be able to observe which user downloaded which key and make reasonable guess as to who is doing what. You can prevent this by forcing all users to download all encrypted private keys (or a reasonable chunk of it).

Basically I store <SHA512(email), salt, iterations>

If you're storing the SHA512 of the users' email, wouldn't identifying the users be as easy as creating a SHA512 rainbow table?

edited Jul 09 '16 at 05:58

answered Jul 09 '16 at 05:33

Lie Ryan

31,089
6
68
93

"If you're storing the SHA512 of the users' email, wouldn't identifying the users be as easy as creating a SHA512 rainbow table?" => No, because he's using a salt. – Nathan Sep 07 '16 at 08:08
1

@Nathan: salt makes little difference here. Password is intended to be a secret, but email address, you give this to everyone you meet, you post them on websites, publicly on mailing lists, and SMTP sends them as plain text. Any anynomization strategy need to take into account that the possible key space of email address is much smaller than possible key space of passwords, so reversing email hashes is much easier than reversing password hashes. – Lie Ryan Sep 08 '16 at 04:13
It makes all the difference: you can't use a rainbow table if the plaintext being attacked is salted. It might be a reduced keyspace before being salted but when you salt your email address before hashing it then the keyspace is now however large you decide to make your salt (+ the the original email keyspace)... So, the simple fact is that salt = no rainbow table. If you believe otherwise, then you need to go learn about rainbow tables, how they're built and their space/time tradeoff. – Nathan Sep 08 '16 at 11:11
1

@Nathan - Salt makes little difference because the search space of email addresses is small enough that you don't need a rainbow table. It is particularly annoying that you've made condescending comments when you are peddling misleading information. No wonder Lie Ryan didn't bother to correct you. – paj28 Oct 07 '16 at 09:16
A thought for you @paj28 - if each stored record is a hash-salt pair, then if we use the arbitrary figure of $20,000 used in the comments above as an example of the cost of brute forcing the "limited" key space of emails, then just by adding salt the cost becomes $20,000 PER RECORD to break instead of a single charge. Hardly a "little difference" eh? – Nathan Oct 07 '16 at 11:33
@Nathan - Sorry, I should have clarified, a salt makes little difference for unmasking a single user. From the context it sounds like unmasking a single user is a concern - otherwise it is a poor anonymous site? You are absolutely right that a salt is a good thing. Your comment about rainbow tables is technically correct, but I think misleading in this context - because of the ability to do a targeted brute force. I also think $20k is out by two orders of magnitude, but OP did not specify the work factor, so it could be true. – paj28 Oct 07 '16 at 11:57
1

@paj28 - No problem - and yes, the conversation is at a bit of a tangent to to the original question. Lie Ryan kind of chucked the rainbow table comment in at the end: "If you're storing the SHA512 of the users' email, wouldn't identifying the users be as easy as creating a SHA512 rainbow table?" - and that's what I responded to... his response was that "reversing email hashes is much easier than reversing password hashes" which is patently incorrect when the plaintext is correctly salted... hence my response which may have come off as condescending... but as you say, it's correct. – Nathan Oct 07 '16 at 12:16
@paj28 PS - the OP does mention a "dictionary attack on a list of email addresses", so on reflection I don't think that it's that much of a tangent :) – Nathan Oct 07 '16 at 12:18
This sounds like a good approach, but will only work if the user can keep track of a private key. If they will 'lose' (similar to forget) their private key, then they would have to re-register, no? – 700 Software Oct 07 '16 at 13:59
@Nathan: Your cost estimation is off by at least five orders magnitude. The cost of guessing email addresses is nowhere near $20000. According to http://www.inboxinteractive.com/buying-renting-email-lists/, the cost of buying email address list can be as low as $1000 per 15 million addresses. A decent rack can calculate the hashes for all 15 million in less than a second (https://security.stackexchange.com/questions/18191/how-many-possibilities-can-todays-computers-check-per-second-in-a-sha512-hash). – Lie Ryan Oct 07 '16 at 15:08
@Nathan: Governments issuing national security letters to just a few tech companies like Google and Microsoft can easily harvest billions of valid email addresses. The point is that email addresses is public information, and most people don't generate email addresses randomly. In many companies, you can even guess someone's email address if you know the person's name, and people heavily reuse username as emails. – Lie Ryan Oct 07 '16 at 15:11
That wasn't my cost estimation, it was used from another commenter as an illustration. And yes, thanks, I know how email works. What you are saying doesn't change how salting and rainbow tables work. – Nathan Oct 08 '16 at 21:57

score 0 · Answer 2 · edited Mar 17 '17 at 13:14

If the user loses the random identifier they need to be able to re-enter their email address and then get back into the system.

To acquire that feature, you have no choice but to store a hash of the email address. Also the Salt cannot be secret because the user has already forgotten everything except their email address.

Email Addresses are Low-Entropy like Passwords, so use a (Slow) Password Hash

BCrypt is commonly recommended, but be sure to run a quick SHA-2 hash on the input data, so super-long input will not be truncated by BCrypt
PBKDF2 but that is less GPU-resistant because it has lower Memory requirements.
SCrypt is better than BCrypt if you have a high enough work factor. Otherwise it is worse against GPUs. (again, because of higher or lower Memory requirements)
The winner of the Password Hashing Competition may be even better than the aforementioned, but has not yet stood the test of time, so don't use it just yet. It's called Argon2, and has separate Work Factor settings for CPU time and Memory load. (nice!)
Repetitive SHA-2 can be used instead of PBKDF2 (still not GPU resistant), but this is more tricky to implement the repetition efficiently (i.e. to be brute-force resistant) because SHA-2 is actually a General Purpose (Fast) Hash.

Most of these options generate random Salt by default, but you should verify whether this is the case!

It is best to include some Pepper (~72 bits of entropy) before the Input prior to hashing. The Pepper can be the same for all your users, but should be stored in a file outside of the database so that component cannot be found via SQL Injection.

Make sure your Work Factor requires about 100ms (with appropriate DoS protection) on your target hardware (knowing that attackers will use faster hardware for Brute force)

Keep in mind that very simple email addresses might be brute-forced anyway, so don't let the hash get stolen! Also relevant, is that an attacker could quickly check for presence of a particular email address.

It would be wise to look at other security precautions you can make such as Separation of SQL Database Users to contain potential SQLi attacks. Also make sure your system is secure overall.

I don't like to invent new security stuff, but I have had one idea. Basically I store <SHA512(email), salt, iterations> and then I store <PBKDF2(email, salt, iterations), UUID>.

There's no benefit to adding complexity here. If you want the attacker to have to do more work, just raise the Work Factor and use Pepper as described.

Anonymisation better than just a hash

2 Answers2