How to securely hash/tokenize a string

Question

A system I'm working on accepts as input a customer account number and needs to generate a token based on it. We're not allowed to store the plain text of the account number itself, so the goal of the token is as follows:

Can not be reversed into the account number.
Can be used to lookup and uniquely identify an account metadata record in our database.

Account numbers resemble credit card numbers. They're 16-digit long numeric strings; the first 4 characters are constant; the last character is a computable check digit. This means that the actual size of the input set is an 11 character numeric string: 99,999,999,999 possible permutations.

I've thought about the methods below. Assume hash means a sufficiently slow secure hash such as high iteration PBKDF2, bcrypt, or argon2 and encrypt means AES256.

1. Plain Hash

hash(account_num)

While being simple, this approach is easy to reverse via brute-force and is vulnerable to rainbow tables.

2. Per-Account Salted Hash

hash(salt + account_num)

This approach fixes vulnerability to rainbow tables, however, due to the limited size of input set is still easy to reverse via brute-force.

3. Per-user Salted Hash encrypted with Global Pepper

encrypt(hash( salt + account_num ), pepper)

This is based on Dropbox's password storage mechanism. Reversing via brute-force requires leaking both the encrypted blobs and the encryption key. However, since encrypting the same value twice with the same key results in different output blobs, this breaks the ability to select an account from the database by account number.

4. Hybrid Approach

Store the last 4 digits of the account number in plain text.
Stored the entire account number as a pepper-encrypted salted hash.
1. Use AWS KMS for hash encryption to reduce odds of leaking a key.

What this accomplishes:

We can lookup accounts using the last 4 digits of the account number. Based on checking a few thousand account numbers, this select will return between 1–3 possible accounts.
Iterate over each of the possible account matches…
1. Decrypt the account number hash for the account.
2. Compare the decrypted hash to the input-account number.
3. Stop iterating as soon as we find a matching hash (use this account row) or run out of accounts (create a new account row).

My question: Does approach 4 actually make sense? For the extra security it provides, is it overly complicated? Does it have flaws I haven't thought about? Most of all, is there a simpler way to solve this problem?

How much of the account number *are* you allowed to store in plaintext? Credit cards are "First six, last four", leaving 3-9 digits hidden. You mention using the last four in a plaintext lookup and the first four as constant. Also, do you need to store hashed account numbers? Or can you just issue a token and link the token directly to the metadata? — Bobson, Feb 28 '17 at 22:35

score 6 · Answer 1 · answered Feb 28 '17 at 23:01

6

It sounds like your account number is a credit card number. Conveniently, I know from experience that a five year old graphics card can exhaust a salted 10 numeric digit key with salt in 3 days, meaning an average of 36 hours to reverse a PBKDF(4096) hash. It's not looking good for you already.

You're much better off generating a random nonce to pair with the transaction and associating that nonce with the card.

answered Feb 28 '17 at 23:01

Jeff Ferland

38,090
9
93
171

This is basically what I was thinking. Just generate an unrelated identifier to use and throw away the account number. – Bobson Mar 01 '17 at 11:32
2

I'd love to do that, but it doesn't solve the problem. For example, a customer inputs their account number and performs an action. Then two months, they come back, enter their account number again, and perform another action. I need to be able to correlate these two actions as being from the same account and the account number is the only piece of data I have to do that with. If I just use a random nonce, then the second action will get a different random nonce and I'll have no way of knowing they're from the same account. – crgwbr Mar 01 '17 at 15:55
@crgwbr I think I might be able to better help if you tell us the problem in greater detail rather than proposed solutions you have. Who is the customer? The cardholder? A merchant? What's the nature of the system? Website login? Call center? – Jeff Ferland Mar 02 '17 at 04:46
@crgwbr - That's very reasonable, but you should clarify that in your question. You asked about looking up the account *by the token*, which is not the same thing. Like Jeff said, taking a step back may be helpful. This may be an [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). – Bobson Mar 02 '17 at 14:37

Out of Band · Answer 2 · 2017-02-28T23:08:43.100

You could simply use an HMAC: HMACs have two inputs, a message and an encryption key, and produce a message authentication code, which is basically a hash (e.g. you can't turn the MAC back into the message, or the key, or both). So in your case you'd have:

token = hmac(account_num, key)

HMAC is constructed in a special way to protect against attacks that might reveal the key. It's a well-understood cryptographic primitive.

Without knowing the key, a brute force attack on the token to get the account number isn't practical. OTOH, for you, knowing the key, it's trivial to turn an account number into a token.

Of course, now the security of your account numbers rests with the security of the key; so if you lose the key and the list of tokens to an attacker, you've opened your account numbers to brute force attacks. But your idea number 4 suffers from the same problem.

It might be a good idea to think about key management, so you can easily change a key if one becomes compromised.

score 1 · Answer 3 · answered Mar 01 '17 at 14:58

Your design is not secure against an oracle attack. Any cleartext number put into the system will result in the same token being generated. Instead of brute force hashing, the attacker simply has to use your system to make his series of guesses, and your oracle will tell him if he's right or wrong.

You've called these "account numbers" without defining anything about how they're being assigned. If they are truly random, then yes, you have 10,000,000,000 possible guesses. But as we know, true random is hard to come by. And if they are not cryptographically randomly spaced across the possible values, an attacker gains a huge crib.

I'm going to first use credit cards as a real-world example of how cribs enable them to be been broken, and then show how the example could be exploited in other systems, perhaps yours.

Consider account numbers issued by banks. Each card identifies the bank in the first six digits, which are called the Bank Identification Number (BIN). Statistically, customers will have a mix of credit cards, but a significant fraction (let's say 13%) of them are highly likely to have been issued by banks local to the shop where they are being used. That's our crib. So if I see hash value with a card ending in 1234, and I know the local bank's BIN is 444444, I simply brute force all the missing digits in this picture: 4444 44?? ???# 1234. The # character uses the check digit algorithm to recover one missing digit from a guess. Given only 100,000 guesses, I'll have a 13% chance of guessing a valid card in your system.

So let's extend this to your cards, and you have a couple of card vendors printing them for you. If you place an order for 100,000 cards today and reorder 100,000 cards next month, the vendor may have no way of knowing if the new run is issuing the same card numbers as the previous run. So to avoid collisions he puts a six digit unique number as the first six digits of the card numbers, ensuring each batch is different from every other batch. (This is very common behavior for vendors printing gift cards.) This has a similar effect to what a BIN number does on making the numbers guessable. There are about 50,000 BINs issued for credit cards, but it may be worse in your case because we don't know how many possible batches of cards you've printed.

Remember, an attacker isn't trying to make the books balance; he doesn't need perfection to succeed as a thief. He doesn't have to decrypt a specific token number he finds in order to steal from one person. All he has to do is brute force against your database, and find one or more numbers that will work for him. The more he successfully guesses, the more profit he makes, but any success at card theft is a win for him.

Instead, consider a single-use token system. If the attacker puts in 4444 4400 0001 1234 two times, he gets two different tokens out of the system. That's the only way to prevent your system from providing potential attackers with an oracle.

The token is just an internal identifier used to uniquely identify an account. It's never displayed to the user or sent anywhere. — crgwbr, Mar 01 '17 at 15:58
@JohnDeters: Interesting analysis. You could make the attack expensive by using aggressive rate limiting. If you allow for five queries from the same IP in, say, 10 minutes, and have anyone who exhausts this wait for progressively longer periods of time between tries, an attacker will need to rent a large botnet to hide his oracle attack from the rate limiting code - and the rate limiting code could still pick up increased usage and either sound an alarm, lock down the system or implement even harder rate limiting. — Out of Band, Mar 02 '17 at 08:58
Careful auditing and rate limiting are the best ways to secure an oracle against attack. And while the original poster says the tokens are never "sent", that does not stop an insider threat, nor is the future protection of the tokens guaranteed. It's far too easy for the marketing side of the business to look around a year from now and do something unsafe, like "let's include these tokens on the survey forms so we can see who is responding!" Multi-use tokens just take on too much risk. — John Deters, Mar 02 '17 at 19:17

How to securely hash/tokenize a string

3 Answers3