A system I'm working on accepts as input a customer account number and needs to generate a token based on it. We're not allowed to store the plain text of the account number itself, so the goal of the token is as follows:
- Can not be reversed into the account number.
- Can be used to lookup and uniquely identify an account metadata record in our database.
Account numbers resemble credit card numbers. They're 16-digit long numeric strings; the first 4 characters are constant; the last character is a computable check digit. This means that the actual size of the input set is an 11 character numeric string: 99,999,999,999 possible permutations.
I've thought about the methods below. Assume hash means a sufficiently slow secure hash such as high iteration PBKDF2, bcrypt, or argon2 and encrypt means AES256.
1. Plain Hash
hash(account_num)
While being simple, this approach is easy to reverse via brute-force and is vulnerable to rainbow tables.
2. Per-Account Salted Hash
hash(salt + account_num)
This approach fixes vulnerability to rainbow tables, however, due to the limited size of input set is still easy to reverse via brute-force.
3. Per-user Salted Hash encrypted with Global Pepper
encrypt(hash( salt + account_num ), pepper)
This is based on Dropbox's password storage mechanism. Reversing via brute-force requires leaking both the encrypted blobs and the encryption key. However, since encrypting the same value twice with the same key results in different output blobs, this breaks the ability to select an account from the database by account number.
4. Hybrid Approach
- Store the last 4 digits of the account number in plain text.
- Stored the entire account number as a pepper-encrypted salted hash.
- Use AWS KMS for hash encryption to reduce odds of leaking a key.
What this accomplishes:
- We can lookup accounts using the last 4 digits of the account number. Based on checking a few thousand account numbers, this select will return between 1–3 possible accounts.
- Iterate over each of the possible account matches…
- Decrypt the account number hash for the account.
- Compare the decrypted hash to the input-account number.
- Stop iterating as soon as we find a matching hash (use this account row) or run out of accounts (create a new account row).
My question: Does approach 4 actually make sense? For the extra security it provides, is it overly complicated? Does it have flaws I haven't thought about? Most of all, is there a simpler way to solve this problem?