2

We have a database of sort codes (6 digit numbers) and account numbers (8 digit numbers) that we use to reconcile monthly accounts with the table of supporters.

There is nothing in the data received from the bank that uniquely identifies the supporter, other than the sort code and account number. ... I know, it's annoying.

While this data is not as sensitive as card data (and not subject to PCI-DSS), it's still pretty sensitive and I'd like to find another way to do the reconciliation to reduce the liability of having all this data.

Combining sort code and account number gives up to 10^14 possibilities.

Is there a way (using a reliable and established PHP function) to hash the data and only store the hash, that would allow me to take a monthly file of -say- 1000 records and match them up to the hashed data? Or is there really no point and instead focus on hardening security around this db?

The security advantage I'm seeking is that the database does not have a ready-to-use list of people's bank details. The transactional monthly bank statement data can be considered to be of short lifespan (it is received encrypted, decrypted, processed, deleted).

I've read a helpful detailed comparison of hashing functions but obviously here we're not talking about password, and in effect we need to be able to crack them every month! Hmmm.


EDIT: Conclusion

Thanks to the answers below, here's what I plan to do:

Set-up

  1. Create a map for sort codes and account numbers to random ids.
  2. Replace real data with mapped data.
  3. Encrypt this map using PHP's Mcrypt AES 256 with a user-provided key never stored on server
  4. Store the encrypted map on the server.

Now: you can take the database, you don't get the data, or any way to decrypt it by brute force, thanks to the random map.

You can take the map also and figure out how it works (not relying on obscurity), but you still need to be able to crack the encryption to get access to the map. This feels like a suitable level of risk.

Reconciliation

  1. Decrypt the PGP content from bank locally.
  2. Over SSL, upload the month's transactions and also provide the decryption key.
  3. Server decrypts the map, applies it to the uploaded data, stores mapped data for later processing, deletes raw uploaded file.
  4. User deletes decrypted bank data locally.

This means the key and decrypted map are only ever in RAM. The month's transactions are temporarily stored on disk, but that's an acceptable level of risk IMO (could use a secure deletion method like bleachbit etc.).

Updating the key is as simple as provide existing and new keys, decrypt map, encrypt map, store map.

If there was concern that the decrypted map had been compromised, this could be rebuilt, too, although it's more effort as it means updating all the stored data.

artfulrobot
  • 473
  • 5
  • 14

2 Answers2

4

Be wary of hashing things where people might determine characteristics of the input. One company used MD5 of taxi IDs for anonymizing, which was quickly reversed. Yes, you could try some home-baked hash modification that would make it less obvious than just a straight MD5, but that's security through obscurity. Solving almost any hashing function for every 8 digit account number is trivial, at which point your data is as good as plaintext. Concatenating account numbers with the sort code isn't going to be much better.

What you should do instead is to make a table/program/whatever that maps your sensitive data to random IDs. Your system would require access to that table/program to do the conversion, you can take steps to secure that table/program (like storing it in a truecrypt volume) while you work with the truly anonymized data.

Aron Foster
  • 1,204
  • 2
  • 11
  • 19
1

If you consider bank account numbers sensitive, then yes it is worth hashing them.

When we talk about hashing we should always talk about salting the hash. In this case it would be computationally expensive for you to salt each hash separately, which is the approach you should alway start with.

As you are trying to use this as a look up value based on the plain text (bank account number + sort code) if you salted each row individually then you would have to calculate the hash of each row received using the salt of each record individually. This would slow down you process from O(log(n)) to O(n) where n is the number of records you are storing.

So I would recommend having one salt all the bank accounts, this will prevent general rainbow tables being used to reverse your hash, but wont prevent someone creating a rainbow table specific to your application. So what would it take to store a rainbow table for all possible account numbers?

There are 10^8 possible account numbers and 10^6 possible sort codes giving 10^14 possible (account number + sortcodes). SHA-1 requires 20 bytes to store, so to store all possible hashs for all possible bankaccount+sortcodes would take 20*10^14 bytes which is 1819 Terabytes (TiB). So it appears that creating a rainbow table to reverse every hash would be infeasible. SHA-256 would require 2910 TiB.

It is worth noting that this will be reservable by anyone with enough computing power, based using SHA-256 and the speed listed here it would take a single core computer approximately 80 days to hash all sort code/ account number combinations. With a top of the line modern desktop I would guess this could come down to single digit days. If this is a concern to you, you can move to a slower hash function such as PBKDF2 (see also) which you can then configure to run as slow as you like.

Recommendation

I would recommend hashing these values using SHA-256 or PBKDF2 and the hash function using a global seed. Please see the following pseudo-code:

$salt = "A Random Long String I Did Not Copy From The Internet"
$iterations = 10000 // Make this number larger for the hash to be more secure/slower

function hashBankAccount($AccountAndSortCode){
    $result = hash("sha256", $salt . $AccountAndSortCode)
    // OR
    $result = hash_pbkdf2("sha256", $AccountAndSortCode, $salt, $iterations, 64);
    return $result
}

You can then store the result of this function in your database.

David Waters
  • 2,802
  • 2
  • 14
  • 14
  • You assume sort codes and account numbers are random and that all of them are valid. This is not the case (http://www.sortcode.org/bankdata/abbey/national/index.php - the website contains 26000 sort codes). One could for example pick sort codes of a popular bank, and reduce the search space to 10^9 or 10^10. – domen Feb 04 '15 at 11:26
  • @domen - Yea I was aware of that, and had a little look to try and find the number of valid sort codes (one per branch?). But my answer was pretty long already so I decided to skip it. Using PBKDF2 configured to take a second on your server should still make 10^9 expensive. – David Waters Feb 04 '15 at 20:47
  • Right, so just plain SHA-256 isn't a good recommendation in this case. – domen Feb 05 '15 at 09:49