1

Let's assume there is a database containing names and each name is hashed and the clear-text is deleted.

If an attacker were to steal this database, would he/she be able to get the list of names?

Gurzo
  • 1,117
  • 6
  • 18

3 Answers3

6

A hash function is one-way in the following sense: given the output, the fastest method to find a matching input is to try possible inputs until a match is found. You cannot get more one-way than that. However, this does not mean that hashing names and hoping for a match does not work; in fact, people names are depressingly not varied enough, it is easy to enumerate and hash them all. Since a hash function is public (no secret element) and deterministic (if you hash twice the same data, you get twice the same output), brute force on names will work. A PC with a medium-sized GPU can hash billions of strings per second.

The same problem occurs with passwords, which are limited by the average imagination of human users. When we hash passwords, we must do it with care. The same would apply to names, possibly even more so.

Thomas Pornin
  • 320,799
  • 57
  • 780
  • 949
  • The average name probably has a depressingly little amount of entropy (likely fewer than 10 bits for the vast majority of people). Even a state of the art password-hashing function would be a bad choice for storage. – Stephen Touset Jan 22 '14 at 01:36
  • would there be any solution to stop this???? – Sungwon Lee Jan 22 '14 at 03:19
  • 1
    You need to explain what you want to protect against what: people names aren't, by default, secret (or even private). What could be either private or secret is the association between people and data. How to do this right depends heavily on the data itself and the way you need to access it. – Stephane Jan 22 '14 at 08:45
1

trivially.

Names don't have enough entropy, and as Thomas Pornin said, a GPU is fast enough to let you bruteforce your way through billions of names until you find matching hashes relatively quickly.

If the database has other longer fields that happen to have a lot of entropy in them, concatenating them together and hashing the sum that with the names will increase the entropy; the longer the better chance you have of reaching a length long enough that an attacker has to go to reasonable lengths to get matches, provided you use a sufficiently strong hashing algorithm.

What my last paragraph explains is essentially salting.

The attacker is likely to thus have the salts; this in itself makes things difficult for rainbow-tables style attacks but doesn't mean attacker can't still bruteforce concatenating strings onto the salt - particularly by doing common names first they're likely not to have to try too many.

This write-up is quite excellent: https://crackstation.net/hashing-security.htm

pacifist
  • 794
  • 3
  • 8
0

There are two problems here that you need to pay attention to.

First, as mentioned in other answers, there is a limited number of possible (or likely) name values. If you run through a list of about 500k names, you're likely to find most names (either first or last). 500k names is a pretty small brute force attack since hashing all those names is not very time consuming.

Second, because you're not talking about salting your hashes, any hash can be compared against all the rows. That means that all an attacker needs to do is hash that list of 500k names once. If, however, you were to add a random salt that is prepended to the name, that list of 500k names will need to be hashed with every salt in every row in order to brute force the table. That makes for a LOT more work to reverse the column.

But, as mentioned, there aren't enough possible values for names to make this an impossible task. You would need to add at least some secret key to make the job impossible. The simplest might be to xor with a secret string but you can use any encryption algorithm.

However, even if using an encryption algorithm, you must continue to salt the entries or you data will be susceptible to a statistical attack (all "John"s will be encrypted to the same string, and based on the percentage of "John"s in the population its possible to guess which rows are "John").

But I have to wonder about your use case. Hashes are non-reversible. That means that the names in the column can only be used to compare against some type of input. That's why hashes are mostly used for either passwords or various types of digital signatures. I can't imagine what situation merits hashing a piece of data like a name.

eyalk
  • 21
  • 2