This does not have any serious benefits as far as I can tell.
To summarize the content you linked, the idea is that you don't have a row with the username and hashed password:
users = [
{"username": "admin", "salt": "PyoBT3KBnn31qDP9", "password_hash": "e53992bb5f7af6f51773ad8ea3033d66"},
{"username": "someuser", "salt": "2IY3VczmbhHyEMQJ", "password_hash": "a7d38c0493cf0ff4af602b92c4f77a81" }
]
Instead, you remove the password_hash
from the users
table, and store them separately in a hashes
table:
hashes = [ "e53992bb5f7af6f51773ad8ea3033d66, "a7d38c0493cf0ff4af602b92c4f77a81" ]
(Note that the article from gosquared
also has a second hash and second salt, but that is pointless as well---I'll get to that later.)
If you store the username and hash together, an attacker can do a guess, for example that the password is "Pizza7!", compute the hash with the salt, and check if it's correct. When separated, an attacker has to check the result against every entry in the hashes
table.
If you have 100 users, an attacker has to match the result against 100 hashes. But only one in 256 hashes has a matching first byte, so most of the time that already won't match. Almost all the time, you're just checking the first 1-2 bytes of the result against the first 1-2 bytes of every user's hash (with some shuffling for cache locality, that can be said to be instant). That is why the author advises to inject bogus hashes, so the table becomes huge. Or maybe you just have a billion users, and your table is huge as well.
If you have a huge table, looping through every hash is not efficient. An attacker would take the first, say, five bytes of every hash in the table, sort it, and keep that in memory. That's five gigabytes for a billion hashes, which easily fits in the RAM of an ordinary computer. Then you apply binary search (also known as bisection search) to find a match in about O(log n)
complexity. So for a billion rows, that would be something like 9 lookups.
So you bloated your database to be a billion rows, and all the attacker needs to do extra is a handful of lookups.
But the 5GB database is way too large for a GPU. You can't reduce it to, say, three bytes, because (almost) all possible three-byte values will occur, and regardless, three gigabytes still doesn't fit in these tiny cores. So you made it impossible to do the comparison part on a GPU. However, the hash iterations from something like PBKDF2 or Bcrypt can still be done on the GPU. Only the result needs to be compared against the database in the computer's main memory. Unless you use Scrypt or Argon2, the attacker loses virtually no speed at all. You might as well increase the cost
parameter of Bcrypt or the iterations
parameter of PBKDF2 (they do the same thing).
Why are Scrypt or Argon2 special? Because they're "memory hard": they already have a memory parameter that allows you to require it to use a lot of memory. Sound familiar? Indeed, that's pretty much the same as using a huge table, except that this prevents the attacker from using a GPU at all.
If you don't use any iterations, for example if you use sha256(password+salt)
, then this scheme might help you because you add lookups which are a tiny bit slower. If you do something sane (any slow hash), I don't see an advantage in doing this.
The scheme I explained above does not match the gosquared
article. They used the scheme as described in the follow-up post from opine.me
. There they add a second salt. If you use a secure hashing algorithm, a random salt will always result in a unique hash, so there will not be any collisions in the hashes
table, so that is not why. Instead, the reason they give is this:
If an attacker is able to both steal a user’s salt value, and insert a specific hash of their choosing into the Hashes table, they can grant themselves access to a specific user’s account
So the scenario is that an attacker can add values to the hashes table, but not the users table. Let me repeat that: they are trying to block an attacker who is able to write values to one user-related table (hashes
), but not to some other user-related table (users
). Privilege separation strict enough to create a scenario in which an attacker can write to some tables but not to your users table is very rare to begin with, but I'm all for layered defenses. However, your account creation script will need to write to both tables. And any login checking script need to write to neither. Being able to write to one but not the other is an inconceivably unlikely scenario, there will never be a database user with such a set of permissions.
Furthermore, the author that cooked this up argues the following:
First, I didn’t do a good job in the previous post stating my design goals. scrypt already does a perfectly fine job of letting you tune the CPU and RAM cost of completing a single hash. There’s absolutely no need for any other method than scrypt if all you want to do is tune the CPU and RAM burden on your attacker. What I want to do is two-fold:
- Impose additional costs (other than RAM and CPU) on an attacker
- Make it easier to detect if attackers start stealing your hashes
[...]
So this approach focuses on increasing the disk storage and bandwidth requirement of copying hashes from a target site.
Even petabytes of data would not do much to the lookup speed. It would no longer fit in RAM, but on-disk lookups can use binary search just as well. You can log in fast enough, so an attacker can do the lookups fast enough as well. You are likely to reduce your security by making your hashing algorithm faster, so that your slow lookups don't make the user wait too long.
Buying loads of hard drives to achieve all this is possible (remember that you also need to make backups of all of this), and I guess if you indeed have a few dozen terabytes of password hashes, it might actually prevent an attacker from getting all of it. Anything less is not going to be useful: network speeds are fast enough, and bandwidth and storage is cheap enough. A few terabytes won't faze any attacker. There are already password dumps of 500GB floating around the internet as torrents.
So then a, say, 40TB hashes
table would actually work? The author thinks that you'll detect breaches. In practice, breaches very often go undetected for months. Some companies may notice it fast, but an equal number will be compromised for multiple years before they notice. 40TB takes (40*8*1024/3600/24=
) 4 days to transfer over an ordinary gigabit connection. Another issue is storing the data, which would indeed be non-trivial at that size. They would probably have to ask friends and then probably still pool some money for a few extra hard drives. Nothing insurmountable, though. And with 40TB stored on slow, spinning drives, you're almost certainly degrading the slowness of your hashes to accommodate this scheme. And even if you succeed in thwarting the attackers, and they manage to get only 1TB (that fits on any modern laptop), then they still have the hashes from 1 in 40 users, plenty to be a big deal.
It's a huge amount of bulk for a very small amount of gain, assuming you don't degrade the quality (slowness) of the hashes.