Practical advice
- Use Argon2id.
- The more memory you use, the less of an advantage an attacker will have compared to your CPU.
- In absolute numbers, against 2021 GPUs:
- 128 MiB is quite secure (but don't hesitate to go for 2 GiB or more, if you can!),
- 32 MiB is reasonable, and
- 4 MiB is decidedly on the low side.
- Against ASICs, you definitely want to go higher. If your adversary has the budget for Argon2 ASICs, a few gigabytes of regular RAM being used for password hashing is no excessive luxury. Also effective is adding multi-factor authentication and enforcing globally unique passwords (check against public breaches).
For less-powerful adversaries, it is also interesting to note that cracking software seems to be lagging behind and as of 2021, there is still no good Argon2 cracker available. Bcrypt and scrypt are (in turn) much better than PBKDF2-class* algorithms, mostly due to their better design but also because less effort has been put into optimizing for these strong hashes, especially when using scrypt with >16 MiB of memory.
* Anything that does plain iterations of hashing functions, so this includes custom designs that a lot of developers use(d). Bcrypt is in a higher class because "Bcrypt happens to heavily rely on accesses to a table which is constantly altered throughout the algorithm execution. This is very fast on a PC, much less so on a GPU, where memory is shared and all cores compete for control of the internal memory bus. Thus, the boost that an attacker can get from using GPU is quite reduced, compared to what the attacker gets with PBKDF2 or similar designs." --Thomas Pornin in Do any security experts recommend bcrypt for password storage?
Cracking memory-hard hashes
A hashcat release came out today (while I was wrapping up my research) with better scrypt support (another memory-hard algorithm). I've been testing with an RX 5700 GPU which seems to cost close to $1000 in today's inflated prices and has 8 GB of VRAM. Trying to crack scrypt hashes with hashcat earlier today, it errored out at 16 MiB already (N=16×1024, r=8, p=1); with the new update, I can now crack scrypt hashes with a whopping hundred mebibytes of memory. At 64 MiB, hashcat is 17 times faster than Python's hashlib.scrypt
which uses OpenSSL's implementation. It's better than nothing, but for comparison, PBKDF2 gets a 1000 times speedup on the same CPU/GPU combination. (And, for the record, Bcrypt also gets a 17-times speed-up from CPU to GPU.)
For Argon2, the fastest cracking software that I can find is a CPU implementation. The only decent-looking GPU implementation that I can find can allocate up to 434 MiB before erroring out, but is slower than PHP's implementation (using libsodium if I'm not mistaken) regardless of how much memory you make it use. And PHP also doesn't limit the memory usage, it will happily compute Argon2 with 13 GiB of RAM for you in as many seconds.
Theoretically
You should be able to allocate as much memory as your GPU has, even if practically no free cracking software actually supports that. Either way, by requiring constant access the main memory rather than being able to use core-local registers, you do limit a GPU's effectiveness: its memory is all shared and the bandwidth limited. The CPU has the same limitation, but that just puts it on a more level playing field. GPU bandwidth still exceeds that of a CPU*, but it's on the same order of magnitude.
ASICs are a next-level kind of attack that I did not worry about when writing the question. But I've looked into it now anyway, and GPUs seem largely ineffective with today's state of the software almost regardless of what settings you use, so we might as well consider them.
The cost of producing an ASIC seems to be mostly determined by the amount of memory you need. The Argon2 paper (section 2.1) describes it as:
We aim to maximize the cost of password cracking on ASICs. There can be different approaches to measure this cost, but we turn to one of the most popular – the time-area product [4, 16].
[...]
- The 50-nm DRAM implementation [10] takes 550 mm² per GByte;
- The Blake2b implementation in the 65-nm process should take about 0.1 mm² (using Blake-512 implementation in [11]);
It isn't stated in as many words and the paper just speaks of surface area rather than cost, but it sounds like increasing RAM capacity on an ASIC is a lot more expensive than increasing computational power is.
So in terms of picking Argon2id parameters, the correct order is:
- First increase memory usage to as much as possible with a time cost parameter of 1**.
- Increase the parallelism as much as still possible without lowering the previous parameter, until you hit the number of cores your system has.
- Increase the time cost parameter as much as still possible without lowering the previous parameters.
The "as much as possible" means: until it either runs out of system resources (RAM, cores) or until it gets slower than your users have patience for.
It's not weird if the only parameter you tweak is the memory usage. It might get plenty slow from that already.
Parallelism is a bit complicated.
In Argon2, the two parts when setting p=2 can be computed in parallel, so if your memory parameter fits in cache or if you use less than half the available memory bandwidth, p=2 should not take any longer than p=1 on your CPU, yet should be stronger. That seems to work in practice: testing in PHP (which uses libsodium), increasing the 'threads' setting is clearly a lot faster. (Note that the paper calls it 'lanes' to set 'parallelism' which PHP then calls 'threads'...) To make sure you use your hardware to its full extent (making it comparatively harder for GPUs/ASICs to compete), setting this setting higher is recommended before increasing the time cost parameter (since the time cost parameter is nothing more than running the algorithm multiple times over the specified amount of memory).
Scrypt doesn't say much about it in its paper or RFC, but seems to serve the same function. In practice, it seems scrypt implementations don't do multithreading: Hashcat and OpenSSL seem just take p times as long. There is also an scrypt JavaScript implementation which "can multi-thread in browsers", but this only seems to work in Firefox on Linux with p=1 and p=2 (equally fast) or in Firefox on Windows with p=2 and p=4 (equally fast). Other values (e.g. p=1 and p=2 in Firefox on Windows) are linearly slower, and other browsers (such as Chromium Edge) don't seem to support it at all. You might want to look for an implementation that supports this to get a stronger hash, or use Argon2 instead.
* "The CUDA implementation can reach about 40-60 GiB/s [...] on an NVIDIA Tesla K20X. For comparison, a fast Intel Xeon processor can only reach about 10 GiB/s." ---Panos Kal., 2017, on "Cryptography algorithms that take longer to solve on a GPU than a CPU"
** Since memory is most important, the time cost parameter should be set to 1 initially. This is not insecure: the Argon2 authors wrote "Again, there is
no ”insecure value” for T" in the Argon2 paper section 6.4, and the Crypto Forum Research Group writes that "Argon2id [with] t=1 and 2GiB memory is [...] suggested as a default setting for all environments. [Setting] t=3 and 64 MiB memory is [...] suggested as a default setting for memory-constrained environments."
Question's Quirks
The question's phrasing demonstrates misconceptions on the part of the author.
The proposed statement "[X MiB of memory] already makes a common consumer GPU completely ineffective" is simply false because the GPU cores can address all the memory the GPU has, but also kind of true in a practical sense because current implementations seem to break long before getting close to the GPU's actual memory limit. To put a number to this value, 1 GiB seems plenty to trigger this effect, but fixing that is a matter of software and not due to hardware limitations.
The alternative statement "The slowdown on GPUs due to increased memory requirements is a linear scale." is also wrong, since this fails to parameter in the memory bandwidth difference. It's not the case that you can close the speed gap by approaching the amount of memory your cracking hardware has. On the other hand, using too low a memory parameter does not saturate your CPU's memory bandwidth and gives a GPU cracker a comparative advantage, so there is (accidentally) also a core of truth here.
As a user
Keep in mind that you can always defend your own accounts by using strong and unique passwords. Even the dumbest of password hashes is (pre-quantum) secure if your password is something like 15 random characters (picked from {a-z, A-Z, 0-9}). If you don't re-use passwords, then it also doesn't really matter if one was cracked. A password manager can help you generate and manage these passwords.