Are there implementations of password hashing algorithms for major frameworks that utilizes specialized hardware like GPUs/FPGAs?

Question

It is common knowledge that password cracking attempts can greatly benefit from specialized hardware such as large clusters of GPUs or FPGAs.

Are there any implementations of the commonly recommended password hashing algorithms (PBKDF2/bcrypt) for common or popular programming languages and/or frameworks such as .NET that offloads the password hashing portion of the authentication process to specialized hardware such as GPUs or FPGAs?

If there isn't such an existing library, are there any good reasons why they do not exist?

Off-the-cuff comment with no research: have you looked at the oclHashcat source code? They may well have an implementation that you can "borrow". — Polynomial, Jun 28 '13 at 14:03
@Polynomial Well, I *can* do that but I think most people would prefer to use an existing implementation actually meant for authentication in the form of a convenient library if at all possible. :) — , Jun 28 '13 at 14:13
I saw a DEFCON talk about FPGAs where one of the uses they had it at was MD5 hash cracking. — shieldfoss, Jun 28 '13 at 21:31

score 2 · Answer 1 · edited Apr 13 '17 at 12:48

2

The benefit of computing hash functions with an FPGA or GPU is that they are heavily parallel, such parallelization is not required for most authentication systems. Further more memory intense hash functions, such as bcrypt and scrypt, remove parallelization advantage that FPGAs and GPUs have over CPUs.

It is possible to implement such a feature, however it would be counter productive. The point of using bcrypt, and other key stretching algorithms, is to remove the computational advantage provided by a heavily parallelized computational resource such as a FPGA or GPU.

Related: Why can't one impermanent bcrypt in Cuda?

edited Apr 13 '17 at 12:48

Community

1

answered Jun 28 '13 at 15:26

rook

46,916
10
92
181

Why is it counter productive? bcrypt can be computed rather efficiently with FPGAs. PBKDF2 can be computed efficiently with GPUs. An attacker is going to be using such specialized hardware anyway. Why can't someone use the same hardware to increase the number of rounds each password is hashed to make things much more difficult...? – Jun 28 '13 at 15:38
@Terry Chia There is a very very good reason for this. Perhaps you should read the link at the bottom of my post more carefully and read my update. – rook Jun 28 '13 at 16:19
Although a GPU might not help when executing bcrypt once, I would think that an algorithm which could take maximum advantage of the hardware on which it was being "legitimately" used would be a good thing. The better an algorithm fits the abilities of a production machine, the smaller the advantage an attacker would be able to achieve by optimizing hardware for the task of password-breaking. – supercat Oct 19 '14 at 22:24

score 2 · Accepted Answer · answered Jun 28 '13 at 17:24

As @Rook points out, most of the advantage of specialized hardware such as GPU is through parallelization. A single GPU core, by itself, is quite feeble; it is clocked at a frequency lower than that of the main CPU, and it will compute only one operation per clock cycle (at best, and with high latency). However, a GPU includes hundreds of cores all dancing simultaneously.

Now, it so happens that usual password hashing functions are inherently sequential. See for instance PBKDF2: each output chunk is computed as the end result of a sequence of U_i values, where U_i+1 is obtained by processing U_i with a PRF (normally HMAC). This cannot be made parallel. If you want to compute a single PBKDF2 instance on a GPU, then it will use only one core on that GPU, and this will be very slow. Indeed, a typical GPU core can launch one instruction per clock cycles, but the result will be available only ten or twenty cycles later, so the GPU is used at its full power only if it has thousands of tasks to run in parallel.

The attacker benefits from GPU because, by definition, the attacker has a lot of potential passwords to try. So he can use parallel computing to its full power; brute forcing of passwords is an embarrassingly parallel problem. Indeed, since each GPU core The defender, on the other hand, does not have a lot of hashing to do: only one per incoming client at a time. We could imagine a very busy server with, at any time, hundreds of clients trying to open a session, and that might be amenable to optimization with a GPU, but this is not very realistic.

So, there is no ready-to-use implementation of an authentication framework which offloads the PBKDF2 or bcrypt cost to a GPU because it would not work. In the context of authenticating incoming clients (the defender's situation), the best hardware to use is the CPU, not a GPU.

That being said, this is really because PBKDF2 and bcrypt (and also scrypt and most other functions of the same kind) are sequential. One could design password hashing functions which can be made parallel. As an illustration, imagine a slow password hashing function designed like this:

Password is pw, salt is s.
For i = 1 to 10000, define V_i = PBKDF2(pw||i, s).
Final hash value is SHA-256(V₁ || V₂ || ... || V₁₀₀₀₀).

In the case of that function, the bulk of the computational effort is the ten thousands of PBKDF2 instances, and these can be optimized with a GPU, even if you have only a single password to hash.

(Caution: the function above is presented only as a speculative illustration. Don't believe that it is secure ! This has not undergone any kind of review by lots of trained cryptographers during several years.)

There is an ongoing competition for defining new password hashing primitives, in the same spirit as the AES, SHA-3 or eSTREAM efforts. If you have some nifty ideas about defining a password hashing function amenable to parallelism, then, by all means, consider submitting a candidate.

Just an afterthought - is there any specialized hardware that could potentially be used to accelerate _slow hash functions_, maybe something similar to Bitcoin miners? Or it's just not worth it and investing into more CPUs/cores would yield better bang for the buck? I'm specifically thinking here of bcrypt/scrypt functions. Cheers! — TildalWave, Jun 28 '13 at 18:00
For the serious attackers against bcrypt, FPGA are best (well, ASIC are even better, but you first test the design on FPGA). Good FPGA have embedded RAM blocks, which allow for fast parallel bcrypt implementations. But the initial investment is rather high (it takes some time to build and integrate a working FGPA implementation) so mere amateurs prefer to use CPU or GPU. For scrypt, a plain old CPU in a PC is supposed to be the best architecture (that's the point of scrypt, really). — Tom Leek, Jun 28 '13 at 18:45
Even if existing password hashing algorithms are sequential when run for a single password, that would not preclude the possibility of designing a password-hash function that was designed to be GPU-friendly. I suspect an off-the-shelf GPU running algorithms that are optimized for it could perform more operations per second per dollar than an FPGA-based approach, thus putting legitimate users on par with attackers. — supercat, Dec 18 '14 at 16:54
There are proposals that support heavy parallelism, and you can always turn a CPU-hard (not memory-hard) function parallel by defining _h'(p,s) = h(p,s)_ XOR _h(p,s+1)_ XOR _h(p,s+2)_... (_p_ = password, _s_ = salt). Whether this brings the defender on par with the attacker remains to be seen: it depends on whether the defender has GPU (e.g. on his server). There is also all the debate about buying vs running costs (e.g. GPU tend to use a lot of energy, and some people argue that in the long run, power consumption is what matters for the average attack cost). — Tom Leek, Dec 18 '14 at 17:08
@TomLeek: In asking http://security.stackexchange.com/questions/68512/does-password-hashing-busy-work-need-to-be-cryptographically-secure (I'd be interested in your thoughts there) my thought is that attackers' advantage could be minimized by having defenders use algorithms that are optimized for whatever hardware they have on hand. Energy consumption is an issue I hadn't considered, though I wouldn't think that a GPU would be worse than an FPGA when running algorithm tailored to the GPU's strength; while more energy-efficient logic costs more than less-efficient logic, the difference... — supercat, Dec 18 '14 at 17:21
...in price is usually hot huge; if reducing energy consumption reduces heat-sinking requirements, that could be a net cost savings. — supercat, Dec 18 '14 at 17:29

score 0 · Answer 3 · answered Jun 13 '20 at 19:09

As the other answers here have explained, password hash functions should be designed to be nearly as efficient on the general purpose hardware used by authentication services as it is on the specialized hardware used by attackers. It is also very unusual for a service to be doing so many password hashes in parallel that they need to scale it up or out.

But bcrypt has now been implemented for both GPUs and for FPGAs. See Bcrypt password cracking extremely slow? Not if you are using hundreds of FPGAs!.

The GPU implementation described there is just barely faster than the CPU implementation. But the FPGA implementation is much more cost-effective, and takes over an order of magnitude less power. But so far it only seems to run on discontinued FPGA boards.

Also note that the Auth0 service is one of the rare ones that does have a business need to do lots of authentications in parallel. So they came up with the BaaS (Bcrypt as a Service) approach described here: Parallelizing and auto-scaling bcrypt - Auth0 Engineering and available as open source here: baas: Node.js implementation of Bcrypt as a micro service.. I think it still relies on standard Node implementations of bcrypt, and isn't optimized for GPUs or other special hardware,

score 0 · Answer 4 · answered Jun 29 '13 at 06:08

Have a look at bitcoin mining code -- their whole process is based on calculating SHA-256 hashes, and I know that there exists not only GPU implementations, but also FPGAs and now even ASICs appearing.

An ASIC-based device available today for $1300 can blow through about 66,300,000,000 SHA-256 hashes per second. An ATI/AMD 7970 video card can do about 687,000,000 hashes per second.

Are there implementations of password hashing algorithms for major frameworks that utilizes specialized hardware like GPUs/FPGAs?

4 Answers4