As @Rook points out, most of the advantage of specialized hardware such as GPU is through parallelization. A single GPU core, by itself, is quite feeble; it is clocked at a frequency lower than that of the main CPU, and it will compute only one operation per clock cycle (at best, and with high latency). However, a GPU includes hundreds of cores all dancing simultaneously.
Now, it so happens that usual password hashing functions are inherently sequential. See for instance PBKDF2: each output chunk is computed as the end result of a sequence of Ui values, where Ui+1 is obtained by processing Ui with a PRF (normally HMAC). This cannot be made parallel. If you want to compute a single PBKDF2 instance on a GPU, then it will use only one core on that GPU, and this will be very slow. Indeed, a typical GPU core can launch one instruction per clock cycles, but the result will be available only ten or twenty cycles later, so the GPU is used at its full power only if it has thousands of tasks to run in parallel.
The attacker benefits from GPU because, by definition, the attacker has a lot of potential passwords to try. So he can use parallel computing to its full power; brute forcing of passwords is an embarrassingly parallel problem. Indeed, since each GPU core The defender, on the other hand, does not have a lot of hashing to do: only one per incoming client at a time. We could imagine a very busy server with, at any time, hundreds of clients trying to open a session, and that might be amenable to optimization with a GPU, but this is not very realistic.
So, there is no ready-to-use implementation of an authentication framework which offloads the PBKDF2 or bcrypt cost to a GPU because it would not work. In the context of authenticating incoming clients (the defender's situation), the best hardware to use is the CPU, not a GPU.
That being said, this is really because PBKDF2 and bcrypt (and also scrypt and most other functions of the same kind) are sequential. One could design password hashing functions which can be made parallel. As an illustration, imagine a slow password hashing function designed like this:
- Password is pw, salt is s.
- For i = 1 to 10000, define Vi = PBKDF2(pw||i, s).
- Final hash value is SHA-256(V1 || V2 || ... || V10000).
In the case of that function, the bulk of the computational effort is the ten thousands of PBKDF2 instances, and these can be optimized with a GPU, even if you have only a single password to hash.
(Caution: the function above is presented only as a speculative illustration. Don't believe that it is secure ! This has not undergone any kind of review by lots of trained cryptographers during several years.)
There is an ongoing competition for defining new password hashing primitives, in the same spirit as the AES, SHA-3 or eSTREAM efforts. If you have some nifty ideas about defining a password hashing function amenable to parallelism, then, by all means, consider submitting a candidate.