4

First of all, the understanding I have of the p parameter in scrypt is that it multiplies the amount of work to do, but in such a way that the additional workloads are independent from each other, and can be run in parallel. With the interpretation of p cleared out of the way, why is the recommended value still 1? More generally, why is it a good thing that key stretching algorithms are not parallelizable?

From the point of view of an attacker trying to crack a password, it doesn't matter whether an algorithm is parallelizable. After all, even if the entire algorithm is sequential, the attacker can just crack several different passwords in parallel.

I understand that scrypt being memory-hard makes it difficult to utilize GPUs for cracking. GPUs have a much greater combined computational power accross its many weak cores than CPUs, but the memory bus is about the same speed, so it levels the ground for authentic users on a CPU and attackers on a GPU.

However, subdividing an scrypt workload that accesses 256MB of RAM into 4 different parallel scrypt workloads, accessing 64MB each, would still consume the same amount of memory bandwidth for an attacker, therefore running at the same throughput, while running 4 times faster on a quad+ core CPU for an authentic user.

Is there any fundamental flaw in my logic? Why is the recommended value for p still p = 1? Is there any downside I can't see to increasing p?

negamartin
  • 141
  • 1
  • If you're running it on a server, which is the common use case, then you probably already have parallelization because of the many users connecting to the system. That might be a good reason *to keep things simple*. But as I'm not *implementing* such servers at the moment, this is just an educated guess. – Maarten Bodewes Jul 18 '20 at 18:48

1 Answers1

2

There are two reasons I can think of:

First, because it weakens the main use case. An explicit design goal of scrypt is to make parallelization hardware expensive - and by "hardware" here, we do not mean merely GPU, but also FPGAs and, ultimately, custom ASICs. Building fast-access RAM close to the processor cores necessary to do the calculation is what drives the cost up.

If the workload can be subdivided, it's more likely that it can be parallelized by an attacker, which reduces the cost of the hardware necessary to attack scrypt at scale.

For more background on this, see Percival's slides from his talk announcing scrypt, in which the hardware cost trade-off is explicitly discussed.

Second (and this is just speculation), it's not yet widely needed - perhaps because most installations are either not yet using scrypt at the scale that would require more parallelization for performance reasons, or else they're big enough (like Facebook) to afford the extra hardware necessary to manage any potential authentication "storms" (many people trying to authenticate at once).

More generally, I would dispute your claim that resistance to parallelization doesn't matter from the perspective of an attacker. As an attacker, I can tell you that such resistance is significantly frustrating - because it requires more resources to perform the same attack, making me have to be much more choosy about which attacks I run, and wait longer for the results.

In fact, one of the other goals of reducing parallelization is to make it harder to crack passwords in the aggregate - attacking an entire leak with thousands or even millions of passwords. The algorithms to optimize such attacks are dramatically hampered by parallelization resistance. Attacking a million MD5 is a cakewalk; attacking a million scrypt is a slog. Only attacks like correlation (matching reused passwords from the same user from another already-cracked leak) can make a dent in large sets of such hashes with anything approaching efficiency.

Which was Percival's goal. :)

Royce Williams
  • 9,128
  • 1
  • 31
  • 55
  • 1
    Note if you try to parallel ROMix then you need lots of memory since the `p` increases the memory size, too. – kelalaka Jul 12 '20 at 20:51