Is there any bias in this random selection from a dictionary?

Question

I am generating passwords by selecting random words from a dictionary and counting entropy=words_in_passphrase*log(dictionary_size)/log(2). This algorithm assumes words will be selected from the dictionary with uniform distribution.

I think the design principle is sound but I'm not sure about the implementation. I'm reading from /dev/urandom enough bytes to express an index into the @words dictionary, and converting those bytes into an integer somewhat awkwardly.

The code appears to work correctly, but are there any caveats to Perl or /dev/urandom which will skew the results? Full source.

my $entropy_per_word = log (@words) / log (2);

my $pw = "";
my $pw_entropy = 0;

# maybe /dev/random ?
open my $random_source, '<', "/dev/urandom" or die "Could not open /dev/urandom";

binmode $random_source or die $!;

my $random_buffer = "";

my $random_bytes = ceil ($entropy_per_word / 8);

die "Tried to read no random bytes." if ($random_bytes < 1);

while ($pw_entropy < $entropy)
{
    die "Could not read random bytes"
    if ($random_bytes !=
        read ($random_source, $random_buffer, $random_bytes));

    my @bytes = unpack 'C ' x $random_bytes, $random_buffer;

    my $n = 0;

    foreach (@bytes)
    {
        $n *= 256;
        $n += $_ * 1;
    }

    next if ($n >= @words); # don't use %, it will bias the randomness

    $pw .= ' ' if ("" ne $pw);

    foreach (split //, $words[$n])
    {
        # sprinkle some capitalization for good measure
        $_ = uc if (rand > 0.8);

        $pw .= $_;
    }

    $pw_entropy += $entropy_per_word;
}

score 7 · Answer 1 · edited Mar 17 '17 at 13:14

You are right to use /dev/urandom. Don't use /dev/random: it is not better than /dev/urandom, and in some ways it is worse. This page is a nice summary on that subject.

Your entropy calculation is correct. Entropy is a measure of what the password generation process could produce, not of what it has produced; hence, if you have n words in your list and pick each word randomly and uniformly, then you get exactly log n bits of entropy per word (base-2 logarithm here). The entropy is not reduced if you happen to pick several times the same word. More detailed explanations are available in this answer.

Your looping test:

next if ($n >= @words); # don't use %, it will bias the randomness

is correct. You avoided the bias induced by the modulo operation. However, there is a slight inefficiency here, in that you generate and integral number of bytes. For instance, if your list contains, say, 1000 words, then you will produce sequences of two bytes, hence a number in the 0..65535 range. You will have to loop on average 65.536 times before hitting a number in the 0..999 range. Depending on your context, this may or may not matter; modern computers are very fast, and if the password generation is human driven (you produce a single password for a human user who is waiting for it) then the computer has a lot of computing power to spare, and the dozens/hundreds of loops won't even be noticeable.

If you want to be more efficient, there are mainly two improvements that can be made:

Truncate the byte sequence to the minimal length in bits. For instance, if you have a 1000-word list, the minimal bit length is 10 (because 2¹⁰ = 1024). This would entail masking out (i.e. clearing) the top 6 bits of the 16-bit random integer. In that specific case, this would mean that looping would happen only in 2.4% of cases, i.e. quite rarely. On a more general basis, truncation to the correct bit length ensures that the average number of loop instances will be no more than 2.
Use a combination of exclusion (when the number is too high) and modulo reduction. With a 1000-word list, this would mean generating a number in the 0..64999 range, and then reducing modulo 1000: the next clause would be triggered only when the 16-bit integer falls in the 65000..65535 range (i.e. not often at all), and the modulo would not induce a bias because 65000 is a multiple of 1000. You might want to take inspiration from the documentation for Java's java.util.Random.nextInt(int bound) method (the documentation includes a code excerpt which is enlightening; this is Java, but the concept applies to other languages as well).

I would advise against the random capitalisations: they don't directly hurt security, but they consume human memory resources for a comparatively mediocre entropy gain. That is, if by not using that capitalisation, you make enough room in your brain to remember a couple of extra words, then that's probably a better bargain. Furthermore, having an all-lowercase password makes for smoother password entry, especially on smartphones.

Wow, that link about `/dev/urandom` is awesome. I did not think of that. But why you argue that you can count word entropy? One way to get entropy is by complexity and complexity can be estimated with compression. A simple test: `echo correct horse battery staple | gzip -c 49, `echo horse horse horse horse | gzip -c 29 — grochmal, Oct 22 '16 at 14:29
Compression does not estimate entropy. At best, it can put a maximum on entropy, but only if you take _averages_. You cannot do that on a single output. Consider the following compression algorithm: if the input is exactly "correct horse battery staple", produce one bit of value 0; otherwise, produce one bit of value 1 followed by the input. This is a well-defined compression algorithm, and it reduces "correct horse battery staple" to 1 bit. Would it imply that the entropy of "correct horse battery staple" is 1 bit? — Thomas Pornin, Oct 22 '16 at 14:43
At best, you could generate all possible passwords from a given generator, compress all of them (independently of each other), and compute the average resulting length. This would be a majorant for the password entropy (but not necessarily a good estimate). The important point is that entropy is a property of the process, not the result, and it measures what the attacker does not know. Mathematically, this makes sense only _on average_, not on a single example. — Thomas Pornin, Oct 22 '16 at 14:45
Cool, I fell into my favorite trap: *entropy is a loaded word*. I used the *thermodynamic* definition. I am being annoying, I know. But I will pester you with one more point on complexity since we are at that: I am confident that using a general compression (gzip uses a Huffman table) you can approximate Kolmogorov Complexity. And you should be able to estimate entropy based on that, right? ( [I'm thinking of this paragraph](https://en.wikipedia.org/wiki/Kolmogorov_complexity#Relation_to_entropy) ). My guess is that my Kolmogorov Complexity estimator is not good enough in the example. — grochmal, Oct 22 '16 at 15:02
Compression can give an upper bound on entropy, provided you take care to do it on sufficiently many samples, but it often overestimates things. For instance, a general compression algorithm does not "know" that you generate passwords from a relative short list of words (Huffman tables may pick up that some _letters_ are more frequent than others, but not _words_). More generally, consider a PRNG that produces a 1-gigabyte stream from a 128-bit key: the compression is unlikely to find that 128-bit key. — Thomas Pornin, Oct 22 '16 at 20:05
I finally got it, thanks. I finally understand what you mean with the overestimation when using compression. Thanks for keeping the discussion with this annoying person for so long. — grochmal, Oct 22 '16 at 21:58

Is there any bias in this random selection from a dictionary?

1 Answers1