6

Intro

From XKCD #936: Short complex password, or long dictionary passphrase?,

After some search for a tool that generate random pass phrases, I've initiate my own...

Take already present dictionary on my desk: /usr/share/dict/american-english and see:

wc -l /usr/share/dict/american-english
98569

After a quick look, I see many 's termination and names that begin with a capitalized letter.

sed -ne '/^[a-z]\{4,99\}$/p' /usr/share/dict/american-english | wc -l
63469

Oh, there is less than 65536, As I can't read only 15.953bit, I will drop this down to 15bits index (using pseudo random as this could be sufficient for now.).

Than with 5 words, I could compute a 75 bits passphrase:

#!/usr/bin/perl -w
use strict;

open my $fh, "</usr/share/dict/american-english" or die;
my @words = map { chomp $_; $_ } grep { /^[a-z]{4,11}$/ } <$fh>;
close $fh;

while (scalar @words > 32768 ) {
    my $rndIdx=int( rand(1) * scalar @words );
    splice @words, $rndIdx, 1 if $words[$rndIdx]=~/s$/ || int(rand()*3)==2;
}

open $fh, "</dev/random" or die;
$_='';
do { sysread $fh, my $buff, 10; $_.=$buff; } while 10 > length;
$_ = unpack "B80", $_;

s/([01]{15})/print "  ".$words[unpack("s",pack("b15",$1))]/eg;
print "\n";

This could produce output like:

  value  nationally  blacktopped  prettify  celebration

The perl script

I wrote a little perl script, passphrase.pl:

$ ./passphrase.pl -h
Usage: passphrase.pl [-h] [-d dict file] [-i mIn length] [-a mAx length]
   [-e entropy bits] [-r random file] [-w words] [-l lines] [lines]
Version: passphrase.pl v1.5 - (2013-07-05 08:34:21).
    -h           This help.
    -l num       number of phrases to generate  (default: 1)
    -w num       number of words by phrase  (default: 5)
    -e bits      Entropy bits for each words (default: 15)
    -d filename  Dictionary file (default: /usr/share/dict/american-english)
    -s filename  Dict file to save after initial drop (default: none)
    -i length    Minimal word length (default: 4)
    -a length    Maximal word length (default: 11)
    -r device    Random file or generator (default: /dev/urandom)
    -q           Quietly generate lines without computations.

The default output look like:

With 5 words over 32768 (     15 entropy bits ) = 1/3.777893e+22 -> 75 bits.
With 5 words from 56947 ( 15.797 entropy bits ) = 1/5.988999e+23 -> 78.987 bits.
  3.736 206.819 foggier     enforced    albatrosses loftiest    foursquare  

First line show count of uniq word found in dictionary, dropped down to 2^Entropy. Second line show initial count of uniq word, and compute theorical entropy based on this.

Each output lines begin with two values, The first is the Shanon's entropy and a flat entropy based on number of character in the whole line power 26.

Usage and human entropy reduction

The answer from David Cary confirm that this calcul is very approximative and difficult to represent, but give some good evaluations and a way of thinking:

3 Questions

.1 What's minimal length for one word? Is 4 chars sufficient? How to compute entropy for a 4 letter word?

In plain alphabet a letter is 1/26 -> 4.7bits, but the following letter is generaly a vowel so 1/6 -> 2.5bits!?

If I'm right, a 4 letter word could not represent more than 14.57bits??

.2 Some could try to run this several time to obtain some choice:

The way I use this look like:

passphrase.pl -d /usr/share/dict/american-english -l 5
With 5 words over 32768 (     15 entropy bits ) = 1/3.777893e+22 -> 75 bits.
With 5 words from 56195 ( 15.778 entropy bits ) = 1/5.603874e+23 -> 78.891 bits.
  3.711 211.520 bittersweet damsons     snarkiest   distillery  keyboard
  3.894 188.018 outdo       caliphs     junction    uneventful  inflexible
  3.920 211.520 contemplate gripped     capitols    plagiarizes obtusely
  3.642 155.115 shark       procured    espied      amperage    goalie
  3.718 150.414 drunken     sulked      derisory    influx      smear

and choose in this bunch 5 words with is human sensibility:

bittersweet gripped distillery derisory influx

This will reduce entropy in that:

sexy (and known) words would have more chance to be chosen.

Anyway if excellently detailed, David Cary's answer don't match my question: As I do not make a bunch from a reduced list, I make a random list of 20 to 30 words, then human will make his choice, regarding his knowledge and feeling at this time. If there is 2 unknowns words, human could maybe learn one and drop the other...

From an attacker point of vue, there is near no way to create a reduced dictionary that contain only words known by target user, mostly if I could imagine user is able to learn one word more...

In fine, there a 5 words on a bunch of 63'469 words ('15.953bit' entropy). I don't imagine a way an attacker could reduce this, knowing the fact: 5 words was chosen on a random bunch of 30 words.

So this will theoricaly drop overall entropy down! In my mind this is negligible, but I can't represent this by numeral argumentation...

2 Answers2

13

Entropy is a measure of the password generation process. Suppose that you have a list of 32768 words to choose from. You select 5 words randomly and uniformly from that list (the words are chosen independently of each other, so you might get twice the same). Then you have exactly 75 bits of entropy. Your password generation process can result in precisely 275 distinct passphrases which all have exactly the same probability of being selected. That's 75 bits of entropy, without any ambiguity.

That some words are short and others are long, or some are plurals, or some are sexier than others, has no influence whatsoever on your entropy. Entropy is a property of the generation process, and your generation process pays no attention to the length or sexiness of words. Average human users, when left to their own meat-based devices (their brains), tend to select some words more often than others; they are bad at uniform randomness. For them, computing the actual "entropy" is hard because we don't really know how biased they are. But this has no bearing on your generator, which does not use a brain, but /dev/random. /dev/random does not find some words more attractive than others; its tastes are simpler.

(Speaking of which, you should use /dev/urandom, not /dev/random. See this.)

Tom Leek
  • 168,808
  • 28
  • 337
  • 475
  • No, I'm pretty sure in my second sample, using 450bits to *humanly* choose 5 words would drop entropy down **under** 75bit in fine. This seem paradoxal, but... There must exist a mathematical demonstration. – F. Hauri - Give Up GitHub Apr 07 '13 at 21:07
  • 1
    While I could use this to generate sometime (no more than 10 per a day) *one* passphrase, with the need of only 10 bytes, is there a good reason to *not* use `/dev/random` and to prefer `/dev/urandom` ?? – F. Hauri - Give Up GitHub Apr 07 '13 at 21:13
  • @F.Hauri - Tom's answer agrees with that. He says that for humans, "computing actual entropy is hard" – Rory Alsop Apr 08 '13 at 11:06
  • `/dev/random` may block for indefinite amounts of time, which is always inconvenient, and it does so because whoever implemented it thinks of entropy as if it was gasoline, which is, at best, a flawed way of thinking. – Tom Leek Apr 08 '13 at 12:25
  • Isn't "real" entropy (as opposed to psuedo random made up entropy) similar to gasoline in that you can only "pump" a finite amount of it out of your system until you need to go searching for more, which can take some time? – Johnny May 13 '13 at 21:10
  • @Johnny It's more like solar energy: mathematically speaking there's only a finite amount of it, but you're if you have access to it at all then you're never realistically going to reach that theoretical limit. – Gilles 'SO- stop being evil' May 13 '13 at 22:08
  • In fine, I've followed your recommendation to set default random generator to `/dev/urandom`, but I've added an option to let user ability to choose another way `genpassphrase.pl -d wl_15bits.lst -r <(echo -ne '\001\002\003\004\005\006\007\010\011\012')`. – F. Hauri - Give Up GitHub May 19 '13 at 12:46
5

As Tom Leek correctly noted, entropy is a property of the generation process; it's not a property of any particular passphrase generated by the process.

What's minimal length for one word? Is 4 chars sufficient? How to compute entropy for a 4 letter word?

When you pull 15 bits from a random number generator, and use those bits to uniformly select one of 2^15 unique words from a dictionary, every word has exactly 15 bits of entropy -- it doesn't matter how many characters long it is.

Yes, using a dictionary that only has 4-letter English words would result in passphrases with less than 15 bits of entropy per 4-letter word -- that's another way of saying there are less than 2^15 words that are 4 letters long in a English dictionary. But that's irrelevant in this case -- there's no reason to arbitrarily exclude short words from your dictionary.

Likewise, there are less than 2^15 words that start with "be" in an English dictionary -- so words that start with "be" have less than 15 bits of entropy per word. Likewise, that fact is irrelevant -- there's no reason to arbitrarily exclude words starting with "be" from your dictionary.

... run this several time to obtain some choice: ... and choose in this bunch 5 words with is human ... This will reduce entropy ...

Yes, if you let a human reject some words, then it will reduce entropy.

One way to estimate this is to assume that there is some "real" list of words that the human will accept, and he will reject all others. If that list has only 2^N bits, then each word in the actual human-selected password will have (at best) only N bits of entropy per word. Alas, it is difficult to find out what "N" really is.

Sometimes the reason humans reject certain words is because they don't know how to spell them. One way to avoid this involves a much shorter dictionary that only includes common, easy-to-spell words. For example, compared to the best case of 15 bits/word * 5 words = 75 bits, you get slightly more entropy (77 bits) by uniformly picking 7 words from a much shorter dictionary of 2^11 common words, such as the the S/Key 2048-word dictionary.

Perhaps a better option is for the computer to pick 5-word passphrases as an all-or-nothing list. If you show your user 8 such passphrases, and force the user to pick one of those 8 passphrases (rather than mixing-and-matching any 5 words from a list of 30 words), it can be shown that: In the worst case it reduces the strength of the selected passphrase by 3 bits (to 72 bits). In the best cases (the user always picks the first one, or the user uses 3 fair coin-flips to choose one of the 8), the strength of the selected passphrase is the full entropy of 15 bits * 5 words = 75 bits.

1/3 words are plurals

As long as each word in your dictionary of 2^15 words is unique and uniformly chosen, it is irrelevant whether it is plural. As in the above "be" case, there is no reason to arbitrarily exclude words ending in "s" from your dictionary.

(And I doubt that 1/3 of the words are really true plurals -- your quick test catches words like "abacus" and "bus" and "boss" that are not true plurals).

David Cary
  • 2,720
  • 4
  • 19
  • 20