Hard question to answer exactly. I'm going to refer to Theodore T'so's pwgen (v2.07) implementation exclusively here (pwgen -A0
)
These pronounceable passwords use "phonemes" as "symbols", rather than single characters, in (the English language biased) pwgen
a phoneme can be 1 or 2 characters. There are 40 defined (in pw_phonemes.c
), 25 are a single character (a-z ,except "q"), and 15 are pairs (diphthongs), average chars per phoneme is 1.375 (closer to 1.425 in use due to consonant/vowel alternation).
Phonemes aren't combined randomly, that's the trick of course, there are rules which make the end-results pronounceable, for pwgen
we have (roughly):
- some phonemes cannot start a word (2 phonemes excluded)
- some phonemes cannot follow a vowel (for 13 vowel phonemes, 8 phonemes excluded)
- having picked a consonant pick a vowel next
- having picked a vowel: after a previous vowel, or on a diphthong, or randomly (60%) pick a consonant next
- otherwise allow another vowel next
- a diphthong (2 characters) cannot be chosen as the last character (the most obvious side effect is a password will never end with "q", since q only appears as the diphthong "qu".)
(If you can formulate the exact number of permutations based on that, well done!)
A "symbol" for an [a-z] password is a single character, for a pronounceable password it's a phoneme of 1 or 2 characters.
For an [a-z] password of length N, there are 4.7 bits (lg2(26)
) per symbol, its estimated entropy is 26^N
or 2^(4.7*N)
per symbol (4.7 bits per-character).
For phonemes we have 5.3 bits (lg2(40)
) per symbol, estimated entropy for a password of length n symbols is 40^n
or 2^(5.3*n)
(3.9 bits per character). A phoneme password of m symbols will (ignoring any deviation caused by the above rules) be an average of 1.375m characters.
Estimating the maximum entropy for the two types of password (which have on average the same length n=1.375m) can be approximated by 26^(1.375m)
and 40^m
, the former grows quicker *, and proves your assertion (count of pronounceable words of length n is obviously less than 26^n)
At a minimum, a pronounceable password created this way should be about 20% longer than a straight [a-z] random password in order to have a comparable entropy. The presumed advantage is that pronounceable probably means more memorable, so for the human a longer password may actually be easier to memorise.
The constraints due to pronounce-ability limit this further.
Estimating a numerical difference is trickier... this is hopefully an "order of magnitude" approximation. pwgen
's 40 phonemes break down as:
20 CONSONANT
5 CONSONANT DIPTHONG
2 CONSONANT DIPTHONG NOT_FIRST
5 VOWEL
8 VOWEL DIPTHONG
(Diphthong is mis-spelt in the source, no matter.)
I will (heavily) approximate a calculation for 3-4 phoneme (~5 character) password, based on the above rules (and with a little empirical evidence). ~80% of passwords are of the form of alternating Consonant/Vowel phonemes, i.e. C V C [V ...]
or V C V [C ...]
, the remaining ~20% have a vowel pair, e.g. C V V C
(consonant phoneme pairs are forbidden; they may occur in the output characters though, particularly due to the phoneme "ng"). (A problem here is that working out the length in characters from the phonemes makes the problem intractable. This isn't just a permutation problem, you have to work out permutations of permutations I suspect for an accurate answer).
To get a reasonable estimate by calculating the most frequent arrangements:
c v c v = 25*13*27*13 = 114075
v c v = 13*27*13 = 4563
c v c = 25*13*27 = 8775
v c v c = 13*27*13*27 = 123201
c v v c = 25*13*5*27 = 43875
v c v v = 13*27*13*5 = 22815
v v c v = 13*5*27*13 = 22815
-------
340119
The magic numbers here are:
25 number of consonants (incl. diphthongs) without NOT_FIRST,
27 number of consonants (incl. diphthongs),
13 number of vowels (incl. diphthongs),
5 number of non-diphthong vowels that can follow a vowel
Empirical data indicates the true number to be about 15% higher, but if more of the permutations are included they start to exceed the length of 5 characters, giving an inflated answer.
A random 5 letter [a-z] password has approx 11.9M permutations, this is less than 3% of that.
A rough approximation then, ignoring edge-cases and by considering pairs of symbols at a time, for a pwgen
pronounceable password of length n characters,
P = 767 ^ (n/(2*1.4))
where 767 is ( 27*13 + 13*27 + 13*5 )
, the permutations of symbol pairs c v
, v c
, v v
, over 2 for symbols in pairs, and 1.4 reduces the character length n to the number of phonemes. (Having the estimated number 1.4 in a power makes the formula somewhat sensitive to minor changes.)
767 (valid symbols pairs) consumes an approx 2.8 characters, for 9.6 bits (log2(767)
) of effective entropy, 3.4 bits per character. Compared with 4.7 bits for [a-z], we need an overall factor of approx 1.35 to bring these passwords up to comparable strength, i.e. one third longer.
For comparison, allowing random mixed case and digits in the pwgen
output gets you back up to ~4 bits per character, so a pwgen
password of length n (without -A0
) is still less than an [a-z] password of the same length (~4.7).
(For empirical proofs, note that pwgen
only uses pronounceable phonemes when the length is >=5.)
For use as a password, you may want at least 50 bits of entropy (e.g. equivalent to 8 characters from [a-zA-Z0-9] + ASCII punctuation, 6.5 bits per character.) This can be achieved with a pwgen -A0
password length of 15-16 characters (~3.4 bits per character). This is a doubling of the length to have a comparable strength password.
Number of n len words (26^n) > Number of n len pronounceable words > Number of n len pass-phrases > Number of n len English words.
All true (I assume that "n len English words" means using a single word as a password).
Pass-phrases need to be quite long to be effective, perhaps 2 bits per character (e.g. 40k words, avg length 8, with unrelated words -- related words is lower).
Fixed short-word dictionary style schemes like that used in RFC2289 achieve ~3 bits per character.
* Wolfram graph 26^(1.375n) versus 40^n
or also try log plot 26^(1.375n) versus 40^n for n [0,16]
which is cached here