What is the entropy of a password made from the initial letters from an English sentence

Question

One way to build a password is to take the first letter from each word in a sentence. For example, the password "Itsfrqbtwaawwnsl" comes from the sentence

"I tried searching for related questions but they were all about whole words not single letters".

In such a password, how much entropy does each letter add?

Is it the same entropy as a 100% random alphabetical password (log₂ 26)? Or is it lower entropy because the more frequent letters tend to show up more often than the rare letters? For example, the example password above has more than one 'a', 's', and 't', which are some of the most common starting letters for English words.

This is similar to [this other question](https://security.stackexchange.com/questions/22717/how-secure-are-passwords-made-of-whole-english-sentences) but that question is about entire passphrases, not just the initial letters. — hugomg, Nov 16 '21 at 22:22
Actually that link has an answer that addresses your question directly. — schroeder, Nov 16 '21 at 23:45
Which answer are you referring to? The one that is closest in my impression is [this one](https://security.stackexchange.com/a/22729/36414) but it is talking about a different thing. Their estimate of 1 bit per character refers to full English text, not to the first characters of each word. — hugomg, Nov 16 '21 at 23:51
Sorry, but I don't see how that answers my question. How would we go about actually computing the entropy of English sentences? (Or of the first letters of the sentences)? — hugomg, Nov 17 '21 at 00:23

Polynomial · Answer 1 · 2021-11-17T02:36:55.703

Information entropy only applies as a measure when the characters are chosen randomly and uniformly. Choosing the first letter of each word in the sentence is inherently non-random and non-uniform, so you cannot calculate the entropy of passwords generated by this scheme. Any estimation would be a gross simplification to the point of being misleading.

Since what you're really asking is "how hard is this to crack compared to a randomly chosen uniform distribution of letters?", let's consider the problem of how to answer that, instead of trying to apply a metric that isn't applicable.

The first thing we might want to know is the relative frequencies of the first letter of words in the English language. This alone does not have a clear answer, since "English words" is a poorly defined set. The list of words in the dictionary is one source, but that doesn't account for words that aren't commonly listed in the dictionary, nor does it account for the frequency of use of the words. Words appear once each in the dictionary, but with varying frequency in real sentences. Many words in the dictionary are obscure and are extremely unlikely to appear in a sentence chosen by a user.

Wikipedia has a table of letter frequencies that lists the relative frequencies of the first letter of English words as appearing in texts. The distribution is as follows:

T   16%
A   11.7%
O   7.6%
I   7.3%
S   6.7%
W   5.5%
C   5.2%
B   4.4%
P   4.3%
H   4.2%
F   4%
M   3.8%
D   3.2%
E   2.8%
R   2.8%
L   2.4%
N   2.3%
G   1.6%
U   1.2%
K   0.86%
V   0.82%
Y   0.76%
J   0.51%
Q   0.22%
X   0.045%
Z   0.045%

Sounds promising, right? Except it isn't all that useful. The probability of having a particular letter as the first letter of a word varies wildly based on that word's position in a sentence. On top of that, until now we've made the assumption that the probability of having a particular letter at the start of the word is independent, but it is not. The distribution not only varies upon the position of the word in the sentence, but also the word that came before it, and the word before that.

What this does mean, though, is that we can build our own custom dictionary based on a large corpus of text. I grabbed the OANC corpus for my own tests, but a better corpus could be generated using a large library of famous quotes, song lyrics, meme formats, popular tweets, etc.

The wordlist building approach is as follows:

Take the contents of every text file and read it line by line.
Trim leading and trailing whitespace from each line.
Concatenate all lines together with a space between them, into one long unbroken string.
Utilise a sentence-splitting regex to recover the component sentences from that string.
Normalise the sentences to remove punctuation.
Extract the first letter of each word in each sentence, in order, and build a character sequence from it.
If the sequence is between 6 and 20 characters (i.e. a sentence with between 6 and 20 words), add it to the sequence list.
Once the sequence list is built, sort it alphabetically.
Count the occurrences of each entry in the list and sort the list by that count, descending.

For a ~3GB corpus of plaintext, this process takes around 30 seconds. The OANC corpus contains around 550,000 sentences, producing a first-letters wordlist of approximately 400,000 unique entries.

Given that the brute-force search space for a 6 character lowercase alpha password is around 300 million, this implies that the search space reduction caused by this first-letter scheme is very significant, bringing it within range of a trivial dictionary attack once the corpus is built.

Since humans are most likely to pick sentences of interest, rather than obscure sentences or anything close to truly random, and sentences of interest that have enough words to meet the length ranges required for passwords are fewer still, the likelihood of breaking this scheme given a reasonable source of popular phrases is fairly excellent.

As a bonus, I did some trigram analysis on the sequences. A trigram is any sequence of three letters. For example, the sentence "the quick brown fox jumps over the lazy dog" would have a starting trigram of "tqb" and contain the trigrams "tqb qbf bfj fjo jot otl tld".

The 200 most common trigrams that occurred at the start of a sentence were as follows:

tia, iat, tao, tat, teo, taa, att, tst, tit, its, tpo, tso, itc, tro, tsa, tio, iia, tas, mot, tpa, tsi, tco, tsw, tai, itp, tii, ats, tpw, ata, ita, tpi, sot, trt, tcw, tra, tca, oot, tri, twa, tci, ttt, itt, iaa, oto, iit, aar, tta, tot, tps, ias, iot, ist, tmo, ato, tss, aot, ict, tto, tac, tct, atp, bta, bot, tin, toa, tdi, tma, ati, tti, aat, twt, tft, tmi, tdo, tcs, ite, ate, aaa, too, taw, tlo, tae, asi, tsc, tis, tfi, iwa, iti, tts, trs, tea, fet, tno, ait, tfa, atc, tar, tap, tet, tfo, tiw, wat, tsp, iae, iac, tpc, itm, tmt, tos, tic, tmw, trw, atm, iwt, tpt, itf, tms, ttc, ttw, bit, tam, tda, itr, tcc, iaw, ito, btt, aso, iin, bti, iai, ift, tte, tmc, tds, atw, tei, tah, aas, tad, ton, iii, itw, sao, ttp, twn, aia, tfs, toi, bts, atr, tdt, tmp, aft, bat, aii, iow, tor, ihb, irt, taf, tsh, tan, wit, saa, wta, iha, aac, tnt, tsr, aao, tpr, tuo, iht, eot, fta, tho, tha, itl, trc, aai, tdw, tla, wst, asa, iwb, fts, ast, bia, toc, not, tmb, tsf, iip, sat, fea, tsd, cwt, hia, ftr

These trigrams appeared 126173 times, making up 26.5% of the total. As such, for any given sentence you have a better than 1 in 4 chance of the first three letters of the sequence appearing on this list.

The 701 most frequent trigrams made up 50% of the occurrences. The other 50% of occurrences were comprised of 11392 trigrams.

Expanding this to look at all trigrams in the sequences, not just those appearing at the start of a sequence, the 200 most common were:

att, tat, aat, sot, tao, aaa, taa, aot, ata, teo, tso, tta, tpo, tst, tia, iat, tsa, ots, ttt, ota, tit, tco, pot, tot, its, sat, ait, ott, ats, tto, stt, cot, otc, ita, itt, oot, otp, tro, tca, ato, tct, tpa, eot, iaa, ttc, tts, itc, mot, oaa, oto, iit, atc, atp, itp, aoa, iot, oat, ati, ast, tio, sit, aia, asa, toa, ctt, trt, saa, aso, tas, cat, wat, tti, ttp, aao, aas, rot, tai, ate, tpt, awa, ote, oti, aai, pit, rtt, tsi, tft, cit, otm, iia, tra, twa, ias, tet, pat, sta, tdo, atr, bat, aco, otr, bot, atm, awt, tea, tma, twt, too, tte, tmo, tpi, dot, tba, act, tci, aft, aac, caa, ite, aca, tsw, apo, tac, ito, tfo, ttw, iti, atw, btt, tii, tmt, oit, apa, otf, asi, tdt, ttm, tbt, aap, bta, itm, eit, tlo, tcw, aar, ttr, atb, tno, soa, tfa, sia, oia, eat, tda, wta, tss, aii, art, tap, fot, waa, tpw, atf, iai, aea, oas, oao, paa, rat, tnt, tbo, ftt, ooa, otb, rit, atd, itf, itr, aaw, dit, cwt, coa, iac, tps, fta, taw, fat, rta, wtt, otd, ath, aae, tha, baa, ioa, poa, tae, cta, ist, ptt

These trigrams appeared 1880623 times, making up 20.6% of the total. The 969 most frequent trigrams made up 50% of the occurrences, with the other 50% comprising of 15225 trigrams.

Overall, the disparity between the frequency of the most-seen trigrams and the frequency of the least-seen trigrams is pretty huge, which serves to demonstrate just how non-uniform the sequences of first letters of words are.

Interesting! I agree that trying to find a perfect association between sentence length and entropy is problematic because, like normal passwords, people gravitate towards the most common sentences (which is vulnerable to a dictionary attack). However, I would still like to put a number on it, to have a better estimate of how "worse" it is. And you just gave me an idea: if we model the letters as a markov chain (using a real corpus to compute the probabilities) there surely must be some formula out that that would give us the entropy rate per character... — hugomg, Nov 17 '21 at 03:00
The problem with trying to quantify this scheme like a normal password is that the search space doesn't grow anywhere near exponentially with length, like a randomly chosen password does. For example, a 6-character password using this scheme has about 400k possible values, given this particular corpus. That's about equivalent to 4 randomly chosen lower-alpha characters. A 10-character password chosen with this scheme _still only has about 405k possible values_, because the pool of possible sentences diminishes, so it's no stronger even though it's longer. — Polynomial, Nov 17 '21 at 03:09
With a randomly chosen sequence of characters the search space is _cⁿ_, where _c_ is the size of the character set and _n_ is the length. Each character added to the password length multiplies the size of the search space by _c_. As the length increases, the search space tends towards infinity, becoming computationally infeasible to brute-force. With the first letter scheme, however, the search space for each length plots as a bell curve. This makes the scheme self-limiting and no amount of password length can make it equivalent to even an 8 character random lower-alpha password. — Polynomial, Nov 17 '21 at 03:45
Sure, but that might be because the corpus doesn't have many long sentences? We have to remember that while the corpus is large, it is still finite. — hugomg, Nov 17 '21 at 03:45
That doesn't sound right... If I take the first lefter of each word in your answer (which is over a hundred letters), surely that must be harder to guess than an 8-letter password? — hugomg, Nov 17 '21 at 03:49
_Sentences_, not entire paragraphs. Keep in mind that for your scheme to be useful a user would have to be able to remember the sentence that generated the password. Otherwise they might as well just use a password manager to randomly generate the password. Even if the corpus could be considered infinite, a user's memory is not. — Polynomial, Nov 17 '21 at 04:07

score 1 · Answer 2 · answered Nov 16 '21 at 23:57

This technique has very low entropy and is not wise to use.

You'd have to look at word frequencies; you'll have a lot of the small common words like articles and conjunctions ("a", "the", "and", "also"). Then you'd have to look at the frequency of each word's first letter, as weighted by the word's frequency.

Word frequency can theoretically be calculated with Google Books Ngrams.

Letter frequency can come from an offline dictionary. Here's a quick mock-up using a shell script on a Debian box with the wamerican spelling dictionary package installed:

$ grep -o ^. /usr/share/dict/american-english \
  |awk '{count[tolower($1)]++} END{for(i in count) print count[i],i}' \
  |sort -n |tail

3726 f
4493 t
4847 r
5330 d
5440 a
5495 m
5525 b
6723 p
8396 c
10009 s

Even in the top ten (without weighting by word frequency), you can see there's a massive discrepancy, with s being 2.7 times as common as lowercase f. On the other end (not shown, change tail to head), x only begins 103 words. Graphed:

graph of words' first-letter frequencies

The top ten initial letters consume nearly two thirds of all words. I'm guessing when you control for word frequency, this will go way up. I'd be pretty surprised if it went down. When it comes to entropy, you want a system that has good uniformity among the letters. This is not at all uniform.

This technique is not at all sound.

You want truly random passcodes. I suggest using a password manager and having it generate and store your codes.

Assuming a uniform distribution, this technique would generate passwords whose entropy is log₂(26ⁿ) where n is the number of letters you've pulled from the words in your sentence. Your sample 16-letter password of Itsfrqbtwaawwnsl would then have an entropy of 75 (not great but perhaps acceptable). However, the true entropy of this system would be significantly lower. Let's say its around log₂(13¹⁶) = 59. That is pretty weak.

You'd be far better off using that sentence itself, with punctuation. It's still not ideal (you need random words in a passphrase, just as you need random characters in a passcode; see this answer for details.)

The only sane solution is to use a password manager.

If you need to use a password when you don't have access to a password manager (like to unlock your computer or for the pw manager itself), use a collection of 5+ unrelated words and try to use 2FA to additionally protect the account. If that's too long, consider three words and a 5-9 character passcode, like axiom-money-9Nu6<g-amigas.

You could even write the code part down and put it in a secure location (like your wallet). If an attacker finds it, they don't have the words.

Of course this is not supposed to replace a password manager, which can store a long 100% random password. The point of the question is, how long does the sentence needs to be to have enough entropy? Is that log(13) an actual estimate or just a guess? Even if it is a lower amount per letter, that would just mean you'd need more characters. And on a character-by-character basis, I'd expect it would be higher entropy than the wordlist-based password... Not to mention, this is an actual password generation scheme that some people use. I think it is valueble to be able to estimate the entropy. — hugomg, Nov 17 '21 at 00:18
That log(13**n) is an estimate given the fact that the top ten letters represent 66.5% of words and the top half of letters represent 70.4%. You're also still assuming there is no relationship between the words that the letters represent. If the words are fully random, then you could indeed eliminate that variable with more words, but why not just have the whole words present? Yes, this is a scheme people use and experts used to suggest. It is not strong. — Adam Katz, Nov 17 '21 at 01:47

What is the entropy of a password made from the initial letters from an English sentence

2 Answers2