There seem to be many different 'kinds' of entropy. I've come across two different concepts:
A) The XKCD example of correcthorsebatterystaple
. It has 44 bits of entropy because four words randomly chosen from a list of 2048 words is 4 * log2(2048) = 44 bits of entropy. This I understand.
B) The Shannon entropy of the actual string i.e. the entropy is calculated based on frequencies of the letters/symbols. Applying the Shannon formula on correcthorsebatterystaple
the result is 3.36 bits of entropy per character.
# from http://stackoverflow.com/a/2979208
import math
def entropy(string):
"Calculates the Shannon entropy of a string"
# get probability of chars in string
prob = [ float(string.count(c)) / len(string) for c in dict.fromkeys(list(string)) ]
# calculate the entropy
entropy = - sum([ p * math.log(p) / math.log(2.0) for p in prob ])
return entropy
print entropy('correcthorsebatterystaple')
# => 3.36385618977
Wikipedia only adds to my confusion:
It is important to realize the difference between the entropy of a set of possible outcomes, and the entropy of a particular outcome. A single toss of a fair coin has an entropy of one bit, but a particular result (e.g. "heads") has zero entropy, since it is entirely "predictable".
-- Wikipedia: Entropy (information theory)
I don't quite understand the distinction between the entropy of the toss (generation) and the entropy of the result (the string).
- When is B used and for what purpose?
- Which concept accurately reflects the entropy of the password?
- Is there terminology to differentiate between the two?
- True randomness could give us
correctcorrectcorrectcorrect
. Using A we still have 44 bits. Using B the entropy would be the same as that ofcorrect
. When is the difference between the two important? - If a requirement specifies that a string needs to have 20 bits of entropy—do I use A or B to determine the entropy?