Should Longer Random Data Have More or Less Entropy?

Question

I have been making a random number generator using an Arudino's analog pins, bitwise shifts, and XOR. I have been using the Unix command 'ent' to measure my entropy. I have tested different sizes of data from the Arduino, and found that longer data usually resulted in a noticably greater entropy (7.98 over 7.95). This was the same with /dev/urandom (7.9999 vs 7.9993). Ironically, a single sequential pass through all possible bits (0-255) results in 8 bits of entropy. This is strange, as I have heard elsewhere on the internet that /dev/urandom decreases in entropy after the entropy pool is drained.

My question is should true random data have more or less entropy when longer? Why? Are pseudorandom numbers any different? In what way? Is there any more reliable way to measure entropy?

Thanks.

*" as I have heard elsewhere on the internet that /dev/urandom decreases in entropy after the entropy pool is drained"* there are a lot of myths about `/dev/urandom` pseudo lack of randomness and the pseudo issue of a pseudo empty entropy pool. You don't drain an entropy pool as you would do with a swimming pool, and once correctly initialized a good PRNG remains a good PRNG. See for instance [this answer](https://security.stackexchange.com/a/3939/32746). — WhiteWinterWolf, Jul 23 '17 at 09:21
Related: [Estimating bits of entropy](https://crypto.stackexchange.com/questions/10404/estimating-bits-of-entropy) — julian, Jul 23 '17 at 14:16
a big part of ent's estimate is uniformity, and uniformity goes up with more samples: flipping a coin twice and getting 2 heads is plausible yet displays low entropy. flipping a million times should result in the more-entropy 50/50 mix. By they way, ~80% of arduino analog in noise is AC ripple; graph out raw samples and you can clearly see the sine wave; not very random at all, and most pins only float +/- ~32 ADC levels. — dandavis, Jul 23 '17 at 18:59
the chi-squared test will be far more reliable for estimating unpredictableness, and `ent` wants it at 50%; more than 90% or less than 10% is suspect, over 95% and under 5% is obviously flawed. — dandavis, Jul 23 '17 at 19:05
@dandavis Yeah, I'm trying to use the few least significant bits for ADC noise, but XORing with a few non-random bits couldn't hurt. Each bit in the 16-bit integer is XORed at least once with each bit from the ADC output. Also, How large of a sample do I need to get a reliable chi squared distribution measurement? I've been getting wildly varying results with 16 megabyte samples from /dev/urandom. — id01, Jul 24 '17 at 21:56
hashing a few hundred digits mixes the samples better and complicates reversal to a known "difficult" problem. you should also `delayMicroseconds()` for an indeterminate amount of time between samples (ex: `analogRead(pins[random(8)]);`) to nyquist-out predictable AC noise patterns. As far as chi^2, at 1mb samples i would expect abut 2/3 to fall within 25-75%, and fewer than 1/10 to fail. I'm sure there's math to tell you with more precision, but it should bounce around a lot. 60% is no worse than 50% really; scores of `50.01` are probably just lucky screenshots... — dandavis, Jul 24 '17 at 22:55
to wit: i've had a 100kb file score 49%, after which i appended a few KB of more from the same source, and the score "dropped" to 35%; did all that previous data go "weaker"? no, not really. — dandavis, Jul 24 '17 at 22:58
I ended up taking the first few chars and hashing them with SHA-256, then using the output of that hash (which was shorter than the input), as an AES key to encrypt an interval of bytes using AES-256-CBC. Then, I would do the same to the next interval of bytes. — id01, Aug 06 '17 at 22:29
My chi squared test is fluctuating from 10 to 90 as I was (in the early stages of) producing my 1MB file. After finishing the generation, I had `Entropy = 7.999826 bits per byte` and `Chi square distribution for 1048576 samples is 252.47, and randomly would exceed this value 53.30 percent of the times`. Thanks for all your help! — id01, Aug 06 '17 at 23:21

forest · Accepted Answer · 2018-06-10T06:20:10.103

I have tested different sizes of data from the Arduino, and found that longer data usually resulted in a noticably greater entropy (7.98 over 7.95).

This is actually most likely due to the fact that randomness tests typically require obscene amounts of input data. Accurately assessing the raw input from a true randomness source can sometimes require terabytes of input! When less input is provided, the tests are less accurate. Think about it from a theoretical perspective: If I give you the bit 1, how certain are you that it is random? What if I give you 11? What about 111? As you see more bits, you can determine with a greater probability that this is not random. I imagine it won't take many more 1s for you to conclude that I am just giving you the same bit over and over. Randomness testing algorithms are similar, though they use more thorough tests than "how often does a single bit repeat", of course. But the same principle applies, and the tests will always require a fairly large input.

This is strange, as I have heard elsewhere on the internet that /dev/urandom decreases in entropy after the entropy pool is drained.

This is a pervasive myth caused by a very misleading manual page. In reality, /dev/urandom will be able to generate a virtually unlimited amount of pseudorandom data perfectly indistinguishable from "true" randomness as long as it is seeded initially with at least a hundred or so bits of entropy.

My question is should true random data have more or less entropy when longer? Why?

Technically, a true random source will have as much entropy in bits as there are bits of data. 5000 bits of data can exist in 2⁵⁰⁰⁰ different states. If that data is true randomness, then all of those states will be possible with equal probability. If you add one more truly random bit (i.e. a bit that has a genuine, true probability of 50% of being a "1"), then you have doubled the possible combinations, giving you instead 2⁵⁰⁰⁰ × 2 = 2⁵⁰⁰¹ possible combinations of bits.

Are pseudorandom numbers any different? In what way?

In the context of computer security and cryptography, the idea of having more "true" randomness is typically not meaningful. This is due to the fact that algorithms called cryptographically-secure pseudorandom number generators (CSPRNGs) are capable of taking a finite number of bits in the form of a key (or "seed") and expanding them into a virtually unlimited amount of pseudorandom bits. A CSPRNG that can be seeded with 256 bits of entropy is capable of producing up to 2²⁵⁶ distinct pseudorandom streams. Despite being limited to "only" 256 bits, it is impossible to determine what the key is (and thus what the subsequent generated bits will be) without iterating through all 2²⁵⁶ possible keys. Since this is not only impossible with current technology, but also impossible even if we harness the entire energy output of the sun for the rest of its life, there is no reason to ever require more than a very small amount of "true" entropy.

Is there any more reliable way to measure entropy?

Not really. Randomness tests are strange beasts. They are capable of checking for certain biases and as such can tell you that the data is, with a very high likelihood, not random, but it is completely incapable of telling you that it is random. As such, randomness tests would be better off called non-randomness tests! Unfortunately, even state of the art tests only check for certain types of patterns and distributions. Unless you have an oracle that tells you exactly how many uniformly distributed states a number of bits can exist in, you cannot accurately measure entropy.

So, how should you collect randomness from an external noise source like an ADC? You need to gather a few hundred bits (256 is a safe number) of entropy from your noise source and use it to seed a cryptographically-secure random number generator. There are four common techniques:

Use the first 256 bits directly. If you are absolutely sure that your noise source is a true random number generator (e.g. it uses beam splitting or otherwise takes advantage of quantum uncertainty), then this is sufficient. The benefit is that you get the seed very quickly, but the drawback is that you need to trust your noise source to generate true randomness.
Employ debiasing on a large number of input bits. Debiasing is any post-processing technique that reduces the bias in an input. For example, ADC output may be biased towards 1s, but that does not mean it is unsuitable as an entropy source. Even if each bit has a 60% chance of being a 1, that chance can still be genuinely random. Debiasing can then be used to create a uniformly-distributed output. One of the first techniques is Von Neumann debiasing.
XOR the noise source. When you have a random value A and a known or attacker-controlled value B, you can combine them using the XOR operation. If you do A ⊕ B = C, knowledge of B does not reveal any information about C. This allows you to mix together a large number of mutually-independent sources such that the predictability of the final value is at least as unpredictable as the most unpredictable entropy source. This technique is very fast, but is only useful on already debiased output. If there is any internal structure to the noise source, simply XORing it with itself may not be particularly effective.
Pass it to a secure hash algorithm. This is the technique you mentioned you ended up using. A cryptographic hash can take an input of virtually unlimited size and will output a finite digest. This digest has a uniform distribution of bits, and the output says nothing about the input. If given 4 random samples, you can safely combine them with H(S₀ || S₁ || S₂ || S₃). Hash functions do not keep the entire input in their state, so memory is not a limit of how many samples you can hash together. This technique is by far the most effective way to combine multiple noise sources or a single biased noise source into a small, random seed that can be used to generate an unlimited amount of randomness later.

Should Longer Random Data Have More or Less Entropy?

1 Answers1