I have tested different sizes of data from the Arduino, and found that longer data usually resulted in a noticably greater entropy (7.98 over 7.95).
This is actually most likely due to the fact that randomness tests typically require obscene amounts of input data. Accurately assessing the raw input from a true randomness source can sometimes require terabytes of input! When less input is provided, the tests are less accurate. Think about it from a theoretical perspective: If I give you the bit 1, how certain are you that it is random? What if I give you 11? What about 111? As you see more bits, you can determine with a greater probability that this is not random. I imagine it won't take many more 1s for you to conclude that I am just giving you the same bit over and over. Randomness testing algorithms are similar, though they use more thorough tests than "how often does a single bit repeat", of course. But the same principle applies, and the tests will always require a fairly large input.
This is strange, as I have heard elsewhere on the internet that /dev/urandom decreases in entropy after the entropy pool is drained.
This is a pervasive myth caused by a very misleading manual page. In reality, /dev/urandom
will be able to generate a virtually unlimited amount of pseudorandom data perfectly indistinguishable from "true" randomness as long as it is seeded initially with at least a hundred or so bits of entropy.
My question is should true random data have more or less entropy when longer? Why?
Technically, a true random source will have as much entropy in bits as there are bits of data. 5000 bits of data can exist in 25000 different states. If that data is true randomness, then all of those states will be possible with equal probability. If you add one more truly random bit (i.e. a bit that has a genuine, true probability of 50% of being a "1"), then you have doubled the possible combinations, giving you instead 25000 × 2 = 25001 possible combinations of bits.
Are pseudorandom numbers any different? In what way?
In the context of computer security and cryptography, the idea of having more "true" randomness is typically not meaningful. This is due to the fact that algorithms called cryptographically-secure pseudorandom number generators (CSPRNGs) are capable of taking a finite number of bits in the form of a key (or "seed") and expanding them into a virtually unlimited amount of pseudorandom bits. A CSPRNG that can be seeded with 256 bits of entropy is capable of producing up to 2256 distinct pseudorandom streams. Despite being limited to "only" 256 bits, it is impossible to determine what the key is (and thus what the subsequent generated bits will be) without iterating through all 2256 possible keys. Since this is not only impossible with current technology, but also impossible even if we harness the entire energy output of the sun for the rest of its life, there is no reason to ever require more than a very small amount of "true" entropy.
Is there any more reliable way to measure entropy?
Not really. Randomness tests are strange beasts. They are capable of checking for certain biases and as such can tell you that the data is, with a very high likelihood, not random, but it is completely incapable of telling you that it is random. As such, randomness tests would be better off called non-randomness tests! Unfortunately, even state of the art tests only check for certain types of patterns and distributions. Unless you have an oracle that tells you exactly how many uniformly distributed states a number of bits can exist in, you cannot accurately measure entropy.
So, how should you collect randomness from an external noise source like an ADC? You need to gather a few hundred bits (256 is a safe number) of entropy from your noise source and use it to seed a cryptographically-secure random number generator. There are four common techniques:
Use the first 256 bits directly. If you are absolutely sure that your noise source is a true random number generator (e.g. it uses beam splitting or otherwise takes advantage of quantum uncertainty), then this is sufficient. The benefit is that you get the seed very quickly, but the drawback is that you need to trust your noise source to generate true randomness.
Employ debiasing on a large number of input bits. Debiasing is any post-processing technique that reduces the bias in an input. For example, ADC output may be biased towards 1s, but that does not mean it is unsuitable as an entropy source. Even if each bit has a 60% chance of being a 1, that chance can still be genuinely random. Debiasing can then be used to create a uniformly-distributed output. One of the first techniques is Von Neumann debiasing.
XOR the noise source. When you have a random value A and a known or attacker-controlled value B, you can combine them using the XOR operation. If you do A ⊕ B = C, knowledge of B does not reveal any information about C. This allows you to mix together a large number of mutually-independent sources such that the predictability of the final value is at least as unpredictable as the most unpredictable entropy source. This technique is very fast, but is only useful on already debiased output. If there is any internal structure to the noise source, simply XORing it with itself may not be particularly effective.
Pass it to a secure hash algorithm. This is the technique you mentioned you ended up using. A cryptographic hash can take an input of virtually unlimited size and will output a finite digest. This digest has a uniform distribution of bits, and the output says nothing about the input. If given 4 random samples, you can safely combine them with H(S0 || S1 || S2 || S3). Hash functions do not keep the entire input in their state, so memory is not a limit of how many samples you can hash together. This technique is by far the most effective way to combine multiple noise sources or a single biased noise source into a small, random seed that can be used to generate an unlimited amount of randomness later.