I'm working on some code that attempts to identify files whose contents appear to be "random". As such, I'm looking for statistical measures that can be used to identify such randomness.
I've implemented the following so far:
- Shannon entropy of the sample (i.e. -∑ p(xi) log2 p(xi) over the sample X)
- Bytewise sample variance (0.0 is a perfect score)
- Bitwise mean (0.5 is a perfect score)
- Bitwise mean distributed across bits 0 to 7 of each byte (looking for any significant variance)
- Percentage of bytes within the range 0x20 ≤ b ≤ 0x7E (37.11% is a perfect score)
- Bit-pair counts (i.e. checking for equal counts of 00, 01, 10, 11)
- Repeated byte distances (i.e. looking for
xx
,x*x
,x**x
,x***x
, ... patterns)
These seem to give a reasonable measure of uniform "randomness", in that 128 bytes generated by concatenating two SHA512 hashes can easily be distinguished from ASCII text or an executable file simply by looking at the statistics.
However, I'd like to know if there are any other strong indicators that I should look into that might indicate that the sample data was generated by a cryptographic PRNG.
Update: To be absolutely clear, I'm not trying to test the quality of a PRNG as a cryptographic randomness source. I'm trying to distinguish between non-random files (e.g. executables, JPEGs, archives, etc.) and random files (e.g. TrueCrypt keyfiles). My tests so far are either used as ways to identify obviously non-random data (e.g. ASCII text) and data that isn't uniformly distributed in some way. I'm looking for other ways to test this.