20

I'm working on some code that attempts to identify files whose contents appear to be "random". As such, I'm looking for statistical measures that can be used to identify such randomness.

I've implemented the following so far:

  • Shannon entropy of the sample (i.e. -∑ p(xi) log2 p(xi) over the sample X)
  • Bytewise sample variance (0.0 is a perfect score)
  • Bitwise mean (0.5 is a perfect score)
  • Bitwise mean distributed across bits 0 to 7 of each byte (looking for any significant variance)
  • Percentage of bytes within the range 0x20 ≤ b ≤ 0x7E (37.11% is a perfect score)
  • Bit-pair counts (i.e. checking for equal counts of 00, 01, 10, 11)
  • Repeated byte distances (i.e. looking for xx, x*x, x**x, x***x, ... patterns)

These seem to give a reasonable measure of uniform "randomness", in that 128 bytes generated by concatenating two SHA512 hashes can easily be distinguished from ASCII text or an executable file simply by looking at the statistics.

However, I'd like to know if there are any other strong indicators that I should look into that might indicate that the sample data was generated by a cryptographic PRNG.

Update: To be absolutely clear, I'm not trying to test the quality of a PRNG as a cryptographic randomness source. I'm trying to distinguish between non-random files (e.g. executables, JPEGs, archives, etc.) and random files (e.g. TrueCrypt keyfiles). My tests so far are either used as ways to identify obviously non-random data (e.g. ASCII text) and data that isn't uniformly distributed in some way. I'm looking for other ways to test this.

Polynomial
  • 132,208
  • 43
  • 298
  • 379
  • 3
    If you consider the input file as the output of a random number generator, you might try applying the [Diehard tests](http://en.wikipedia.org/wiki/Diehard_tests). – Dan D. Mar 11 '13 at 05:57
  • Why only `0 to 7`, simple and small text could be encoded in 5, 6, 7 or 8 bits in order to hide them using stegano, for sample. – F. Hauri - Give Up GitHub Mar 11 '13 at 07:26
  • What are the $x_i$ in your shannon formula? bytes? bits? Looks a bit redundant with the other formulas to me. – CodesInChaos Mar 11 '13 at 08:02
  • Do you want to work on short byte sequences or on long ones? For example die-harder is mainly designed for long sequences. – CodesInChaos Mar 11 '13 at 11:31
  • @F.Hauri I'm only trying to distinguish non-random-looking data from random-looking data, so looking for the average value of the 0th bit across all bytes vs. the average value of the 1st bit across all bytes vs. ... vs. the average value of the 7th bit across all bytes gives me a decent indicator of non-random data (e.g. the 0th byte will have a mean of 0 for standard ASCII). – Polynomial Mar 11 '13 at 12:46
  • @CodesInChaos x_i is bytes - sorry, the subscript and lowercase is rather confusing. I wanted to use the proper math markup but we don't have it here. It's an implementation taken from a proper stats library, so it is indeed correct. – Polynomial Mar 11 '13 at 12:53

2 Answers2

10

During the AES competition, the organizing body (the NIST) went to the trouble of running extensive statistical testings on the output of the 15 submitted block ciphers, with what was believed to be the gold standard of such tests, the Diehard tests. They found nothing, of course. Comments from cryptographers at that time were that these tests were a bit ridiculous in the context of cryptographic algorithms (unofficial comments, of course: it would be discourteous to embarrass the host of the event); it was good for Public Relations that someone runs them, but nothing worth noting was expected from it.

This illustrates the difference between statistical randomness, as is needed for instance to power big simulations of physical systems, and unpredictability, which is required for security. Consider for instance the following PRNG:

  • Set a counter to a 128-bit seed.
  • To produce the next 128 bits of the PRNG, increment the counter by 1, then encrypt that value with AES and an all-zero key, yielding the 128 bits.

This PRNG will successfully pass all the tests of the Diehard suite; it is statistically "perfect" as far as the tests are concerned (departure from pure alea might be detectable after generating about 271 bits, i.e. more than 200 millions of terabytes). However, for security, it is terrible, since it is trivial to predict future output once 128 bits of output have been observed (decrypt these 128 bits with AES and an all-zero key: this will give you the current counter value).

Summary: a cryptographically secure PRNG produces output which is indistinguishable from randomness; not only with statistical tests, but also by someone who knows exactly the specific PRNG algorithm which was potentially used. If you have a statistical tool which can show that a given sequence of bytes comes from a PRNG, then you have proven not only that the PRNG is not cryptographically secure, but also that it is awfully bad at it. Even Dave's hash is "random" from the point of view of Diehard tests.

In shorter words, your tool won't catch randomness which was generated with some expertise, but it won't catch most non-randomness from complete amateurs either. With such tools, you are not targeting "non-cryptographers who should know better than to use homemade schemes", but rather "chimpanzees who should not be allowed to touch a keyboard".

Note: good steganography tools encrypt data before stuffing it into the transport medium, precisely to avoid detection from statistical tool. Encryption also should yield bits indistinguishable from randomness.

Thomas Pornin
  • 320,799
  • 57
  • 780
  • 949
  • I'm not as much looking to detect *good* randomness, as I am looking to detect *a lack of non-randomness*. – Polynomial Mar 11 '13 at 12:54
  • 1
    If the data is non-random but not _intentionally non-random_, but just something like a compressed file or a JPEG image, then chances are that the file will begin with a "known header". The [Unix "file" command](http://en.wikipedia.org/wiki/File_(command)) will detect many common file types. – Thomas Pornin Mar 11 '13 at 12:56
  • Yep, that's my primary source of identification. I'm using a set of definitions that match the file extension to known internal data structures and magic numbers. – Polynomial Mar 11 '13 at 20:31
0

http://www.fourmilab.ch/random/ uses the following

  • Chi-square test
  • Arithmetic mean
  • Monte Carlo Pi estimation
  • Serial Correlation Coefficient

You might want to add "incompressibility" too (this also means you might need to identify compressed files in order to eliminate them). Dan D's Diehard (and Dieharder) suggestion looks like a very thorough solution. (Edit: Dieharder is strictly targeted at testing PRNGs, not snapshots of their output, and definitely not suitable for small data samples.)

Your task shares some of the problems involved in statistical File Carving and forensic searching for encrypted data: the smaller the data set the harder it is to be definitive about whether it's "random".

mr.spuratic
  • 7,937
  • 25
  • 37