16

If I have a good random number generator that gives me a byte of data at a time, and I want to extract a random decimal digit of 0 to 9 from that byte stream, what is the correct way to do that?

At first I naively assumed that a simple (randomByte mod 10) calculation would be sufficient, but since 256 is not evenly divisible by 10, that results in a clear bias in the "random" digits:

0: 101323 #################################
1: 101261 #################################
2: 101473 #################################
3: 101389 #################################
4: 101551 #################################
5: 101587 #################################
6: 97831  ###############################
7: 97893  ###############################
8: 97843  ###############################
9: 97849  ###############################
(histogram from 1 million 'random' digits)

One method that appears to work is to discard any value above 249 and divide by 25. Is that cryptographically correct? Is there a better method that doesn't involve discarding (potentially expensive) bytes of randomness?

(this question is prompted by reading about a CryptoCat vulnerability, where one of the discovered flaws was that they discarded random values above 250 instead of above 249, giving a slight bias in their "random" numbers... so I was curious what the "right" way to do it is)

Johnny
  • 1,418
  • 13
  • 18
  • 3
    Discarding isn't expensive on average. Especially if you generate multiple digits from multiple bytes at a time. Btw. good stream ciphers(and thus PRNGs) are pretty fast, so the modulo operation is probably more expensive than generating a byte. – CodesInChaos Jul 20 '13 at 20:26
  • @CodesInChaos The question seem more theoretical. Anyway someone could work on old firmware where modern PRNG simply don't exist... ( In his question, Johnny don't speak about *modulo* but *divide by 25*, which have probably not same cost ;-) – F. Hauri - Give Up GitHub May 20 '14 at 21:25
  • @F.Hauri The cost of division and modulo should have the same on most CPUs. In fact on x86 the division instruction always outputs *both*. – CodesInChaos May 21 '14 at 07:50

3 Answers3

17

There are two generic ways to produce a "sufficiently unbiased" random digit.

First method is to loop if the byte was not in the right range. I.e.:

  • Get next random byte b.
  • If b is in the 0..249 range, return b mod 10.
  • Loop.

This method may consume an unbounded number of random bytes, but it is perfectly unbiased and it is very unlikely to require looping many times. That's what Java's Random.nextInt(int) standard method applies (albeit with 32-bit words instead of bytes).

Second method is to use as source value not one byte but a big enough word. Say, use 20 bytes, interpret these as an integer x in the 0..2160-1 range, and return x mod 10. This way, the bias is still there, but can be made arbitrarily small, to the point that it no longer matters. This is computationally expensive (more than the first method) but has the advantage of always requiring the same number of input bytes, which can be handy in some specific situations (e.g. side-channel leaks).

Tom Leek
  • 168,808
  • 28
  • 337
  • 475
  • 1
    With the first method one could simply include an upper bound for the number of iterations. Hitting it becomes exponentially rate and the bias exponentially small when the underlying generator is good. The advantage of such an upper bound is that you can prove that the generator will terminate. – CodesInChaos Jul 20 '13 at 20:31
  • 4
    Concerning side-channels, if the underlying PRNG is secure and properly seeded the side-channel shouldn't leak anything useful. It only gives the attacker information about discarded bytes, which shouldn't allow prediction of other bytes. The control flow of the first method isn't secret, even if the output is. – CodesInChaos Jul 20 '13 at 20:34
  • Your first method use *1,02* bytes for 1 *unbiaised number*, your second method use *20* bytes for 1 *near unbiaised number*... Please have a look at my method that use only *0.8* bytes for 1 *unbiaised number*! – F. Hauri - Give Up GitHub May 20 '14 at 21:07
3

Splitting byte, for rendering several numbers

As random generator may be something expensive, the more efficient way is to cut each byte (0-255) in two part (0-15), before dropping out of limit values:

There is a little sample using :

unset random cnt
while [ ! "$random" ] ;do
    ((cnt++))
    i=$(dd if=/dev/random bs=1 count=1 2>/dev/null | od -A n -t u1)
    for val in $((i&15)) $((i>>4)) ;do
        [ $val -lt 10 ] && [ ! "$random" ] && random=$val
    done
done
printf "%d bytes read, random value: %d\n" $cnt $random

Making a kind of comparission:

As an aswer to @AndreasKrey's comment, there is a little demo, where I try to obtain 10 numbers between 0 an 9:

I use same pot (of random numbers) in both methods:

  • Splitting byte in two parts and filtering bigger number than 9
  • Dropping numbers bigger than 249 and using mod:

.

#!/bin/bash

myRandom() {
    printf ${1+-v} $1 "%s" $(
        head -c1 /dev/urandom | od -A n -t u1
    )
}

num=${1:-10} byteMod=0 byteSplit=0 potMod=() potSplit=() wholePot=()

while [ ${#potMod[@]} -lt $num ] || [ ${#potSplit[@]} -lt $num ];do
    myRandom rndVal
    wholePot+=($rndVal)

    [ ${#potMod[@]}   -lt $num ] && ((byteMod+=1)) &&
        [ $rndVal -lt 250 ] && potMod+=($[rndVal%10])

    [ ${#potSplit[@]} -lt $num ] && ((byteSplit+=1))

    for val in $[rndVal&15] $[rndVal>>4] ;do
        [ $val -lt 10 ] && [ ${#potSplit[@]} -lt $num ] && potSplit+=($val)
      done

  done

printf "%2d bytes was read for rendering %2d values by split * max 10: %s\n" \
    $byteSplit ${#potSplit[@]} "${potSplit[*]}"
printf "%2d bytes was read for rendering %2d values by max 250 && mod: %s\n" \
    $byteMod ${#potMod[@]} "${potMod[*]}"

echo Whole pot: ${wholePot[@]}

This could be run several times:

./randGen10.sh
 6 bytes was read for rendering 10 values by split * max 10: 8 3 9 7 9 3 1 1 3 4
10 bytes was read for rendering 10 values by max 250 && mod: 6 1 7 0 9 0 3 1 3 9
Whole pot: 56 121 57 30 49 20 183 161 123 239

./randGen10.sh
 7 bytes was read for rendering 10 values by split * max 10: 7 1 5 0 7 4 6 9 4 4
10 bytes was read for rendering 10 values by max 250 && mod: 3 3 6 1 8 3 4 0 4 9
Whole pot: 23 213 176 71 198 73 244 220 154 139

./randGen10.sh
10 bytes was read for rendering 10 values by split * max 10: 0 8 3 9 6 6 8 9 2 3
11 bytes was read for rendering 10 values by max 250 && mod: 1 8 2 5 4 8 9 9 0 7
Whole pot: 221 128 254 62 105 214 168 249 189 50 77

./randGen10.sh
 7 bytes was read for rendering 10 values by split * max 10: 3 1 5 9 5 8 6 9 7 7
10 bytes was read for rendering 10 values by max 250 && mod: 9 1 9 8 8 1 7 4 7 6
Whole pot: 19 181 89 168 198 121 247 54 117 226

./randGen10.sh
 9 bytes was read for rendering 10 values by split * max 10: 5 8 6 3 6 8 5 4 0 1
10 bytes was read for rendering 10 values by max 250 && mod: 4 0 0 9 3 4 8 6 6 9
Whole pot: 234 90 200 109 243 214 88 196 16 199

./randGen10.sh
11 bytes was read for rendering 10 values by split * max 10: 3 1 9 5 0 0 6 9 5 5
10 bytes was read for rendering 10 values by max 250 && mod: 5 9 7 0 5 0 1 7 8 9
Whole pot: 175 19 157 90 235 0 191 107 238 89 117

Of course, there is some cases where splited max 10 method do use more bytes than mod max 250...

Explanation:

Surprisingly, having to drop 6/256 values -> 2.34% seem a lot smaller than having to drop 6/16 values -> 37.5%, but as we could obtain a second chance for each number:

  • by splitting, we have 2 x 62.5% => 125% chances to obtain a correct number
  • by using mod (or divide by 25), we only have 97.65% chances...

So, for rendering 100'000 values:

./randGen10.sh 100000 | sed 's/^\(.\{74\}\).*$/\1.../'
 80086 bytes was read for rendering 100000 values by split * max 10: 9 9 2...
102397 bytes was read for rendering 100000 values by max 250 && mod: 3 7 6...
Whole pot: 233 217 46 36 193 182 9 44 187 48 100 172 127 230 157 194 197 1...

This seem correct: 100'000/97.65% = 102'407 and 100'000/1.25=80'000.

  • Actually, this is less efficient. There are 36 byte values where neither the upper nor the lower nibble are decimal while there are only six values (250-255; FA-FF) you need to drop to make the (byte%10) method unbiased. (Unless, o/c, you can actually make use of the other nibble elsewhere.) – Andreas Krey May 20 '14 at 12:25
  • @AndreasKrey Hmm, no: By using `split` you have `2x63% = 125%` chances of having a correct number, while instead you only obtain `98%` chance by each byte read from random generator. – F. Hauri - Give Up GitHub May 20 '14 at 21:01
  • As I said, 'unless you can use the other nibble'. If you want to get *one* digit out of a random byte, mod 10 is more efficient - otherwise nibble-split is. But there is an even better way: If the random by is less than 200, take the two lower digits of the decimal representation, otherwise if it is less the 250 take the last digit. That yields 450 result digits from the 256 input values, or 175%. – Andreas Krey May 22 '14 at 06:16
2

There are two ways you can do this.

First is to split the byte into two 4 bit numbers. If the 4-bit number is above 9, discard it. You can get up to 2 digits per byte this way.

Another way you could do this is if it is under 200, take the last two digits. Then you have 2 random digits. If it's under 250, take the last digit. If it's above 250, discard it. This way you can get the most random digits our of your number.

schroeder
  • 123,438
  • 55
  • 284
  • 319
Spl1ce
  • 21
  • 1