Super fast sampling from a tricky random distribution

6

3

This is a micro-optimization challenge. The task is to sample from the maximally skewed stable distribution F(x;1,-1,Pi/2,0). See Table 1 of http://arxiv.org/pdf/0908.3961v2.pdf . It is easiest to describe how to do this using the following C code which uses the Mersenne Twister RNG C code from here .

#include <stdio.h>
#include <stdlib.h>
#include "mtwist.h"
#include <math.h>


int main(void) {
   int i;
   volatile double x;
   mt_seed();
   double u1;
   double u2;
   double w1;
   double w2;
   for(i = 0; i < 100000000; ++i) {
     u1 = mt_drand();
     u2 = mt_drand();
     w1 = M_PI*(u1-1/2.0);
     w2 = -log(u2);
     x = tan(w1)*(M_PI_2-w1)+log(w2*cos(w1)/(M_PI_2-w1));
   }
   return EXIT_SUCCESS;
}

I compile with

gcc -Wall -O3 random.c mtwist-1.5/mtwist.c -o random -lm

The running time is 20 seconds taking 5,000,000 iterations per second.

There are two obvious ways of making this fast. The first is to choose a very fast uniform random number generator. If you choose to use anything but the Mersenne twister then you must run it through the Diehard tests and check the P-values provided. Use the code here http://www.stat.fsu.edu/pub/diehard/ to do this. Your uniform random number generator must pass at least 15 of these tests.

The second way is to find fast micro-optimizations for the transcendental functions, perhaps using some CPU specific features. For example in my case these features of the AMD FX-8350 CPU.

Rules The code must maintain at least 32 bits of accuracy throughout. The code should be compilable on ubuntu using easy to install free software only. Please provide full compile and run instructions. You should also provide a diehard log for any non Mersenne twister RNGs as was done for Build a random number generator that passes the Diehard tests . Note the very simple RNG that passes 15 tests in that link.

Scores You can report your score as the time taken by the code above divided by the time taken by your code on your own machine. For running your code you can use any compiler and compilation options you like. To test the code above you should use gcc 4.8 or later with -O3. If any submission take less than one second I will increase N by factors of 10 until the fastest submission takes longer than 1 second.

Test machine My test machine was a standard 8GB RAM ubuntu install on an AMD FX-8350 Processor. Your machine can be anything that supports gcc 4.8 or later.

The Intel optimization manual says

If there is no critical need to evaluate the transcendental functions using the extended precision of 80 bits, applications should consider an alternate, software-based approach, such as a look-up-table-based algorithm using interpolation techniques. It is possible to improve transcendental performance with these techniques by choosing the desired numeric precision and the size of the look-up table, and by taking advantage of the parallelism of the SSE and the SSE2 instructions.

According to this very helpful table,fcos has latency 154 and fptan has latency 166-231 on my AMD FX 8350.

user9206

Posted 2014-05-17T19:06:28.973

Reputation:

If you want to avoid compilers optimizing away the entire computation, the easiest way is to declare x as volatile double x. – Reimer Behrends – 2014-05-22T14:15:51.647

@ReimerBehrends I am happy with that solution! Thank you. Now I just need some answers to the question :) – None – 2014-05-22T15:46:19.353

Incidentally, is the line w2 = -log(u2) correct? After all, u2 can be zero. – Reimer Behrends – 2014-05-23T09:41:33.760

Looking at it more closely, I suspect that both u1 and u2 should be drawn from (0, 1) rather than [0,1). Also, 1/2 should probably be 1/2.0 in C (since / is integer division if both operands are ints). – Reimer Behrends – 2014-05-23T10:34:03.293

@ReimerBehrends I will check the 1/2 possible bug. Thank you. For the other part the formula is taken from Table 1 of http://arxiv.org/pdf/0908.3961v2.pdf so if my implementation doesn't follow that then that is my mistake. I suppose mathematically the probability that u1 or u2 are exactly 0 is exactly 0. It's only with limited precision that you get a non-zero probability.

– None – 2014-05-23T13:20:12.613

According to the paper, u1/u2 are indeed meant to be uniformly distributed over the open interval (0,1), not over [0,1). That's easy to fix, though. – Reimer Behrends – 2014-05-23T13:25:51.087

A double has 53 bits of precision, so some libraries generate 53 random bits, then divide the 53-bit integer by 253. This yields a double in [0,1), with a 1 in 253 chance of 0. – kernigh – 2014-06-01T02:35:03.087

Some suggestions. 1) Use the RNG xorshift128+ from http://xorshift.di.unimi.it/xorshift128plus.c . 2) Remove the slow tan and use sincos and compute tan from it. 3) Even better, observe that you are computing tan(w1)*(M_PI_2-w1) + log(cos(w1)/(M_PI_2-w1)) + log(w2) and make a polynomial approximation for the two parts that defaults to the expensive version near its extremes (which will be hard to approximate).

– Anush – 2014-06-02T19:55:24.677

Answers

1

GMPY2

Abuse of GMP. @OP the slowest part of the code is printing the numbers, if leaving them in memory is fine the code would be much much faster.

import math
import random
import gmpy2
from gmpy2 import mpz, mpfr

gmpy2.get_context().precision = 11 # 32 bits of accuracy

N = 100000000
random_state = gmpy2.random_state()
pi = math.pi
half_pi = pi / 2
size = 0
for i in range(N):
    u1 = gmpy2.mpfr_random(random_state)
    u2 = gmpy2.mpfr_random(random_state) 
    w1 = gmpy2.mul(pi, gmpy2.sub(u1,1/2))
    w2 = -gmpy2.log(u2)
    a = gmpy2.mul(gmpy2.tan(w1), gmpy2.sub(half_pi, w1))
    b = gmpy2.div(gmpy2.log(gmpy2.mul(w2, gmpy2.cos(w1))), gmpy2.sub(half_pi, w1))
    x = gmpy2.add(a, b)

qwr

Posted 2014-05-17T19:06:28.973

Reputation: 8 929

I would like to remove the printing. The problem is that the whole code might be optimized out then! – None – 2014-05-17T21:47:23.490

@Lembik Write it to file? – qwr – 2014-05-17T22:47:46.373

I added an if statement which should reduce the output size by a factor of about 10000. – None – 2014-05-18T07:15:28.330

@Lembik: if statements are slow. – Kyle Kanos – 2014-05-18T19:05:46.053

@KyleKanos If you can think of a better way to do this please let me know! It was my best attempt so far. But is it really slow if the test almost always fails? – None – 2014-05-18T19:14:31.837

1@Lembik: The code needs to test the inequality each iteration, which is what makes it slow. A faster way to ensure it isn't optimized out is to just sum x and print it out at the end (or something similar). – Kyle Kanos – 2014-05-18T19:16:30.423

@KyleKanos Are you available on chat? I just tested without the if and it takes the same amount of time. – None – 2014-05-18T19:21:48.097

@Lembik: I'm not available for chat, I'm about to head out with my kids. Probably tomorrow though. By summing x, I get ~11s run whereas printing for x>3 it takes more than 2 minutes (using ifort for both runs). – Kyle Kanos – 2014-05-18T19:31:45.977

@KyleKanos OK chat tomorrow. Very odd though about your timings. I get 20 seconds for both using gcc -O3. Are you piping to /dev/null? Otherwise it would be slower. – None – 2014-05-18T19:41:14.923

@Lembik I can't run your code because even though I have mtwist.h I get the error `undefined reference to 'mt_default_state' and there's not much documentation online about it. I think you should test my code (python and gmpy2 are very easy to install) – qwr – 2014-05-29T22:07:39.740

You need the .c file (see the example GCC line in the question). I just put all the mtwist files in one directory and ran that GCC line. I can test your code on Sunday as well.. – None – 2014-05-30T10:42:24.627

Unfortunately your code is so slow I killed it after half an hour. How long does it take for you? – None – 2014-05-31T19:06:14.120

1

Nimrod (Score = 1.71, 4.8 vs. 8.2 seconds)

Observe that x->[[cos x, -sin x], [sin x, cos x]] is a group isomorphism between [0, 2*pi) (with respect to addition modulo 2*pi) and the rotation matrices (as a subgroup of SL(2, R)).

Because w1 is effectively of the form pi*k/2^32, where k is a signed 32-bit integer, we can split k in two 16-bit words, h and l and always represent w1 as the sum of pi*h/2^16+pi*l/2^32 and tabulate rotation matrices for both addends in 2*sizeof(double)*2^16 bytes each. We can then derive sin(w1) and cos(w1) from the product of the two rotation matrices per the isomorphism above (and tan(w1) as sin(w1)/cos(w1)).

The code below is in Nimrod, but should be directly portable to C, with the necessary renaming to integer and float types (float64 in Nimrod corresponds to double in C, and int corresponds to intptr_t, i.e. 64-bit on most modern architectures) and essentially no change in runtime. It currently does not print anything, but uses a volatile variable as described in my earlier comment to avoid the code being optimized away.

Further speedups may come from looking at how to make the computation of the natural logarithm cheaper (and, of course, replace the Mersenne Twister with a different RNG, but that's not very interesting).

Compile with nimrod cc -d:release random.nim.

Compile with -d:validate instead to check floating point errors with respect to the normal sin/cos calculations.

Update: Added an option to use an Xorshift RNG, enabled by compiling with -d:xorshift. This reduces runtime by approximately a quarter. Also fixed RNG generation to avoid u1 == 0 or u2 == 0.

import math, mersenne, unsigned

# It is sufficient to represent the rotation matrix
# [ [ cos x, -sin x], [sin x, cos x] ]]
# by just the values in the left column/bottom row.

type RotMat = tuple[sine, cosine: float64]

var tab1, tab2: array[65536, RotMat]

for i in 0..65535:
  tab1[i] = (sin(pi*float64(i-32768)/float64(65536)),
             cos(pi*float64(i-32768)/float64(65536)))
  tab2[i] = (sin(pi*float64(i)/float64(65536*65536)),
             cos(pi*float64(i)/float64(65536*65536)))

proc `*`(x, y: RotMat): RotMat {.inline.} =
  result.cosine = x.cosine*y.cosine-x.sine*y.sine
  result.sine = x.cosine*y.sine+x.sine*y.cosine

proc sincos(x: int): RotMat =
  let r1 = tab1[(x shr 16) and 65535]
  let r2 = tab2[x and 65535]
  return r1*r2

when defined(xorshift):
  type RNGState = object
    x, y, z, w: uint32

  proc getNum(s: var RNGState): int =
    let t = s.x xor (s.x shl 11)
    s.x = s.y; s.y = s.z; s.z = s.w
    s.w = s.w xor (s.w shr 19) xor t xor (t shr 8)
    result = int(s.w)

  var rng = RNGState(x: 1)
else:
  var rng = newMersenneTwister(1)

const maxRand = 1 shl 32

for i in 1..100_000_000:
  var t1, t2: int
  while true:
    t1 = rng.getNum; if t1 != 0: break
  let u1 = t1/maxRand
  while true:
    t2 = rng.getNum; if t2 != 0: break
  let u2 = t2/maxRand
  let w1 = pi*(u1-1/2)
  let w2 = -ln(u2)
  when defined(original):
    let (sw1, cw1) = (sin(w1), cos(w1))
  else:
    let (sw1, cw1) = sincos(t1)
  var x {.volatile.} = sw1/cw1*(pi/2-w1)+ln(w2*cw1/(pi/2-w1))
  when defined(validate):
    var x2 = sin(w1)/cos(w1)*(pi/2-w1)+ln(w2*cos(w1)/(pi/2-w1))
    echo sw1-sin(w1)
    echo cw1-cos(w1)
    echo x-x2

Reimer Behrends

Posted 2014-05-17T19:06:28.973

Reputation: 899

Thank you! Do you know any faster RNGs which pass 15 diehard tests? – None – 2014-05-23T06:50:28.383

2

Certain variants of Xorshift should fit the bill, though I can't say that I'm much of an expert on PRNGs.

– Reimer Behrends – 2014-05-23T09:36:51.623

What is your current score using xorshift? Also, what do you think the prospects are for further speed ups? It seems sse might help...maybe. – None – 2014-05-25T19:16:18.557

1The current score (on my machine, this can vary greatly by hardware) is about 2 (compared to compiling with -d:original). That's assuming that Xorshift is an acceptable PRNG. The limiting factor seems to be the need for at least 32 bits of precision. This makes it difficult to use cheaper approximations or to create a polynomial approximation of the formula as a whole. – Reimer Behrends – 2014-05-26T10:00:47.103

The problem is that the distribution is dramatically skewed. Try plotting it and you will see. I am not sure how to capture this long tail without using a lot of bits. – None – 2014-05-26T16:48:33.130

I understand why you want the precision; I was merely noting that it prevents some common tricks that you can pull in other situations (such as certain applications in computer graphics, where the limitations of human vision may impose lesser constraints). If you want full precision, it's generally hard to beat the established algorithms that have generally been optimized to a fare-thee-well. I got a speedup primarily because I could exploit the fact that tan and cos were operating on a limited domain. – Reimer Behrends – 2014-05-26T19:23:38.947

I updated the rules so you can just run it on your own CPU (assuming you pass the diehard tests). – None – 2014-05-29T19:28:06.487

I added some suggestions but the one that seems most relevant to your answer is that you can make a good approximation for most of the distribution I think and then just default to the more precise version when the random variables are near their extremes. – Anush – 2014-06-02T19:57:38.483

@Anush: Yes, I was already aware of that, but the rules require 32-bit precisionthroughout, and I can't change that. – Reimer Behrends – 2014-06-06T12:13:10.357

@Lembik: W.r.t. using SSE, that's possible in principle. The RNG could be optimized that way (since you essentially can generate four 32-bit random numbers in parallel). The main problem with the rest of the code is that I'd have to write a sufficiently precise natural logarithm implementation for SSE (which isn't exactly a quick thing to do), so that I could run two of the calculations in parallel (using packed double representation). – Reimer Behrends – 2014-06-06T12:25:31.243

Having thrown together a quick SSE implementation for Xorshift, that doesn't seem to help; if anything, it seems to be marginally slower, and the Xorshift RNG already takes up less than 10% of the total time. The bulk of the remaining time appears to be spent computing the natural logarithm; removing the ln(...) calls reduces runtime to less than 1.5 seconds. – Reimer Behrends – 2014-06-06T16:03:19.363

Is this helpful? http://bpaste.net/show/BD3JwvM5UMhj2UmNKzmQ/

– Anush – 2014-06-06T19:50:20.493

0

Numpy (score = 1.8)

In response to my previous code not being fast enough (I admit it was very slow), I used numpy which is extremely fast at working with large arrays of numbers and uses low level number libraries (BLAS). This is almost a direct translation of your code, but already is noticeably faster and has the bonus of being very readable.

import math
from numpy import *

pi = math.pi
half_pi = pi / 2

for i in range(10000):
    u1 = random.rand(10000)
    u2 = random.rand(10000)

    w1 = pi * (u1 - 1/2)
    w2 = -log(u2)

    a = half_pi - w1
    x = tan(w1)*a + log(w2*cos(w1)/a)

print("done")

I am running on i7 4770 with 8 gb RAM. Your code (including -O3 in GCC): 21.8 sec, my code: 11.8 sec.

qwr

Posted 2014-05-17T19:06:28.973

Reputation: 8 929

Thanks for this but your code takes 23 seconds compared to 20 seconds for mine on my machine. What timings are you getting and are you compiling mine with -03? – None – 2014-06-01T06:56:07.967

@Lembik I am running on i7 4770. Your code (including -O3): 21.8 sec, my code: 11.8 sec. I find it unlikely that gcc could beat numpy, because numpy operates efficiently on arrays and uses BLAS. Maybe try it again? I'm not sure what could have caused this difference. – qwr – 2014-06-01T21:46:12.437