Evaluate CDF of Student-t distribution

7

1

The goal of this codegolf is evaluating the cumulative distribution function of the student-t probability distribution. This is not trivial since there is no closed form and the function is dependent of two variables v and x. The distribution function is

enter image description here

For getting the CDF you could integrate this function from -inf to x. As you can see the evaluation of this function relies much on integration as you'd also have to evaluate nontrivial values of the gamma function.

Details

You can can write a program or function that accepts a two numbers xand v as input, where x is the position (must at least accept floats from -10 upto 10) at which the CDF is evaluated and v is the number of degrees of freedom. v is a positive integer (range 1 upto 1450) but it must be possible to set it to infinity (for v=inf the student-t distribution is equal to the normal distribution). But you can specify that e.g. v = -1 should denote infinity. The output consists must be a number between 0 and 1 and must have an accuracy of at least 4 decimal places. If your language does not support runtime inputs or arguments on program call you may assume that both arguments are stored in variables (or another 'memory unit')

Restrictions

You may not use special functions like the gamma function or probability distribution functions as well as numerical integration functions or libraries (as long as you do not include their source in your code which counts for the score) etc. The idea is using only basic math functions. I tried to make a list of what I consider as 'basic' but there might be others I forgot - if so please comment.

Explicitly allowed are

+ - / * 
% (modulo)
! (factorial for integers only)
&& || xor != e.t.c. (logical operators)
<< >> ^ e.t.c. (bitshift and bitwise operators)
** pow (powerfunction for floats like a^b including sqrt or and other roots)
sin cos tan arcsin arccos arctan arctan2
ln lb log(a,b)
rand() (generating pseudorandom numbers of UNIFORM distribution)

Scoring

Lowest number of bytes wins.

Entry

Please explain how your code works. Please show the output of your program for at least 5 different arguments. Among those should be the following three. For your own values please also provide 'exact' values (like from wolframalpha or what ever you prefer).

x = 0      v= (any)  (should return 0.5000)
x = 3.078  v= 1      (should return 0.9000
x = 2.5    v= 23     (should return 0.9900)

If possible, it would be great if you could provide some fancy plots of the values calculated with your program (the code for plotting does not count for the score).

(from Wikipedia): plots

flawr

Posted 2014-11-09T14:53:52.613

Reputation: 40 560

1Many languages natively support powerful numeric integration algorithms, which aren't "special functions like the gamma function or probability distribution functions etc.". These algorithms are implemented using only the basic mathematical operators you've listed. Is the use of such library functions allowed? – COTO – 2014-11-09T15:11:07.397

Thank you for pointing this out. No, I'd like the participants to program it 'from scratch', so they should implement the integration method in the program itself (it does not matter whether it is 'stolen' from a library, as long as it is included in the code and counts for the score.) – flawr – 2014-11-09T15:27:11.653

1So numerical integration is allowed provided it is included in the program? That shouldn't be too dificult, a simple riemann sum or trapezium rule can be implemented with a for loop, provided it starts at a low enough value. On the other hand, a polynomial approximation should be able to solve this to the required accuracy. Is that allowed? (What if I call it an "analytically derived" Taylor series, or even an "empirically derived" one?) – Level River St – 2014-11-09T17:31:27.960

Yes this is allowed, the goal is having the required accuracy on the whole given domain. I am looking forward to seeing different approaches! If you want you could even do the whole think with one big lookup table, but it has to be included in the bytecount=) – flawr – 2014-11-09T20:53:52.217

May we generate random numbers (like floats between 0 and 1)? – xnor – 2014-11-10T09:32:17.197

Good question, I didn't think about that but I think I'm gonna allow that, I guess you are after MC-methods, right? (Or is there any other 'shortcut' that makes it way easier to solve the challenge with random numbers than without?) – flawr – 2014-11-10T17:03:44.103

Does it have to be 4 significant digits (or is it 4 decimal places)? For example according to http://www.wolframalpha.com/input/?i=t+distribution+probabilities for x=2.5788 v=1500 I get 0.9950, the same answer to 4 significant digits as for v=inf.(table at http://mathworld.wolfram.com/Studentst-Distribution.html ). But for x=-2.5788 v=1450 I get 0.005048, which only matches the result for v=inf in its last digit. At x=-10 v=1450 I get 4.096E-23. The asymmetric number of decimal places required by current wording may require a datum at x=-10 rather than the obvious x=0 or x=-inf

– Level River St – 2014-11-10T18:33:24.470

This question has turned out harder than I thought. First I tried integrating (1+x^2/v)^(-v/2-.5) numerically in the ranges -9999...x and -9999...9999 and dividing one by the other (taking 9999 as an approximation for infinity and knowing that the gamma part would cancel out) but I ran into problems with machine limits. I have a way of calculating the gamma function in half-integer steps inductively. There's a complete formula at http://www.vosesoftware.com/ModelRiskHelp/index.htm not requiring integration, but it's pretty complicated.

– Level River St – 2014-11-10T18:47:45.950

@steveverrill Oh I didn't consider that, so would you suggest altering the question to 4 decimal places (E.g. If the exact value is 0.0001624... the acceptable answer would be 0.0002)? – flawr – 2014-11-10T18:51:33.613

@flawr I've gone ahead and edited it, and fixed a spelling mistake in the title too. Roll it back if you disagree, but I think it's better that way. I dont think centering the calculations on x=-10 was your intention, and v=1450 is more than enough to make this challenging. – Level River St – 2014-11-10T19:06:08.240

@steveverrill The changes are ok, I just didn't understand what you mean by the second part? Why do you mean by centering the calculations on x=-10? Would you suggest v=1450 as limit? – flawr – 2014-11-10T19:12:52.227

@flawr per my previous comment, around v=1450 is where the curve becomes indistinguishable from v=inf based on a precision of 4 decimal places, so now there shouldn't be a need for a limit on v. Gamma(1450) is a big number but just about manageable. It would be easier if it was lower :-) The regarding the x=-10, the point of my edit was to avoid the need for excessive precision at the low end of the x domain, so that part is solved too. I was just reiterating why I think the change is a good one. Because I'm embarrased about changing the rules on a question I'm planning to answer. – Level River St – 2014-11-10T19:30:25.857

Ah ok now I understand, well there are really some problems I didn't consider (e.g. the limits of the computers, I only thought about the algorithms in a mathematical way.) So no need to feel embarrassed, I am really glad you helped eliminating the 'impossible' aspects=) – flawr – 2014-11-10T19:36:13.420

Can I use a built-in value of π? – TwiNight – 2014-11-11T04:35:12.693

@TwiNight Yes you can, I'll add that. – flawr – 2014-11-11T13:56:14.870

My MC convergence rate essentially becomes 0. Can't get 4 digits ahhhhhh! I'm guessing this is a problem with Box Muller. – Vlo – 2014-11-18T22:18:27.607

Well it seems this thread is not getting flooded with answers, so feel free to post your approach anyway! – flawr – 2014-11-19T13:49:00.123

You can use Box Muller (or many of the more algorithmically complex but better methods) to generate normal i.i.d. variables using uniform random. Then you can use (http://stats.stackexchange.com/a/70270) to generate student-t random variables (should be an exact result). Generate a lot of these student-t and rely on almost-non-existant-convergence to find the density. I'm doing this in R and its loop performance is essentially 0. I figure an implementation in a faster language with couple billion samples could possibly get the solution fairly close (assuming normal r.v. gen is good...)

– Vlo – 2014-11-19T19:42:29.833

Answers

5

BBC BASIC, 155 ASCII characters, tokenised filesize 140

Download emulator at http://www.bbcbasic.co.uk/bbcwin/bbcwin.html

  a=1s=.000003INPUTx,v
  IFv<1v=9999
  FORw=1TOv-1a=w/a
  NEXTb=SGN(x)*s*a/SQR(v)IFv MOD2b=b/PI ELSEb=b/2
  f=.5FORz=0TOABS(x)STEPs
  f+=b/(1+z*z/v)^(v/2+.5)NEXTPRINTf

The code is in two parts. The first calculates the coefficient that is independent of x. and stores it in b. The algorithm is a bit like a continued fraction, but a mathematician would probably call it a simple quotient of two products. It's based on a formula from the Wikipedia page in the question, which is similar to a factorial but has the product (v-1)(v-3)... divided by the product (v-2)(v-4)... This is possible because we are only interested in half-integral values of the gamma function (arbitrary values of the gamma function would be a lot harder.) Using this formula we never have to calculate either gamma function, so we avoid large numbers. The largest intermediate value is approximately the square root of v.

The second is a straightforward integration from 0 to x by summation. It is known that f(0) is 0.5 so we start from there and work outwards. the sign and magnitude of b are modified before the loop, in order to be able to give the final answer directly.

Pretty version, ungolfed

This contains a few extra lines to add colour and to plot the data. The range x=0 to endpoint, and f(x)=0.5 to 1 is shown. Also v is modified to 2^v to be able to show the full range of v values without exceeding the limits on colour. In order to speed the drawing up, there are two less digits in s than in the golfed version. Even so, there is only one discrepancy when compared with http://www.wolframalpha.com/input/?i=student+t-distribution+probability: 0.8130575 rounds up to 0.8131. The golfed version give the correct answer (0.8130) for this and all the other cases below.

  a=1                              :REM accumulator for gamma series
  s=.0001                          :REM step size for x integration
  INPUTx,v
  IFv<1 v=9999                     :REM if v<1, set v to a value that gives results indistinguishable from infinity
  GCOLv                            :REM pick a colour
  v=2^v                            :REM for the pretty version, modify v to 2^v
  FORw=1TOv-1
    a=w/a                          :REM calculate series as per Wikipedia
  NEXT
  b=a/SQR(v)                       :REM divide a by the necessary values...
  IFv MOD2 b=b/PI ELSE b=b/2       :REM to give the coefficient, per Wikipedia

  b*=SGN(x)*s                      :REM modify according to sign and step size
  f=.5                             :REM initialise f(x)=.5 at x=0
  FORz=0TO ABS(x) STEPs
    f+=b/(1+z*z/v)^(v/2+.5)        :REM perform integration as simple "barchart" sum
    PLOT69,z*100+200,(f-.5)*1600   :REM plot a point
  NEXT
  PRINT f

enter image description here enter image description here

Level River St

Posted 2014-11-09T14:53:52.613

Reputation: 22 049