Biased Random results in Excel

3

1

Over on WorldBuilding we like to play with science-fictiony themes, including "what if there were more that two sexes" (examples here, here, here). All good clean fun and, I thought, a chance for me to polish my spreadsheet skills (I use Excel 2003).

I built a spreadsheet for a Monte Carlo model of three sexes using X and Y chromosomes: Male (YY), Female (XX) and Hermaphrodite (XY), using ordinary Mendelian inheritance. Thus M-F will always produce H, M-H will produce 50% M and 50% H, F-H similar, and H-H will produce 25% M, 25% F and 50% H.

I made a population of 200, breeding for 256 generations, and the results look reasonable, except for one thing. I had expected the population to stabilise at 50% H and 25% each of M and F, but I consistently get about 54% H and only 23% M and F. This does not seem to be an Excel bug, I get similar results with LibreOffice, nor is it affected by the sex of the "buffer" parents (see below) or the initial distibution.

The spreadsheet is available here (my personal website, usual disclaimers of course) and I have documented it below. It looks as if my random functions IF(RAND()>0.5 and CHOOSE(INT(4*RAND()+1) are not returning 50% and 25% as intended but I don't know how to test that directly. Am I doing something wrong?

Code Details

I encode Male as 1, Female as 2 and Hermaphrodite as 4 so that I can simply add the two parents. Most of the cells (B6:IV205) calculate the next generation and contain (this is cell B6):

=CHOOSE (A5+A7,                        // sum of parents  
  -99,                                 // 1 can't happen  
  A6,                                  // 2 = MM = don't breed, copy old  
  4,                                   // 3 = MF -> H  
  A6,                                  // 4 = FF = don't breed, copy old  
  IF(RAND()>0.5,1,4),                  // 5 = MH -> M or H  
  IF(RAND()>0.5,2,4),                  // 6 = FH -> F or H  
  -99,                                 // 7 can't happen  
  CHOOSE(INT(4*RAND()+1),1,2,4,4)      // 8 = HH -> M or F or H or H  
 )  

The first row of Sheet1 shows a generation count, the second, third and fourth the number of each "sex" in that generation (divided by 2 to give a percentage of the population of 200). Rows 5 and 206 provide a (buffer) parent for rows 6 and 205. Column A builds an initial population based on the value in A2.

I use conditional formatting to colour Males as blue, Females as pink and Hermaphrodites as green; with 25% zoom and a column width of 3 I can see the whole range on my monitor. On Sheet2 I plot the populations using rows 2-3-4 of Sheet1.

Details of all cells on Sheet1:

A1 "gen1"  
B1:IV1 ="g"&COLUMN()  // column title  
A2 50%  // initial hermaphrodite percentage  
B2:IV2 =COUNTIF(B6:B205,"=4")/2  // hermaphrodite percentage  
A3 ="M:"&COUNTIF(A6:A205,"=1")/2  
B3:IV3 =COUNTIF(B6:B205,"=1")/2  // male percentage  
A4 ="F:"&COUNTIF(A6:A205,"=2")/2  
B4:IV4 =COUNTIF(B6:B205,"=2")/2  // female percentage  
A5 4  // lower buffer parent  
B5:IV5 =A5  // echo buffer  
A6:A205 =IF(RAND()>$A$2,IF(RAND()>0.5,1,2),4)  // initial population  
B6:IV205 =CHOOSE(A5+A7,-99,A6,4,A6,IF(RAND()>0.5,1,4),IF(RAND()>0.5,2,4),-99,CHOOSE(INT(4*RAND()+1),1,2,4,4))  
A206 4  // upper buffer parent  
B206:IV206 =A206  // echo buffer  

Of course I'm open to advice on all aspects of my opus.

NL_Derek

Posted 2016-02-29T22:43:25.850

Reputation: 143

Answers

0

I suspect this is not a bug with Excel, but it's correct (at least, as correct as it can be based on random chance.)

The reason for this is not due to the randomness, but instead your probability within this calculation:

=CHOOSE (A5+A7,                        // sum of parents  
  -99,                                 // 1 can't happen  
  A6,                                  // 2 = MM = don't breed, copy old  
  4,                                   // 3 = MF -> H  
  A6,                                  // 4 = FF = don't breed, copy old  
  IF(RAND()>0.5,1,4),                  // 5 = MH -> M or H  
  IF(RAND()>0.5,2,4),                  // 6 = FH -> F or H  
  -99,                                 // 7 can't happen  
  CHOOSE(INT(4*RAND()+1),1,2,4,4)      // 8 = HH -> M or F or H or H  
 )  

Specifically, your very last situation, HH -> M or F or H or H. This means you have a 50% chance to get H compared to 25% each for M or F, should we be using the result of a HH. As such, a skew towards H over M or F should be expected.

I suspect if you change this part to

CHOOSE(INT(3*RAND()+1),1,2,4)

you would see the probability you were expecting, but not necessarily what is correct (I haven't actually done the math.)

Jonno

Posted 2016-02-29T22:43:25.850

Reputation: 18 756

You are right; I now get about 25-25-50%. Thanks! Now I have to work out why (watch this space, I'll report back). – NL_Derek – 2016-03-01T16:12:29.490

Your last sentence is right on. My original expectation of 25-25-50% in the population seems to have been incorrect. My formulation of CHOOSE(INT(4*RAND()+1),1,2,4,4) gives me 25-25-50% in the generation but not in the population; your formulation of CHOOSE(INT(3*RAND()+1),1,2,4) gives me 33-33-33% in the generation and (contrary to my expectation) 25-25-50% in the population. Now I have to wrap my brain around this population-generation distinction and find out what's happening. I have marked your answer as correct: it is not an Excel bug but an NL_Derek bug ;-) – NL_Derek – 2016-03-01T21:35:13.363