Calculate the correlation coefficient

9

Given a series of numbers for events X and Y, calculate Pearson's correlation coefficient. The probability of each event is equal, so expected values can be calculated by simply summing each series and dividing by the number of trials.

Input

1   6.86
2   5.92
3   6.08
4   8.34
5   8.7
6   8.16
7   8.22
8   7.68
9   12.04
10  8.6
11  10.96

Output

0.769

Shortest code wins. Input can be by stdin or arg. Output will be by stdout.

Edit: Builtin functions should not be allowed (ie calculated expected value, variance, deviation, etc) to allow more diversity in solutions. However, feel free to demonstrate a language that is well suited for the task using builtins (for exhibition).

Based on David's idea for input for Mathematica (86 char using builtin mean)

m=Mean;x=d[[All,1]];y=d[[All,2]];(m@(x*y)-m@x*m@y)/Sqrt[(m@(x^2)-m@x^2)(m@(y^2)-m@y^2)]

m = Mean;
x = d[[All,1]];
y = d[[All,2]];
(m@(x*y) - m@x*m@y)/((m@(x^2) - m@x^2)(m@(y^2) - m@y^2))^.5

Skirting by using our own mean (101 char)

m=Total[#]/Length[#]&;x=d[[All,1]];y=d[[All,2]];(m@(x*y)-m@x*m@y)/((m@(x^2)-m@x^2)(m@(y^2)-m@y^2))^.5

m = Total[#]/Length[#]&;
x = d[[All,1]];
y = d[[All,2]];
(m@(x*y)-m@x*m@y)/((m@(x^2)-m@x^2)(m@(y^2)-m@y^2))^.5

miles

Posted 2012-11-28T22:32:28.043

Reputation: 15 654

Very nice streamlining of the Mathematica code, using your own mean! – DavidC – 2012-11-29T14:18:13.863

The MMa code can be shortened. See my comment under David's answer. Also, in your code you may define m=Total@#/Length@#& – Dr. belisarius – 2012-12-12T12:12:22.740

Answers

3

PHP 144 bytes

<?
for(;fscanf(STDIN,'%f%f',$$n,${-$n});$f+=${-$n++})$e+=$$n;
for(;$$i;$z+=$$i*$a=${-$i++}-=$f/$n,$y+=$a*$a)$x+=$$i*$$i-=$e/$n;
echo$z/sqrt($x*$y);

Takes the input from STDIN, in the format provided in the original post. Result:

0.76909044055492

Using the vector dot product:

where are the input vectors adjusted downwards by and respectively.

Perl 112 bytes

/ /,$e+=$`,$f+=$',@v=($',@v)for@u=<>;
$x+=($_-=$e/$.)*$_,$y+=($;=$f/$.-pop@v)*$;,$z-=$_*$;for@u;
print$z/sqrt$x*$y

0.76909044055492

Same alg, different language. In both cases, new lines have been added for 'readability', and are not required. The only notable difference in length is the first line: the parsing of input.

primo

Posted 2012-11-28T22:32:28.043

Reputation: 30 891

5

Mathematica 34 bytes

Here are a few ways to obtain the Pearson product moment correlation. They all produce the same result. From Dr. belisarius: 34 bytes

Dot@@Normalize/@(#-Mean@#&)/@{x,y}

Built-in Correlation function I: 15 chars

This assumes that x and y are lists corresponding to each variable.

x~Correlation~y

0.76909


Built-in Correlation function II: 31 chars

This assumes d is a list of ordered pairs.

d[[;;,1]]~Correlation~d[[;;,2]]

0.76909

The use of ;; for All thanks to A Simmons.


Relying on the Standard Deviation function: 118 115 chars

The correlation can be determined by:

s=StandardDeviation;
m=Mean;
n=Length@d;
x=d[[;;,1]];
y=d[[;;,2]];
Sum[((x[[i]]-m@x)/s@x)((y[[i]]-m@y)/s@y),{i,n}]/(n-1)

0.76909


Hand-rolled Correlation: 119 chars

Assuming x and y are lists...

s=Sum;n=Length@d;m@p_:=Tr@p/n;
(s[(x[[i]]-m@x)(y[[i]]-m@y),{i,n}]/Sqrt@(s[(x[[i]]-m@x)^2,{i,n}] s[(y[[i]] - m@y)^2,{i,n}]))

0.76909

DavidC

Posted 2012-11-28T22:32:28.043

Reputation: 24 524

;; shaves a byte off each instance of All in your builtin answers – A Simmons – 2016-02-16T16:00:15.823

I get 0.076909 for the last code snippet. Also why do you have s = StandardDeviation; when s is never applied? – miles – 2012-11-29T08:03:58.653

Considering assumptions in answer for Q-language, in Mathematica it is just x~Correlation~y – Vitaliy Kaurov – 2012-11-29T08:07:25.263

@VitaliyKaurov, Yes, good point, now taken into account. – DavidC – 2012-11-29T15:18:09.490

@milest. Of course! StandardDeviation was "legacy" from the earlier solutions. Think I'll reserve s for Sum. – DavidC – 2012-11-29T15:20:46.420

@milest The error in the final output was also due to /(n-1) being mistakenly carried over from the earlier solution. Now corrected. – DavidC – 2012-11-29T15:34:43.610

In the final one you should be able to save 4 chars by taking the root of the product rather than the product of the roots. Might also be worth defining your own function to calculate stddev? – Peter Taylor – 2012-11-29T17:58:34.630

Good, about the square root. BTW, the final approach does use my own stddev. (In that case, s is an abreviation of Sum. – DavidC – 2012-11-29T19:01:42.377

I meant that defining a function would probably be shorter than repeating s[(#[[i]] - m@#)^2, {i, n}]. – Peter Taylor – 2012-11-29T23:02:04.137

I follow the logic of your suggestion, but I don't see any economy of code aside from {i,n} which would be offset by the additional code for the definition itself. I don't want to define the function in such a way that incrementing always stops at a specific n that was not passed as a parameter. – DavidC – 2012-11-30T00:20:34.650

The following has 82 chars: n=Length@x;z@u_:=Sum[u,{i,n}];u@p_:=p[[i]]-Tr@p/n;z[u@x u@y]/(z[u@x^2]z[u@y^2])^.5 – Dr. belisarius – 2012-12-11T19:15:02.173

And following is 40 r@x_:=x-Mean@x;r@x.r@y/Norm@r@x/Norm@r@y – Dr. belisarius – 2012-12-11T19:32:57.620

Neat. I've never used Norm. – DavidC – 2012-12-11T22:05:40.927

This is 37 r=#-Mean@#&;r@x.r@y/Norm@r@x/Norm@r@y – Dr. belisarius – 2012-12-12T12:19:04.330

This is 35 r=#-Mean@#&;n=r@#/Norm@r@#&;n@x.n@y – Dr. belisarius – 2012-12-12T12:26:38.000

This is 34 Dot@@Normalize/@(#-Mean@#&)/@{x,y} – Dr. belisarius – 2012-12-12T12:38:19.227

Your responses are too good to leave in the margins. Shall I convert my post into a community wiki so you can paste them above, or would you like to post them as your own answer? – DavidC – 2012-12-12T14:10:41.450

Miles, s is used in the final line of code. – DavidC – 2016-07-19T09:15:10.250

2

Q

Assuming builtins are allowed and x,y data are seperate vectors (7 chars):

x cor y

If data are stored as orderded pairs, as indicated by David Carraher, we get (for 12 characters):

{(cor).(+)x}

skeevey

Posted 2012-11-28T22:32:28.043

Reputation: 4 139

Don't correlation data normally consist of ordered pairs? – DavidC – 2012-11-29T02:51:56.840

I added al alternative for that case – skeevey – 2012-11-29T03:41:43.073

2

MATLAB/Octave

For the purpose of demonstrating built-ins only:

octave:1> corr(X,Y)
ans =  0.76909
octave:2> 

Paul R

Posted 2012-11-28T22:32:28.043

Reputation: 2 893

2

J, 30 27 bytes

([:+/*%*&(+/)&.:*:)&(-+/%#)

This time as a function taking two arguments. Uses the vector formula for calculating it.

Usage

   f =: ([:+/*%*&(+/)&.:*:)&(-+/%#)
   (1 2 3 4 5 6 7 8 9 10 11) f (6.86 5.92 6.08 8.34 8.7 8.16 8.22 7.68 12.04 8.6 10.96)
0.76909

Explanation

Takes two lists a and b as separate arguments.

([:+/*%*&(+/)&.:*:)&(-+/%#)  Input: a on LHS, b on RHS
                   &(     )  For a and b
                         #     Get the count
                      +/       Reduce using addition to get the sum
                        %      Divide the sum by the count to get the average
                     -         Subtract the initial value from the average
                             Now a and b have both been shifted by their average
                             For both a and b
                *:             Square each value
         (+/)&.:               Reduce the values using addition to get the sum
                               Apply in the inverse of squaring to take the square root
                               of the sum to get the norm
       *&                    Multiply norm(a) by norm(b)
     *                       Multiply a and b elementwise
      %                      Divide a*b by norm(a)*norm(b) elementwise
 [:+/                        Reduce using addition to the sum which is the
                             correlation coefficient and return it

miles

Posted 2012-11-28T22:32:28.043

Reputation: 15 654

You can factor out the x and y in the final line by stitching them together with ,. to give you ((m@:*/@|:-*/@m)%%:@*/@(m@:*:-*:@m))x,.y – Gareth – 2012-11-29T23:34:05.537

I have to admit, the code in itself looks gorgeous... speaking as someone who loves his non-alphanumeric code... ;) – WallyWest – 2016-10-06T01:52:58.690

There is a shorter 24 bytes version +/ .*&(%+/&.:*:)&(-+/%#) recognized by Oleg on the J forums.

– miles – 2017-07-11T03:48:52.743

2

APL 57

Using the dot product approach:

a←1 2 3 4 5 6 7 8 9 10 11

b←6.86 5.92 6.08 8.34 8.7 8.16 8.22 7.68 12.04 8.6 10.96

(a+.×b)÷((+/(a←a-(+/a)÷⍴a)*2)*.5)×(+/(b←b-(+/b)÷⍴b)*2)*.5

0.7690904406         

Graham

Posted 2012-11-28T22:32:28.043

Reputation: 3 184

1

Python 3, 140 bytes

E=lambda x:sum(x)/len(x)
S=lambda x:(sum((E(x)-X)**2for X in x)/len(x))**.5
lambda x,y:E([(X-E(x))*(Y-E(y))for X,Y in zip(x,y)])/S(x)/S(y)

2 helper functions (E and S, for expected value and standard deviation, respectively) are defined. Input is expected as 2 iterables (lists, tuples, etc). Try it online.

Mego

Posted 2012-11-28T22:32:28.043

Reputation: 32 998

1

Oracle SQL 11.2, 152 bytes (for exhibition)

SELECT CORR(a,b)FROM(SELECT REGEXP_SUBSTR(:1,'[^ ]+',1,2*LEVEL-1)a,REGEXP_SUBSTR(:1,'[^ ]+',1,2*LEVEL)b FROM DUAL CONNECT BY INSTR(:1,' ',2,LEVEL-1)>0);

Un-golfed

SELECT CORR(a,b)
FROM
(
  SELECT REGEXP_SUBSTR(:1, '[^ ]+', 1, 2*LEVEL-1)a, REGEXP_SUBSTR(:1, '[^ ]+', 1, 2*LEVEL)b
  FROM DUAL
  CONNECT BY INSTR(:1, ' ', 2, LEVEL - 1) > 0
)

Input string should use the same decimal separator as the database.

Jeto

Posted 2012-11-28T22:32:28.043

Reputation: 1 601

1

Python 3 with SciPy, 52 bytes (for exhibition)

from scipy.stats import*
lambda x,y:pearsonr(x,y)[0]

An anonymous function that takes input of the two data sets as lists x and y, and returns the correlation coefficient.

How it works

There's not a lot going on here; SciPy has a builtin that returns both the coefficient and the p-value for testing non-correlation, so the function simply passes the data sets to this and returns the first element of the (coefficient, p-value) tuple returned by the builtin.

Try it on Ideone

TheBikingViking

Posted 2012-11-28T22:32:28.043

Reputation: 3 674