Rank Correlation Coefficient

13

2

The usual correlation coefficient (in 2d) measures how well a set of points can be described by a line, and if yes, its sign tells us whether we have a positive or negative correlation. But this assumes that coordinates of the points can actually interpreted quantitatively for instance as measurements.

If you cannot do that but you can still order the coordinates, there is the rank correlation coefficient: It measures how well the points can be described by a monotonic function.

Challenge

Given a list of 2d points, determine their rank correlation coefficient.

Details

  • You can assume the input to be positive integers (but you don't have to), or any other "sortable" values.
  • The points can be taken as a list of points, or two lists for the x- and y-coordinates or a matrix or 2d array etc.
  • The output must be a floating point or rational type, as it should represent a real number between 0 and 1.

Definitions

Rank: Given a list of numbers X=[x(1),...,x(n)] we can assign a positive number rx(i) called rank to each entry x(i). We do so by sorting the list and assigning the index of x(i) in the sorted list rx(i). If two or more x(i) have the same value, then we just use the arithmetic mean of all the corresponding indices as rank. Example:

          List: [21, 10, 10, 25, 3]
Indices sorted: [4, 2, 3, 5, 1]

The number 10 appears twice here. In the sorted list it would occupy the indices 2 and 3. The arithmetic mean of those is 2.5 so the ranks are

         Ranks: [4, 2.5, 2.5, 5, 1]

Rank Correlation Coefficient: Let [(x(1),y(1)),(x(2),y(2)),...,(x(n),y(n))] be the given points where each x(i) and y(i) is a real number (wlog. you can assume it is an integer) For each i=1,...,n we compute the rank rx(i) and ry(i) of x(i) and y(i) respectively.

Let d(i) = rx(i)-ry(i) be the rank difference and let S be the sum S = d(1)^2 + d(2)^2 + ... + d(n)^2. Then the rank correlation coefficient rho is given by

rho = 1 - 6 * S / (n * (n^2-1))

Example

x   y   rx              ry   d      d^2
21  15  4               5   -1      1
10  6   2&3 -> 2.5      2    0.5    0.25
10  7   2&3 -> 2.5      3   -0.5    0.25
25  11  5               4    1      1
3   5   1               1    0      0

    rho = 1 - 6 * (1+0.25+0.25+1)/(5*(5^2-1)) = 0.875   

flawr

Posted 2017-08-09T19:58:31.800

Reputation: 40 560

From wikipedia : "Only if all n ranks are distinct integers, it can be computed using the popular formula"

– rahnema1 – 2017-08-10T05:09:27.090

What do you want to say with that? – flawr – 2017-08-10T11:32:02.980

I say that the formula that you provided is for the special cases where the ranks are integers according to wikipedia. However you used the formula for the ranks such as 2.5. – rahnema1 – 2017-08-10T11:34:18.387

Well that is if you're using integers in the first place. And even if you're doing so, you're still gonna get a good approximation. Many authors even use the formula of this challenge as a definition. Furthermore keep in mind that a ranking is unstable and does not necessarily have a such an impactful meaning as an usual correlation coefficient. But all this is irrelevant for this challenge. – flawr – 2017-08-10T12:12:09.900

Answers

5

MATL, 33 bytes

,it7#utb,&S]2XQw)]-Us6*1GntUq*/_Q

Try it online!

Explanation

,           % Do...twice
  it        %   Input a numeric vector. Duplicate
  7#u       %   Replace each element by a unique integer label (1, 2, ...)
  t         %   Duplicate
  b         %   Bubble up: moves original numeric vector to top
  ,         %   Do...twice
    &S      %     Sort and push the indices of the sorting
  ]         %   End
            %   The above do...twice loop gives the sorted indices (as
            %   explained in the challenge text) for the current input
  2XQ       %   Compute average for entries with the same integer label
  w         %   Swap: move vector of integer labels to top
  )         %   Index. This gives the rank vector for the current input
]           % End
-           % Subtract the two results. Gives d
Us          % Square each entry, sum of vector. S
6*          % Times 6. Gives 6*S
1G          % Push first input vector again
n           % Number of entries. Gives n
t           % Duplicate 
Uq          % Square, minus 1. Gives n^2-1
*           % Times. Gives n*(n^2-1)
/           % Divide. Gives 6*S/(n*(n^2-1))
_Q          % Negate, plus 1. Gives 1-6*S/(n*(n^2-1))

Luis Mendo

Posted 2017-08-09T19:58:31.800

Reputation: 87 464

4I've never seen something with such resemblance to keyboard mashing that actually does something before. +1 – HyperNeutrino – 2017-08-09T20:06:49.893

5

R, 64 60 bytes

function(x,y)1-6*sum((rank(x)-rank(y))^2)/((n=sum(x|1))^3-n)

Try it online!

rank in R is the builtin that computes the desired rank; the rest is just the math to do the rest of the job.

Thanks to CriminallyVulgar for saving 4 bytes

As mentioned in the comments, the stated definition of rank correlation coefficient doesn't correspond precisely to the Spearman correlation coefficient, else a valid answer would be 26 bytes:

function(x,y)cor(x,y,,"s")

Giuseppe

Posted 2017-08-09T19:58:31.800

Reputation: 21 077

2Wee 4 byte tweak: (n^3-n) for the last bracket – CriminallyVulgar – 2017-08-10T08:15:29.103

@CriminallyVulgar thanks! my wedding was not too long after your comment so I didn't see it... – Giuseppe – 2018-10-03T14:34:12.327

3

Python 3, 141 bytes

lambda X,Y,Q=lambda U,S=sorted:[S(U).index(y)+S(U).count(y)/2+.5for y in U]:1-6*sum((i[1]-i[0])**2for i in zip(Q(X),Q(Y)))/(len(X)**3-len(X))

This defines an anonymous function which takes input as two lists corresponding to the x and y values. Output is returned as a floating-point value.

Try it online!

R. Kap

Posted 2017-08-09T19:58:31.800

Reputation: 4 730

2

Mathematica, 89 bytes

(F[x_]:=Min@N@Mean@Position[Sort@x,#]&;1-6Tr[(F@#/@#-F@#2/@#2)^2]/((y=Length@#)(y^2-1)))&

Try it online! (in order to work on mathics, "Tr" is replaced with "Total")

J42161217

Posted 2017-08-09T19:58:31.800

Reputation: 15 931

0

Wolfram Language (Mathematica), 18 bytes

N[SpearmanRho@@#]&

Try it online!

nixpower

Posted 2017-08-09T19:58:31.800

Reputation: 71

Unfortunately it looks like the definition of RCC in the question doesn't match precisely to the Spearman Rho -- it works only in the case of distinct integer inputs. See for instance my R answer or the comment linked therein.

– Giuseppe – 2018-10-10T00:26:29.933

The author of the question seems to suggest that this is fine here. The question gave the Spearman Rho formula as a definition so I would consider this to be valid despite its mathematical inaccuracy.

– nixpower – 2018-10-10T00:41:34.543