Correlation Coefficient of Words

3

This is going to be relatively quick...

Challenge

Given a lower-case word as input, calculate the Pearson's Correlation Coefficient between the letter position of the word (nth letter within the word, x), and the letter position within the alphabet (nth letter of the alphabet, y).

Background

Pearson's Correlation Coefficient is worked out for data sets x and y, as: enter image description here

where x and y are the respective data sets as explained above (and further below), and n is the length of the string.

Examples

  Input: forty
  Output: 0.964406804

Explanation:

  Letters: f o r t y
  Alphabet Number (y): 6 15 18 20 25
  Letter Position (x): 1 2 3 4 5
  Length of word (n): 5

Correlation coefficient is worked out from the above formula.

  Input: bells
  Output: 0.971793199

Explanation:

  Letters: b e l l s
  Alphabet Number (y): 2 5 12 12 19
  Letter Position (x): 1 2 3 4 5
  Length of word (n): 5

Correlation coefficient is once again worked out from the above formula.

Rules

  1. Input must provide the word as a lower-case alphabetic string with no spaces (assumed input). The delivery of this string is entirely up to you (file, STDIN, function, array, carrier pigeon, etc.)
  2. Output is in the form of a numeric that will, by the formula, provide r, such that -1 <= r <= 1. Delivery of the output, once again, is entirely up to you.
  3. No silly loop holes
  4. This is not code-golf, as I'm sure there will be many languages that may have native statistical functions. Instead, this will be a popularity contest, so most votes wins here, so make your entry count... I'll be throwing in a bounty for the most creative answer that attracts my attention a little later down the line...

Any questions/comments?

WallyWest

Posted 2016-10-09T11:54:03.280

Reputation: 6 949

Question was closed 2016-10-09T15:01:53.477

Relevant Wikipedia article – Copper – 2016-10-09T13:54:55.510

5

I wish you had made this a pop con in the sandbox, so I could have said it over there. If you want to avoid built-ins in code golf challenges, just ban functions that compute the PCC. This type of popularity contest has fallen out of scope. The tag wiki explicitly says Questions like "do (this) the most creative way" should be avoided. Creativity should be the tool, not the goal.

– Dennis – 2016-10-09T14:42:11.503

1

I wasn't aware of this when I posted my previous comment, but this has already be done as a code golf challenge.

– Dennis – 2016-10-09T15:04:10.677

Answers

2

R, 63 bytes

s=strsplit(scan(,""),"")[[1]];cor(match(s,letters),1:length(s))

Ungolfed :

#Takes the input and splits it, letters by letters
s=strsplit(
           scan(,""), 
           "")[[1]];

#Finds the positions of each letters in the lowercase alphabet, and correlates them with
#their position in the word
cor(
    match(s,letters),
    1:length(s)
    )

Frédéric

Posted 2016-10-09T11:54:03.280

Reputation: 2 059

Nice! I came up with a similar solution, but yours is much neater (especially the use of match). You can save one byte by making it a one-liner, and another byte by getting the nchar of the input string rather than the length of the split vector. Like so: cor(match(strsplit(i<-scan(,""),"")[[1]],letters),1:nchar(i)) – rturnbull – 2016-10-09T15:19:03.873

... it's not [tag:code-golf] though – Jonathan Allan – 2016-10-09T15:19:45.170

@rturnbull : your solution doesn't work on my computer. I wonder if the fact that strsplit(scan) is inside the cor isn't problematic @JonathanAllan : True enough, but it's always funnier ! – Frédéric – 2016-10-09T15:25:08.917

Ah, I think the problem was the placement of the assignment operator to i. (I don't actually fully understand R's behavior with this.) Try it now: cor(match(strsplit(i,"")[[1]],letters),1:nchar(i<-scan(,""))). – rturnbull – 2016-10-09T15:29:45.510

2

Python

That's a lot of traversals, let's do all we need to do in one pass for efficiency.

Since y will be [1,2,...,len(x)] we can calculate it's sum and sum of squares using the formulas for triangle and square pyramid numbers respectively.

...and there's really no need for two square roots since the two are multiplied we can just square the product.

def p(l):
    y = sx = sxy = sx2 = 0
    for c in l:
        y += 1
        x = ord(c) - 96
        sx += x
        sxy += x * y
        sx2 += x * x
    sy = y * (y + 1) / 2
    sy2 = sy * (2 * y + 1) / 3
    return (n * sxy - sx * sy ) / ((n * sx2 - sx * sx) * (n * sy2 - sy * sy)) ** .5

Jonathan Allan

Posted 2016-10-09T11:54:03.280

Reputation: 67 804

1

Python 3

Simply uses the formula in the challenge. Doesn't use any libraries like numpy and friends.

from math import sqrt

def correlation_coefficient(word):
  alphabet = 'abcdefghijklmnopqrstuvwxyz'

  x = list(range(len(word)))
  y = list(map(alphabet.index, word))
  n = len(word)

  top = n * sum(a*b for a, b in zip(x, y)) - sum(x) * sum(y)
  bottom = (sqrt(n * sum(a ** 2 for a in x) - sum(x) ** 2)
          * sqrt(n * sum(a ** 2 for a in y) - sum(y) ** 2))

  r = top / bottom
  return r

Ideone it!

Copper

Posted 2016-10-09T11:54:03.280

Reputation: 3 684