Linear Regression on a String

25

4

This challenge is a little tricky, but rather simple, given a string s:

meta.codegolf.stackexchange.com

Use the position of the character in the string as an x coordinate and the ascii value as a y coordinate. For the above string, the resultant set of coordinates would be:

0, 109
1, 101
2, 116
3, 97
4, 46
5, 99
6, 111
7, 100
8, 101
9, 103
10,111
11,108
12,102
13,46
14,115
15,116
16,97
17,99
18,107
19,101
20,120
21,99
22,104
23,97
24,110
25,103
26,101
27,46
28,99
29,111
30,109

Next, you must calculate both the slope and the y-intercept of the set you've garnered using Linear Regression, here's the set above plotted:

Plot

Which results in a best fit line of (0-indexed):

y = 0.014516129032258x + 99.266129032258

Here's the 1-indexed best-fit line:

y = 0.014516129032258x + 99.251612903226

So your program would return:

f("meta.codegolf.stackexchange.com") = [0.014516129032258, 99.266129032258]

Or (Any other sensible format):

f("meta.codegolf.stackexchange.com") = "0.014516129032258x + 99.266129032258"

Or (Any other sensible format):

f("meta.codegolf.stackexchange.com") = "0.014516129032258\n99.266129032258"

Or (Any other sensible format):

f("meta.codegolf.stackexchange.com") = "0.014516129032258 99.266129032258"

Just explain why it is returning in that format if it isn't obvious.


Some clarifying rules:

- Strings are 0-indexed or 1 indexed both are acceptable.
- Output may be on new lines, as a tuple, as an array or any other format.
- Precision of the output is arbitrary but should be enough to verify validity (min 5).

This is lowest byte-count wins.

Magic Octopus Urn

Posted 2017-01-09T17:45:18.940

Reputation: 19 422

3Do you have any link / formula to calculate the slope and the y-intercept? – Rod – 2017-01-09T17:56:28.257

1Agreed. You should include the method for computing the line in the question. – mbomb007 – 2017-01-09T17:58:31.920

16Dear Unclear-voters: While I agree that it is nice to have the formula, it is by no means necessary. Linear regression is a well-defined thing in the mathematical world, and the OP may want to leave finding the equation up to the reader. – Nathan Merrill – 2017-01-09T18:02:57.643

6https://en.wikipedia.org/wiki/Simple_linear_regression – Magic Octopus Urn – 2017-01-09T18:11:01.420

1There are many ways to perform linear regression. Some are estimations/approximations while other methodologies strive to be exact calculations. So, how you go about it is up to you. But you should be able to explain the method you chose. – Magic Octopus Urn – 2017-01-09T18:15:26.567

2Is it okay to return the actual equation of the best-fit line, such as 0.014516129032258x + 99.266129032258? – Greg Martin – 2017-01-09T18:18:51.047

1@GregMartin yes. As long as the slope and y-intercept are displayed in a sensible manner I will accept it. – Magic Octopus Urn – 2017-01-09T18:39:28.283

2

This challenge's title has put this wonderful song in my head for the rest of the day

– Luis Mendo – 2017-01-09T21:09:16.923

1I joined this community just to up-vote this challenge! – CraigR8806 – 2017-01-10T15:52:19.463

@CraigR8806 have a look around, this is by far not even close to the best challenge out there. – Magic Octopus Urn – 2017-01-10T19:15:13.530

@carusocomputing I have been lurking through here for awhile and there are definitely some novel challenges. I found this one to be really interesting though. – CraigR8806 – 2017-01-10T19:16:45.050

1Obligatory xkcd. (Though the example in this challenge does have something of a pattern, the outliers really skew the line of best fit!) – DLosc – 2017-01-11T07:28:00.323

Answers

2

MATL, 8 bytes

n:G3$1ZQ

1-based string indexing is used.

Try it online!

Explanation

n:     % Input string implicitly. Push [1 2 ... n] where n is string length.
       % These are the x values
G      % Push the input string. A string is an array of chars, which is
       % equivalent to an array of ASCII codes. These are the y values
3$     % The next function will use 3 inputs
1      % Push 1
ZQ     % Fit polynomial of degree 1 to those x, y data. The result is an
       % array with the polynomial coefficients. Implicitly display

Luis Mendo

Posted 2017-01-09T17:45:18.940

Reputation: 87 464

7

Octave, 29 26 24 20 bytes

@(s)s/[!!s;1:nnz(s)]

Try it Online!

We have the model

y= intercept *x^0 + slope * x
 = intercept * 1  + slope * x

Here y is the ASCII value of string s

To find parameters intercept and slope we can form the following equation:

s = [intercept slope] * [1 X]

so

[intercept slope] = s/[1 x]

!!s converts a string to a vector of ones with the same length as the string.
The vector of ones is used for estimation of the intercept.
1:nnz(s) is range of values from 1 to number of elements of the string used as x.

Previous answer

@(s)ols(s'+0,[!!s;1:nnz(s)]')

For test paste the following code into Octave Online

(@(s)ols(s'+0,[!!s;1:nnz(s)]'))('meta.codegolf.stackexchange.com')

A function that accepts a string as input and applies ordinary least squares estimation of model y = x*b + e

The first argument of ols is y that for it we transpose the string s and add with number 0 to get its ASCII code.

rahnema1

Posted 2017-01-09T17:45:18.940

Reputation: 5 435

/, great idea! – Luis Mendo – 2017-01-09T20:06:26.423

6

TI-Basic, 51 (+ 141) bytes

Strings are 1-based in TI-Basic.

Input Str1
seq(I,I,1,length(Str1->L1
32+seq(inString(Str2,sub(Str1,I,1)),I,1,length(Str1->L2
LinReg(ax+b)

Like the other example, this outputs the equation of the best fit line, in terms of X. Also, in Str2 you need to have this string, which is 141 bytes in TI-Basic:

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_abcdefghijklmnopqrstuvwxyz{|}~

The reason this cannot be a part of the program is because two characters in TI-Basic cannot be automatically added to a string. One is the STO-> arrow, but this is not a problem because it is not a part of ASCII. The other is the string literal ("), which can be stringified only by typing into a Y= equation and using Equ>String(.

Timtech

Posted 2017-01-09T17:45:18.940

Reputation: 12 038

I was seriously wondering if anyone would bust out their old calculators for this :). I had my old TI-83 in mind when I thought this up. – Magic Octopus Urn – 2017-01-09T19:08:18.247

@carusocomputing Hey, nice! I like the TI-Basic programming language a lot and I use it for many of my code golfs. If only it supported ASCII... – Timtech – 2017-01-09T21:13:49.340

Two comments: 1, you can stringify " by prompting for it as user input in a program as well, which doesn't help you here, but I just wanted to point that fact out. 2, I don't recognize some of those characters as existing on the calculator. I could be wrong, but for example, where do you get @ and ~? As well as #, $, and &. – Patrick Roberts – 2017-01-10T10:57:29.190

Thanks for the comment, @PatrickRoberts. Those are two-byte tokens starting with 0xBB. Look in Column D of http://tibasicdev.wikidot.com/miscellaneous-tokens

– Timtech – 2017-01-10T23:07:23.913

6

R, 46 45 bytes

x=1:nchar(y<-scan(,""));lm(utf8ToInt(y)~x)$co

Reads input from stdin and for the given test case returns (one-indexed):

(Intercept)           x 
99.25161290  0.01451613 

Billywob

Posted 2017-01-09T17:45:18.940

Reputation: 3 363

Slightly shorter (but untested, possibly some evaluation problems in parsing the formula): lm(utf8ToInt(y<-scan(,""))~1:nchar(y))$co – rturnbull – 2017-01-10T15:16:46.027

@rturnbull I tried this at first but it seems like the x variable has to be pre-defined for lm to work. – Billywob – 2017-01-10T15:17:47.727

@rturnbull I get a variable lengths differ error on that. We are given s so x=1:nchar(s);lm(charToRaw(s)~x)$co saves some bytes. I also don't know if the $co is technically necessary, as you still get the intercept + coefficient without it – Chris – 2017-01-10T15:18:08.050

@Chris Fairly sure that is not a viable answer. There should be some input from stdin or as a function argument. – Billywob – 2017-01-10T15:24:30.747

Fair enough, just my reading of the question - it gives a more fair comparison to the python + octave answers as well – Chris – 2017-01-10T15:27:46.807

5

Python, 82 80 bytes

-2 bytes thanks to @Mego

Using scipy:

import scipy
lambda s:scipy.stats.linregress(range(len(s)),list(map(ord,s)))[:2]

dfernan

Posted 2017-01-09T17:45:18.940

Reputation: 528

Unnamed lambdas are allowed, so you can drop the f=. – Mego – 2017-01-09T20:15:13.493

@DigitalTrauma numpy.linalg.lstsq apparently differs in arguments to scipy.stats.linregress and is more complex. – dfernan – 2017-01-09T21:09:23.223

4

Mathematica, 31 bytes

Fit[ToCharacterCode@#,{1,x},x]&

Unnamed function taking a string as input and returning the actual equation of the best-fit line in question. For example, f=Fit[ToCharacterCode@#,{1,x},x]&; f["meta.codegolf.stackexchange.com"] returns 99.2516 + 0.0145161 x.

ToCharacterCode converts an ASCII string to a list of the corresponding ASCII values; indeed, it defaults to UTF-8 more generally. (Kinda sad, in this context, that one function name comprises over 48% of the code length....) And Fit[...,{1,x},x] is the built-in for computing linear regression.

Greg Martin

Posted 2017-01-09T17:45:18.940

Reputation: 13 940

1Thanks for the example of the 1-indexed line, didn't have to calculate it because of you haha. – Magic Octopus Urn – 2017-01-09T19:19:39.487

4

Node.js, 84 bytes

Using regression:

s=>require('regression')('linear',s.split``.map((c,i)=>[i,c.charCodeAt()])).equation

Demo

// polyfill, since this is clearly not Node.js
function require(module) {
  return window[module];
}
// test
["meta.codegolf.stackexchange.com"].forEach(function test(string) {
  console.log(string);
  console.log(this(string));
},
// submission
s=>require('regression')('linear',s.split``.map((c,i)=>[i,c.charCodeAt()])).equation
);
<script src="https://cdn.rawgit.com/Tom-Alexander/regression-js/master/src/regression.js"></script>

Patrick Roberts

Posted 2017-01-09T17:45:18.940

Reputation: 2 475

3

Sage, 76 bytes

var('m','c')
y(x)=m*x+c
f=lambda x:find_fit(zip(range(len(x)),map(ord,x)),y)

Hardly any golfing, probably longer than a golfed Python answer, but yeah...

busukxuan

Posted 2017-01-09T17:45:18.940

Reputation: 2 728

2

J, 11 bytes

3&u:%.1,.#\

This uses one-based indexing.

Try it online!

Explanation

3&u:%.1,.#\  Input: string S
         #\  Get the length of each prefix of S
             Forms the range [1, 2, ..., len(S)]
      1,.    Pair each with 1
3&u:         Get the ASCII value of each char in S
    %.       Matrix divide

miles

Posted 2017-01-09T17:45:18.940

Reputation: 15 654

2

JavaScript, 151 148 bytes

s=>([a,b,c,d,e]=[].map.call(s,c=>c.charCodeAt()).reduce(([a,b,c,d,e],y,x)=>[a+1,b+x,c+x*x,d+y,e+x*y],[0,0,0,0,0]),[k=(e*a-b*d)/(c*a-b*b),(d-k*b)/a])

More readable:

function f(s) {
  var [n, sx, sx2, sy, sxy] = [].map.call(s, c => c.charCodeAt(0))
    .reduce(([n, sx, sx2, sy, sxy], y, x) => [
      n + 1,
      sx + x,
      sx2 + x * x,
      sy + y,
      sxy + x * y
    ], [0, 0, 0, 0, 0]);
  var k = (sxy * n - sx * sy) / (sx2 * n - sx * sx);
  return [k, (sy - k * sx) / n];
}

$(() => $('#run').on('click', () => console.log(f($('#input').val()))));
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<textarea id="input" style="width: 100%">meta.codegolf.stackexchange.com</textarea>
<button id="run">Run</button>

Markus Jarderot

Posted 2017-01-09T17:45:18.940

Reputation: 121

You can save a byte by removing 0 from c.charCodeAt(0), and another 2 bytes by moving the k=... comma group and putting it directly in first index of the returned array like [k=...,(d-k*b)/a] – Patrick Roberts – 2017-01-10T10:41:29.073

2

Haskell, 154 142 bytes

import Statistics.LinearRegression
import Data.Vector
g x=linearRegression(generate(Prelude.length x)i)$i.fromEnum<$>fromList x
i=fromIntegral

It is far too long for my likings because of the imports and long function names, but well. I couldn't think of any other Golfing method left, although I'm not expert on the area of golfing imports.

Stripped 12 bytes by replacing ord and the import of Data.Char by fromEnum thanks to nimi.

Renzeee

Posted 2017-01-09T17:45:18.940

Reputation: 599

1You can replace ord with fromEnum and get rid of import Data.Char. – nimi – 2017-01-10T23:48:42.737

2

Javascript (ES6), 112 bytes

s=>[m=(a=b=c=d=0,([...s].map((u,x)=>{a+=n=x,b+=y=u.charCodeAt(),c+=x*x,d+=x*y}),++n)*d-a*b)/(n*c-a*a),b/n-m*a/n]

F=s=>[m=(a=b=c=d=0,([...s].map((u,x)=>{a+=n=x,b+=y=u.charCodeAt(),c+=x*x,d+=x*y}),++n)*d-a*b)/(n*c-a*a),b/n-m*a/n]

const update = () => {
  console.clear();
  console.log(F(input.value));
};
input.oninput = update;
update();
#input {
  width: 100%;
  box-sizing: border-box;
}
<input id="input" type="text" value="meta.codegolf.stackexchange.com" length=99/>
<div id="output"></div>

George Reith

Posted 2017-01-09T17:45:18.940

Reputation: 2 424

1

SAS Macro Language, 180 bytes

Uses 1-based indexing. The solution gets to be pretty wordy when the output is only the slope and intercept.

%macro t(a);data w;%do i=1 %to %length(&a);x=&i;y=%sysfunc(rank(%substr(&a,&i,1)));output;%end;run;proc reg outtest=m;model y=x/noprint;run;proc print data=m;var x intercept;%mend;

J_Lard

Posted 2017-01-09T17:45:18.940

Reputation: 351

1

Clojure, 160 bytes

No built-ins, uses the iterative algorithm described at Perceptron article. Might not converge on other inputs, in that case lower the learning rate 2e-4 and maybe increase the iteration count 1e5. Not sure if the non-iterative algorithm would have been shorter to implement.

#(nth(iterate(fn[p](let[A apply e(for[x(range(count %))](-(int(get % x))(*(p 1)x)(p 0)))](mapv(fn[p e](+(* e 2e-4)p))p[(A + e)(A +(map *(range)e))])))[0 0])1e5)

Example:

(def f #( ... ))
(f "meta.codegolf.stackexchange.com")

[99.26612903225386 0.014516129032464659]

NikoNyrh

Posted 2017-01-09T17:45:18.940

Reputation: 2 361

1

Maple, 65 bytes

Statistics:-LinearFit(b*x+a,[$(1..length(s))],convert(s,bytes),x)

Usage:

s := "meta.codegolf.stackexchange.com";
Statistics:-LinearFit(b*x+a,[$(1..length(s))],convert(s,bytes),x);

Returns:

99.2516129032259+0.0145161290322573*x

Notes: This uses the Fit command to fit a polynomial of the form a*x + b to the data. The ASCII values for the string are found by converting to bytes.

DSkoog

Posted 2017-01-09T17:45:18.940

Reputation: 560