Find the largest root of a polynomial with a neural network

11

5

The challenge

Find the smallest feedforward neural network such that, given any 3-dimensional input vector \$(a,b,c)\$ with integer entries in \$[-10,10]\$, the network outputs the largest (i.e., "most positive") root of the polynomial \$x^3+ax^2+bx+c\$ with error strictly smaller than \$0.1\$.

Admissibility

The notion of admissibility in my previous neural net golfing challenge seemed a bit restrictive, so for this challenge, we are using a more liberal definition of feedforward neural network:

A neuron is a function \$\nu\colon\mathbf{R}^n\to\mathbf{R}\$ that is specified by a vector \$w\in\mathbf{R}^{n}\$ of weights, a bias \$b\in\mathbf{R}\$, and an activation function \$f\colon\mathbf{R}\to\mathbf{R}\$ in the following way:

$$ \nu(x) := f(w^\top x+b), \qquad x\in\mathbf{R}^n. $$

A feedforward neural network with input nodes \$\{1,\ldots,n\}\$ is a function of \$(x_1,\ldots,x_n)\in\mathbf{R}^n\$ that can be built from a sequence \$(\nu_k)_{k=n+1}^N\$ of neurons, where each \$\nu_k\colon\mathbf{R}^{k-1}\to\mathbf{R}\$ takes inputs from \$(x_1,\ldots,x_{k-1})\$ and outputs a scalar \$x_k\$. Given some specified set \$S\subseteq\{1,\ldots,N\}\$ of output nodes, then the output of the neural network is the vector \$(x_k)_{k\in S}\$.

Since activation functions can be tuned for any given task, we need to restrict the class of activation functions to keep this challenge interesting. The following activation functions are permitted:

  • Identity. \$f(t)=t\$

  • ReLU. \$f(t)=\operatorname{max}(t,0)\$

  • SoftPlus. \$f(t)=\ln(e^t+1)\$

  • Sigmoid. \$f(t)=\frac{e^t}{e^t+1}\$

  • Sinusoid. \$f(t)=\sin t\$

Overall, an admissible neural net is specified by input nodes, a sequence of neurons, and output nodes, while each neuron is specified by a vector of weights, a bias, and an activation function from the above list. For example, the following neural net is admissible, though it does not meet the performance goal of this challenge:

  • Input nodes: \$\{1,2\}\$

  • Neurons: \$\nu_k(x_1,\ldots,x_{k-1}):=x_{k-2}+x_{k-1}\$ for \$k\in\{3,\ldots,10\}\$

  • Output nodes: \$\{5,9,10\}\$

This network consists of 8 neurons, each with zero bias and identity activation. In words, this network computes the generalized Fibonacci sequence generated by \$x_1\$ and \$x_2\$ and then outputs the 5th, 9th, and 10th numbers from this sequence, in that order.

Scoring

Given a real number \$x\$ with terminating decimal expansion, let \$p(x)\$ be the smallest nonnegative integer \$p\$ for which \$10^{-p}\cdot |x|<1\$, and let \$q(x)\$ be the smallest nonnegative integer \$q\$ for which \$10^q \cdot x\$ is integer. Then we say \$p(x)+q(x)\$ is the precision of \$x\$.

For example, \$x=1.001\$ has a precision of \$4\$, whereas \$x=0\$ has a precision of \$0\$.

Your score is the sum of the precisions of the weights and biases in your neural network.

(E.g., the above example has a score of 16.)

Verification

While the roots can be expressed in terms of the cubic formula, the largest root is perhaps most easily accessed by numerical means. Following @xnor's suggestion, I computed the largest root for every choice of integers \$a,b,c\in[-10,10]\$, and the results can be found here. Each line of this text file is of the form a,b,c,root. For example, the first line reports that the largest root of \$x^3-10x^2-10x-10\$ is approximately \$10.99247140445449\$.

Edit: The original file I posted had errors in cases where the polynomial exhibited a multiple root. The current version should be free of such errors.

Dustin G. Mixon

Posted 2019-09-29T21:03:27.200

Reputation: 1 833

3What happens in the input polynomial has no real roots, like when a=0 and the quadratic has two complex roots? – xnor – 2019-09-29T21:27:03.247

I think the cleanest solution would be the say that the input will have a be nonzero, or even just 1. Also, I'd recommend putting in some test cases, giving the roots to high precision so we can checks ours are within 0.1. It would also be good to have outputs for all possible inputs, probably in a link since that's a lot for the post. – xnor – 2019-09-29T21:32:29.583

@xnor - Good eye, thanks! I'll work on some test cases. – Dustin G. Mixon – 2019-09-29T21:54:21.357

1I like the new admissibility rules. It looks like the new sinusoid function is extremely exploitable though. I have a sketchy proof that a function of the form x -> a * sin(b * softplus(x) + c) can overfit any finite number of data points with integer x to arbitrary precision using an extremely large and precise frequency. – xnor – 2019-09-29T22:15:08.313

@xnor - Thanks for spotting that. I expect sinusoid to be useful, but I agree that your exploit is extreme, so I updated the scoring mechanism to reflect precision. It seems more natural in hindsight. – Dustin G. Mixon – 2019-09-29T23:08:37.237

1

Not sure how useful it would be (for future challenges): In number theory we use height functions to measure the complexity of a number. For example the naive height of a (reduced) fraction $p/q$ is given by $h=\log\max{|p|,|q|}$ (and there are lot of generalizations). Maybe this could be used as an alternative measure.

– flawr – 2019-09-30T09:03:04.777

1

@DustinG.Mixon I'm not sure if you're aware but we do have a sandbox for posting drafts and discussing details of a challenge a well as a chat.

– flawr – 2019-09-30T09:09:40.903

@xnor He changed the coefficient of x^3 to constant 1 as to prevent this. BTW: this would only happen if a is zero and the discriminant of the quadratic is negative. – mondlos – 2019-09-30T16:28:11.640

Answers

6

14,674,000,667 5,436,050 5,403,448 10,385 5,994 4,447
3,806 total precision

For a baseline, I investigated the following approach: Select \$M,\delta,\epsilon>0\$ such that if we sample the polynomial \$p(x)=x^3+ax^2+bx+c\$ at

$$ S:=\{-M,-M+\delta,-M+2\delta,\ldots,M\}, $$

then the largest sample point \$s^\star\in S\$ satisfying \$p(s^\star)<\epsilon\$ necessarily exists and necessarily resides within \$0.1\$ of the largest root of \$p\$. One can show that for our collection of polynomials, one may take \$M=11\$, \$\delta=0.1\$, and \$\epsilon=10^{-4}\$.

To design a neural net that implements this logic, we start with a layer of neurons that sample the polynomial on \$S\$. For each \$s\in S\$, we take

$$ x_{1,s} = s^2\cdot a + s\cdot b + 1\cdot c + s^3. $$

Next, we identify which of these are less than \$\epsilon=10^{-4}\$. It turns out that for \$s\in S\$, it holds that \$p(s)<10^{-4}\$ only if \$p(s)\leq 0\$. As such, we can use relu activations to exactly identify our samples:

$$ \frac{\mathrm{relu}(10^{-4}-t) - \mathrm{relu}(-t)}{10^{-4}} = \left\{\begin{array}{cl}1&\text{if }t\leq 0\\0&\text{if }t\geq 10^{-4}.\end{array}\right.$$

We implement this with a few layers of neurons:

$$ \begin{aligned} x_{2,s} &= \mathrm{relu}(-1\cdot x_{1,s}+10^{-4}), \\ x_{3,s} & = \mathrm{relu}(-1\cdot x_{1,s}), \\ x_{4,s} &= 10^{4}\cdot x_{2,s}-10^{4}\cdot x_{3,s}. \end{aligned}$$

At this point, we have \$x_{4,s}=1\$ when \$p(s)<10^{-4}\$, and otherwise \$x_{4,s}=0\$. Recall that we seek the largest \$s^\star\$ for which \$x_{4,s^\star}=1\$. To this end, we label \$x_{4,M}\$ as \$x_{5,M}\$ (for notational convenience), and for each \$k\geq 1\$, we iteratively define

$$ x_{5,M-k\delta} = 1\cdot x_{4,M-k\delta}+2\cdot x_{5,M-(k-1)\delta} = \sum_{j=0}^k 2^{k-j}x_{4,M-j\delta}. $$

Thanks to this transformation, every \$x_{5,s}\$ is a nonnegative integer, and \$s^\star\$ is the unique \$s\$ for which \$x_{5,s}=1\$. We may now identify \$s^\star\$ by another application of relu activations:

$$ \mathrm{relu}(t-2)-2\cdot\mathrm{relu}(t-1)+t = \left\{\begin{array}{cl}1&\text{if }t=1\\0&\text{if }t\in\mathbf{Z}_{\geq 0}\setminus\{1\}.\end{array}\right.$$

Explicitly, we define neurons by

$$ \begin{aligned} x_{6,s} &= \mathrm{relu}(1\cdot x_{5,s} - 2), \\ x_{7,s} &= \mathrm{relu}(1\cdot x_{5,s} - 1), \\ x_{8,s} &= 1\cdot x_{6,s} - 2\cdot x_{7,s} + 1\cdot x_{5s}. \end{aligned} $$

Then \$x_{8,s}=1\$ if \$s=s^\star\$ and otherwise \$x_{8,s}=0\$. We linearly combine these to produce our output node:

$$ x_9 = \sum_{s\in S} s\cdot x_{8,s} = s^\star. $$

For the score, each layer has neurons with different levels of precision: (1) \$6+3+1+9=19\$, (2) \$1+4=5\$, (3) \$1\$, (4) \$5+5=10\$, (5) \$1+1=2\$, (6) \$1+1=2\$, (7) \$1+1=2\$, (8) \$1+1+1=3\$, (9) \$3|S|\$. Furthermore, all but two of the layers have \$|S|=221\$ neurons; layer 5 has \$|S|-1\$ neurons and layer 9 has \$1\$ neuron.

Edit: Improvements: (1) We can sample the polynomial much more efficiently using finite differences. (2) We can bypass layers 2 through 4 by instead using a sigmoid activation. (3) The overflow issues in layer 5 can be averted (and subsequent layers can be combined) by more carefully applying relu activations. (4) The final sum is cheaper with summation by parts.

What follows is the MATLAB code. To be clear, prec is a function (found here) that computes the precision of a vector of weights or biases.

function sstar = findsstar2(a,b,c)

relu = @(x) x .* (x>0);

totprec = 0;

% x1 samples the polynomial on -11:0.1:11
x1=[];
for s = -11:0.1:11
    if length(x1) < 5
        w1 = [s^2 s 1];
        b1 = s^3;
        x1(end+1,:) = w1 * [a; b; c] + b1;
        totprec = totprec + prec(w1) + prec(b1);
    else
        w1 = [-1 4 -6 4];
        x1(end+1,:) = w1 * x1(end-3:end,:);
        totprec = totprec + prec(w1);
    end
end

% x4 indicates whether the polynomial is nonpositive
w4 = -6e5;
b4 = 60;
x4=[];
for ii=1:length(x1)
    x4(end+1) = sigmf(w4 * x1(ii) + b4, [1,0]);
    totprec = totprec + prec(w4) + prec(b4);
end

% x6 indicates which entries are less than or equal to sstar
x5 = zeros(size(x1));
x6 = zeros(size(x1));
x5(end) = 0;
x6(end) = 0;
for ii = 1:length(x5)-1
    w5 = [-1 -1];
    b5 = 1;
    x5(end-ii) = relu(w5 * [x4(end-ii); x6(end-ii+1)] + b5);
    totprec = totprec + prec(w5) + prec(b5);
    w6 = -1;
    b6 = 1;
    x6(end-ii) = w6 * x5(end-ii) + b6;
    totprec = totprec + prec(w6) + prec(b6);
end

% a linear combination produces sstar
w7 = 0.1*ones(1,length(x1));
w7(1) = -11;
sstar = w7 * x6;

%disp(totprec) % uncomment to display score

end

Dustin G. Mixon

Posted 2019-09-29T21:03:27.200

Reputation: 1 833

2

53,268 29,596 29,306 total precision

Private communication with @A.Rex led to this solution, in which we construct a neural net that memorizes the answers. The core idea is that every function \$f\colon S\to\mathbf{R}\$ over a finite set \$S\$ enjoys the decomposition

$$ f(x) = \sum_{s\in S}f(s)\cdot \left\{\begin{array}{cl}1&\text{if }x=s\\0&\text{else}\end{array}\right\}. $$

As such, one may construct a neural net implementation of \$f\$ by first transforming the input into an indicator function of the input, and then linearly combining using the desired outputs as weights. Furthermore, relu activations make this transformation possible:

$$ \mathrm{relu}(t-1)-2\cdot\mathrm{relu}(t)+\mathrm{relu}(t+1) = \left\{\begin{array}{cl}1&\text{if } t=0\\0&\text{if }t\in\mathbf{Z}\setminus\{0\}.\end{array}\right. $$

What follows is a MATLAB implementation of this approach. To be clear, roots.txt is the roots file posted above (found here), and prec is a function (found here) that computes the total precision of a vector of weights or biases.

Edit 1: Two improvements over the original: (1) I factored some neurons out of the for loops. (2) I implemented "Lebesgue integration" in the final sum by first combining terms from the same level set. This way, I pay for the higher-precision value of an output only once every level set. Also, it is safe to round outputs to the nearest fifth by the rational root theorem.

Edit 2: Additional minor improvements: (1) I factored more neurons out of a for loop. (2) I don't bother computing the term in the final sum for which the output is already zero.

function r = approxroot(a,b,c)

relu = @(x)x .* (x>0);

totalprec=0;

% x4 indicates which entry of (-10:10) is a
w1 = ones(21,1);   b1 = -(-10:10)'-1;    x1 = relu(w1 * a + b1);
w2 = ones(21,1);   b2 = -(-10:10)';      x2 = relu(w2 * a + b2);
w3 = ones(21,1);   b3 = -(-10:10)'+1;    x3 = relu(w3 * a + b3);
w4p1 = ones(21,1); w4p2 = -2*ones(21,1); w4p3 = ones(21,1);
x4 = w4p1 .* x1 + w4p2 .* x2 + w4p3 .* x3;
totalprec = totalprec + prec(w1) + prec(w2) + prec(w3) + prec(b1) + prec(b2) + prec(b3) + prec(w4p1) + prec(w4p2) + prec(w4p3);

% x8 indicates which entry of (-10:10) is b
w5 = ones(21,1);   b5 = -(-10:10)'-1;    x5 = relu(w5 * b + b5);
w6 = ones(21,1);   b6 = -(-10:10)';      x6 = relu(w6 * b + b6);
w7 = ones(21,1);   b7 = -(-10:10)'+1;    x7 = relu(w7 * b + b7);
w8p1 = ones(21,1); w8p2 = -2*ones(21,1); w8p3 = ones(21,1);
x8 = w8p1 .* x5 + w8p2 .* x6 + w8p3 .* x7;
totalprec = totalprec + prec(w5) + prec(w6) + prec(w7) + prec(b5) + prec(b6) + prec(b7) + prec(w8p1) + prec(w8p2) + prec(w8p3);

% x12 indicates which entry of (-10:10) is c
w9 = ones(21,1);    b9 = -(-10:10)'-1;     x9 = relu(w9 * c + b9);
w10 = ones(21,1);   b10 = -(-10:10)';      x10 = relu(w10 * c + b10);
w11 = ones(21,1);   b11 = -(-10:10)'+1;    x11 = relu(w11 * c + b11);
w12p1 = ones(21,1); w12p2 = -2*ones(21,1); w12p3 = ones(21,1);
x12 = w12p1 .* x9 + w12p2 .* x10 + w12p3 .* x11;
totalprec = totalprec + prec(w9) + prec(w10) + prec(w11) + prec(b9) + prec(b10) + prec(b11) + prec(w12p1) + prec(w12p2) + prec(w12p3);

% x15 indicates which row of the roots file is relevant
x15=[];
for aa=-10:10
    w13 = 1;
    b13 = -2;
    x13 = w13 * x4(aa+11) + b13;
    totalprec = totalprec + prec(w13) + prec(b13);
    for bb=-10:10
        w14p1 = 1;
        w14p2 = 1;
        x14 = w14p1 * x13 + w14p2 * x8(bb+11);
        totalprec = totalprec + prec(w14p1) + prec(w14p2);
        for cc=-10:10
            w15p1 = 1;
            w15p2 = 1;
            x15(end+1,1) = relu(w15p1 * x14 + w15p2 * x12(cc+11));
            totalprec = totalprec + prec(w15p1) + prec(w15p2);
        end
    end
end

% r is the desired root, rounded to the nearest fifth
A = importdata('roots.txt');
outputs = 0.2 * round(5 * A(:,4)');
uniqueoutputs = unique(outputs);
x16 = [];
for rr = uniqueoutputs
    if rr == 0
        x16(end+1,:) = 0;
    else
        lvlset = find(outputs == rr);
        w16 = ones(1,length(lvlset));
        x16(end+1,:) = w16 * x15(lvlset);
        totalprec = totalprec + prec(w16);
    end
end
w17 = uniqueoutputs;
r = w17 * x16;
totalprec = totalprec + prec(w17);

%disp(totalprec) % uncomment to display score

end

Dustin G. Mixon

Posted 2019-09-29T21:03:27.200

Reputation: 1 833