Sort with a neural network

16

3

The previous neural net golfing challenges (this and that) inspired me to pose a new challenge:

The challenge

Find the smallest feedforward neural network such that, given any 4-dimensional input vector \$(a,b,c,d)\$ with integer entries in \$[-10,10]\$, the network outputs \$\textrm{sort}(a,b,c,d)\$ with a coordinate-wise error strictly smaller than \$0.5\$.

Admissibility

For this challenge, a feedforward neural network is defined as a composition of layers. A layer is a function \$L\colon\mathbf{R}^n\to\mathbf{R}^m\$ that is specified by a matrix \$A\in\mathbf{R}^{m\times n}\$ of weights, a vector \$b\in\mathbf{R}^m\$ of biases, and an activation function \$f\colon\mathbf{R}\to\mathbf{R}\$ that is applied coordinate-wise:

$$ L(x) := f(Ax+b), \qquad x\in\mathbf{R}^n. $$

Since activation functions can be tuned for any given task, we need to restrict the class of activation functions to keep this challenge interesting. The following activation functions are permitted:

  • Identity. \$f(t)=t\$

  • ReLU. \$f(t)=\operatorname{max}(t,0)\$

  • Softplus. \$f(t)=\ln(e^t+1)\$

  • Hyperbolic tangent. \$f(t)=\tanh(t)\$

  • Sigmoid. \$f(t)=\frac{e^t}{e^t+1}\$

Overall, an admissible neural net takes the form \$L_k\circ L_{k-1}\circ\cdots \circ L_2\circ L_1\$ for some \$k\$, where each layer \$L_i\$ is specified by weights \$A_i\$, biases \$b_i\$, and an activation function \$f_i\$ from the above list. For example, the following neural net is admissible (while it doesn't satisfy the performance goal of this challenge, it may be a useful gadget):

$$\left[\begin{array}{c}\min(a,b)\\\max(a,b)\end{array}\right]=\left[\begin{array}{rrrr}1&-1&-\frac{1}{2}&-\frac{1}{2}\\1&-1&\frac{1}{2}&\frac{1}{2}\end{array}\right]\mathrm{ReLU}\left[\begin{array}{rr}\frac{1}{2}&\frac{1}{2}\\-\frac{1}{2}&-\frac{1}{2}\\1&-1\\-1&1\end{array}\right]\left[\begin{array}{c}a\\b\end{array}\right]$$

This example exhibits two layers. Both layers have zero bias. The first layer uses ReLU activation, while the second uses identity activation.

Scoring

Your score is the total number of nonzero weights and biases.

(E.g., the above example has a score of 16 since the bias vectors are zero.)

Dustin G. Mixon

Posted 2019-09-27T11:42:21.090

Reputation: 1 833

2@Close-voter: What exactly is unclear? I don't think either of the previous NN challenges was so well specified. – flawr – 2019-09-27T14:44:32.247

Are skip connections (NOT feedback connections) allowed? – Joel – 2019-09-27T15:51:27.187

1No - skip connections are not allowed. – Dustin G. Mixon – 2019-09-27T16:13:02.627

@flawr It's probably just because of the high level of knowledge and math that is required. I don't have a specific idea of what the question is asking, so I am not qualified to know whether it is unclear or not. – mbomb007 – 2019-09-27T21:44:06.673

@DustinG.Mixon Out of curiosity: Did you construct the max/min network yourself or did you find it somewhere else? – flawr – 2019-09-27T22:13:22.563

@flawr - I constructed it myself. It may not be minimal. – Dustin G. Mixon – 2019-09-27T22:23:17.453

1@DustinG.Mixon I actually just found an approach for the max/min that only uses 15 weights instead of 16, but it is considerably less elegant:) – flawr – 2019-09-27T23:27:24.523

3This is a nicely specified challenge that I think could serve as a model for future neural-network challenges. – xnor – 2019-09-27T23:30:01.277

1I personally find it difficult to optimize without skipping connections. This is because a sorting NN is required to output numbers close enough to the inputs. So it seems necessary to 'remember' / 'reconstruct' the inputs across layers. I do not see how that could be done easily once $e^t$ is involved since there are no inverses of those functions allowed as activations. So we are only left with the ReLUs for which the baseline (with minor improvements as shown in flawr's answer) is already near optimal. – Joel – 2019-09-28T00:16:07.493

@Joel I agree, more flexibility in the architecture would have made this a more interesting challenge. I guess the drawback with allowing e.g. skip connections is that the specs would be a lot more complicated. – flawr – 2019-09-28T12:43:57.260

Another alternative is to allow outputting the indices of sorted inputs (i.e. argsort) rather than the actual sorted inputs. That would also avoid the need of reconstructing the inputs entirely. – Joel – 2019-09-28T13:04:57.160

Answers

13

Octave, 96 88 87 84 76 54 50 weights & biases

This 6-layer neural net is essentially a 3-step sorting network built from a very simple min/max network as a component. It is basically the example network from wikipedia as shown below, with a small modification: The first two comparisons are done in parallel. To bypass negative numbers though the ReLU, we just add 100 first, and then subtract 100 again at the end.

So this should just be considered as a baseline as it is a naive implementation. It does however sort all possible numbers that do not have a too large magnitude perfectly. (We can adjust the range by replacing 100 with another number.)

Try it online!

max/min-component

There is a (considerably less elegant way more elegant now, thanks @xnor!) way to find the minimum and maximum of two numbers using less parameters:

$$\begin{align} \min &= a - ReLU(a-b) \\ \max &= b + ReLU(a-b) \end{align}$$

This means we have to use a lot less weights and biases.

Thanks @Joel for pointing out that it is sufficient to make all numbers positive in the first step and reversing it in the last one, which makes -8 weights. Thanks @xnor for pointing out an even shorter max/min method which makes -22 weights! Thanks @DustinG.Mixon for the tip of combining certain matrices which result in another -4 weights!

function z = net(u)
a1 = [100;100;0;100;100;0];
A1 = [1 0 0 0;0 0 1 0;1 0 -1 0;0 1 0 0;0 0 0 1;0 1 0 -1];
B1 = [1 0 -1 0 0 0;0 0 0 1 0 -1;0 1 1 0 0 0;0 0 0 0 1 1];
A2 = [1 0 0 0;0 1 0 0;1 -1 0 0;0 0 1 0;0 0 0 1;0 0 1 -1];
A3 = [1 0 -1 0 0 0;0 1 1 0 0 0;0 0 0 1 0 -1;0 1 1 -1 0 1;0 0 0 0 1 1];
B3 = [1 0 0 0 0;0 1 0 -1 0;0 0 1 1 0;0 0 0 0 1];
b3 = -[100;100;100;100];
relu = @(x)x .* (x>0);
id = @(x)x;
v = relu(A1 * u + a1);
w = id(B1 * v) ;
x = relu(A2 * w);
y = relu(A3 * x);
z = id(B3 * y + b3);
% disp(nnz(a1)+nnz(A1)+nnz(B1)+nnz(A2)+nnz(A3)+nnz(B3)+nnz(b3)); %uncomment to count the total number of weights
end

Try it online!

flawr

Posted 2019-09-27T11:42:21.090

Reputation: 40 560

1The constant offsets are basically used to make the inputs non-negative. Once done in the first layer, all intermediate outputs of the comparison blocks are non-negative and it suffices to change it back only in the last layer. – Joel – 2019-09-27T23:47:37.273

1Might you get a shorter min-max gadget with (a - relu(a-b), b + relu(a-b))? – xnor – 2019-09-28T00:00:33.267

@joel Oh now I see, that makes a lot of sense:) – flawr – 2019-09-28T00:00:52.713

@xnor Thanks a lot that makes a huge difference!!!! – flawr – 2019-09-28T00:16:03.713

Looks like you can shave off another 2 weights by multiplying A3*B2 instead of applying them in series. – Dustin G. Mixon – 2019-09-28T02:27:23.303

1Inconsequential nitpick: The score for the first bias is nnz(A1*a0), not nnz(a0). (Or else we must pay the price of an identity matrix.) These numbers are the same in this case. – Dustin G. Mixon – 2019-09-28T12:12:52.103

@DustinG.Mixon Thanks for the hint, I was able to fix it by omitting the first identity alltogether! – flawr – 2019-09-28T12:34:58.920