Sort with a neural network

The previous neural net golfing challenges (this and that) inspired me to pose a new challenge:

The challenge

Find the smallest feedforward neural network such that, given any 4-dimensional input vector \$(a,b,c,d)\$ with integer entries in \$[-10,10]\$, the network outputs \$\textrm{sort}(a,b,c,d)\$ with a coordinate-wise error strictly smaller than \$0.5\$.

Admissibility

For this challenge, a feedforward neural network is defined as a composition of layers. A layer is a function \$L\colon\mathbf{R}^n\to\mathbf{R}^m\$ that is specified by a matrix \$A\in\mathbf{R}^{m\times n}\$ of weights, a vector \$b\in\mathbf{R}^m\$ of biases, and an activation function \$f\colon\mathbf{R}\to\mathbf{R}\$ that is applied coordinate-wise:

$$ L(x) := f(Ax+b), \qquad x\in\mathbf{R}^n. $$

Since activation functions can be tuned for any given task, we need to restrict the class of activation functions to keep this challenge interesting. The following activation functions are permitted:

Identity. \$f(t)=t\$
ReLU. \$f(t)=\operatorname{max}(t,0)\$
Softplus. \$f(t)=\ln(e^t+1)\$
Hyperbolic tangent. \$f(t)=\tanh(t)\$
Sigmoid. \$f(t)=\frac{e^t}{e^t+1}\$

Overall, an admissible neural net takes the form \$L_k\circ L_{k-1}\circ\cdots \circ L_2\circ L_1\$ for some \$k\$, where each layer \$L_i\$ is specified by weights \$A_i\$, biases \$b_i\$, and an activation function \$f_i\$ from the above list. For example, the following neural net is admissible (while it doesn't satisfy the performance goal of this challenge, it may be a useful gadget):

$$\left[\begin{array}{c}\min(a,b)\\\max(a,b)\end{array}\right]=\left[\begin{array}{rrrr}1&-1&-\frac{1}{2}&-\frac{1}{2}\\1&-1&\frac{1}{2}&\frac{1}{2}\end{array}\right]\mathrm{ReLU}\left[\begin{array}{rr}\frac{1}{2}&\frac{1}{2}\\-\frac{1}{2}&-\frac{1}{2}\\1&-1\\-1&1\end{array}\right]\left[\begin{array}{c}a\\b\end{array}\right]$$

This example exhibits two layers. Both layers have zero bias. The first layer uses ReLU activation, while the second uses identity activation.

Scoring

Your score is the total number of nonzero weights and biases.

(E.g., the above example has a score of 16 since the bias vectors are zero.)

Dustin G. Mixon

Posted 2019-09-27T11:42:21.090

Reputation: 1 833

2@Close-voter: What exactly is unclear? I don't think either of the previous NN challenges was so well specified. – flawr – 2019-09-27T14:44:32.247

Are skip connections (NOT feedback connections) allowed? – Joel – 2019-09-27T15:51:27.187

1No - skip connections are not allowed. – Dustin G. Mixon – 2019-09-27T16:13:02.627

@flawr It's probably just because of the high level of knowledge and math that is required. I don't have a specific idea of what the question is asking, so I am not qualified to know whether it is unclear or not. – mbomb007 – 2019-09-27T21:44:06.673

@DustinG.Mixon Out of curiosity: Did you construct the max/min network yourself or did you find it somewhere else? – flawr – 2019-09-27T22:13:22.563

@flawr - I constructed it myself. It may not be minimal. – Dustin G. Mixon – 2019-09-27T22:23:17.453

1@DustinG.Mixon I actually just found an approach for the max/min that only uses 15 weights instead of 16, but it is considerably less elegant:) – flawr – 2019-09-27T23:27:24.523

3This is a nicely specified challenge that I think could serve as a model for future neural-network challenges. – xnor – 2019-09-27T23:30:01.277

1I personally find it difficult to optimize without skipping connections. This is because a sorting NN is required to output numbers close enough to the inputs. So it seems necessary to 'remember' / 'reconstruct' the inputs across layers. I do not see how that could be done easily once $e^t$ is involved since there are no inverses of those functions allowed as activations. So we are only left with the ReLUs for which the baseline (with minor improvements as shown in flawr's answer) is already near optimal. – Joel – 2019-09-28T00:16:07.493

@Joel I agree, more flexibility in the architecture would have made this a more interesting challenge. I guess the drawback with allowing e.g. skip connections is that the specs would be a lot more complicated. – flawr – 2019-09-28T12:43:57.260

Another alternative is to allow outputting the indices of sorted inputs (i.e. argsort) rather than the actual sorted inputs. That would also avoid the need of reconstructing the inputs entirely. – Joel – 2019-09-28T13:04:57.160

1The constant offsets are basically used to make the inputs non-negative. Once done in the first layer, all intermediate outputs of the comparison blocks are non-negative and it suffices to change it back only in the last layer. – Joel – 2019-09-27T23:47:37.273

1Might you get a shorter min-max gadget with (a - relu(a-b), b + relu(a-b))? – xnor – 2019-09-28T00:00:33.267

@joel Oh now I see, that makes a lot of sense:) – flawr – 2019-09-28T00:00:52.713

@xnor Thanks a lot that makes a huge difference!!!! – flawr – 2019-09-28T00:16:03.713

Looks like you can shave off another 2 weights by multiplying A3*B2 instead of applying them in series. – Dustin G. Mixon – 2019-09-28T02:27:23.303

1Inconsequential nitpick: The score for the first bias is nnz(A1*a0), not nnz(a0). (Or else we must pay the price of an identity matrix.) These numbers are the same in this case. – Dustin G. Mixon – 2019-09-28T12:12:52.103

@DustinG.Mixon Thanks for the hint, I was able to fix it by omitting the first identity alltogether! – flawr – 2019-09-28T12:34:58.920