Machine Learning Golf: Multiplication

69

15

I'd like to propose a different kind of golfing challenge to this community:

(Artificial) Neural Networks are very popular machine learning models that can be designed and trained to approximate any given (usually unknown) function. They're often used to solve highly complex problems that we don't know how to solve algorithmically like speech recognition, certain kinds of image classifications, various tasks in autonomous driving systems, ... For a primer on neural networks, consider this excellent Wikipedia article.

As this is the first in what I hope to be a series of machine learning golf challenges, I'd like to keep things as simple as possible:

In the language and framework of your choice, design and train a neural network that, given \$(x_1, x_2)\$ calculates their product \$x_1 \cdot x_2\$ for all integers \$x_1, x_2\$ between (and including) \$-10\$ and \$10\$.

Performance Goal

To qualify, your model may not deviate by more than \$0.5\$ from the correct result on any of those entries.

Rules

Your model

  • must be a 'traditional' neural network (a node's value is calculated as a weighted linear combination of some of the nodes in a previous layer followed by an activation function),
  • may only use the following standard activation functions:
    1. \$\textrm{linear}(x) = x\$,
    2. \$\textrm{softmax}(\vec{x})_i = \frac{e^{x_i}}{\sum_j e^{x_j}}\$,
    3. \$\textrm{selu}_{\alpha, \beta}(x) = \begin{cases} \beta \cdot x & \text{, if } x > 0 \\ \alpha \cdot \beta (e^x -1 ) & \text{, otherwise} \end{cases}\$,
    4. \$\textrm{softplus}(x) = \ln(e^x+1)\$,
    5. \$\textrm{leaky-relu}_\alpha(x) = \begin{cases} x & \text{, if } x < 0 \\ \alpha \cdot x & \text{, otherwise} \end{cases}\$,
    6. \$\tanh(x)\$,
    7. \$\textrm{sigmoid}(x) = \frac{e^x}{e^x+1}\$,
    8. \$\textrm{hard-sigmoid}(x) = \begin{cases} 0 & \text{, if } x < -2.5 \\ 1 & \text{, if } x > 2.5 \\ 0.2 \cdot x + 0.5 & \text{, otherwise} \end{cases}\$,
    9. \$e^x\$
  • must take \$(x_1, x_2)\$ either as a tupel/vector/list/... of integers or floats as its only input,
  • return the answer as an integer, float (or a suitable container, e.g. a vector or list, that contains this answer).

Your answer must include (or link to) all code necessary to check your results -- including the trained weights of your model.

Scoring

The neural network with the smallest number of weights (including bias weights) wins.

Enjoy!

Stefan Mesken

Posted 2019-07-02T11:38:39.713

Reputation: 779

1My personal best, which I'll submit later, uses 43 weights and I'm sure there's plenty of room for improvements. – Stefan Mesken – 2019-07-02T11:51:59.410

9Welcome to the site! I think this challenge could benefit a good deal from a more robust definition of a neural network. There are a couple of things here 1) It would be very nice for you to state it in language that does not already imply knowledge of NNs 2) You really should list the activation functions in your post rather than link out to an external source (outside links can change or disappear). – Post Rock Garf Hunter – 2019-07-02T12:29:46.570

2@SriotchilismO'Zaic I will add the activation functions but I don't think I want to include a definition of a neural network. I'm confident that my post contains enough information for people who are familiar with neural networks and for people who don't, this isn't the right place to learn about them. – Stefan Mesken – 2019-07-02T12:33:14.793

2Can we have constant offsets? e.g., for a node with a single input x, compute its value as ax+b? And if yes, does that count as 1 or 2 weights? – Grimmy – 2019-07-02T14:30:46.270

1@Grimy I'd be fine with that. However, they must be counted as two weights as otherwise this would allow for some shenanigans (e.g. in my code, it would reduce the weights to 33). – Stefan Mesken – 2019-07-02T14:40:03.957

4Can we reuse weights/use convolutional layers? (I recommend removing the bonus, as it doesn't add anything to the challenge and just distracts from the main goal.) Are the weights supposed to be real or can they be complex? – flawr – 2019-07-02T14:41:42.513

4Your wording implies nodes from layer 3 can't use inputs from layer 1. Does it cost a weight to have a layer 2 node simply doing f(x) = x to forward its input? – Grimmy – 2019-07-02T14:46:26.923

1@flawr CNNs are allowed. Basically any forward propagation that is a chain of 'weighted sum of previous nodes followed by activation function' is permissible. In particular, you may also include RNN layers. The rules are meant to encourage experimentation while protecting this challenge from becoming trivialized (e.g. using a neural network that solves this problem as activation function or data preprocessing that simplifies the task). – Stefan Mesken – 2019-07-02T14:47:16.887

3@Grimy You're correct. My wording doesn't correctly reflect my intentions. I'll edit that. Skipping layers is allowed and if you do so with one of the listed activation functions, it doesn't cost you any weights. – Stefan Mesken – 2019-07-02T14:48:40.040

How many weights does a mapping like $x \mapsto (x,x,x)$ count as? – flawr – 2019-07-02T15:02:46.410

1@flawr That's zero weights. You may reroute/duplicate your states as you wish. So, you may also have a map like $(x,y) \mapsto (x,y,y,x,y)$ for free. (I sure hope this doesn't break the game in some way I don't immediately foresee ;-)) – Stefan Mesken – 2019-07-02T15:08:31.293

1Should we counts weights that are equal to 1? For example, would a node doing output = tanh(input) (or some other standard activation function) count as 0 or 1 weight? – Grimmy – 2019-07-02T16:10:24.920

@mbomb007 Sure, I'll add that. – Stefan Mesken – 2019-07-03T02:37:44.990

Like Grimy, I'm also unclear whether a weight that's equal to 1 counts as a weight for scoring. How much does it cost to do z = x + y to get the sum of the two inputs? – xnor – 2019-07-03T03:14:35.430

@xnor That should count as 2 weights as this is equivalent to a dense layer with 2 inputs, one output without bias. Admittedly, this is awkward and certainly shows there a problems with the current set of rules. However, two points: 1. This is my first attempt at coming up with a machine learning golf challenge, so these kinds of problems are to be expected. 2. Because I've deliberately chosen a problem that has a trivial algorithmic solution, these quirks of the rules, unfortunately, are more problematic than they would be in an actual machine learning problem. Sorry about that! – Stefan Mesken – 2019-07-03T03:33:15.020

2If people are interested in machine learning golf (which I really hope they are), I'm confident we'll flash out these issues in no time and end up with a semi-standard set of rules similar to those for more traditional golfing challenges. – Stefan Mesken – 2019-07-03T03:37:40.550

If we are reusing the same weights, do we have to count them each time or can we just count them once? – flawr – 2019-07-03T06:24:03.753

1@flawr I've thought about that and I think reusing them is fine. It will have a significant impact on this particular challenge but it overall better captures what I intend to measure (model complexity) and it will allow for more creative solutions in this and future challenges. So, yes: Feel free to reuse your weights to improve your score! – Stefan Mesken – 2019-07-03T06:33:04.513

Is there any particular reason you don't have y=ln(x) in your list of activation functions? – FirefoxMetzger – 2019-07-03T13:13:57.383

@FirefoxMetzger Yes, there is. And I'm glad you asked: $a \cdot b = e^{\ln(a \cdot b)} = e^{\ln(a) + \ln(b)}$ – Stefan Mesken – 2019-07-03T14:15:00.067

@StefanMesken I am aware that this would become a solution (in this case 1 neuron would be the anwer). I'm curious about the logic behind making exp(.) part of the 'standard activation functions', but not it's inverse. – FirefoxMetzger – 2019-07-03T14:18:52.787

@FirefoxMetzger $\exp$ is a total function which makes it more suitable as an activation function. Since I wanted to pick one and only one, $\exp$ seemed like the better choice. – Stefan Mesken – 2019-07-03T14:19:56.367

2

IMO this fails to reach the required level of clarity on at least two counts: firstly, it's very vague what the criteria are for an answer to be valid (contrast https://codegolf.stackexchange.com/q/183036/194 , which took a lot of work to polish up to the required standard); and secondly, it's too vague about the scoring (see the same earlier question as an example of how to be clear enough).

– Peter Taylor – 2019-07-03T16:08:08.890

1@PeterTaylor I'm not sure that I fully agree with that criticism but I'm certainly willing to improve this post. In fact, I think these first attempts should be viewed as a trial run that guide the development of a better setup. (I don't imagine that the first challenges posted on code golf se were up to the current standards either and I'm making them up for ML golf as we speak. However, doing so without participants that bring up issues seems doomed which is why I plead for keeping this post open.) – Stefan Mesken – 2019-07-03T16:11:25.040

4There should be a link in the right column to the Sandbox, which was created expressly to fix these kind of issues before the question is even posted to the main site. And the network philosophy is that it's better to close a question, fix it, and reopen it than to get a bunch of answers which either will make no sense after the question is fixed or will tightly constrain the changes which can be made to the question. – Peter Taylor – 2019-07-03T16:22:30.780

1@PeterTaylor 'These kinds of issues' are detected by participation -- not by looking at some preliminary version in a vacuum. Before I ever considered posting this challenge, I've spent a week worth of afternoons doing trial runs and coming up with ways to cheat the system and how to defend against that. We are walking a thin line here: Creating rules restrictive enough to make the challenge meaningful while allowing as much freedom as possible to make it accessible to a wide audience. Strongly disagree with that last comment. – Stefan Mesken – 2019-07-03T16:25:34.190

7Not at all. These kinds of issues are detected by many years' experience of seeing other people make the same kind of mistakes. Some ambiguities slip past the sandbox, but many more are caught there. And this would definitely have been caught, because as indicated in my first comment we had exactly the same problems with a neural net question two months ago. – Peter Taylor – 2019-07-03T16:29:35.080

1Typo in the hard sigmoid function: should be 1 if x > 2.5, not -2.5. – Ray – 2019-07-04T02:36:56.687

Am I allowed to use gating (like in LSTM) and self-gating? – CrabMan – 2019-07-04T21:16:12.973

1@CrabMan Only if you can realize them as a combination of weighted sums followed by one of the allowed activation functions. (Doing is naively would require a lot of weights... But maybe you can do it efficiently.) – Stefan Mesken – 2019-07-05T04:33:40.437

From review queue, this question is not unclear. Just because the math and concepts are complicated and you don't understand them (I don't either) doesn't mean the question is not specific or clear. – mbomb007 – 2019-07-09T13:38:20.720

@StefanMesken: Thank you for asking this question and resisting defining "neural network", as well as disagreeing with other parts of the "network philosophy". I asked the first neural network question a couple months before you: https://codegolf.stackexchange.com/q/183036/ . It received a lot of pushback from the community and was in fact closed. It was mentioned above that it took "a lot of work to polish up to the required standard", but in fact I believe that work I did was worthless. Good luck with further questions.

– A. Rex – 2019-09-24T19:17:07.060

1If linear activations are legal, then you can reduce every solution (up to machine precision) to another solution that has only two distinct weights. Indeed, every neural net whose weights are integer multiples of 10^(-2k) can be simulated with a much larger neural net whose weights are plus or minus 10^(-k). To see this, observe that multiplication by any such weight w can be implemented as w.x = sgn(w).v^T.v.x, where v is the column vector consisting of 10^(2k).abs(w) entries, each entry equaling 10^(-k). – Dustin G. Mixon – 2019-09-24T23:35:13.443

Answers

37

21 13 11 9 weights

This is based on the polarization identity of bilinear forms which in the one dimensional real case reduces to the polynomial identity:

$$ x\cdot y = \frac{(x+y)^2 - (x-y)^2}{4}$$

So y1 just computes [x+y, x-y] using a linear transformation, and y3 is just the absolute value of y1 as a preprocessing step for the next one: Then the "hard" part is computing the squares which I will explain below, and after that just computing a difference and scaling which is again a linear operation.

To compute the squares I use an exponential series \$s\$ which should be accurate for all integers \$\{0,1,2,\ldots,20\}\$ within around \$0.5\$. This series is of the form

$$ \text{approx_square}(x) = \sum_{i=0}^2 w_i \exp(0.0001 \cdot i \cdot x)$$

where I just optimized for the weights W2 (\$=(w_i)_i\$). This whole approximation comprises again just two linear transformations with an exponential activation sandwiched in between. This approach results in a maximal deviation of about 0.02.

function p = net(x)
% 9 weights
one = 1; 
mone =-1;
zero = 0;
fourth = 0.25;
W1 = [1e-4, 2e-4];
W2  = [-199400468.100687;99700353.6313757];
b2 = 99700114.4299316;
leaky_relu = @(a,x)max(a*x,x); 


% Linear
y0 = [one, one; one, mone] * x;

% Linear + ReLU
y1 = mone * y0;
y2 = [leaky_relu(zero, y0), leaky_relu(zero, y1)];

% Linear
y3 = y2 * [one; one];

% Linear + exp
y4 = exp(y3 * W1); 

% Linear + Bias
y5 =  y4 * W2 + b2;

% Linear
y6 = [one, mone]*y5;
p = y6 * fourth;

end

Try it online!

flawr

Posted 2019-07-02T11:38:39.713

Reputation: 40 560

I think your checking code in the footer of the TIO link misses an application of abs. But everything is fine anyway. – Christian Sievers – 2019-07-02T22:27:11.057

@ChristianSievers Thanks, I updated the TIO link! – flawr – 2019-07-03T05:32:27.987

I'm not an expert on NN, out of curiosity, how is the weight count done? y0 needs 4, y1 needs 2, y3 needs 2, y4 needs 1, y5 needs 1 and y6 needs 2. That's 12? – Margaret Bloom – 2019-07-03T19:27:32.393

3@MargaretBloom Yes this is indeed a little bit unusual, but the OP said in the comments that we can reuse weights and only ever have to count them once, even if we use the same weight multiple times. So all the weights I'm using are defined in the first part of the function. – flawr – 2019-07-03T19:50:14.757

31

7 weights

eps = 1e-6
c = 1 / (2 * eps * eps)

def f(A, B):
	e_s = exp(eps * A + eps * B)  # 2 weights, exp activation
	e_d = exp(eps * A - eps * B)  # 2 weights, exp activation
	return c * e_s + (-c) * e_d + (-1 / eps) * B  # 3 weights, linear activation

Try it online!

Uses the following approximate equality for small \$\epsilon\$ based on the Taylor expansion \$ e^x \approx 1 + x + \frac{x^2}{2}\$:

$$ AB \approx \frac{e^{\epsilon A+\epsilon B} - e^{\epsilon A-\epsilon B}}{2 \epsilon^2} - \frac{B}{\epsilon} $$

Picking \$\epsilon\$ small enough gets us within the required error bounds. Note that eps and c are constant weights in the code.

xnor

Posted 2019-07-02T11:38:39.713

Reputation: 115 687

1Not sure that this counts as a 'traditional neural network' (rule #1) but it's obvious that it can be reformatted into one, so I see no issue with that. Nice solution! – Stefan Mesken – 2019-07-03T03:55:15.867

1You could define C = -B (1 weight) and then have [e_s, e_d] = conv([A,B,C], [eps, eps]) (2 weights) to save one weight:) (BTW: Very clever approach!) – flawr – 2019-07-03T05:39:07.880

(I forgot to add the exp) – flawr – 2019-07-03T06:21:22.360

4

You can even get a lot lower by reusing the weights - you don't have to count the same weight multiple times.

– flawr – 2019-07-03T06:54:39.413

2@flawr That's a nice trick, but I think the allowances for convolution and reusing weights in the comments make this so much a different challenge that I'm going to keep this answer as is. – xnor – 2019-07-04T02:10:56.713

22

33 31 weights

# Activation functions
sub hard { $_[0] < -2.5 ? 0 : $_[0] > 2.5 ? 1 : 0.2 * $_[0] + 0.5 }
sub linear { $_[0] }

# Layer 0
sub inputA() { $a }
sub inputB() { $b }

# Layer 1
sub a15() { hard(5*inputA) }

# Layer 2
sub a8()  { hard(-5*inputA + 75*a15 - 37.5) }

# Layer 3
sub aa()  { linear(-5*inputA + 75*a15 - 40*a8) }

# Layer 4
sub a4()  { hard(aa - 17.5) }

# Layer 5
sub a2()  { hard(aa - 20*a4 - 7.5) }

# Layer 6
sub a1()  { linear(0.2*aa - 4*a4 - 2*a2) }

# Layer 7
sub b15() { hard(0.25*inputB - 5*a15) }
sub b8()  { hard(0.25*inputB - 5*a8) }
sub b4()  { hard(0.25*inputB - 5*a4) }
sub b2()  { hard(0.25*inputB - 5*a2) }
sub b1()  { hard(0.25*inputB - 5*a1) }

# Layer 8
sub output() { linear(-300*b15 + 160*b8 + 80*b4 + 40*b2 + 20*b1 - 10*inputA) }

# Test
for $a (-10..10) {
        for $b (-10..10) {
                die if abs($a * $b - output) >= 0.5;
        }
}

print "All OK";

Try it online!

This does long multiplication in (sorta) binary, and thus returns the exact result. It should be possible to take advantage of the 0.5 error window to golf this some more, but I'm not sure how.

Layers 1 to 6 decompose the first input in 5 "bits". For golfing reasons, we don't use actual binary. The most significant "bit" has weight -15 instead of 16, and when the input is 0, all the "bits" are 0.5 (which still works out fine, since it preserves the identity inputA = -15*a15 + 8*a8 + 4*a4 + 2*a2 + 1*a1).

Grimmy

Posted 2019-07-02T11:38:39.713

Reputation: 12 521

1I did expect someone to come up with a hard-coded, ANN-ified multiplication algorithm. But I didn't think it would be the first response. Well done! (I'm also eager to see whether you'll be able to pull something like this off with the MNIST dataset or some other, more relastic ML problem :D.) – Stefan Mesken – 2019-07-02T15:55:41.133

14

43 weights

The two solutions posted so far have been very clever but their approaches likely won't work for more traditional tasks in machine learning (like OCR). Hence I'd like to submit a 'generic' (no clever tricks) solution to this task that hopefully inspires other people to improve on it and get sucked into the world of machine learning:

My model is a very simple neural network with 2 hidden layers built in TensorFlow 2.0 (but any other framework would work as well):

model = tf.keras.models.Sequential([
tf.keras.layers.Dense(6, activation='tanh', input_shape=(2,)),
tf.keras.layers.Dense(3, activation='tanh'),
tf.keras.layers.Dense(1, activation='linear')
])

As you can see, all layers are dense (which most certainly isn't optimal), the activation function is tanh (which might actually be okay for this task), except for the output layer that, due to the nature of this task, has a linear activation function.

There are 43 weights:

  • \$(2+1) \cdot 6 = 18\$ between the input and first hidden layer,
  • \$(6+1) \cdot 3 = 21\$ between the hidden layers and
  • \$(3+1) \cdot 1 = 4\$ connecting the last hidden and the output layer.

The weights have been trained (with an adam optimizer) by a layered fitting approach: First they've been fitted to minimize the mean squarred error not only on integer multiplication between \$-10\$ and \$10\$ but actually on inputs in a certain neighborhood around these values. This results in much better convergence due to the nature of gradient descent. And it accounted for 400 epochs worth of training on 57,600 training samples each, using a batch size of 32.

Next, I've fine-tuned them -- optimizing for the maximum deviation on any of the integer multiplication tasks. Unfortunately, my notes don't show much fine tuning I ended up doing, but it was very minor. In the neighborhood of 100 epochs on those 441 training samples, with a batch size of 441.

These are the weights I ended up with:

[<tf.Variable 'dense/kernel:0' shape=(2, 6) dtype=float32, numpy=
 array([[ 0.10697944,  0.05394982,  0.05479664, -0.04538541,  0.05369904,
         -0.0728976 ],
        [ 0.10571832,  0.05576797, -0.04670485, -0.04466859, -0.05855528,
         -0.07390639]], dtype=float32)>,
 <tf.Variable 'dense/bias:0' shape=(6,) dtype=float32, numpy=
 array([-3.4242163, -0.8875816, -1.7694025, -1.9409281,  1.7825342,
         1.1364107], dtype=float32)>,
 <tf.Variable 'dense_1/kernel:0' shape=(6, 3) dtype=float32, numpy=
 array([[-3.0665843 ,  0.64912266,  3.7107112 ],
        [ 0.4914808 ,  2.1569328 ,  0.65417236],
        [ 3.461693  ,  1.2072319 , -4.181983  ],
        [-2.8746269 , -4.9959164 ,  4.505049  ],
        [-2.920127  , -0.0665407 ,  4.1409926 ],
        [ 1.3777553 , -3.3750365 , -0.10507642]], dtype=float32)>,
 <tf.Variable 'dense_1/bias:0' shape=(3,) dtype=float32, numpy=array([-1.376577  ,  2.8885336 ,  0.19852689], dtype=float32)>,
 <tf.Variable 'dense_2/kernel:0' shape=(3, 1) dtype=float32, numpy=
 array([[-78.7569  ],
        [-23.602606],
        [ 84.29587 ]], dtype=float32)>,
 <tf.Variable 'dense_2/bias:0' shape=(1,) dtype=float32, numpy=array([8.521169], dtype=float32)>]

which barely met the stated performance goal. The maximal deviation ended up being \$0.44350433\$ as witnessd by \$9 \cdot 10 = 90.443504\$.

My model can be found here and you can also Try it online! in a Google Colab environment.

Stefan Mesken

Posted 2019-07-02T11:38:39.713

Reputation: 779

8

2 weights

I was inspired by the other answers to approximate the polarization identity in a different way. For every small \$\epsilon>0\$, it holds that

$$ xy \approx \frac{e^{\epsilon x+\epsilon y}+e^{-\epsilon x-\epsilon y}-e^{\epsilon x-\epsilon y}-e^{-\epsilon x+\epsilon y}}{4\epsilon^2}.$$

It suffices to take \$\epsilon=0.01\$ for this challenge.

The obvious neural net implementation of this approximation takes weights in \$\{\pm\epsilon,\pm(4\epsilon^2)^{-1}\}\$. These four weights can be golfed down to three \$\{\pm\epsilon,(4\epsilon^3)^{-1}\}\$ by factoring \$\pm(4\epsilon^2)^{-1}=\pm\epsilon\cdot(4\epsilon^3)^{-1}\$. As I mentioned in a comment above, every neural net with weights in machine precision can be golfed to a (huge!) neural net with only two distinct weights. I applied this procedure to write the following MATLAB code:

function z=approxmultgolfed(x,y)

w1 = 0.1;   % first weight
w2 = -w1;   % second weight

k  = 250000;
v1 = w1*ones(k,1);
v2 = w2*ones(k,1);

L1 = w1*eye(2);
L2 = [ w1 w1; w2 w2; w1 w2; w2 w1 ];
L3 = [ v1 v1 v2 v2 ];
L4 = v1';

z = L4 * L3 * exp( L2 * L1 * [ x; y ] );

All told, this neural net consists of 1,250,010 weights, all of which reside in \$\{\pm0.1\}\$.

How to get away with just 1 weight (!)

It turns out you can simulate any neural net that has weights in \$\{\pm0.1\}\$ with a larger neural net that has only one weight, namely, \$-0.1\$. Indeed, multiplication by \$0.1\$ can be implemented as

$$ 0.1x = w^\top wx, $$

where \$w\$ is the column vector of \$10\$ entries, all equal to \$-0.1\$. For neural nets in which half of the weights are positive, this transformation produces a neural net that is \$10.5\$ times larger.

The obvious generalization of this procedure will transform any neural net with weights in \$\{\pm 10^{-k}\}\$ into a larger neural net with the single weight \$-10^{-k}\$. Combined with the procedure in my comment above, it therefore holds that every neural net with machine-precision weights can be transformed into a single-weight neural net.

(Perhaps we should modify how re-used weights are scored in future neural net golfing challenges.)

Dustin G. Mixon

Posted 2019-07-02T11:38:39.713

Reputation: 1 833