Can a neural network recognize primes?

26

8

Background

Recognizing primality seems like a poor fit for (artificial) neural networks. However, the universal approximation theorem states that neural networks can approximate any continuous function, so in particular it should be possible to represent any finitely-supported function one desires. So let's try to recognize all the primes among the first million numbers.

More precisely, because this is a programming website, let's go up to 2^20 = 1,048,576. The number of primes below this threshold is 82,025 or roughly 8%.

Challenge

How small of a neural network can you find that correctly classifies all 20-bit integers as prime or not prime?

For the purposes of this challenge, the size of a neural network is the total number of weights and biases required to represent it.

Details

The goal is to minimize the size of a single, explicit neural network.

The input to your network will be a vector of length 20 containing the individual bits of an integer, represented either with 0s and 1s or alternatively with -1s and +1s. The ordering of these can be most-significant-bit first or least-significant-bit first.

The output of your network should be a single number, such that above some cutoff the input is recognized as prime and below the same cutoff the input is recognized as not prime. For example, positive might mean prime (and negative not prime), or alternatively greater than 0.5 might mean prime (and less than 0.5 not prime).

The network must be 100% accurate on all 2^20 = 1,048,576 possible inputs. As mentioned above, note that there are 82,025 primes in this range. (It follows that always outputting "not prime" would be 92% accurate.)

In terms of standard neural network terminology, this would likely be called overfitting. In other words, your goal is to perfectly overfit the primes. Other words one might use are that the "training set" and the "test set" are the same.

This challenge does not consider the number of "trainable" or "learnable" parameters. Indeed, your network is likely to contain hard-coded weights, and the example below is entirely hard-coded. Instead, all weights and biases are considered parameters and are counted.

The length of the code necessary to train or generate your neural network is not relevant to your score, but posting the relevant code is certainly appreciated.

Baseline

As a baseline, it is possible to "memorize" all 82,025 primes with 1,804,551 total weights and biases.

Note that this code that follows includes many things: a working example, working test code, a working definition of neural network using a known neural network library, a "hard-coded" (or at least, not "trained") neural network, and a working measurement of score.

import numpy as np

bits = 20

from keras.models import Sequential
from keras.layers import Dense

from sympy import isprime

# Hardcode some weights
weights = []
biases  = []
for n in xrange(1<<bits):
    if not isprime(n):
        continue
    bit_list = [(n / (1 << i))%2 for i in xrange(bits)]
    weight = [2*bit - 1 for bit in bit_list]
    bias   = - (sum(bit_list) - 1)
    weights.append(weight)
    biases .append(bias)
nprimes = len(biases)
weights1 = np.transpose(np.array(weights))
biases1  = np.array(biases )
weights2 = np.full( (nprimes,1), 1 )
biases2  = np.array( [0] )

model = Sequential()
model.add(Dense(units=nprimes, activation='relu', input_dim=bits, weights=[weights1, biases1]))
model.add(Dense(units=1, activation='relu', weights=[weights2, biases2]))
print "Total weights and biases: {}".format( np.size(weights1) + np.size(weights2) + np.size(biases1) + np.size(biases2) )

# Evaluate performance
x = []
y = []
for n in xrange(1<<bits):
    row = [(n / (1 << i))%2 for i in xrange(bits)]
    x.append( row )
    col = 0
    if isprime(n):
        col = 1
    y.append( col )
x = np.array(x)
y = np.array(y)

model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])

loss, accuracy = model.evaluate(x, y, batch_size=256)
if accuracy == 1.0:
    print "Perfect fit."
else:
    print "Made at least one mistake."

What is a neural network?

For the purposes of this challenge, we can write down a narrow but precise definition of an (artificial) neural network. For some external reading, I suggest Wikipedia on artificial neural network, feedforward neural network, multilayer perceptron, and activation function.

A feedforward neural network is a collection of layers of neurons. The number of neurons per layer varies, with 20 neurons in the input layer, some number of neurons in one or more hidden layers, and 1 neuron in the output layer. (There must be at least one hidden layer because primes and not-primes are not linearly separable according to their bit patterns.) In the baseline example above, the sizes of the layers are [20, 82025, 1].

The values of the input neurons are determined by the input. As described above, this will either be 0s and 1s corresponding to the bits of a number between 0 and 2^20, or -1s and +1s similarly.

The values of the neurons of each following layer, including the output layer, are determined from the layer beforehand. First a linear function is applied, in a fully-connected or dense fashion. One method of representing such a function is using a weights matrix. For example, the transitions between the first two layers of the baseline can be represented with a 82025 x 20 matrix. The number of weights is the number of entries in this matrix, eg 1640500. Then each entry has a (separate) bias term added. This can be represented by a vector, eg a 82025 x 1 matrix in our case. The number of biases is the number of entries, eg 82025. (Note that the weights and biases together describe an affine linear function.)

A weight or bias is counted even if it is zero. For the purposes of this narrow definition, biases count as weights even if they are all zero. Note that in the baseline example, only two distinct weights (+1 and -1) are used (and only slightly more distinct biases); nonetheless, the size is more than a million, because the repetition does not help with the score in any way.

Finally, a nonlinear function called the activation function is applied entry-wise to the result of this affine linear function. For the purposes of this narrow definition, the allowed activation functions are ReLU, tanh, and sigmoid. The entire layer must use the same activation function.

In the baseline example, the number of weights is 20 * 82025 + 82025 * 1 = 1722525 and the number of biases is 82025 + 1 = 82026, for a total score of 1722525 + 82026 = 1804551. As a symbolic example, if there were one more layer and the layer sizes were instead [20, a, b, 1], then the number of weights would be 20 * a + a * b + b * 1 and the number of biases would be a + b + 1.

This definition of neural network is well-supported by many frameworks, including Keras, scikit-learn, and Tensorflow. Keras is used in the baseline example above, with code essentially as follows:

from keras.models import Sequential
model = Sequential()
from keras.layers import Dense
model.add(Dense(units=82025, activation='relu', input_dim=20, weights=[weights1, biases1]))
model.add(Dense(units=1, activation='relu', weights=[weights2, biases2]))
score = numpy.size(weights1) + numpy.size(biases1) + numpy.size(weights2) + numpy.size(biases2)

If the weights and bias matrices are numpy arrays, then numpy.size will directly tell you the number of entries.

Are there other kinds of neural networks?

If you want a single, precise definition of neural network and score for the purposes of this challenge, then please use the definition in the previous section. If you think that "any function" looked at the right way is a neural network with no parameters, then please use the definition in the previous section.

If you are a more free spirit, then I encourage you to explore further. Perhaps your answer will not count toward the narrow challenge, but maybe you'll have more fun. Some other ideas you may try include more exotic activation functions, recurrent neural networks (reading one bit at a time), convolutional neural networks, more exotic architectures, softmax, and LSTMs (!). You may use any standard activation function and any standard architecture. A liberal definition of "standard" neural network features could include anything posted on the arxiv prior to the posting of this question.

A. Rex

Posted 2019-04-11T16:43:00.077

Reputation: 1 979

What sort of types are these weights? Usually people use floats can we use other numeric types? e.g. types of less, more or unlimited precision. – Post Rock Garf Hunter – 2019-04-12T20:59:54.667

@SriotchilismO'Zaic: For the purposes of the narrow definition, I think it makes sense to restrict to float and double (IEEE single- and double-precision floating point real numbers) for all of the weights and intermediate values. (Although note that some implementations may use other amounts of precision -- eg 80-bit -- during evaluation.) – A. Rex – 2019-04-13T15:53:47.970

I love this question but am disappointed there isn't a much smaller neural network to be find with enough training time. – Anush – 2019-05-12T15:37:14.500

Answers

13

Trial division: score 59407, 6243 layers, 16478 neurons in total

Given as a Python program which generates and validates the net. See the comments in trial_division for an explanation of how it works. The validation is quite slow (as in, running time measured in hours): I recommend using PyPy or Cython.

All layers use ReLU (\$\alpha \to \max(0, \alpha)\$) as the activation function.

The threshold is 1: anything above that is prime, anything below is composite or zero, and the only input which gives an output of 1 is 1 itself.

#!/usr/bin/python3

import math


def primes_to(n):
    ps = []
    for i in range(2, n):
        is_composite = False
        for p in ps:
            if i % p == 0:
                is_composite = True
                break
            if p * p > i:
                break
        if not is_composite:
            ps.append(i)
    return ps


def eval_net(net, inputs):
    for layer in net:
        inputs.append(1)
        n = len(inputs)
        inputs = [max(0, sum(inputs[i] * neuron[i] for i in range(n))) for neuron in layer]
    return inputs


def cost(net):
    return sum(len(layer) * len(layer[0]) for layer in net)


def trial_division(num_bits):
    # Overview: we convert the bits to a single number x and perform trial division.
    # x is also our "is prime" flag: whenever we prove that x is composite, we clear it to 0
    # At the end x will be non-zero only if it's a unit or a prime, and greater than 1 only if it's a prime.
    # We calculate x % p as
    #     rem = x - (x >= (p << a) ? 1 : 0) * (p << a)
    #     rem -= (rem >= (p << (a-1)) ? 1) : 0) * (p << (a-1))
    #     ...
    #     rem -= (rem >= p ? 1 : 0) * p
    #
    # If x % p == 0 and x > p then x is a composite multiple of p and we want to set it to 0

    N = 1 << num_bits
    primes = primes_to(1 + int(2.0 ** (num_bits / 2)))

    # As a micro-optimisation we exploit 2 == -1 (mod 3) to skip a number of shifts for p=3.
    # We need to bias by a multiple of 3 which is at least num_bits // 2 so that we don't get a negative intermediate value.
    bias3 = num_bits // 2
    bias3 += (3 - (bias3 % 3)) % 3

    # inputs: [bit0, ..., bit19]
    yield [[1 << i for i in range(num_bits)] + [0],
           [-1] + [0] * (num_bits - 1) + [1],
           [0] * 2 + [-1] * (num_bits - 2) + [1],
           [(-1) ** i for i in range(num_bits)] + [bias3]]

    for p in primes[1:]:
        # As a keyhole optimisation we overlap the cases slightly.
        if p == 3:
            # [x, x_is_even, x_lt_4, x_reduced_mod_3]
            max_shift = int(math.log((bias3 + (num_bits + 1) // 2) // p, 2))
            yield [[1, 0, 0, 0, 0], [0, 1, -1, 0, 0], [0, 0, 0, 1, 0], [0, 0, 0, -1, p << max_shift]]
            yield [[1, -N, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, -1, 1]]
            yield [[1, 0, 0, 0], [0, 1, -p << max_shift, 0]]
        else:
            # [x, x % old_p]
            max_shift = int(num_bits - math.log(p, 2))
            yield [[1, 0, 0], [1, -N, -p_old], [-1, 0, p << max_shift]]
            yield [[1, -N, 0, 0], [0, 0, -1, 1]]
            yield [[1, 0, 0], [1, -p << max_shift, 0]]

        for shift in range(max_shift - 1, -1, -1):
            # [x, rem]
            yield [[1, 0, 0], [0, 1, 0], [0, -1, p << shift]]
            yield [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, -1, 1]]
            yield [[1, 0, 0, 0], [0, 1, -p << shift, 0]]
        # [x, x % p]
        p_old = p

    yield [[1, 0, 0], [1, -N, -p]]
    yield [[1, -N, 0]]


def validate_primality_tester(primality_tester, threshold):
    num_bits = len(primality_tester[0][0]) - 1
    primes = set(primes_to(1 << num_bits))
    errors = 0
    for i in range(1 << num_bits):
        expected = i in primes
        observed = eval_net(primality_tester, [(i >> shift) & 1 for shift in range(num_bits)])[-1] > threshold
        if expected != observed:
            errors += 1
            print("Failed test case", i)
        if (i & 0xff) == 0:
            print("Progress", i)

    if errors > 0:
        raise Exception("Failed " + str(errors) + " test case(s)")


if __name__ == "__main__":
    n = 20

    trial_div = list(trial_division(n))
    print("Cost", cost(trial_div))
    validate_primality_tester(trial_div, 1)

As an aside, re

the universal approximation theorem states that neural networks can approximate any continuous function

it's easy to show that a neural network using ReLU is Turing complete. The easiest logic gate to implement robustly is NOR: an n-input NOR gate is \$\max\left(0, 1 - \sum a_i\right)\$. I say robustly because this gate accepts inputs greater than 1 but (provided the inputs are not between 0 and 1) only ever outputs 0 or 1. A single-layer AND gate is \$\max\left(0, 1 + \sum(a_i - 1)\right)\$ but only works correctly if its inputs are guaranteed to be 0 or 1, and may output larger integers. Various other gates are possible in one layer, but NOR by itself is Turing-complete so there's no need to go into detail.

Peter Taylor

Posted 2019-04-11T16:43:00.077

Reputation: 41 901

As a further aside, I started work on an Euler test before I tried trial division, because I thought it would be more efficient, but raising a number (7 was the best candidate) to a power of (x-(x mod 2)) would require 38 multiplications followed by reduction mod x, and the best network I found for multiplying 20-bit numbers costs 1135, so it's not going to be competitive. – Peter Taylor – 2019-04-18T23:07:39.470

7

Score 984314, 82027 layers, 246076 neurons in total

We can keep things entirely in the integers if we use activation function ReLU, which simplifies the analysis.

Given an input \$x\$ which is known to be an integer, we can test whether \$x = a\$ with two layers and three neurons:

  1. First layer: outputs \$\textrm{ge}_a = (x - a)^+\$ and \$\textrm{le}_a = (-x + a)^+\$
  2. Second layer: outputs \$\textrm{eq}_a = (-\textrm{ge}_a - \textrm{le}_a + 1)^+\$. \$\textrm{eq}_{a}\$ will be \$1\$ if \$x = a\$ and \$0\$ otherwise.

Layer 1: reduce the 20 inputs to one value \$x\$ with weights 1, 2, 4, ... and bias 0. Cost: (20 + 1) * 1 = 21.

Layer 2: outputs \$\textrm{ge}_2 = (x - 2)^+\$, \$\textrm{le}_2 = (-x + 2)^+\$. Cost (1 + 1) * 2 = 4.

Layer 3: outputs \$\textrm{accum}_2 = (-\textrm{ge}_2 - \textrm{le}_2 + 1)^+\$, \$\textrm{ge}_3 = (\textrm{ge}_2 - (3-2))^+\$, \$\textrm{le}_3 = (-\textrm{ge}_2 + (3-2))^+\$. Cost (2 + 1) * 3 = 9.

Layer 4: outputs \$\textrm{accum}_3 = (2^{21} \textrm{accum}_2 -\textrm{ge}_3 - \textrm{le}_3 + 1)^+\$, \$\textrm{ge}_5 = (\textrm{ge}_3 - (5-3))^+\$, \$\textrm{le}_5 = (-\textrm{ge}_3 + (5-3))^+\$. Cost (3 + 1) * 3 = 12.

Layer 5: outputs \$\textrm{accum}_5 = (2^{21} \textrm{accum}_3 -\textrm{ge}_5 - \textrm{le}_5 + 1)^+\$, \$\textrm{ge}_7 = (\textrm{ge}_5 - (7-5))^+\$, \$\textrm{le}_7 = (-\textrm{ge}_5 + (7-5))^+\$. Cost (3 + 1) * 3 = 12.

...

Layer 82026: outputs \$\textrm{accum}_{1048571} = (2^{21} \textrm{accum}_{1048559} -\textrm{ge}_{1048571} - \textrm{le}_{1048571} + 1)^+\$, \$\textrm{ge}_{1048573} = (\textrm{ge}_{1048571} - ({1048573}-{1048571}))^+\$, \$\textrm{le}_{1048573} = (-\textrm{ge}_{1048571} + ({1048573}-{1048571}))^+\$. Cost (3 + 1) * 3 = 12.

Layer 82027: outputs \$\textrm{accum}_{1048573} = (2^{21} \textrm{accum}_{1048571} -\textrm{ge}_{1048573} - \textrm{le}_{1048573} + 1)^+\$. Cost (3 + 1) * 1 = 4.

The threshold is 0. If working with doubles, overflow to \$+\infty\$ is quite possible, but that seems to be perfectly in accordance with the rules.

Score is (82026 - 3) * 12 + 21 + 4 + 9 + 4.

Peter Taylor

Posted 2019-04-11T16:43:00.077

Reputation: 41 901

Cool. As I understand, this also "memorizes" the primes, but it tests for equality "sequentially" rather than in "parallel". (Alternatively, it's like a transpose of the baseline.) The first step is to immediately move away from the bit pattern and just work with the actual integer itself. As a result, there isn't a 20-fold penalty in the equality check. Thanks for your submission – A. Rex – 2019-04-14T01:39:03.250

What is superscript plus? – feersum – 2019-04-14T04:57:10.837

1@feersum, that's notation from the Wikipedia page on ReLU linked in the question. $x^+ = \max(0, x)$ – Peter Taylor – 2019-04-14T05:17:08.103