Testing if a number is a square

16

Write a GOLF assembly program that given a 64-bit unsigned integer in register n puts a non-zero value into register s if n is a square, otherwise 0 into s.

Your GOLF binary (after assembling) must fit in 4096 bytes.


Your program will be scored using the following Python3 program (which must be put inside the GOLF directory):

import random, sys, assemble, golf, decimal

def is_square(n):
    nd = decimal.Decimal(n)
    with decimal.localcontext() as ctx:
        ctx.prec = n.bit_length() + 1
        i = int(nd.sqrt())
        return i*i == n

with open(sys.argv[1]) as in_file:
    binary, debug = assemble.assemble(in_file)

score = 0
random.seed(0)
for i in range(1000):
    cpu = golf.GolfCPU(binary)

    if random.randrange(16) == 0: n = random.randrange(2**32)**2
    else:                         n = random.randrange(2**64)

    cpu.regs["n"] = n
    cpu.run()
    if bool(cpu.regs["s"]) != is_square(n):
        raise RuntimeError("Incorrect result for: {}".format(n))
    score += cpu.cycle_count
    print("Score so far ({}/1000): {}".format(i+1, score))

print("Score: ", score)

Make sure to update GOLF to the latest version with git pull. Run the score program using python3 score.py your_source.golf.

It uses a static seed to generate a set of numbers of which roughly 1/16 is square. Optimization towards this set of numbers is against the spirit of the question, I may change the seed at any point. Your program must work for any non-negative 64-bit input number, not just these.

Lowest score wins.


Because GOLF is very new I'll include some pointers here. You should read the GOLF specification with all instructions and cycle costs. In the Github repository example programs can be found.

For manual testing, compile your program to a binary by running python3 assemble.py your_source.golf. Then run your program using python3 golf.py -p s your_source.bin n=42, this should start the program with n set to 42, and prints register s and the cycle count after exiting. See all the values of the register contents at program exit with the -d flag - use --help to see all flags.

orlp

Posted 2015-04-27T20:57:21.430

Reputation: 37 067

I unrolled a 32-iteration loop to save ~64 operations per test. That's probably outside the spirit of the challenge. Maybe this would work better as speed divided by codesize? – Sparr – 2015-04-28T06:19:28.050

@Sparr Loop unrolling is allowed, as long as your binary fits in 4096 bytes. Do you feel this limit is too high? I'm willing to lower it. – orlp – 2015-04-28T06:20:05.303

@Sparr Your binary right now is 1.3k, but I think that 32-iteration loop unrolling is a bit much. How does a binary limit of 1024 bytes sound? – orlp – 2015-04-28T06:24:58.443

Warning to all contestants! Update your GOLF interpreter with git pull. I found a bug in the leftshift operand where it didn't properly wrap. – orlp – 2015-04-28T07:07:59.427

I'm not sure. 1024 would only require me to loop once; I'd still save ~62 ops per test by the unrolling.

I suspect someone might also put that much space to good use as a lookup table. I've seen some algorithms that want 2-8k of lookup tables for 32 bit square roots. – Sparr – 2015-04-28T15:26:41.083

all the existing solutions, including mine before unrolling, fit in much much less space than that. How big is yours? – Sparr – 2015-04-28T15:28:25.117

Is there a way to get access to math (in particular, math.sqrt) to programmatically generate a lookup table in the data section? Right now I have a string of 3k octal escapes pasted in my file... – 2012rcampion – 2015-07-01T02:34:08.973

@2012rcampion No, but you can use x ** 0.5 to do square roots. – orlp – 2015-07-01T07:04:55.697

Answers

2

Score: 22120 (3414 bytes)

My solution uses a 3kB lookup table to seed a Newton's method solver that runs for zero to three iterations depending on the result's size.

    lookup_table = bytes(int((16*n)**0.5) for n in range(2**10, 2**12))

    # use orlp's mod-64 trick
    and b, n, 0b111111
    shl v, 0xc840c04048404040, b
    le q, v, 0
    jz not_square, q
    jz is_square, n

    # x will be a shifted copy of n used to index the lookup table.
    # We want it shifted (by a multiple of two) so that the two most 
    # significant bits are not both zero and no overflow occurs.
    # The size of n in bit *pairs* (minus 8) is stored in b.
    mov b, 24
    mov x, n 
    and c, x, 0xFFFFFFFF00000000
    jnz skip32, c
    shl x, x, 32
    sub b, b, 16
skip32:
    and c, x, 0xFFFF000000000000
    jnz skip16, c
    shl x, x, 16
    sub b, b, 8
skip16:
    and c, x, 0xFF00000000000000
    jnz skip8, c
    shl x, x, 8
    sub b, b, 4
skip8:
    and c, x, 0xF000000000000000
    jnz skip4, c
    shl x, x, 4
    sub b, b, 2
skip4:
    and c, x, 0xC000000000000000
    jnz skip2, c
    shl x, x, 2
    sub b, b, 1
skip2:

    # now we shift x so it's only 12 bits long (the size of our lookup table)
    shr x, x, 52

    # and we store the lookup table value in x
    add x, x, data(lookup_table)
    sub x, x, 2**10
    lbu x, x

    # now we shift x back to the proper size
    shl x, x, b

    # x is now an intial estimate for Newton's method.
    # Since our lookup table is 12 bits, x has at least 6 bits of accuracy
    # So if b <= -2, we're done; else do an iteration of newton
    leq c, b, -2
    jnz end_newton, c
    divu q, r, n, x
    add x, x, q
    shr x, x, 1

    # We now have 12 bits of accuracy; compare b <= 4
    leq c, b, 4
    jnz end_newton, c
    divu q, r, n, x
    add x, x, q
    shr x, x, 1

    # 24 bits, b <= 16
    leq c, b, 16
    jnz end_newton, c
    divu q, r, n, x
    add x, x, q
    shr x, x, 1

    # 48 bits, we're done!

end_newton:

    # x is the (integer) square root of n: test x*x == n
    mulu x, h, x, x
    cmp s, n, x
    halt 0

is_square:
    mov s, 1

not_square:
    halt 0

2012rcampion

Posted 2015-04-27T20:57:21.430

Reputation: 1 319

10

Score: 27462

About time I'd compete in a GOLF challenge :D

    # First we look at the last 6 bits of the number. These bits must be
    # one of the following:
    #
    #     0x00, 0x01, 0x04, 0x09, 0x10, 0x11,
    #     0x19, 0x21, 0x24, 0x29, 0x31, 0x39
    #
    # That's 12/64, or a ~80% reduction in composites!
    #
    # Conveniently, a 64 bit number can hold 2**6 binary values. So we can
    # use a single integer as a lookup table, by shifting. After shifting
    # we check if the top bit is set by doing a signed comparison to 0.

    and b, n, 0b111111
    shl v, 0xc840c04048404040, b
    le q, v, 0
    jz no, q
    jz yes, n

    # Hacker's Delight algorithm - Newton-Raphson.
    mov c, 1
    sub x, n, 1
    geu q, x, 2**32-1
    jz skip32, q
    add c, c, 16
    shr x, x, 32
skip32:
    geu q, x, 2**16-1
    jz skip16, q
    add c, c, 8
    shr x, x, 16
skip16:
    geu q, x, 2**8-1
    jz skip8, q
    add c, c, 4
    shr x, x, 8
skip8:
    geu q, x, 2**4-1
    jz skip4, q
    add c, c, 2
    shr x, x, 4
skip4:
    geu q, x, 2**2-1
    add c, c, q

    shl g, 1, c
    shr t, n, c
    add t, t, g
    shr h, t, 1

    leu q, h, g
    jz newton_loop_done, q
newton_loop:
    mov g, h
    divu t, r, n, g
    add t, t, g
    shr h, t, 1
    leu q, h, g
    jnz newton_loop, q
newton_loop_done:

    mulu u, h, g, g
    cmp s, u, n 
    halt 0
yes:
    mov s, 1
no:
    halt 0

orlp

Posted 2015-04-27T20:57:21.430

Reputation: 37 067

If I steal your lookup idea then my score drops from 161558 to 47289. Your algorithm still wins. – Sparr – 2015-04-28T15:37:04.010

Have you tried unrolling the newton loop? How many iterations does it need, for the worst case? – Sparr – 2015-05-26T15:11:30.100

@Sparr Yes, it's not faster to unroll because there's high variance in number of iterations. – orlp – 2015-05-27T05:59:15.950

does it ever complete in zero or one iterations? what's the max? – Sparr – 2015-05-27T16:04:14.940

The lookup table idea was also in the answer http://stackoverflow.com/a/18686659/4339987.

– lirtosiast – 2015-07-01T14:57:07.390

5

Score: 161558 227038 259038 260038 263068

I took the fastest integer square root algorithm I could find and return whether its remainder is zero.

# based on http://www.cc.utah.edu/~nahaj/factoring/isqrt.c.html
# converted to GOLF assembly for http://codegolf.stackexchange.com/questions/49356/testing-if-a-number-is-a-square

# unrolled for speed, original source commented out at bottom
start:
    or u, t, 1 << 62
    shr t, t, 1
    gequ v, n, u
    jz nope62, v
    sub n, n, u
    or t, t, 1 << 62
    nope62:

    or u, t, 1 << 60
    shr t, t, 1
    gequ v, n, u
    jz nope60, v
    sub n, n, u
    or t, t, 1 << 60
    nope60:

    or u, t, 1 << 58
    shr t, t, 1
    gequ v, n, u
    jz nope58, v
    sub n, n, u
    or t, t, 1 << 58
    nope58:

    or u, t, 1 << 56
    shr t, t, 1
    gequ v, n, u
    jz nope56, v
    sub n, n, u
    or t, t, 1 << 56
    nope56:

    or u, t, 1 << 54
    shr t, t, 1
    gequ v, n, u
    jz nope54, v
    sub n, n, u
    or t, t, 1 << 54
    nope54:

    or u, t, 1 << 52
    shr t, t, 1
    gequ v, n, u
    jz nope52, v
    sub n, n, u
    or t, t, 1 << 52
    nope52:

    or u, t, 1 << 50
    shr t, t, 1
    gequ v, n, u
    jz nope50, v
    sub n, n, u
    or t, t, 1 << 50
    nope50:

    or u, t, 1 << 48
    shr t, t, 1
    gequ v, n, u
    jz nope48, v
    sub n, n, u
    or t, t, 1 << 48
    nope48:

    or u, t, 1 << 46
    shr t, t, 1
    gequ v, n, u
    jz nope46, v
    sub n, n, u
    or t, t, 1 << 46
    nope46:

    or u, t, 1 << 44
    shr t, t, 1
    gequ v, n, u
    jz nope44, v
    sub n, n, u
    or t, t, 1 << 44
    nope44:

    or u, t, 1 << 42
    shr t, t, 1
    gequ v, n, u
    jz nope42, v
    sub n, n, u
    or t, t, 1 << 42
    nope42:

    or u, t, 1 << 40
    shr t, t, 1
    gequ v, n, u
    jz nope40, v
    sub n, n, u
    or t, t, 1 << 40
    nope40:

    or u, t, 1 << 38
    shr t, t, 1
    gequ v, n, u
    jz nope38, v
    sub n, n, u
    or t, t, 1 << 38
    nope38:

    or u, t, 1 << 36
    shr t, t, 1
    gequ v, n, u
    jz nope36, v
    sub n, n, u
    or t, t, 1 << 36
    nope36:

    or u, t, 1 << 34
    shr t, t, 1
    gequ v, n, u
    jz nope34, v
    sub n, n, u
    or t, t, 1 << 34
    nope34:

    or u, t, 1 << 32
    shr t, t, 1
    gequ v, n, u
    jz nope32, v
    sub n, n, u
    or t, t, 1 << 32
    nope32:

    or u, t, 1 << 30
    shr t, t, 1
    gequ v, n, u
    jz nope30, v
    sub n, n, u
    or t, t, 1 << 30
    nope30:

    or u, t, 1 << 28
    shr t, t, 1
    gequ v, n, u
    jz nope28, v
    sub n, n, u
    or t, t, 1 << 28
    nope28:

    or u, t, 1 << 26
    shr t, t, 1
    gequ v, n, u
    jz nope26, v
    sub n, n, u
    or t, t, 1 << 26
    nope26:

    or u, t, 1 << 24
    shr t, t, 1
    gequ v, n, u
    jz nope24, v
    sub n, n, u
    or t, t, 1 << 24
    nope24:

    or u, t, 1 << 22
    shr t, t, 1
    gequ v, n, u
    jz nope22, v
    sub n, n, u
    or t, t, 1 << 22
    nope22:

    or u, t, 1 << 20
    shr t, t, 1
    gequ v, n, u
    jz nope20, v
    sub n, n, u
    or t, t, 1 << 20
    nope20:

    or u, t, 1 << 18
    shr t, t, 1
    gequ v, n, u
    jz nope18, v
    sub n, n, u
    or t, t, 1 << 18
    nope18:

    or u, t, 1 << 16
    shr t, t, 1
    gequ v, n, u
    jz nope16, v
    sub n, n, u
    or t, t, 1 << 16
    nope16:

    or u, t, 1 << 14
    shr t, t, 1
    gequ v, n, u
    jz nope14, v
    sub n, n, u
    or t, t, 1 << 14
    nope14:

    or u, t, 1 << 12
    shr t, t, 1
    gequ v, n, u
    jz nope12, v
    sub n, n, u
    or t, t, 1 << 12
    nope12:

    or u, t, 1 << 10
    shr t, t, 1
    gequ v, n, u
    jz nope10, v
    sub n, n, u
    or t, t, 1 << 10
    nope10:

    or u, t, 1 << 8
    shr t, t, 1
    gequ v, n, u
    jz nope8, v
    sub n, n, u
    or t, t, 1 << 8
    nope8:

    or u, t, 1 << 6
    shr t, t, 1
    gequ v, n, u
    jz nope6, v
    sub n, n, u
    or t, t, 1 << 6
    nope6:

    or u, t, 1 << 4
    shr t, t, 1
    gequ v, n, u
    jz nope4, v
    sub n, n, u
    or t, t, 1 << 4
    nope4:

    or u, t, 1 << 2
    shr t, t, 1
    gequ v, n, u
    jz nope2, v
    sub n, n, u
    or t, t, 1 << 2
    nope2:

    or u, t, 1 << 0
    shr t, t, 1
    gequ v, n, u
    jz nope0, v
    sub n, n, u
    nope0:

end:
    not s, n        # return !remainder
    halt 0


# before unrolling...
#
# start:
#     mov b, 1 << 62  # squaredbit = 01000000...
# loop:               # do {
#     or u, b, t      #   u = squaredbit | root
#     shr t, t, 1     #   root >>= 1
#     gequ v, n, u    #   if remainder >= u:
#     jz nope, v
#     sub n, n, u     #       remainder = remainder - u
#     or t, t, b      #       root = root | squaredbit
# nope:
#     shr b, b, 2     #   squaredbit >>= 2
#     jnz loop, b      # } while (squaredbit > 0)
# end:
#     not s, n        # return !remainder
#     halt 0

EDIT 1: removed squaring test, return !remainder directly, save 3 ops per test

EDIT 2: use n as the remainder directly, save 1 op per test

EDIT 3: simplified the loop condition, save 32 ops per test

EDIT 4: unrolled the loop, save about 65 ops per test

Sparr

Posted 2015-04-27T20:57:21.430

Reputation: 5 758

1You can use full Python expressions in instructions, so you can write 0x4000000000000000 as 1 << 62 :) – orlp – 2015-04-28T05:39:17.883

3

Score: 344493

Does a simple binary search within the interval [1, 4294967296) to approximate sqrt(n), then checks if n is a perfect square.

mov b, 4294967296
mov c, -1

lesser:
    add a, c, 1

start:
    leu k, a, b
    jz end, k

    add c, a, b
    shr c, c, 1

    mulu d, e, c, c

    leu e, d, n
    jnz lesser, e
    mov b, c
    jmp start

end:
    mulu d, e, b, b
    cmp s, d, n

    halt 0

es1024

Posted 2015-04-27T20:57:21.430

Reputation: 8 953

Nice starting answer! Do you have any feedback on programming in GOLF assembly, the tools I made for GOLF, or the challenge? This type of challenge is very new, and I'm eager to hear feedback :) – orlp – 2015-04-28T04:39:31.987

Your answer is bugged for n = 0 sadly, 0 is 0 squared :) – orlp – 2015-04-28T04:46:35.913

@orlp fixed for n = 0. Also, I'd suggest adding an instruction to print a register's value mid-execution, which could make debugging GOLF programs easier. – es1024 – 2015-04-28T04:54:12.403

I'm not going to add such an instruction (that would mean challenges have to add extra rules about disallowed debugging instructions), instead I have interactive debugging planned, with breakpoints and viewing all register contents. – orlp – 2015-04-28T04:56:31.397

you could maybe speed this up by weighting your binary search to land somewhere other than the midpoint. the geometric mean of the two values perhaps? – Sparr – 2015-04-28T05:33:31.823