Verity 0.10, optimized for source code size (1944 bytes)
I originally misread the question and interpreted it as a code-golf. That was probably for the best, as it's much easier to write a quine with short source code than short object code under the restrictions in the question; that made the question easy enough that I felt I could reasonably produce an answer, and might work as a stepping stone on the way to a better answer. It also prompted me to use a higher-level language for the input, meaning that I'd need to express less in the program itself. I didn't create Verity as a golfing language for hardware (I was actually hired to create it a while ago in an entirely different context), but there's quite a reminiscence there (e.g. it's substantially higher level than a typical HDL is, and it has much less boilerplate; it's also much more portable than the typical HDL).
I'm pretty sure that the correct solution for short object code involves storing the data in some kind of tree structure, given that the question disallows the use of block ROM, which is where you'd normally store it in a practical program; I might have a go at writing a program that uses this principle (not sure what language, maybe Verity, maybe Verilog; VHDL has too much boilerplate to likely be optimal for this sort of problem) at some point. That would mean that you wouldn't need to pass every bit of the source code to every bit of your "manually created ROM". However, the Verity compiler currently synthesizes the structure of the output based on the precedence and associativity of the input, meaning that it's effectively representing the instruction pointer (thus the index to the lookup table) in unary, and a unary index multiplied by the length of the lookup table gives this O(n²) space performance.
The program itself:
import <print>new x:=0$1296in(\p.\z.\a.new y:=(-a 5-a 1-a 1-a 2-a 4-a 2-a 3-a 2-a 6-a 2-a 0-a 3-a 0-a 4-a 4-a 7-a 4-a 2-a 6-a 2-a 5-a 1-a 2-a 2-a 0-a 3-a 6-a 7-a 2-a 2-a 1-a 1-a 3-a 3-a 0-a 4-a 4-a 3-a 2-a 7-a 5-a 7-a 0-a 6-a 4-a 4-a 1-a 6-a 2-a 6-a 1-a 7-a 6-a 6-a 5-a 1-a 2-a 2-a 0-a 5-a 0-a 0-a 4-a 2-a 6-a 5-a 0-a 0-a 6-a 3-a 6-a 5-a 0-a 0-a 5-a 0-a 6-a 5-a 2-a 2-a 1-a 1-a 3-a 3-a 0-a 4-a 5-a 3-a 2-a 7-a 5-a 7-a 0-a 5-a 5-a 5-a 1-a 4-a 4-a 3-a 1-a 5-a 5-a 1-a 2-a 2-a 0-a 4-a 3-a 3-a 4-a 1-a 5-a 1-a 0-a 2-a 1-a 1-a 1-a 4-a 4-a 3-a 6-a 7-a 0-a 6-a 0-a 1-a 3-a 2-a 0-a 5-a 4-a 2-a 0-a 5-a 5-a 1-a 2-a 1-a 0-a 4-a 6-a 3-a 4-a 7-a 3-a 6-a 2-a 6-a 0-a 3-a 4-a 1-a 1-a 1-a 2-a 2-a 0-a 4-a 6-a 3-a 3-a 5-a 1-a 7-a 2-a 6-a 1-a 1-a 0-a 2-a 7-a 2-a 1-a 1-a 0-a 4-a 6-a 3-a 1-a 5-a 3-a 7-a 5-a 1-a 2-a 1-a 0-a 4-a 6-a 3-a 5-a 7-a 5-a 7-a 4-a 6-a 5-a 6-a 0-a 3-a 4-a 1-a 1-a 1-a 2-a 2-a 0-a 4-a 3-a 3-a 4-a 1-a 5-a 1-a 0-a 2-a 1-a 1-a 1-a 4-a 5-a 3-a 6-a 7-a 0-a 6-a 0-a 1-a 3-a 2-a 0-a 5-a 4-a 2-a 0-a 4-a 1-a 7-a 7-a 6-a 3-a 7-a 4-a 2-a 0-a 4-a 3-a 6-a 2-a 6-a 3-a 7-a 4-a 2-a 0-a 5-a 4-a 6-a 0-a 7-a 2-a 0-a 1-a 4-a 5-a 3-a 4-a 4-a 4-a 4-a 3-a 6-a 4-a 4-a 4-a 4-a 3-a 6-a 2-a 6-a 1-a 5-a 3-a 7-a 4-a 2-a 0-a 4-a 4-a 6-a 5-a 6-a 3-a 7-a 5-a 3-a 2-a 7-a 5-a 7-a 1-a 4-a 5-a 3-a 6-a 7-a 6-a 7-a 3-a 6-a 1-a 5-a 1-a 1-a 0-a 2-a 7-a 2-a 1-a 1-a 0-a 4-a 7-a 2-a 7-a 1-a 5-a 1-a 4-a 2-a 3-a 7-a 4-a 3-a 2-a 7-a 5-a 7-a 1-a 4-a 4-a 3-a 6-a 7-a 6-a 7-a 6-a 6-a 1-a 5-a 1-a 5-a 4-a 2-a 6-a 2-a 5-a 1-a 2-a 2-a 0-a 3-a 0-a 5-a 1-a 4-a 4-a 3-a 4-a 4-a 4-a 4-a 6-a 6-a 4-a 4-a 4-a 4-a 3-a 6-a 2-a 6-a 1-a 5-a 0-a 5-a 0-a 0-a 0-a 1-a 6-a 5-a 4-a 3-a 2-a 7-a 5-a 7-a 1-a 4-a 4-a 3-a 6-a 7-a 6-a 7-a 3-a 6-a 2-a 0-a 0-a 1-a 4-a 7-a 4-a 7-a 1-a 6-a 2-a 6-a 1-a 7-a 3-a 6-a 3-a 7-a 0-a 6-a 1-a 5-!x)in while!x>0do(p(if z<32then z+92else z);if z==45then while!y>0do(p 97;p 32;p(48^!y$$3$$32);p 45;y:=!y>>3)else skip;x:=!x>>6))print(!x$$6$$32)(\d.x:=!x>>3^d<<1293;0)
More readable:
import <print>
new x := 0$1296 in
(\p.\z.\a.
new y := (-a 5-a 1-
# a ton of calls to a() omitted...
-a 1-a 5-!x) in
while !x>0 do (
p(if z<32 then z+92 else z);
if z==45
then while !y>0 do (
p 97;
p 32;
p(48^!y$$3$$32);
p 45;
y:=!y>>3 )
else skip;
x:=!x>>6
)
)(print)(!x$$6$$32)(\d.x:=!x>>3^d<<1293;0)
The basic idea is that we store the entire data in the variable x
. (As usual for a quine, we have a code section and a data section; the data encodes the text of the code, and can also be used to regenerate the text of the data.) Unfortunately, Verity doesn't currently allow very large constants to be written in the source code (it uses OCaml integers during compilation to represent integers in the source, which clearly isn't correct in a language that supports arbitrarily wide integer types) – and besides, it doesn't allow constants to be specified in octal – so we generate the value of x
at runtime via repeated calls to a function a
. We could create a void function and call it repeatedly as separate statements, but that would make it hard to identify where to start outputting the text of the data section. So instead, I made a
return an integer, and use arithmetic to store the data (Verity guarantees that arithmetic evaluates left to right). The data section is encoded in x
using a single -
sign; when this is encountered at run time, it's expanded to the full -a 5-a 1-
, etc., via the use of y
.
Initializing y
as a copy of x
is fairly subtle here. Because a
returns zero specifically, most of the sum is just zero minus zero minus… and cancels itself out. We end with !x
(i.e. "the value of x
"; in Verity, as in OCaml, a variable's name works more like a pointer than anything else, and you have to dereference it explicitly to get at the variable's value). Verity's rules for unary minus are a little complex – the unary minus of v
is written as (-v)
– thus (-0-0-0-!x)
parses as (-(0-0-0-!x))
, which is equal to !x
, and we end up initializing y
as a copy of x
. (It's also worth noting that Verity is not call-by-value, but rather allows functions and operators to choose the order they evaluate things; -
will evaluate the left argument before the right argument, and in particular, if the left argument has side effects, those will be visible when the right argument is evaluated.)
Each character of the source code is represented using two octal digits. This means that the source code is limited to 64 different characters, so I had to create my own codepage for internal use. The output is in ASCII, so I needed to convert internally; this is what the (if z<32 then z+92 else z)
is for. Here's the character set I used in the internal representation, in numerical order (i.e. \
has codepoint 0, ?
has codepoint 63):
\]^_`abcdefghijklmnopqrstuvwxyz{ !"#$%&'()*+,-./0123456789:;<=>?
This character set gives us most of the characters important for Verity. Notable characters missing are }
(meaning that we can't create a block using {}
, but luckily all statements are expressions so we can use ()
instead); and |
(this is why I had to use an exclusive rather than inclusive OR when creating the value of x
, meaning I need to initialize it to 0; however, I needed to specify how large it was anyway). Some of the critical characters that I wanted to ensure were in the character set were <>
(for imports, also shifts), ()
(very hard to write a program that can be parsed without these), $
(for everything to do with bitwidth), and \
(for lambdas; theoretically we could work around this with let…in
but it would be much more verbose).
In order to make the program a bit shorter, I constructed abbreviations for print
and for !x$$6$$32
(i.e. "the bottom 6 bits of !x
, cast to be usable to the print
library) via temporarily binding them to lambda arguments.
Finally, there's the issue of output. Verity provides a print
library that's intended for debug output. On a simulator, it prints the ASCII codes to standard output, which is perfectly usable for testing the program. On a physical circuit board, it depends on a print
library having been written for the particular chip and board surrounding it; there's a print
library in the Verity distribution for an evaluation board I had access to that prints the output on seven-segment displays. Given that the library will end up taking space on the resulting circuit board, it may be worth using a different language for an optimized solution to this problem so that we can output the bits of the output directly on wires.
By the way, this program is O(n²) on hardware, meaning that it's much worse on a simulator (I suspect O(n⁴); not sure, though, but it was hard enough to simulate that it seems unlikely to be even cubic, and based on how the time reacted to my changes as I was writing the program, the function seems to grow very quickly indeed). The Verity compiler needed 436 optimization passes (which is much, much more than it'd typically use) to optimize the program, and even after that, simulating it was very hard for my laptop. The complete compile-and-simulate run took the following time:
real 112m6.096s
user 105m25.136s
sys 0m14.080s
and peaked at 2740232 kibibytes of memory. The program takes a total of 213646 clock cycles to run. It does work, though!
Anyway, this answer doesn't really fulfil the question as I was optimizing for the wrong thing, but seeing as there are no other answers yet, this is the best by default (and it's nice to see what a golfed quine would look like in a hardware language). I'm not currently sure whether or not I'll work on a program that aims to produce more optimized ouptut on the chip. (It would likely be a lot larger in terms of source, as an O(n) data encoding would be rather more complex than the one seen here.)
1The first compiler run says that only 130 LE gates are used. Because an LE gate can only store 2 bits of data and you use a ~750 bit data stream this may mean that the compiler compressed the data or that something went wrong while compiling. Tomorrow in the morning I'll verify if the compiler result is correct. – Martin Rosenau – 2016-11-17T04:24:26.457
With this construction, much of the data is stored in the pattern of connections between the gates rather than the gates themselves, so I can believe that 130 LE gates is correct. (The synthesizer is basically brute-forcing a formula that maps a 10-bit index to a 1-bit value; I specified the formula in the Verilog program using a lookup table with 1024 entries, but the synthesizer is very likely to use a more efficient representation based on something like a K-map minimization.) – None – 2016-11-17T04:55:12.970
Oh, you should probably also check to make sure that the compiler hasn't optimized part of the code into a block RAM or block ROM (disallowed by the question). I didn't request one, and haven't written the code in a form that would imply one (I was careful to make the lookup table combinatorial), but sometimes compiler optimizations do weird things. If there is an optimization interfering with that, you'd have to turn the optimization off. – None – 2016-11-17T06:34:23.127
I tried to get that running on my FPGA for half a day (OK. 3 hours wasted because I searched for some CD-ROM I obviously lost): The data stream returned by the FPGA seems not to be correct. Unfortunately I havn't found out why... – Martin Rosenau – 2016-11-17T12:48:42.633
1Ok. I managed the problems with the pins now. The code seems to work well. Outstanding. Only 130 LE-type cells! RAM, ROM etc. is not used. I think the compiler does some optimization similar to KV diagrams to "compress" the data. – Martin Rosenau – 2016-11-17T13:49:16.837
1You win! Congratulations. – Martin Rosenau – 2016-11-22T09:49:13.533