0"D34çýÇbεDg•Xó•18в@ƶà©i7j0ìëR6ôRíć7®-jšTìJ1®<×ì]ð0:J"D34çýÇbεDg•Xó•18в@ƶà©i7j0ìëR6ôRíć7®-jšTìJ1®<×ì]ð0:J
05AB1E has no UTF-8 conversion builtins, so I have to do everything manually..
Try it online or verify that it's a quine.
Explanation:
quine-part:
The shortest quine for 05AB1E is this one: 0"D34çý"D34çý
(14 bytes) provided by @OliverNi. My answer uses a modified version of that quine by adding at the ...
here: 0"D34çý..."D34çý...
. A short explanation of this quine:
0 # Push a 0 to the stack (can be any digit)
"D34çý" # Push the string "D34çý" to the stack
D # Duplicate this string
34ç # Push 34 converted to an ASCII character to the stack: '"'
ý # Join everything on the stack (the 0 and both strings) by '"'
# (output the result implicitly)
Challenge part:
Now for the challenge part of the code. As I mentioned at the top, 05AB1E has no UTF-8 conversion builtins, so I have to do these things manually. I've used this source as reference on how to do that: Manually converting unicode codepoints into UTF-8 and UTF-16. Here a short summary of that regarding the conversion of Unicode characters to UTF-8:
- Convert the unicode characters to their unicode values (i.e.
"dЖ丽"
becomes [100,1046,20029]
)
- Convert these unicode values to binary (i.e.
[100,1046,20029]
becomes ["1100100","10000010110","100111000111101"]
)
- Check in which of the following ranges the characters are:
0x00000000 - 0x0000007F
(0-127): 0xxxxxxx
0x00000080 - 0x000007FF
(128-2047): 110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF
(2048-65535): 1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF
(65536-2097151): 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
There are also ranges for 5 or 6 bytes, but let's leave them out for now.
The character d
will be in the first range, so 1 byte in UTF-8; character Ж
is in the second range, so 2 bytes in UTF-8; and character 丽
is in the third range, so 3 bytes in UTF-8.
The x
in the pattern behind it are filled with the binary of these characters, from right to left. So the d
(1100100
) with pattern 0xxxxxxx
becomes 01100100
; the Ж
(10000010110
) with pattern 110xxxxx 10xxxxxx
becomes 11010000 10010110
; and the 丽
(100111000111101
) with pattern 1110xxxx 10xxxxxx 10xxxxxx
becomes 1110x100 10111000 10111101
, after which the remaining x
are replaced with 0
: 11100100 10111000 10111101
.
So, that approach I also used in my code. Instead of checking the actual ranges, I just look at the length of the binary and compare it to the amount of x
in the patterns however, since that saves a few bytes.
Ç # Convert each character in the string to its unicode value
b # Convert each value to binary
ε # Map over these binary strings:
Dg # Duplicate the string, and get its length
•Xó• # Push compressed integer 8657
18в # Converted to Base-18 as list: [1,8,12,17]
@ # Check for each if the length is >= to this value
# (1 if truthy; 0 if falsey)
ƶ # Multiply each by their 1-based index
à # Pop and get its maximum
© # Store it in the register (without popping)
i # If it is exactly 1 (first range):
7j # Add leading spaces to the binary to make it of length 7
0ì # And prepend a "0"
ë # Else (any of the other ranges):
R # Reverse the binary
6ô # Split it into parts of size 6
Rí # Reverse it (and each individual part) back
ć # Pop, and push the remainder and the head separated to the stack
7®- # Calculate 7 minus the value from the register
j # Add leading spaces to the head binary to make it of that length
š # Add it at the start of the remainder-list again
Tì # Prepend "10" before each part
J # Join the list together
1®<× # Repeat "1" the value from the register - 1 amount of times
ì # Prepend that at the front
] # Close both the if-else statement and map
ð0: # Replace all spaces with "0"
J # And join all modified binary strings together
# (which is output implicitly - with trailing newline)
See this 05AB1E answer of mine (sections How to compress large integers? and How to compress integer lists?) to understand why •Xó•18в
is [1,8,12,17]
.
1By "binary", do you mean a string representation of the binary values, i.e. a string consisting of only 1's and 0's? – None – 2019-02-04T18:38:00.987
1@mdahmoune Now that's already much better. The question remains how to represent something as UTF-8. Notice that Unicode representation is mainly based on the looks of a character (only occasionally on semantic meaning). What if no assigned Unicode glyph looks like a character in the source code? Unicode also has many look-alikes (homoglyphs). How does one decide which one to use? E.g. Dyalog APL has an AND function which may be encoded as
01011110
or0010011100100010
in UTF-8 (they look pretty alike:^
vs∧
) – Adám – 2019-02-04T19:44:15.2901Better example:
01111100
and0010001100100010
encode|
and∣
. – Adám – 2019-02-04T19:51:13.9174@Adám I think it would be fair to output any binary sequence that corresponds to a symbol that will compile/run in a certain implementation of a language. – qwr – 2019-02-04T20:09:22.837
1How about machine code? (Commodore C64 takes 28 bytes assuming the machine code itself is the "source") – Martin Rosenau – 2019-02-04T21:18:22.377
Does it have to be a program, not a function that takes a pointer to an output buffer? Like @MartinRosenau, I'm wondering about machine code. If a whole program is required, I think we could probably still follow the usual code-golf rules of only counting the actual executable bytes as "the source", not any executable file metadata. (i.e. the contents of a
.text
section for an x86-64 Linux executable that base2-dumps itself to stdout, perhaps using RIP-relative addressing got get its own code bytes. Or not, because if we have to be an executable, we can be position-dependent and shorter.) – Peter Cordes – 2019-02-05T02:38:50.723Does the "binary" dump of UTF-8 have to be in any particular character set or encoding? Can we choose to dump it in a format that packs 8 binary bits per octet (i.e. actual binary UTF-8, a serialization format for Unicode codepoints)? Or do you require a text representation of base 2 digits, using the ASCII subset of UTF-8? Or any choice of pre-existing encoding used for text, like EBCDIC or UTF-16? Basically it annoys me when people use "binary" to mean a serialization format with 1 bit per character, similar to hex. UTF-8 itself is a binary format composed of 0s and 1s. – Peter Cordes – 2019-02-05T02:52:50.097
@PeterCordes If you can dump it in a format that packs 8 binary bits per octet then any standard quine would count as a binary quine. – user253751 – 2019-02-05T03:02:32.413
@immibis: that's exactly my point. In computing, the word "binary" is not sufficient to describe what this question is trying to ask for. It's fairly clear from context what's intended, but I think I could rules-lawyer my way into a standard quine in a UTF-8 source (i.e. not x86 machine code, because that's not always valid UTF-8) with phrases like "the binary UTF-8 representation", because UTF-8 is a binary format for serializing Unicode codepoints. – Peter Cordes – 2019-02-05T03:04:26.670
Oh, that's probably what @MartinRosenau was getting at. What about languages that aren't textual in the first place, like machine code? Many instruction sequences don't form valid UTF-8 sequences, where the upper bits signal how many later bytes are part of the same character. Can we just have a function base-2 dump itself in ASCII/UTF-8? – Peter Cordes – 2019-02-05T03:10:56.530