The aim of this challenge is to find an impossibly short implementation of the following function p, in the langage of your choosing. Here is C code implementing it (see this TIO link that also prints its outputs) and a wikipedia page containing it.

unsigned char pi[] = {
    252,238,221,17,207,110,49,22,251,196,250,218,35,197,4,77,
    233,119,240,219,147,46,153,186,23,54,241,187,20,205,95,193,
    249,24,101,90,226,92,239,33,129,28,60,66,139,1,142,79,
    5,132,2,174,227,106,143,160,6,11,237,152,127,212,211,31,
    235,52,44,81,234,200,72,171,242,42,104,162,253,58,206,204,
    181,112,14,86,8,12,118,18,191,114,19,71,156,183,93,135,
    21,161,150,41,16,123,154,199,243,145,120,111,157,158,178,177,
    50,117,25,61,255,53,138,126,109,84,198,128,195,189,13,87,
    223,245,36,169,62,168,67,201,215,121,214,246,124,34,185,3,
    224,15,236,222,122,148,176,188,220,232,40,80,78,51,10,74,
    167,151,96,115,30,0,98,68,26,184,56,130,100,159,38,65,
    173,69,70,146,39,94,85,47,140,163,165,125,105,213,149,59,
    7,88,179,64,134,172,29,247,48,55,107,228,136,217,231,137,
    225,27,131,73,76,63,248,254,141,83,170,144,202,216,133,97,
    32,113,103,164,45,43,9,91,203,155,37,208,190,229,108,82,
    89,166,116,210,230,244,180,192,209,102,175,194,57,75,99,182,
};

unsigned char p(unsigned char x) {
     return pi[x];
}

What is `p`

p is a component of two Russian cryptographic standards, namely the hash function Streebog and the block cipher Kuznyechik. In this article (and during ISO meetings), the designers of these algorithms claimed that they generated the array pi by picking random 8-bit permutations.

"Impossible" implementations

There are \$256! \approx 2^{1684}\$ permutations on 8 bits. Hence, for a given random permutation, a program that implement it shall not be expected to need fewer than 1683 bits.

However, we have found multiple abnormally small implementations (which we list here), for example the following C program:

p(x){unsigned char*k="@`rFTDVbpPBvdtfR@\xacp?\xe2>4\xa6\xe9{z\xe3q5\xa7\xe8",l=0,b=17;while(--l&&x^1)x=2*x^x/128*285;return l%b?k[l%b]^k[b+l/b]^b:k[l/b]^188;}

which contains only 158 characters and thus fits in 1264 bits. Click here to see that it works.

We talk about an "impossibly" short implementation because, if the permutation was the output of a random process (as claimed by its designers), then a program this short would not exist (see this page for more details).

Reference implementation

A more readable version of the previous C code is:

unsigned char p(unsigned char x){
     unsigned char
         s[]={1,221,146,79,147,153,11,68,214,215,78,220,152,10,69},
         k[]={0,32,50,6,20,4,22,34,48,16,2,54,36,52,38,18,0};
     if(x != 0) {
         unsigned char l=1, a=2;
         while(a!=x) {
             a=(a<<1)^(a>>7)*29;
             l++;
         }
         unsigned char i = l % 17, j = l / 17;
         if (i != 0) return 252^k[i]^s[j];
         else return 252^k[j];
     }
     else return 252;
}

The table k is such that k[x] = L(16-x), where L is linear in the sense that L(x^y)==L(x)^L(y), and where, like in C, ^ denotes the XOR. However, we did not manage to leverage this property to shorten our implementation. We are not aware of any structure in s that could allow a simpler implementation---its output is always in the subfield though, i.e. \$s[x]^{16}=s[x]\$ where the exponentiation is done in the finite field. Of course, you are absolutely free to use a simpler expression of s should you find one!

The while loop corresponds to the evaluation of a discrete logarithm in the finite field with 256 elements. It works via a simple brute-force search: the dummy variable a is set to be a generator of the finite field, and it is multiplied by this generator until the result is equal to x. When it is the case, we have that l is the discrete log of x. This function is not defined in 0, hence the special case corresponding to the if statement.

The multiplication by the generator can be seen as a multiplication by \$X\$ in \$\mathbb{F}_2[X]\$ which is then reduced modulo the polynomial \$X^8+X^4+X^3+X^2+1\$. The role of the unsigned char is to ensure that the variable a stays on 8 bits. Alternatively, we could use a=(a<<1)^(a>>7)*(256^29), in which case a could be an int (or any other integer type). On the other hand, it is necessary to start with l=1,a=2 as we need to have l=255 when x is equal to 1.

More details on the properties of p are presented in our paper, with a writeup of most of our optimizations to obtain the previous short implementation.

Rules

Propose a program that implements the function p in less than 1683 bits. As the shorter the program, the more abnormal it is, for a given language, shorter is better. If your language happens to have Kuznyechik, Streebog or p as a builtin, you cannot use them.

The metric we use to determine the best implementation is the program length in bytes. We use the bit-length in our academic paper but we stick to bytes here for the sake of simplicity.

If your language does not have a clear notion of function, argument or output, the encoding is up to you to define, but tricks like encoding the value pi[x] as x are obviously forbidden.

We have already submitted a research paper with our findings on this topic. It is available here. However, should it be published in a scientific venue, we will gladly acknowledge the authors of the best implementations.

By the way, thanks to xnor for his help when drafting this question!

picarresursix

Posted 2019-06-07T05:14:03.617

Reputation: 871

12I hope someone submits an answer in Seed. – Robin Ryder – 2019-06-07T05:43:10.820

That looks quite interesting:) Can you explain: What exactly does L do? Regarding input/output: By default we allow a variety of input and output methods for code-golf to accommodate for the different types of languages. If there is something that you think does not apply for this challenge please mention this. I'm curious to see what people come up with!

– flawr – 2019-06-07T06:55:18.757

7Similarily, can, for example, brainfuck code be scored at 3 bits per character if it has no nops? And is the 1683 bits at most a strict restriction [sic?] or the goal? – my pronoun is monicareinstate – 2019-06-07T10:04:32.193

35"If the permutation was the output of a random process (as claimed by its designers), then a program this short would not exist" I disagree (although it doesn't matter for the challenge). If it was the output of a random process, it would be unlikely that such program existed; or it would be difficult to find – Luis Mendo – 2019-06-07T11:13:36.970

5@LuisMendo For probabilities less than 1 in 10^100, insisting on the term "unlikely" rather than "impossible" is just pedantry. (At least in a programming context. If we were doing pure math, the distinction would be important). – Grimmy – 2019-06-07T11:18:51.200

8@Grimy The statement then a program this short would not exist is a conceptual one (not a programming one). Using your terms, it belongs to the pure-math world, not to the programming world – Luis Mendo – 2019-06-07T12:16:20.807

For your C program, you can assign the last value to any variable instead of returning it, and it'll work as if you returned it in GCC (I think that's UB though) – my pronoun is monicareinstate – 2019-06-07T15:59:23.810

1In the paper, an 80 byte solution was achieved. Perhaps this should be mentioned in the challenge. – lirtosiast – 2019-06-07T16:56:03.520

5Although the "program this short would not exist" statement is technically false, the chances of a program under 1300 bits existing for a completely random 1683-bit sequence are about 1 in 10^115 – BlueRaja - Danny Pflughoeft – 2019-06-07T19:18:38.277

1@lirtosiast That's what I was thinking. The S-box was reverse engineered into several different possible representations, and those representations would be far easier to compress and golf than the raw permutation of values. – forest – 2019-06-08T02:05:55.700

7It may have been already noticed, but just in case: $s_i \text{ XOR } s_{i-1}$ results in only 8 distinct values: $1,10,68,79,146,153,220,221$ (starting with $i=1$ and assuming $s_0=0$). – Arnauld – 2019-06-08T08:43:42.727

4Also, there are countless ways of sorting $s$ such that $s_i\text{ XOR }s_{i-1}$ results in only 4 distinct values. Example: [ 11, 153, 146, 221, 79, 68, 214, 215, 152, 147, 220, 78, 1, 10, 69 ] can be built by XOR'ing with { 1, 11, 79, 146 }. – Arnauld – 2019-06-08T08:51:51.100

4I am very impressed by the implementations found! Let me answer some questions. I stuck with the byte count as a measure of size as a more finer grained estimate can be difficult in some context. It also makes little difference between different programs in a given language. The patterns in the table found by Arnaud are consequences of the linearity of the corresponding component. Finally, the term "impossible" is indeed used a bit loosely here; I am more rigorous in the explanation on my website (and in the paper). Still, we are talking about very, very unlikely events. – picarresursix – 2019-06-09T16:21:34.897

This is likely a botched attempt to create a kleptographic backdoor.

– wizzwizz4 – 2019-08-19T19:31:47.987

Answers

AMD64 Assembly (78 bytes or 624 bits of machine code)

uint8_t SubByte(uint8_t x) {
    uint8_t y,z;
    uint8_t s[]=
      {1,221,146,79,147,153,11,68,214,215,78,220,152,10,69};

    uint8_t k[]=
      {0,32,50,6,20,4,22,34,48,16,2,54,36,52,38,18,0};

    if(x) {
      for(y=z=1;(y=(y<<1)^((y>>7)*29)) != x;z++);
      x = (z % 17);
      z = (z / 17);
      x = (x) ? k[x] ^ s[z] : k[z];
    }
    return x^252;
}

64-bit x86 assembly

    ; 78 bytes of AMD64 assembly
    ; odzhan
    bits   64

    %ifndef BIN
      global SubBytex
    %endif

SubBytex:
    mov    al, 252
    jecxz  L2                ; if(x) {
    call   L0
k:
    db     0xfc, 0xdc, 0xce, 0xfa, 0xe8, 0xf8, 0xea, 0xde, 
    db     0xcc, 0xec, 0xfe, 0xca, 0xd8, 0xc8, 0xda, 0xee, 0xfc
s:
    db     0x01, 0xdd, 0x92, 0x4f, 0x93, 0x99, 0x0b, 0x44, 
    db     0xd6, 0xd7, 0x4e, 0xdc, 0x98, 0x0a, 0x45
L0:
    pop    rbx
    mov    al, 1             ; y = 1
    cdq                      ; z = 0
L1:
    inc    dl                ; z++
    add    al, al            ; y = y + y
    jnc    $+4               ; skip XOR if no carry
    xor    al, 29            ;
    cmp    al, cl            ; if(y != x) goto L1
    jne    L1    

    xchg   eax, edx          ; eax = z
    cdq                      ; edx = 0
    mov    cl, 17            ; al = z / 17, dl = z % 17
    div    ecx

    mov    cl, [rbx+rax+17]  ; cl = s[z]
    xlatb                    ; al = k[z]
    test   dl, dl            ; if(x == 0) goto L2
    jz     L2
    xchg   eax, edx          ; al = x
    xlatb                    ; al = k[x]
    xor    al, cl            ; al ^= s[z]
L2:
    ret

Disassembled 64-bit code

00000000  B0FC              mov al,0xfc
00000002  67E348            jecxz 0x4d
00000005  E820000000        call qword 0x2a
; k[] = 0xfc, 0xdc, 0xce, 0xfa, 0xe8, 0xf8, 0xea, 0xde, 
;       0xcc, 0xec, 0xfe, 0xca, 0xd8, 0xc8, 0xda, 0xee, 0xfc
; s[] = 0x01, 0xdd, 0x92, 0x4f, 0x93, 0x99, 0x0b, 0x44, 
;       0xd6, 0xd7, 0x4e, 0xdc, 0x98, 0x0a, 0x45
0000002A  5B                pop rbx
0000002B  B001              mov al,0x1
0000002D  99                cdq
0000002E  FEC2              inc dl
00000030  00C0              add al,al
00000032  7302              jnc 0x36
00000034  341D              xor al,0x1d
00000036  38C8              cmp al,cl
00000038  75F4              jnz 0x2e
0000003A  92                xchg eax,edx
0000003B  99                cdq
0000003C  B111              mov cl,0x11
0000003E  F7F1              div ecx
00000040  8A4C0311          mov cl,[rbx+rax+0x11]
00000044  D7                xlatb
00000045  84D2              test dl,dl
00000047  7404              jz 0x4d
00000049  92                xchg eax,edx
0000004A  D7                xlatb
0000004B  30C8              xor al,cl
0000004D  C3                ret

32-bit x86 assembly

    ; 72 bytes of x86 assembly
    ; odzhan
    bits   32

    %ifndef BIN
      global SubBytex
      global _SubBytex
    %endif

SubBytex:
_SubBytex:
    mov    al, 252
    jecxz  L2                ; if(x) {
    call   L0
k:
    db     0xfc, 0xdc, 0xce, 0xfa, 0xe8, 0xf8, 0xea, 0xde, 
    db     0xcc, 0xec, 0xfe, 0xca, 0xd8, 0xc8, 0xda, 0xee, 0xfc
s:
    db     0x01, 0xdd, 0x92, 0x4f, 0x93, 0x99, 0x0b, 0x44, 
    db     0xd6, 0xd7, 0x4e, 0xdc, 0x98, 0x0a, 0x45
L0:
    pop    ebx
    mov    al, 1             ; y = 1
    cdq                      ; z = 0
L1:
    inc    edx               ; z++
    add    al, al            ; y = y + y
    jnc    $+4               ; skip XOR if no carry
    xor    al, 29            ;
    cmp    al, cl            ; if(y != x) goto L1
    jne    L1    
    xchg   eax, edx          ; al = z
    aam    17                ; al|x = z % 17, ah|z = z / 17
    mov    cl, ah            ; cl = z
    cmove  eax, ecx          ; if(x == 0) al = z else al = x
    xlatb                    ; al = k[z] or k[x]
    jz     L2                ; if(x == 0) goto L2
    xor    al, [ebx+ecx+17]  ; k[x] ^= k[z]
L2:
    ret

Disassembled 32-bit code

00000000  B0FC              mov al,0xfc
00000002  E345              jecxz 0x49
00000004  E820000000        call dword 0x29
; k[] = 0xfc, 0xdc, 0xce, 0xfa, 0xe8, 0xf8, 0xea, 0xde, 
;       0xcc, 0xec, 0xfe, 0xca, 0xd8, 0xc8, 0xda, 0xee, 0xfc
; s[] = 0x01, 0xdd, 0x92, 0x4f, 0x93, 0x99, 0x0b, 0x44, 
;       0xd6, 0xd7, 0x4e, 0xdc, 0x98, 0x0a, 0x45
00000029  5B                pop ebx
0000002A  B001              mov al,0x1
0000002C  99                cdq
0000002D  42                inc edx
0000002E  00C0              add al,al
00000030  7302              jnc 0x34
00000032  341D              xor al,0x1d
00000034  38C8              cmp al,cl
00000036  75F5              jnz 0x2d
00000038  92                xchg eax,edx
00000039  D411              aam 0x11
0000003B  88E1              mov cl,ah
0000003D  0F44C1            cmovz eax,ecx
00000040  D7                xlatb
00000041  7404              jz 0x47
00000043  32440B11          xor al,[ebx+ecx+0x11]
00000047  C3                ret

odzhan

Posted 2019-06-07T05:14:03.617

Reputation: 491

1Nice answer! Since the OP was looking for bit count, this (85 bytes) comes out to 680 bits, using 8 bits per byte, or 595 bits using 7 bits per byte (possible since all the characters are ASCII). You could probably go shorter if you compressed to an even more restrictive character set. – Cullub – 2019-06-07T16:41:19.277

1Welcome to PPCG; nice first solution. – Shaggy – 2019-06-07T17:27:38.413

@Cullub I don't understand your comment about ASCII, surely that doesn't apply to machine code? – ArBo – 2019-06-07T17:46:20.120

It doesn't apply to this one in specific because amd64 uses 8 bit characters. However, all of ASCII can fit into just 7bits, so if we're talking bit counts, and just counting up the number of bits used in the code above, which comprises solely of ASCII characters, we could use special counting methods if we wanted to – Cullub – 2019-06-07T17:51:14.640

9@Cullub My point was, the code in this answer is just C/Assembler that gets compiled, so the byte count isn't that of the given code, but of the compiled code. And that's not ASCII. – ArBo – 2019-06-07T17:58:28.217

7Just for clarification, the 84 bytes is the size of the machine code after this has been assembled? If so the title should be updated to reflect that this is a machine code answer rather than an assembly answer. – Potato44 – 2019-06-07T23:06:23.450

@odzhan: machine-code answers should include the actual answer (a hexdump of the machine code)! e.g. like a nasm -l/dev/stdout listing or objdump -d, or a separate block of hexdump separate from the source code. See Tips for golfing in x86/x64 machine code for examples. My answer on The Chroma Key to Success shows how to use nasm | cut -b -28,$((28+12))- to get reasonably narrow listings that fit well into answers.

– Peter Cordes – 2019-06-08T19:36:31.370

1And BTW, you don't have to use a standard calling convention; you can use a custom convention where RBX is call-clobbered, saving 2 bytes for push/pop. (And where uint8_t args are zero-extended to 64-bit for JRCXZ). Also, if you write position-dependent code you can put the table address into a register with a 5-byte mov ebx, imm32 instead of a 6-byte call/pop. Or use it as a disp32 in mov al, [table + rax], but that might lose since you have two xlatb and a mov already. The call+pop shellcode trick does win vs. 7-byte RIP-relative LEA with the data after the ret, though. – Peter Cordes – 2019-06-08T19:45:45.323

Have you considered using 32-bit mode for division by an immediate with aam 17? You can still use a register-arg calling convention, and clobber as many regs as you want with a custom calling convention.

– Peter Cordes – 2019-06-08T19:47:46.570

Hey all, thanks for the feedback. @PeterCordes I did consider writing a 32-bit version and using AAM for the division, but haven't tried. Wanted to avoid saving+restoring RBX. This works if assembled with NASM/YASM and linked with C code using MSVC or MinGW. – odzhan – 2019-06-08T21:27:59.160

To make it testable from C, you can write a wrapper function in asm (that saves/restores extra registers and adapts for args / return value in different regs), and call that wrapper from C. The wrapper isn't part of your answer / byte-count, because the language you're using is machine-code (where custom calling conventions for private helper functions is fine), not C. – Peter Cordes – 2019-06-08T21:30:32.260

Okay, I didn't know that would be allowed. I'll update the answer. – odzhan – 2019-06-08T21:31:31.720

CJam, 72 67 66 63 bytes

ri{_2md142*^}es*]2#~Hmd{5\}e|Fm2b"Ý0$&ÜÖD
×EON".*Lts:^i

es* repeats something by the current timestamp, which is a big number, and it would take too long to finish.

Actually testable version, 64 bytes:

ri{_2md142*^}256*]2#~Hmd{5\}e|Fm2b"Ý0$&ÜÖD
×EON".*Lts:^i

00000000: 7269 7b5f 326d 6431 3432 2a5e 7d32 3536  ri{_2md142*^}256
00000010: 2a5d 3223 7e48 6d64 7b35 5c7d 657c 466d  *]2#~Hmd{5\}e|Fm
00000020: 3262 22dd 3024 2612 dc99 d644 0092 0b0a  2b".0$&....D....
00000030: 98d7 454f 934e 0122 2e2a 4c74 733a 5e69  ..EO.N.".*Lts:^i

Proving that a Russian cryptographic standard is too structured

What is p

"Impossible" implementations

Reference implementation

Rules

Answers

AMD64 Assembly (78 bytes or 624 bits of machine code)

64-bit x86 assembly

Disassembled 64-bit code

32-bit x86 assembly

Disassembled 32-bit code

CJam, 72 67 66 63 bytes

Explanation

Jelly 71 59 bytes

Explanation

Original approach

Jelly, 71 66 bytes

Explanation

C (gcc), 157 148 140 139 bytes

C (gcc), 150 142 127 126 bytes

Stax, 65 64 62 59 58 bytes

05AB1E, 101 100 98 97 95 94 bytes

JavaScript (ES6), 139 bytes

JavaScript (Node.js), 149 148 bytes

Encoding

Building \$s\$

Python 3, 151 bytes

C# (Visual C# Interactive Compiler), 141 bytes

Python 3, 182 bytes

Python 3, 176 bytes

Python 3, 173 bytes

Rust, 170 163 bytes

05AB1E, 74 bytes

What is `p`