Gimli, make it even shorter?

25

2

I'm one of the authors of Gimli. We already have a 2-tweet (280 chars) version in C but I would like to see how small it can get.

Gimli (paper,website) is a high speed with high security level cryptographic permutation design that will be presented at the Conference on Cryptographic Hardware and Embedded Systems (CHES) 2017 (September 25-28).

The task

As usual: to make the smalled usable implementation of Gimli in the language of your choice.

It should be able to take as input 384 bits (or 48 bytes, or 12 unsigned int...) and return (may modify in place if you use pointers) the result of Gimli applied on these 384 bits.

Input conversion from decimal, hexadecimal, octal or binary is allowed.

Potential corner cases

Integer encoding is assumed to be little-endian (e.g. what you probably already have).

You may rename Gimli into G but it must still be a function call.

Who wins?

This is code-golf so the shortest answer in bytes wins! Standard rules apply of course.

A reference implementation is provided below.

Note

Some concern has been raised:

"hey gang, please implement my program for free in other languages so I don't have to" (thx to @jstnthms)

My answer is as follow:

I can easily do it in Java, C#, JS, Ocaml... It is more for the fun. Currently We (the Gimli team) have it implemented (and optimized) on AVR, Cortex-M0, Cortex-M3/M4, Neon, SSE, SSE-unrolled, AVX, AVX2, VHDL and Python3. :)


About Gimli

The state

Gimli applies a sequence of rounds to a 384-bit state. The state is represented as a parallelepiped with dimensions 3×4×32 or, equivalently, as a 3×4 matrix of 32-bit words.

state

Each round is a sequence of three operations:

  • a non-linear layer, specifically a 96-bit SP-box applied to each column;
  • in every second round, a linear mixing layer;
  • in every fourth round, a constant addition.

The non-linear layer.

The SP-box consists of three sub-operations: rotations of the first and second words; a 3-input nonlinear T-function; and a swap of the first and third words.

SP

The linear layer.

The linear layer consists of two swap operations, namely Small-Swap and Big-Swap. Small-Swap occurs every 4 rounds starting from the 1st round. Big-Swap occurs every 4 rounds starting from the 3rd round.

Linear

The round constants.

There are 24 rounds in Gimli, numbered 24,23,...,1. When the round number r is 24,20,16,12,8,4 we XOR the round constant (0x9e377900 XOR r) to the first state word.

enter image description here

reference source in C

#include <stdint.h>

uint32_t rotate(uint32_t x, int bits)
{
  if (bits == 0) return x;
  return (x << bits) | (x >> (32 - bits));
}

extern void gimli(uint32_t *state)
{
  int round;
  int column;
  uint32_t x;
  uint32_t y;
  uint32_t z;

  for (round = 24; round > 0; --round)
  {
    for (column = 0; column < 4; ++column)
    {
      x = rotate(state[    column], 24);
      y = rotate(state[4 + column],  9);
      z =        state[8 + column];

      state[8 + column] = x ^ (z << 1) ^ ((y&z) << 2);
      state[4 + column] = y ^ x        ^ ((x|z) << 1);
      state[column]     = z ^ y        ^ ((x&y) << 3);
    }

    if ((round & 3) == 0) { // small swap: pattern s...s...s... etc.
      x = state[0];
      state[0] = state[1];
      state[1] = x;
      x = state[2];
      state[2] = state[3];
      state[3] = x;
    }
    if ((round & 3) == 2) { // big swap: pattern ..S...S...S. etc.
      x = state[0];
      state[0] = state[2];
      state[2] = x;
      x = state[1];
      state[1] = state[3];
      state[3] = x;
    }

    if ((round & 3) == 0) { // add constant: pattern c...c...c... etc.
      state[0] ^= (0x9e377900 | round);
    }
  }
}

Tweetable version in C

This might not be the smallest usable implementation but we wanted to have a C standard version (thus no UB, and "usable" in a library).

#include<stdint.h>
#define P(V,W)x=V,V=W,W=x
void gimli(uint32_t*S){for(long r=24,c,x,y,z;r;--r%2?P(*S,S[1+y/2]),P(S[3],S[2-y/2]):0,*S^=y?0:0x9e377901+r)for(c=4;c--;y=r%4)x=S[c]<<24|S[c]>>8,y=S[c+4]<<9|S[c+4]>>23,z=S[c+8],S[c]=z^y^8*(x&y),S[c+4]=y^x^2*(x|z),S[c+8]=x^2*z^4*(y&z);}

Test vector

The following input generated by

for (i = 0;i < 12;++i) x[i] = i * i * i + i * 0x9e3779b9;

and "printed" values by

for (i = 0;i < 12;++i) {
  printf("%08x ",x[i])
  if (i % 4 == 3) printf("\n");
}

thus:

00000000 9e3779ba 3c6ef37a daa66d46 
78dde724 1715611a b54cdb2e 53845566 
f1bbcfc8 8ff34a5a 2e2ac522 cc624026 

should return:

ba11c85a 91bad119 380ce880 d24c2c68 
3eceffea 277a921c 4f73a0bd da5a9cd8 
84b673f0 34e52ff7 9e2bef49 f41bb8d6 

Biv

Posted 2017-09-08T22:29:41.697

Reputation: 363

3A tweet is 140 chars, not a 280 – Stan Strum – 2017-09-08T22:36:06.843

1

I know, which is why it fits into 2 ;) https://twitter.com/TweetGimli .

– Biv – 2017-09-08T22:36:37.563

10"hey gang, please implement my program for free in other languages so I don't have to" – jstnthms – 2017-09-08T22:53:41.500

hahaha Nah I already have it in Python, and I can easily do it in Java, C#, JS. It is more for the fun. :) – Biv – 2017-09-08T22:54:35.643

Currently I have it implemented on AVR, Cortex-M0, Cortex-M3/M4, Neon, SSE, SSE-unrolled, AVX, AVX2 and Python3. – Biv – 2017-09-08T22:55:52.940

5

The reference code on the website has a crucial error, -round instead of --round means that it never terminates. Converting -- to an en dash is probably not suggested in code :)

– orlp – 2017-09-09T00:21:01.237

@Shaggy updated. – Biv – 2017-09-09T01:14:42.030

"2-tweet" hey, now you have a 1-tweet version! – undergroundmonorail – 2017-11-11T01:55:15.140

my 112-bytes binary version is only 154 bytes in base64. You can tweet that now. – peter ferrie – 2017-11-15T06:44:12.377

and when base85-encoded, it's exactly 140 bytes – peter ferrie – 2017-11-15T17:31:14.073

Answers

4

CJam (114 chars)

{24{[4/z{[8ZT].{8\#*G8#:Mmd+}__)2*\+.^W%\[_~;&8*\~@1$|2*@@&4*].^Mf%}%z([7TGT]R=4e!=\f=(2654435608R-_4%!*^\@]e_}fR}

This is an anonymous block (function): if you want to name it G then append :G. In CJam assigned names can only be single upper-case letters. There's space to append a comment e# Gimli in CJam and have characters left in a single tweet.

Online test

Dissection

{                                e# Define a block
  24{                            e# For R=0 to 23...
    [                            e#   Collect values in an array
      4/z                        e#     Transpose to columns
      {                          e#     Map over each column
        [8ZT].{8\#*G8#:Mmd+}     e#       Rotations, giving [x y z]
        __)2*\+.^W%\             e#       => [y^z x^y x^z*2] [x y z]
        [_~;&8*\~@1$|2*@@&4*].^  e#       => [x' y' z']
        Mf%                      e#       Map out any bits which overflowed
      }%
      z                          e#    Transpose to rows
      ([7TGT]R=4e!=\f=           e#    Permute first row
      (2654435608R-_4%!*^        e#    Apply round constant to first element
      \@                         e#    Put the parts in the right order
    ]e_                          e#  Finish collecting in array and flatten
  }fR
}

Peter Taylor

Posted 2017-09-08T22:29:41.697

Reputation: 41 901

For a moment I was thrown of by the fact that the ouput was not in hex (in the online test). :) – Biv – 2017-09-14T07:56:20.373

16

C (gcc), 237 bytes

#define P(a,l)x=a;a=S[c=l>>r%4*2&3];S[c]=x;
r,c,x,y,z;G(unsigned*S){
for(r=24;r;*S^=r--%4?0:0x9e377901+r){
for(c=4;c--;*S++=z^y^8*(x&y))
x=*S<<24|*S>>8,y=S[4]<<9|S[4]>>23,z=S[8],S[8]=x^2*z^4*(y&z),S[4]=y^x^2*(x|z);
S-=4;P(*S,33)P(S[3],222)}}

I probably gained bytes with my swapping method, but it's too cute not to use.

orlp

Posted 2017-09-08T22:29:41.697

Reputation: 37 067

lost or gained? – HyperNeutrino – 2017-09-09T01:41:33.883

@HyperNeutrino Gained, making me a loser :) – orlp – 2017-09-09T01:42:06.573

Ah ok :P makes sense :P :P – HyperNeutrino – 2017-09-09T01:43:59.247

This is still definitely an improvement, but it's somewhat cheating to use unsigned instead of uint32_t (and OP's code was somewhat cheating to use long) because the idea behind the cipher is that it's highly portable. (In fact, fundamentally this saves a mere 8 bytes). – Peter Taylor – 2017-09-09T09:20:18.817

1@PeterTaylor Even though my code is similar, I'm not really competing against OP's code. I'm working under the rules of PPCG, where it must work with at least an implementation on a platform, and it does with gcc on a 32-bit or 64-bit Intel CPU (and probably many more). – orlp – 2017-09-09T11:56:56.827

@orlp I really like your column traversal too. :) – Biv – 2017-09-09T12:04:04.063

@Biv And this technique, to use a bitshift as a small lookup table is actually a side-channel resistant form of doing lookup tables, so potentially useful in implementing crypto primitives :) Obviously in the way it's used here the access pattern is always consistent and public information, so no reason to worry about cache effects, but the technique is interesting regardless. – orlp – 2017-09-09T13:57:20.157

4

C, 268 chars (268 bytes) using uint32_t

NB Since the original code uses <stdint.h> and types S as uint32_t *, I think the use of long is a cheat to get into 280 chars at the cost of the portability which is the reason for using uint32_t in the first place. If for fairness of comparison we require consistent use of uint32_t and the explicit signature void gimli(uint32_t *), the original code is really 284 chars, and orlp's code is 276 chars.

#include<stdint.h>
#define R(V)x=S[V],S[V]=S[V^y],S[V^y]=x,
void gimli(uint32_t*S){for(uint32_t r=24,x,y,z,*T;r--;y=72>>r%4*2&3,R(0)R(3)*S^=y&1?0x9e377901+r:0)for(T=S+4;T-->S;*T=z^y^8*(x&y),T[4]=y^x^2*(x|z),T[8]=x^2*z^4*(y&z))x=*T<<24|*T>>8,y=T[4]<<9|T[4]>>23,z=T[8];}

This can be split into two tweets with continuation markers as

#include<stdint.h>
#define R(V)x=S[V],S[V]=S[V^y],S[V^y]=x,
void gimli(uint32_t*S){for(uint32_t r=24,x,y,z,*T;r--;y=72>>r%4*2&3,R(0)R(3)// 1

and

*S^=y&1?0x9e377901+r:0)for(T=S+4;T-->S;*T=z^y^8*(x&y),T[4]=y^x^2*(x|z),T[8]=x^2*z^4*(y&z))x=*T<<24|*T>>8,y=T[4]<<9|T[4]>>23,z=T[8];}// 2/2

Peter Taylor

Posted 2017-09-08T22:29:41.697

Reputation: 41 901

The use of long in my version is safe (with respect to portability) because the minimum size of a long is 32 bits by the standard (as opposed to int). Rotations of x and y are done before the cast into long at the assignment, making them safe (as the right shift on signed value is CC dependent). The cast when going back to the uint32_t* S) gets rid of the upper bits and puts us in the right state :). – Biv – 2017-09-09T12:06:34.127

2

Java (OpenJDK 8), 351 343 339 320 318 247 + 56 bytes

Just a near 1:1 port of the reference to start golfing from.

void f(int[]x,int y,int z){int q=x[y];x[y]=x[z];x[z]=q;}

s->{for(int r=24,c,x,y,z;r>0;--r){for(c=0;c<4;x=s[c]<<24|s[c]>>>8,y=s[4+c]<<9|s[4+c]>>>23,z=s[8+c],s[8+c]=x^z<<1^(y&z)<<2,s[4+c]=y^x^(x|z)<<1,s[c++]=z^y^(x&y)<<3);if((r&3)==2){f(s,0,2);f(s,1,3);}if((r&3)<1){f(s,0,1);f(s,2,3);s[0]^=0x9e377900|r;}}}

Try it online!

Roberto Graham

Posted 2017-09-08T22:29:41.697

Reputation: 1 305

1Why use Integer at all? o_O Since you don't use any Integer method, there's no reason to not use ints here... – Olivier Grégoire – 2017-09-09T12:50:33.927

@OlivierGrégoire I think just a remnant from me trying Integer.divideUnsigned, but I realised I can have >>> – Roberto Graham – 2017-09-09T12:56:34.147

s[0]^=(0x9e377900|r); (at the very end) -- can't you drop the extra parentheses? – Clashsoft – 2017-09-09T14:27:22.213

Same with s[4+c]>>>(23). – Clashsoft – 2017-09-09T14:28:14.187

1You can make far fewer changes and get 300: void P(int[]S,int a,int b){int x=S[a];S[a]=S[b];S[b]=x;}void gimli(int[]S){for(int r=24,c,x,y,z;r>0;S[0]^=y<1?0x9e377901+r:0){for(c=4;c-->0;){x=S[c]<<24|S[c]>>>8;y=S[c+4]<<9|S[c+4]>>>23;z=S[c+8];S[c]=z^y^8*(x&y);S[c+4]=y^x^2*(x|z);S[c+8]=x^2*z^4*(y&z);}y=r%4;if(--r%2>0){P(S,0,1+y/2);P(S,3,2-y/2);}}} . I've basically made the minimal changes necessary to get it to compile. Java's precedence rules aren't very different to C's. – Peter Taylor – 2017-09-09T16:06:09.267

Order of Operations – HyperNeutrino – 2017-09-09T18:16:52.060

238 – ceilingcat – 2019-11-20T22:50:43.430

2

JavaScript (ES6), 231 bytes

s=>{for(r=25;--r;[a,b,c,d,...e]=s,s=r&1?s:r&2?[c,d,a,b,...e]:[b,a,d,c,...e],s[0]^=r&3?0:0x9e377900|r)for(c=4;c--;x=s[c]<<24|s[c]>>>8,y=s[j=c+4]<<9|s[j]>>>23,z=s[c+8],s[c+8]=x^z*2^(y&z)*4,s[j]=y^x^(x|z)*2,s[c]=z^y^(x&y)*8);return s}

Demo

let f =

s=>{for(r=25;--r;[a,b,c,d,...e]=s,s=r&1?s:r&2?[c,d,a,b,...e]:[b,a,d,c,...e],s[0]^=r&3?0:0x9e377900|r)for(c=4;c--;x=s[c]<<24|s[c]>>>8,y=s[j=c+4]<<9|s[j]>>>23,z=s[c+8],s[c+8]=x^z*2^(y&z)*4,s[j]=y^x^(x|z)*2,s[c]=z^y^(x&y)*8);return s}

console.log(
  f([
    0x00000000, 0x9e3779ba, 0x3c6ef37a, 0xdaa66d46,
    0x78dde724, 0x1715611a, 0xb54cdb2e, 0x53845566,
    0xf1bbcfc8, 0x8ff34a5a, 0x2e2ac522, 0xcc624026
  ])
  .map(x=>(x >>> 0).toString(16)).join(' ')
)

Arnauld

Posted 2017-09-08T22:29:41.697

Reputation: 111 334

1

32-bit x86 assembler (112 bytes)

(__cdecl calling convention)

            pusha
            mov     ecx, 9E377918h
    loc_6:  mov     esi, [esp+24h]
            push    esi
            push    4
            pop     ebx
    loc_E:  lodsd
            ror     eax, 8
            mov     ebp, [esi+0Ch]
            rol     ebp, 9
            mov     edx, [esi+1Ch]
            push    eax
            push    ebp
            lea     edi, [edx+edx]
            and     ebp, edx
            shl     ebp, 2
            xor     edi, ebp
            xor     eax, edi
            mov     [esi+1Ch], eax
            pop     ebp
            pop     eax
            push    eax
            push    ebp
            xor     ebp, eax
            or      eax, edx
            shl     eax, 1
            xor     ebp, eax
            mov     [esi+0Ch], ebp
            pop     ebp
            pop     eax
            xor     edx, ebp
            and     eax, ebp
            shl     eax, 3
            xor     edx, eax
            push    edx
            dec     ebx
            jnz     short loc_E
            pop     esi
            pop     ebp
            pop     ebx
            pop     eax
            pop     edi
            mov     dl, cl
            and     dl, 3
            jnz     short loc_5B
            xchg    eax, ebx
            xchg    esi, ebp
            xor     eax, ecx
    loc_5B: cmp     dl, 2
            jnz     short loc_63
            xchg    eax, ebp
            xchg    esi, ebx
    loc_63: stosd
            xchg    eax, ebx
            stosd
            xchg    eax, ebp
            stosd
            xchg    eax, esi
            stosd
            dec     cl
            jnz     short loc_6
            popa
            retn

Tweetable version (z85-format Base85 encoding):

v7vb1h>C}HbQuA91y51A:oWYw48G)?I=H/]rGf9Na>sA.DWu06{6f#TEC^CM:#IeA-cstx7:>!VfVf#u*YB&mP(tuCl*+7eENBP)$:)Lh!k}t$^wM51j%LDf$HMAg2bB^MQP

peter ferrie

Posted 2017-09-08T22:29:41.697

Reputation: 804