Implement Rijndael's S-box

15

4

Rijndael's S-box is a frequently used operation in AES encryption and decryption. It is typically implemented as a 256-byte lookup table. That's fast, but means you need to enumerate a 256-byte lookup table in your code. I bet someone in this crowd could do it with less code, given the underlying mathematical structure.

Write a function in your favorite language that implements Rijndael's S-box. Shortest code wins.

Keith Randall

Posted 2011-10-04T02:26:29.103

Reputation: 19 865

1Bonus points (upvotes from me) if the resulting function is constant-time (i.e. no data-dependent code paths or array accesses or whatever your language supports). – Paŭlo Ebermann – 2011-10-04T08:08:16.803

@PaŭloEbermann array accesses are constant time in many languages (it's adding a (scaled) value to a pointer and dereferencing it, this is why a lookup table is so very fast) – ratchet freak – 2011-10-04T08:58:13.153

@ratchetfreak Array accesses are O(1), but the actual access time depends on cache hits or misses, for example, which leads to side-channel attacks on AES. – Paŭlo Ebermann – 2011-10-04T09:01:47.997

@PaŭloEbermann, but you can use the shorter code to fill a lookup table, which will then fit in well under a page of memory. – Peter Taylor – 2011-10-04T09:32:08.477

@PaŭloEbermann and if the 256-length table is stored along the code (as enum generated at compile time) you nearly guaranteed a cache hit – ratchet freak – 2011-10-04T09:35:52.263

There is some discussion about the relevance of cache timing attacks on AES in the Cryptography chat. (@ratchet freak)

– Paŭlo Ebermann – 2011-10-04T11:58:17.990

@PaŭloEbermann you can't argue that my ubyte to ubyte version is fully branchless (except for 1 if in the mulInv() which is 1 instruction over 256 iterations and then it continues with searching the full space) – ratchet freak – 2011-10-04T12:22:11.527

Answers

6

Ruby, 161 characters

R=0..255
S=R.map{|t|t=b=R.select{|y|x=t;z=0;8.times{z^=y*(x&1);x/=2;y*=2};r=283<<8;8.times{r/=2;z^r<z/2&&z^=r};z==1}[0]||0;4.times{|r|t^=b<<1+r^b>>4+r};t&255^99}

In order to check the output you can use the following code to print it in tabular form:

S.map{|x|"%02x"%x}.each_slice(16){|l|puts l*' '}

Howard

Posted 2011-10-04T02:26:29.103

Reputation: 23 109

7

x86-64 Machine code - 23 22 20 19 bytes

Uses the AES-NI instruction set

66 0F 6E C1          movd        xmm0,ecx
66 0F 38 DD C1       aesenclast  xmm0,xmm1
0F 57 C1             xorps       xmm0,xmm1  
66 0F 3A 14 C0 00    pextrb      eax,xmm0,0
C3                   ret

Using Windows calling conventions, takes in a byte and outputs a byte. It is not necessary to reverse the ShiftRows as it does not affect the first byte.

me'

Posted 2011-10-04T02:26:29.103

Reputation: 451

2For once, x86_64 pulls a mathematica, and has a builtin for that. – moonheart08 – 2019-07-10T05:55:05.690

7

GolfScript, 60 characters

{[[0 1{.283{1$2*.255>@*^}:r~^}255*].@?~)={257r}4*99]{^}*}:S;

This code defines a function named S that takes in a byte and applies the Rijndael S-box to it. (It also uses an internal helper function named r to save a few chars.)

This implementation uses a logarithm table to compute the GF(28) inverses, as suggested by Thomas Pornin. To save a few chars, the whole logarithm table is recalculated for each input byte; even so, and despite GolfScript being a very slow language in general, this code takes only about 10 ms to process a byte on my old laptop. Precalculating the logarithm table (as L) speeds it up to about 0.5 ms per byte, at the modest cost of three more chars:

[0 1{.283{1$2*.255>@*^}:r~^}255*]:L;{[L?~)L={257r}4*99]{^}*}:S;

For convenience, here's a simple test harness that calls the function S, as defined above, to compute and print out the whole S-box in hex like on Wikipedia:

"0123456789abcdef"1/:h; 256, {S .16/h= \16%h= " "++ }% 16/ n*

Try this code online.

(The online demo precalculates the logarithm table to avoid taking too much time. Even so, the online GolfScript site may sometimes randomly time out; this is a known issue with the site, and a reload usually fixes it.)

Explanation:

Let's start with the logarithm table calculation, and specifically with the helper function r:

{1$2*.255>@*^}:r

This function takes two inputs on the stack: a byte and a reduction bitmask (a constant between 256 and 511). It duplicates the input byte, multiplies the copy by 2 and, if the result exceeds 255, XORs it with the bitmask to bring it back under 256.

Within the log-table generating code, the function r is called with the reduction bitmask 283 = 0x11b (which corresponds to the Rijndael GF(28) reduction polynomial x8 + x4 + x3 + x + 1), and result is XORed with the original byte, effectively multiplying it by 3 (= x + 1, as a polynomial) in the Rijndael finite field. This multiplication is repeated 255 times, starting from the byte 1, and the results (plus an initial zero byte) are collected into a 257-element array L that looks like this (middle part omitted):

[0 1 3 5 15 17 51 85 255 26 46 ... 180 199 82 246 1]

The reason why there are 257 elements is that, with the prepended 0 and with 1 occurring twice, we can find the modular inverse of any given byte simply by looking up its (zero-based) index in this array, negating it, and looking up the byte at the negated index in the same array. (In GolfScript, as in many other programming languages, negative array indexes count backwards from the end of the array.) Indeed, this is exactly what the code L?~)L= at the beginning of the function S does.

The rest of the code calls the helper function r four times with the reduction bitmask 257 = 28 + 1 to create four bit-rotated copies of the inverted input byte. These are all collected into an array, together with the constant 99 = 0x63, and XORed together to produce the final output.

Ilmari Karonen

Posted 2011-10-04T02:26:29.103

Reputation: 19 513

6

The table can be generated without computing inverses in the finite field GF(256), by using logarithms. It would look like this (Java code, using int to avoid problems with the signed byte type):

int[] t = new int[256];
for (int i = 0, x = 1; i < 256; i ++) {
    t[i] = x;
    x ^= (x << 1) ^ ((x >>> 7) * 0x11B);
}
int[] S = new int[256];
S[0] = 0x63;
for (int i = 0; i < 255; i ++) {
    int x = t[255 - i];
    x |= x << 8;
    x ^= (x >> 4) ^ (x >> 5) ^ (x >> 6) ^ (x >> 7);
    S[t[i]] = (x ^ 0x63) & 0xFF;
}

The idea is that 3 is a multiplicative generator of GF(256)*. The table t[] is such that t[x] is equal to 3x; since 3255 = 1, we get that 1/(3x) = 3255-x.

Thomas Pornin

Posted 2011-10-04T02:26:29.103

Reputation: 191

Is that a >>> instead of a >> in line 4? – Joe Z. – 2013-01-22T17:22:55.263

@JoeZeng: both would work. In Java, '>>>' is the "unsigned shift", '>>' is the "signed shift". They differ by how they handle the sign bit. Here, the values will never be wide enough for the sign bit to be non-zero, so it makes no real difference. – Thomas Pornin – 2013-01-22T17:39:40.790

I was wondering whether that was just a typo or not. Seems like it isn't. – Joe Z. – 2013-01-22T17:40:18.583

shouldn't it be 0x1B (one 1 in the hex literal) instead of 0x11B – ratchet freak – 2011-10-04T12:31:19.240

@ratchetfreak: no, it must be 0x11B (I tried). The int type is 32-bit in Java; I must cancel out the higher bit. – Thomas Pornin – 2011-10-04T12:36:38.290

ah didn't realize that – ratchet freak – 2011-10-04T12:47:49.617

6

GolfScript (82 chars)

{256:B,{0\2${@1$3$1&*^@2/@2*.B/283*^}8*;;1=},\+0=B)*:A.2*^4A*^8A*^128/A^99^B(&}:S;

Uses global variables A and B, and creates the function as global variable S.

The Galois inversion is brute-force; I experimented with having a separate mul function which could be reused for the post-inversion affine transform, but it turned out to be more expensive because of the different overflow behaviour.

This is too slow for an online demo - it would time out even on the first two rows of the table.

Peter Taylor

Posted 2011-10-04T02:26:29.103

Reputation: 41 901

Mine's faster (and shorter ;). +1 anyway, though. – Ilmari Karonen – 2015-04-08T23:15:00.980

4

Python, 176 chars

This answer is for Paŭlo Ebermann's comment-question about making the function constant time. This code fits the bill.

def S(x):
 i=0
 for y in range(256):
  p,a,b=0,x,y
  for j in range(8):p^=b%2*a;a*=2;a^=a/256*283;b/=2
  m=(p^1)-1>>8;i=y&m|i&~m
 i|=i*256;return(i^i/16^i/32^i/64^i/128^99)&255

Keith Randall

Posted 2011-10-04T02:26:29.103

Reputation: 19 865

Multiplication being constant-time is platform dependent (even on 32-bit platforms, e.g. ARM Cortex M0). See this related question

– fgrieu – 2015-03-06T13:46:07.163

1@fgrieu Sure, but these are all multiplications by constants, which can be easily implemented in constant time using shifts and adds. – Keith Randall – 2015-03-06T19:55:13.573

2

d

ubyte[256] getLookup(){

    ubyte[256] t=void;
    foreach(i;0..256){
        t[i] = x;
        x ^= (x << 1) ^ ((x >>> 7) * 0x1B);
    }
    ubyte[256] S=void;
    S[0] = 0x63;
    foreach(i;0..255){
        int x = t[255 - i];
        x |= x << 8;
        x ^= (x >> 4) ^ (x >> 5) ^ (x >> 6) ^ (x >> 7);
        S[t[i]] = cast(ubyte)(x & 0xFF) ^ 0x63 ;
    }
    return S;

}

this can generate the lookup table at compile time, I could save some by making ubyte a generic param

edit direct ubyte to ubyte without array lookups, no branching and fully unrollable loops

B[256] S(B:ubyte)(B i){
    B mulInv(B x){
        B r;
        foreach(i;0..256){
            B p=0,h,a=i,b=x;
            foreach(c;0..8){
                p^=(b&1)*a;
                h=a>>>7;
                a<<=1;
                a^=h*0x1b;//h is 0 or 1
                b>>=1;
            }
            if(p==1)r=i;//happens 1 or less times over 256 iterations
        }
        return r;
    }
    B s= x=mulInv(i);
    foreach(j,0..4){
        x^=(s=s<<1|(s>>>7));
    }
    return x^99;
}

edit2 used @Thomas' algo for creating the lookup table

ratchet freak

Posted 2011-10-04T02:26:29.103

Reputation: 1 334

0

Stax, 53 bytes

ë■ÿÆ♂◄º5hUø√!«mr¿¡ƒR3Å←ç≥#/$taJkαn∩╓▒ÿ╔τy╫π§╪S♫╔┴H╔║Ö

Run and debug it

I have no particular understanding of S-boxes. This is a conversion of Thomas Pornin's (8 year old!) solution.

recursive

Posted 2011-10-04T02:26:29.103

Reputation: 8 616