C: replace AES FIPS-197 SubBytes table by constant-time code

In FIPS-197 (the Advanced Encryption Standard, known as AES), it is made heavy use of SubBytes, which could be implemented as

unsigned char SubBytes(unsigned char x) {
static const unsigned char t[256] = {
  0x63,0x7C,0x77,0x7B,0xF2,0x6B,0x6F,0xC5,0x30,0x01,0x67,0x2B,0xFE,0xD7,0xAB,0x76,
  0xCA,0x82,0xC9,0x7D,0xFA,0x59,0x47,0xF0,0xAD,0xD4,0xA2,0xAF,0x9C,0xA4,0x72,0xC0,
  0xB7,0xFD,0x93,0x26,0x36,0x3F,0xF7,0xCC,0x34,0xA5,0xE5,0xF1,0x71,0xD8,0x31,0x15,
  0x04,0xC7,0x23,0xC3,0x18,0x96,0x05,0x9A,0x07,0x12,0x80,0xE2,0xEB,0x27,0xB2,0x75,
  0x09,0x83,0x2C,0x1A,0x1B,0x6E,0x5A,0xA0,0x52,0x3B,0xD6,0xB3,0x29,0xE3,0x2F,0x84,
  0x53,0xD1,0x00,0xED,0x20,0xFC,0xB1,0x5B,0x6A,0xCB,0xBE,0x39,0x4A,0x4C,0x58,0xCF,
  0xD0,0xEF,0xAA,0xFB,0x43,0x4D,0x33,0x85,0x45,0xF9,0x02,0x7F,0x50,0x3C,0x9F,0xA8,
  0x51,0xA3,0x40,0x8F,0x92,0x9D,0x38,0xF5,0xBC,0xB6,0xDA,0x21,0x10,0xFF,0xF3,0xD2,
  0xCD,0x0C,0x13,0xEC,0x5F,0x97,0x44,0x17,0xC4,0xA7,0x7E,0x3D,0x64,0x5D,0x19,0x73,
  0x60,0x81,0x4F,0xDC,0x22,0x2A,0x90,0x88,0x46,0xEE,0xB8,0x14,0xDE,0x5E,0x0B,0xDB,
  0xE0,0x32,0x3A,0x0A,0x49,0x06,0x24,0x5C,0xC2,0xD3,0xAC,0x62,0x91,0x95,0xE4,0x79,
  0xE7,0xC8,0x37,0x6D,0x8D,0xD5,0x4E,0xA9,0x6C,0x56,0xF4,0xEA,0x65,0x7A,0xAE,0x08,
  0xBA,0x78,0x25,0x2E,0x1C,0xA6,0xB4,0xC6,0xE8,0xDD,0x74,0x1F,0x4B,0xBD,0x8B,0x8A,
  0x70,0x3E,0xB5,0x66,0x48,0x03,0xF6,0x0E,0x61,0x35,0x57,0xB9,0x86,0xC1,0x1D,0x9E,
  0xE1,0xF8,0x98,0x11,0x69,0xD9,0x8E,0x94,0x9B,0x1E,0x87,0xE9,0xCE,0x55,0x28,0xDF,
  0x8C,0xA1,0x89,0x0D,0xBF,0xE6,0x42,0x68,0x41,0x99,0x2D,0x0F,0xB0,0x54,0xBB,0x16};
return t[x];}

This function is not arbitrary; it is a reversible mapping, consisting of an inversion in a Galois Field followed by an affine transformation. All the details are in FIPS-197 section 5.1.1 or here section 4.2.1 (under a slightly different name).

One problem with implementation as a table is that it opens to so-called cache-timing attacks.

Thus your mission is to devise an exact replacement for the above SubBytes() function, that exhibits constant-time behavior; we'll assume that's the case when nothing depending on the input x of SubBytes is used either:

as an array index,
as the control operand of if, while, for, case, or operator ?:;
as any operand of operators &&, ||, !, ==, !=, <, >, <=, >=, *, /, %;
as the right operand of operators >>, <<, *=, /=, %=, <<=, >>=.

The winning entry will be one with the lowest cost, obtained from the number of operators executed in the input-dependent data path, with a weight of 5 for unary operators - and ~ as well as for <<1, >>1, +1, -1; weight of 7 for all other operators, shifts with other counts, or adds/subs of other constants (type casts and promotions are free). By principle, that cost is unchanged by unrolling loops (if any), and independent of the input x. As a tie-breaker, the answer with the shortest code after removing whitespace and comments wins.

I plan to designate an entry as answer as early as I can in year 2013, UTC. I will consider answers in languages that I've some knowledge of, ranking them as a straight translation to C not optimized for size.

_{Apologies for the initial omission of +1 and -1 in the favored operators, of free casts and promotions, and ranking of size. Note that * is prohibited both when unary, and as multiplication.}

fgrieu

Posted 2012-12-21T17:40:24.530

Reputation: 545

1Worth noting that lookups are free because you can inline them as constants. – Peter Taylor – 2012-12-22T15:21:57.867

"early in year 2013, UTC" – wouldn't the date be more interesting than the time zone? – Paŭlo Ebermann – 2012-12-23T16:17:22.907

@PaŭloEbermann: My intention should be clear now. – fgrieu – 2012-12-23T18:10:57.217

– Keith Randall – 2012-12-31T18:13:38.133

Answers

Score: 940 933 926 910, field tower approach

public class SBox2
{
    public static void main(String[] args)
    {
        for (int i = 0; i < 256; i++) {
            int s = SubBytes(i);
            System.out.format("%02x  ", s);
            if (i % 16 == 15) System.out.println();
        }
    }

    private static int SubBytes(int x) {
        int fwd;
        fwd  = 0x010001 & -(x & 1); x >>= 1; //   7+5+7+5+ | 24+
        fwd ^= 0x1d010f & -(x & 1); x >>= 1; // 7+7+5+7+5+ | 31+
        fwd ^= 0x4f020b & -(x & 1); x >>= 1; // 7+7+5+7+5+ | 31+
        fwd ^= 0x450201 & -(x & 1); x >>= 1; // 7+7+5+7+5+ | 31+
        fwd ^= 0xce080d & -(x & 1); x >>= 1; // 7+7+5+7+5+ | 31+
        fwd ^= 0xa20f0f & -(x & 1); x >>= 1; // 7+7+5+7+5+ | 31+
        fwd ^= 0xc60805 & -(x & 1); x >>= 1; // 7+7+5+7+5+ | 31+
        fwd ^= 0x60070e & -x;                // 7+7+5+     | 19+

        // Running total so far: 229

        int p1;
        {
            int ma = fwd;
            int mb = fwd >> 16;         // 7+         | 7+
            p1  = ma & -(mb&1); ma<<=1; //   7+5+7+5+ | 24+
            p1 ^= ma & -(mb&2); ma<<=1; // 7+7+5+7+5+ | 31+
            p1 ^= ma & -(mb&4); ma<<=1; // 7+7+5+7+5+ | 31+
            p1 ^= ma & -(mb&8);         // 7+7+5+7+   | 26+
            int t = p1 >> 3;            // 7+         | 7+
            p1 ^= (t >> 1) ^ (t & 0xe); // 7+5+7+7+   | 26+
        }

        // Running total so far: 229 + 152 = 381

        int y3, y2, y1, y0;
        {
            int Kinv = (fwd >> 20) ^ p1;     // 7+7+
            int w0 = Kinv & 1; Kinv >>= 1;   // 7+5+
            int w1 = Kinv & 1; Kinv >>= 1;   // 7+5+
            int w2 = Kinv & 1; Kinv >>= 1;   // 7+5+
            int w3 = Kinv & 1;               // 7+

            int t0 = w1 ^ w0 ^ (w2 & w3);      // 7+7+7+
            int t1 = w2 ^ (w0 | w3);           // 7+7+
            int t2 = t0 ^ t1;                  // 7+

            y3 = t2 ^ (t1 & (w1 & w3));        // 7+7+7+
            y2 = t0 ^ (w0 | t2);               // 7+7+
            y1 = w0 ^ w3 ^ (t1 & t0);          // 7+7+7+
            y0 = w3 ^ (t0 | (w1 ^ (w0 | w2))); // 7+7+7+7


        }

        // Running total so far: 381 + 24*7 + 3*5 = 564

        int p2;
        {
            int ma = fwd;
            p2  = ma & -y0; ma<<=1;       //   7+5+5+ | 17+
            p2 ^= ma & -y1; ma<<=1;       // 7+7+5+5+ | 24+
            p2 ^= ma & -y2; ma<<=1;       // 7+7+5+5+ | 24+
            p2 ^= ma & -y3;               // 7+7+5+   | 19+
            int t = p2 >> 3;              // 7+       | 7+
            p2 ^= (t >> 1) ^ (t & 0xe0e); // 7+5+7+7+ | 26
        }

        // Running total so far: 564 + 117 = 681

        int inv8;
        inv8  =  31 & -(p2 & 1);           //   7+5+7+   | 19+
        inv8 ^= 178 & -(p2 & 2); p2 >>= 2; // 7+7+5+7+7+ | 33+
        inv8 ^= 171 & -(p2 & 1);           // 7+7+5+7+   | 26+
        inv8 ^=  54 & -(p2 & 2); p2 >>= 6; // 7+7+5+7+7+ | 33+
        inv8 ^= 188 & -(p2 & 1);           // 7+7+5+7+   | 26+
        inv8 ^=  76 & -(p2 & 2); p2 >>= 2; // 7+7+5+7+7+ | 33+
        inv8 ^= 127 & -(p2 & 1);           // 7+7+5+7+   | 26+
        inv8 ^= 222 & -(p2 & 2);           // 7+7+5+7    | 26+

        return inv8 ^ 0x63;                // 7+         | 7+

        // Grand total: 681 + 229 = 910
    }
}

The structure is essentially the same as Boyar and Peralta's implementation - reduce inversion in GF(2^8) to inversion in GF(2^4), break it down into a linear prologue, a non-linear body, and a linear epilogue, and then minimise them separately. I pay some penalties for bit extraction, but I compensate by being able to do operations in parallel (with some judicious padding of the bits of fwd). In more detail...

Background

As mentioned in the problem description, the S-box consists of an inversion in a particular implementation of the Galois field GF(2^8) followed by an affine transformation. If you know what both of those mean, skip this section.

An affine (or linear) transformation is a function f(x) which respects f(x + y) = f(x) + f(y) and f(a*x) = a*f(x).

A field is a set F of elements with two special elements, which we'll call 0 and 1, and two operators, which we'll call + and *, which respects various properties. In this section, assume that x, y, and z are elements of F.

The elements of F form an abelian group under + with 0 as the identity: i.e. x + y is an element of F; x + 0 = 0 + x = x; each x has a corresponding -x such that x + (-x) = (-x) + x = 0; x + (y + z) = (x + y) + z; and x + y = y + x.
The elements of F other than 0 form an abelian group under * with 1 as the identity.
Multiplication distributes over addition: x * (y + z) = (x * y) + (x * z).

It turns out that there are some quite severe limitations on finite fields:

They must have a number of elements which is a power of a prime.
They are isomorphic with all other finite fields of the same size (i.e. there's really only one finite field of a given size, and any others are just relabellings; call that field GF(p^k) where p is the prime and k is the power).
The multiplicative group F\{0} under * is cyclic; i.e. there's at least one element g such that every element is a power of g.
For powers greater than 1 there is a representation as univariate polynomials of order k of the field of the prime order. E.g. GF(2^8) has a representation in terms of polynomials of x over GF(2). In fact there's usually more than one representation. Consider x^7 * x in GF(2^8); it must be equivalent to some order-7 polynomial, but which? There are a lot of choices which give the right structure; AES chooses to make x^8 = x^4 + x^3 + x + 1 (the lexicographically smallest polynomial which works).

So how do we compute an inverse in that particular representation of GF(2^8)? It's too chunky a problem to tackle directly, so we need to break it down.

Field towers: representing GF(2^8) in terms of GF(2^4)

Instead of representing GF(2^8) with polynomials of 8 terms over GF(2) we can represent it with polynomials of 2 terms over GF(2^4). This time we need to choose a linear polynomial for x^2. Suppose we choose x^2 = px + q. Then (ax + b) * (cx + d) = (ad + bc + acp)x + (bd + acq).

Is it easier to find an inverse in this representation? If (cx + d) = (ax + b)^-1 we get the simultaneous equations

ad + (b + ap)c = 0
bd + (aq)c = 1

Let D = [b(b+ap) + a^2 q] and set c = a * D^-1; d = (b + ap) * D^-1. So we can do an inverse in GF(2^8) for the cost of a conversion to GF(2^4), an inverse and a few additions and multiplications in GF(2^4), and a conversion back. Even if we do the inverse by means of a table, we've reduced the table size from 256 to 16.

Implementation details

To construct a representation of GF(4) we can choose between three polynomials to reduce x^4:

x^4 = x + 1
x^4 = x^3 + 1
x^4 = x^3 + x^2 + x + 1

The most important difference is in the implementation of multiplication. For any of the three (which correspond to poly of 3, 9, f) the following will work:

// 14x &, 7x unary -, 3x <<1, 3x >>1, 3x >>3, 6x ^ gives score 226
int mul(int a, int b) {
    // Call current values a = a0, b = b0
    int p = a & -(b & 1);
    a = ((a << 1) ^ (poly & -(a >> 3))) & 15;
    b >>= 1;
    // Now p = a0 * (b0 mod x); a = a0 x; b = b0 div x

    p ^= a & -(b & 1);
    a = ((a << 1) ^ (poly & -(a >> 3))) & 15;
    b >>= 1;
    // Now p = a0 * (b0 mod x^2); a = a0 x^2; b = b0 div x^2

    p ^= a & -(b & 1);
    a = ((a << 1) ^ (poly & -(a >> 3))) & 15;
    b >>= 1;
    // Now p = a0 * (b0 mod x^3); a = a0 x^3; b = b0 div x^3

    p ^= a & -(b & 1);
    // p = a0 * b0

    return p;
}

However, if we choose poly = 3 we can handle the overflow much more efficiently because it has a nice structure: there's no double-overflow, because the two inputs are both cubics and x^6 = x^2 (x + 1) is also cubic. In addition we can save the shifts of b: since we're leaving the overflow to last, a0 x^2 doesn't have any set bits corresponding to x or 1 and so we can mask it with -4 instead of -1. The result is

// 10x &, 4x unary -, 3x <<1, 1x >>1, 1x >>3, 5x ^ gives score 152
int mul(int a, int b) {
    int p;
    p  = a & -(b & 1); a <<= 1;
    p ^= a & -(b & 2); a <<= 1;
    p ^= a & -(b & 4); a <<= 1;
    p ^= a & -(b & 8);
    // Here p = a0 * b0 but with overflow, which we need to bring back down.

    int t = p >> 3;
    p ^= (t >> 1) ^ (t & 0xe);
    return p & 15;
}

We still need to choose the values of p and q for the representation of GF(2^8) over GF(2^4), but we don't have many constraints on them. The one thing that matters is that we can get a linear function from the bits of our original representation to the bits of the working representation. This matters for two reasons: firstly, it's easy to do linear transformations, whereas a non-linear transformation would require optimisation equivalent in difficulty to just optimising the entire S-box; secondly, because we can get some side benefits. To recap the structure:

GF256 SubBytes(GF256 x) {
    GF16 a, b, t, D, Dinv, c, d;

    (a, b) = f(x); // f is linear

    t = b + a * p;
    D = b * t + a * a * q;
    Dinv = inverse_GF16(D);
    c = a * Dinv;
    d = t * Dinv;

    return finv(c, d); // finv is also linear
}

If the bits of x are x7 x6 ... x0 then since the transform is linear we get a = f({x7}0000000 + 0{x6}000000 + ... + 0000000{x0}) = f({x7}0000000) + f(0{x6}000000) + ... + f(0000000{x0}). Square it and we get a^2 = f({x7}0000000)^2 + f(0{x6}000000)^2 + ... + f(0000000{x0})^2 where the cross-terms cancel (because in GF(2), 1 + 1 = 0). So a^2 can also be computed as a linear function of x. We can augment the forward linear transform to get:

GF256 SubBytes(GF256 x) {
    GF16 a, b, t, a2q, D, Dinv, c, d;

    (a, b, t, a2q) = faug(x);

    D = b * t + a2q;
    Dinv = inverse_GF16(D);
    c = a * Dinv;
    d = t * Dinv;

    return finv(c, d);
}

and we're down to three multiplications and one addition. We can also extend the multiplication code to do the two multiplications by Dinv in parallel. So our total cost is a forward linear transformation, an addition, two multiplications, an inverse in GF(2^4), and a back linear transformation. We can roll in the post-inverse linear transformation of the S-box and get that essentially for free.

The computation of the coefficients for the linear transformations isn't very interesting, and nor is the micro-optimisation to save a mask here and a shift there. The remaining interesting part is the optimisation of inverse_GF16. There are 2^64 different functions from 4 bits to 4 bits, so a direct optimisation requires lots of memory and time. What I've done is to consider 4 functions from 4 bits to 1 bit, place a cap on the total cost permitted for any one function (with a maximum cost of 63 per function I can enumerate all suitable expressions in under a minute), and for each tuple of functions do common subexpression elimination. After 25 minutes of crunching, I find that the best inverse possible with that cap has a total cost of 133 (an average of 33.25 per bit of the output, which isn't bad considering that the cheapest expression for any individual bit is 35).

I'm still experimenting with other approaches to minimise the inversion in GF(2^4), and an approach which builds bottom-up rather than top-down has yielded an improvement from 133 to 126.

Peter Taylor

Posted 2012-12-21T17:40:24.530

Reputation: 41 901

Fantastic! I confirm it works! A detail: the 8th & 1 can be trimmed (esp. if x is unsigned char; CHAR_BIT is 8 in codegolf). – fgrieu – 2012-12-31T18:22:03.837

@fgrieu, good point. – Peter Taylor – 2012-12-31T18:31:57.827

**Score: 980=75+1157+75+157, Boyar and Peralta's minimization**

I found A new combinational logic minimization technique with applications to cryptology by Joan Boyar and René Peralta, which (save for the C formalism) does what's required. The technique used to derive their equations is patented by no less than the USA. I just did a straight C translation of their equations, kindly linked here.

unsigned char SubBytes_Boyar_Peralta(unsigned char x7){
  unsigned char 
  x6=x7>>1,x5=x6>>1,x4=x5>>1,x3=x4>>1,x2=x3>>1,x1=x2>>1,x0=x1>>1,
  y14=x3^x5,y13=x0^x6,y9=x0^x3,y8=x0^x5,t0=x1^x2,y1=t0^x7,y4=y1^x3,y12=y13^y14,y2=y1^x0,
  y5=y1^x6,y3=y5^y8,t1=x4^y12,y15=t1^x5,y20=t1^x1,y6=y15^x7,y10=y15^t0,y11=y20^y9,y7=x7^y11,
  y17=y10^y11,y19=y10^y8,y16=t0^y11,y21=y13^y16,y18=x0^y16,t2=y12&y15,t3=y3&y6,t4=t3^t2,
  t5=y4&x7,t6=t5^t2,t7=y13&y16,t8=y5&y1,t9=t8^t7,t10=y2&y7,t11=t10^t7,t12=y9&y11,
  t13=y14&y17,t14=t13^t12,t15=y8&y10,t16=t15^t12,t17=t4^t14,t18=t6^t16,t19=t9^t14,
  t20=t11^t16,t21=t17^y20,t22=t18^y19,t23=t19^y21,t24=t20^y18,t25=t21^t22,t26=t21&t23,
  t27=t24^t26,t28=t25&t27,t29=t28^t22,t30=t23^t24,t31=t22^t26,t32=t31&t30,t33=t32^t24,
  t34=t23^t33,t35=t27^t33,t36=t24&t35,t37=t36^t34,t38=t27^t36,t39=t29&t38,t40=t25^t39,
  t41=t40^t37,t42=t29^t33,t43=t29^t40,t44=t33^t37,t45=t42^t41,z0=t44&y15,z1=t37&y6,
  z2=t33&x7,z3=t43&y16,z4=t40&y1,z5=t29&y7,z6=t42&y11,z7=t45&y17,z8=t41&y10,z9=t44&y12,
  z10=t37&y3,z11=t33&y4,z12=t43&y13,z13=t40&y5,z14=t29&y2,z15=t42&y9,z16=t45&y14,z17=t41&y8,
  t46=z15^z16,t47=z10^z11,t48=z5^z13,t49=z9^z10,t50=z2^z12,t51=z2^z5,t52=z7^z8,t53=z0^z3,
  t54=z6^z7,t55=z16^z17,t56=z12^t48,t57=t50^t53,t58=z4^t46,t59=z3^t54,t60=t46^t57,
  t61=z14^t57,t62=t52^t58,t63=t49^t58,t64=z4^t59,t65=t61^t62,t66=z1^t63,s0=t59^t63,
  s6=t56^t62,s7=t48^t60,t67=t64^t65,s3=t53^t66,s4=t51^t66,s5=t47^t65,s1=t64^s3,s2=t55^t67;
  return (((((((s0<<1|s1&1)<<1|s2&1)<<1|s3&1)<<1|s4&1)<<1|s5&1)<<1|s6&1)<<1|s7&1)^0x63;}

fgrieu

Posted 2012-12-21T17:40:24.530

Reputation: 545

Wow, really works, and really cheap. When disassembling, it's indeed 144 instructions, excluding prologue, epilogie and move instructions. – ugoren – 2012-12-24T14:30:09.327

Score: 10965

This uses the same principle of unrolling the array lookup. May require extra casts.

Thanks to ugoren for pointing out how to improve is_zero.

// Cost: 255 * (5+7+24+7) = 10965
unsigned char SubBytes(unsigned char x) {
    unsigned char r = 0x63;
    char c = (char)x;
    c--; r ^= is_zero(c) & (0x63^0x7c); // 5+7+24+7 inlining the final xor
    c--; r ^= is_zero(c) & (0x63^0x77); // 5+7+24+7
    // ...
    c--; r ^= is_zero(c) & (0x63^0x16); // 5+7+24+7
    return r;
}

// Cost: 24
// Returns (unsigned char)-1 when input is 0 and 0 otherwise
unsigned char is_zero(char c) {
    // Shifting a signed number right is unspecified, so use unsigned
    unsigned char u;
    c |= -c;               // 7+5+
    u = (unsigned char)c;
    u >>= (CHAR_BITS - 1); // 7+
    c = (char)u;
    // c is 0 if we want -1 and 1 otherwise.
    c--;                   // 5
    return (unsigned char)c;
}

Peter Taylor

Posted 2012-12-21T17:40:24.530

Reputation: 41 901

2For an integer c, (c|-c)>>31 is 0 for 0 and -1 otherwise. – ugoren – 2012-12-23T11:19:23.257

@ugoren, in sensible languages, yes. In C, right shifting an unsigned type is unspecified. – Peter Taylor – 2012-12-24T14:16:43.767

1I guess you mean signed. But this site isn't really famed for strict standard compliance. Also, your c >> 4 seems like signed right shift to me. And if you really insist - ((unsigned int)(c|-c))>>31 is c?1:0. – ugoren – 2012-12-24T20:35:01.550

@ugoren, you're right, I meant signed. The c >>4 works with or without sign extension. But good catch on using an unsigned shift: will edit when I get home and can use a proper computer rather than a phone. Thanks. – Peter Taylor – 2012-12-25T09:07:47.790

Score: 9109, algebraic approach

I'll leave the lookup approach in case anyone can improve it drastically, but it turns out that a good algebraic approach is possible. This implementation finds the multiplicative inverse using Euclid's algorithm. I've written it in Java but in principle it can be ported to C - I've commented the parts which would change if you want to rework it using only 8-bit types.

Thanks to ugoren for pointing out how to shorten the is_nonzero check in a comment on my other answer.

public class SBox
{
    public static void main(String[] args)
    {
        for (int i = 0; i < 256; i++) {
            int s = SubBytes(i);
            System.out.format("%02x  ", s);
            if (i % 16 == 15) System.out.println();
        }
    }

    // Total cost: 9109
    public static int SubBytes(int x)
    {
        x = inv_euclid(x); // 9041
        x = affine(x);     // 68
        return x;
    }

    // Total cost: 68
    private static int affine(int s0) {
        int s = s0;
        s ^= (s0 << 1) ^ (s0 >> 7); // 5 + 7
        s ^= (s0 << 2) ^ (s0 >> 6); // 7 + 7
        s ^= (s0 << 3) ^ (s0 >> 5); // 7 + 7
        s ^= (s0 << 4) ^ (s0 >> 4); // 7 + 7
        return (s ^ 0x63) & 0xff;   // 7 + 7
    }

    // Does the inverse in the Galois field for a total cost of 24 + 9010 + 7 = 9041
    private static int inv_euclid(int s) {
        // The first part of handling the special case: cost of 24
        int zeromask = is_nonzero(s);

        // NB the special value of r would complicate the unrolling slightly with unsigned bytes
        int r = 0x11b, a = 0, b = 1;

        // Total cost of loop: 7*(29+233+566+503+28) - 503 = 9010
        for (int depth = 0; depth < 7; depth++) { // 7*(
            // Calculate mask to fake out when we're looping further than necessary: cost 29
            int mask = is_nonzero(s >> 1);

            // Cost: 233
            int ord = polynomial_order(s);

            // This next block does div/rem at a total cost of 6*(24+49) + 69 + 59 = 566
            int quot = 0, rem = r;
            for (int i = 7; i > 1; i--) {                   // 6*(
                int divmask = is_nonzero(ord & (rem >> i)); // 24+7+7
                quot ^= (1 << i) & divmask;                 // 7+0+7+ since 1<<i is inlined on unrolling
                rem ^= (s << i) & divmask;                  // 7+7+7) +
            }
            int divmask1 = is_nonzero(ord & (rem >> 1));    // 24+7+5
            quot ^= 2 & divmask1;                           // 7+7+
            rem ^= (s << 1) & divmask1;                     // 7+5+7+
            int divmask0 = is_nonzero(ord & rem);           // 24+7
            quot ^= 1 & divmask0;                           // 7+7+
            rem ^= s & divmask0;                            // 7+7

            // This next block does the rest for the cost of a mul (503) plus 28
            // When depth = 0, b = 1 so we can skip the mul on unrolling
            r = s;
            s = rem;
            quot = mul(quot, b) ^ a;
            a = b;
            b ^= (quot ^ b) & mask;
        }

        // The rest of handling the special case: cost 7
        return b & zeromask;
    }

    // Gets the highest set bit in the input. Assumes that it's always at least 1<<1
    // Cost: 233
    private static int polynomial_order(int s) {
        int ord = 2;
        ord ^= 6 & -((s >> 2) & 1);           // 7+7+5+7+7 = 33 +
        ord ^= (ord ^ 8) & -((s >> 3) & 1);   // 7+7+7+5+7+7 = 40 +
        ord ^= (ord ^ 16) & -((s >> 4) & 1);  // 40 +
        ord ^= (ord ^ 32) & -((s >> 5) & 1);  // 40 +
        ord ^= (ord ^ 64) & -((s >> 6) & 1);  // 40 +
        ord ^= (ord ^ 128) & -((s >> 7) & 1); // 40
        return ord;
    }

    // Returns 0 if c is 0 and -1 otherwise
    // Cost: 24
    private static int is_nonzero(int c) {
        c |= -c;   // 7+5+
        c >>>= 31; // 7+ (simulating a cast to unsigned and right shift by CHAR_BIT)
        c = -c;    // 5+ (could be saved assuming a particular implementation of signed right shift)
        return c;
    }

    // Performs a multiplication in the Rijndael finite field
    // Cost: 503 (496 if working with unsigned bytes)
    private static int mul(int a, int b) {
        int p = 0;
        for (int counter = 0; counter < 8; counter++) { // 8*(_+_
            p ^= a & -(b & 1);                          // +7+7+5+7
            a = (a << 1) ^ (0x1b & -(a >> 7));          // +5+7+7+5+7
            b >>= 1;                                    // +5)
        }
        p &= 0xff;                                      // +7 avoidable with unsigned bytes
        return p;
    }
}

Peter Taylor

Posted 2012-12-21T17:40:24.530

Reputation: 41 901

**Score: 256(7+(8(7+7+7)-(2+2))+5+7+7)=48640 (assuming loops unrolled)**

unsigned char SubBytes(unsigned char x) {
static const unsigned char t[256] = {
  0x63,0x7C,0x77,0x7B,0xF2,0x6B,0x6F,0xC5,0x30,0x01,0x67,0x2B,0xFE,0xD7,0xAB,0x76,
  0xCA,0x82,0xC9,0x7D,0xFA,0x59,0x47,0xF0,0xAD,0xD4,0xA2,0xAF,0x9C,0xA4,0x72,0xC0,
  0xB7,0xFD,0x93,0x26,0x36,0x3F,0xF7,0xCC,0x34,0xA5,0xE5,0xF1,0x71,0xD8,0x31,0x15,
  0x04,0xC7,0x23,0xC3,0x18,0x96,0x05,0x9A,0x07,0x12,0x80,0xE2,0xEB,0x27,0xB2,0x75,
  0x09,0x83,0x2C,0x1A,0x1B,0x6E,0x5A,0xA0,0x52,0x3B,0xD6,0xB3,0x29,0xE3,0x2F,0x84,
  0x53,0xD1,0x00,0xED,0x20,0xFC,0xB1,0x5B,0x6A,0xCB,0xBE,0x39,0x4A,0x4C,0x58,0xCF,
  0xD0,0xEF,0xAA,0xFB,0x43,0x4D,0x33,0x85,0x45,0xF9,0x02,0x7F,0x50,0x3C,0x9F,0xA8,
  0x51,0xA3,0x40,0x8F,0x92,0x9D,0x38,0xF5,0xBC,0xB6,0xDA,0x21,0x10,0xFF,0xF3,0xD2,
  0xCD,0x0C,0x13,0xEC,0x5F,0x97,0x44,0x17,0xC4,0xA7,0x7E,0x3D,0x64,0x5D,0x19,0x73,
  0x60,0x81,0x4F,0xDC,0x22,0x2A,0x90,0x88,0x46,0xEE,0xB8,0x14,0xDE,0x5E,0x0B,0xDB,
  0xE0,0x32,0x3A,0x0A,0x49,0x06,0x24,0x5C,0xC2,0xD3,0xAC,0x62,0x91,0x95,0xE4,0x79,
  0xE7,0xC8,0x37,0x6D,0x8D,0xD5,0x4E,0xA9,0x6C,0x56,0xF4,0xEA,0x65,0x7A,0xAE,0x08,
  0xBA,0x78,0x25,0x2E,0x1C,0xA6,0xB4,0xC6,0xE8,0xDD,0x74,0x1F,0x4B,0xBD,0x8B,0x8A,
  0x70,0x3E,0xB5,0x66,0x48,0x03,0xF6,0x0E,0x61,0x35,0x57,0xB9,0x86,0xC1,0x1D,0x9E,
  0xE1,0xF8,0x98,0x11,0x69,0xD9,0x8E,0x94,0x9B,0x1E,0x87,0xE9,0xCE,0x55,0x28,0xDF,
  0x8C,0xA1,0x89,0x0D,0xBF,0xE6,0x42,0x68,0x41,0x99,0x2D,0x0F,0xB0,0x54,0xBB,0x16};

unsigned char ret_val = 0;
int i,j;
for(i=0;i<256;i++) {
  unsigned char is_index = (x ^ ((unsigned char) i));
  for(j=0;j<8;j++) {
   is_index |= (is_index << (1 << j)) | (is_index >> (1 << j));
  }
  is_index = ~is_index;
  ret_val |= is_index & t[i];
}

return ret_val;}

Explanation:

Essentially an array lookup reimplemented using bitwise operators and always processing the entire array. We loop through the array, and xor x with each index, then use bitwise operators to logically negate the result, so we get 255 when x=i and 0 otherwise. We bitwise-and this with the array-value, so that the chosen value is unchanged and the others become 0. Then we take the bitwise-or of this array, thereby pulling out only the chosen value.

The two 1 << j operations would go away as part of unrolling the loop, replacing them with the powers of 2 from 1 through 128.

histocrat

Posted 2012-12-21T17:40:24.530

Reputation: 20 600

Now to see if it's possible to actually do the math using bitwise operators. – histocrat – 2012-12-21T20:46:16.053

Looking through the algorithm here, I kind of doubt it'll be possible to implement the polynomial inversion without looping over all bytes at least once as a substitute for some of the polynomial-time steps. So this might well beat any "smart" solutions. I suspect tuning this constant-time array lookup is a more promising avenue.

– histocrat – 2012-12-21T22:25:23.733

Nice. The function rj_sbox in aes.c here might give inspiration (although, as is, it does not match the question).

– fgrieu – 2012-12-22T00:04:21.990

Where does the -(2+2) in your scoring calculation come from? Edit: ah, inlining creates a <<1 and a >>1. – Peter Taylor – 2012-12-22T15:06:07.467

Score 1968 1692, using the lookup table

Note: This solution doesn't pass the criteria, because of w >> b.

Using the lookup table, but reading 8 bytes at a time.
3*7 + 32 * (6*7 + 2*5) + 7 = 692

unsigned char SubBytes(unsigned char x){

static const unsigned char t[256] = {
  0x63,0x7C,0x77,0x7B,0xF2,0x6B,0x6F,0xC5,0x30,0x01,0x67,0x2B,0xFE,0xD7,0xAB,0x76,
  0xCA,0x82,0xC9,0x7D,0xFA,0x59,0x47,0xF0,0xAD,0xD4,0xA2,0xAF,0x9C,0xA4,0x72,0xC0,
  0xB7,0xFD,0x93,0x26,0x36,0x3F,0xF7,0xCC,0x34,0xA5,0xE5,0xF1,0x71,0xD8,0x31,0x15,
  0x04,0xC7,0x23,0xC3,0x18,0x96,0x05,0x9A,0x07,0x12,0x80,0xE2,0xEB,0x27,0xB2,0x75,
  0x09,0x83,0x2C,0x1A,0x1B,0x6E,0x5A,0xA0,0x52,0x3B,0xD6,0xB3,0x29,0xE3,0x2F,0x84,
  0x53,0xD1,0x00,0xED,0x20,0xFC,0xB1,0x5B,0x6A,0xCB,0xBE,0x39,0x4A,0x4C,0x58,0xCF,
  0xD0,0xEF,0xAA,0xFB,0x43,0x4D,0x33,0x85,0x45,0xF9,0x02,0x7F,0x50,0x3C,0x9F,0xA8,
  0x51,0xA3,0x40,0x8F,0x92,0x9D,0x38,0xF5,0xBC,0xB6,0xDA,0x21,0x10,0xFF,0xF3,0xD2,
  0xCD,0x0C,0x13,0xEC,0x5F,0x97,0x44,0x17,0xC4,0xA7,0x7E,0x3D,0x64,0x5D,0x19,0x73,
  0x60,0x81,0x4F,0xDC,0x22,0x2A,0x90,0x88,0x46,0xEE,0xB8,0x14,0xDE,0x5E,0x0B,0xDB,
  0xE0,0x32,0x3A,0x0A,0x49,0x06,0x24,0x5C,0xC2,0xD3,0xAC,0x62,0x91,0x95,0xE4,0x79,
  0xE7,0xC8,0x37,0x6D,0x8D,0xD5,0x4E,0xA9,0x6C,0x56,0xF4,0xEA,0x65,0x7A,0xAE,0x08,
  0xBA,0x78,0x25,0x2E,0x1C,0xA6,0xB4,0xC6,0xE8,0xDD,0x74,0x1F,0x4B,0xBD,0x8B,0x8A,
  0x70,0x3E,0xB5,0x66,0x48,0x03,0xF6,0x0E,0x61,0x35,0x57,0xB9,0x86,0xC1,0x1D,0x9E,
  0xE1,0xF8,0x98,0x11,0x69,0xD9,0x8E,0x94,0x9B,0x1E,0x87,0xE9,0xCE,0x55,0x28,0xDF,
  0x8C,0xA1,0x89,0x0D,0xBF,0xE6,0x42,0x68,0x41,0x99,0x2D,0x0F,0xB0,0x54,0xBB,0x16};

  unsigned long long *t2 = (unsigned long long *)t;
  int a = x>>3, b=(x&7)<<3;                       // 7+7+7
  int i;
  int ret = 0;
  for (i=0;i<256/8;i++) {                         // 32 *
      unsigned long long w = t2[i];
      int badi = ((unsigned int)(a|-a))>>31;      // 7+7+5
      w &= (badi-1);                              // +7+7
      a--;                                        // +5
      ret |= w >> b;                              // +7+7
  }
  return ret & 0xff;                              // +7
}

ugoren

Posted 2012-12-21T17:40:24.530

Reputation: 16 527

I don't think this meets the definition of constant time in the question, because w>>b has RHS calculated from x – Peter Taylor – 2012-12-25T09:24:25.823

There are several violations: w >> b where b depends on input; also x/8, x%8, and *= (1-badi). The first one is especially likely to degenerate into a timing dependency on low-end CPUs. However, the idea of using wide variables certainly has potential. – fgrieu – 2012-12-25T23:35:41.163

I didn't read the instructions carefully enough. I can fix most of the problems, but w >> b is quite essential (I need to see if it can be fixed without rewriting everything. – ugoren – 2012-12-26T05:23:03.260

Table lookup and mask, score = 256 * (57+15) = 10240

Uses table lookup with a mask to pick out only the result we want. Uses the fact that j|-j is either negative (when i!=x) or zero (when i==x). Shifting makes an all-one or all-zero mask which is used to select out only the entry we want.

static const unsigned char t[256] = {
  0x63,0x7C,0x77,0x7B,0xF2,0x6B,0x6F,0xC5,0x30,0x01,0x67,0x2B,0xFE,0xD7,0xAB,0x76,
  0xCA,0x82,0xC9,0x7D,0xFA,0x59,0x47,0xF0,0xAD,0xD4,0xA2,0xAF,0x9C,0xA4,0x72,0xC0,
  0xB7,0xFD,0x93,0x26,0x36,0x3F,0xF7,0xCC,0x34,0xA5,0xE5,0xF1,0x71,0xD8,0x31,0x15,
  0x04,0xC7,0x23,0xC3,0x18,0x96,0x05,0x9A,0x07,0x12,0x80,0xE2,0xEB,0x27,0xB2,0x75,
  0x09,0x83,0x2C,0x1A,0x1B,0x6E,0x5A,0xA0,0x52,0x3B,0xD6,0xB3,0x29,0xE3,0x2F,0x84,
  0x53,0xD1,0x00,0xED,0x20,0xFC,0xB1,0x5B,0x6A,0xCB,0xBE,0x39,0x4A,0x4C,0x58,0xCF,
  0xD0,0xEF,0xAA,0xFB,0x43,0x4D,0x33,0x85,0x45,0xF9,0x02,0x7F,0x50,0x3C,0x9F,0xA8,
  0x51,0xA3,0x40,0x8F,0x92,0x9D,0x38,0xF5,0xBC,0xB6,0xDA,0x21,0x10,0xFF,0xF3,0xD2,
  0xCD,0x0C,0x13,0xEC,0x5F,0x97,0x44,0x17,0xC4,0xA7,0x7E,0x3D,0x64,0x5D,0x19,0x73,
  0x60,0x81,0x4F,0xDC,0x22,0x2A,0x90,0x88,0x46,0xEE,0xB8,0x14,0xDE,0x5E,0x0B,0xDB,
  0xE0,0x32,0x3A,0x0A,0x49,0x06,0x24,0x5C,0xC2,0xD3,0xAC,0x62,0x91,0x95,0xE4,0x79,
  0xE7,0xC8,0x37,0x6D,0x8D,0xD5,0x4E,0xA9,0x6C,0x56,0xF4,0xEA,0x65,0x7A,0xAE,0x08,
  0xBA,0x78,0x25,0x2E,0x1C,0xA6,0xB4,0xC6,0xE8,0xDD,0x74,0x1F,0x4B,0xBD,0x8B,0x8A,
  0x70,0x3E,0xB5,0x66,0x48,0x03,0xF6,0x0E,0x61,0x35,0x57,0xB9,0x86,0xC1,0x1D,0x9E,
  0xE1,0xF8,0x98,0x11,0x69,0xD9,0x8E,0x94,0x9B,0x1E,0x87,0xE9,0xCE,0x55,0x28,0xDF,
  0x8C,0xA1,0x89,0x0D,0xBF,0xE6,0x42,0x68,0x41,0x99,0x2D,0x0F,0xB0,0x54,0xBB,0x16};

unsigned char SubBytes(unsigned char x) {
  unsigned char r = 255;
  for (int i = 0; i < 256; i++) {
    int j = i - x;
    r &= t[i] | ((j | -j) >> 31);
  }
  return r;
}

Keith Randall

Posted 2012-12-21T17:40:24.530

Reputation: 19 865

Isn't this the same as my second answer except less optimised? – Peter Taylor – 2012-12-31T18:45:53.077

Close, I guess. I'm using signed shift so I don't have to do the -1 at the end. – Keith Randall – 2012-12-31T21:18:12.280