GolfScript, 60 characters
{[[0 1{.283{1$2*.255>@*^}:r~^}255*].@?~)={257r}4*99]{^}*}:S;
This code defines a function named S
that takes in a byte and applies the Rijndael S-box to it. (It also uses an internal helper function named r
to save a few chars.)
This implementation uses a logarithm table to compute the GF(28) inverses, as suggested by Thomas Pornin. To save a few chars, the whole logarithm table is recalculated for each input byte; even so, and despite GolfScript being a very slow language in general, this code takes only about 10 ms to process a byte on my old laptop. Precalculating the logarithm table (as L
) speeds it up to about 0.5 ms per byte, at the modest cost of three more chars:
[0 1{.283{1$2*.255>@*^}:r~^}255*]:L;{[L?~)L={257r}4*99]{^}*}:S;
For convenience, here's a simple test harness that calls the function S
, as defined above, to compute and print out the whole S-box in hex like on Wikipedia:
"0123456789abcdef"1/:h; 256, {S .16/h= \16%h= " "++ }% 16/ n*
Try this code online.
(The online demo precalculates the logarithm table to avoid taking too much time. Even so, the online GolfScript site may sometimes randomly time out; this is a known issue with the site, and a reload usually fixes it.)
Explanation:
Let's start with the logarithm table calculation, and specifically with the helper function r
:
{1$2*.255>@*^}:r
This function takes two inputs on the stack: a byte and a reduction bitmask (a constant between 256 and 511). It duplicates the input byte, multiplies the copy by 2 and, if the result exceeds 255, XORs it with the bitmask to bring it back under 256.
Within the log-table generating code, the function r
is called with the reduction bitmask 283 = 0x11b (which corresponds to the Rijndael GF(28) reduction polynomial x8 + x4 + x3 + x + 1), and result is XORed with the original byte, effectively multiplying it by 3 (= x + 1, as a polynomial) in the Rijndael finite field. This multiplication is repeated 255 times, starting from the byte 1, and the results (plus an initial zero byte) are collected into a 257-element array L
that looks like this (middle part omitted):
[0 1 3 5 15 17 51 85 255 26 46 ... 180 199 82 246 1]
The reason why there are 257 elements is that, with the prepended 0 and with 1 occurring twice, we can find the modular inverse of any given byte simply by looking up its (zero-based) index in this array, negating it, and looking up the byte at the negated index in the same array. (In GolfScript, as in many other programming languages, negative array indexes count backwards from the end of the array.) Indeed, this is exactly what the code L?~)L=
at the beginning of the function S
does.
The rest of the code calls the helper function r
four times with the reduction bitmask 257 = 28 + 1 to create four bit-rotated copies of the inverted input byte. These are all collected into an array, together with the constant 99 = 0x63, and XORed together to produce the final output.
1Bonus points (upvotes from me) if the resulting function is constant-time (i.e. no data-dependent code paths or array accesses or whatever your language supports). – Paŭlo Ebermann – 2011-10-04T08:08:16.803
@PaŭloEbermann array accesses are constant time in many languages (it's adding a (scaled) value to a pointer and dereferencing it, this is why a lookup table is so very fast) – ratchet freak – 2011-10-04T08:58:13.153
@ratchetfreak Array accesses are O(1), but the actual access time depends on cache hits or misses, for example, which leads to side-channel attacks on AES. – Paŭlo Ebermann – 2011-10-04T09:01:47.997
@PaŭloEbermann, but you can use the shorter code to fill a lookup table, which will then fit in well under a page of memory. – Peter Taylor – 2011-10-04T09:32:08.477
@PaŭloEbermann and if the 256-length table is stored along the code (as enum generated at compile time) you nearly guaranteed a cache hit – ratchet freak – 2011-10-04T09:35:52.263
There is some discussion about the relevance of cache timing attacks on AES in the Cryptography chat. (@ratchet freak)
– Paŭlo Ebermann – 2011-10-04T11:58:17.990@PaŭloEbermann you can't argue that my ubyte to ubyte version is fully branchless (except for 1
if
in themulInv()
which is 1 instruction over 256 iterations and then it continues with searching the full space) – ratchet freak – 2011-10-04T12:22:11.527