Compute CRC32 Hash

14

1

Credits

This challenge originated from @miles.


Create a function that computes the CRC32 hash of an input string. The input will be an ASCII string of any length. The output will be the CRC32 hash of that input string.

Explanation

The algorithm of CRC32 and other CRC are essentially the same, so only CRC3 will be demonstrated here.

Firstly, you have the generator polynomial, which is actually a 4-bit [n+1] integer (would be 33-bit in CRC32).

In this example, the generator polynomial is 1101.

Then, you will have the string to be hashed, which in this example would be 00010010111100101011001101.

00010010111100101011001101|000 (1)    append three [n] "0"s
   1101                        (2)    align with highest bit
00001000111100101011001101|000 (3)    XOR (1) and (2)
    1101                       (4)    align with highest bit
00000101111100101011001101|000 (5)    XOR (3) and (4)
     1101                      (6)    align with highest bit
00000011011100101011001101|000 (7)    XOR (5) and (6)
      1101                     (8)    align with highest bit
00000000001100101011001101|000 (9)    XOR (7) and (8)
          1101                 (10)   align with highest bit
00000000000001101011001101|000 (11)   XOR (9) and (10)
             1101              (12)   align with highest bit
00000000000000000011001101|000 (13)   XOR (11) and (12)
                  1101         (14)   align with highest bit
00000000000000000000011101|000 (15)   XOR (13) and (14)
                     1101      (16)   align with highest bit
00000000000000000000000111|000 (17)   XOR (15) and (16)
                       110 1   (18)   align with highest bit
00000000000000000000000001|100 (19)   XOR (17) and (18)
                         1 101 (20)   align with highest bit
00000000000000000000000000|001 (21)   XOR (19) and (20)
^--------REGION 1--------^ ^2^

The remainder obtained at (21), when region 1 is zero, which is 001, would be the result of the CRC3 hash.

Specs

  • The generator polynomial is 0x104C11DB7, or 0b100000100110000010001110110110111, or 4374732215.
  • Input can be a string or a list of integers, or any other reasonable format.
  • Output be a hex string or just an integer, or any other reasonable format.
  • Built-ins that compute the CRC32 hash are not allowed.

Goal

Standard rules for apply.

The shortest code wins.

Test cases

input         output      (hex)
"code-golf"   147743960   08CE64D8
"jelly"       1699969158  65537886
""            0           00000000

Leaky Nun

Posted 2016-05-01T01:32:15.147

Reputation: 45 011

If I understand right, this is doing polynomial division modulo 2 and finding the remainder, i.e. the analogue of mod in XOR multiplication.

– xnor – 2016-05-01T01:43:02.570

1

Yep. This is not xnor modulo though, this is xor modulo.

– Leaky Nun – 2016-05-01T01:45:41.980

For CRC32, do you first append 31 0's? – xnor – 2016-05-01T01:46:58.493

Yes – – – – – – – – – – Leaky Nun – 2016-05-01T01:47:41.130

1@KennyLau you can ping people with their name, just like chat. – Rɪᴋᴇʀ – 2016-05-01T01:51:28.493

In your example, why is the remainder 001 rather than 1100? – Dennis – 2016-05-01T02:09:11.443

@Dennis Added . – Leaky Nun – 2016-05-01T02:10:59.833

Are builtins for polynomial division allowed? – Luis Mendo – 2016-05-01T02:12:20.913

@LuisMendo I don't think there will be any built-in suitable for this challenge. Would you care to provide an example? – Leaky Nun – 2016-05-01T02:13:07.937

But isn't the CRC in your example a 3-bit hash? – Dennis – 2016-05-01T02:13:49.793

@KennyLau I meant this. I'm not sure it helps. It's standard polynomial division

– Luis Mendo – 2016-05-01T02:17:16.733

@Dennis Should be fixed now? – Leaky Nun – 2016-05-01T02:23:35.443

@LuisMendo I don't think that would be useful. – Leaky Nun – 2016-05-01T02:24:20.980

You say the string in the example is 00010010111100101011001010, but the last four bits in stage 1 are 1101. – Dennis – 2016-05-02T05:51:49.827

Answers

12

Intel x86, 34 30 29 27 bytes

Takes the address of the zero-terminated string in ESI, and returns the CRC in EBX:

31 db ac c1 e0 18 74 01 31 c3 6a 08 59 01 db 73 
06 81 f3 b7 1d c1 04 e2 f4 eb e7

Disassembly (AT&T syntax):

00000000    xorl    %ebx, %ebx
00000002    lodsb   (%esi), %al
00000003    shll    $24, %eax
00000006    je      0x9
00000008    xorl    %eax, %ebx
0000000a    pushl   $8
0000000c    popl    %ecx
0000000d    addl    %ebx, %ebx
0000000f    jae     0x17
00000011    xorl    $0x4c11db7, %ebx
00000017    loop    0xd
00000019    jmp     0x2
0000001b

Incorporating suggestions from Peter Cordes to save four more bytes. This assumes a calling convention where the direction flag for string instructions is cleared on entry.

Incorporating suggestion of Peter Ferrie to use push literal and pop to load a constant, saving one byte.

Incorporating suggestion of Peter Ferrie to jump to the second byte of an xorl %eax, %ebx instruction which is a retl instruction, combined with changing the interface of the routine to take a zero-terminated string instead of length, saving two bytes in total.

Mark Adler

Posted 2016-05-01T01:32:15.147

Reputation: 759

Use a calling convention that requires the direction flag to be cleared on entry, so you can save the cld insn (like I did in my adler32 answer). Is it normal practice to allow totally arbitrary calling conventions for asm answers?

– Peter Cordes – 2016-05-02T05:15:50.983

Anyway, it looks like your code will work as x86-64 machine code, and you could use the x86-64 SysV x32 calling convention to take count in edi and the pointer in esi (maybe not zero-extended, so maybe fudge things and require a 64bit zero-extended pointer). (x32 so you can safely use 32bit pointer math, but still have the register-args calling convention. Since you don't use inc, there's no downside to long mode.) – Peter Cordes – 2016-05-02T05:17:48.763

Did you consider keeping edx in byte-reversed order? bswap edx is only 2B. shr %edx is 2B, same as your left-shift with add %edx,%edx. This probably isn't helpful; Unless it enables more optimization, you save 3B for the shl $24, %eax, but you spend 4B for xor %eax,%eax at the start and bswap %edx at the end. Zeroing eax does let you use cdq to zero %edx, so overall it's a wash. It would perform better, though: it avoids the partial-register stall/slowdown on every iteration from writing al and then reading eax with shl. :P – Peter Cordes – 2016-05-02T05:33:47.777

If you limit the input length to 0..INT_MAX (signed), then you can save 3B (in 32bit code) by looping with dec %ebx / jge instead of comparing with an end pointer (this saves the addl %esi, %ebx as well as cmp->dec). jge instead of jne is the trick to still returning in the len=0 case. In 64bit code, dec %ebx is 2B instead of 1B, so I guess nvm my suggestion to use the x32 ABI. But requiring the direction flag to be cleared on call/ret is standard for many calling conventions, so you shouldn't worry about requiring it if you're already doing stuff like returning in edx. – Peter Cordes – 2016-05-02T05:50:15.247

Thanks! I incorporated the use of jge (there is in fact a limit on the length in the rules), and took out the cld. The byte swap would not work at all, since then the bits are in the wrong place to use a one-bit shift right. – Mark Adler – 2016-05-02T06:35:01.330

Oh right, silly me, forgot about the bitwise part. BTW, I think disassembly output showing the instruction bytes would be ideal. I like being able to see which instructions are which length just from counting the bytes, not from subtracting addresses. Branch labels are a bonus, too. objdump -d is nice. Putting a label after the last instruction byte means gives you the byte count without having to remember to count the ret byte. – Peter Cordes – 2016-05-02T14:45:52.403

The question says "ASCII string of any length", but a length limit of 2GiB (signed length) should be acceptable if a limit of ~4GiB (32bit pointer) is acceptable. Fun fact, even if the crc32 instruction was allowed, it's not useful because it uses the now-prefered CRC32C polynomial (0x11EDC6F41), rather than the question's 0x104C11DB7. – Peter Cordes – 2016-05-02T15:00:21.747

1Got confused with the Adler-32 question, which has a length limit. This question doesn't have an explicit length limit. – Mark Adler – 2016-05-02T15:15:32.703

1There might be a way to make this shorter with the PCLMULQDQ instruction. However its use tends to need a lot of constants, so possibly not. – Mark Adler – 2016-05-02T17:51:35.270

Interesting. Intel has a paper on using it for CRC with arbitrary polynomials, obviously focusing on performance (not code size). Even movd to get the result back into an integer register is 4B. Vector code can be many fewer insns, but most integer insns are at least 4B. (SSE ps instructions like xorps are 3B, when the operands only require a one-byte mod/rm). There's no 1B-shorter MMX version of pclmul, and its encoding is 6B (including the imm8)

– Peter Cordes – 2016-05-02T17:57:14.523

Ah, I see you already wrote an SO answer to a question about that paper. :P Anyway, it's going to be a rare question where SSE helps at all for golfing, because Intel didn't leave much opcode space for short instructions. And if you need to handle input lengths that aren't multiples of 16, then you need a scalar cleanup loop (or for some algorithms like min(), do a potentially-overlapping unaligned last vector of data if it doesn't matter that you see some elements multiple times.)

– Peter Cordes – 2016-05-02T18:08:30.843

the "xorl %ecx,%ecx/movb $8,%cl" can be "push $8/pop %ecx" to save one byte. If the string is always non-NULL, then the "jmp 0x1a" can be removed, to save two bytes, because the CRC will calculate as zero. If the CRC were stored in %ebx instead, then the exit condition could branch to the $0xC3 part of the $0x31 $0xC3-encoded XOR instruction, to save another byte. – peter ferrie – 2018-03-15T19:56:35.980

@peterferrie Thanks! I incorporated the pop and the jump into the xor, as well as making an assumption that it is a zero-terminated string. – Mark Adler – 2018-03-16T04:26:16.647

4

Jelly, 34 bytes

l2_32Ḟ4374732215æ«^
Oḅ⁹æ«32Çæ»32$¿

Try it online!

Leaky Nun

Posted 2016-05-01T01:32:15.147

Reputation: 45 011

4

Ruby, 142 bytes

Anonymous function; takes a string as input, returns an integer.

->s{z=8*i=s.size;r=0;h=4374732215<<z
l=->n{j=0;j+=1 while 0<n/=2;j}
s.bytes.map{|e|r+=e*256**(i-=1)};r<<=32
z.times{h/=2;r^=l[h]==l[r]?h:0}
r}

Value Ink

Posted 2016-05-01T01:32:15.147

Reputation: 10 608

2Can you change your name so that people can distinguish us? XD – Leaky Nun – 2016-05-01T03:18:19.713

2@KennyLau must you be so picky... OK fine – Value Ink – 2016-05-01T03:22:29.970

I was just kidding xd – Leaky Nun – 2016-05-01T03:24:51.357

4

Jelly, 23 bytes

ḅ⁹Bµ4374732215B×ḢḊ^µL¡Ḅ

Input is in form of a list of integers. Try it online! or verify all test cases.

How it works

While Jelly has bitwise XOR, padding the input with zeroes and aligning the polynomial with the most significant binary digit makes this approach, which uses lists of bits instead, a tad shorter.

ḅ⁹Bµ4374732215B×ḢḊ^µL¡Ḅ  Main link. Argument: A (list of bytes)

ḅ⁹                       Convert A from base 256 to integer.
  B                      Convert the result to binary, yielding a list.
   µ                     Begin a new, monadic chain. Argument: B (list of bits)
    4374732215B          Convert the integer to binary, yielding a list.
                Ḣ        Pop and yield the first, most significant bit of B.
               ×         Multiply each bit in the polynomial by the popped bit.
                 ^       Compute the element-wise XOR of both lists.
                         If one of the lists is shorter, the elements of the other
                         lists do not get modified, thus avoiding the necessity
                         of right-padding B with zeroes.
                  µ      Convert the previous chain into a link.
                   L¡    Execute the chain L times, where L is the number of bits
                         in the original bit list.
                     Ḅ   Convert from binary to integer.

Dennis

Posted 2016-05-01T01:32:15.147

Reputation: 196 637

3

Pyth, 42 bytes

?!Q0h<#^2 32.uxN.<4374732215.as-lN32.<CQ32

Test suite.

Leaky Nun

Posted 2016-05-01T01:32:15.147

Reputation: 45 011

3

CJam, 37 36 bytes

q256b32m<{Yb4374732215Yb.^Yb_Yb32>}g

Test it here.

Explanation

q               e# Read input.
256b            e# Convert to single number by treating the character codes
                e# as base-256 digits.
32m<            e# Left-shift the number by 32 bits, effectively appending 32
                e# zeros to the binary representation.
{               e# While the condition on top of the stack is truthy...
  Yb            e#   Convert the number to base 2.
  4374732215Yb  e#   Convert the polynomial to base 2.
  .^            e#   Take the bitwise XOR. If the number is longer than the
                e#   polynomial, the remaining bits will be left unchanged.
  Yb            e#   Convert the list back from base 2, effectively stripping
                e#   leading zeros for the next iteration.
  _             e#   Duplicate the result.
  Yb            e#   Convert back to base 2.
  32>           e#   Remove the first 32 bits. If any are left, continue the loop.
}g

Martin Ender

Posted 2016-05-01T01:32:15.147

Reputation: 184 808

q256bYb_,{(4374732215Ybf*1>.^}*Yb saves a few bytes. – Dennis – 2016-05-02T04:26:28.583

@Dennis That's really clever, feel free to make it a separate answer. :) – Martin Ender – 2016-05-02T08:16:36.780

3

Pyth, 28 bytes

uhS+GmxG.<C"Á·"dlhG.<Cz32

Try it online: Demonstration or Test Suite

Explanation:

uhS+GmxG.<C"..."dlhG.<Cz32   implicit: z = input string
                      Cz     convert to number
                    .<  32   shift it by 32 bits
u                            apply the following expression to G = ^,
                             until it get stuck in a loop:
     m           lhG            map each d in range(0, log2(G+1)) to:
          C"..."                   convert this string to a number (4374732215)
        .<      d                  shift it by d bits
      xG                           xor with G
   +G                           add G to this list
 hS                             take the minimum as new G

Jakube

Posted 2016-05-01T01:32:15.147

Reputation: 21 462

2

JavaScript (ES6), 180 bytes

f=(s,t=(s+`\0\0\0\0`).replace(/[^]/g,(c,i)=>(c.charCodeAt()+256*!!i).toString(2).slice(!!i)))=>t[32]?f(s,t.replace(/.(.{32})/,(_,m)=>(('0b'+m^79764919)>>>0).toString(2))):+('0b'+t)

The lack of a 33-bit XOR operator, or even an unsigned 32-bit XOR operator, is unhelpful.

Neil

Posted 2016-05-01T01:32:15.147

Reputation: 95 035

1

CJam, 33 bytes

q256bYb_,{(4374732215Ybf*1>.^}*Yb

Input is in form of a string. Try it online!

How it works

q                                  Read all input from STDIN.
 256bYb                            Convert it from base 256 to base 2.
       _,{                   }*    Compute the length and repeat that many times:
          (                          Shift out the first bit.
           4374732215Yb              Convert the integer to base 2.
                       f*            Multiply each bit by the shifted out bit.
                         1>          Remove the first bit.
                           .^        Compute the element-wise XOR of both lists.
                                     If one of the lists is shorter, the elements
                                     of the other lists do not get modified, thus
                                     avoiding the necessity of right-padding B with
                                     zeroes.
                               Yb  Convert the final result from base 2 to integer.

Dennis

Posted 2016-05-01T01:32:15.147

Reputation: 196 637