Nearest hamming cycle period in MD5

1

This riddle was inspired by that thread. Consider that you are super-hacker and try to break MD5 hashing algorithm by looking for a hash collisions for a given hash string which your friend gave to you. But this is a non-trivial task and you have a migraine disorder so you try to break the problem into smaller parts. First thing which you try is to look for a hamming distance patterns in a repeated execution of MD5() feeding it's last output to an input. For example, if your friend gave to you a hash bb2afeb932cc5aafff4a3e7e31d6d3fb, then you will notice a such pattern :

e3b86f8d25e5a71a88eabef0d7ed9840=md5(bb2afeb932cc5aafff4a3e7e31d6d3fb), hamming 95
45fcdb59517b39f18d1622a84be0a301=md5(e3b86f8d25e5a71a88eabef0d7ed9840), hamming 91
0b3f2eac737b916f3b78f240e7fc97ec=md5(45fcdb59517b39f18d1622a84be0a301), hamming 77
fe56b1802722cc30b936186aacce1e01=md5(0b3f2eac737b916f3b78f240e7fc97ec), hamming 95

So after 4 iterations you arrived at hamming distance 95 between an output and a first input, which is the same 95 number which you have started from. So nearest hamming cycle period of MD5(bb2afeb932cc5aafff4a3e7e31d6d3fb) is 4. You have found a hamming collision !

Your task is to write a program which takes an arbitrary MD5 hash from the standard input, and outputs it's nearest hamming cycle period on standard output. For example:

Input:  45fcdb59517b39f18d1622a84be0a301, Output: 3
Input:  1017efa1722ba21873d79c3a215da44a, Output: 7
Input:  edc990163ff765287a4dd93ee48a1107, Output: 90
Input:  c2ca818722219e3cc5ec6197ef967f5d, Output: 293

Shortest code wins.

Notes:

  1. If your language doesn't have an md5() built-in, feel free to implement that yourself and then write in the solution total byte count and byte count without md5() sub-program implementation, the later will be considered as an answer.

  2. Hamming distance is defined as number of differing bits between two strings. If strings converted to binary (by character ASCII code) has different lengths - smaller one is padded from the left with 0. Example:

Word1: nice Binary1: 01101110011010010110001101100101
Word2: car  Binary2: 00000000011000110110000101110010
Hamming distance (differing bits): 12
  1. In this task you should calculate hamming distance between a FIRST input in a cycle and current output. So according to a given hamming cycle example above you should calculate
hamming(bb2afeb932cc5aafff4a3e7e31d6d3fb,e3b86f8d25e5a71a88eabef0d7ed9840)
hamming(bb2afeb932cc5aafff4a3e7e31d6d3fb,45fcdb59517b39f18d1622a84be0a301)
hamming(bb2afeb932cc5aafff4a3e7e31d6d3fb,0b3f2eac737b916f3b78f240e7fc97ec)
hamming(bb2afeb932cc5aafff4a3e7e31d6d3fb,fe56b1802722cc30b936186aacce1e01)

Agnius Vasiliauskas

Posted 2019-03-22T08:39:44.440

Reputation: 125

I feel like the note favors languages with non-built-in md5 hashing, as one can practically choose the required bytes for calling the hashing. – Jonathan Frech – 2019-03-22T09:38:42.790

@JonathanFrech I'm not sure about this really. Calling hash sub-program should take relatively small amount of bytes in relation to total program source. So this is not an important factor. Important is - how the rest will be coded, and it should be minimized. So can you elaborate more on this ? – Agnius Vasiliauskas – 2019-03-22T09:52:03.497

3

I think what Jonathan means that if you choose to use a language, that does not have builtin md5 hashing, you can just define a function with a short name, while calling the builtin function in a different language might cost a lot of bytes (example in Python).

– ovs – 2019-03-22T11:04:13.870

1Actually, it seems like you're computing the Hamming distance between the ASCII codes of the hexadecimal representation of the MD5 digests in lower case. This should definitely be specified in the challenge. – Arnauld – 2019-03-22T12:56:24.790

1I'm not the downvoter, but the question seems to be just asking us to implement the md5 algorithm, with an extra part tacked on. It would be fine if it was one or the other, but as it is, it discourages languages without md5 built-ins (and therefore the people using them). This is exacerbated by letting people not count the code to implement the md5 algorithm, but forcing them to implement it anyway – Jo King – 2019-03-24T03:04:08.480

Answers

3

Jelly, 37 bytes

Ç,³OBU+0x8¤_/FAS,Ç
Ç;0ịÇƊ$⁺-ịn1ị$Ʋ¿LH

Try it online!

Try it online! - using full implementation of MD5

I'll comment it when I have time. Jelly doesn't have an MD5 built-in, so the header for the first TIO implements MD5 for a fixed length (32 bytes) input. I've also now extended this to a general purpose MD5 routine (second TIO link). For the purposes of the challenge, there is a monadic link just above the challenge code that takes a 32 character string and returns the MD5 hash. The remaining code (as displayed above here) calculates the hamming distance between the original input and sequential hashes until it matches the first one.

Nick Kennedy

Posted 2019-03-22T08:39:44.440

Reputation: 11 829

2

Python 2, 199 196 bytes

-3 bytes thanks to Kevin Cruijssen

from md5 import*
B=bytearray
h=lambda a,b:sum(bin(i^j).count('1')for i,j in zip(B(a),B(b)))
m=lambda a:new(a).hexdigest()
def f(x):
 a=b=m(x);i=1
 while 1:
    b=m(b);i+=1
    if h(x,a)==h(x,b):return i

Try it online!

Jonas Ausevicius

Posted 2019-03-22T08:39:44.440

Reputation: 311

1The two space-indentations before b=.. and if h... can be a single tab to save 2 bytes. :) Also, return(i) can be return i to save an additional byte. – Kevin Cruijssen – 2019-03-22T13:01:30.753

1

perl 5 (-MDigest::MD5=md5_hex -M5.01 -pl), 83 bytes

may be golfed

$x=$_;$h=(unpack'B*',($x=md5_hex($x))^$_)=~y/1//,$i||=$h while$c++<2|$i!=$h;$_=$c-1

TIO

Nahuel Fouilleul

Posted 2019-03-22T08:39:44.440

Reputation: 5 582

Why you say that there is no cycle ? Isn't it that we get to the same hamming number we started from ? Call it as you like, but it is a loop - hamming output mapped to the same starting hamming input – Agnius Vasiliauskas – 2019-03-22T19:43:06.823

1i mean it's not a cycle with the meaning : it depends from the point we start, if we start from next md5 we don't get the same result – Nahuel Fouilleul – 2019-03-23T05:53:37.777

Should we ? Of course not, because we are searching a hash (hamming) collision - duplicated input which when hashed gives us the same output. Now when you change starting MD5 - this means you begin looking for other colliding input pairs. And in general - every loop in the world depends on starting conditions. So it's not a problem with a cycle, but rather with your understanding about it. – Agnius Vasiliauskas – 2019-03-23T08:11:25.863

admitted, i remove from the answer – Nahuel Fouilleul – 2019-03-23T08:44:31.147

You can leave your answer, it's valid. Suppose we are saving our passwords as CRC32 trimmed to higher 16 bits. Now we want to crack it - find a fake password which matches the same hash. Look here .

– Agnius Vasiliauskas – 2019-03-23T09:03:14.360

Changing starting hash in my example code above, means that we are trying to break other password, than we had before. By analogy, same holds for a hamming loop - changing starting MD5 means finding a hamming collision to other password than we had before. Hope this makes sense to you. – Agnius Vasiliauskas – 2019-03-23T09:10:40.650

ok, but using hamming distance between md5 hex string and md5 hex string of md5 hex string as hash would be a weak hash – Nahuel Fouilleul – 2019-03-23T09:33:28.447

I don't completely understood your point, but yes it can be called a "weak-collision". In reality it would be much more fun to find true colliding MD5 hashes. So it's just a tool to find some repeated patterns in MD5 algo before applying full-scale attack to MD5. But it's non-trivial like I said in original post, so ... – Agnius Vasiliauskas – 2019-03-23T09:41:43.490

1

JavaScript (Node.js), 157 bytes

f=(s,n,a=(B=Buffer)(s))=>a.map((x,i)=>r+=(g=x=>x&&x%2+g(x/2))(x^B(s)[i]),s=require('crypto').createHash('md5').update(s).digest`hex`,r=0)|r==n||1+f(s,n||r,a)

Try it online!

Arnauld

Posted 2019-03-22T08:39:44.440

Reputation: 111 334

I'm thinking should I add a new rule that language can't use it's deprecated features. Anyway It would be good if you could eliminate deprecation warning Buffer() is deprecated due to security and usability issues – Agnius Vasiliauskas – 2019-03-23T13:27:13.767

1

@AgniusVasiliauskas Golfed code is basically the exact opposite of clean code and if could save a few more bytes by doing even more ugly things, I definitely would. As pointed out by Jo King in the comments, you may want to have a look at other winning criteria for your future challenges.

– Arnauld – 2019-03-23T17:05:43.747

Fixing the deprecation warning would cost 10 bytes.

– Arnauld – 2019-03-23T17:06:47.067

Ok, agree, that cleaning a code is not a point here :-) – Agnius Vasiliauskas – 2019-03-23T18:45:54.430