Tweetable hash function challenge

73

13

In this you will write a hash function in 140 bytes1 or less of source code. The hash function must take an ASCII string as input, and return a 24-bit unsigned integer ([0, 224-1]) as output.

Your hash function will be evaluated for every word in this large British English dictionary2. Your score is the amount of words that share a hash value with another word (a collision).

The lowest score wins, ties broken by first poster.

Test case

Before submitting, please test your scoring script on the following input:

duplicate
duplicate
duplicate
duplicate

If it gives any score other than 4, it is buggy.


Clarifying rules:

  1. Your hash function must run on a single string, not a whole array. Also, your hash function may not do any other I/O than the input string and output integer.
  2. Built-in hash functions or similar functionality (e.g. encryption to scramble bytes) is disallowed.
  3. Your hash function must be deterministic.
  4. Contrary to most other contests optimizing specifically for the scoring input is allowed.

1 I am aware Twitter limits characters instead of bytes, but for simplicity we will use bytes as a limit for this challenge.
2 Modified from Debian's wbritish-huge, removing any non-ASCII words.

orlp

Posted 2016-04-01T11:03:04.587

Reputation: 37 067

Are seeded rand functions allowed, or are they considered "similar functionality"? – Doorknob – 2016-04-01T11:05:21.077

@Doorknob I would consider them similar functionality. It is intended you scramble the input using your own code, and not through some side effect of other hashing algorithms, crypto, rng, compression, etc. – orlp – 2016-04-01T11:06:12.530

11Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch's ? What the...? – Luis Mendo – 2016-04-01T11:39:40.643

8

@DonMuesli https://en.wikipedia.org/wiki/Llanfairpwllgwyngyll (fun fact: that word is also in Jelly's built-in compression dictionary)

– Martin Ender – 2016-04-01T11:40:41.300

1

Relevant: http://140byt.es/

– ASCII-only – 2016-04-01T12:29:05.957

@orlp What domain does the hash function have to be defined on? Is it only required to work for words in the dictionary? Or does it have to work with inputs not in that dictionary as well? – kasperd – 2016-04-01T13:04:37.133

This is a lovely challenge! – None – 2016-04-01T14:58:48.133

8I think you should disallow built-in dictionaries. – Dennis – 2016-04-01T15:29:52.963

Per the spec in the question, a valid hash should cover the inclusive range [0, 2**24-1], but your own submission uses %(2**24-1), which covers the range [0, 2**24-2], since n%n (in this case n is 2**24-1) returns 0. Could you clarify the challenge's specifications more, please? – mbomb007 – 2016-04-01T16:22:37.600

2I estimate that a random oracle would score around 6832 on this challenge. Anything better than that has to be optimized for the dictionary used for scoring. Any decent hash function will score around the same value. There will be some variation in score depending on the exact constants used in the hashing algorithm, thus it is possible to do a little better than 6832, but if you want to do significantly better than that, you have to an approach that is much more targeted towards the specific dictionary. – kasperd – 2016-04-01T16:43:36.710

4For reference: Taking the 24 MSB of SHA-512 would achieve a score of 6816. – Dennis – 2016-04-01T17:26:41.587

@Dennis I interpret built-in dictionaries as being disallowed by clarifying rule #1. I consider that to be extra I/O, as you're reading a nontrivial amount of extra information outside of the input string. – orlp – 2016-04-01T22:39:41.390

3

@kasperd That's correct, this challenge however is not about finding a generic hash function, it's about finding a perfect hash function.

– orlp – 2016-04-01T22:40:20.173

@mbomb007 That's a mistake, my submission had a bug. The specs are still authoritative. – orlp – 2016-04-01T22:41:19.367

@kasperd It's not required that the hash function works on elements not inside the dictionary, although I don't see why that would matter for any valid implementation (dictionary lookups are considered extra I/O, even when the dictionary is a builtin in your language). – orlp – 2016-04-01T22:43:59.273

@orlp If the function is only required to work for words in the dictionary, then the modulus operation in your answer is redundant. – kasperd – 2016-04-01T22:57:01.377

1Not going to post as an answer because it definitely should be disallowed (the hash function should not use mutable global state, or something like that), but I had some fun writing this: H=lambda s,i=[1],d={}:d[s] if s in d else (i[0],d.__setitem__(s,i[0]),i.__setitem__(0,i[0]+1))[0] - score is 0. – Random832 – 2016-04-02T07:25:56.830

@Random832 Global state is considered extra I/O. – orlp – 2016-04-02T08:27:18.650

2@orlp That's an unreasonable definition of "I/O". You're not reading it, you're maintaining it in memory. The text of the clarification isn't sufficient to conclude that "I/O" is meant to include things that are not I/O. The text of the clarification is clearly targeted at hash functions that open the file again themselves. – Random832 – 2016-04-02T08:40:34.297

10Some back-of-the-envelope calculations: With D=340275 words and R=2^24 hash outputs, a random hash has an expected D^2/(2*R) = 3450 colliding pairs, some of which overlap. There are an expected D^3/(6*R^2) = 23 colliding triples and a negligible number of larger collisions, which means these triples are likely disjoint. This gives an expected 6829 words that share a hash value, ~70 in triples and the rest in pairs. The standard deviation is estimated at 118, so getting <6200 with a random hash is roughly a 5 sigma event. – xnor – 2016-04-02T17:57:13.487

@Random832 A hash function is supposed to produce the same output each time it is given the same input. That needs to be true even across invocations of the program. So after the program has been stopped and started again, your hash function still needs to reproduce the same hash output given the same input. That means the global state cannot help you, because you are required to give the same result, even when given a new clean version of the global state. – kasperd – 2016-04-02T20:03:32.103

@orlp The point of my calculations was to reason about what was needed for an entry to be competitive. Those calculations lead me to the conclusion, that a generic hash function cannot be competitive, not even if you tweak the parameters. So if I wanted a better score, I would have to do something else (which I did). And the calculation also shows that any answer with a generic hash function and a score less than 6000 is essentially guaranteed to be flawed. – kasperd – 2016-04-02T20:06:47.667

@xnor Those numbers appear to be consistent with my rough estimate in an earlier comment. Nice to know the standard deviation as well. With those numbers I can be even more confident when I point out answers with an unrealistic score. – kasperd – 2016-04-02T20:10:24.110

3 of spades! Yay! – CalculatorFeline – 2016-04-02T21:10:17.860

@kasperd You're misinterpreting the question. You can do significantly better than the random oracle, and that's the point of this question. Please read up on perfect hash functions. This question is not about generic hash functions.

– orlp – 2016-04-02T22:28:38.200

2@orlp You are totally not getting the point. There have been numerous answers providing a generic hash function with no targeting for the specific dictionary which still claimed a score in the ballpark around 3600. As long as the answer is a generic hash function where you have perhaps tweaked some of the parameters you will get a score within a few standard deviations of the expected value, there is no way such an entry can score less than 6000. – kasperd – 2016-04-02T22:34:09.577

@kasperd All right, I haven't seen those answers, so I was unaware of that. – orlp – 2016-04-02T22:35:05.880

@orlp That's probably because each time it happens there is somebody around to point it out in comments, and the answer gets fixed or deleted. It has happened frequently enough that the question even got edited to warn about it. Not that it helped, I have seen at least 3 of them after the warning was added to the question. – kasperd – 2016-04-02T22:39:23.423

Oh cool, my generic hash function comes to 6825, so slightly better than expected. But given the other contenders, I’m not bothering to reimplememnt it in a short language (I used the shell implementation for easiness of testing). – mirabilos – 2016-04-04T08:05:06.973

@mirabilos If you feel your code can be golfed to within the limit, you may submit a link with the same for someone else to do so. – ghosts_in_the_code – 2016-04-04T13:48:30.620

Answers

11

Alright fine I’ll go learn a golfing language.

CJam, 140 bytes, 3314 colliding words

00000000: 7b5f 3162 225e d466 4a55 a05e 9f47 fc51  {_1b"^.fJU.^.G.Q
00000010: c45b 4965 3073 72dd e1b4 d887 a4ac bcbd  .[Ie0sr.........
00000020: 9c8f 70ca 2981 b2df 745a 10d0 dfca 6cff  ..p.)...tZ....l.
00000030: 7a3b 64df e730 54b4 b068 8584 5f6c 9f6b  z;d..0T..h.._l.k
00000040: b7f8 7a1f a2d3 b2b8 bcf5 cfa6 1ef7 a55c  ..z............\
00000050: dca8 795c 2492 dc32 1fb6 f449 f9ca f6b7  ..y\$..2...I....
00000060: a2cf 4772 266e ad4f d90c d236 b51d c5d5  ..Gr&n.O...6....
00000070: 5c46 3f9b 7cb4 f195 4efc fe4a ce8d 9aee  \F?.|...N..J....
00000080: 9dbc 223d 6962 3443 2329 257d            .."=ib4C#)%}

Defines a block (anonymous function). To test, you could add qN%%N*N to take the newline-separated list of words on stdin and write a newline-separated list of hashes on stdout. Equivalent Python code:

b=lambda s,a:reduce(lambda n,c:n*a+ord(c),s,0)
f=lambda s:b(s,ord('^\xd4fJU\xa0^\x9fG\xfcQ\xc4[Ie0sr\xdd\xe1\xb4\xd8\x87\xa4\xac\xbc\xbd\x9c\x8fp\xca)\x81\xb2\xdftZ\x10\xd0\xdf\xcal\xffz;d\xdf\xe70T\xb4\xb0h\x85\x84_l\x9fk\xb7\xf8z\x1f\xa2\xd3\xb2\xb8\xbc\xf5\xcf\xa6\x1e\xf7\xa5\\\xdc\xa8y\\$\x92\xdc2\x1f\xb6\xf4I\xf9\xca\xf6\xb7\xa2\xcfGr&n\xadO\xd9\x0c\xd26\xb5\x1d\xc5\xd5\\F?\x9b|\xb4\xf1\x95N\xfc\xfeJ\xce\x8d\x9a\xee\x9d\xbc'[b(s,1)%125]))%(8**8+1)

Pyth, 140 bytes, 3535 3396 colliding words

00000000: 4c25 4362 2d68 5e38 2038 2a36 3643 4022  L%Cb-h^8 8*66C@"
00000010: aa07 f29a 27a7 133a 3901 484d 3f9b 1982  ....'..:9.HM?...
00000020: d261 79ab adab 9d92 888c 3012 a280 76cf  .ay.......0...v.
00000030: a2e5 8f81 7039 acee c42e bc18 28d8 efbf  ....p9......(...
00000040: 0ebe 2910 9c90 158e 3742 71b4 bdf5 59c2  ..).....7Bq...Y.
00000050: f90b e291 8673 ea59 6975 10be e750 84c8  .....s.Yiu...P..
00000060: 0b0f e7e8 f591 f628 cefa 1ab3 2e3c 72a3  .......(.....<r.
00000070: 7f09 6190 dbd2 d54e d6d0 d391 a780 ebb6  ..a....N........
00000080: ae86 2d1e 49b0 552e 7522 4362            ..-.I.U.u"Cb

Defines a function named y. To test, you could add jmyd.z to take the newline-separated list of words on stdin and write a newline-separated list of hashes on stdout. Equivalent Python code:

b=lambda s,a:reduce(lambda n,c:n*a+ord(c),s,0)
f=lambda s:b(s,256)%(8**8+1-66*ord("\xaa\x07\xf2\x9a'\xa7\x13:9\x01HM?\x9b\x19\x82\xd2ay\xab\xad\xab\x9d\x92\x88\x8c0\x12\xa2\x80v\xcf\xa2\xe5\x8f\x81p9\xac\xee\xc4.\xbc\x18(\xd8\xef\xbf\x0e\xbe)\x10\x9c\x90\x15\x8e7Bq\xb4\xbd\xf5Y\xc2\xf9\x0b\xe2\x91\x86s\xeaYiu\x10\xbe\xe7P\x84\xc8\x0b\x0f\xe7\xe8\xf5\x91\xf6(\xce\xfa\x1a\xb3.<r\xa3\x7f\ta\x90\xdb\xd2\xd5N\xd6\xd0\xd3\x91\xa7\x80\xeb\xb6\xae\x86-\x1eI\xb0U.u"[b(s,256)%121]))

Theoretical limits

How well can we expect to do? Here is a plot of x, the number of colliding words, vs. y, the entropy in bytes required to get at most x colliding words. For example, the point (2835, 140) tells us that a random function gets at most 2835 colliding words with probability 1/256**140, so it’s exceedingly unlikely that we’ll ever be able to do much better than that with 140 bytes of code.

graph

Anders Kaseorg

Posted 2016-04-01T11:03:04.587

Reputation: 29 242

Nice analysis of the theoretical limits. In order to beat that theoretical limit one would probably have to use a language with builtin functions optimized for the dictionary in the question (which would be cheating). If the language has a builtin cryptographic hash, the limit can be turned into a more-or-less constructive method for finding an optimal solution. Consider this: $h(w || c) % 2^{24}$ where $c$ is a byte string constant. In a random oracle model that could be shown to get close to the optimum with high probability. Of course brute forcing the $c$ would not be feasible. – kasperd – 2016-04-09T18:52:49.940

How did you calculate the formula for the graph? Really interesting! – NikoNyrh – 2017-07-03T12:10:18.767

@NikoNyrh Dynamic programming. Let (w, c, h) represent a state with w words, of which c are colliding with h distinct hashes, and the remaining wc all have distinct hashes. If we add a random word, the state becomes (w + 1, c, h) with probability 1 − (h + wc) / 2^24, or (w + 1, c + 1, h) with probability h / 2^24, or (w + 1, c + 2, h + 1) with probability (wc) / 2^24. Then the final entropy graphed with x colliding words is the log base 1/256 of the sum of probabilities at states (340275, c, h) with cx. – Anders Kaseorg – 2017-07-04T08:07:54.680

I can’t believe no one has asked how you came up with the hash function? I would be very interested to know. – Anush – 2019-03-19T05:35:23.200

22

Python, 5333 4991

I believe this is the first contender to score significantly better than a random oracle.

def H(s):n=int(s.encode('hex'),16);return n%(8**8-ord('+%:5O![/5;QwrXsIf]\'k#!__u5O}nQ~{;/~{CutM;ItulA{uOk_7"ud-o?y<Cn~-`bl_Yb'[n%70]))

kasperd

Posted 2016-04-01T11:03:04.587

Reputation: 829

1Sorcery! def H(s):n=int(s.encode('hex'),16);return n%... saves 5 bytes, in case you can use them somehow... – Dennis – 2016-04-01T21:56:52.063

3@Dennis I could have used 5 bytes to make the string constant 5 bytes longer. But I would have to start over building the string constant from scratch if I change the length. And I am not sure those 5 bytes will give me enough improvement that it is worth starting over building the string. I have spend hours of CPU time optimizing the string constant already. – kasperd – 2016-04-01T22:02:29.313

@Dennis I guess a few extra bytes would give me the freedom to use some characters in the constant needing escaping. That way I could potentially use a few extra bytes without having to construct the string all over again. – kasperd – 2016-04-01T22:19:26.423

7If you want another byte, 2**24 == 8**8. – Anders Kaseorg – 2016-04-02T04:23:14.290

20

Python 2, 140 bytes, 4266 colliding words

I didn’t really want to start with the non-printable bytes thing given their unclear tweetability, but well, I didn’t start it. :-P

00000000: efbb bf64 6566 2066 2873 293a 6e3d 696e  ...def f(s):n=in
00000010: 7428 732e 656e 636f 6465 2827 6865 7827  t(s.encode('hex'
00000020: 292c 3336 293b 7265 7475 726e 206e 2528  ),36);return n%(
00000030: 382a 2a38 2b31 2d32 3130 2a6f 7264 2827  8**8+1-210*ord('
00000040: 6f8e 474c 9f5a b49a 01ad c47f cf84 7b53  o.GL.Z........{S
00000050: 49ea c71b 29cb 929a a53b fc62 3afb e38e  I...)....;.b:...
00000060: e533 7360 982a 50a0 2a82 1f7d 768c 7877  .3s`.*P.*..}v.xw
00000070: d78a cb4f c5ef 9bdb 57b4 7745 3a07 8cb0  ...O....W.wE:...
00000080: 868f a927 5b6e 2536 375d 2929            ...'[n%67]))

Python 2, 140 printable bytes, 4662 4471 4362 colliding words

def f(s):n=int(s.encode('hex'),16);return n%(8**8+3-60*ord('4BZp%(jTvy"WTf.[Lbjk6,-[LVbSvF[Vtw2e,NsR?:VxC0h5%m}F5,%d7Kt5@SxSYX-=$N>'[n%71]))

Inspired by the form of kasperd’s solution, obviously—but with the important addition of an affine transformation on the modulus space, and entirely different parameters.

Anders Kaseorg

Posted 2016-04-01T11:03:04.587

Reputation: 29 242

+1 I am not giving up without a fight. But I think I have to stop optimizing my current solution and find another approach, because I am not going to beat you if I keep using my current approach to optimizing the parameters. I'll be back with an edit to my solution once I have beaten yours.... – kasperd – 2016-04-02T07:18:23.667

@kasperd: Awesome, bring it on. :-P – Anders Kaseorg – 2016-04-02T07:39:56.860

1@AndersKaseorg How do you find the string? – ASCII-only – 2016-04-02T13:16:05.617

@AndersKaseorg I managed to speed up my parameter search a lot. And I removed an obstacle which was causing my search to get stuck on suboptimal solutions. But I still couldn't make it any better than 4885. After some pondering on why I couldn't make it any further I suddenly realized what is wrong with my solution and how it can be fixed. Now the affine transformation in your solution makes perfect sense to me. I think the only way I can possibly catch up is by using an affine transformation myself. – kasperd – 2016-04-02T16:52:35.033

1@kasperd: Very nice. When searching for a better string in n%(8**8-ord('…'[n%70])) without other parameter changes, I had only managed to get to 4995, so it looks like your new optimizer has caught up to mine. Now this gets more interesting! – Anders Kaseorg – 2016-04-02T17:43:58.537

@AndersKaseorg I am crunching through the parameter space for a different affine transformation. You are still 47 points ahead of me, but I count on a few hours of CPU time changing that. – kasperd – 2016-04-02T18:03:09.203

16

CJam, 4125 3937 3791 3677

0000000: 7b 5f 39 62 31 31 30 25 5f 22 7d 13 25 77  {_9b110%_"}.%w
000000e: 77 5c 22 0c e1 f5 7b 83 45 85 c0 ed 08 10  w\"...{.E.....
000001c: d3 46 0c 5c 22 59 f8 da 7b f8 18 14 8e 4b  .F.\"Y..{....K
000002a: 3a c1 9e 97 f8 f2 5c 18 21 63 13 c8 d3 86  :.....\.!c....
0000038: 45 8e 64 33 61 50 96 c4 48 ea 54 3b b3 ab  E.d3aP..H.T;..
0000046: bc 90 bc 24 21 20 50 30 85 5f 7d 7d 59 2c  ...$! P0._}}Y,
0000054: 4a 67 88 c8 94 29 1a 1a 1a 0f 38 c5 8a 49  Jg...)....8..I
0000062: 9b 54 90 b3 bd 23 c6 ed 26 ad b6 79 89 6f  .T...#..&..y.o
0000070: bd 2f 44 6c f5 3f ae af 62 9b 22 3d 69 40  ./Dl.?..b."=i@
000007e: 62 31 35 32 35 31 39 25 31 31 30 2a 2b 7d  b152519%110*+}

This approach divides domain and codomain into 110 disjoint sets, and defines a slightly different hash function for each pair.

Scoring / Verification

$ echo $LANG
en_US
$ cat gen.cjam
"qN%{_9b110%_"
[125 19 37 119 119 34 12 225 245 123 131 69 133 192 237 8 16 211 70 12 34 89 248 218 123 248 24 20 142 75 58 193 158 151 248 242 92 24 33 99 19 200 211 134 69 142 100 51 97 80 150 196 72 234 84 59 179 171 188 144 188 36 33 32 80 48 133 95 125 125 89 44 74 103 136 200 148 41 26 26 26 15 56 197 138 73 155 84 144 179 189 35 198 237 38 173 182 121 137 111 189 47 68 108 245 63 174 175 98 155]
:c`"=i@b152519%110*+}%N*N"
$ cjam gen.cjam > test.cjam
$ cjam test.cjam < british-english-huge.txt | sort -n > temp
$ head -1 temp
8
$ tail -1 temp
16776899
$ all=$(wc -l < british-english-huge.txt)
$ unique=$(uniq -u < temp | wc -l)
$ echo $[all - unique]
3677

The following port to Python can be used with the official scoring snippet:

h=lambda s,b:len(s)and ord(s[-1])+b*h(s[:-1],b)

def H(s):
 p=h(s,9)%110
 return h(s,ord(
  '}\x13%ww"\x0c\xe1\xf5{\x83E\x85\xc0\xed\x08\x10\xd3F\x0c"Y\xf8\xda{\xf8\x18\x14\x8eK:\xc1\x9e\x97\xf8\xf2\\\x18!c\x13\xc8\xd3\x86E\x8ed3aP\x96\xc4H\xeaT;\xb3\xab\xbc\x90\xbc$! P0\x85_}}Y,Jg\x88\xc8\x94)\x1a\x1a\x1a\x0f8\xc5\x8aI\x9bT\x90\xb3\xbd#\xc6\xed&\xad\xb6y\x89o\xbd/Dl\xf5?\xae\xafb\x9b'
  [p]))%152519*110+p

Dennis

Posted 2016-04-01T11:03:04.587

Reputation: 196 637

1I've ported my code to Python for easy verification. – Dennis – 2016-04-03T19:16:02.380

Does h in that Python port correspond to a CJam builtin? – kasperd – 2016-04-03T20:27:01.697

Yes. It's CJam's b (base conversion). – Dennis – 2016-04-03T21:46:59.937

Is your scoring process in bash? – GamrCorps – 2016-04-04T14:12:01.443

@GamrCorps Yes, that's Bash. – Dennis – 2016-04-04T14:19:33.870

11

Python, 6446 6372


This solution achieves lower collision count than all previous entries, and it needs only 44 of the 140 bytes allowed for code:

H=lambda s:int(s.encode('hex'),16)%16727401

kasperd

Posted 2016-04-01T11:03:04.587

Reputation: 829

2@mbomb007 orlp's own submission does %(2**24-1), so I think it might be good to ask for clarification – Sp3000 – 2016-04-01T16:09:16.020

12@mbomb007 The challenge says no such thing. It says the function must take an ASCII string as input and output an integer in that range. Regardless of which input you give my function, the output will be in that range. The mathematical definition of the word function does not require it to produce every allowed output. If that's what you wanted the mathematical term you'd be using was surjective function. But the word surjective was not used in the requirements. – kasperd – 2016-04-01T16:22:29.603

@mbomb007: There is no requirement that hash functions be surjective. For example, many memory-address-based hash functions can only produce multiples of some small power of 2 due to memory alignment, including the default object hash in older versions of Python. Many hash functions even have a smaller domain than codomain, so they couldn't be surjective anyway. – user2357112 supports Monica – 2016-04-01T21:12:33.177

3@mbomb007 - In fact, given there are far more number values from [0, 2**24-1] than there are words in the English language, it would be mathematically impossible to make a hash where every single value in that range was possible. – Darrel Hoffman – 2016-04-03T18:38:28.480

7

JavaScript (ES6), 6389

The hash function (105 bytes):

s=>[...s.replace(/[A-Z]/g,a=>(b=a.toLowerCase())+b+b)].reduce((a,b)=>(a<<3)*28-a^b.charCodeAt(),0)<<8>>>8

The scoring function (NodeJS) (170 bytes):

h={},c=0,l=require('fs').readFileSync(process.argv[2],'utf8').split('\n').map(a=>h[b=F(a)]=-~h[b])
for(w of Object.getOwnPropertyNames(h)){c+=h[w]>1&&h[w]}
console.log(c)

Call as node hash.js dictionary.txt, where hash.js is the script, dictionary.txt is the dictionary text file (without the final newline), and F is defined as the hashing function.

Thanks Neil for shaving 9 bytes off the hashing function!

Mwr247

Posted 2016-04-01T11:03:04.587

Reputation: 3 494

Why the assignment to a? Also, instead of ((...)>>>0)%(1<<24) you can probably use (...)<<8>>>8. – Neil – 2016-04-01T15:02:45.627

@Neil Because alphabet, and I'm boring =P Also, nice bitwise mathing! That saved 7 bytes =) – Mwr247 – 2016-04-01T15:06:28.543

Good thing this isn't code golf, otherwise I'd have to ding you for the unused variable i as well. – Neil – 2016-04-01T15:07:42.230

@Neil Crap >_< I use that while testing some alternative hashing ideas and forgot to remove XD Yes, good thing this isn't a golf, though all the same, I'd love it if I could compress the hash and scoring functions into the same 140 bytes, so every bit helps ;) – Mwr247 – 2016-04-01T15:10:57.380

@Neil Same thing with my (a=...) deal (thought you meant why I used a ad my variable name haha). Removed that as well, and now down to 57. – Mwr247 – 2016-04-01T15:12:31.450

1@Sp3000 Gah, I see what you mean. Mine is not counting the ones that are in there initially when the collision is found. I'll fix that. – Mwr247 – 2016-04-01T15:24:15.590

Ok, I'm going to stop rushing this and get an actual fix out here. – Mwr247 – 2016-04-01T15:30:35.467

@Sp3000 How's this look? – Mwr247 – 2016-04-01T15:44:31.707

I can't check the JS since I don't have node on this computer, but I've got the same score from porting to Python. – Sp3000 – 2016-04-01T15:45:52.853

Let's take a crack at that scoring function... maybe if you use a=>h[b=F(a)]=-~h[b] and c+=h[w]>1&&h[w]? – Neil – 2016-04-01T16:07:04.270

By what appears to be pure coincidence, your latest edit got you the exact same score as the Mathematica answer.

– kasperd – 2016-04-01T16:54:32.697

@kasperd Indeed haha, I noticed that coincidence. I actually found a better one, but I'm spending a little more time searching before I update. – Mwr247 – 2016-04-01T17:12:10.007

7

CJam, 6273

{49f^245b16777213%}

XOR each character with 49, reduce the resulting string via x, y ↦ 245x + y, and take the residue modulo 16,777,213 (the largest 24-bit prime).

Scoring

$ cat hash.cjam
qN% {49f^245b16777213%} %N*N
$ all=$(wc -l < british-english-huge.txt)
$ unique=$(cjam hash.cjam < british-english-huge.txt | sort | uniq -u | wc -l)
$ echo $[all - unique]
6273

Dennis

Posted 2016-04-01T11:03:04.587

Reputation: 196 637

I reimplemented the algorithm in python from your description. I can confirm that your score checks out with the official score computation. – kasperd – 2016-04-01T21:26:29.820

5

Python, 340053

A terrible score from a terrible algorithm, this answer exists more to give a small Python script that displays scoring.

H=lambda s:sum(map(ord, s))%(2**24)

To score:

hashes = []
with open("british-english-huge.txt") as f:
    for line in f:
        word = line.rstrip("\n")
        hashes.append(H(word))

from collections import Counter
print(sum(v for k, v in Counter(hashes).items() if v > 1))

orlp

Posted 2016-04-01T11:03:04.587

Reputation: 37 067

1It could be useful to have the scoring code assert that the return value from the hash function is an integer in the permitted range. – kasperd – 2016-04-02T20:26:07.100

5

Mathematica, 6473

The next step up... instead of summing the character codes we treat them as the digits of a base-151 number, before taking them modulo 224.

hash[word_] := Mod[FromDigits[ToCharacterCode @ word, 151], 2^24]

Here is a short script to determine the number of collisions:

Total[Last /@ DeleteCases[Tally[hash /@ words], {_, 1}]]

I've just tried all bases systematically from 1 onwards, and so far base 151 yielded the fewest collisions. I'll try a few more to bring down the score a bit further, but the testing is a bit slow.

Martin Ender

Posted 2016-04-01T11:03:04.587

Reputation: 184 808

5

Javascript (ES5), 6765

This is CRC24 shaved down to 140 Bytes. Could golf more but wanted to get my answer in :)

function(s){c=0xb704ce;i=0;while(s[i]){c^=(s.charCodeAt(i++)&255)<<16;for(j=0;j++<8;){c<<=1;if(c&0x1000000)c^=0x1864cfb}}return c&0xffffff}

Validator in node.js:

var col = new Array(16777215);
var n = 0;

var crc24_140 = 
function(s){c=0xb704ce;i=0;while(s[i]){c^=(s.charCodeAt(i++)&255)<<16;for(j=0;j++<8;){c<<=1;if(c&0x1000000)c^=0x1864cfb}}return c&0xffffff}

require('fs').readFileSync('./dict.txt','utf8').split('\n').map(function(s){ 
    var h = crc24_140(s);
    if (col[h]===1) {
        col[h]=2;
        n+=2;
    } else if (col[h]===2) {
        n++;
    } else {
        col[h]=1;
    }
});

console.log(n);

binarymax

Posted 2016-04-01T11:03:04.587

Reputation: 151

Welcome to Programming Puzzles & Code Golf! – Alex A. – 2016-04-01T19:09:17.093

...And thanks for the warm welcome @AlexA.! – binarymax – 2016-04-01T19:15:54.107

4

Python, 6390 6376 6359

H=lambda s:reduce(lambda a,x:a*178+ord(x),s,0)%(2**24-48)

May be considered a trivial modification to Martin Büttner's answer.

ASCII-only

Posted 2016-04-01T11:03:04.587

Reputation: 4 687

3@mbomb007 That's not true. If your function always outputs 4 it's still outputting in the range [0, 2**24-1]. The only thing that's not allowed is outputting any number not inside that range, e.g. -1 or 2**24. – orlp – 2016-04-01T22:45:31.907

3

Python, 9310


Yeah, not the best, but at least it is something. As we say in crypto, never write your own hash function.

This is exactly 140 bytes long, as well.

F=lambda x,o=ord,m=map:int((int(''.join(m(lambda z:str(o(z)^o(x[-x.find(z)])^o(x[o(z)%len(x)])),x)))^(sum(m(int,m(o,x))))^o(x[-1]))%(2**24))

Mr Public

Posted 2016-04-01T11:03:04.587

Reputation: 669

2

Matlab, 30828 8620 6848

It builds the hash by assigning a prime number to each ascii character/position combo and calculating their product for each word modulo the largest prime smaller than 2^24. Note that for testing I moved the call to primes outside into the tester directly before the while loop and passed it into the hash function, because it sped it up by about a factor of 1000, but this version works, and is self-contained. It may crash with words longer than about 40 characters.

function h = H(s)
p = primes(1e6);
h = 1;
for i=1:length(s)
    h = mod(h*p(double(s(i))*i),16777213);
end
end

Tester:

clc
clear variables
close all

file = fopen('british-english-huge.txt');
hashes = containers.Map('KeyType','uint64','ValueType','uint64');

words = 0;
p = primes(1e6);
while ~feof(file)
    words = words + 1;
    word = fgetl(file);
    hash = H(word,p);
    if hashes.isKey(hash)
        hashes(hash) = hashes(hash) + 1;
    else
        hashes(hash) = 1;
    end
end

collisions = 0;
for key=keys(hashes)

    if hashes(key{1})>1
        collisions = collisions + hashes(key{1});
    end
end

Godric Seer

Posted 2016-04-01T11:03:04.587

Reputation: 131

If you want to save space in your program, you don't have to convert your char to a double explicitly. Also you could use numel rather than length. Not sure what you'd do with all those extra bytes though! – Suever – 2016-04-03T00:53:22.033

1

tcl

88 bytes, 6448 / 3233 collisions

I see people have been counting either the number of colliding words, or else the number of words placed in nonempty buckets. I give both counts - the first is according to the problem specification, and the second is what more posters have been reporting.

# 88 bytes, 6448 collisions, 3233 words in nonempty buckets

puts "[string length {proc H w {incr h;lmap c [split $w {}] {set h [expr (2551*$h+[scan $c %c])%2**24]};set h}}] bytes"

proc H w {incr h;lmap c [split $w {}] {set h [expr (2551*$h+[scan $c %c])%2**24]};set h}

# change 2551 above to:
#   7: 85 bytes, 25839 colliding words, 13876 words in nonempty buckets
#   97: 86 bytes, 6541 colliding words, 3283 words in nonempty buckets
#   829: 87 bytes, 6471 colliding words, 3251 words in nonempty buckets


# validation program

set f [open ~/Downloads/british-english-huge.txt r]
set words [split [read $f] \n]
close $f

set have {};                        # dictionary whose keys are hash codes seen
foreach w $words {
    if {$w eq {}} continue
    set h [H $w]
    dict incr have $h
}
set coll 0
dict for {- count} $have {
    if {$count > 1} {
        incr coll $count
    }
}
puts "found $coll collisions"

Kevin Kenny

Posted 2016-04-01T11:03:04.587

Reputation: 11

2Where do you see answers using the incorrect method for computing the score? There has been a lot, but they were all corrected or deleted years ago. I see four answers remaining with scores less than 6000 because those four answers have actually been optimized to have such low scores. – kasperd – 2017-06-30T07:26:27.060

1As far as I can tell, your code is proc H w {incr h;lmap c [split $w {}] {set h [expr (2551*$h+[scan $c %c])%2**24]};set h}...right? – Erik the Outgolfer – 2017-06-30T07:42:09.237

@EriktheOutgolfer: Yes, it is – sergiol – 2017-07-04T22:34:20.623

1I second @kasperd: Can you point what answers are not accounting the collisions according to question spec? Did you really tried to run them? – sergiol – 2017-07-04T22:39:00.263

1

JavaScript, 121 bytes, 3268 3250 3244 6354(3185) collisions

s=>{v=i=0;[...s].map(z=>{v=((((v*13)+(s.length-i)*7809064+i*380886)/2)^(z.charCodeAt(0)*266324))&16777215;i++});return v}

Parameters (13, 7809064, 380886, 2, 266324) are by trial and error.

Still optimizable I think, and there is still room for adding extra parameters, working for further optimization...

Verification

hashlist = [];
conflictlist = [];
for (x = 0; x < britain.length; x++) {
    hash = h(britain[x]);                      //britain is the 340725-entry array
    hashlist.push(hash);
}

conflict = 0; now_result = -1;
(sortedlist = sort(hashlist)).map(v => {
    if (v == now_result) {
        conflict++;
        conflictlist.push(v);
    }
    else
        now_result = v;
});

console.log(conflictlist);

var k = 0;
while (k < conflictlist.length) {
    if (k < conflictlist.length - 1 && conflictlist[k] == conflictlist[k+1])
        conflictlist.splice(k,1);
    else
        k++;
}

console.log(conflict + " " + (conflict+conflictlist.length));

3268 > 3250 - Changed the 3rd parameter from 380713 to 380560.

3250 > 3244 - Changed the 3rd parameter from 380560 to 380886.

3244 > 6354 - Changed the 2nd parameter from 7809143 to 7809064, and found I've used the wrong calculation method ;P

Shieru Asakoto

Posted 2016-04-01T11:03:04.587

Reputation: 4 445

1

Python 3, 89 bytes, 6534 hash collisions

def H(x):
 v=846811
 for y in x:
  v=(972023*v+330032^ord(y))%2**24
 return v%2**24

All the large magic numbers you see here are fudge constants.

Magenta

Posted 2016-04-01T11:03:04.587

Reputation: 1 322

1

Here are a few similar constructs, which are quite "seedable" and makes incremental parameter optimization possible. Damn it is difficult to get lower than 6k! Assuming the score has the mean of 6829 and std of 118 I also calculated the likelihood of getting such low scores by random.

Clojure A, 6019, Pr = 1:299.5e9

 #(reduce(fn[r i](mod(+(* r 811)i)16777213))(map *(cycle(map int"~:XrBaXYOt3'tH-x^W?-5r:c+l*#*-dtR7WYxr(CZ,R6J7=~vk"))(map int %)))

Clojure B, 6021, Pr = 1:266.0e9

#(reduce(fn[r i](mod(+(* r 263)i)16777213))(map *(cycle(map int"i@%(J|IXt3&R5K'XOoa+Qk})w<!w[|3MJyZ!=HGzowQlN"))(map int %)(rest(range))))

Clojure C, 6148, Pr = 1:254.0e6

#(reduce(fn[r i](mod(+(* r 23)i)16777213))(map *(cycle(map int"ZtabAR%H|-KrykQn{]u9f:F}v#OI^so3$x54z2&gwX<S~"))(for[c %](bit-xor(int c)3))))

Clojure, 6431, Pr = 1:2.69e3 (something different)

#(mod(reduce bit-xor(map(fn[i[a b c]](bit-shift-left(* a b)(mod(+ i b c)19)))(range)(partition 3 1(map int(str"w"%"m")))))16776869)

This was my original ad-hoc hash function, it has four tunable parameters.

NikoNyrh

Posted 2016-04-01T11:03:04.587

Reputation: 2 361

The trick to a low score is a string constant in which each character can be optimized independently without ruining the optimization you have done for the other characters. – kasperd – 2017-07-04T23:14:48.797

Yeah, I tried optimizing for only shorter strings first, as appending more characters to the "entropy" string doesn't affect those (once the multiplier of r is fixed). But still my search algorithm is essentially brute force, and I'm not sure if the initial choice of the multiplier of r is important or not. – NikoNyrh – 2017-07-05T08:12:32.647

Maybe just multiplying ASCII values does not bring enough entropy to the game. Many well-scoring algorithms seem to have the form f(n) % (8^8 - g(n)). – NikoNyrh – 2017-07-05T08:19:44.487

There is one answer which explains how it got as low as 3677. The ones scoring even lower than that have little explanation.

– kasperd – 2017-07-05T19:27:03.617

1

Java 8, 7054 6467

This is inspired by (but not copied from) the builtin java.lang.String.hashCode function, so feel free to disallow according to rule #2.

w -> { return w.chars().reduce(53, (acc, c) -> Math.abs(acc * 79 + c)) % 16777216; };

To score:

import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.function.Function;

public class TweetableHash {
    public static void main(String[] args) throws Exception {
        List<String> words = Files.readAllLines(Paths.get("british-english-huge.txt"));

        Function<String, Integer> hashFunc = w -> { return w.chars().reduce(53, (acc, c) -> Math.abs(acc * 79 + c)) % 16777216; };

        Map<Integer, Integer> hashes = new HashMap<>();
        for (String word : words) {
            int hash = hashFunc.apply(word);
            if (hash < 0 || hash >= 16777216) {
                throw new Exception("hash too long for word: " + word + " hash: " + hash);
            }

            Integer numOccurences = hashes.get(hash);
            if (numOccurences == null) {
                numOccurences = 0;
            }
            numOccurences++;

            hashes.put(hash, numOccurences);
        }

        int numCollisions = hashes.values().stream().filter(i -> i > 1).reduce(Integer::sum).get();
        System.out.println("num collisions: " + numCollisions);
    }
}

Bewusstsein

Posted 2016-04-01T11:03:04.587

Reputation: 111

@muddyfish could you check out the current version? i think ive accounted for 3-way-collisions and am still getting the same result. – Bewusstsein – 2016-04-01T17:53:15.573

This doesn't account for three-way collisions. If you replace hashes with Map<Integer, Integer> hashes = new HashMap<>() and then count the number of words for each hash, you can account for them correctly. – Peter Taylor – 2016-04-01T19:09:48.333

Your scoring still looks incorrect. I think to compute a correct score, you have to output numHashes + numCollisions. (Which I guess will put you close to my 6832 estimate for a random oracle.) – kasperd – 2016-04-01T19:38:12.967

Modify the grading portions to this: http://pastebin.com/nLeg4qut

– TheNumberOne – 2016-04-01T19:43:32.547

yeah, fixed the grading and it looks like a much more reasonable value now, ty – Bewusstsein – 2016-04-01T21:46:45.107

1

C#, 6251 6335

int H(String s){int h = 733;foreach (char c in s){h = (h * 533 + c);}return h & 0xFFFFFF;}

The constants 533 and 733 889 and 155 give the best score out of all of the ones I've searched so far.

bmm6o

Posted 2016-04-01T11:03:04.587

Reputation: 111

1

Ruby, 9309 collisions, 107 bytes

def hash(s);require'prime';p=Prime.first(70);(0...s.size).reduce(0){|a,i|a+=p[i]**(s[i].ord)}%(2**24-1);end 

Not a good contender, but I wanted to explore a different idea from other entries.

Assign the first n primes to the first n positions of the string, then sum all prime[i] ** (ascii code of string[i]), then mod 2**24-1.

jose_castro_arnaud

Posted 2016-04-01T11:03:04.587

Reputation: 229

1

Python, 6995 6862 6732

Just a simple RSA function. Pretty lame, but beats some answers.

M=0x5437b3a3b1
P=0x65204c34d
def H(s):
    n=0
    for i in range(len(s)):
        n+=pow(ord(s[i]),P,M)<<i
    return n%(8**8)

Blue

Posted 2016-04-01T11:03:04.587

Reputation: 1 986

1

C++: 7112 6694 6483 6479 6412 6339 collisions, 90 bytes

I implemented a naïve genetic algorithm for my coefficient array. I'll update this code as it finds better ones. :)

int h(const char*s){uint32_t t=0,p=0;while(*s)t="cJ~Z]q"[p++%6]*t+*s++;return t%16777213;}

Test function:

int main(void)
{
    std::map<int, int> shared;

    std::string s;
    while (std::cin >> s) {
        shared[h(s.c_str())]++;
    }

    int count = 0;
    for (auto c : shared) {
        if ((c.first & 0xFFFFFF) != c.first) { std::cerr << "invalid hash: " << c.first << std::endl; }
        if (c.second > 1) { count += c.second; }
    }

    std::cout << count << std::endl;
    return 0;
}

fluffy

Posted 2016-04-01T11:03:04.587

Reputation: 725

0

tcl

#91 bytes, 6508 collisions

91 bytes, 6502 collisions

proc H s {lmap c [split $s ""] {incr h [expr [scan $c %c]*875**[incr i]]};expr $h&0xFFFFFF}

Computer is still performing a search to evaluate if there is a value that causes less collisions than the 147 875 base, which is still the recordist.

sergiol

Posted 2016-04-01T11:03:04.587

Reputation: 3 055

0

Ruby, 6473 collisions, 129 bytes

h=->(w){@p=@p||(2..999).select{|i|(2..i**0.5).select{|j|i%j==0}==[]};c=w.chars.reduce(1){|a,s|(a*@p[s.ord%92]+179)%((1<<24)-3)}}

The @p variable is filled with all the primes below 999.

This converts ascii values into primes numbers and takes their product modulo a large prime. The fudge factor of 179 deals with the fact that the original algorithm was for use in finding anagrams, where all words that are rearrangements of the same letters get the same hash. By adding the factor in the loop, it makes anagrams have distinct codes.

I could remove the **0.5 (sqrt test for prime) at the expense of poorer performance to shorten the code. I could even make the prime number finder execute in the loop to remove nine more characters, leaving 115 bytes.

To test, the following tries to find the best value for the fudge factor in the range 1 to 300. It assume that the word file in in the /tmp directory:

h=->(w,y){
  @p=@p||(2..999).
    select{|i|(2..i**0.5). 
    select{|j|i%j==0}==[]};
  c=w.chars.reduce(1){|a,s|(a*@p[s.ord%92]+y)%((1<<24)-3)}
}

american_dictionary = "/usr/share/dict/words"
british_dictionary = "/tmp/british-english-huge.txt"
words = (IO.readlines british_dictionary).map{|word| word.chomp}.uniq
wordcount = words.size

fewest_collisions = 9999
(1..300).each do |y|
  whash = Hash.new(0)
  words.each do |w|
    code=h.call(w,y)
    whash[code] += 1
  end
  hashcount = whash.size
  collisions = whash.values.select{|count| count > 1}.inject(:+)
  if (collisions < fewest_collisions)
    puts "y = #{y}. #{collisions} Collisions. #{wordcount} Unique words. #{hashcount} Unique hash values"
    fewest_collisions = collisions
  end
end

Paul Chernoch

Posted 2016-04-01T11:03:04.587

Reputation: 249

1The score looks suspicious. Are you sure you are counting all the colliding words? Several previous answers mistakenly counted only one word for each colliding hash value. – kasperd – 2016-04-01T21:06:12.313

You may be right. I must consider how I counted and see if it is the same as your definition. I am counting how many words there are and subtracting how many unique hashcodes were generated. If words A and B get the same hashcode, is that one collision or two? I count it as one. – Paul Chernoch – 2016-04-01T21:08:20.750

1I didn't define the scoring function. I just copied it from the example answer posted by the same user who posted the challenge. Most answers have scores ranging between 6273 and 6848. There have been multiple answers each making the same mistake in the score computation leading to computing a score approximately half of what it should have been. (Exactly half the correct score if there are no cases of three colliding words.) – kasperd – 2016-04-01T21:16:31.223

1Yes, I made the same mistake. I will amend my answer later. Got to catch a bus. – Paul Chernoch – 2016-04-01T21:17:29.313

Fixed the scoring. – Paul Chernoch – 2016-04-02T00:44:11.580

Why @p and now just p? – Cyoce – 2017-06-30T17:46:59.373