0

I'm building a URL-shortening tool. For an arbitrary link, I need to produce a fixed-length slug which will index the full URL in a database. For prettiness reasons, I'd like to keep the slug reasonably short (8 alphanumerical characters seems reasonable).

It seems obvious to me that using a hash function with output to hex is an easy way of generating such a slug without having to worry about collisions. I am not expecting more than a million links to ever be posted (it's not a public tool), so I don't need anything like the kind of collision resistance a normal hashing algorithm would provide. Unfortunately, the hash values tend to be rather long - even MD5 uses 32 hex characters (I also don't know how to square this with the fact that it produces a 128 bit value and 16^32 is much bigger than that).

Suppose I took some collision-resistant hash function like SHA-512, and I take the 128 character output:

ddaf35a193617abacc417349ae20413112e6fa4e89a97ea20a9eeee64b55d39a2192992a274fc1a836ba3c23a3feebbd454d4423643ce80e2a9ac94fa54ca49f

and truncate it to just eight:

ddaf35a

I have two (probably naive) questions:

  1. Why does the 512 bit digest take 128 hex characters and not log_{16}(512)? Or, the other way round, can I cram 16^8 bits of entropy into eight hex characters (or any short alphanumeric string)?
  2. Assuming 1. is something obvious I don't see, does truncating a 128 character hash to 8 characters behave "like an 8-character hash function"? In other words, other than accounting for the reduced size of the hash space, are collisions more likely than you would expect from a hash function with a digest of that length?
  • 3
    *"Why does the 512 bit digest take 128 hex characters"* - I have the feeling that you don't understand what hex means. 1 hex character represents 4 bit (2^4=16), i.e. 2 hex characters represent 8 bit (a byte) and 128 hex characters represent 128*4 = 512 bit. *"can I cram 16^8 bits of entropy into eight hex characters" - 16^8 is 32 bit. 8 hex characters are 8*4 = 32 bit. So - yes. – Steffen Ullrich Apr 14 '21 at 16:58
  • @SteffenUllrich That's a great way to explain it. I've seen this explained many different ways, but that's probably the most concise and succinct explanation I've ever seen. – mti2935 Apr 14 '21 at 17:01
  • @SteffenUllrich I now remember there are 2^512 512-bit numbers, not 512 of them. Oops :( – Daniel Littlewood Apr 14 '21 at 17:02
  • 2
    *"... does truncating a 128 character hash to 8 characters behave "like an 8-character hash function"?"* - Yes. Does this answer your question? [How bad is it to truncate a hash?](https://security.stackexchange.com/questions/72673/how-bad-is-it-to-truncate-a-hash), [Secure way to shorten a hash](https://security.stackexchange.com/questions/97377/secure-way-to-shorten-a-hash). – Steffen Ullrich Apr 14 '21 at 17:03
  • @SteffenUllrich Yes, it certainly answers the on-topic part of my question. Thank you! My earlier misunderstanding might imply I need to do something different (and it certainly looks like conventional URL shorteners do). – Daniel Littlewood Apr 14 '21 at 17:10
  • Note that according to the birthday principle, if you have a 32-bit hash, then you have a 50% chance of generating at least one collision after you generate 2^16 hashes, i.e. 65536 hashes. – user253751 Apr 14 '21 at 17:35
  • Have you considered encoding using "base64url" for the hash? That way you can get 48 bits of entropy in 8 characters instead of 32. Which means that by the birthday principle, you can generate more than 16 million hashes before you have 50% chance of a collision. – nobody Apr 15 '21 at 07:20
  • @nobody Thank you for that idea. I'm currently instead trying out an enumeration of short alphanumeric strings. So naively I'd just encode a primary key to get 0..9,a,...z,A,...Z,10,... and so on. If you do some simple modular arithmetic it's not hard to get something pretty and guarantee uniqueness up to 62^5 URLs with only 5 characters, and it's easy to extend if necessary. – Daniel Littlewood Apr 15 '21 at 10:19

0 Answers0