12

I was reading this article which talks about a design to shorten URLs, and in the design section, it says that the given URL can be hashed using a hashing algorithm such as MD5, and then be encoded for display purposes using base64 or similar encoding.

I am confused so as to why would we need to encode a hashed string. Is it not safe to display an MD5 hash, or is there any other benefit? Can anyone shed some light on this? I tried searching online, and a lot of pages talk about encryption, etc but not about the above scenario.

blahdiblah
  • 113
  • 3
Ufder
  • 223
  • 2
  • 5
  • 30
    You should properly distinguish between encrypt, encode and hash and only use them where appropriate, e.g. "encrypt" is ill-suited for both base64 (encode) and md5 (hash). – luk2302 May 06 '21 at 08:50
  • 2
    See RFC-4648. Avoid the ordinary Base-64 for URL's. – MSalters May 06 '21 at 08:51
  • 7
    The entire approach the article uses with MD5 and Base64 is completely unnecessary. The shortened URL need only be unique in that URL shortener database. It could just as easily start with "**A**" and work it's way upwards with each new URL submission. Just using capital letters and numbers alone would support 36 single character URLs, and 36² two character URLs, etc. – user10216038 May 06 '21 at 21:42
  • @user10216038 Sure, you could do it that way, but URLs would be longer. Base64 is nice that it is rather short and the existing algorithms are optimized for byte boundaries. As for just counting up instead of hashing: that assumes that only a single server is handling everything. – trlkly May 08 '21 at 08:23
  • @trikly: (counting 'single') not really; as long as they're online (and a URL shortener must be) it's easy to have a protocol that hands out subspaces to be enumerated -- things like SETI@home have done this for a long time, and bitcoin mining pools do it at scale now. Although it does expose when the assignment was done, like US Social Security Numbers exposed people's approximate date and place of birth (which are too-often used as 'security' questions) from 1988 until recently. – dave_thompson_085 May 09 '21 at 03:23
  • @trikly - *"... that assumes that only a single server is handling everything ..."*. Nothing of the kind. As I stated, The shortened URL need only be unique in that URL shortener database. Unlike general DNS, there is no need for global uniqueness. – user10216038 May 09 '21 at 16:30

5 Answers5

45

Hash functions output binary data, usually as a byte array. This cannot be displayed correctly, therefore, you need encoding.

Transmitting binary data can create problems, especially in protocols that are designed to deal with textual data. To avoid it altogether, we don't transmit binary data. Many of the programming errors related to encryption on Stack Overflow are due to sending binary data over text-based protocols. Most of the time this works, but occasionally it fails and the coders wonder about the problem. The binary data corrupts the network protocol.

Therefore hex, base64, or similar encodings are necessary to mitigate this. Base64 is not totally URL safe and one can make it URL safe with a little work.

In another way, character encodings are reversible and don't require encryption keys. This has nothing to do with security; it is about visibility and interoperability.

nobody
  • 11,251
  • 1
  • 41
  • 60
kelalaka
  • 5,409
  • 4
  • 24
  • 47
  • 6
    This makes sense, and now I understand the design. So basically the output of MD5 itself is binary and I need some form of encoding to see it in a non-binary format or to transmit it elsewhere. Thank you! – Ufder May 05 '21 at 19:28
  • 5
    Really? Sending binary over a binary protocol is safe. But there are a gazillion different "text" formats, and those break all the times. Even with ASCII - who hasn't seen a `&` in the wild? The question even contains a **known bad mix** of two text formats: only 62 of the 64 characters in Base64 are URL-safe. – MSalters May 06 '21 at 08:50
  • @MSalters you are right, it was too general. Updated, could you see any problem now ? – kelalaka May 06 '21 at 09:49
  • @kelalaka: Looks good. It may sound like nitpicking, but these sort of things are known causes of security bugs. An attacker may be able to force an illegal character in a URL for instance, if you use Base-64. If this causes a security check to mis-parse that URL, it could incorrectly apply permissions. – MSalters May 06 '21 at 10:16
  • How does binary data corrupt the network protocol? Binary data gets sent all the time through the network without any problems. – The Red Fox May 06 '21 at 10:31
  • 1
    @TheRedFox if you know that you are only transferring in the protocols where sending binary is safe then no problem. What if the data goes through multiple protocols that your party not totally aware of. I can find you many error reporting in [so] with cryptography/encryption tag that people complaining errors that only caused by the non-existing of proper encoding. That is the core of my point. – kelalaka May 06 '21 at 10:50
  • 3
    @TheRedFox: Binary data goes over binary network protocols without issue. But many network protocols are text-based, and if you put binary data into a text-based protocol, then invariably something will break. A program trying to parse the data will read it as corrupt text and fail, or something will try to switch encodings, corrupting the binary data, or sometimes the binary data happens to contain a text delimiter, or sometimes something else. – Mooing Duck May 06 '21 at 16:22
14

What you see as the MD5 hash is the hex encoded version of it. The hash itself is binary, but we usually don't like to see binary data on screen. Another way to display the hash is using Base64, so all chars are printable.

Encrypted strings are binary too, so you will want to encode them as base64 too. Or if you want a vintage encoding, try uuencode.

ThoriumBR
  • 50,648
  • 13
  • 127
  • 142
  • 1
    The main advantage of Base64 in comparison to hex is that Base64 is a great deal shorter, especially for large amounts of data. – Kevin May 06 '21 at 18:18
  • 1
    @Kevin In that case just use gzip or other compression before the encoding and it'll be even shorter. :) – KeyWeeUsr May 08 '21 at 10:55
  • 2
    @KeyWeeUsr: MD5 output (or the output of any "reasonable" hashing function) should be statistically indistinguishable from random noise and therefore incompressible. – Kevin May 09 '21 at 00:52
  • @Kevin yup, just tested... my bad – KeyWeeUsr May 09 '21 at 05:35
  • And while gzipping _after_ MD5 does reduce the size a lot, it sort of defeats the purpose ;) – MSalters May 10 '21 at 07:00
5

While the other answers contain plenty of good information, I think they missed the original point. This is a URL shortener. So Length is a critical factor. An MD5 hash gives you 128 1's and 0's. Naively using those 1's and 0's in your url makes it unnecessarily long. Encoding those 1's and 0's with base 64 considerably shortens the number of characters needed to represent the hash.

In the article they even state how it takes only 6 characters of base 64 to yield 68.7 billion possibilities.

It would take 36 characters of binary to achieve the same number of possibilities.

noslenkwah
  • 273
  • 1
  • 2
  • 8
  • 1
    Yes, I put together that part based on the above answers. Thanks for clarifying it! – Ufder May 06 '21 at 14:04
  • 3
    To be fair, the shortest way is to not encode it at all. I believe your example is encoding via hex? – not my real name May 06 '21 at 16:15
  • @RayWu - The example is not mine but the one the article gives. They claim B64 and not hex. I added the 36 char of binary example myself, ceiling(log2(68.7B)). Not encoding it at all is the worst case scenario. It would mean your URL is just 1's and 0's. Maybe I misunderstood what you meant? – noslenkwah May 06 '21 at 16:32
  • 3
    No, not encoding at all does not mean your url is ones and zeros, that is encoding as a string of binary digits. It means you do not have a URL due to non-url-safe characters – not my real name May 06 '21 at 17:21
  • @noslenkwah you are forgetting that each base64 character needs 8 bits (1 byte). So 6 base64 character = 48 bits which is more than the 36 needed for the unencoded binary. – puhlen May 06 '21 at 18:22
  • @puhlen - The number of bits your system memory requires to represent the hash is a red herring in the context of a URL shortener. All that matters is how many characters are used in the URL. Encoding the hash is a way to achieve that. The author of the article chose to use base 64. – noslenkwah May 06 '21 at 19:11
  • @RayWu - What would a shortened URL look like that incorporates a non-encoded MD5 hash? – noslenkwah May 06 '21 at 19:13
  • @noslenkwah I mean answer says "Encoding the MD5 hash with b64 considerably shortens the number of characters." to which I say "shorter than what"? – not my real name May 06 '21 at 19:51
  • @RayWu - Shorter than using 128 1's and 0's in the url. I've amended the answer to make that clear. – noslenkwah May 06 '21 at 20:21
  • 2
    Neither I nor the OP have ever mentioned encoding using a string of binary digits, and I don't think this was implied – not my real name May 06 '21 at 20:22
  • @RayWu - You claimed "the shortest way is to not encode it at all". If that doesn't mean using 1's and 0's in your URL then please explain what that means. – noslenkwah May 06 '21 at 20:29
  • I mean not being in the context of a url shortener, just sending the raw bytes would be shortest. A hash isn't "128 1's and 0's" in the sense of being a string of digits – not my real name May 06 '21 at 20:31
  • 4
    The binary MD5 requires only 16 bytes, but can't be used as a URL, due to encoding protocols. Where binary stores 256 possible states per byte, base 64 encoding stores only 64 possible states per byte. Hex encoding provides only 16 possible states per byte. Nobody would ever use a base 2 encoding of ASCII 1's and 0's for any practical application. – jwdonahue May 06 '21 at 21:50
  • @jwdonahue You are exactly saying what I wanted to say – not my real name May 07 '21 at 15:54
1

Is it unsafe to display unencoded?

The problem is not safety, at least in the way you're asking. The problem is the actual result of a hashing algorithm is binary. Md5 produces 128 bits (16 bytes) of data, but these are not ascii (or even Unicode) bytes. If treated as text, the result is very likely to include characters that are not printable, and therefore you must have some kind of encoding scheme to map the hash result into space of printable characters.

Now there is still a safety issue in play. If you did try to just print out the raw bits as if they were characters there's a chance the bit pattern could produces unwanted (and unsafe) results. But really it's about communicating the pattern in an accurate (and recoverable/reproducible) way.

Joel Coehoorn
  • 2,116
  • 1
  • 13
  • 14
1

A hash is an algorithm that computes a mostly-evenly distributed, fixed-sized output from a variably-sized input.

A digest is the output of the hash function, which is a series of bytes.

Bytes cannot be displayed or used in text interface, but they can be encoded into text.

Base64 is one such text encoding of bytes into ASCII characters.

In particular, the articles describes a URL, encode the text into bytes, hash those bytes, then encode the digest into text in a shortened URL.


On a side note, the approach in the article is very weird.

There is no given reason to hash the URL in the first place. In fact, the articles discusses adding a sequence number to undo the determinism of the hash.

A random number generator would work just as well if not better for the stated purposes. The article author has some inexplicable compulsion to use a hash.

Paul Draper
  • 958
  • 8
  • 18
  • Yeah, exactly my thoughts. I would imagine storing and retrieving urls while avoiding collisions shouldn’t be this complex! – Ufder May 08 '21 at 04:45
  • Hashes are a pretty standard way to reduce a directory of arbitrary length strings, into strings of a known fixed length. If you didn't hash first, you couldn't control or limit the length of the final.URL. For example a URL that's 200 unicode or ascii characters long, could be hashed - and the hash is far fewer bits, hence the final base 64 is also very short. – Stilez May 09 '21 at 06:35