11

I'm looking at the code of a particular web application that handle file uploads. For some reason, instead of using the cryptographic hash function (SHA-256 in this case), they derive an ID from it, and use that everywhere instead, to identify files uniquely.

The steps involved are as follows:

  • Calculate the SHA-256 sum of the required file.
  • Take a maximum of 3 characters per iteration, and treating it as a hex string, convert it to its equivalent base62 notation (i.e. 0-9a-zA-Z => 0 - 62).
  • Append these strings in that order, and obtain the "ID".

For example:

hash (file) = 26ba0a896923d2de4cad532a3f05da725d9cc08d371eaf96905f5bbc1901b56f

26b  -------> 9Z
a0a  -------> Fs
896  -------> zs
923  -------> BJ
d2d  -------> Sp
e4c  -------> X2
ad5  -------> IJ
32a  -------> d4
3f0  -------> gg
5da  -------> oa
725  -------> tv
d9c  -------> Uc
c08  -------> NG
d37  -------> Sz
1ea  -------> 7U
f96  -------> 12m
905  -------> Bf
f5b  -------> 11p
bc1  -------> Mx
901  -------> Bb
b56  -------> KO
f    -------> f

ID = 9ZFszsBJSpX2IJd4ggoatvUcNGSz7U12mBf11pMxBbKOf

To me, this does not seem to be a safe way to truncate the hash at all. In particular, it looks to me that the probability of collisions increases this way.*

Do the above operations pose a problem, or do they not interfere with the cryptographic strengths of SHA256?

* The resistances of the SHA-2 functions may prevent an attacker from exploiting this. However, I'm just concerned about the premise of the function itself.

CodesInChaos
  • 11,854
  • 2
  • 40
  • 50
S. B.
  • 111
  • 3
  • 1
    Isn't this essentially what Base64 does? – RoraΖ May 27 '15 at 16:46
  • @raz It's similar in theory but neither identical nor without problems. apsillers's answer points out the case when it's not reversible. – Aron Foster May 27 '15 at 19:51
  • 1
    As long as it's reversible, you don't have a problem. (But apsillers has shown why it's not reversible here) – user253751 May 27 '15 at 22:19
  • 2
    It is indeed not reversible. However truncating from 256 bits to 224 bits is not reversible either, and that is what SHA224 does. My gut feeling says the particular transformation preserves more than 224 bits of the entropy, but the calculations to compute the actual amount of preserved entropy are not entirely obvious. Probably it is safer than SHA224, but I would still pick SHA224 if I was forced to chose between the two. – kasperd May 27 '15 at 22:25
  • Isn't this similar to a (very inefficient) zip file method? I can't remember the name of it off the top of my my head. – Jon May 28 '15 at 01:43
  • @Chipperyman Are you thinking of [Lessiss-Moore](http://web.archive.org/web/20030207194302/http://lzip.sourceforge.net/faq.html#08)? Or did you have something serious in mind? – kasperd May 28 '15 at 07:12
  • I'd either use url-safe Base64 (which uses `_` and `-` in addition to alphanumeric ASCII) or Bitcoin's Base58, which avoids special characters and look-alike characters like `l` vs `I`. – CodesInChaos May 28 '15 at 08:07

5 Answers5

29

This is almost a perfectly fine practice, but it has a bit of a flaw.

In general, a hash is just a numeric value, and you can express it in whatever base you please. For example, you could convert your hash to binary and express it as base64:

   2   6   b   a  ...
   |   |   |   |
0010011010111010  ...
      |      |
      T      u

However, the serious problem with your approach here is the clustering of the output. Three hex digits may transform into either one, two, or three base62 digits. There is no reliable way to decide how to cluster the base62 values. If you had leading zeros (i.e., you transformed three hex digits into three base62 digits) and/or used a larger base (e.g., three hex digits could map onto exactly two base128 digits with leading zeroes), you could avoid this problem.

To see a practical example of this, consider that hex f43 maps to base62 111 and 03f maps onto base62 11. Consider the impossibility of distingusinging between the base62-forms of the following hashes:

f43f43f43f43f43f43f43f43f43f4303f03f03f03f03f03f03f03f03f03f9991
03f03f03f03f03f03f03f03f03f03ff43f43f43f43f43f43f43f43f43f439991
03ff4303ff4303ff4303ff4303ff4303ff4303ff4303ff4303ff4303ff439991

All of these hashes transform into

11111111111111111111111111111111111111111111111111CC1

There is no way to know which 1s are part of a three-character group and which are part of a two character group. Obviously, this is an extreme example, but the problem will arise any time a group has a leading 1 that is ambiguous.

However, three- and one-digit output groups only happen for 314 out of the possible 4096 values that group can be, and there will only be ambiguity for a fraction of those cases. A comment from Gilles, below, estimates the resultant truncated value will preserve 254 bits:

As far as we know, the bits of a SHA-2 hash are independent. This truncation doesn't exactly strip bits, but it's close enough that it should be independent too. The non-uniqueness concerns only about lg(12³-62²)≈0.1 bit per 3 hex digits, so the result should have roughly the strength of a 254-bit hash.

The loss of two bits is obviously not optimal, but it's far from a devastating loss.

apsillers
  • 5,780
  • 27
  • 33
  • What if 11 cannot be a two-character group, or even if 1 cannot begin a two-character group at all? – Random832 May 27 '15 at 19:40
  • 1
    Er, I had misunderstood the scenario, I thought it was an arbitrary mapping rather than exactly base62. However, using a leading zero only when the "tens" digit is 0 or 1, and not otherwise, would be sufficient. – Random832 May 27 '15 at 19:44
  • 1
    @Random832 We're not looking for a solution; the solution is to convert to binary and to then apply base 64, possibly with a file name safe alphabet. We're trying to determine if there is a chance at collisions, and - if possible - how high this chance is. – Maarten Bodewes May 27 '15 at 21:42
  • If you can use base 64, there's no need to convert to binary per se, since three hexadecimal digits is 12 bits which is exactly two base-64 characters. Presumably there was some reason why only alphanumeric characters were wanted in the file name, but if so you need a more complicated encoding (e.g., actually convert the entire number to base62) to make it reversible. – Harry Johnston May 27 '15 at 22:00
  • 2
    You're correct in your description of collisions. However, your wording implies that the collisions are a problem, but in fact they don't significantly weaken the security of the hash. As far as we know, the bits of a SHA-2 hash are independent. This truncation doesn't exactly strip bits, but it's close enough that it should be independent too. The non-uniqueness concerns only about lg(12³-62²)≈0.1 bit per 3 hex digits, so the result should have roughly the strength of a 254-bit hash. That's not a cause for concern. – Gilles 'SO- stop being evil' May 28 '15 at 07:15
  • @Gilles An excellent point. I hope my edit does your comment justice? – apsillers May 28 '15 at 13:29
12

From what I can see, this isn't truncation at all. Each 12-bit section (3 ASCII hex characters) is converted to its equivalent base62 representation, which is a bijective operation. You can take the values on the right and turn them back into the values on the left.

The operation doesn't truncate the value, but rather reduces its resultant length by using a more efficient encoding, just like computing the base64 value of the raw hash bytes would.

Polynomial
  • 132,208
  • 43
  • 298
  • 379
  • 6
    If this were a bijective transformation, this would be perfectly correct; unfortunately, it is not. Three hex digits can express 4096 values, while two base62 digits can only express 3906 values, which sometimes requires the scheme to use *three* base62 digits. If this scheme had used leading zeroes, it could work (but I suppose that would defeat the entire point, mapping 3 digits onto three digits). Note the three-digit `11p` and `12m` output groups in the question's example. – apsillers May 27 '15 at 19:19
  • @apsillers It's still possible that the overall transformation is bijective (or injective, depending on how you define the codomain) - if the initial pairs of the three-digit groups cannot appear as two-digit groups, for example. – Random832 May 27 '15 at 19:39
  • @apsillers My understanding is that they're not truncating each block, but rather concatenating all of that data together, such that there's no loss of information even if you can't trivially identify which ones were 3-char and which ones were 2-char. – Polynomial May 27 '15 at 20:05
  • I think that the answer of apsillers already shows that there can be loss of information. It would be tricky to calculate how much though. – Maarten Bodewes May 27 '15 at 20:11
  • @Polynomial `03ff4303ff43...03ff439991` and `f4303ff4303f...f4303f9991` both map to the same output value (`1111111111...1111111CC1`). There are lots of ways to modify the scheme so it doesn't cause collisions (I have suggested a few trivial ones), but as presented in the question, values can collide. – apsillers May 27 '15 at 20:12
  • That would only be true if ONLY two base62 characters were put into the string regardless of what the three base16 characters encode to. I don't think that's the case though - I think all 3 characters are added and the string gets longer. – Polynomial May 27 '15 at 20:12
  • 2
    @Polynomial I'm not suggesting that the extra digit is dropped; I agree that it does remain, and that that causees a problem. If the groups `03f` and `f43` are adjacent in the input (`11` and `111` in the output), it is not possible to use the five character `11111` in the output to determine which group came first in the input. To restate a much shorter version of my previous example, the output from `03ff43` and `f4303f` are indistinguishable. – apsillers May 27 '15 at 20:16
  • 1
    @apsillers Ah, I see what you're saying - there's a potential for collision between a pair of 3-char and 2-char base62 units where the encoding could be 3-then-2 or 2-then-3. – Polynomial May 27 '15 at 20:23
  • @Polynomial And that's just the start, there's also the single character encoding... – Maarten Bodewes May 27 '15 at 21:24
3

"Truncate" means to remove a portion altogether. In this example, if I truncated the right half of the hash characters, the remainder would look like this: 26ba0a896923d2de4cad532a3f05da72

So yes, truncation will increase your collisions, but that's not what is happening here.

user77432
  • 31
  • 1
  • "Truncating" often means "cut off the end", but I think the definition that the OP meant is the more abstract one: "make a fixed size, losing information if necessary". By the second definition, `N (mod p)` counts as a truncation of `N`. There are many other non-obvious forms of truncation (in fact, any hash function counts). And yes, truncation in any form always increases the number of collisions. – Mike Ounsworth May 27 '15 at 19:12
0

If the length of the hex representation of the hash is unacceptable, and one wants to uniquely represent hashes in a shorter string using a limited character set, using base-64 rather than base-64 would allow a nice easy mapping (even if one has to replace . and / with different characters); if there are only 62 acceptable characters, one could subdivide the data into 64-bit chunks and use 11 base-62 characters to store each for a fixed total length of 44, only one character more than the optimal fixed-length encoding using 43 characters (your encoding would sometimes use 43 characters, but sometimes require more, and wouldn't be unique). Encoding 64 bits in base-62 should be reasonably easy on any platform that has a 64-bit unsigned integer type; on platforms that don't, one could encode 53 bits as 11 base-31 characters and add one of the remaining 11 bits to each of the base-31 characters to yield a base-32 character.

supercat
  • 2,029
  • 10
  • 10
0

I don't think there is enough information to really give a good answer. The possible 'weakness' with this approach is that by reducing the representation length, you have increased the change of collision. Two files with different hashes may end up with the same transformed code. However, this may not be an issue, depending on the application or the risk of collision may be of less concern than the need to reduce representation length. there really isn't enough info to judge.

However, having said that, on the face of it, it seems difficult to justify the increased potential for collision given the minimal amount of representation length, especially given that if the length of the representation is the problem, you would assume that there must be a lot of these hashes to store, which means collision is possibly more likely. Then again, perhaps it is all just about the increase in efficiency obtained by comparing shorter signatures where possible collisions are not an issue

Tim X
  • 3,242
  • 13
  • 13