47

A strong cryptographic hash makes collisions unlikely. Many cryptographic protocols build on that fact. But Git is using SHA-1 hashes as object identifiers. So there are a lot of already computed hashes out there in the public Git repositories of the web, along with details on how to reproduce them.

Is there some known attack on some protocol where this might be leveraged? Something like “well, I can do something evil if I replace this unknown plain text with some other plain text with the same SHA-1 hash, so instead of computing a collision I'll google for it.” Of course, the space of all hashes is still far from covered by Git commits, but nevertheless, I'd guess all the Git commits out there might amount to quite some CPU hours of computing SHA-1 hashes. I'm not sure whether that guess is justified, though.

As far as I can see, such an attack would only work if the hash is visible, the plain text from which it was generated is not, but some cypher text generated from is, and a different text can be encrypted as well. So this looks like it might apply to some public key based protocols, where you can encrypt but not decrypt. Furthermore, you don't have control over the colliding plain text, so obvious things like putting your own name as the beneficiary of some financial transaction won't work. Are there any scenarios where such a crowd-sourced hash collision could cause serious trouble with non-negligible probability?

Peter Mortensen
  • 877
  • 5
  • 10
MvG
  • 745
  • 5
  • 10

3 Answers3

49

Is Git crowdsourcing the production of SHA-1 preimages? Not to any meaningful degree.

Github doesn't say how many commits it's tracking, but it's probably not more than a few billion. For comparison, there are 1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976 possible SHA-1 hashes, so the odds of finding a plaintext matching an arbitrary hash of interest are effectively non-existent.

Mark
  • 34,390
  • 9
  • 85
  • 134
  • 12
    "Probably not more than a few billion" what's a few billion commits between friends? – corsiKa Dec 12 '14 at 20:37
  • 17
    To establish an absurdly optimistic upper bound: in 2011 Github had [2M repos](https://github.com/blog/841-those-are-some-big-numbers). Assuming this tripled each year, they'd be at 54M repos now. If each is as large as the [Linux kernel](https://github.com/torvalds/linux) (500k commits), that would be 2.7e13 commits. If GitHub held only 0.1% of the world's code, that would make 2.7e16 commits worldwide - which still pales to the ~1e48 possible hashes. No matter how you adjust those numbers, it won't come anywhere close to finding you a useful collision. – Chris Hayes Dec 13 '14 at 08:54
  • 4
    To add to this, it is nonsensical to attempt using a source file (which is typically 1-2 kilobytes) as password to log into a web page. Insofar, _even if_ Git crowdsources SHA-1 (or any other hash), and _even if, against all odds_ you do find a suitable preimage for a hash on Github or such, it is most likely noth worth a lot. – Damon Dec 13 '14 at 13:36
  • 6
    Aren't _salted_ hashes the recommended way of securely storing passwords? If I understand correctly, that would render Git's hashes even more meaningless. – F.X. Dec 13 '14 at 13:52
  • 2
    One of the design goals of SHA-1 was to make it so that brute-force attacks would be useless. Given that the Git community is not trying to attack SHA-1, it can be considered, at best, a brute-force attack. Brute-forcing SHA-1 is a heat-death-of-the-universe problem. – Cort Ammon Dec 13 '14 at 21:25
  • 1
    Birthday paradox does help: if we are just looking for a collision, it squares the efficiency. So e24 to get a collision 50-50 chance: or short more than e7.5 code. – Yakk Dec 14 '14 at 06:22
12

You could probably compute your own SHA1 hashes quicker from small arbitrary texts than that you harvest the hashes that someone else computed. But there's a lot of possible SHA1 digests, about as many as atoms in the world. That illustrates the challenge if you want to keep a list of all known digests and search that list.

rvdheij
  • 166
  • 5
  • 16
    I was going to say there are *many* more SHA-1 hashes than atoms, but I looked it up and actually it's the other way round – the estimated number of atoms in the universe is around 10^80 while the number of possible hashes is around 10^48. – ntoskrnl Dec 12 '14 at 17:32
  • 2
    To be fair, there are 10^80 atoms in the _visible_ universe. We don't know what may be beyond that. – dotancohen Dec 13 '14 at 19:20
  • The statement that there are "a few [SHA1 hashes] for each atom in the universe" is incorrect. In fact, the ratio of git commits to SHA1 hashes is almost as large as the ratio of SHA1 hashes to atoms! (~1e13 commits to ~1e48 hashes vs. 1e48 hashes to 1e80 atoms) – Kevin Dec 13 '14 at 23:10
  • The number of atoms in the solar system is a closer match (but still probably larger than 2^160). – hobbs Dec 14 '14 at 07:10
  • Thanks for the 1e80. I must have confused it with the atoms on earth and just corrected that. Either way, the irony in the response apparently wasn't that clear. – rvdheij Jan 02 '15 at 08:06
9

The amount of human effort which has gone into computing each of those SHA-1 hashes found in Git is significant. And that means the number of hashes computed that way is fairly limited.

If you want to find collisions, you need zero human effort per hash and very little computer time spend on each hash.

Bitcoin might be the only system with enough computing power to perform the 2^80 cryptographic operations needed to find a SHA-1 collision through brute force. Though most of this computing power is specialized hardware that only does SHA-2 and could not be repurposed to compute SHA-1 hashes.

It still provides an idea of the scale of deployment needed for such a brute-force attack. Bitcoin has proven that 2^80 cryptographic operations is doable. And for that reason alone, we should move to stronger hashes than SHA-1 ASAP.

Had bitcoin been based on SHA-1, a collision would have occurred already. And that would have been by brute force without even exploiting any weakness in SHA-1. That is because bitcoin does almost nothing but compute hashes all the time, and it has specialized hardware to do so.

The actual hash function used in bitcoin has a larger output than SHA-1, so there most likely hasn't been a collision. Additionally it would have required a different design to be able to find out if a collision had occurred because bitcoin as it exists today discards most of the hashes immediately, so even if a collision had occurred we wouldn't know about it.

Peter Mortensen
  • 877
  • 5
  • 10
kasperd
  • 5,402
  • 1
  • 19
  • 38
  • Haha, so we should reframe this. Question: is bitcoin crowdsourcing hashes? Answer: Yes, definitely. – Kzqai Dec 13 '14 at 19:10
  • 1
    @Kzqai, Bitcoin is crowdsourcing even fewer hashes: only one hash gets accepted into the blockchain every 10 minutes. The collective mining capacity may be computing a billion times as many hashes per second as Git has since the beginning of time, but virtually every one of those is discarded as soon as it's generated. – Mark Dec 14 '14 at 07:01
  • @Mark Good point, I forgot that all the extra hashes don't go public. – Kzqai Dec 28 '14 at 04:53
  • They probablly don't even go off the ASIC that calculated them. IMO when attacking hashes storage and indexing is probablly a bigger challange than the actual calculation. – Peter Green Sep 25 '20 at 22:07