Can a file be maliciously changed in a way that maintains its original SHA-1 Hash?

According to this article, and many others, SHA-1 is not secure.

In my case, I am not concerned about passwords or digital certificates. I am concerned about file integrity.

Is it reasonably possible for a file (e.g. an ISO image or executable file) to be maliciously altered in a way that:

Maintains the original file's SHA-1 hash, and
Maintains the file's overall content and operation (but of course now includes malicious content that was not originally there)

The way I see it, altering a file in a way that yields a SHA-1 collision would render the file totally useless. The ISO would be totally corrupt, or the executable file would be so utterly scrambled it wouldn't even be an executable file anymore.

But, the way I see it could well be wrong. So far I have found nothing on Google searches in regards to SHA-1's continued suitability for file verification. Any insights?

misha256

Posted 2015-03-15T22:21:16.377

Reputation: 10 292

7The answer is "it depends". If the ISO happened to contain lots of jpegs or movie files - along with your target executable, then it is possible. You can modify jpeg files quite dramatically without altering their size or visual appearance. Ultimately, the larger the file, the more you have to play with, and the better the chance of a non destructive collision. – Paul – 2015-03-15T22:44:07.523

How are you getting the hash value? From the download page, or some other way? – cpast – 2015-03-16T03:27:24.937

7@cpast exactly, many websites list SHA-1 hashes to allow you to verify your download. Thinking about it, it seems far more likely that a hacker would compromise a website by altering the content and the published hash. Then you're really screwed. – misha256 – 2015-03-16T04:05:54.937

1Just fyi, my question asks about SHA-1 specifically because it's quite common, especially with downloads from Microsoft/MSDN. Of course some websites publish MD5 hashes, others SHA256, etc. – misha256 – 2015-03-16T04:13:02.243

2The question is, why would you want to use a hash that has any known vulnerabilities, when there are alternatives that are just as fast, easy to use, and widely available that don't (eg. SHA-256)? Additionally, there's a reason cryptographers declare a hash insecure after only one vulnerability is found: history has shown that when one is found, others quickly follow. The famous Bruce Schneier quote is "Attacks always get better, they never get worse" – BlueRaja - Danny Pflughoeft – 2015-03-16T04:31:52.473

It's important to rememeber that if attacker has access to the server, he can replace both the file and the checksum.

– gronostaj – 2015-03-16T15:59:52.833

3@misha256 Those sha1 hashes are for you to check for download corruption, not for security. If you want security then use gpg signed files – Daenyth – 2015-03-17T15:13:43.747

Answers

Nobody has yet accomplished this for SHA-1. It is possible in theory, but still not practical. The reports about insecurity in SHA-1 just mean that the security level is not as high as we would like it to be and that means we don't have as many years before we have to worry about this as we thought we did.

It is harder to produce a file with the same SHA-1 hash as a given file than it is to craft two files yourself with the same SHA-1 hash. And as far as we know, nobody anywhere in the world has yet accomplished even this easier task. That doesn't mean it can't happen tomorrow though.

David Schwartz

Posted 2015-03-15T22:21:16.377

Reputation: 58 310

Is there even a known attack on SHA-1 for collisions with a given file? I was under the impression that that attack hasn't been found for either MD5 or SHA-1 (there's just a collision attack, not a second-preimage attack) – cpast – 2015-03-16T01:25:05.230

@cpast the Flame malware used an MD5 collision to appear to be from Microsoft and hijack Windows Update. They might have had a bunch of Microsoft certs to choose from, but they weren't just trying to find any 2 files with the same MD5. – Aron Foster – 2015-03-16T15:19:28.780

2@Aron No, that was not an example of a collision with a given file. With Flame, Microsoft had a licensing server that would sign X.509 certificates according to a certificate signing request, which means the attacker controls what is being signed within some limits. There was no preexisting certificate they found a collision with; Microsoft signed CSRs from customers as part of activation, which allows use of a collision attack (which is not a second-preimage attack). – cpast – 2015-03-16T15:48:05.520

@cpast You're right. Thanks for the clarification! – Aron Foster – 2015-03-16T16:28:50.080

According to the wiki for sha-1 (http://en.wikipedia.org/wiki/SHA-1): As of 2012, the most efficient attack against SHA-1 is considered to be the one by Marc Stevens[34] with an estimated cost of $2.77M to break a single hash value by renting CPU power from cloud servers.[35] ( ... ). On 8 November 2010, he claimed he had a fully working near-collision attack against full SHA-1 working with an estimated complexity equivalent to 2^57.5 SHA-1 compressions. He estimates this attack can be extended to a full collision with a complexity around 2^61. . So not exactly "never done", but quite costly

– Olivier Dulac – 2015-03-16T16:59:56.937

2@OlivierDulac No, it has indeed never been done. There are no known SHA-1 collisions. The estimated cost is just an estimate - it's not that someone did it and this is how much we think it cost, it's that no one has done it but we think this is how much it would cost. – cpast – 2015-03-16T17:13:53.823

@cpast We do not know for sure if has been done or not, but a $3 million attack is less than 0.03% of the NSA's annual budget (in fact, the attack should be cheaper given that they already own the hardware and don't have to rent). It is reasonable to conclude that, since they have the means and motivation to do it, then they probably already did. Remember Flame.

– bain – 2015-03-16T19:50:52.693

It is theoretically possible, but it has not yet been done.

What you are looking for is called a "hash collision:" two files with the same hash. Cryptographic hash codes like SHA-1 are generally designed to make this difficult. Because SHA-1 is a 160 bit code, it will take on average 2^159 brute force attempts to find a duplicate. If an algorithm is found which reliably does better than that against a cryptographic hash, the hash is considered "broken."

MD-5 is an example of a very broken hash. It was supposed to have a strength of 128 bits, requiring on average 2^127 attempts. As is, abusing known vulnerabilities, the actual number of attempts needed can be as low as 2^47. This is a LOT smaller than 2^127. In fact, it has been done in under a day on a modern computing cluster.

I give that example because that is closest to how you are looking to use SHA-1. However, that is not the most common approach cryptanalysis use for making sure hashes aren't broken. They usually allow a collision between two files, as chosen by the attacker, instead of having you pick one file and the attacker seeking to match it. This sort of attack has an advantage of being easier to benchmark. If I find it is "hard" to crack your file, does that mean that another file is similarly strong? This attack where the attacker gets to choose both files ensures we catch the worst of the worst.

This sort of attack allows an interesting trick known as the "birthday attack." Long story short, getting to use the birthday attack halves the strength of the algorithm, so SHA-1 requires 2^80 attempts (on average) and MD5 requires 2^64 attempts (on average). These are half of 160 and 128 respectively.

SHA-1 has known attacks that decrease its strength from 2^80 to 2^69. This is not going to matter much to you. 2^69 attempts is a long time.

However, from history, we've found that hash algorithms are not broken spontaneously, but rather broken over time. Nobody cracks an algorithm like MD-5 taking it from 2^64 to 2^47 over night. It happens over time, as many individuals publish papers about the math they're using against it. One can usually watch the complexity of the attacks go down slowly from the inception of the algorithm (where the best attack is usually the birthday attack).

The fact that we are seeing some changes in collisions suggests SHA-1 is seeing the light at the end of the tunnel. It's still strong, but there might be a desire to go up to the newest SHA-3 which is currently much more secure.

You should really make such decisions from a threat model perspective. How much damage can be done by an attacker if they get one of these collisions. Are your attackers script kiddies with access to a few laptops, or governments with entire supercomputing clusters at their disposal. How large of a time window does an attacker have to break the hash before it's no use (many uses of cryptography involve a "changing of the guard," such as password rotation). All of these will affect how seriously you have to consider collisions.

Cort Ammon

Posted 2015-03-15T22:21:16.377

Reputation: 2 316

8Regarding your birthday attack paragraph, 2^80 is the square root of 2^160, not half of it (which would be 2^159). – Andrew Morton – 2015-03-16T20:32:43.530

The question is about second-preimage attacks, but your answer is about collisions. Preimage attacks against SHA-1 — and even MD5 — are absurdly impractical. (There's a 2^123 preimage attack against MD5, but with SHA-1 you're stuck with a 2^160 brute-force.) – Matt Nordhoff – 2015-03-17T01:09:11.150

"Because SHA-1 is a 160 bit code, it will take on average 2^159 brute force attempts to find a duplicate." But a 2^2 code takes 2^2 guesses. I'm not seeing why you -1. "Long story short," ... "halves the strength of the algorithm, so SHA-1 requires 2^80" ... "MD5 requires 2^64" ... "These are half of 160 and 128 respectively." Here you should have -1'ed. The bits add to strength exponentially, so halving the strength of a 160 bit hash would treat it as a 159 bit hash, not an 80 bit hash. Each bit doubles the challenge of a brute force attack. – TOOGAM – 2015-03-18T13:07:20.040

@TOOGAM: He said 'on average'; over multiple trials, only 50% of the key space must be searched on average to succeed in a brute-force attack. As for the halving comment, Andrew Morton's comment above explains that; it should be the square root, not half, of the complexity. – Reid – 2015-03-18T14:58:36.223

@AndrewMorton good point, I was not clear with my wording. I find the literature switches between number of states and the base-2 logarithm of the number of states quite often. My wording referred to halving the number of bits because people tend to talk about "strength" in number of bits. I was so used to switching back and forth that I did so unconsciously. I'll edit to remove the confusion. – Cort Ammon – 2015-03-26T16:23:10.030

@MattNordhoff In my third paragraph I covered why I was talking about collisions, not preimage attacks. It is much harder to do an analysis of a preimage attack because some data interacts with the hash in ways which make it substantially harder to find a preimage (consider the famous case of Length Extension Attacks, where some signature constructions are trivially weak compared to others). Collisions are more of a "level" playing field. In addition, one sees the slow breaking of the hash in its collisions first, so it's a good place to look to see how strong the hash currently is. – Cort Ammon – 2015-03-26T16:27:18.307

The flaws in SHA-1 discussed in that article are very specific: they allow attackers to create two things that hash to the same value (this is called a "collision attack"). However, a collision attack requires that the attacker control both files involved. If the attacker doesn't control the original file, a collision attack does not let them find another file with the same hash value.

The reason this matters for TLS/SSL (and signatures in general) is that with those, an attacker often can control both files. A TLS certificate is mostly created by the person requesting it (the bits they don't control are often predictable), so collisions let them make a legitimate certificate and an illegitimate one, get the legitimate one signed, and transfer the signature.

For files, the same situation doesn't always apply. If your concern is that the person making the file is the attacker (e.g. they'll get one thing independently verified as good, and then send you the evil payload with the same hash), the SHA-1 attack applies, and you should look towards phasing it out (although it's not critical yet, as David Schwartz mentioned). If the original file is trusted, then an attacker can't apply the currently known SHA-1 attacks, although you should still think about phasing it out if you can (if you have a choice, use a hash without known attacks like SHA-2).

In response to "the collision won't be useful" -- While an attack doesn't require an attacker to be able to get a useful collision, it's generally not all that hard to turn "collision" into "useful collision." Many file formats have a fair amount of room in which you can have whatever you want without affecting the functionality of the file; an attacker can typically modify that in order to get a collision (if collisions are practically findable), while keeping the functional part as whatever they want it to be. The gap between "academic attack" and "practical attack" can be large; the gap between "any collision" and "useful collision" is generally much smaller.

The more serious issue, which is unrelated to the choice of algorithm, is how you're getting the hash. All a hash does is to shift the problem from "get the real file" to "get the real hash value;" a hash value sent from the same server and over the same connection type as the file is utterly worthless against malicious modification (any attacker who can tamper with the file can tamper with the hash). Hashes are only useful for this if you can trust the hash more than you can trust the file; while that's sometimes the case (torrents, mirrors), they're often used when it isn't the case. So you should be very careful about that whenever using hashes for integrity verification.

cpast

Posted 2015-03-15T22:21:16.377

Reputation: 2 279

You have to differentiate between a collision attack and a preimage attack. Finding any two messages that hash to the same value is a collision attack.
Replacing one particular, given message (here: an executable) with another message that has the same hash is a (second) preimage attack.

SHA-1 is broken insofar as a collision attack can be done in 2⁵² operations according to a Wikipedia article that does not provide a citation for that number (the best attack that I am aware of which is actually credible is the one by Marc Stevens, which takes 2⁶⁰ operations). But let's assume the pessimistic case of 2⁵².

This is concerning because an attack at that scale is not only theoretically conceivable but indeed perfectly doable in under a day on a multi-GPU rig. That's of course a problem for applications where "any two" messages will do. Even the 2⁶⁰ figure given by Stevens (which is 256 times more work) is perfectly feasible if your attacker is willing to throw some extra money at the problem, or is willing to spend a year of time.
Which is exactly the kind of thing that won't prevent someone involved in espionage or cyber crime from forging certificates.

Now, a preimage attack has an exponent that is twice as large, so assuming 2⁵² for the collision attack, that would be 2¹⁰⁴ operations, which is a totally different ballpark.

This is not only impractical (a machine that is a billion times faster than the one mentioned in the previous paragraph would still take around 6 million or so years), but given our puny means of generating energy this is entirely impossible.

Doing such a massive calculation would require an energy source which is much larger than anything we can afford to dedicate to a single operation. No, not quite an energy source the size of the sun, but still a quite big one.

You can realistically expect to get anything from 10 to 50 GFLOPS out of one Watt. Assuming that some kind of miracle happens and processors get about several thousand times more energy efficient over night, one could assume 1 SHA ≈ 1 FLOP (quite optimistic!). This would mean that in order to perform 2¹⁰⁴ hash calculations within 10 years, you need a 10¹²W power plant. In order to run the attack within 1 year, you need a 10¹³W power plant. That's about 50 times of what the entire nuclear power plants of the USA, France, and Japan can produce together, only for forging a single hash.

This is not going to happen, there are much easier ways of achieving the same goal (exploiting the server that stores the original hash and replacing that one, blackmailing someone, etc.).

Damon

Posted 2015-03-15T22:21:16.377

Reputation: 4 002

"... much easier ways of achieving the same thing..." As illustrated in http://xkcd.com/538/

– Ralph J – 2015-03-18T00:28:42.327

The general point of the article refered in the question is: SHA1 is deprecated and should be phased out while you still have time to do it smoothly. In some areas, the time is running out since Google and Microsoft enforce deadlines.

Rule of thumb for deprecated technology:

If you do a new design or add features, don't use it (SHA1).
If you maintain something old, make a plan when to replace it (SHA1).

Summary quote from the 2012 blog post by Bruce Schneier.: "The point is that we in the community need to start the migration away from SHA-1 and to SHA-2/SHA-3 now."

jmn

Posted 2015-03-15T22:21:16.377

Reputation: 121

For the SHA-1 hash collision part of your question, this has been addressed by a few of the answers.

However, a big portion of this hinges on the type of file we're working with:

Maintains the file's overall content and operation (but of course now includes ~~malicious content that was not originally there~~ changed contents)

What this means varies greatly on what is detecting the alterations:

If it's a signed executable, not a (reasonable) chance: you'd have to get two hash collisions somehow: the SHA-1 of the file and the internal .exe signature.
If it's an unsigned executable, .com, unsigned .dll, or similar, their resource forks can be added to in ways that will not change their operation and thus you could (eventually) get a hash collision that is not detectable by 'normal' operation.
If it's a source code file or similar structure (.cs, .c, .h, .cpp, .rb, .yml, .config, .xml, .pl, .bat, .ini) the additions, modifications, or removals can be constrained to valid comment syntax such that the change would not be discernible by most uses (compiling or running it, not opening it up with a text editor).
If it's an .iso or .zip or other container format, it is also more unlikely since most random changes will corrupt the container. It is possible to do: add a bogus file entry or alter a content within the container and recheck it, but you're adding a layer of complexity and adding additional time to check the result, as well as having limited degrees of freedom with respect to how and what contents may be changed.
If it's a text or text-like format, they can be changed almost any way you like while still being a 'valid' file, though the content will probably be noticeable.
With many formats like .rtf, .doc, .html, .xslx, and other markup-esque formats, they can be added or modified in ways that will be undetectable by parsers, so other than the length (or even with a constrained length, less freedom) the files can be altered to (eventually) get a hash collision while still being not only a valid file, but not noticeably changed in any way that would be visible to the typical applications they would be used with.

So, what you're left with is how to get collisions in whatever structure that is noncorrupting and some degree of undetectable perhaps:

Make any functional changes you desire (perhaps insertion of malicious content) and make any additional changes to retain file format specific validity
Add a section that will be non-functional (between comment blocks, at the very end of a text file with 3k carriage returns above it, isolate a current comment block)
Add or select a character/code point/byte for modification and try every possible valid combination (not every byte combination is valid for different encodings, for example).
Recompute the hash, see if collision matches.
if it does not, goto 3.

Let's say you have a super fast computer and a smallish file, such that modification with a valid byte sequence and recomputing the hash takes 1 millisecond (probably requiring some dedicated hardware). If the hash distribution is perfectly random and distributed across the range, you will get a collision with SHA-1 every 2^160 attempts (brute forcing it).

2^160/1000/60/60/24/365.24 
= 4.63x10^37 years 
= 46,300,000,000,000,000,000,000,000,000,000,000,000 years 
= 46 undecillion years.

But hey, let's try the 2^60 and 2^52 versions, and pretend that they allow us to modify the file any way we like (they don't) and that they, too, can be done in 1ms each try:

2^52 yields 142,714 years 
/*humans might still be around to care, but not about these antiquated formats*/
2^60 yields 3.65x10^7 years = 36,500,000 years 
/*machines will probably have taken over anyway*/

But hey, you might get lucky. Really, really, more-of-a-miracle-than-anything-people-call-miracles lucky.

Ehryk

Posted 2015-03-15T22:21:16.377

Reputation: 240

Not really, you can satisfy one of those conditions at a time but not both.. it is possible to get the same hash for two different files but for someone to alter a file and then try to get the same hash is pretty much impossible as far as I know

Anthony Guess

Posted 2015-03-15T22:21:16.377

Reputation: 11

1Pretty much impossible yet. With sufficient computing power everything is possible. – None – 2015-03-16T02:50:09.207

-6

Yes, it is possible. Think of how viruses work on EXEs. The malware payload is appended to the original EXE, so that the program still does what it originally did, but also spreads as a virus. Now, to maintain the same hash, you would need additional specifically crafted padding.

That means that the file would be larger. But in the case of an EXE, maybe you could remove some of the less-used code, so that the program would only appear to work initially. In the case of a JPEG, you could compress the image further or use a different image entirely. For an ISO, you could remove sets of files. The computations required to replicate the hash would be more difficult, and perhaps mathematically impossible for specific cases, but would still be possible in general.

Ken

Posted 2015-03-15T22:21:16.377

Reputation: 7 497

7-1 everything in this post is completely made up. Length-extension attacks do not "maintain the same hash" (the hash just changes in a known way). Also, there's no reason a virus would have to remove "the less-used code" (how would it even determine what that is?). And what do jpegs have to do with anything!? – BlueRaja - Danny Pflughoeft – 2015-03-16T03:19:05.870

2This is just totally wrong, I can't even begin to suggest corrections without rewriting the whole answer – Mark K Cowan – 2015-03-16T10:42:11.917

2-1 Not right at all. aka "Not even wrong" (Wolfgang Pauli) – Olivier Dulac – 2015-03-16T17:03:33.040

1Well, we could start with the fact that if something is possible in general then it obviously is possible in a specific case. The opposite is not always true, however: it is easy to imagine a problem that can be solved for a specific case, but not generally. – a CVn – 2015-03-16T19:48:07.167