Using MD5 for malware ids: collision attack risks?

Question

It has been known since 2004 that the MD5 hash is vulnerable to collision attacks (update - not "preimage" attacks - my mistake....). Yet it still seems that people are using it to identify malware. E.g. reports about the new Flame malware document people going back several years to discover the same md5 signatures in archived md5 data.

How old is Flame? - Alienvault Labs

An attacker could presumably ensure that all their files matched the md5 hash of other files which they make public and which seem innocuous, so relying on md5 seems dangerous.

I don't see references to sha256 or even sha1, which have not seen (public) collision attacks. What is the status of moving to better hashes for virus databases?

Update: the concern I had was that if the virus db didn't also retain full copies of all the files in question (eg because some were really big or whatever), and/or if folks searching the db didn't check the full contents of the new files they're looking up with the archived files, then a new file from a malicious virus, which matched an old "innocuous" file might be mistakenly dismissed as not dangerous just based on an md5 match. But hopefully the full files are retained and checked by anti-virus researchers, or else they would be vulnerable to this attack.

So what sorts of attacks against malware ids might make use of the ease of producing md5 collisions, and what steps are actually taken in specific hash databases and AV software to thwart them?

Jeff Ferland · Answer 1 · 2012-06-04T03:08:38.697

A quick analysis:

Threat: Somebody creates a clean file that matches a malicious file's MD5 hash.

Result: The clean file is identified as malicious, but is merely a collision. Another file that does match still exists and will always be identified as the same.

I suppose if this happens, there might be some talk about moving on. My guesses as to why we haven't:

This isn't authenticating malware, but identifying it. Inserting false positive matches has limited value. True positives will still be located.
It is presently universal. One can identify malware with only one hashing algorithm operation. If we change over, you'll have to start hashing every file with multiple algorithms unless somebody keeps a repository and can post new algorithm hashes for everything.

A false match has very limited value in that you can't do much more than try to convince somebody that they're looking at a piece of malicious software that might be a different piece of malicious software... or just a bunch of bits. Only researchers trying to learn about that particular malware could care and they should realize what's going on.

Update

It is my understanding that virus databases do not include "clean" checksums. If there is a matching MD5 entry, it is for something that you don't want on your system unless you're trying to research it. Because a section of an executable can be set aside to be filled with any old nonsense, it is possible to create a malware file which has the same MD5 sum as another innocuous file (a "collision attack"). While we don't know how to do a practical "preimage attack", the design nature of executables makes it reasonably likely that a focused attacker could create an collision attack as described on Wikipedia's MD5 page. Namely, executable structure allows one a lot of flexibility in filling inserting any data of choice that is ignored during execution. Further, one could load an otherwise non-executable file from a launcher allowing modification of any of the leading or trailing data in the file. That provides for the use of generic launchers and exceptionally easy hash collisions as the first and last bytes can be anything at all.

Since databases don't contain clean files, you won't get a false negative. You could get a false positive if somebody engineered a piece of malware with that in mind. If you were in charge of building the first malware database today, you'd use a different hash algorithm. For historical reasons and a relatively low impact of a successful collision attack, MD5 continues to march on, though it is not ideal now.

Thanks. I updated my question with more details. If you know or could reference more info on how virus dbs and anti-virus researchers do their work to address the attack I describe (i.e. "they always check for full file content matches also), I'll accept this. — nealmcb, May 31 '12 at 11:18
@nealmcb Updated; describing possible collision scenario and likelyhood. — Jeff Ferland, Jun 03 '12 at 14:44

score 6 · Accepted Answer · edited Jun 04 '12 at 14:31

First off -- you are right that it would be better to use SHA1, SHA256, SHA2, or some more modern hash function.

However, I don't think the risk is very high. To explain why I have to give a little bit of background about attacks on hash functions. There are two kinds of attacks to worry about:

Collision attacks. An attacker finds two files M (Malicious), S (Safe) that have the same hash. Here the attacker can choose M and S freely.
Second pre-image attacks. The attacker is given a file C (Common), and has to find a second file M that has the same hash as file C. Notice that the attacker cannot choose C; it is given to him. The attacker's only degree of freedom is in the choice of M.

Second pre-image attacks are much more harmful, because they let the attacker do stuff he couldn't do with just a collision attack. The thing to know about MD5 is that it is known to be vulnerable to collision attacks. However, there are no known second pre-image attacks against MD5.

With this context, I can now answer your question. While MD5 is vulnerable to collision attacks, there's no clear way an attacker could use this to cause problems. Sure, an attacker could find a malicious file M and a harmless file S that have the same hash. The attacker could then start spreading the malware M, so that the MD5 hash of M gets onto someone's blacklist, and this might cause some people to falsely conclude that S is harmful. But so what? S will be a bunch of random-looking bytes. There's no reason why anyone would already have S stored on their systems, so the fact that the attacker can trigger false alarms on S is basically harmless.

A second pre-image attack on MD5 would be much more problematic. It would let the attacker choose some benign file C that is stored on everyone's hard drive: maybe a file that is critical to the operation of the Windows Firewall, for instance. Then (if the attacker knew a way to do a second pre-image attack on MD5) the attacker might be able to construct a malicious file M that has the same hash as C, and start spreading M around. When anti-virus companies add the MD5 hash of M to some blacklist, this could cause problems: it might cause anti-virus software to wrongly conclude that C is harmful, and that false alarm might end up disabling Windows Firewall on a bunch of systems or something like that. That would be bad, if it were possible. However, as far as we know today, such a bad scenario is not possible, because no one knows of any way to mount a successful second pre-image attack on MD5.

Bottom line: while it would be better to use a more modern hash function instead of MD5, I don't think there's much potential for bad guys to exploit the current practice of using MD5. The risk seems awfully low.

Thanks - that's an attack I hadn't thought of. I edited your answer to change the file names to be more mnemonic and consistent. The one I talked about above is where the attacker first spreads S around, gets AV folks to notice that it is innocuous, and thus think that md5(S) means "innocuous", and later when M is spread around, fails to notice that M is different than S. Probably hard to leverage as an attack. And in both your attack and mine, all it takes is for the database and software to also store and check the full file contents to deal with false positives based on the hash match. — nealmcb, Jun 04 '12 at 14:31
What about an attack involving a collision between relatively benign but annoying adware, and advanced spyware? It would be problematic if AV software detected a highly invasive specimen as a merely irritating source of ads, leading users to take the wrong steps to remove it. Another possibility is creating something that looks sorta malicious (like network scanning software) that collides with genuine malware. All the malware author has to do is present to the MD5 database holder the benign software and explain that it's a false positive, making social engineering easier. — forest, May 29 '18 at 05:26
@forest, Good point. I overlooked that possibility. You've convinced me. Given the state of MD5 today and how badly broken it is against collision attacks (and just how much control we can have over the collision), I suspect your attack might be feasible. Want to write an answer explaining that in more detail and then ping me, so I can remove my answer and upvote yours? Or would you prefer to see this answer edited to explain that? Feel free to suggest an edit (or if you prefer I can try to do it). — D.W., May 29 '18 at 20:03
Your answer is already very good (especially in regards to a preimage being far more problematic, whereas an attack using a collision is far more conditional). No need to remove it, even if I write my own answer! Perhaps just change the paragraph with `there's no clear way...` to describe these potential attacks. — forest, May 30 '18 at 01:17

Ramhound · Answer 3 · 2012-06-11T11:25:53.887

I have to disagree with the conclusions the author of that article wrote.

We have found a version of the main component (mssecmgr.ocx) that seems to be compiled at the end of 2008. It can indicate that Flame has been around at least for 4 years.

Actually all this indicates is the main component of a very large piece of malware is 4 years old. This doesn't mean there are other main components being used. As is Flame exploits a flaw that was fixed I believe in 2010.

Update: After the first two weeks of looking at the Flame. It was discovered that Flame was actually several years old. It was able to remain hidden because it exploit a flaw in Terminal Services and a MD5 Collision attack against a Microsoft Certificate.

The rest of the author's conclusion I agree with. There is no doubt that Flame is new in the sense its the motherload of malware, it exploits a bunch of stuff, in order several sophiscated attacks.

An attacker could presumably ensure that all their files matched the md5 hash of other files which they make public and which seem innocuous, so relying on md5 seems dangerous.

Flame did not attempt to do this. Your concern is not directed towards the correct thing

Update: the concern I had was that if the virus db didn't also retain full copies of all the files in question (eg because some were really big or whatever), and/or if folks searching the db didn't check the full contents of the new files they're looking up with the archived files, then a new file from a malicious virus, which matched an old "innocuous" file might be mistakenly dismissed as not dangerous just based on an md5 match. But hopefully the full files are retained and checked by anti-virus researchers, or else they would be vulnerable to this attack.

All I have to say to this update is that you are worried about the wrong thing. These virus database websites do retain a copy of the file. The security companies also use more then just the MD5 hash of a file to determine if a file is malicious.

The chances of a real file matching a malicious file's MD5 hash is really really REALLY small. So even if a malicious file matched a real file more then just the MD5 hash is used to identify the threat.

Most of the well known malicious infections in the last 10 years have used the same exploit, in addition to exploits discovered between "discovery dates", in addition to working in very similar ways.

Everyone thought Stuxnet was some really advanced trojan written in a "custom" language. Come to find out it was written in C, and shared the same components, as every other of its siblings.

*"The chances of a real file matching a malicious file's MD5 hash is really really REALLY small."* - I think you're missing the point. Attacks are not a blind random process; they are deliberate. If there was a way for an attacker to carefully choose things so that a real file had the same hash as a malicious file, then that'd be bad, even if the chances of stumbling upon this by mere chance are low. — D.W., Jun 04 '12 at 03:28
@D.W. - You have to understand when I wrote this answer what was know about Flame was limited. We didn't know they used a MD5 collision attack on the certificate in order to forget the component. I also learned that Flame actually IS 2 years old, my thought process while sound, was not correct. — Ramhound, Jun 08 '12 at 17:41
cool. Perhaps you might want to edit your answer to remove the incorrect statement, then, given your latest understanding? (The question specifically asks about the risks of collision attacks when using MD5, so when answering that question, we do have consider, well, the risk of collision attacks on MD5. By the way, my comment is not specific to Flame and applies generally.) — D.W., Jun 08 '12 at 17:48
@D.W. - I could update the question I suppose. I will work on that. — Ramhound, Jun 11 '12 at 11:20

Using MD5 for malware ids: collision attack risks?

3 Answers3

Update

Linked