Introduction

In my previous article Good Bye MD5, I introduced you to the current findings on cryptology and MD5 collision detection. A debate started, and most of the people think that these findings are not a serious issue.

Microsoft agreed that this is an important issue:

"Microsoft is banning certain cryptographic functions from new computer code, citing increasingly sophisticated attacks that make them less secure, according to a company executive".

"The Redmond, Wash., software company instituted a new policy for all developers that bans functions using the DES, MD4, MD5 and, in some cases, the SHA1 encryption algorithm, which is becoming "creaky at the edges," said Michael Howard, senior security program manager at the company, Howard said". Source: Microsoft Scraps Old Encryption in New Code

How to exploit the collisions

There is a known result about MD5 hash function:

If MD5(x) == MD5(y) then MD5(x+q) == MD5(y+q)

So, if you have a pair of messages, x and y, with the same MD5 value, you can append a payload q, the MD5 value remains the same, the size of q is arbitrary. You need a pair of vectors, x and y to do the exploit. You can try to find a pair for yourself, but we already have a pair of values, given by the Chinese investigators Joux and Wang. A practical use of this pair of vector values is explained in the paper MD5 To Be Considered Harmful Someday, by Dan Kaminsky.

Hacking software distribution

The proof of concept to be shown in this article has the following scenario:

This is a simulated software distribution mechanism.

The software is distributed in binary format, in files with .bin extension.

Exists as an extraction program that checks and extracts the software from the .bin file.

For verification purpose, we use the MD5 value to check the integrity of the .bin files.

This picture shows a scenario, where a pair of binary files with the same MD5 are generated, MD5(good.bin) == MD5(evil.bin):

Attacking the distribution software

First, we will build a generator program, this program takes a pair of executables, the first is a harmless program and the second the evil file, is a harmful program, and generates a pair of binary distribution files (.bin files). These are good.bin distribution file, and an evil.bin distribution file.

The good program

The code of the harmless program is simple:

namespace GoodExe
{
///<summary>/// A Harmless program
///</summary>class Class1
{
///<summary>/// The main entry point for the application.
///</summary> [STAThread]
staticvoid Main(string[] args)
{
Console.WriteLine ("this is a good executable");
}
}
}

The evil program

This is the code of the evil program, that simulates a harmful behavior:

Remember that given this pair of vectors, if we have a payload of any size, then MD5(vec1+payload) == MD5(vec2+payload). The payload is built in this way, the length of good file, the length of evil file, the content of the good file, and the content of the evil file.

Now, we can publish good.bin in the Internet for people to download it, and later, we can replace it with evil.bin. Now, the users will get infected, without noticing and convinced that there is no tampering, because the MD5 signature is the same for both files, in others words we have MD5(good.bin) == MD5(evil.bin).

The extractor program

Now, suppose we have changed the extractor program, with our own version. Our extractor receives the .bin distribution file, and extracts the good or evil program based on the prefix vector at the beginning of the .bin file. We use the byte at position 123 to detect the vector that is used for the prefix.

Suppose, you receive the good.bin file, then you apply the extractor on good.bin and the good.exe file is extracted. But if you receive the evil.bin file, then the extractor will extract the evil.exe, i.e. the harmful executable. Remember that MD5(evil.bin) == MD5(good.bin).

Conclusion

Recently, the world of cryptographic hash functions was on crisis. A lot of researchers announced "attacks" to find collisions for common hash functions such as MD5 and SHA-1. "For cryptographers, these results are exciting - but many so-called "practitioners" turned them down as practically irrelevant".

I hope, this proof of concept will convince you that there is a serious issue with MD5.

This article shows how a failure on the software distribution chain, allows exploiting the current findings on cryptology about the MD5 hash function.

History

September 20th, 2005: Some grammar correction, and title modified.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Comments and Discussions

Actually I agree with him. This doesn't prove a thing other than what long known - there has to be collisions. And this can be proven with the pigeon hole principle - you cannot shove more pigeons into a set of pigeon holes than there are pigeon holes. In other words (for MD5) if you have 2 ^ 128 + 1 files then you _must_ have a collision.

Note that (in this example) the md5extractor is part of the "evil" system. _it_ makes the selection between which of the two files are going to be placed back. In short, this attack does not allow me to replace good code for evil code. Not yet anyway. Take for example the way in which Gentoo Linux distributes packages (I'd attack the rsync mechanism in that particular case if I was evil). It stores and MD5 hash of the _good_ code. Now if I want to replace the code with something evil then I don't have code on the "clean" system to begin with - all I've got to work with is replacing the source archive, and the MD5s need to match. I've got no starting matching data and I must generate a _valid_ .tar.gz archive. Suddenly the problem becomes a lot harder again.

Give respect and credit where it is due, to Joux and Xang for actually finding two vectors that produce the same MD5.

I'll be impressed when someone shows me a way to take any arbitrary file and add a small amount of data to that file to get to a particular MD5 (and in the case of .tar.gz still extract without any warnings). Not when someone already has an evil bootstrap on the system under attack. To get md5extractor onto the system (or replace emerge with the evil bootstrap) the system needs to be broken already - what is the point of this attack then?

Wow. You have utterly misunderstood what this article is about. While is it obvious that MD5 has collissions (as do all hash-forming functions, but definition), the exploit is that if MD5(x) == MD5(y) then MD5(x+p) == MD5(y+p) where p is arbitrary.

Say someone hacked microsoft.com and overwrote the installation for the dotNET framework or microsoft vista beta 2 for instance and used this method to insert an arbitrary program (which is our selected p) into the executable and noone and microsoft would be any-the-wiser because the MD5 checksum would be the same.

Coincidentally the same is also true for MD4, but not SHA1, or SHA256.

Actually I agree with him. This doesn't prove a thing other than what long known - there has to be collisions. And this can be proven with the pigeon hole principle - you cannot shove more pigeons into a set of pigeon holes than there are pigeon holes. In other words (for MD5) if you have 2 ^ 128 + 1 files then you _must_ have a collision.

Note that (in this example) the md5extractor is part of the "evil" system. _it_ makes the selection between which of the two files are going to be placed back. In short, this attack does not allow me to replace good code for evil code. Not yet anyway. Take for example the way in which Gentoo Linux distributes packages (I'd attack the rsync mechanism in that particular case if I was evil). It stores and MD5 hash of the _good_ code. Now if I want to replace the code with something evil then I don't have code on the "clean" system to begin with - all I've got to work with is replacing the source archive, and the MD5s need to match. I've got no starting matching data and I must generate a _valid_ .tar.gz archive. Suddenly the problem becomes a lot harder again.

Give respect and credit where it is due, to Joux and Xang for actually finding two vectors that produce the same MD5.

I'll be impressed when someone shows me a way to take any arbitrary file and add a small amount of data to that file to get to a particular MD5 (and in the case of .tar.gz still extract without any warnings). Not when someone already has an evil bootstrap on the system under attack. To get md5extractor onto the system (or replace emerge with the evil bootstrap) the system needs to be broken already - what is the point of this attack then?

You apparently don't get it. The whole point is that, after submitting the good code (with the bad code appended to it, although inert), a malicious programmer could then replace the file with one that has the bad code active, and the good code inert, and the MD5 checksum would be the same.

No dude, this is too much for you to comprehend so quit arguing senselessly!

xacatecas hit the nail on the head. What Kaminsky and this nut who wrote this C# 'simulation' did is nothing more than a magic trick. Re-read what 'xacatecas' posted above, I'll refrain from reiterating.

Before replying to a post with arrogance, try comprehending the orignal goddamn research.

would be a tool which creates all of the possible "candidate" files from an MD5 value. So, let's say you had the MD5 value of an executable. With this tool you could produce all of the possible original files which result in this MD5 hash value. Most of them will not be well-formed executables, so they could be quickly filtered out.

The remaining files could be further filtered if you had some idea of the size of the original file. Other filtering techniques could be applied to still further reduce the number of possible files.

The end result - possibly after considerable computation time - would be a few files and one of them would be the original executable. Right?

Do not be misled into thinking that this would be some magic that produces the one true file. There would be a significant number of candidates dispite whatever filtering was applied. MD5 is a hash function where multiple sources resolve to the same hash value.

Obviously, there is a brute-force way of doing this, but that would be a rather crude approach. It would seem there would be a mathematical approach that would be significantly better than brute force.

Nice example and it's good to know that MD5 collisions can be that easily extended, which is indeed very dangerous. Because the example needs an additional attack on the extractor software, it can trick us into underestimating the problem. There is no need for an extra attack on trusted software that uses the colliding files, if that trusted software is itself not sensitive for the first part of the files.

Consider for example an interpreter that by construction skips a fixed number of bytes at the beginning of the source code files when starting the interpretation, e.g. because the interpreter expects only initialization data at that place. The initialization data will be made available for use and interpretation by the program itself, which starts after the fixed initial space. Now suppose we can fit the colliding payloads of the good an bad source files into that initial space. Our good and bad programs are the same, but what they actually do depends completely on that initial data. In that case, a code analysis of the "good" source code could generate well founded trust about the good program, based on its harmless initialization data. This trust would then extend to the bad program because both the good and the bad source code files have the same hash.

I haven't downloaded, but based on the cmd window screenshot, you nede to clearly show that the MD5 of good.exe and bad.exe match - not goodexe.exe & badexe.exe for which we can't associate similar or different code.

So here my question relating to demonstation of a compromise:

MD5(x) == MD5(y) then MD5(x+q) == MD5(y+q)

is understood.

So we start out w/ two byte blocks which are known to result in the same hash: x & yThen we embed these in an exe - one exe that does something good and one that does something bad.qx != qyso MD5(x+qx) != MD5(y+qy). Check the md5 sums of good.exe and bad.exe.

Ok. So this is obvious, I must have missed your point. It would be very interesting to show software or a general algorithm which demonstrates the compromise of MD5 in finding x and y such that MD5(x) == MD5(y) other than using brute force. I'm not clear on the application of what is shown here.

MD5(good + bad + vector1) == MD5(good + bad + vector2), where MD5(vector1) == MD5(vector2).The extractor program then checks if vector1 or vector2 was used to create the binary package, and extracts the good or bad exe depending on that condition.

No, md5 is just a hash function, so it is possible to create files with the same size (those files would be quite large, I guess) and the same md5 sum. You know, we live in the world where everything is possible.

If you've got the rights to write good.exe, why bother with all this? Just create bad.exe and rename it good.exe. You don't have to bother with MD5 hash collisions, because you can just post the MD5 hash for bad.exe.

Both your example and my counter require that good.exe's author be the culprit, and if he's not trusted, then how do you know he didn't write something that just works for a few weeks before formatting your hard drive?

If I author good.exe and the md5 hash, and you can produce a bad.exe with a matching hash, now we're talking a HUGE vulnerability. But you're totally glossing over the fact that the Author is the only one who can take advantage of this. And if he's already seeking to do harm, then this is actually a complicated and difficult manner of accomplishing that goal.

But, suppose you are able to hack a distribution package that uses MD5 as checksum, like rpm for linux for example.Then, things get more interesting....

This quote is from Practical Attacks on Digital Signatures using MD5 Message Digest[^]"Now it is clear that there is possibility to create two custom self-extract executablebinaries containing any arbitrary files having equal MD5 sums. Take a web-browsersoftware as an example.Suppose a packager in a company creates self-extract packages intended for distribution.Once the web-browser's development is finished, all of the web-browser'sNow it is clear that there is possibility to create two custom self-extract executablebinaries containing any arbitrary files having equal MD5 sums. Take a web-browsersoftware as an example.Suppose a packager in a company creates self-extract packages intended for distribution.Once the web-browser's development is finished, all of the web-browser'sfiles are given to packager to create installation scripts and create a package ready fordistribution. Packager creates the scripts and the package itself. The package is thensent to testing department to check whether it passes the tests and whether the webbrowsercan be installed. If all tests are passed, the package is good and will be digitallysigned. Testing department will sign the package as to prove that the softwarewas tested, passed the tests and is ready for distribution. Both web-browser package,its MD5 sum and signature is then put on company's ftp or web page for download.Now suppose the packager is a dishonest person (an attacker). So he/she creates apair of packages with equal MD5 sums, one containing the original files and the otherpackage deliberately flawed. The good package is sent for testing, passes tests andgets signed using MD5 hash as message digest. Later it's put on ftp/web page fordownload. The attacker as an insider has access to company's servers (either legitimateor not), so he/she replaces the good package with flawed package.The MD5 sum and digital signature will hold even for the flawed package. If theattacker is clever, he/she will modify the original software only slightly and in a discreetway. A very rare and obfuscated race condition that only he/she knows how totrigger is a good example. The flawed web browser will be downloaded and installed.Since MD5 sum and signature holds and the software does not act suspiciously, it cantake a long time until the flaw is detected (if ever). All mischief after discovering theflaw would fall upon testing department's head, since their signature guarantees thatthe software was well-tested.Now the attacker can for example sell the knowledge of the flaw to spammers,virus-writers, create own web-pages which will infect computers using the flawedbrowser or hack into vulnerable web-servers and slightly modify the pages addingown code that triggers the flaw in browser and causes infection, later using infectedcomputers to send spam, launch distributed denial of service attacks, etc."

Think that RPM from redhat, apache uses MD5 as digital signature.

And the diferential algorithm to calculate collisions of MD5 will be pusblished in near future.

And the packager could do the exact same thing by building code that says "act nice if running before a certain date" or "act nice if running on the tester's machine" or "act nice if running in the test domain" or... Get the point?

As you suggested, make it a small bit of code, and who's going to find it?

That person has numerous ways to harm people who trust him already. Yes, MD5 collisions adds another way, but if you switched the signature to use a SHA-512 hash it wouldn't even inconvenience him, now would it?

I believe the intent is to demonstrate a very specific type of attack that exploits the inherent trust of an MD5 hash. It's sort of a semi-social engineering attack.

The attack shown in the article illustrates how the distributor/author/etc can issue a hash for a "good" application and then begin distributing an "evil" application by exploiting the trust of the "good" application's published MD5 hash. Needless to say, this is a very specific scenario and uses predefined vectors; however, it does work to reinforce a fundamental truth about security.

Bruce Schneier[^] asserts that security is a process, not a product[1]. This article demonstrates a specific attack that exploits the product-based view of security..."This file is safe because the MD5 hash I just calculated matches the published (well known) MD5 hash". The article highlights the importance of establishing a process that assumes products are flawed and manages exposure to attack and effectively deals with breaches.

chuck5761 wrote:If you've got the rights to write good.exe, why bother with all this? Just create bad.exe and rename it good.exe. You don't have to bother with MD5 hash collisions, because you can just post the MD5 hash for bad.exe.

You're assuming that a person who has the ability to post bad.exe as good.exe also has the ability to post the MD5 hash. That's not the way distribution via mirror sites need to work, regardless of the signature algorithm.

If somebody.net has authored a software package and wishes to distribute it via somemirror.net, then the proper way to check the signature is to compare the expected signature from somebody.net to the computed package signature from somemirror.net. I realize that many mirror sites also post the signatures, but as your example shows, that's a bad practice.

GFeldman wrote:You're assuming that a person who has the ability to post bad.exe as good.exe also has the ability to post the MD5 hash. That's not the way distribution via mirror sites need to work, regardless of the signature algorithm.

Actually no. I'm assuming that the person who has the ability to post good.exe has also has the ability to post the MD5 hash, which seems to be universally true.

If somebody.net has authored a software package and wishes to distribute it via somemirror.net, then the proper way to check the signature is to compare the expected signature from somebody.net to the computed package signature from somemirror.net.

You're correct. I've just been trying to point out that there's no way for someone with access to ONLY somemirror.net to make the md5 hash of his package match the real one on somebody.net.

So while I agree it's a weakness, I'm arguing that it's minor because the only person who can take advantage of it is the author, since he's the only one that can produce the "modified" versions of both files so that their MD5s match.

chuck5761 wrote:You're correct. I've just been trying to point out that there's no way for someone with access to ONLY somemirror.net to make the md5 hash of his package match the real one on somebody.net.

Granted, the way the other attacks cited work is to have one author for two separate documents. However, a subtle point here is that there is a way for for someone with only access to somemirror.net to make the md5 hash of the bogus package match that of the real one, under certain contrived circumstances, and only if the real author somehow used a vulnerable mechanism for the MD5. This may just be a tiny crack that isn't practical, but tiny cracks have a habit of getting bigger over time.

You're assuming that a person who has the ability to post bad.exe as good.exe also has the ability to post the MD5 hash. That's not the way distribution via mirror sites need to work, regardless of the signature algorithm.

But this example requires that the same person be in a position to generate both good.exe, bad.exe and the md5 hash.If they're in that position then they're surely in a position to post those files.

This example doesn't show how somemirror.net could generate a package with the same md5 as that generated by somebody.net

Exactly: if anybody working for theproducer of the software that you download(whether "packager","programmer","tester" or "CEO") isis evil, then you are screwed. The use or non-use of MD5is a red herring.

If you can change the extractor, what prevent you from just putting some harmfull program on its place? The purpose of checking the hash is to make sure the file haven't been modified. So if you are not sure your extractor is genuine, you should check its hash first with some other utility and find it was replaced.The real treath would come from modifying the good.bin file into an evil one without actually having to embed both files and some coliding value. It sure is possible just not easy.

The extractor example is just a proof of concept, the real trick is to hack self extracting files, and maintain the MD5 checksum, or develop a way to put nocive code into a program, without altering MD5.

Remember, if MD5(x) == MD5(y) then MD5(x+q) == MD5(y+q).

Suppose x is the self extracting code, or a valid header, then the trick, to find the y part, is a matter of time until collision finding algorithms are disclosed.