Vulnerability of software integrity and code signing applications to chosen-prefix collisions for MD5

November 30, 2007
Marc StevensCentrum voor Wiskunde en Informatica, Amsterdam Centrum voor Wiskunde en Informatica
Arjen LenstraÉcole Polytechnique Fédérale de Lausanne Bell Labs EPFL and Bell Labs
Benne de WegerTechnische Universiteit Eindhoven

Announcement We announce two different Win32 executable files with different functionality but identical MD5 hash values. This shows that trust in MD5 as a tool for verifying software integrity, and as a hash function used in code signing, has become questionable.

Software Integrity Checksums Software is vulnerable to threats on its integrity. For example, when a program is made available as an executable file on the web for downloading, some bits of the file may be accidentally changed during the download process. Or, when a program is stored on a hard disk that becomes victim to a virus attack, the program code may be changed on purpose, to get an executable file that has been infected by the virus. A popular method of thwarting those threats is to compute a checksum of the original executable file, and to publish this checksum along with the program on the download webpage, or to store this checksum in a database. At any later moment the user of the program is then able to recompute the checksum from the file as it is then residing on his computer, and verify the checksum with the stored original value. When the checksum has changed, apparently the file has changed, and should not be trusted any longer.
Code Signing Software vendors have set up code signing schemes to assist users in distinguishing software coming from a trusted source from malware. A public key signature scheme based on PKI (Public Key Infrastructure) is used to provide the user, along with the software itself, with a digital signature of the software. Such a signature is unforgeable in a strong cryptographic sense, as it has been constructed using the software vendors private key. The users can verify this signature using the corresponding public key, usually made available inside a certificate. A prime example of such a code signing scheme is Microsoft Authenticode.
Hash functions play an essential role in the process of making digital signatures. The private key operation is not applied directly to the to be signed file, but to a hash value of this file instead.
Cryptographic hash functions Cryptographic hash functions are widely known and easy to calculate functions that are believed to have the property that no two messages with identical hash values can be found. A well known and widely used cryptographic hash function is MD5. Especially in software integrity applications and code signing programs MD5 has been a popular hash function for many years now. Many websites making software (source code or executable) available for download provide on the website with the software itself its MD5 checksum. Another much used hash function is SHA-1. For the last decade MD5 and SHA-1 have been the default choice for hash functions in most code signing schemes.

Hash functions should be collision resistant In order for a software integrity checksum or a digital signature based on a hash value to be of any value, the cryptographic hash function that is used must be collision resistant. That is, it must be practically impossible to find different messages that have the same hash value. Otherwise, a miscreant can use a single hash value to commit to more than a single file.
MD5 is not collision resistant The cryptographic hash function MD5 was shown to be not collision resistant, by prof. Xiaoyun Wang and her co-authors, in 2004 (see the EuroCrypt 2005 paper "How to break MD5 and other hash functions"). Although devastating from a theoretical point of view, many practitioners believed that the practical repercussions of prof. Wang’s attack would be minimal because of the lack of useful structure of the collisions that could be constructed.
Chosen-prefix collisions When incorporating collisions into a pair of files, prof. Wang's original method of constructing collisions required the files to be exactly equal, except for the relatively small random looking block of bytes comprising the collision. Last year we showed, with the kind assistance of prof. Wang, how her attack on MD5 can be improved to provide chosen-prefix collisions. Our chosen-prefix collisions have only the requirement that after the collision the files should be exactly equal. Before the collision the two files, for which a collision is to be found, can be anything: our chosen-prefix collision finding method will always produce a collision that can be incorporated into the two files, irrespective of what data is present before the collision.
Extending executables Eric Verheul pointed out to us that random bytes may be appended to various formats, without changing their functionality. This idea also applies to executable files. So, by appending a chosen-prefix collision, any perfectly harmless program can be made to collide with any other program, such as a piece of malware. The two files show unchanged behaviour when executed, one good and the other bad, but now the files have identical MD5 hash values. A digital signature obtained on the harmless program will then be equally valid for the malware.

Proof of concept We now present our proof of concept pair of Win32 executable files that have different functionality but identical MD5 hash values.
A good executable We wrote the program HelloWorld.cpp, with the following C++ source code:

#include "stdafx.h"
int main(int argc, char* argv[])
    printf("Hello World ;-)\n");
    return 0;

As you can see from the source code, it's completely innocent. Here is the
Win32 executable HelloWorld.exe.
The MD5 hash value of the executable file HelloWorld.exe is

A bad executable We also wrote the program GoodbyeWorld.cpp, with the following C++ source code:

#include "stdafx.h"
int main(int argc, char* argv[])
        printf("Goodbye World :-(\n");
    return 0;

As you can see from the source code, it's not really harmful, just annoying. Here is the
Win32 executable GoodbyeWorld.exe.
The MD5 hash value of the executable file GoodbyeWorld.exe is


Building both executables from source code was done by a standard MS Visual C++ installation.
Making the executables collide As a proof of concept we applied our chosen-prefix collision finding method to the files HelloWorld.exe and GoodbyeWorld.exe. This came down to carefully constructing two blocks of 832 bytes, and appending them to the end of both files. These 2 times 832 bytes have been constructed such that the resulting files, renamed to HelloWorld-colliding.exe and GoodbyeWorld-colliding.exe, have exactly identical MD5 hash values.
The colliding executables Here are the two colliding files, the
Win32 executable HelloWorld-colliding.exe, and the
Win32 executable GoodbyeWorld-colliding.exe.
HelloWorld-colliding.exe has exactly the same good functionality as HelloWorld.exe, while GoodbyeWorld-colliding.exe has exactly the same annoying functionality as GoodbyeWorld.exe.
Though HelloWorld-colliding.exe and GoodbyeWorld-colliding.exe are different files, containing programs with different functionality, they both have the MD5 hash value


Signing the code To demonstrate the code signing vulnerability, we have set up our own Code Signing Authority, complete with a public key certificate and corresponding private key. Then we signed the colliding executable files using the private key of our Code Signing authority. The signature is

0D4488B0 838D4C50 08DE9F71 14ED9AC1 1A14586F 3A7ADEF1 D5AD5D5E F2B9C5C2 3E98F316 8BCDBD0E 2D07DCFF 011BF918 261AD036 A862FC8E 108A3FBF 0E72D4A6 849F2E9A 1B14102D 291F3804 BFE5BC8A D8051F4B A938EC40 3C91ED99 FBB59F45 3F34E110 EEE2D7EA 7B334874 584CE49C 600084EF 4EA9BA3A C7A9AE65 33478EE8

This signature was made by applying RSA to the standard PKCS#1 padded hash value.

Verification You are invited to verify for yourself the following essential points.
- the files HelloWorld.exe and HelloWorld-colliding.exe differ only in that 832 bytes have been appended;
- dito for GoodbyeWorld.exe and GoodbyeWorld-colliding.exe;
- appending those bytes did not change the functionality of the programs, they are normally behaving win32 executables, even though the files HelloWorld-colliding.exe and GoodbyeWorld-colliding.exe have not been constructed as output of a normal process of building an executable;
- the MD5 hash values of HelloWorld-colliding.exe and GoodbyeWorld-colliding.exe are identical;
- the signature is valid for both the colliding files.
Here is an OpenSSL recipe for verifying the signature. You will need the signature in a binary file, the Code Signing Authority certificate and the public key file. The CSA-certificate in turn has been issued (i.e. signed) by our own Hash Collision Certification Authority. So the CSA-certificate itself can be validated by the CA certificate.
See also the output of fciv - Microsoft's File Checksum Integrity Verifier utility.
Here is a file with all IHVs.
Here is a pretty picture showing the bit differences in the internal states of MD5 as it is applied to the two colliding files.

Vulnerability analysis It is important to note that the hash value shared by the two different files is a result of the collision construction process. We cannot target a given hash value, and produce a (meaningful) input bit string hashing to that given value. In cryptographic terms: our attack is an attack on collision resistance, not on preimage or second preimage resistance. This implies that both colliding files have to be specially prepared by the attacker, before they are published on a download site or presented for signing by a code signing scheme. Existing files with a known hash that have not been prepared in this way are not vulnerable.
What we can easily do, however, is to do a brute force search on a few additional bytes, to e.g. let the first three and the last three bytes of the hash value match with a given target. This already may introduce an additional vulnerability, as many people (including at least two of us) usually only look at the first and last three bytes when checking a hash value.
For abusing a chosen-prefix collision on a software integrity protection or a code signing scheme, the attacker should be able to manipulate the files before they are being hashed and/or signed. This may mean that the attacker needs insider access to the party operating the trusted software integrity protection or code signing process. An attacker with such access can most probably do more harm anyway, without the need for chosen-prefix collisions, to get ´official´ digital signatures on malware.
On the other hand, there is the viewpoint of the relying party, i.e. the user downloading hashed or signed code who needs some guarantee that this software can be trusted. This relying party can not be sure anymore that the published hash value or the digital signature is valid for only the executable file he downloaded. There might very well be a sibling file with the same hash value or digital signature, while only one of these siblings went through the proper hashing or signing procedure. Especially when the software integrity verification takes place under the hood, with the user not knowing that the operating system or some hidden application is silently verifying digital signatures on software to be installed, the user may be more easily lured into installing malware.
Note that it is not necessary for an attacker to build both executables from source code. It is perfectly well possible to take as the first file any executable from any source, and as the second file produce a second executable as malware. Then a byte block to be appended to both files can be found such that the resulting files have the same MD5 hash value. If an attacker can then get the first file to be signed, e.g. by the original software vendor, this signature will also be valid for the attacker-constructed malware.
It can be objected that at a bit-level inspection of the executable(s) it will become apparent that there is a strange non-functional random looking block of bytes at the end of the executable. This may be true for specialists knowing what to expect in executables, at least for us it's a complete mystery what should and should not be expected inside executables. Moreover, it could very well be possible to hide the colliding blocks not at the end, but hidden somewhere else in a less conspicuous place of the binary executable files.
Note that with the technique of constructing multi-collisions as described on our Nostradamus website, it is very well feasible to produce as many different (executable) files as one wants, all with the same MD5 hash.
One idea to thwart our attacks is to provide software with self-checking components, e.g. the software checks the integrity of its own executable as the first step of the execution. This might be a countermeasure against our attack. Nevertheless we think that the proper way to go is to avoid relying on broken crypto such as MD5 altogether.
Given the recent insights into the weaknesses of MD5, the bottomline of our work is:
MD5 should no longer be used as a hash function for software integrity or code signing purposes.
By now, everyone should be aware of this. If you are still using MD5, it would be good to carefully analyse the risks that result from its continued usage.
Note that also the collision resistance of SHA-1 does not live up to its design criteria anymore, though attacking SHA-1 is still much more difficult than attacking MD5. See the IAIK Krypto collision website for the present state of affairs.

Hasn't this been done before? In December 2004 Dan Kaminsky and Ondrej Mikle, and later Peter Selinger, published similar attacks, based on the Wang-type collisions that require two binary files that differ only in the colliding blocks. To create such files from two executables with different behaviour that yet collide under MD5, each of the two files has to contain both executables in full, somehow using the collision to switch on the one and hide the other.
Our colliding files are based on chosen-prefix collisions. This means that we only have to append a few thousand carefully chosen bytes to each file to reach an MD5 collision. Each file by itself contains only one of the two executables. This is less suspicious.

Sony PlayStation 3 As mentioned on the Nostradamus attack website, we make our chosen-prefix collisions using a Sony Playstation 3. The collision built inside the above executables took us less than 2 days to make.

- Our recent implementation of a Nostradamus attack, predicting the next US president.
- The (finished) HashClash project, Marc Stevens' MSc project on fast and smart MD5 collision construction.
- Our previous work on chosen prefix collisions.
- Our webpage on colliding X.509 certificates for different identities.
- Our old colliding certificates construction.
- Our webpage on Creating a rogue CA certificate.

 / _ \ 
| / | |
| \_|/ 
Marc Stevens, CWI, Amsterdam, The Netherlands
Arjen K. Lenstra, EPFL, Lausanne, Switzerland, and Bell Labs, Murray Hill, USA
Benne de Weger, TU/e, Eindhoven, The Netherlands

Please send all correspondence to Benne de Weger.

Acknowledgements We thank Eric Verheul for pointing out to us that appending any string of bytes to many data formats does not change the functionality.
A large part of the work behind this collision construction was done while Marc was visiting Arjen at EPFL in August and September 2007.
We are grateful to Dag Arne Osvik at EPFL for his help with the PlayStation 3.

Publication Date November 30, 2007
Latest Modification January 1, 2009
Vanity Counter