Search

Subscribe

Cryptanalysis of SHA-1

On Tuesday, I blogged about a new cryptanalytic result -- the first attack faster than brute-force against SHA-1. I wrote about SHA, and the need to replace it, last September. Aside from the details of the new attack, everything I said then still stands. I'll quote from that article, adding new material where appropriate.

One-way hash functions are a cryptographic construct used in many applications. They are used in conjunction with public-key algorithms for both encryption and digital signatures. They are used in integrity checking. They are used in authentication. They have all sorts of applications in a great many different protocols. Much more than encryption algorithms, one-way hash functions are the workhorses of modern cryptography.

In 1990, Ron Rivest invented the hash function MD4. In 1992, he improved on MD4 and developed another hash function: MD5. In 1993, the National Security Agency published a hash function very similar to MD5, called SHA (Secure Hash Algorithm). Then, in 1995, citing a newly discovered weakness that it refused to elaborate on, the NSA made a change to SHA. The new algorithm was called SHA-1. Today, the most popular hash function is SHA-1, with MD5 still being used in older applications.

One-way hash functions are supposed to have two properties. One, they're one way. This means that it is easy to take a message and compute the hash value, but it's impossible to take a hash value and recreate the original message. (By "impossible" I mean "can't be done in any reasonable amount of time.") Two, they're collision free. This means that it is impossible to find two messages that hash to the same hash value. The cryptographic reasoning behind these two properties is subtle, and I invite curious readers to learn more in my book Applied Cryptography.

Breaking a hash function means showing that either -- or both -- of those properties are not true.

Earlier this week, three Chinese cryptographers showed that SHA-1 is not collision-free. That is, they developed an algorithm for finding collisions faster than brute force.

SHA-1 produces a 160-bit hash. That is, every message hashes down to a 160-bit number. Given that there are an infinite number of messages that hash to each possible value, there are an infinite number of possible collisions. But because the number of possible hashes is so large, the odds of finding one by chance is negligibly small (one in 280, to be exact). If you hashed 280 random messages, you'd find one pair that hashed to the same value. That's the "brute force" way of finding collisions, and it depends solely on the length of the hash value. "Breaking" the hash function means being able to find collisions faster than that. And that's what the Chinese did.

They can find collisions in SHA-1 in 269 calculations, about 2,000 times faster than brute force. Right now, that is just on the far edge of feasibility with current technology. Two comparable massive computations illustrate that point.

In 1999, a group of cryptographers built a DES cracker. It was able to perform 256 DES operations in 56 hours. The machine cost $250K to build, although duplicates could be made in the $50K-$75K range. Extrapolating that machine using Moore's Law, a similar machine built today could perform 260 calculations in 56 hours, and 269 calculations in three and a quarter years. Or, a machine that cost $25M-$38M could do 269 calculations in the same 56 hours.

On the software side, the main comparable is a 264 keysearch done by distributed.net that finished in 2002. One article put it this way: "Over the course of the competition, some 331,252 users participated by allowing their unused processor cycles to be used for key discovery. After 1,757 days (4.81 years), a participant in Japan discovered the winning key." Moore's Law means that today the calculation would have taken one quarter the time -- or have required one quarter the number of computers -- so today a 269 computation would take eight times as long, or require eight times the computers.

The magnitude of these results depends on who you are. If you're a cryptographer, this is a huge deal. While not revolutionary, these results are substantial advances in the field. The techniques described by the researchers are likely to have other applications, and we'll be better able to design secure systems as a result. This is how the science of cryptography advances: we learn how to design new algorithms by breaking other algorithms. Additionally, algorithms from the NSA are considered a sort of alien technology: they come from a superior race with no explanations. Any successful cryptanalysis against an NSA algorithm is an interesting data point in the eternal question of how good they really are in there.

For the average Internet user, this news is not a cause for panic. No one is going to be breaking digital signatures or reading encrypted messages anytime soon. The electronic world is no less secure after these announcements than it was before.

But there's an old saying inside the NSA: "Attacks always get better; they never get worse." Just as this week's attack builds on other papers describing attacks against simplified versions of SHA-1, SHA-0, MD4, and MD5, other researchers will build on this result. The attack against SHA-1 will continue to improve, as others read about it and develop faster tricks, optimizations, etc. And Moore's Law will continue to march forward, making even the existing attack faster and more affordable.

Jon Callas, PGP's CTO, put it best: "It's time to walk, but not run, to the fire exits. You don't see smoke, but the fire alarms have gone off." That's basically what I said last August.

It's time for us all to migrate away from SHA-1.

Luckily, there are alternatives. The National Institute of Standards and Technology already has standards for longer -- and harder to break -- hash functions: SHA-224, SHA-256, SHA-384, and SHA-512. They're already government standards, and can already be used. This is a good stopgap, but I'd like to see more.

I'd like to see NIST orchestrate a worldwide competition for a new hash function, like they did for the new encryption algorithm, AES, to replace DES. NIST should issue a call for algorithms, and conduct a series of analysis rounds, where the community analyzes the various proposals with the intent of establishing a new standard.

Most of the hash functions we have, and all the ones in widespread use, are based on the general principles of MD4. Clearly we've learned a lot about hash functions in the past decade, and I think we can start applying that knowledge to create something even more secure.

Hash functions are the least-well-understood cryptographic primitive, and hashing techniques are much less developed than encryption techniques. Regularly there are surprising cryptographic results in hashing. I have a paper, written with John Kelsey, that describes an algorithm to find second preimages with SHA-1 ­-- a technique that generalizes to almost all other hash functions -- in 2106 calculations: much less than the 2160 calculations for brute force. This attack is completely theoretical and not even remotely practical, but it demonstrates that we still have a lot to learn about hashing.

It is clear from rereading what I wrote last September that I expected this to happen, but not nearly this quickly and not nearly this impressively. The Chinese cryptographers deserve a lot of credit for their work, and we need to get to work replacing SHA.

I don't know why the paper is taking so long to be published. Lots of cryptographers have seen it in private -- and they're the people who can understand it -- so I know it's not secrecy. I speculate there's an internal review process at their university that they have to go through before they can publish the paper, but I don't know for sure.

This is very interesting however I think reports on some of the popular tech news sites are sensationalising this. We knew that SHA-1 would not last forever and this should be seen as a good reason to put some time and money into identifying an alternative. Jon Callas put it perfectly in that we should start to walk but not run towards the fire exit.

And as you said Bruce the team who did it deserve a lot of credit. It is an excellent achievement.

Its not uncommon for "big" results to be circulated in this way. Such results are associated with long standing problems and contain subtle issues. It takes time to verify the results---perhaps more than for typical conference processes and no one wants to make a mistake on verifying it, so its the right thing to do.

Consider it at this point like a medical study which finds a statistically significant result. Such studies will be repeated and often corrected.

You two properties are clearly wrong. If I do a one way hash of the complete works of shakespear (oft cited in such frivolous experiemnts) the resulting hash is a fixed size right?

Or do you mean the hash function that will not compress the hash result? if the has isn't exactly the same size as the input, then you can gaurauntee that there will be collisions (even if it is larger than the input)

Two, they're collision free. This means that it is impossible to find two messages that hash to the same hash value.

Unless your rather weak version of impossible (meaning entirely easy and possible, if you have enough time) is what you are stating.

Hmmm - I have a hard time understanding how these "collisions" are a problem. Sure you can have two (and lots more) documents with the same one-way hash. But context makes a whole lot of difference. If I use SHA-1 (or -224, -256, -384 or -512) to get a one-way hash of a legal document. What are the chances that you can derive another legal document about the same subject and on the same details concerning the same people, that anybody is going to believe is a forgery of the original document. Chances are that the document you find with the same one-way hash is going to be pure gibberish when viewed as a legal document about the problem at hand.

No, what the forger wants is to fake your legal document with a different date, or different names or change paragraph 1 to read differently and leave the rest of the doument unchanged. Out of all of those infinite document collisions, you may be able to find one that accomplishes that, but I seriosuly doubt you are going to be able accomplish that in the lifetime of anybody who would be concerned.

There's something I have wondered (don't know if this was covered in another thread).

You have some plaintext P and two different hash functions H1() and H2(). Say that H1 and H2 have known collision weaknesses such that it is "easy" to find H1(P') = H1(P) or H2(P'') = H2(P).

What's the probability of finding value P''' where H1(P''') = H1(P) and H2(P''') = H2(P)? That is, find a P''' that causes a collision for P under both H1 and H2? If brute force on H1 required 2^80 attempts and on H2 it was 2^80, would the double collision brute force take 2^160? Or is it more or less complicated than that?

Would the use of two (or n) hashing algorithms in your application be too expensive? Does this just defer the problem down the hardware scale?

Granted, I'm thinking here of the most naive application of hashes, such as the storing of pre-calculated hash values of passwords for later comparison.

To the person complaining about the description of hashing function's properties:

There are an infinite number of messages that hash down to a finite hash size. So, yes, there are an infinite number of hash collisions.

For a cryptographically strong hashing function, however, it should be computationally infeasible (what Bruce means by "impossible") to construct two messages that hash to the same value.

What I'm curious about is how this affects existing messages hashed by SHA-1. It seems this attack is against the hash itself, such that the attacker chooses both texts to hash. If I have a fixed text that I've hashed (say as part of an integrity check), does this attack make it easier for the attacker to generate a new text that collides with my text? It doesn't sound like it.

I talk about how collisions affect security in "Applied Cryptography."

Here's one scenario: Make one message that says "Buy 1000 shares" with a varying number of nulls at the end. Make another message that says "Sell 1000 shares" with a varying number of nulls at the end. Hash the zillions of messages. Look for a collision. Get victim to sign one message, and now you have the other message signed.

There are lots of attacks like this against various protocols that use hash functions. Really, collisions are a bad thing in hash functions. All crypto texts explain why.

Bruce, I guess I'll need to read so more to catch the subtleties. I know for my particular SHA-1 application, the hashed text is a fixed, known-ahead-of-time size and the message itself must have certain other properties to be considered valid. This severely constrains an attacker looking for collisions. The "varying trailing nulls" attack won't work for those messages.

I guess that's the crux of the matter: If the hashed text is highly flexible and has no other form of integrity check beyond the SHA-1 function, then this attack is quite bad. Other layers of integrity checking do mitigate the damage somewhat. (Though it is a case of "walk, don't run to the exits.")

This comment is from Niels Ferguson, to someone who says "it's only a collision attack":

"A good hash function should behave like a random permutation. I cannot afford to do a security analysis of every place I use a hash function to find out whether a collision attack is dangerous. Hash functions are so useful they appear in dozens of locations through a design. It's just easier, and safer, to use a good hash function than trying to live with a partially broken one."

Okay - but wouldn't it be trivial problem to ensure that the message DOESN'T have ANY nulls at the end. Alternatively, I could hash the message and include the message size in the hash. Your xxxxx nulls at the end are no longer a problem. They change the message themselves and are detectable. Different messages that any competent s/w developer could test for.

Okay - I think I understand part of what you saying - If I originate the message and have total control over the content, then I could jigger the content in such a manner that I could dupe somebody that has no idea what is happening into "signing" two documents when they thought they were "signing" only one.

But again doesn't that come down to the "human" element in any security scheme. By duping the human element - any security scheme can be defeated. You have to know the people you are dealing with.

"Okay - but wouldn't it be trivial problem to ensure that the message DOESN'T have ANY nulls at the end. Alternatively, I could hash the message and include the message size in the hash. Your xxxxx nulls at the end are no longer a problem. They change the message themselves and are detectable. Different messages that any competent s/w developer could test for.

"Again, what am I missing here???"

Hash functions have certain security properties. Of course, it's often possible to use a hash function that doesn't have those security properties. For example, "any competent s/w developer could test for" the particular attack I described above. But how many thousands of possible attacks are out there? Can you think and test for them all? And what about all those less-than-competent s/w developers, that ones that seem to code every encryption product I end up analyzing?

SHA is broken. You can either work around the break, or replace it. Replacing SHA is by far the better option.

Replacing SHA-1 *right now* might prove problematic at least. Suppose that one is building a PKI with the CA's certificate using some alternative, such as SHA-256 or SHA-512 for the hash function. Wouldn't this require everybody who is verifying a cert chain leading up to this CA to support the replacement hash function as well?

With a general purpose PKI, this would require changes into web browsers and servers (doing client auth), VPN gateways & clients, S/MIME clients and well, everything.. And not just the implementations, if the corresponding standards were not crafted with 'swappable' cipher suites or hash functions in mind.

If I understand this right, this risk is a bit more severe with a PKI since the typical lifetime of a CA is 5 - 10 years , giving time to Moore's law. A collision attack against PKI using SHA or MD5 could involve attacker crafting two CSR's: one completely legal one and other with subject: CN=*, Basic constraints: cA=TRUE. Attacker would end up with a regular certificate that the CA signed and another one with a valid signature, but with arbitrary name and "useful properties". Of course the attack would be more complex than this (if even possible), but this is still worrysome stuff from implementor's perspective as well.

Bruce, you are referencing SHA-512, etc approved by NIST. What about Whirlpool algorithm recommended after more recent research by NESSIE for European Commission? Is there are any paper that compares SHA-xxx algorithms and Whirlpool?

"Bruce, you are referencing SHA-512, etc approved by NIST. What about Whirlpool algorithm recommended after more recent research by NESSIE for European Commission? Is there are any paper that compares SHA-xxx algorithms and Whirlpool?"

SHA-256 and SHA-512 are the easy solutions, because they're NIST standards like SHA-1 is. Whirlpool is certainly a good choice. And while I have not seen any direct comparisons of post-SHA hash functions, my guess is that we'll be seeing a bunch of them soon.

The link to my URL links to a tarball that has a working implementation of Whirlpool in C (the public domain reference implementation; any and all changes I have made are also public domain). Actually, it has implementations of both versions of Whirlpool.

Hmm. I am not a hardware techie, so bare with me here but - does Moore's law apply to FPGA machines? I know CPUs and their associated chipsets have been pushing that boundary for some time now - but I don't know of any comparable "arms race" that is forcing FPGAs to evolve as fast. How MUCH faster is a 2005 FPGA than a 1998 one?

The second question (which can be only answered after we see the paper of course) is - how much practical value is this result? From my memory of the previous breaks, they allowed an attacker to construct two messages which had the same hash - but the attacker had no control over what changes had to be made to the plaintext of the second message to make it hash out the same; is an attacker going to be able to make useful changes (eg, the value of a transaction, or the payment date, or whatever you are using a digital signature for) and create a plausable and usable second document from a first they may or may not get to specify?

David Howe: Moore's "Law" (it's less a law than it is a rough guide) states that the number of transistors on integrated circuits will double every 18 months. This does not necessary correlate to speed.

Compliance with the Law also varies from architecture to architecture and from year to year.

The "varying trailing nulls" attack might be practical for a malicious software developer. Mallory iterates on two versions of voting software -- one counts accurately, one has a back door -- and finds a collision. The accurate version is certified. The back door version is loaded into voting machines and shows the same hash as the certified one.

It seems like high-value systems -- where the value of an attack exceeds the computational cost of 2**69 -- might be vulnerable.

[Dave], FPGAs performance follow "Moore's Observation" (It is by no means a law in any conventional sense) the same as any other types of integrated circuits. In fact, FPGA vendors are very aggressive in their push to use the forefront of technology to stay ahead of the game. They are one of the first IC vendors that come out with devices in new technology nodes. However, as [Mat] remarked, it is no longer true that speed is a given when going down in geometry sizes as it was in the past.

It looks like the 2**69 number is for the existence problem: find two (any two) messages M0 and M1 so that SHA1(M0) = SHA1(M1). What's the number for the construtive problem: given a known message M0, find another message M1 so that SHA1(M0) = SHA1(M1)?

One should notice that there are hidden considerations in every discussion. Once getting awared to them your attitude usually changes and becomes more positively balanced. Please find the hidden consideration in your point of view. The common one is lack of understanding of the drivers that make people think differently then us.

It's too bad that a prominent commentator scared everyone unnecessarily with his inflammatory claim that "SHA-1 HAS BEEN BROKEN", without emphasizing the difficulty of applying the attack. Now we have to go back and try to calm everyone down.

"It looks like the 2**69 number is for the existence problem: find two (any two) messages M0 and M1 so that SHA1(M0) = SHA1(M1). What's the number for the construtive problem: given a known message M0, find another message M1 so that SHA1(M0) = SHA1(M1)?"

The first problem is finding collisions, and that's what the attack does: 2**69 work. The second problem is reversing the one-wayness, and that is not what the attack is about. SHA-1 brute force says 2**160, and the best attack is the Kelsey-Schneier attack (mentioned in the main post) at 2**106.

"It's too bad that a prominent commentator scared everyone unnecessarily with his inflammatory claim that 'SHA-1 HAS BEEN BROKEN', without emphasizing the difficulty of applying the attack. Now we have to go back and try to calm everyone down."

That kind of thing happens. I never talk about cryptographic anything being "broken" without stating the complexity of the attack, but not everyone is as diligent as I am.

The real problem is that there is a mathematical meaning to the term "broken," and cryptographers sometimes forget that they're talking to the general public and not members of their own tribe.

Bruce Schneier, could you possibly produce a shortened edition of Applied Cryptography in paparback that costs only about $30? $60 for a paperback is pretty steep.

Despite your relative calm about the rate at which SHA-1 will be broken in practice, I'm concerned about the legal validity of digital signatures. These are generally regarded as binding now, and many contracts last a long time. In many cases it would be worth breaking the digital signature on a contract in order to falsify the contract terms and reseal it, even five or ten years into the contract term. Do I have a valid concern here?

I think one of the causes of the misunderstanding is that Bruce lists two security properties in his post, whereas there are really at least six one could consider:

a) Given H and a set A of "well-formed" messages, find M in A such that SHA1(M)=H.
b) Given M1 and a set A of "well-formed" messages, find M2 in A such that SHA1(M2)=SHA1(M1).
c) Given a set A of "well-formed" messages, find M1 and M2 in A such that SHA1(M1)=SHA1(M2).
d) Given H, find any M such that SHA1(M)=H
e) Given M1, find any M2 such that SHA1(M2)=SHA1(M1)
f) Given nothing, find any M1 and M2 such that SHA1(M1)=SHA2(M2).

What the Chinese have demonstrated, according to Bruce, is that (f) is not as hard as it was previously believed. So what? say some readers; (f) is only of theoretical interest, and a real attacker would have to solve one of the presumably harder (a), (b), or (c).

One thing these people miss is that it is not sound *engineering* to consider only (a), (b), (c), because the set A is different for every application and for every mode of attack. Some A's may have some internal structure that makes an attack easier, so if (a), (b) and (c) were the only problems we could consider, we'd have to analyse the security of our hash function from first principles for *each* *individual* application of it. That would pretty much invalidate every advantage of having a common hash function as an off-the-shelf component, except for the very minor matter of reusing source code.

Another more subtle point is the (f) is the easiest of the six problems - any feasible solution to one of the others would immediately produce a feasible answer to (f). And much of the confidence that (a)-(e) is hard is based on the evidence that *even (f)* was so hard that nobody could find a better method than brute-forcing it. This premise has fallen away now, which forces Bruce and his community to reconsider which reasons they have to think that the harder problems are sufficiently hard.

It is conceivable that we are actually still safe, and that there exists a way of thinking about SHA-1 that allows an argument that the underlying ideas behind the Chinese break is actually only good for (f) and cannot possibly scale to the more immediately interesting (d) and (e). But the cryptographic community seems to believe that it would be *easier* to reestablish trust by finding aother hash function where (f) is still hard. And they are really the only ones who can have an informed belief about which option would have the best chance of success.

As for cryptographers defining "hard enough" as "so hard that no algorithm is appreciably cheaper than a brute-force search", it's really a tribute to their ingenuity that they've been able to define anything that even looked as if it would meet this standard, and I think the rest of us should approve of them trying to reach it for as long as they have any hope left. (Though something with a proven lower bound, even one significantly below brute-force level, would be better if anybody could figure out how to do that).

[Disclaimer: I have officially no idea what I'm talking about, but it soulds cool]

Why is the solution not to use two very different cryptographic hashes? It should be exponentially more difficult to find matching hashes from two separate algorithms, but only a constant work addition on the confirmation side. Heck, you could use rather insecure hashes and still have a near-unsolvable problems. Being little more than a dabbler in math, mabey ye learned people can enlighten me.

Regarding the comment that there are 6 security properties of hash functions instead of two: Keep in mind the idea, espoused in /Practical Crytography/, that our security primitives be black boxes, whose weaknesses we don't need to worry about when constructing systems around them.

From that perspective, it is unacceptable to have to worry about:

1) Whether our restrictions on message format reduce the message-space enough to eliminate attacks. That will likely require a statistical calculation particular to the strength of the attack and the space of legal messages, a calculation we _should not have to make_ to guarantee security. This eliminates the distinction between {abc} and {def}.

2) Whether the message being hashed is a secret or not. This assumption blurs the line between authentication (for which we should be able to use a hash function) and encryption (which should be completely orthogonal). That eliminates the distinction between {ad} and {be}.

Keeping that in mind, we are left with two interesting properties: essentially E and F. All the others are redundant. B ut the extra complexity isn't that interesting anyway: your line of reasoning is still fine. The fact that F is no longer "hard," for a cryptographer's definition of hard, means it's time to start looking for a new hash function, because 1) we're no longer _sure_ that E is hard, having been shown that F is not, and 2) we don't want to have to distinguish between "hash functions that are universal," and "hash functions that are only good for E but not F" for reasons of engineering.

Paul:
> Henning - I don't see much practical difference between (d) and (e) in your list.

(d) is the problem you have to solve if you know the hash of a password and want to get in. It is intrinsically harder than (e). For example, consider a hash function that strips off and forgets the first byte of its input and then computes a real, crytographically strong, hash of the rest. I can do (e) for this function with no effort at all, but (d) remains difficult.

> From my understanding of the previous reports, the break wasn't in the (f) slot but in the (e) slot. I could be wrong.

If the break had been for (e), the brute-force comparison would have been 2^160, not 2^80. The birthday paradox technique that allows you to halve the bit count only works when you get to select both of M1 and M2 yourself, i.e. problem (f).

Glenn:
I think we are in complete agreement and actually trying to make the same points. I listed the six different problems not because I think each is an interesting property to analyse for, but because some posters in the thead (or whatever it is called in blogspeak) seemed to be confused by the differences between them. In particular I wanted to *state* the questions in order to get a language in which I could try to explain why cryptographers care about the seemingly insignificant (f).

In fact, as long as (f) is technologically infeasible we don't need to worry about (e) either, because any feasible solution to (e) can trivially be turned into one for (f). Research that specifically aims at attacking (e) - such as the 2^106 attack mentioned in Bruce's original item - is useful because it is one possible strategy for breaking (e), but also because (e) is the second line of defense. Even if a feasible attack on (f) is demonstrated, if we still have some faith in (e) we'll have a little breathing space that we can use to switch to another hash function, because *most* applications of cryptographic hashes do not depend that much on (f). [I acknowledge that we'd still have to worry about a gajilion instances of (c) then, so it won't be apicnic]. On the other hand, if (e) gets cracked we'll basically have no guarantees left about _anything_.

"If you hashed 2^80 random messages, you'd find one pair that hashed to the same value."
Ah, but finding that pair would require storing 2^80 hashes in "memory" no?

The nature of keysearches is that they are cpu-bound: memory latency isn't a problem. I'd be somewhat surprised if that was the case this new attack. So it may not be "on the far edge of feasibility". Or it may: I'm not in a position to know.

Honestly, I prefer Twofish over AES, it simply looks better (from the naive point of view of a programmer like me).

So how about creating a hash function? Do you think it would be possible to base a hash function on Twofish, similar as Whirpool is based on a function very similar to AES? Any idea how good such a function would be?

I personally don't trust SHA-256 or higher either. I never liked SHA in the first place.

Just one thought, IF these three bods have broaken SHA-1 (and I see no reason why they have not) what other use could there technique be put to, as Bruce pointed out some hash functions have a lot of similarities when compared.

Therefore I think before I start walking, I would like to know the direction I am heading is away from the fire not towards it...

I am not a Cryptographer either myself but I have two questions regarding the use of Hash functions.

One of them has already been asked here and is "How hard would it really be to find, given a message M1, another message M2 that has the same hashes as M1 with two different hash algorithms? MD5 has been 'broken', so has SHA-1, but is it feasibly possible to find M1 and M2 so that MD5(M1)=MD5(M2) and SHA-1(M1)=SHA-1(M2)?"

The second question is, how hard would it be, given M1, to find M2 so that SHA-1(M1+MD5(M1))=SHA-1(M2+MD5(M2))? This second composition leads to a 160 bit hash and seems not to be vulnerable to either the MD5 or SHA-1 brokerage.

Once again I would love to be enlighted on these two options, not being a cryptographer very often makes me not see the obvious flaws.

> Ah, but finding that pair would require storing 2^80 hashes in "memory" no?
Er, make that messages, not hashes.

Tobias:
> In many cases it would be worth breaking the digital signature on a contract in order to falsify the contract terms and reseal it, even five or ten years into the contract term. Do I have a valid concern here?
Yes -- but. Such a forgery leaves evidence; namely the original contract with an identical hash. If you had the original contract you could produce it in court. What happens then is a question for lawyers. But there's also a flip side: if signatures are too easily forgable, they may cease to mean very much and other parties may have trouble enforcing countracts you have digitally signed.

I'd like to reply to Terry who said that you just need to ensure there aren't any nulls at the end of the message. Now think of a legal document: they're usually quite long and full of big words, not the trivial example Bruce presented. Any other useful document is usually quite long and nontrivial, by the way. Once you have a feasible, better-than-brute-force attack on the hash, you will probably be able to find the hash you're looking for by changing a few letters here and there and it will look like an innocent typo. Nobody is going to get suspicious for a typo in a real-life document -- usually there are some already. If that's a concern, you could even spell-check the doctored text and use the corrections as your starting point. :)

(I am not a cryptanalist. Bruce, if I said something stupid, feel free to point it out.)

"Why is the solution not to use two very different cryptographic hashes?"

Correct me if I'm wrong, but wouldn't that be equivalent to using a stronger hash with roughly as many bits as the other two combined? And you have the added advantage that you can choose a not-already-broken hash.

In other words, how does the cost of computing a hash vary with the hash length?

Earlier there were some discussions of adding null characters to files and other various tricks. I've got two ways that's easier than most people seem to think.

Firstly, you could substitute a different format of file - maybe I could use a WordPerfect file instead of a text file. People send all sorts of file formats, why does the replacement have to be the same?

Secondly - an idea inspired by the first - many complicated file formats are capable of nearly arbitrary manipulation while retaining their original contents using traditional viewers.

What I mean is - Microsoft Word documents contain undo metadata, and all sorts of other metadata which is not obviously viewable, and which does not need to match in any way at all the original document. Many versions of Word even used sparse files - not writing at all in certain sections. You could hack a file and put whatever you wanted into the sparse sections.

Another example is something like a video file. QuickTime, say, lets you have multiple streams of many different formats. You could have a video, audio, and "other" stream, the "other" stream being meaningless data for hash function manipulation.

Final example is TIFF. TIFF lets you encode many, many different types of data. Microsoft TabletPCs, I've been told, can use TIFFs to transfer an image of your handwriting - with an extra layer containing the vector stroke data.

How many file formats simply ignore garbage appended to the end? Xmodem in some (all?) variants, I believe, would "round" transferred files to the block size you used back in the BBS modem days.

While it would be hard to use a broken hash to exploit short pure ASCII messages, I think it's very clear that more complicate file formats are wide open, and in many cases nearly trivial to modify without making it apparent.

Even if you do as a developer protect the file format in your application, in the real business world, chances are somebody could encode the same document in an easily modifiable format without arousing suspicion.

A minor error in the article: "If you hashed 2^80 random messages, you'd find one pair that hashed to the same value."

No, if you had one message you wished to match, you'd find (on average) one match for it in 2^80 random messages.

This is the "Birthday problem": "How many people do you need so that there is a 50% chance that a pair of them share the same birthday?" The answer is 23. It is a different question from "How many people do you need so that there is at least a 50% chance that one of them shares *my* birthday", for which the answer is >183. (365 times whatever the mean of a Poisson distribution has to be for P(0)

Sorry, I hadn't read all the comments (I did skim them before posting, but missed the relevant bits.) It looks like this has already been accounted for in the difference between 2^160 and 2^80. (I.e. there are 2^160 possible 'birthdays', so we need 2^80 people to find a single pair that share a birthday.)

Ralph Loader,
That's clever all right; thanks for the explanations. And I can see how to extend it along the lines of Mr. Schneier's example:
Instead of SHA-1, use SHA-1':
SHA-1'(x) = SHA-1("message-id: " + x + "\n\nBuy " + x[0] + "000 shares.")

Since the method you described produces a sequence of collisions, one could pick the desired collision from the sequence so long as the difference between the two messages is quite small.

I guess we'll have to wait to see what sorts of attacks the new technique allows. But I take Bruce Schneier's point: even if the current attack isn't exploitable, it is a starting point for developing better attacks.

I will accept the engineering advice and will end up using whatever turns out to be outside the exits that cryptographers are now walking towards...

But it still seems to me - as a lawyer not a cryptographer or engineer - to be a slow walk. By applying 2**69 computing power (possibly years of work or many parallel computers), one can get *one* random message that will match the one you start with. But how much computing power and time is needed to get a *useful* message that matches the one you start with?

Bruce mentions the "zillions" of messages (he's surely allowed to be imprecise when he chooses to be!) to sort through to find a "sell 1000 shares" message to substitute for a "buy 1000 shares" message. But if I have a digitally signed contract, how long will it take someone to come up with an apparently identically signed contract that has only a meaningful change and otherwise appears to be credible (even allowing for a few typos and invisible metadata)?

Surely a VERY long time.

It will be a while as well before there are tools available to apply to existing digitally signed documents to create useful (to the bad sort of people) variants that will appear legitimate.

And the signature and the encryption and the hashing all produce SOME evidence, and not usually the best.

I recognize that there are lots of other uses of hashes, as Bruce says in his original article, and they need to work reliably for many systems to work. That's why I will expect the engineers and cryptgraphers to change hashes, but why I would not advise my clients to panic, or even to settle ...

A few people seem to be bothered by the fact that the SHA-1 paper isn't in print yet. Academic papers take a long time to get published. As an example, the submission deadline for Crypto 2005 was February 14 and the conference is in August.

I wrote:
> Since the method you described produces a sequence of collisions,
Well, I'm embarassed; that's just not true. Still the difference between '1' and '5' in ascii is just one bit so if you're a bit lucky, one collision is all you need.

John Gregory wrote:
> But if I have a digitally signed contract, how long will it take someone to come up with an apparently identically signed contract that has only a meaningful change and otherwise appears to be credible (even allowing for a few typos and invisible metadata)?
That's not what this current attack does, rather it is a way of finding two messages with the same hash (If one is signed, the signature is valid for the other as well). For the brute-force attack, producing two contracts where one has a '5' in place of a '1' in the other (and invisible metadata changed) is easy enough. For the new attack, well we just don't know.

I'd like to respond to John Gregory's and Andrew Wade's message by drawing an analogy between the need for competent software engineers as mentioned by "Terry" and Bruce Schneier and the need for competent lawyers applying cryptographic procedures: Simply applying existing techniques without thinking thoroughly about the consequences can get you in deep trouble, especially if one element you were relying on appears broken, which now is the case with SHA-1. I'm convinced that the risks in using digital signatures for contracts that are based on SHA-1 now that SHA-1 has been broken can be reduced to an acceptable level (risk=0 does not exist) by applying additional measures that are thoroughly thought out.

Bruce Schneier pointed out in several of his comment messages that SHA-1 being broken is bad, and he's in a much better position than I am to say so. That being said, one should not forget that in many fields hashing is not the only element that is of importance, so the specialists may often be able to reduce the now-increased risk again by applying other measures. However, cryptology cannot be expected to take all fields of application of a hash algorithm into account, so it needs one that's not broken.

Finding two messages with the same hash in 2**69 computations is impressive to the cryptographer, but not to the person interested in a transaction.

If I have a short message (and a lot of contracts are not short) saying something like "I accept your offer to buy 1000 widgets for $100 each for delivery on March 1 in Chicago" - then there is no point in doing 2**69 computations to get a message that has the same hash but which reads "Home, home on the range, where the deer and the antelope play", or (more likely) "dowinr02asvi0nvewt0svj0wr2309fj jdjr032493jr0d9j()D9JR)#(jFDj ".

It would be almost inconceivable to find a message that shared a hash with another message that was identical to it except for a significant credible variation.

And how long does the bad guy/good cryptographer have to come up with the variation? In my little example, if it's not by March 1, forget it.

Presumably once the hash is broken, it will be further broken, i.e. the computations required to find a matching hash will fall from 2**69 to something less over time - but it's still going to be a challenge to find the significant credible variation.

Besides, there is usually lots of evidence about what a transaction is or was, besides the document or the signature. (I have long thought that high-value e-commerce between strangers was unlikely, without a trusted intermediary, not a CA but a financial institution.) So a court would usually be able to figure out which variant of a document was credible, just as it does now for paper documents that vary when they are not supposed to.

So it's going to be a very slow walk to the exits for the lawyers. We'll adopt what the engineers and cryptographers are selling long before we need it, because they need it more and will change the market accordingly.

As John Gregory pointed out for a lot of things there is a timleyness required for the search.

There are three points to consider,

The first, is some contracts (ie your home mortgage for instance) have a very long time to run, often the important thing in long contracts are the termination clauses (think leasess and in the UK the PFI contracts).

The second, is usually any new attack leads to insite into other attacks (The history of FEAL for instance) as noted above the NSA have a saying about attackes getting better not worse.

The Third and perhaps the most important, is perception, currently it is very difficult to explain digital signitures and crypto attacks to involved parties such as engineers, think what problems our legislators are going to have. Any successfull attack on any crypto primative is going to erode confidence and delay any eventual acceptance of digital contracts etc.

So if this IS THE summary, then I assume the page 3 is the result of the attack? It seems to me that nowhere on the paper it showed and proved the full ***80-step SHA-1*** was cracked; the page 3 summary showed a ***58-step SHA-1*** attack. Can you even call that SHA-1?

Maybe I'm stupid, but page 1 seems to have lots of vague and misleading clauses to me.

"Our analysis shows that collisions of SHA1 can be found with complexity less than 2^69 hash operations. This is the first attack on the full 80-step SHA1 with complexity less than the 2^80 theoretical bound."

This makes you think, oh SHA-1 (the real 80-step one) collision was found. And next sentence it says

"Based on our ***ESTIMATION?!**, we expect that real collisions of SHA1 reduced to 70-steps can be found using today's supercomputers."

And only in the final paragraph it mentioned

"We also ***implemented*** the attack on SHA1 reduced to 58 steps and found real collisions with less than 2^33 hash operations."

So it seems to me the 2^69 number was an ***estimated result" based on the 58-step attack??

The whole hell raising alarm about the "broken SHA-1" was based on this (3 page) paper or is there something else?

John, it's not necessary to find a meaningful message that collides
with a similar meaningful message to do damage. In many transactions it's sufficent
to have a random message and meaningful message collide.

Say I email you:

'John I've some business I'd like to discuss with you but I'd like
to be sure that it's you I'm dealing with. Can you please digitally sign this
random string and send it back to me: "90823jkasdc9ujasdcv zai".
The reason I'm asking you to sign something random is so that someone
can't grab a previously signed "Yes, it's me message" out of their mail
from you and forward it to me.'

On the surface that's a reasonable request to identify yourself with a
protocol that avoids replay attacks.

You then do as I ask unknowing that I have prepared a message:

"I agree to pay Ian 1,000,000 Guineas Sterling for a short and
painful lesson in cryptographic protocols."

which has a hash collision with the random string I sent you. I substitute
that string for the random one in your signed message and then instruct
my own lawyer to recover the 1,000,000 guineas from you. I chose the message
then I found a random collision for it and presented it so that a random
string looked reasonable to you.

I'm not a lawyer, but I don't think that random message thing can even fool me.

First of all, if I'd even consider signing something like that, it'd be for the entire message string (from "John..." to "...zai"). What is the chance the signatures will match? There are enough this kind of cheap pranks around; I think I'm educated enough thank you.

Even if somebody is dumb enough to sign something like that; I would think it'd be from a trusted friend. And if just by chance, the poor guy actually kept a copy of the thing signed and produced the evidence in court, what's going to happen to you?

Thirdly, the collision case found so far is only a 58th-step ***intermediate hash value***, the SHA-1 algorithm wasn't even finished yet. There is really no proof of the 2^69 number yet at this time; or if a collision can really be found in 2^69. And even if that number actually works, how long it's gonna take you to find the random string?

It's an exemplar to demonstrate how a random string can be useful, as opposed to a pair of choosen plaintexts.

Substitute 'program' for 'John' and 'nonce' for random string and this hits a number of common cryptographic protocols.

That's why collisions matter and that's why it matters finding them in less than brute force order. If the probability of a failure, and the associated risk assessment, is based on the brute force order then reducing the attack difficulty by a factor of 10^(80-69) = 10^11 may be significant.

If your original risk assessment said "We'll take a 1 in one billion risk" (very acceptable for most commercial ventures) you've just been holed below the waterline.

Seems like everybody is waiting for somebody else to do the work.. well, I guess curiosity kills the cat..

Here is what I found with the messages. The 58th-step hashes do match for the two messages, but the number I got is different from the 'h1' in the paper. I don't know what kind of SHA-1 was implemented for the paper, but I'm quite sure my SHA-1 module is fine. The NIST test case of ASCII "abc" was put through the module first and produced correct results.

I am wondering how this can be applied to highly structured messages such as validated XML. Example: assuming you manipulate an HTML document, add an appendix after the closing html-tag, no browser will worry about displaying the gibberish but will only display the manipulated page. The security mechanisms will tell the user everything is ok, since they consider the whole document.

Now if you take a validated XML document and an XML Signature (according to W3C specs) from all I know this is to ignore extra whitespace and obviously any gibberish not obeying the XML structure will render the document invalid.

Does this now justify the conclusion that highly structured content such as validated XML is more robust against this kind of collision attacks?

What about composing a message M1 and a large random string C1 and transmitting the following data:
M1, C1, SHA1(M1), SHA1(C1), SHA1(M1+C1)

Try to crack this by making a forged "valid" message like:

M2, C2, SHA1(M2), SHA1(C2), SHA1(M2+C2)

This is the same as solving a problem (e) for M and solving problem (f) for C, but at the same time
try to make the SHA1 of the sum of M+C in the forged situation equal to the SHA1 of the original M+C.
The last SHA1 causes that the common results of the full result set of the first, second and third problem
will be a very meager collection of possibilities, although a brute force attack might result in a solution.

If such a combination of criteria makes it difficult to find a solution, the combination can be expanded with more random data strings and SHA1's and more SHA1-ed additions.

I'm not a cryptographer but does this make the forging attempts much more difficult?

This is a situation of coupled equations.
The important thing is not that simple SHA1s are transmitted but that an RSA encrypted version (using private key) of the SHA1s is transmitted: digital signatures.
Therefore the SHA1 values cannot simply be replaced by other values.

To achieve this the following system of coupled equations need to be solved:

1) SHA1(M1)=SHA1(M2)
2) SHA1(C1)=SHA1(C2)
3) SHA1(M1+C1)=SHA1(M2+C2)

If equation 1 can be solved the same effort or more is needed to solve equation 2.
The difficulty is in the third equation that couples the equations 1 and 2.

If finding collisions for equation 1 and 2 is easy it is still difficult to solve equation 3. Assume the number of possible collisions equal to 2^80 for a 160 bits SHA1. To solve equation 3 all possible solutions of equation 1 (2^80) and equation 2 (also 2^80) need to be combined and tried out. That is the same amount of possibilities ( 2^80 * 2^80 = 2^160 ) as a brute force attack should use to solve equation 1.

If there will be developed advanced algorithms that break SHA1 in 2^50 calculations these algorithms will not give all 2^80 solutions but only one or a few.

If applying such an algorithm to equation 1 and 2 of the simple case described above this needs 2^51 calculations.
The combination of solutions used to solve equation 3 has a chance of 1 / 2^80 of being correct.

If a very advanced algorithm (feeded with a parameter to obtain a different solution) only requires 1 calculation to result in a solution for equation 1, still all 2^80 * 2^80 possibilities need to be tried to find the correct solution for equation 3.

Its now almost 3 years later. Has anything happened on the topic of replacing SHA-1? I have been reviewing a design document written in 2004 with references to the use of SHA-1. I was just wondering if there is an alternative to using SHA-1. Or are SHA-224, SHA-256 etc. the only real alternatives?

There are several widely published and reviewed cryptographic hash functions which have no known effective attacks. Of these, the oldest and most thoroughly studied is probably RIPEMD-160, which was published 13 years ago and has been adopted as a standard by several international organisations. Another good example in this class is Whirlpool, which has a much larger output than RIPEMD-160 and has also been studied for a fair while now, although not as long.

There are many more which are currently unbroken, but for which there may be a little concern about their long term survival. This includes Tiger and the whole SHA family later than SHA-1. I would be happy to use any of these at present but would be wary of embedding them into a long-lived system that was difficult to upgrade.

Assume the user enters DENVER BRONCOS which I hash once. Then I hash the hash and repeat that 9 more times.

Am I correct that it will take 9 times 2**80 to find a comparable string that equals the same hash?

If not, why not?

If I am correct then all a programmer has to do is to repeat the hasting for multiple times to gain the security desired, then whats the big problem?

For use in passwords, where the user enters a plaintext pass, I store the hash of the input and do the same hashing when the user signs in, and compare it against the hashed value in the database.
(a common approach).

Since the program does the hashing, it is no burden to the user, other than a little more time to wait for confirmation.

If the above is true, what is the big concern about the supposed weakness of SHA-1?

The solution appears to be trivial to implement to whatever level of security one desires, ie, just do it multiple times.

One problem that seems to be overlooked is the question: HOW DO YOU KNOW YOU'VE DECODED THE STRING?

Apparently the unstated assumption is that everyone is using English. The 2nd unstated assumption is that the first time you get clear test that makes sense, you've successfully decoded the string. Save I get a decode of NICE DAY when I was trying to decode the input string of DENVER BRONCOS. How do I know that NICE DAY isn't the clear text input?

Are we getting so technically focused that common sense has been lost?

Decrypting anything is a real beast of a problem. Even if I had a bank of super computers with hugh amounts of cpu power, actually decrypting anything encrypted with any thing that does not simply repeat the same letter for the same input is not a trivial pursuit.

Just to run a decrypting program thru a big bank of super computers or their equivalent in parallel desk top computers would cost about $10,000 a hour and it could take days, months or years to get something you think is clear text.

By that time, the information is probably stale, ie, no longer useful. Battlefield conditions, for example, aren't much use two weeks later, to anyone. How about your competitor's sales reports 5 years ago? Is that worth a million dollars to get?

So let's put a does of common sense into all of this discussion. Or perhaps the information should be entitled: FOR CRYPTO GEEKS ONLY!

Since the length of the password is a major contributor to the time required to reverse engineer a hash or an encryption, why shouldn't I programmatically increase the size of the user input password to, say 256 characters in length, padding to the end from a fixed string (so I can decrypt it)? Wouldn't that make the translation much more time consuming, than just say DENVER BRONCOS which is only 14 characters long?

So i dare since this is an old post and hey, what better than to be proven wrong or foolish. Here's some guess work.

Would detecting collisions benefit from next to using a specific length plain-text ( which never changes ) also using "random as in using Pi to an equal length of the plaintext as plaintext " or "controlled pseud-random sequences as plaintext" ?

My guess is this may "reveal" the plausibility of collisions by comparing sufficiënt sets of data and filtering out any "patterns" which could not but be explained as by the algorithm(s) themselve.