Assuming $H$ is a hash function, the following function $H'$ should — to my understanding — also be a hash function:

$$H'(m) = m_0 || H(m_1||m_2||\dots||m_k)$$

where $m_i$ is the $i$th byte of $m$.

$H'$ leaks the first byte of $m$, but even just leaking one byte, you still can't find a pre-image or any collisions.

When looking at HMAC, Wikipedia says that it takes a hash function (without maker further assumptions on the hash function). Taking my hash function $H'$ (just let it leak $\operatorname{len}(key)$ bytes, but not the long message) for HMAC, HMAC would be insecure, since everybody now sees the key.

So maybe my $H'$ is not a cryptographic hash function after all — in which case my question is: Why not?

Or $H'$ is a cryptographic hash function, but building an HMAC requires additional assumptions. What additional assumptions must be fulfilled by the hash function to be secure in HMAC? Do the three properties above imply some other properties I don't see?

Also PBKDF2 takes a PRF, where HMAC-SHA256 should be secure, but HMAC just given a hash function which has the three properties, won't be a PRF to my understanding. Again the same questions: Are there more assumptions? Even more than on HMAC?

This question came from our site for information security professionals.

> In order for a hash function with n bits of output to be collision resistant, it must take at least $2^n$ work and storage to find a collision Don't you just need $2^{\frac{n}{2}}$ due the birthday attack to find a collision?
–
So MorrJun 4 '13 at 15:02

In order for a hash function with $n$ bits of output to be collision resistant, it must take at least $2^{n/2}$ work and storage to find a collision. Presuming your hash function $H$ is collision resistant, it is not obvious $H′$ also is collision resistant. Finding two different messages that both share the same prefix, and collide in the last $n−8$ bits, requires less work and less storage than finding a collision in all $n$ bits of $H$.
–
Henrick HellströmJun 4 '13 at 15:44

2 Answers
2

Before we jump into this question, you first need to know a bit about the internals of hash functions with the Merkle-Dåmgard construction. Here's a pretty picture from Wikipedia:

In this diagram, you see the compression function $f$ being fed the message blocks along with the output of the state of the previous compression block (or the IV). The final output is the result of the last compression function. (You can ignore the finalization step for our purposes.)

In the rest of this paper we will concentrate on iterated hash
functions, except if stated otherwise.

This should be the first clue that your scheme is not one that will work with NMAC/HMAC: it's not iterated! Not all of it, at least. The fact that the first $n$ bytes are concatenated (leaked, what have you) means that your hash function's output is no longer solely the result from the compression function evaluated on the last block. This changes the construction of the scheme drastically.

In regular circumstances, the (almost implicit) assumption that the underlying hash function is iterated is not an unreasonable one at all: all of the popular hash functions of today are. (SHA3/Keccak is a bit of a special case. It's not clear one even needs the HMAC construction for it. But that's a topic for another question.)

For example, what do you do with the IV (initialization vector, or as Bellare et al call it, "initial variable")? Do you simply pass it along to the $H$ in $H'$? If so, then your scheme doesn't actually leak the key with NMAC, although it does leak $k \oplus \mathtt{opad}$ in HMAC. In case you're unfamiliar with NMAC, the basic idea of the scheme is replace the IVs of the regular hash functions with the keys $k = (k_1, k_2)$. In the case of HMAC, the "new IVs" are (where $f$ is the compression function for the hash in question) $k_1 = f(k \oplus \mathtt{opad})$ and $k_2 = f(k \oplus \mathtt{ipad})$. But note that this, too, carries with it the implicit assumption that the starting state of the next $f$ in the chain is the previous evaluation of $f$.

Trying to discuss $H'$ in the context of HMAC is difficult, though. The primary issue is that your $H'$ doesn't have a clearly available compression function. Sure, $H$ (probably) has a compression function, but for $H'$, things are much less certain. Even if you attempted to define a compression function for $H'$, in order to be an iterated hash function, it would somehow have to leak the first $n$ bytes of the original message while simultaneously evaluating $H$ for the rest of the message.

Here is the problem, though: even if you created such a compression function, it would be insecure, naturally. Namely, it no longer would act as a pseudorandom function (PRF) (or for Wikipedia's less thorough, but perhaps easier, explanation, see here). In this relatively new paper, Bellare proves that HMAC is a PRF itself if the compression function of the underlying hash is a PRF. If $H'$ had a compression function, then it definitely would not be a PRF, since it (quite literally) leaks part of the input.

The prerequisite that the compression function is a PRF is quite a weak requirement, too. Given that $H'$ (even if you did somehow come up with a compression function for it) simply fails this requirement, the security proofs for HMAC do not cover it. Further, pretty much all of the security proofs given in favor of HMAC assume that the attacker doesn't know the key. Those proofs are possibly invalid if this assumption is invalid.

So, to answer your question directly, HMAC requires an iterated hash scheme whose compression function is, as best as we can tell, a PRF. A generic cryptographic hash function, at least using the definition Wikipedia gives, is not strong enough to guarantee a solid MAC. But the PRF requirement is relatively weak, as even MD5 (which is completely broken as far as collisions go) still appears to satisfy it.

Pre-image resistance means that given $h$, it's difficult to find $m$ such that $h = H(m)$. Intuitively speaking, your only chance is to have started with an $h$ that is in the relatively small set of already-computed hashes.

Now suppose you have $h'$ and are looking for $m'$ such that $h' = H'(m)$. You have a better chance of finding $m'$ than you had of finding $m$, because if you had previously computed $H'(a||m'') = a||H(m'')$ such that $h' = b||H(m'')$, you can take $m' = b||m''$. With an alphabet of size $n$, $H'$ gives you $n$ times better chances to find a pre-image.

This is perhaps more visible if you think of $H'$ consisting of, say, the first 128 bits of the message followed by a 128-bit hash of the rest. Then $H'$ hashes are 256 bits long but the preimage resistance of $H'$ is no more than that of a 128-bit hash, only half as strong as expected.

A natural question at this point is, what if you define $H''(m) = m_0 || H(m)$? (That is, calculate the same hash, but leak a fixed-size prefix of the message.)
You do not get a better chance of finding a pre-computed pre-image. However, the amount of work required to find a pre-image by brute-force is clearly less, because you can concentrate your efforts on messages with the prefix $m_0$. This is actually closer to the mathematical definition of difficulty (computational complexity of finding a pre-image) than the informal explanation above.

I'm not fully satisfied with this explanation. What about $H^\circ(m) = \mathbf{0} || H(m)$ where $\mathbf{0}$ is some constant string? This is obviously as good a hash as $H$ since it doesn't leak anything. Yet the amount of work required to find a pre-image is only as good as $H$, despite the longer hash, which is no different from the complaint against $H'$ above.

In any case, as far as I know, the Wikipedia article is a simplification: there is no general result that any hash function can be used to build a HMAC in this way. The security proofs of HMAC only apply to hash functions of a certain form, which includes all Merkle-Damgård hash functions, but not the oddball variants considered in this thread.

The last part of this answer is the crucial bit: HMAC's security proof relies iterated hash functions, specifically MD constructions. The construction in the question fails this criterion.
–
ReidJun 4 '13 at 17:07

@Reid I don't find this fully satisfactory: sure, the proof doesn't apply, but what causes the result not to hold?
–
GillesJun 4 '13 at 18:20

@Gilles: The structure of the above construction is entirely different from usual hash schemes. How would you define the compression function for $H'$? And note that in order for NMAC's security proof to be relevant, the scheme needs to be iterated, so you somehow have to define the compression function in such a way that the final state will include the first $n$ bytes of the message. Of course, the compression function needs to be a PRF, but leaking even the first byte of the message sounds like a death knell for that idea.
–
ReidJun 4 '13 at 19:55

@Reid Please write an answer that expands on this! I'd be very interested.
–
GillesJun 4 '13 at 20:01