The reduction and the freeze seem to indicate this is some kind of residue number system but I can't find explicit descriptions of how that would work.

One of Daniel Bernsteins motivations when writing this was probably to have a constant time implementation, if that helps identify the algorithm.

Does anyone recognize the algorithm being used here, and know of/can provide a complete explanation of it (i.e. one that explains how the magic numbers 320 and minusp are derived, and the basis for all xor operations in the freeze)?

I edited the declarations back in - they're important to know exactly what's going on.
–
orlpJul 14 '13 at 20:34

It's funny that this implementation is used so often, given its bad performance.
–
CodesInChaosJul 14 '13 at 21:45

The sheer horror of the 53 floating point implementation (200+ lines of variable declarations, 1600 lines overall), and the fact it doesn't work on systems with 52 bit mantissa, may have something to do with the fondness for the ref implementation. I'd be keen to hear of better portable approaches if you know of them (i.e. ones that will work in Java etc.).
–
archieJul 14 '13 at 21:58

1 Answer
1

Numbers get represent as in base 256, i.e. $h = \sum_{i=0}^{17} h_i \cdot 256^i$. Since ints are used which are significantly larger than bytes you don't need to propagate carries immediately.

If you forget about modular reduction, then the $i$th digit of the result is computed as $\sum_{j=0}^i h_j\cdot r_{i-j}$. Apart from the lack of carry this is pretty much the same as schoolbook multiplication.

Now to take modular reduction into account you take the high bits of the result starting with bit 130. Now you can shift these 130 bits to the right, multiply them by 5 and add it to the unreduced result. Now shifting to the right by 130 bits can be done by shifting to the right by 17 bytes and then shifting 6 bits to the left. Shifting 17 bytes to the right is done using index arithmetic in r[i + 17 - j]. Shifting 6 bits to the left is equivalent to multiplying with 64. Combining that with the multiplication with 5, gives you a multiplication by 320.

So the first inner loop is the contribution of those digits that didn't overflow, and the second inner loop the contribution of the overflow.

freeze seems to simply be:

if(h > p)
return h - p;
else
return h;

Except it uses bitwise operations to avoid the branch. minusp is $2^{8 \cdot 17}-(2^{130}-5)$, i.e. it simply represents $-p$. It can be used to reduce modulo $p$ if the input is less than $2p$.

Since the individual elements of h can be larger than 256 but are in base 256 you need to propagate carries at the end of each multiplication. This is done using squeeze. Now after propagating the carries you get a number between 0 and $2p-1$. Having a slightly too large value as the result of an intermediate step isn't a problem, so no further reduction happens by default.

But at the of the MAC you need a number between 0 and $p-1$, so freeze is used as final reduction.