Poly1305-AES for the 8051

A few people have asked me how well Poly1305-AES will perform
on a very small CPU, such as the Intel 8051.

Gura et al. have reported a 160x160 multiplication
(without any fancy algorithms)
in 1986 cycles on the 8051-compatible ATmega128 with MULACC.
The main loop in Poly1305-AES handles 16 bytes of input
using a 130x124 multiplication (considerably smaller than 160x160),
a reduction modulo 2^130-5 (relatively easy),
and a 128-bit addition (easy).
So it's reasonable to shoot for 2000 cycles on the ATmega128
(or even smaller with better multiplication algorithms);
in other words, 125 cycles/byte.

Storage:
The 16-byte AES key k and 16-byte nonce n can be quickly merged
into a 16-byte output AES_k(n).
Other storage: 16-byte r; 16-byte message chunk;
1 byte saying how long the chunk is; 17-byte accumulator.
All of this fits comfortably into the 8051 RAM,
leaving many bytes free for intermediate results.