要約:This application note describes how to improve the speed of modular exponentiation by more than 50% when using MAXQ® microcontrollers that have a modular arithmetic accelerator (MAA).

Introduction

Modular exponentiation, ae modulo m, is a common operation in many cryptographic functions. The modular arithmetic accelerator (MAA) in MAXQ microprocessors can perform this operation with modulus sizes of up to 2048 bits. It is easy to load the memory areas with a, e, and m, and start the operation.

When the modulus is the product of two or more primes, we can use the results of the Chinese remainder theorem (CRT) to reduce the execution time by doing two smaller modular exponentiations instead of one large one. Specifically, we are using Garner's algorithm for this operation.

Description

In a typical RSA decryption operation, we recover our plain text (pt) from the cipher text (ct) by executing pt = ctd mod n, where d and n form the private key. The value d is our decryption exponent, and n is the product of the primes p and q. In general, p and q will be the same number of bits in length, and n, pt, and ct will be twice that number of bits. For example, if p and q are 1024 bits long, then n will be a 2048 bit number about 60% of the time.

Notice that now our modular exponentiations in the c1 and c2 terms will have half as many bits than with the ctd mod n operation.

The terms m1 and m2 can both be precomputed. The p-1 mod q is some value, say y, such that p × y mod q = 1. For example, if p = 11 and q = 17, 11-1 mod 17 = 14, since 11 times 14 mod 17 = 1. These inverse values can be found using the extended Euclidian algorithm, or since p and q are primes, executing the function pq-2 mod q. This modular exponentiation to find an inverse is based on Fermat's little theorem.

The c1 and c2 terms are interesting because ct and d are both twice as large as their modulus values of p and q. The one thing the MAA cannot do is operate on values greater than the modulus size; we need to reduce both of these values before we can use the MAA to perform the modular exponentiation. The reduction of the exponent d relative to p and q can be precomputed. These new exponents are simply (d - 1) mod p and (d - 1) mod q. The reductions of the exponents is also based on Fermat's little theorem.

Reducing ct in both c1 and c2 is done at execution time with a modular multiply. For example, if the ct is 64 bits long and p and q are both 32 bits long, then we can perform the following multiplication: ct × 232 mod (p × 232). This would be a 64-bit modular multiplication. This is actually simpler than the equation looks. We put the 64-bit, 4-word ct into the MAA register a and a clear everything in MAA register b. We then set bit 32 of register b, making the register equal 232. In the modulus we will write the bottom two words to zero, and then copy the value of p into the next two words. We then set MAWS to 64 and perform the modular multiplication. The reduced value we are looking for is in words 3 and 4 of the result.

To make things easier for the rest of the calculation, we find which term is the larger between c1 and c2, and then select the equation where we subtract the smaller from the larger to avoid getting a negative number. Now do a modulo multiplication, and then add this to c1 (if we multiplied by m1) or c2 (if we multiplied by m2).

Code Description

Listing 1 shows the initialization of unsigned long word pointers to each of the registers in the MAA. The hard-coded address for the MAA in the DeepCover® Secure Microcontroller (MAXQ1103) with secure RISC architecture is given.

Listing 2 contains the four precomputed constants, piqtp (read as p inverse q times p), qiptq (read as q inverse p times q), dphip (read as d phi of p), and dphiq (read as d phi of q). The constants diptp and piqtq are the m1 and m2 terms from above. The constants dphip and dphiq are the reduced decryption exponents for the c1 and c2 terms from above. The two vectors ptp and ptq are for temporary storage. The other variables p, q, n, phi, e, d, pt, and ct describe all the values needed for RSA. The term phi is equal to (p - 1) × (q - 1).

The term nwords is the number of words in n, the modulus of the keys. In this implementation it is assumed that both p and q will have exactly nwords × 16 bits, and that n will have exactly nwords × 32 bits.

The words in these vectors are saved from the least significant word to the most significant word. They are loaded into the MAA register from low word to high word. To be clear, here is an example of p × q = n using the constants from Listing 2 with alternate long words in bold.

Listing 3 has the do_crt function, which is called with the number of words in p or q. The routine starts off by creating the terms c1 and c2 from above and saving these values in ptp and ptq, respectively. Then we determine which is larger of ptp or ptq, and then call the routine that will do the modular multiplication and addition. This leaves the pt in the maa_resw memory.

Listing 4 has the details for initializing the MAA. Here we clear all 384 words in the MAA, initialize the memory select register, MAMS, and then tell the MAA which is the most significant bit in the modulus, MAWS.

Listing 5 has the details for doing the modular reduction for expressions of the form ptp = ctdphip mod p. The first thing this routine does is reduce ct to half its size by doing a modular multiply with a shifted modulus p, and letting the multiplicand be the shifted value. These routines use moves of long words rather than shifting several words one bit at a time. (A move by a distance of a long word is like shifting 32 times.) After doing the modular multiplication, our reduced answer is shifted left in the maa_resw register.

With the reduced ct, we next do the modular exponentiation to get the result this subroutine was meant to obtain, ptp.

Listing 6 describes the function that puts everything together. We are going to compute the equation ct = (c1 + m1(c2 - c1)) mod n if c2 is larger than c1 or ct = (c2 + m2(c1 - c2)) mod n otherwise. It starts off with the subtraction. Notice we are using n as our modulus, and we will be at the full key length for the multiply and add. After the subtraction, we move the result from maa_resw to maa_aw, and copy our multiplier, m1 or m2, our argument c into maa_bw, and start the modular multiply. In the last step, we copy the result of the multiply from maa_resw to maa_aw, and copy c1 or c2, our argument b, into maa_bw and do the modular addition. When it is done, our plain text is in maa_resw.

Performance and Conclusion

Tables 1 and 2 give an indication of the speed improvements that can be realized by using the algorithm discussed above. We get greater reductions in time as the modulus size increases.

The C implementation presented here is intended to demonstrate that a speed increase is possible using this algorithm. The code also shows how the MAA is manipulated. Many things can be done to increase the speed of the algorithm, including loop unrolling, loop optimization, using the compilers optimized data movement routines, and using assembly language.

The quickest and easiest improvement in speed would be to manipulate the MAA's MAMS register, which will eliminate some data movement. The MAMS register lets us rename memory segments. It was not done in this application, for simplicity.

Theoretically, it should be possible to get close to a timing ratio of 4 to 1 in the improved speed by using this algorithm.

Always use the crypto ring as your crypto clock source. This is accomplished by clearing the Master Crypto Source Select (MCSS, PMR.7) in the Power Management Register (PMR). When decrypting, it is recommended that the Optimized Calculation Control (OCALC, MACT.4) of the Modular Arithmetic Accelerator Control Register (MACT) be cleared to zero, disabling the function.

Numerical Example of RSA Using the CRT to Recover the Plain Text

This example goes through the steps of constructing the public and private keys for RSA, then taking a sample message, encrypting it, and decrypting it. Then we show how the same encrypted message can be decrypted using the CRT.

We start by finding a couple of prime numbers. Let p = 0xE747 and q = 0xC7A5. Both are 16-bit prime numbers.

Now we will recover the plain text by using Chinese remainder theorem.

Offline we precompute four constants. The first is piqtp = 0x9E1D261C, which is (p-1mod q) × p. The second is qiptq = 0x16401BA8, which is (q-1mod p) times q. These are both 32 bits long. You can use the extended Euclidian algorithm or execute pq-2 mod q and qp-2 mod p to find the inverses. We also need dphip = d mod phi(p) = 0x9B111CC9 mod (0xE747 - 1) = 0x4AAB and dphiq = d mod phi(q) = 0x9B111CC9 mod (0xC7A5 - 1) = 0x9A0D. These numbers are both 16 bits, half the size of n.

The online calculation starts by reducing the cipher text by mod p. The cipher text is 32 bits long and our modulus is 16 bits long. To use the MAA to do the reduction, we shift the modulus, p, left 16 bits and multiply the cipher text by 216. That looks like 0x87CCFE27 × 0x10000 mod 0xE7470000 = 0x36B00000. Now we shift the answer right 16 bits or just grab the upper 16 bits of the word. We need this value to do the 16-bit modular exponentiation of 0x36B0dphip mod p, which looks like 0x36B00x4AAB mod 0xE747 = 0x6425 = ptp.

The second modular reduction with cipher text mod q expands as 0x87CCFE27 × 0x10000 mod 0xC7A50000 = 0x543D0000. Shift the answer right 16 bits, or just grab the upper word, getting the reduction, 0x543D. Use this result and perform the modular exponentiation 0x543Ddphiq mod q, which looks like 0x543D0x9A0D mod 0xC7A5 = 0x1671 = ptq.

We see that ptp is larger than ptq. The difference, ptp - ptq = 0x4DB4. The product of ((ptp - ptq) and qiptq) mod n, and (0x4DB4 × 0x16401BA8) mod 0xB45D41C3 is 0xABCDD8A1. Adding ptq, 0x1671 gives us back our plain text, 0xABCDEF12, and we are finished.