Abstract

Increasing amounts of information that needs to be protected put in claims specific requirements for information security systems. The main goal of this paper is to find ways to increase performance of cryptographic transformation with public key by increasing performance of integers squaring. Authors use delayed carry mechanism and approaches of effective parallelization for Comba multiplication algorithm, which was previously proposed by authors. They use the idea of carries accumulation by addition products of multiplying the relevant machine words in columns. As a result, it became possible to perform addition of such products in the column independently of each other. However, independent accumulation of products and carries require correction of the intermediate results to account for the accumulated carries. Due to the independence of accumulation in the columns, it became possible to parallelize the process of products accumulation that allowed formulating several approaches. In this paper received theoretical estimates of the computational complexity for proposed squaring algorithms. Software implementations of algorithms in C++ allowed receiving practical results of the performance, which are not contrary to theoretical estimates. The authors first proposed applying the method of delayed carry and parallelization techniques for squaring algorithms, which was previously proposed for integers multiplication.

1. Introduction

Cryptographic transformation with public key (CTPK) are the basis for most modern cryptosystems. Increasing amounts of information that needs to be protected, makes specific demands for CTPK. Multiplicative operations [2, 3], such as multiplication and squaring of integers, are the most frequently used in CTPK. One of the performance increasing approaches in CTPK is increasing the productivity of basic operations, such as multiplication, squaring, modular reduction and multiplicative inversion. Performance increasing approaches in CTPK by increasing the productivity of integer multiplication were reviewed in [4, 5, 6]. The main goal of this paper is to find ways of increasing performance of CTPK, by increasing productivity of squaring integers, using the delayed carry mechanism [5, 6] and efficient parallelization approaches [4].

Squaring is a special case of multiplication where both multipliers are equal [1, 3]. Show features of multiplication and squaring by considering "schoolbook" multiplication of two integers 123 and 456, Figure 1.

Figure 1 shows that to calculate the product of two integers 123 and 456, it should complete 9 unique multiplication operations. Squaring using "schoolbook" multiplication allows some optimizations. Multiply integer 123 by itself, using "schoolbook" multiplication Figure 2.

Figure 2 shows how to multiply decimals in a different order, such as products in rows 0 and 2 in column 2 (3 * 1 = 1 * 3), products in rows 0 and 1 in column 1 (3 * 2 = 2 * 3) and products in rows 1 and 2 in column 3: (1 * 2 = 2 * 1). Therefore, for squaring for n-digit number, there are only unique multiplications required ( operations required for multiplication in common case).

Let be integer being squared, а – -th term of . It is easy to notice features:

1. In row the product in column has a term in it. In Figure 2 it 3 * 3, 2 * 2, 1 * 1.

2. Every non-square term of a column will appear twice (product in column in row , where has a pair). In Figure 2 it 3 * 1=1 * 3,3 * 2=2 * 3 and 1 * 2=2 * 1. Every odd column is made-up entirely of product pairs.

3. For row , such as and , the first unique product that is not a square, is located in the column . In Figure 2 it 2 * 1.

2. Multiplication Algorithm Modified Comba

In [5] proposed generalized modified algorithm Comba for integer multiplication – Modified Comba (MC), which uses the idea of delayed carry. The basis of the algorithm is loops (p.2 and p.3), and inner loops (p.2.1 and p.3.1). At the lowest level of the hierarchy, in loops p.2.1, p.3.1 there are multiplication and accumulation of delayed carry. Accumulated carry is taken into account in the final iterations of the loops p.2 and p.3. Using 2w-bit variables for storing w-bit variables eliminates the carry accounting of w-bit variable after each arithmetic operation. Carry accumulated in the higher part of the 2w-bit variable and is taken into account when needed, Figure 3. The generalized algorithm MC[5] for the w-bit systems is given below.

Multiplication algorithm 1. Modified Comba

INPUT: , ,

OUTPUT:

1. , , .

2. For , , do

2.1. For , , , , do

2.1.1. .

2.1.2. ,

2.2. ,

2.3. , , , .

3. For , ,, , do

3.1. For , , , , do

3.1.1. .

3.1.2. ,

3.2. ,

3.3. , , , .

4. .

5. Return .

where n – number of w-bit machine words required to store the multiplier of given size, – a multiplication operation for w-bit words, – an addition operation for 2w-bit and w-bit words. Assignment operations do not take into account in computational complexity of the algorithms.

Using the idea of delayed carry it can independently produce addition of multiplication results corresponding by columns, that enables to perform the accumulation of sum of high and least significant bit in separate parallel threads. However, it is necessary to make an adjustment (account carry) , and set result after sum accumulation in each thread. Figure 3 and Figure 4 is a graphical interpretation of the MC algorithm, for n=3, where well-defined results addition for corresponding products in columns.

3. Squaring Algorithms

3.1. Squaring algorithm Modified Comba SQR

Using delayed carry mechanism [5, 6] and approaches to parallelization [4, 5], were offer three squaring algorithms that take account the above features. Consider squaring features, the MC algorithm was modified in inner loops of delayed carry accumulation (p.2.1 and p.3.1), and added an additional check to avoid duplication in the sum accumulation (p.2.1.2 and p.3.1.2 ).

where n – number of w-bit machine words required to store the multiplier of given size, – a multiplication operation for w-bit words, – an addition operation for 2w-bit and w-bit words, – a word shift operation. Assignment operations do not take into account in computational complexity of the algorithms.

The evaluation results of computational complexity of MCSQR for different bit length multipliers are shown in Table 1 and Table 2 (MUL, ADD and SHIFT – amount of required multiplication, addition and shift operations).

where n – number of w-bit machine words required to store the multiplier of given size, – a multiplication operation for w-bit words, – an addition operation for 2w-bit and w-bit words, – a shift operation. Assignment operations do not take into account in computational complexity of the algorithms.

The evaluation results of computational complexity of MCSQR2x for different bit length multipliers are shown in Table 3 and Table 4 (MUL, ADD and SHIFT – amount of required multiplication, addition and shift operations):

where Z – parallel threads count, n – number of w-bit machine words required to store the multiplier of given size, – a multiplication operation for w-bit words, – an addition operation for 2w-bit and w-bit words, – a word shift operation. Assignment operations do not take into account in computational complexity of the algorithms.

The evaluation results of computational complexity of MCSQR2x for different bit length multipliers are shown in Table 5 and Table 6 for Z=4 (MUL, ADD and SHIFT – amount of required multiplication, addition and shift operations):

Table 6. The number of operations for MCSQRMx (w=64 bit)

Theoretical calculations show that the parallel squaring algorithms have a lower computational complexity, primarily due to the parallel execution of the elementary operations of addition and multiplication. Furthermore, the use of 64-bit machine words reduces the number of multiplications by 4 times.

4. Field Research

Squaring algorithms MCSQR, MCSQR2x and MCSQRMx as previously proposed algorithms for multiplication MC, MC2x and MCMx (Kovtun et al., 2012), (Kovtun and Okhrimenko, 2012), (Kovtun and Okhrimenko, 2013) have been implemented in software in C++ using the Intel C + + Compiler XE 13. The proposed algorithms have been implemented for 32- and 64-bit platforms. Measurements were performed on a computer running Microsoft Windows 7 Ultimate x64 SP1 and the processor Intel Core i5-3570 (6M Cache, 3.40 GHz) with four physical cores. For multiplication of two 64-bit integers, have been used the built-in compiler intrinsic function _umul128, (128-bit result of the multiplication is represented as an array of 64-bit words). Comparison of the results occurred by comparing the average time of multiplication operations in software implementation MC, MC2x and MCMx and the proposed algorithms squaring MCSQR, MCSQR2x and MCSQRMx, for 1 million iterations. The experimental results for 32-bit platforms are shown in Table 7 and Table 8.

Table 9. Normalized results of experiments for w = 32 bit

Table 9 shows that all proposed squaring algorithms for 32-bit platforms are effectively than multiplying algorithms. Single-threaded algorithm MCSQR is more efficient than algorithm MC by 5%, and advantage increases to 15% with the increase of the multipliers bit size. The algorithm MCSQR2x is more efficient than algorithm MC2x by 6%, and advantage increases to 56% with the increase of the multipliers bit size. Multi-threaded algorithm MCSQRMx is more effectively than algorithm MCMx by 6%, and advantage increases to 9% with the increase of the multipliers bit size.

The experimental results for 64-bit platforms are shown in Table 10 and Table 11.

Table 12. Normalized results of experiments for w = 64 bit

Table 12 shows that all proposed squaring algorithms for 64-bit platforms are effectively than multiplying algorithms. Single-threaded algorithm MCSQR is more efficient than algorithm MC by 7%, and advantage increases to 36% with the increase of the multipliers bit size. The algorithm MCSQR2x is more efficient than algorithm MC2x by 30%, and advantage increases to 41% with the increase of the multipliers bit size. Multi-threaded algorithm MCSQRMx is more effectively than algorithm MCMx in average of 37%.

To compare the performance of software implementations of squaring algorithms for 32 and 64 bit platforms, the results obtained on 32-bit platform were divided by results on 64-bit platform, for the same algorithms. The comparison results are shown in Table 13 and Table 14.

Table 14 shows that, as in case of multiplication, software implementations of squaring algorithms for 64-bit platforms were more effective than the same implementation for 32-bit platforms (up to 4 times for the algorithm MCSQR, up to 3,6 times for the algorithm MCSQR2x). Multithreaded multiplication and squaring algorithms for 32-bit platforms was more effective than for 64-bit, because there are not support 128-bit operations in modern compilers, that’s why it require software emulation such operations. MCSQRMx algorithm for 64-bit platforms is more effectively on 128-512 bits multipliers (4-6%), which are widely used in cryptography.

Theoretical estimation for MCSQR and MCSQR2x confirmed by practical results of research. MCSQRMx algorithm test results for 64-bit platforms indicate that there are no operations performed on 128-bit data array.

Theoretical calculations show that the parallel squaring algorithms have a lower computational complexity, primarily due to the parallel execution of the elementary operations of addition and multiplication. Furthermore, the use of 64-bit machine words reduces the number of multiplications by 4 times.

Software implementations of squaring algorithms for 64-bit platforms were more effective than the same implementation for 32-bit platforms (up to 4 times for the algorithm MCSQR, up to 3,6 times for the algorithm MCSQR2x). MCSQRMx algorithm for 64-bit platforms is more effectively on 128-512 bits multipliers (4-6%), which are widely used in cryptography. Multithreaded multiplication and squaring algorithms for 32-bit platforms was more effective than for 64-bit, because there are not support 128-bit operations in modern compilers, that’s why it require software emulation such operations.

Experimental researches have shown the effectiveness of the proposed squaring algorithms over multiplication algorithms for 32-bit and 64-bit platforms. The theoretical results are confirmed by practice.

The most perspective algorithm is MCSQRMx, which shows significantly better results than other presented algorithms. MCSQRMx has a high degree of parallelism, which allows implementing it on various microprocessor platforms, that’s why further research will focus on its development using specialized software and hardware (e.g., NVIDIA CUDA and OpenCL).