High Performance Elliptic Curve Cryptographic Processor Over Gf (2163 )

  • Uploaded by: Susan Carter
  • 0
  • 0
  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View High Performance Elliptic Curve Cryptographic Processor Over Gf (2163 ) as PDF for free.

More details

  • Words: 2,725
  • Pages: 6
High Performance Elliptic Curve Cryptographic Processor Over GF (2163 ) Hyun Min Choi, Chun Pyo Hong Dept. of Computer and Communication Daegu University, South Korea [email protected], [email protected] Abstract In this paper, we propose a high performance elliptic curve cryptographic processor over GF (2163 ). The proposed architecture is based on a modified L´opez-Dahab elliptic curve point multiplication algorithm and uses Gaussian normal basis (GNB) for GF (2163 ) field arithmetic. To achieve a high throughput rates, we design two new wordlevel arithmetic units over GF (2163 ) and derive a parallelized elliptic curve point doubling and point addition algorithm. We implement our design using Xilinx XC4VLX80 FPGA device which uses 24,263 slices and has a maximum frequency of 143MHz. Our design is roughly 4.8 times faster with 2 times increased hardware complexity compared with the previous hardware implementation. Therefore, the proposed architecture is well suited to elliptic curve cryptosystems requiring high throughput rates such as network processors and web servers.

1. Introduction Elliptic curve cryptosystems (ECC) have recently gained much attention in industry and academia. The main reason is that for a properly chosen elliptic curve, no known subexponential algorithm can be used to break the encryption through the solution of the discrete logarithm problem. This means that significantly smaller parameters can be used in ECC than in other competitive systems such as RSA and ElGamal with equivalent levels of security. Some benefits of having smaller key sizes include faster computations, reductions in processing power, storage space, and bandwidth. Due to these advantages of ECC, a number of software and hardware implementations [1-3] have been proposed, and included in many standards such as IEEE 1363 [4] and NIST [5] for elliptic curve digital signature algorithm. Computing kP (a point or scalar multiplication) is the most important arithmetic operation in ECC, where k is an integer and P is a point on an elliptic curve. This opera-

Chang Hoon Kim Dept. of Computer and Information Daegu University, South Korea [email protected]

tion can be computed by repeated point additions and doublings. If we implement ECC over GF (2m ), an efficiency of computing kP depends on the choice of a point multiplication algorithm, coordinate system, and basis representation. From previous results [1-3], it can be noted that, for hardware implementations of ECC, L´opez-Dahap Montgomery’s kP algorithm based on projective coordinate is faster than any other point multiplication methods. Another advantage of this algorithm is that the same operations are performed in every iteration of the main loop, thereby potentially increasing resistance to timing attacks and power analysis attacks. In this paper, we propose a high performance elliptic curve cryptographic processor over GF (2163 ). The proposed architecture is based on a modified L´opez-Dahab elliptic curve point multiplication algorithm and uses GNB for GF (2m ) field arithmetic. Three major characteristics of the proposed architecture are: 1) it uses fast arithmetic units based on a word-level multiplier [6] 2) it adopts a parallelized point doubling and point addition unit with uniform addressing mode, 3) and it utilizes benefits of GNB representation. Therefore, the proposed architecture leads to a considerable reduction of computational delay time compared with previous hardware implementations. Furthermore, since the proposed architecture has the features of modularity and a simple control structure, it is well suited to VLSI implementation.

2 Arithmetic unit for ECC 2.1

L´ opez-Dahab kP algorithm for GF (2m )

Let E be an elliptic curve over GF (2m ) defined by E : y 2 + xy = x3 + ax2 + b,

(1)

where a, b are in GF (2m ) and b 6= 0. Let P = (x, y) be a point on E of large prime order in E(GF (2m )), and let

k ≥ 0 be an integer. We may write k as a binary expansion k = (ks−1 ks−2 · · · k1 k0 )2 = ks−1 2s−1 + ks−2 2s−2 + · · · k1 s + k0 ,

(2)

where ki = 0, 1 with ks−1 = 1. We can compute kP using the following L´opez-Dahab algorithm [7]. Two main advantages of the L´opez-Dahab algorithm are: 1) A number of inversions are avoided by using projective coordinate representation and 2) The same operations are performed in every iteration of the main loop. These two benefits mean that faster implementation is possible with potentially increasing resistance to timing attacks. Algorithm 1 L´opez-Dahab version for point multiplication Input: P = (x, y) ∈ E(GF (2m )), an integer k ≥ 0. Output: kP = (x0 , y0 ). 1. If k = 0 or x = 0, then kP = ∞ or P . 2. k ← (kl−1 , · · · , k1 , k0 )2 . 3. (X1 , Z1 ) ← (x, 1), (X2 , Z2 ) ← (x4 + b, x2 ). 4. For i = l − 2 down to 0 do 5. Z3 ← (X1 Z2 + X2 Z1 )2 . 6. If ki = 1 then 7. X1 ← xZ3 + (X1 Z2 )(X2 Z1 ), Z1 ← Z3 , X2 ← X24 + bZ24 , Z2 ← X22 Z22 . 8. Else 9. X2 ← xZ3 + (X1 Z2 )(X2 Z1 ), Z2 ← Z3 , X1 ← X14 + bZ14 , Z1 ← X12 Z12 . 10. End if 11. End for 1 12. x0 ← X Z1 , 1 1 y0 ← x · (x + X Z1 )· X2 2 1 {(x + X Z1 )(x + Z2 ) + x + y} + y. 13. Return kP = (x0 , y0 ).

2.2

Parallelized arithmetic unit for point doubling & addition

By observing the main loop of the L´opez-Dahab point multiplication scheme in algorithm 1, we can see that both step 7 and step 9 have the same operations except for input and output variables depending on the value of kl . This condition means that we can construct uniform operations structure by appropriate modifications. Based on these properties, Ansari and Hasan [9] proposed a modified version for point doubling and addition (DA) as described in algorithm 2. From algorithm 2, if we use one multiplier, six multiplications are required. By observing the algorithm 2, however, we can find that three multiplications are independently computed, i.e., the multiplications X1 Z2 (T1 ), X2 Z1

Algorithm 2 Point doubling & addition with uniform addressing if kl−2 =1 then Swap(X1 , X2 ), Swap(Z1 , Z2 ) end if for i = l − 2 down to 0 do Z3 ← (X1 Z2 + X2 Z1 )2 X2 ← xZ3 + (X1 Z2 )(X2 Z1 ), Z2 ← Z3 , X1 ← X14 + bZ14 , Z1 ← X12 Z12 if (l 6= 0 and kl 6= kl−1 ) or (l = 0 and kl = 1) then Swap(X1 , X2 ), Swap(Z1 , Z2 ) end if end for

(T2 ), and X1 Z1 can be computed at the first step and the multiplications T1 T2 , xZ3 , and bZ14 can be computed at the second step respectively. From this observation, we can derive a new parallelized version corresponding to the algorithm 2, and a modified version is described in algorithm 3. Algorithm 3 Parallelized version of point doubling & addition with uniform addressing if kl−2 =1 then Swap(X1 , X2 ), Swap(Z1 , Z2 ) end if for i = l − 2 down to 0 do 2 1. T1 ← (X1 Z2 ), T2 ← (X2 Z1 ), Z3 ← (T1 + T2 ) , T3 ← (X1 Z1 )2 , Z2 ← Z3 2. X2 ← T1 T2 + xZ3 , X1 ← bZ14 + X14 , Z1 ← T3 if (l 6= 0 and kl 6= kl−1 ) or (l = 0 and ki = 1) then Swap(X1 , X2 ), Swap(Z1 , Z2 ) end if end for

A data dependence graph for point DA corresponding to the algorithm 3 is given in Fig. 1. Based on the data dependence graph of Fig. 1, we can derive a new arithmetic unit as shown in Fig. 2. As described in Fig. 2, this arithmetic unit is merged version of step 1 and step 2 in the dependence graph of Fig. 1. In Fig. 2, input and output values are controlled by 2-bits Ctrl-A and 1-bit Ctrl-B signals. This leads to a very simple control architecture. The new arithmetic unit of Fig. 2 consists of three multipliers, two adder, six 3-to-1 multiplexers, two buffer registers, and two swap s logic blocks. As a reference, since the A2 operations of an m element A in GF (2 ) are s-bits cyclic shift when GNB is used, any logic gates for these operations are not required. In Fig. 2, Xi , Zi , X14 , and Z14 , b, and x are transferred from register file, and Xi 0 and Zi 0 mean temporary computation

Figure 3. Arithmetic unit for coordinate conversion

scheme. The Itoh-Tsujii’s algorithm is based on the following relation:

Figure 1. Data dependence graph for point doubling & addition

results and directly used as input values for the next iteration, where i ∈ 1, 2. Therefore we can reduce clock cycles for data fetch from register file. The two buffer registers are used to adjust input timing. In other words, the temporary results Z1 and Z2 are computed at the first step, however, these results should be used at the same time as inputs of the next iteration with the temporary results X1 and X2 . From Fig. 2, we can notice that (l − 1)·2(clock cycles for multiplication) clock cycles are required for main loop of the L´opez-Dahab point multiplication.

2.3

Arithmetic unit for coordinate conversion

Unlike point DA algorithm, the coordinate conversion (CC) routine in algorithm 1 includes inversion operation. Therefore, we firstly derive inverter over GF (2m ). Let A be an element in GF (2m ). Since the order of the m multiplicative group GF (2m )× is 2m − 1, we get A2 −1 = 1. That is A−1 = A2

m

−2

= A2(2

m−1

−1)

(3)

Based on (3), Itoh-Tsujii [8] proposed an efficient inversion

s

s

2s − 1 = (2 2 − 1)(2 2 + 1), = 2(2

s−1 2

− 1)(2

s−1 2

if s = even, + 1) + 1,

if s = odd. (4)

The Itoh-Tsujii’s inversion scheme requires blog2 (m−1)c+ H(m − 1) − 1 multiplications in GF (2m ), where H denotes the hamming weight of the binary expansion of given integer. In case of m = 163, we get m − 1 = 162 = (10100010)2 . Thus, the number of required multiplications are 7 + 3 − 1 = 9. Therefore, we can compute the inverse of A in GF (2163 ) in the following order of the exponents. A−1 = A2(2

81

+1)[2(240 +1)(220 +1)(210 +1)(25 +1){2(22 +1)(2+1)+1}+1]

(5) Based on (5), we can derive a new VLSI architecture for CC over GF (2163 ) as shown in Fig. 3. As described in Fig. 3, we add 163(3-to-1) multiplexers, 163(9-to-1) multiplexers, 163 (2-to-1) multiplexers, one GF (2163 ) adder, and control signal with length 8.

Figure 2. Arithmetic unit for point doubling & addition

3 Elliptic curve cryptographic processor over GF (2163 )

multiplication over GF (2163 ), where we assumed word size ω = 55, i.e., L = 3.

3.1

3.2

Datapath for elliptic curve point multiplication over GF (2163 )

A new ECC processor for GF (2163 ), proposed in this paper, is shown in Fig. 4. The ECC processor, shown in Fig. 4, consists of eight main components. Eight components are host interface (HI), data memory, register file, instruction memory, control-1, control-2, AU-1, and AU-2. The HI communicates with host microprocessor, i.e., host microprocessor transmits all parameters for kP to HI with Start signal, and receives kP results and End signal from HI. The data memory consists of 8×163-bit dual port block RAM and the instruction memory contains 13 microcode sequences of 11-bits word. For high performance implementation of point doubling & addition, we add 7 × 163-bit register file, which receives data form HI and transmits temporary computation results (X1 , X2 , Z1 , Z1 ) to data memory. The AU-1 is used for point doubling & addition and controlled by Control-1. The AU-2 performs the coordinate conversion in algorithm 1. The Control-2 receives operation code from instruction memory and generates control signals for AU-2, data memory, and HI. Table 1 gives number of required clock cycles to perform elliptic curve point

FPGA implementation and performance analysis

The ECC processor over GF (2163 ) in Fig. 4 was coded using VHDL and synthesized by Synopsys FPGA Compiler II, in which Xilinx XC4VLX80 was used as the target device. The placement, route process, and timing analysis of the synthesized designs were accomplished using Xilinx’s foundation software. We implemented the design using Libtron’s SYS-Lab 5000 system-on-chip test board which includes Intel PXA272 microprocessor and Xilinx XC4VLX80 FPGA device. Performance comparisons with recently proposed architectures are given in Table 2. From Table 2, it is noted that the proposed architecture is the fastest design including ASIC implementation. As a detailed comparison, our ECC processor is 4.8 times faster than the architecture in [1] which is the best FPGA implementation to up to date to the author’s knowledge. The proposed design, however, uses roughly 2 times more hardware resources than the Shu’s architecture, since one slice of Xilinx’s XC4VLX80 device has 2 LUTs.

Figure 4. Architecture of ECC processor over GF (2163 ) Table 1. Clock cycles required to compute kP over GF (2163 ) # Operations # Clock Cycles Initialize Add.: 1, Sqr.: 1 1 Main Loop 162(Mult.: 2) 162{2(dm/ωe + 1)}=1296 Coordinate Conversion Inv: 3, Mult: 5, Add: 5 32 dm/ωe + 53 =149

Shu et al. [1] 2m , 163-bit, NIST Satoh et al. [2] 2m , 160-bit, Any Gura et al. [3] 2m , 163-bit This Work 2m , 163-bit

Table 2. Performance comparison results Device/Size f(MHz)/Time XCV2000E-7 68.9 25,763 LUTs 48µs 0.13 µs CMOS 510.2 117,500 Gates 190µs XCV2000E-7 66.5 19,508 LUTs 143µs XC4VLX80 143 24,363 Slices 10µs

Remarks 6 MSD Multipliers D = 32, 8 MMM, D = 64 LSD Multiplier, D = 64 3 GNB Multipliers, D = 55

4 Conclusions In this paper, we proposed a high performance ECC processor. We implemented our design using Xilinx XC4VLX80 FPGA device. The proposed architecture uses 24,263 slices and has a maximum frequency of 143MHz. Our design is roughly 4.8 times faster with 2 times increased hardware complexity compared with the previous hardware implementation proposed by Shu. et al. [1]. Therefore, the proposed elliptic curve cryptographic processor is well suited to elliptic curve cryptosystems requiring high throughput rates such as network processors and web servers. Furthermore, since the proposed architecture has the features of modularity and a simple control structure, it can easily be implemented on ASICs or FPFAs by using hardware description languages.

References [1] C. Shu, K. Gaj, and T. El-Ghazawi, “Low Latency Elliptic Curve Cryptography Accelerators for NIST Curves over Binary Fields,” FPT 2005 1965, pp. 309-310, 2005. [2] A. Satoh and K. Takano, “A Scalable Dual-Field Elliptic Curve Cryptographic Processor,” IEEE Trans. Computers, Vol. 52, No. 4, pp. 449-460, April. 2003. [3] N. Gura, S.C. Shantz, H. Eberle, S. Gupta, V. Gupta, D. Finchelstein, E. Goupy, and D. Stebila, “An End-to-End Systems Approach to Elliptic Curve Cryptography,” CHES 2002, Lecture Notes in Computer Science 2523, pp. 349-365, 2002. [4] IEEE 1363, Standard Specifications for Publickey Cryptography, 2000. [5] NIST, Recommended elliptic curves for federal government use, May 1999. http://csrc.nist.gov/encryption. [6] C.H. Kim, Y.K. Kwon, T.H. Kim, S. Kwon, and C.P. Hong, “Word Level Multiplier for GF (2m ) Using Gaussian Normal Basis,” Journal of Korea Information and Communications., Vol. 31, No. 11c, pp. 1120-1127, Nov. 2006. [7] J. L´opez and R. Dahab, “Fast Multiplication on Elliptic Curves over GF (2m ) without Precomputation,” CHES 1999, Lecture Notes in Computer Science Vol. 1717, pp. 316-327, 1999. [8] T. Itoh and S. Tsuji, “A fast algorithm for computing multiplicative inverses GF (2m ) in using normal bases,” Information and Computing, Vol. 78, No. 3, pp. 171-177, 1988. [9] B. Ansari, M. Anwar Hasan, “High Performance Architecture of Elliptic Curve Scalar Multiplication,” Tech. Report CACR 2006-01, 2006.

Related Documents


More Documents from ""