Title: P230/MAPLD%202004
1Elliptic Curve Cryptography over GF(2m) on a
Reconfigurable Computer Polynomial Basis vs.
Optimal Normal Basis Representation Comparative
Study
Kris Gaj, Sashisu Bajracharya, Nghi Nguyen,
Deapesh Misra Tarek El-Ghazawi
George Mason University
The George Washington University
2What is a reconfigurable computer?
Reconfigurable processor system
Microprocessor system
. . .
?P
?P
. . .
FPGA
FPGA
?P memory
?P memory
FPGA memory
FPGA memory
. . .
. . .
Interface
Interface
I/O
I/O
3Why cryptography is a good application for
reconfigurable computers?
- computationally intensive
- arithmetic operations
- unconventionally long operand sizes
- (160-2048 bits)
- multiple algorithms, parameters,
- key sizes, and architectures
- need for reconfiguration
4SRC Hardware Software
5SRC-6E from SRC Computers, Inc.
6SRC Hardware Architecture
7SRC Programming
HLL (C)
HDL (VHDL)
?P system
SRC
FPGA system
Application Programmer
Library Developer
8SRC Compilation Process
9Elliptic Curve Cryptosystems
10Elliptic Curve Cryptosystems
- public key (asymmetric) cryptosystems
- first true alternative for RSA
- several times shorter keys
- fast and compact implementations,
- in particular in hardware
- a family of cryptosystems, instead of a single
- cryptosystem
11Three Classes of Elliptic Curves
Elliptic curves built over
Secure m
m155 .. 512
K GF(p)
K GF(2m)
Our m
m233
Arithmetic operations present in many libraries
Normal basis representation
Polynomial basis representation
Fast in hardware
Compact in hardware
12Basic operations of ECC
Basic operations in Galois Field GF(2m)
- addition and subtraction (xor) xy, x-y (XOR)
- multiplication, squaring x ? y, x2
- inversion x-1
Basic operations on points of an Elliptic Curve
- addition of points P Q
- doubling a point
2 P - projective to affine coordinate P2A
Complex operations on points of an Elliptic Curve
- scalar multiplication k ? P P P
P
k times
13ECC hierarchy of functions
High level
kP
projective_to_affine (P2A)
Medium level
PQ
2P
Low level 2
INV
Low level 1
XOR
SQR
MUL
independent of the GF representation
specific to the given GF representation
14Investigated Partitioning Schemes
15SRC Program Partitioning
C function for ?P
?P system
HLL
C function for MAP
FPGA system
VHDL macro
HDL
16H00 Partitioning (µP Software Only)
C function for ?P
H
kP
C function for MAP
0
VHDL macro
0
specific to the given GF representation
1700H Partitioning (VHDL only)
C function for ?P
0
C function for MAP
0
VHDL macro
H
kP
specific to the given GF representation
180HL1 Partitioning
independent of the GF representation
specific to the given GF representation
19FPGA Contents (0HL1)
kP
MUL4
PQ
2P
MUL2
MUL
POW
INV
P2A
200HL2 Partitioning
C function
0
for
µ
P
kP
kP
C function
H
for MAP
P2A
P2A
PQ
PQ
2P
2P
VHDL
L2
MUL4
ROT
SQR
XOR
INV
INV
macros
independent of the GF representation
specific to the given GF representation
210HM Partitioning
0
C function
for
µ
P
C function
H
for MAP
VHDL
PQ
PQ
P2A
2P
2P
M
P2A
macros
independent of the GF representation
specific to the given GF representation
22Results
23Timing Measurements
.c file
.mc file
MAP function
MAP function
MAP Alloc.
MAP Free
FPGA Configure
DMA DataOut
DMA Data In
FPGA Computation
End-to-End time (HW)
End-to-End time (SW)
MAP Allocation time
MAP Release Time
Configuration time
24Results for Optimal Normal Basis (Latency)
25Results for Polynomial Basis (Latency)
26Results for Optimal Normal Basis (Area)
27Results for the Polynomial Basis (Area)
28Number of lines of code
Algorithm Partitioning Scheme VHDL PB VHDL ONB Macro Wrapper MAP C Main C
0HL1 N/A 1007 260 371 153
0HL2 714 1291 230 349 153
0HM N/A 1744 160 185 153
00H N/A 1960 36 78 153
29Conclusions
Assuming focus on
Timing
Resources
Ease of programming
30Best implementation approaches
Optimal Normal Basis OHL1 scheme
Polynomial Basis OHL2 scheme
Large speedup vs. software Ease of
implementation Flexibility
31Conclusions
- Elliptic Curve Cryptosystem implementation
- challenging for reconfigurable computers
because of
- optimization for latency rather than throughput
- limited amount of parallelism
- Absolute latency and resource utilization
similar for - Optimal Normal Basis
- and
- Polynomial Basis
- Number of lines of VHDL code smaller
- for the polynomial basis representation
32Conclusions cont.
Speed-up over Intel P3 microprocessor
implementation
From 893 to 1305 times for Optimal Normal
Basis From 33 times for
Polynomial Basis
27 x greater for Optimal Normal
Basis compared to
Polynomial Basis Representation
because of polynomial basis operations more
efficient in software