Title: Superscalar Coprocessor for High-speed Curve-based Cryptography
1Superscalar Coprocessor forHigh-speed
Curve-based Cryptography
- K. Sakiyama, L. Batina, B. Preneel, I.
Verbauwhede - Katholieke Universiteit Leuven / IBBT
- Department Electrical Engineering - ESAT/COSIC
2Overview
- Introduction
- Curve-based Cryptography
- HW/SW Partitioning
- Superscalar Coprocessor
- Results
- Conclusions
3IntroductionMotivation
- High-speed curve-based cryptography in HW/SW
co-design - How much instruction-level parallelism can we
obtain from coprocessor instructions? - Performance improvement for different operation
forms in datapath - ABC mod P vs A(BD)C mod P ,A,B,C,D,P
polynomials - Performance comparison three different
curve-based cryptosystems - Which one is faster between ECC, HECC, ECC over a
composite field? - Programmability and scalability
- Programmable in order to support different
cryptosystems? - Scalable in field sizes?
4IntroductionTarget Architecture
- Curve-based cryptography over binary fields
- Hardware can be smaller and faster than prime
field - ECC over a binary field, e.g. GF(2163)
- HECC of genus 2
- Field length can be shorter with a factor of
2, e.g. GF(283) - ECC over a composite field
- Field length can be shorter with a factor of
2, e.g. GF ((283)2) - The datapath can be shared
- Programmable coprocessor supporting three
curve-based cryptography by defining coprocessor
instruction(s) - (Coprocessor) instruction-level parallelism by
superscalar
5Overview
- Introduction
- Curve-based Cryptography
- HW/SW Partitioning
- Superscalar Coprocessor
- Results
- Conclusions
6Curve-based CryptographyHW/SW partitioning (1)
- General hierarchy in coprocessor for curve-based
cryptography
Point/Divisor Multiplication
SW or HW controller
Point/Divisor Addition
Point/Divisor Doubling
SW or HW controller
Finite Field Addition
Finite Field Multiplication
Finite Field Inversion
HW Datapath
7Curve-based Cryptography Proposed Hierarchy (1)
- Single instruction for all finite field
operations - Fixed-cycle execution enables efficient
implementation
Single Instruction (Datapath)
Point/Divisor Multiplication
Point/Divisor Multiplication
Conventional
Point/Divisor Addition
Point/Divisor Doubling
Finite Field Inversion
Point/Divisor Addition
Point/Divisor Doubling
Finite Field Operation E.g. ABC mod P
Finite Field Addition
Finite Field Multiplication
Finite Field Inversion
8Curve-based Cryptography Modular Arithmetic
Logic Unit (MALU)
- (a) Building block Regular XOR chains
- (b) Scalable in digit size (d) and field size (k)
by interconnecting several building blocks - We use MALU83 (n83, d12) as building block
- 2xMALU83 can be configured as 1xMALU163
9Overview
- Introduction
- Curve-based Cryptography
- HW/SW Partitioning
- Superscalar Coprocessor
- Results
- Conclusions
10HW/SW PartitioningTYPE I Smallest
implementation (baseline)
Main CPU
SRAM
Program ROM
Memory Mapped I/O
32-bit instructions
32-bit data
Coprocessor
DBC
IBC
Instruction Bus
Data Bus
MALU83
11HW/SW Partitioning TYPE II TYPE I m-code RAM
Main CPU
SRAM
Program ROM
Memory Mapped I/O
32-bit instructions
32-bit data
Coprocessor
IBC
FSM
m-code RAM
DBC
Instruction Bus
Data Bus
MALU83
12HW/SW Partitioning TYPE III TYPE I
Coprocessor Memory
Main CPU
SRAM
Program ROM
Memory Mapped I/O
32-bit instructions
32-bit data
Coprocessor
DBC
IBC
Instruction Bus
Data Bus
MALU83
Coprocessor Memory
13HW/SW Partitioning TYPE IV TYPE I Copro.
Mem. m-code RAM
Main CPU
SRAM
Program ROM
Memory Mapped I/O
32-bit instructions
32-bit data
Coprocessor
IBC
FSM
m-code RAM
DBC
Instruction Bus
Data Bus
MALU83
Coprocessor Memory
14HW/SW Partitioning Co-design flow with GEZEL
C/C codes for PKCs
Partitioning of functions
C/C codes H/W behavior blocks w/interface
ARM (SW)
Co-processor (HW)
C/C codes w/physical memory map
Cycle-true sim. (GEZEL)
GEZEL FDL codes
Cross compile
Synthesis
VHDL codes
Program codes
15HW/SW Partitioning Result Vertical Exploration
of System
- HECC Performance for different HW/SW partitioning
- (Performance Point/Divisor multiplication)
16Overview
- Introduction
- Curve-based Cryptography
- HW/SW Partitioning
- Superscalar Coprocessor
- Results
- Conclusions
17Superscalar Coprocessor Proposed Hierarchy (2)
- Multiple Modular Arithmetic Logic Units (MALUs)
in coprocessor
Single MALU
Point/Divisor Multiplication
Multiple MALUs
Point/Divisor Multiplication
Point/Divisor Addition
Point/Divisor Doubling
Finite Field Inversion
Point/Divisor Addition
Point/Divisor Doubling
Finite Field Inversion
Finite Field Operation E.g. ABC mod P
Finite Field Operation E.g. ABC mod P
Finite Field Operation E.g. ABC mod P
Finite Field Operation E.g. ABC mod P
Finite Field Operation E.g. ABC mod P
18Superscalar Coprocessor Parallel Processing
Architecture (TYPE IV-based)
19Superscalar Coprocessor Horizontal Exploration
of System
- Performance of ECC and HECC
20Overview
- Introduction
- Curve-based Cryptography
- HW/SW Partitioning
- Superscalar Coprocessor
- Results
- Conclusions
21ResultsPerformance for ECC over GF(283)
- Fastest of three
- x1.8 speed-up by 2-way superscaling (ILPDP6)
with A(BD)C - Still more improvement is possible by adding
MALUs
ABC
A(BD)C
22ResultsPerformance of HECC over GF(283)
- Faster than ECC over a composite field
- x2.7 speed-up by 4-way superscaling (ILPDP5)
with A(BD)C - Less improvement as increasing of MALU
ABC
A(BD)C
23ResultsPerformance for ECC over GF((283)2 )
- Slowest of three
- x2.5 speed-up by 4-way superscaling (ILPDP6)
with A(BD)C - Less improvement as increasing of MALU
ABC
A(BD)C
24ResultsComparison of ECC/HECC implementations on
FPGAs
11 T. Wollinger, PhD thesis, 2004. 13 G.
Orlando and C. Paar, CHES 00. 14 N. Gura et
al., CHES02. 29 Nazar A. Saqib et al.,
International Journal of Embedded Systems 2005
25Conclusions
- Performance improvement / Comparison
- ECC was improved by a factor of 1.8 (2-way)
- HECC (genus 2) was improved by a factor of 2.7
(4-way) - ECC over a composite field was improved by a
factor of 2.5 -
(4-way) - A(BD)C offers better performance than ABC
- ECC is the fastest in this case study
- Programmability flexibility
- Support three different curve-based cryptosystems
over a binary field - Arbitrary irreducible polynomial
- Field size up to 332 bits by using 4xMALU83
26Thank you!
27Parallel issue of instructionsCase of using 4
MALUs
- IF/D Instruction Fetch Decode
- R_ Read operands (dependent on the type of
operation) - EX Execution (dependent on MALU
configuration, k d) - W_ Write (dependent on of instructions
issued in parallel)
28Parallel issue of instructionsOut-of-order
Execution
- Check RAW (Read After Write Dependency) for
in-/out-of-order execution