Title: CITS Workshop on Factorization
1Workshop on Factorization, CITS Bochum, 11-12
September 2009
Implementing the Elliptic Curve Method (ECM) on
Special-Purpose Hardware Ralf Zimmermann, Tim
Güneysu, Christof PaarHorst Görtz Institute for
IT-SecurityRuhr-University Bochum
2Outline
- Introduction
- Background and Arithmetic on Modern FPGAs
- Overview of Elliptic Curve Method (ECM)
- Initial Implementation of ECM (Phase 1)
- Optimization and Implementing of ECM (Phase 1
2) - Results
- Conclusions
3Outline
- Introduction
- Background and Arithmetic on Modern FPGAs
- Overview of Elliptic Curve Method (ECM)
- Initial Implementation of ECM (Phase 1)
- Optimization and Implementing of ECM (Phase 1
2) - Results
- Conclusions
4Design Goals and Decisions
- Usage of Elliptic Curve Method (ECM) in
co-factorization - Goals of this work group
- ECM implementation for small bit numbers (up to
200 bits) - Implement both Phases 1 and 2
- Usable on COPACOBANA for massive parallelism
- Four Individual Phases of this Work Package
- Determine the best platform for ECM
- Redesign COPACOBANA architecture for target
platform - Implement ECM on the target platform
- Optimize ECM on the target platform
5Selection Process
- Task Find is the best platform for ECM (? CPUs)!
- Based on three (predefined) platforms
- FPGAs Spartan-3/Virtex-4
- Digital Signal Processors (DSP) TI C6713
- Smartcards with PK-Accelerator Siemens SL88
Winner of this competition Virtex-4 FPGAs
6Design Goals and Decisions II
- Use of Elliptic Curves in Montgomery Form
- Efficient formulas for hardware
- Computation/storage of y-coordinate can be
omitted - External inputs provided by the host PC
- Initial values k (Phase 1), prime table (Phase
2) - Each unit curve parameters, modified moduli
- GCD on Host PC
7Outline
- Introduction
- Background and Arithmetic on Modern FPGAs
- Overview of Elliptic Curve Method (ECM)
- Initial Implementation of ECM (Phase 1)
- Optimization and Implementing of ECM (Phase 1
2) - Results
- Conclusions
8General FPGA Architecture
- Configurability by mesh of programmable elements
- Configurable Logic Blocks (CLB)
- Modern FPGAs contain tens of thousands of CLBs
- Connection via interconnect (switching matrix)
- Modern FPGAs contain hardcores
- Dedicated memory elements
- Arithmetic hardcores, e.g., to accelerate integer
multiplication and addition - Embedded PowerPC processors
- High-speed I/O Transceivers
9Generic FPGA Structure (simplified)
Long lines
Switch matrix
Input/output
Configurable Logic Block
10Embedded Memory Elements in Virtex-4
- 18kbit storage element (BlockRAM)
- 400 MHz with connected output register
- Flexible BRAM configuration
- Dual-Port storage
- Single-port storage
- RAM or ROM
- Cascading possible
11Digital Signal Processing (DSP) Hardcores
4
8
32
5
37
- Fast signed 18x18-bit multiplication and signed
48-bit addition/subtraction (400 MHz) - Integrated pipeline register
- Controllable by an OPMODE signal
- DSP elements can be cascaded
12Simplified Architecture of Virtex-4 FPGAs
CLB
PowerPC (optional)
DSP elements
Block RAM elements
Location and columnwise alignment of elements
are important for place and route of FPGA
designs!
13Basic Modular Addition
2
3
5
4
A
4
8
7
5
B
- Input Modulus M
- Operands A ? M, B ? M
- Output AB (mod M)
- S AB
- S AB-M
- If (b 1) then
- Return S
- Else
- Return S
6
8
8
8
M
2
3
5
4
4
8
7
5
0
1
1
0
9
2
2
7
S
9
2
2
7
6
8
8
8
-
0
0
1
1
1
4
3
0
S
borrow (b)
14DSP-supported Modular Addition
2
3
5
4
A
4
8
7
5
B
6
8
8
8
M
8
6
5
7
8
4
8
6
4
5
3
2
1
0
1
0
0
1
1
0
1
9
12
4
3
12
0
7
1
4
3
0
9
2
2
7
15Basic Modular Multiplication
block of k bits
- Modular Multiplication with Quotient Pipelining
(Orup) - Input Modulus dependent par
- M(M, k, d), operands A, B
- Output S A B R-1 (mod M)
- // n is number of rounds
- for i 0 to n do
- qi Si (mod 2k)
- Si1 Si/2k qiM biA
- Return Sn1
A
a3
an
a1
a2
a0
block selection of Si,0
b3
bn
b1
b2
b0
M
m3
mn
m1
m2
m0
shift by k bits
Si,j
Si,3
Si,n
Si,1
Si,2
Si,0
Si,1
Si,2
Si,3
Si,4
0
scalarmultiplication
accumulation
16Outline
- Introduction
- Background and Arithmetic on Modern FPGAs
- Overview of Elliptic Curve Method (ECM)
- Initial Implementation of ECM (Phase 1)
- Optimization and Implementing of ECM (Phase 1
2) - Results
- Conclusions
17Introduction to Elliptic Curve Method
- Phase 1 and 2 of the elliptic curve method
- Phase 1 was originally proposed by Lenstra
- Idea Pollards p-1 method adapted for elliptic
curves - Brent and Montgomery extended the Phase 1 by a
continuation (Phase 2) - Cost intensive operations in Phase 1 2
- Phase 1 is basically a scalar multiplication
- Gaj, et. al. described an algorithm for the
standard continuation (CHES 2006) - Standard continuation in hardware
- Precomputations
- Scalar multiplications jQ
- Storing primes in a table
- Main computations
- point addition
- accumulation of product d
18Outline
- Introduction
- Background and Arithmetic on Modern FPGAs
- Overview of Elliptic Curve Method (ECM)
- Initial Implementation of ECM (Phase 1)
- Optimization and Implementing of ECM (Phase 1
2) - Results
- Conclusions
19Example ECM Operations
- Arithmetic unit comprises of 2 multipliers and 1
addition/subtraction component - Example concurrent point doubling/point addition
in 6 steps - Implementation of Montgomery Ladder for ECC point
multiplication
20Initial Project Implementation
- Initial Project implemented only Phase 1
- Pros
- Proof-of-concept implementation
- DSP implementation of multiplier and
addition/subtraction - 8 units calculating Phase 1
- Cons
- Very complex instructions (size and routing)
- Addition works with different bit-width than
multiplication - Special way to feed multiplier with data
- Design reaches DSPs limit
- Extension to Phase 2 might not be possible
21Initial ECM System Architecture
ECM system
ECM unit
- 8 independent units
- supports up to 151-bit numbers
- Each unit computes a different parameter set
22Initial ECM Systems on COPACOBANA
- Host Computer performs (simple) precomputations
- COBACOBANA performs cost intensive operations
- Each module consists of 8 Virtex-4 FPGA
- For more information about COPACOBANA, please see
talk tomorrow
23Initial Multiplication with DSPs
- 151-bit multiplier using 10 DSP elements
- Special structure aj aj5 and mj mj5
- Output sj after 66 clock cycles in 17-bit blocks
24Initial Data storage
storage block I
storage block II
storage block III
- BRAM 0 to 5 contain input values for the
multipliers - BRAM 6/7 contain input values for the
addition/subtraction unit - BRAM 8 contain moduli for multiplication and
addition/subtraction - Multipliers transfers 17-bit resulting blocks to
storage block I (BRAM 0-2), or storage block II
(BRAM 3-5), and/or storage block III (BRAM 6-7) - Addition/subtraction transfers 34-bit resulting
block to one of storage block I/II/III
Too complex! Routing kills the frequency ? max
100 / 400 MHz
25Outline
- Introduction
- Background and Arithmetic on Modern FPGAs
- Overview of Elliptic Curve Method (ECM)
- Initial Implementation of ECM (Phase 1)
- Optimization and Implementing of ECM (Phase 1
2) - Results
- Conclusions
26Optimization of Addition/Subtraction
- 17-bit internal bus width
- No conversation between add/sub (34 bit) and
multiplication (17 bit) - Less space in fabric when buffering bus signals
- Better routing possible
- Changed code of modular addition/subtraction
algorithm - Maximized clock frequency under optimal
conditions - Removed output buffers and distributed RAM
27Optimization of Multiplication
- Previous approach to multiplying
- Use n DSPs to multiply n17 bit values (parallel)
? (n1) 6 cycles - New approach
- Use 3 DSPs to multiply n17 bit values
(sequential) ? 6 (n2)n cycles - Here n10 Reduce DSPs by 7 on cost of 60 clock
cycles
28Optimization of Multiplication
- Multiplication results
- ? Algorithmic-Logic Unit (ALU) results
Promising results for the basic structure!
29Optimization of Memory Usage
- Instructions
- Reduced bit width by factor gt 2,5
- Instructions for all ECM operations 1 BRAM
- ECM instructions merged with ECM Parametersk and
prime table - Memory usage per unit
- ALU 6 BRAMs
- ECM Unit 1 BRAM
1 Project 1 uses global instructions for phase
1 using 17 BRAMs for 8 units 2 Project 2 uses 2
BRAMs, one for ALU, one for ECM instruction for
both phases.
30Implementation of ECM Phase 2
- Different Clock Domains
- 100 MHz for ECM finite state machine (FSM)
- 200 MHz for algorithmic-logic unit (ALU)
- Workspace RAM ALU playground
- 32 Cells, each containing 2 blocks
- Each block stores 16 values of 17-bit
- ? 17.408 bits (equals 1 BRAM)
Phase 2 Precomputation
Phase 2
P1/P2 temporary result storage
Point addition / doubling results
P2 pre temporary points
31Optimizing Phase 2 Precalculation
- From Gaj, et. al. (CHES 2006)
Calculate set JS of integers j
Use the j multiple of Q0
32Optimizing Phase 2 Precalculation II
- Suggested parameter D 210 for hardware
- Which multiples of Q0 are needed?
- JS(D) 1, 11, 13, 17, 19, 23, 29, 31, 37, 41,
43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97,
101, 103 - Calculation suggested in 1 calculatesJS(D)
1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23,
2527, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47,
49, 51, 5355, 57, 59, 61, 63, 65, 67, 69, 71,
73, 75, 77, 79, 8183, 85, 87, 89, 91, 93, 95,
97, 99, 101, 103, 105 - Are there more needed?
- 210Q0 and MMINDQ0
- But Point Addition on elliptic curves?
- To calculate (Px Pz) (Qx Qz) we
efficientlywe need the coordinates (P-Qx
P-Qz) - ? What to do with R R Q ?
33Optimizing Phase 2 Precalculation III
- What about MMIN D?
- MMIN round(B1 / D) math
- Suggested B1 960 ? MMIN 5
- MMIND 1050
- What about R-Q?
- Start R 5DQ0, Q DQ0
- Point addition needs R-Q 4DQ0 840Q0
- Iteration increases factor by one5DQ0,
6DQ0, - Calculations suggested
- 1 D 52 A 210, 1050Q0 (8 11) AD using
montgomery ladder - 840Q0 never mentioned 10 AD using montgomery
ladder - ? not implemented and/or inefficient
34Implementation of Phase 2 Precalculation
- Use a chain of point addition / doublingto
calculate multiples - Number of operations 30 A 7 D
- Number of points in RAM 27 2 temporary
- ? reduce number of operations drastically
- ? calculate every point needed for addition
- ? calculate every point needed for phase 2
35Implementation of Phase 2 Precalculation II
- Parameters might change what now?
- Change of B1 results in a change of MMIN
- MMIN 0 not possible
- MMIN 1 first R Q changes to doubling Q
- MMIN gt 1 is implementable using this strategy
- Required space in RAM unchanged
- Required space in fabric unchanged
- Varies the number of operations in precomputation
36Outline
- Introduction
- Background and Arithmetic on Modern FPGAs
- Overview of Elliptic Curve Method (ECM)
- Initial Implementation of ECM (Phase 1)
- Optimization and Implementing of ECM (Phase 1
2) - Results
- Conclusions
37Results of the new ECM implementation
Data transfer
Phase 1
Phase 2
Data transfer
Results for ECM Phase 1 2 (B1 960 and B2
57600)
38Results for ECM Phase 1 2
ECM Phase 1 2 using B1 960, B2 57600 1723
bit k
69,120 ECM calculations / second on COPACOBANA1
1 COPACOBANA using 4 FPGA modules ( 32 FPGA)
39Comparison (ECM Phase 1)
Comparing ECM Phase 1 Implementations using B1
960, B2 57000 1323 bit k
1 Cost for 1 FPGA www.em.avnet.com (September
2009)
40Outline
- Introduction
- Background and Arithmetic on Modern FPGAs
- Overview of Elliptic Curve Method (ECM)
- Initial Implementation of ECM (Phase 1)
- Optimization and Implementing of ECM (Phase 1
2) - Results
- Conclusions
41Conclusions
- Novel and complete implementation
- Implementation results implementation of phase 2
on FPGA not just estimates - Optimization results at least twice as effective
as the documented result - Generic and scalable system
- Architecture can be easily optimized for (n17)
2 bit with 2 lt n lt 14 implemented n 9 - only small changes in fabric/resources, but
exchange instruction ROM - Highly parallel architecture for co-factorization
- Multiple ECM-units per FPGA (24 units supporting
151-bit on Virtex-4 SX 35) - Multiple FPGAs on the COPACOBANA cluster (128
Virtex-4 SX 35)
42Thank you for your attention!Any Questions?
- Ralf Zimmermann,
- Tim Güneysu,
- Christof Paar
43Redesign of FPGA Module for ECM
Original 6xSpartan-3 XC3S1000
Redesign 8xVirtex-4 XCV4SX35
44COPACOBANA with Virtex-4 Devices