CITS Workshop on Factorization - PowerPoint PPT Presentation

About This Presentation

Title:

CITS Workshop on Factorization

Description:

Ralf Zimmermann, Tim G neysu, Christof Paar. Horst G rtz Institute for IT-Security ... Idea: Pollard's p-1 method adapted for elliptic curves ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 45

Provided by: ralfzim

Category:

more less

Transcript and Presenter's Notes

Title: CITS Workshop on Factorization

1
Workshop on Factorization, CITS Bochum, 11-12
September 2009
Implementing the Elliptic Curve Method (ECM) on
Special-Purpose Hardware Ralf Zimmermann, Tim
Güneysu, Christof PaarHorst Görtz Institute for
IT-SecurityRuhr-University Bochum
2
Outline

Introduction
Background and Arithmetic on Modern FPGAs
Overview of Elliptic Curve Method (ECM)
Initial Implementation of ECM (Phase 1)
Optimization and Implementing of ECM (Phase 1
2)
Results
Conclusions

3
Outline

Introduction
Background and Arithmetic on Modern FPGAs
Overview of Elliptic Curve Method (ECM)
Initial Implementation of ECM (Phase 1)
Optimization and Implementing of ECM (Phase 1
2)
Results
Conclusions

4
Design Goals and Decisions

Usage of Elliptic Curve Method (ECM) in
co-factorization
Goals of this work group
ECM implementation for small bit numbers (up to
200 bits)
Implement both Phases 1 and 2
Usable on COPACOBANA for massive parallelism
Four Individual Phases of this Work Package
Determine the best platform for ECM
Redesign COPACOBANA architecture for target
platform
Implement ECM on the target platform
Optimize ECM on the target platform

5
Selection Process

Task Find is the best platform for ECM (? CPUs)!
Based on three (predefined) platforms
FPGAs Spartan-3/Virtex-4
Digital Signal Processors (DSP) TI C6713
Smartcards with PK-Accelerator Siemens SL88

Winner of this competition Virtex-4 FPGAs
6
Design Goals and Decisions II

Use of Elliptic Curves in Montgomery Form
Efficient formulas for hardware
Computation/storage of y-coordinate can be
omitted
External inputs provided by the host PC
Initial values k (Phase 1), prime table (Phase
2)
Each unit curve parameters, modified moduli
GCD on Host PC

7
Outline

Introduction
Background and Arithmetic on Modern FPGAs
Overview of Elliptic Curve Method (ECM)
Initial Implementation of ECM (Phase 1)
Optimization and Implementing of ECM (Phase 1
2)
Results
Conclusions

8
General FPGA Architecture

Configurability by mesh of programmable elements
Configurable Logic Blocks (CLB)
Modern FPGAs contain tens of thousands of CLBs
Connection via interconnect (switching matrix)
Modern FPGAs contain hardcores
Dedicated memory elements
Arithmetic hardcores, e.g., to accelerate integer
multiplication and addition
Embedded PowerPC processors
High-speed I/O Transceivers

9
Generic FPGA Structure (simplified)
Long lines
Switch matrix
Input/output
Configurable Logic Block
10
Embedded Memory Elements in Virtex-4

18kbit storage element (BlockRAM)
400 MHz with connected output register
Flexible BRAM configuration
Dual-Port storage
Single-port storage
RAM or ROM
Cascading possible

11
Digital Signal Processing (DSP) Hardcores
4
8
32
5
37

Fast signed 18x18-bit multiplication and signed
48-bit addition/subtraction (400 MHz)
Integrated pipeline register
Controllable by an OPMODE signal
DSP elements can be cascaded

12
Simplified Architecture of Virtex-4 FPGAs
CLB
PowerPC (optional)
DSP elements
Block RAM elements
Location and columnwise alignment of elements
are important for place and route of FPGA
designs!
13
Basic Modular Addition
2
3
5
4
A
4
8
7
5
B

Input Modulus M
Operands A ? M, B ? M
Output AB (mod M)
S AB
S AB-M
If (b 1) then
Return S
Else
Return S

6
8
8
8
M
2
3
5
4
4
8
7
5

0
1
1
0
9
2
2
7
S
9
2
2
7
6
8
8
8
-
0
0
1
1
1
4
3
0
S
borrow (b)
14
DSP-supported Modular Addition
2
3
5
4
A
4
8
7
5
B
6
8
8
8
M
8
6
5
7
8
4
8
6
4
5
3
2
1
0
1
0
0
1
1
0
1
9
12
4
3
12
0
7
1
4
3
0
9
2
2
7
15
Basic Modular Multiplication
block of k bits

Modular Multiplication with Quotient Pipelining
(Orup)
Input Modulus dependent par
M(M, k, d), operands A, B
Output S A B R-1 (mod M)
// n is number of rounds
for i 0 to n do
qi Si (mod 2k)
Si1 Si/2k qiM biA
Return Sn1

A
a3
an
a1
a2
a0
block selection of Si,0
b3
bn
b1
b2
b0
M
m3
mn
m1
m2
m0
shift by k bits
Si,j
Si,3
Si,n
Si,1
Si,2
Si,0
Si,1
Si,2
Si,3
Si,4
0
scalarmultiplication
accumulation
16
Outline

Introduction
Background and Arithmetic on Modern FPGAs
Overview of Elliptic Curve Method (ECM)
Initial Implementation of ECM (Phase 1)
Optimization and Implementing of ECM (Phase 1
2)
Results
Conclusions

17
Introduction to Elliptic Curve Method

Phase 1 and 2 of the elliptic curve method
Phase 1 was originally proposed by Lenstra
Idea Pollards p-1 method adapted for elliptic
curves
Brent and Montgomery extended the Phase 1 by a
continuation (Phase 2)
Cost intensive operations in Phase 1 2
Phase 1 is basically a scalar multiplication
Gaj, et. al. described an algorithm for the
standard continuation (CHES 2006)
Standard continuation in hardware
Precomputations
Scalar multiplications jQ
Storing primes in a table
Main computations
point addition
accumulation of product d

18
Outline

Introduction
Background and Arithmetic on Modern FPGAs
Overview of Elliptic Curve Method (ECM)
Initial Implementation of ECM (Phase 1)
Optimization and Implementing of ECM (Phase 1
2)
Results
Conclusions

19
Example ECM Operations

Arithmetic unit comprises of 2 multipliers and 1
addition/subtraction component
Example concurrent point doubling/point addition
in 6 steps
Implementation of Montgomery Ladder for ECC point
multiplication

20
Initial Project Implementation

Initial Project implemented only Phase 1
Pros
Proof-of-concept implementation
DSP implementation of multiplier and
addition/subtraction
8 units calculating Phase 1
Cons
Very complex instructions (size and routing)
Addition works with different bit-width than
multiplication
Special way to feed multiplier with data
Design reaches DSPs limit
Extension to Phase 2 might not be possible

21
Initial ECM System Architecture
ECM system
ECM unit

8 independent units
supports up to 151-bit numbers
Each unit computes a different parameter set

22
Initial ECM Systems on COPACOBANA

Host Computer performs (simple) precomputations
COBACOBANA performs cost intensive operations
Each module consists of 8 Virtex-4 FPGA
For more information about COPACOBANA, please see
talk tomorrow

23
Initial Multiplication with DSPs

151-bit multiplier using 10 DSP elements
Special structure aj aj5 and mj mj5
Output sj after 66 clock cycles in 17-bit blocks

24
Initial Data storage
storage block I
storage block II
storage block III

BRAM 0 to 5 contain input values for the
multipliers
BRAM 6/7 contain input values for the
addition/subtraction unit
BRAM 8 contain moduli for multiplication and
addition/subtraction
Multipliers transfers 17-bit resulting blocks to
storage block I (BRAM 0-2), or storage block II
(BRAM 3-5), and/or storage block III (BRAM 6-7)
Addition/subtraction transfers 34-bit resulting
block to one of storage block I/II/III

Too complex! Routing kills the frequency ? max
100 / 400 MHz
25
Outline

Introduction
Background and Arithmetic on Modern FPGAs
Overview of Elliptic Curve Method (ECM)
Initial Implementation of ECM (Phase 1)
Optimization and Implementing of ECM (Phase 1
2)
Results
Conclusions

26
Optimization of Addition/Subtraction

17-bit internal bus width
No conversation between add/sub (34 bit) and
multiplication (17 bit)
Less space in fabric when buffering bus signals
Better routing possible
Changed code of modular addition/subtraction
algorithm
Maximized clock frequency under optimal
conditions
Removed output buffers and distributed RAM

27
Optimization of Multiplication

Previous approach to multiplying
Use n DSPs to multiply n17 bit values (parallel)
? (n1) 6 cycles
New approach
Use 3 DSPs to multiply n17 bit values
(sequential) ? 6 (n2)n cycles
Here n10 Reduce DSPs by 7 on cost of 60 clock
cycles

28
Optimization of Multiplication

Multiplication results
? Algorithmic-Logic Unit (ALU) results

Promising results for the basic structure!
29
Optimization of Memory Usage

Instructions
Reduced bit width by factor gt 2,5
Instructions for all ECM operations 1 BRAM
ECM instructions merged with ECM Parametersk and
prime table
Memory usage per unit
ALU 6 BRAMs
ECM Unit 1 BRAM

1 Project 1 uses global instructions for phase
1 using 17 BRAMs for 8 units 2 Project 2 uses 2
BRAMs, one for ALU, one for ECM instruction for
both phases.
30
Implementation of ECM Phase 2

Different Clock Domains
100 MHz for ECM finite state machine (FSM)
200 MHz for algorithmic-logic unit (ALU)
Workspace RAM ALU playground
32 Cells, each containing 2 blocks
Each block stores 16 values of 17-bit
? 17.408 bits (equals 1 BRAM)

Phase 2 Precomputation
Phase 2
P1/P2 temporary result storage
Point addition / doubling results
P2 pre temporary points
31
Optimizing Phase 2 Precalculation

From Gaj, et. al. (CHES 2006)

Calculate set JS of integers j
Use the j multiple of Q0
32
Optimizing Phase 2 Precalculation II

Suggested parameter D 210 for hardware
Which multiples of Q0 are needed?
JS(D) 1, 11, 13, 17, 19, 23, 29, 31, 37, 41,
43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97,
101, 103
Calculation suggested in 1 calculatesJS(D)
1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23,
2527, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47,
49, 51, 5355, 57, 59, 61, 63, 65, 67, 69, 71,
73, 75, 77, 79, 8183, 85, 87, 89, 91, 93, 95,
97, 99, 101, 103, 105
Are there more needed?
210Q0 and MMINDQ0
But Point Addition on elliptic curves?
To calculate (Px Pz) (Qx Qz) we
efficientlywe need the coordinates (P-Qx
P-Qz)
? What to do with R R Q ?

33
Optimizing Phase 2 Precalculation III

What about MMIN D?
MMIN round(B1 / D) math
Suggested B1 960 ? MMIN 5
MMIND 1050
What about R-Q?
Start R 5DQ0, Q DQ0
Point addition needs R-Q 4DQ0 840Q0
Iteration increases factor by one5DQ0,
6DQ0,
Calculations suggested
1 D 52 A 210, 1050Q0 (8 11) AD using
montgomery ladder
840Q0 never mentioned 10 AD using montgomery
ladder
? not implemented and/or inefficient

34
Implementation of Phase 2 Precalculation

Use a chain of point addition / doublingto
calculate multiples
Number of operations 30 A 7 D
Number of points in RAM 27 2 temporary
? reduce number of operations drastically
? calculate every point needed for addition
? calculate every point needed for phase 2

35
Implementation of Phase 2 Precalculation II

Parameters might change what now?
Change of B1 results in a change of MMIN
MMIN 0 not possible
MMIN 1 first R Q changes to doubling Q
MMIN gt 1 is implementable using this strategy
Required space in RAM unchanged
Required space in fabric unchanged
Varies the number of operations in precomputation

36
Outline

Introduction
Background and Arithmetic on Modern FPGAs
Overview of Elliptic Curve Method (ECM)
Initial Implementation of ECM (Phase 1)
Optimization and Implementing of ECM (Phase 1
2)
Results
Conclusions

37
Results of the new ECM implementation
Data transfer
Phase 1
Phase 2
Data transfer
Results for ECM Phase 1 2 (B1 960 and B2
57600)
38
Results for ECM Phase 1 2
ECM Phase 1 2 using B1 960, B2 57600 1723
bit k
69,120 ECM calculations / second on COPACOBANA1
1 COPACOBANA using 4 FPGA modules ( 32 FPGA)
39
Comparison (ECM Phase 1)
Comparing ECM Phase 1 Implementations using B1
960, B2 57000 1323 bit k
1 Cost for 1 FPGA www.em.avnet.com (September
2009)
40
Outline

Introduction
Background and Arithmetic on Modern FPGAs
Overview of Elliptic Curve Method (ECM)
Initial Implementation of ECM (Phase 1)
Optimization and Implementing of ECM (Phase 1
2)
Results
Conclusions

41
Conclusions

Novel and complete implementation
Implementation results implementation of phase 2
on FPGA not just estimates
Optimization results at least twice as effective
as the documented result
Generic and scalable system
Architecture can be easily optimized for (n17)
2 bit with 2 lt n lt 14 implemented n 9
only small changes in fabric/resources, but
exchange instruction ROM
Highly parallel architecture for co-factorization
Multiple ECM-units per FPGA (24 units supporting
151-bit on Virtex-4 SX 35)
Multiple FPGAs on the COPACOBANA cluster (128
Virtex-4 SX 35)