CITS Workshop on Factorization - PowerPoint PPT Presentation

About This Presentation
Title:

CITS Workshop on Factorization

Description:

Ralf Zimmermann, Tim G neysu, Christof Paar. Horst G rtz Institute for IT-Security ... Idea: Pollard's p-1 method adapted for elliptic curves ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 45
Provided by: ralfzim
Category:

less

Transcript and Presenter's Notes

Title: CITS Workshop on Factorization


1
Workshop on Factorization, CITS Bochum, 11-12
September 2009
Implementing the Elliptic Curve Method (ECM) on
Special-Purpose Hardware Ralf Zimmermann, Tim
Güneysu, Christof PaarHorst Görtz Institute for
IT-SecurityRuhr-University Bochum
2
Outline
  • Introduction
  • Background and Arithmetic on Modern FPGAs
  • Overview of Elliptic Curve Method (ECM)
  • Initial Implementation of ECM (Phase 1)
  • Optimization and Implementing of ECM (Phase 1
    2)
  • Results
  • Conclusions

3
Outline
  • Introduction
  • Background and Arithmetic on Modern FPGAs
  • Overview of Elliptic Curve Method (ECM)
  • Initial Implementation of ECM (Phase 1)
  • Optimization and Implementing of ECM (Phase 1
    2)
  • Results
  • Conclusions

4
Design Goals and Decisions
  • Usage of Elliptic Curve Method (ECM) in
    co-factorization
  • Goals of this work group
  • ECM implementation for small bit numbers (up to
    200 bits)
  • Implement both Phases 1 and 2
  • Usable on COPACOBANA for massive parallelism
  • Four Individual Phases of this Work Package
  • Determine the best platform for ECM
  • Redesign COPACOBANA architecture for target
    platform
  • Implement ECM on the target platform
  • Optimize ECM on the target platform

5
Selection Process
  • Task Find is the best platform for ECM (? CPUs)!
  • Based on three (predefined) platforms
  • FPGAs Spartan-3/Virtex-4
  • Digital Signal Processors (DSP) TI C6713
  • Smartcards with PK-Accelerator Siemens SL88

Winner of this competition Virtex-4 FPGAs
6
Design Goals and Decisions II
  • Use of Elliptic Curves in Montgomery Form
  • Efficient formulas for hardware
  • Computation/storage of y-coordinate can be
    omitted
  • External inputs provided by the host PC
  • Initial values k (Phase 1), prime table (Phase
    2)
  • Each unit curve parameters, modified moduli
  • GCD on Host PC

7
Outline
  • Introduction
  • Background and Arithmetic on Modern FPGAs
  • Overview of Elliptic Curve Method (ECM)
  • Initial Implementation of ECM (Phase 1)
  • Optimization and Implementing of ECM (Phase 1
    2)
  • Results
  • Conclusions

8
General FPGA Architecture
  • Configurability by mesh of programmable elements
  • Configurable Logic Blocks (CLB)
  • Modern FPGAs contain tens of thousands of CLBs
  • Connection via interconnect (switching matrix)
  • Modern FPGAs contain hardcores
  • Dedicated memory elements
  • Arithmetic hardcores, e.g., to accelerate integer
    multiplication and addition
  • Embedded PowerPC processors
  • High-speed I/O Transceivers

9
Generic FPGA Structure (simplified)
Long lines
Switch matrix
Input/output
Configurable Logic Block
10
Embedded Memory Elements in Virtex-4
  • 18kbit storage element (BlockRAM)
  • 400 MHz with connected output register
  • Flexible BRAM configuration
  • Dual-Port storage
  • Single-port storage
  • RAM or ROM
  • Cascading possible

11
Digital Signal Processing (DSP) Hardcores
4
8
32
5
37
  • Fast signed 18x18-bit multiplication and signed
    48-bit addition/subtraction (400 MHz)
  • Integrated pipeline register
  • Controllable by an OPMODE signal
  • DSP elements can be cascaded

12
Simplified Architecture of Virtex-4 FPGAs
CLB
PowerPC (optional)
DSP elements
Block RAM elements
Location and columnwise alignment of elements
are important for place and route of FPGA
designs!
13
Basic Modular Addition
2
3
5
4
A
4
8
7
5
B
  • Input Modulus M
  • Operands A ? M, B ? M
  • Output AB (mod M)
  • S AB
  • S AB-M
  • If (b 1) then
  • Return S
  • Else
  • Return S

6
8
8
8
M
2
3
5
4
4
8
7
5

0
1
1
0
9
2
2
7
S
9
2
2
7
6
8
8
8
-
0
0
1
1
1
4
3
0
S
borrow (b)
14
DSP-supported Modular Addition
2
3
5
4
A
4
8
7
5
B
6
8
8
8
M
8
6
5
7
8
4
8
6
4
5
3
2
1
0
1
0
0
1
1
0
1
9
12
4
3
12
0
7
1
4
3
0
9
2
2
7
15
Basic Modular Multiplication
block of k bits
  • Modular Multiplication with Quotient Pipelining
    (Orup)
  • Input Modulus dependent par
  • M(M, k, d), operands A, B
  • Output S A B R-1 (mod M)
  • // n is number of rounds
  • for i 0 to n do
  • qi Si (mod 2k)
  • Si1 Si/2k qiM biA
  • Return Sn1

A
a3
an
a1
a2
a0
block selection of Si,0
b3
bn
b1
b2
b0
M
m3
mn
m1
m2
m0
shift by k bits
Si,j
Si,3
Si,n
Si,1
Si,2
Si,0
Si,1
Si,2
Si,3
Si,4
0
scalarmultiplication
accumulation
16
Outline
  • Introduction
  • Background and Arithmetic on Modern FPGAs
  • Overview of Elliptic Curve Method (ECM)
  • Initial Implementation of ECM (Phase 1)
  • Optimization and Implementing of ECM (Phase 1
    2)
  • Results
  • Conclusions

17
Introduction to Elliptic Curve Method
  • Phase 1 and 2 of the elliptic curve method
  • Phase 1 was originally proposed by Lenstra
  • Idea Pollards p-1 method adapted for elliptic
    curves
  • Brent and Montgomery extended the Phase 1 by a
    continuation (Phase 2)
  • Cost intensive operations in Phase 1 2
  • Phase 1 is basically a scalar multiplication
  • Gaj, et. al. described an algorithm for the
    standard continuation (CHES 2006)
  • Standard continuation in hardware
  • Precomputations
  • Scalar multiplications jQ
  • Storing primes in a table
  • Main computations
  • point addition
  • accumulation of product d

18
Outline
  • Introduction
  • Background and Arithmetic on Modern FPGAs
  • Overview of Elliptic Curve Method (ECM)
  • Initial Implementation of ECM (Phase 1)
  • Optimization and Implementing of ECM (Phase 1
    2)
  • Results
  • Conclusions

19
Example ECM Operations
  • Arithmetic unit comprises of 2 multipliers and 1
    addition/subtraction component
  • Example concurrent point doubling/point addition
    in 6 steps
  • Implementation of Montgomery Ladder for ECC point
    multiplication

20
Initial Project Implementation
  • Initial Project implemented only Phase 1
  • Pros
  • Proof-of-concept implementation
  • DSP implementation of multiplier and
    addition/subtraction
  • 8 units calculating Phase 1
  • Cons
  • Very complex instructions (size and routing)
  • Addition works with different bit-width than
    multiplication
  • Special way to feed multiplier with data
  • Design reaches DSPs limit
  • Extension to Phase 2 might not be possible

21
Initial ECM System Architecture
ECM system
ECM unit
  • 8 independent units
  • supports up to 151-bit numbers
  • Each unit computes a different parameter set

22
Initial ECM Systems on COPACOBANA
  • Host Computer performs (simple) precomputations
  • COBACOBANA performs cost intensive operations
  • Each module consists of 8 Virtex-4 FPGA
  • For more information about COPACOBANA, please see
    talk tomorrow

23
Initial Multiplication with DSPs
  • 151-bit multiplier using 10 DSP elements
  • Special structure aj aj5 and mj mj5
  • Output sj after 66 clock cycles in 17-bit blocks

24
Initial Data storage
storage block I
storage block II
storage block III
  • BRAM 0 to 5 contain input values for the
    multipliers
  • BRAM 6/7 contain input values for the
    addition/subtraction unit
  • BRAM 8 contain moduli for multiplication and
    addition/subtraction
  • Multipliers transfers 17-bit resulting blocks to
    storage block I (BRAM 0-2), or storage block II
    (BRAM 3-5), and/or storage block III (BRAM 6-7)
  • Addition/subtraction transfers 34-bit resulting
    block to one of storage block I/II/III

Too complex! Routing kills the frequency ? max
100 / 400 MHz
25
Outline
  • Introduction
  • Background and Arithmetic on Modern FPGAs
  • Overview of Elliptic Curve Method (ECM)
  • Initial Implementation of ECM (Phase 1)
  • Optimization and Implementing of ECM (Phase 1
    2)
  • Results
  • Conclusions

26
Optimization of Addition/Subtraction
  • 17-bit internal bus width
  • No conversation between add/sub (34 bit) and
    multiplication (17 bit)
  • Less space in fabric when buffering bus signals
  • Better routing possible
  • Changed code of modular addition/subtraction
    algorithm
  • Maximized clock frequency under optimal
    conditions
  • Removed output buffers and distributed RAM

27
Optimization of Multiplication
  • Previous approach to multiplying
  • Use n DSPs to multiply n17 bit values (parallel)
    ? (n1) 6 cycles
  • New approach
  • Use 3 DSPs to multiply n17 bit values
    (sequential) ? 6 (n2)n cycles
  • Here n10 Reduce DSPs by 7 on cost of 60 clock
    cycles

28
Optimization of Multiplication
  • Multiplication results
  • ? Algorithmic-Logic Unit (ALU) results

Promising results for the basic structure!
29
Optimization of Memory Usage
  • Instructions
  • Reduced bit width by factor gt 2,5
  • Instructions for all ECM operations 1 BRAM
  • ECM instructions merged with ECM Parametersk and
    prime table
  • Memory usage per unit
  • ALU 6 BRAMs
  • ECM Unit 1 BRAM

1 Project 1 uses global instructions for phase
1 using 17 BRAMs for 8 units 2 Project 2 uses 2
BRAMs, one for ALU, one for ECM instruction for
both phases.
30
Implementation of ECM Phase 2
  • Different Clock Domains
  • 100 MHz for ECM finite state machine (FSM)
  • 200 MHz for algorithmic-logic unit (ALU)
  • Workspace RAM ALU playground
  • 32 Cells, each containing 2 blocks
  • Each block stores 16 values of 17-bit
  • ? 17.408 bits (equals 1 BRAM)

Phase 2 Precomputation
Phase 2
P1/P2 temporary result storage
Point addition / doubling results
P2 pre temporary points
31
Optimizing Phase 2 Precalculation
  • From Gaj, et. al. (CHES 2006)

Calculate set JS of integers j
Use the j multiple of Q0
32
Optimizing Phase 2 Precalculation II
  • Suggested parameter D 210 for hardware
  • Which multiples of Q0 are needed?
  • JS(D) 1, 11, 13, 17, 19, 23, 29, 31, 37, 41,
    43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97,
    101, 103
  • Calculation suggested in 1 calculatesJS(D)
    1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23,
    2527, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47,
    49, 51, 5355, 57, 59, 61, 63, 65, 67, 69, 71,
    73, 75, 77, 79, 8183, 85, 87, 89, 91, 93, 95,
    97, 99, 101, 103, 105
  • Are there more needed?
  • 210Q0 and MMINDQ0
  • But Point Addition on elliptic curves?
  • To calculate (Px Pz) (Qx Qz) we
    efficientlywe need the coordinates (P-Qx
    P-Qz)
  • ? What to do with R R Q ?

33
Optimizing Phase 2 Precalculation III
  • What about MMIN D?
  • MMIN round(B1 / D) math
  • Suggested B1 960 ? MMIN 5
  • MMIND 1050
  • What about R-Q?
  • Start R 5DQ0, Q DQ0
  • Point addition needs R-Q 4DQ0 840Q0
  • Iteration increases factor by one5DQ0,
    6DQ0,
  • Calculations suggested
  • 1 D 52 A 210, 1050Q0 (8 11) AD using
    montgomery ladder
  • 840Q0 never mentioned 10 AD using montgomery
    ladder
  • ? not implemented and/or inefficient

34
Implementation of Phase 2 Precalculation
  • Use a chain of point addition / doublingto
    calculate multiples
  • Number of operations 30 A 7 D
  • Number of points in RAM 27 2 temporary
  • ? reduce number of operations drastically
  • ? calculate every point needed for addition
  • ? calculate every point needed for phase 2

35
Implementation of Phase 2 Precalculation II
  • Parameters might change what now?
  • Change of B1 results in a change of MMIN
  • MMIN 0 not possible
  • MMIN 1 first R Q changes to doubling Q
  • MMIN gt 1 is implementable using this strategy
  • Required space in RAM unchanged
  • Required space in fabric unchanged
  • Varies the number of operations in precomputation

36
Outline
  • Introduction
  • Background and Arithmetic on Modern FPGAs
  • Overview of Elliptic Curve Method (ECM)
  • Initial Implementation of ECM (Phase 1)
  • Optimization and Implementing of ECM (Phase 1
    2)
  • Results
  • Conclusions

37
Results of the new ECM implementation
Data transfer
Phase 1
Phase 2
Data transfer
Results for ECM Phase 1 2 (B1 960 and B2
57600)
38
Results for ECM Phase 1 2
ECM Phase 1 2 using B1 960, B2 57600 1723
bit k
69,120 ECM calculations / second on COPACOBANA1
1 COPACOBANA using 4 FPGA modules ( 32 FPGA)
39
Comparison (ECM Phase 1)
Comparing ECM Phase 1 Implementations using B1
960, B2 57000 1323 bit k
1 Cost for 1 FPGA www.em.avnet.com (September
2009)
40
Outline
  • Introduction
  • Background and Arithmetic on Modern FPGAs
  • Overview of Elliptic Curve Method (ECM)
  • Initial Implementation of ECM (Phase 1)
  • Optimization and Implementing of ECM (Phase 1
    2)
  • Results
  • Conclusions

41
Conclusions
  • Novel and complete implementation
  • Implementation results implementation of phase 2
    on FPGA not just estimates
  • Optimization results at least twice as effective
    as the documented result
  • Generic and scalable system
  • Architecture can be easily optimized for (n17)
    2 bit with 2 lt n lt 14 implemented n 9
  • only small changes in fabric/resources, but
    exchange instruction ROM
  • Highly parallel architecture for co-factorization
  • Multiple ECM-units per FPGA (24 units supporting
    151-bit on Virtex-4 SX 35)
  • Multiple FPGAs on the COPACOBANA cluster (128
    Virtex-4 SX 35)

42
Thank you for your attention!Any Questions?
  • Ralf Zimmermann,
  • Tim Güneysu,
  • Christof Paar

43
Redesign of FPGA Module for ECM
Original 6xSpartan-3 XC3S1000
Redesign 8xVirtex-4 XCV4SX35
44
COPACOBANA with Virtex-4 Devices
Write a Comment
User Comments (0)
About PowerShow.com