Title: Montgomery
1Montgomerys Multiplication Technique How to
make it Smaller and Faster
- Colin D. Walter
- Computation Department, UMIST, UK
- www.co.umist.ac.uk
2Peter Montgomery
- Modular Multiplication without Trial Division
Math. Computation, vol. 44 (1985)
519-521 - (A ? B) mod M
without obtaining digits q ? (A ? B) / M
3Motivation
- Faster RSA Cryptosystem
- ? through pipelined array
- Safer encryption
- ? against timing or DPA attacks
4Overview
- RSA Notation
- Classical Algorithm
- Montgomerys Version
- Comparison
- carry propagation
- digit distribution
- communication
- timing/power attacks
- Conclusion
5Enigma
- Special Purpose Colossus (1943-44)
- Tommy Flowers,
- Bletchley Park, England.
- General Purpose ENIAC (1943-46)
- John Eckert John Mauchly
- Philadelphia, US.
6RSA
- Modulus M of around 1024 bits
- Two keys d and e such that Ade ? A mod M
- A encrypted to C Ae mod M
- C decrypted by A Cd mod M
- M PQ, a product of two large primes
- e is often small (e.g. a Fermat prime)
- d satisfies de ? 1 mod (P1)(Q1)
7Faster H/W More Secure Encryption
- Work to factorize M doubles for every
extra 15 bits (for key lengths 210
bits) - Work to en/decrypt
- ((102415)/1024)2 per multiplication
- ((102415)/1024)3 per exponentiation,
- i.e. only 5 extra!
8Number representations
n1
- X ?i0 xiri
- r 2k is the radix (prime to M)
- xi is the ith digit (usually 0 ? xi lt r)
- n ? max no. of digits in any number
- Redundant reps
- wider digit range than 0 .. r?1
- H/W is built from k?k-bit multipliers
- n fixed by H/W register size
9Redundancy
- Digits xj split into carry-save parts xj xj,s
rxj,c - X ? XY
- is performed by digit-parallel addition
- xj ? xj,s xj?1,c yj
- No carry propagation only old carries on right
side
10Multiplication A?B
- Use n digit multipliers to form ai?B and add to a
partial product P - P 0
- For i n?1 downto 0 do
- P r?P ai?B
- Post-condition P A?B
11- Either Use redundancy in P and parallel digit
addition to add aiB in one clock cycle - Cell j computes aibj in cycle i
ai
ai
pj1,s
pj,c
pj,s
pj-1,c
pj-1,s
bj1
bj
bj-1
cell j
cell j-1
cell j1
pj,c
pj1,c
pj1,s
pj,s
pj-1,c
pj-1,s
P P ai?B (digit-parallel)
P in Carry-Save form pj pj,s r?pj,c
12- or Pipeline the addition of ai?B over n cycles
and propagate carries with no redundancy - Cell j computes aibj in cycle ij
pj1 bj1
pj bj
pj-1 bj-1
ai
ai
ai
ai
time j1
time j
time j-1
carry
carry
carry
carry
pj1
pj
pj-1
P P ai?B (digit-serial)
13Multiplier Complexity
- Assume wires take area but not time (or power).
- Area?Time2 complexity for un-pipelined k-bit
multiplication is bounded below by k2 - This can be achieved for time in log k ..?k
- Discrete Fourier Transform has large constants
for time and area. - Better, but asymptotically poorer designs for k
expected here.
14- Cross-over point ?
- 107 transistors available for RSA ?
- k ? 64 to accommodate ai?B
- Speed by using at least n multipliers to perform
a full length ai?B (or equivalent) in one cycle.
15Real-Time ?
- Assume
- bus is one k-bit digit per cycle
- k-bit multiplier operates in one cycle
- Then
- A?B takes n cycles using n multipliers
- Throughput is one digit per cycle for multn.
- Need O(nk) multiplications for decryption
- Conclude
- Need O(nk) rows of n multipliers.
16Classical Mod Multn Algorithm
- Pre-condition 0 ? A lt rn
- P 0
- For i n?1 downto 0 do
- Begin
- P rP aiB
- qi P div M
- P P ? qiM
- End
- Post-conditions P AB ? QM,
- P ? (AB) mod M
17Comments
- Carry propagation a problem
- (it slows finding q)
- Use only top digits of M and P to determine
a good multiple of M to remove - P is bounded by small multiple of M
- Clean up only at end
- Critical path is finding q.
18Disadvantages
- Redundant rep. for digit-parallel operation
- Global broadcast of q to each digit position
19Montgomerys Mod Multn Algm
- Pre-condition 0 ? A lt rn
- P 0
- For i 0 to n?1 do
- Begin
- qi (p0aib0)(-m0-1) mod r
- P (P aiB qiM) div r
- Invariant 0 ? P lt MB
- End
- Post-condition Prn AB QM ,
- P ? (ABrn) mod M
20Peter Montgomery
- reverses the multiplication order
- chooses digits from least to most significant
- shifts down on each iteration.
- uses the least significant digits
to determine multiple of M to
subtract. - Computes (ABrn) mod M
21- The factor rn is cleared up in post-processing
- Any extra multiple of M is removed then
- qi has no carries to wait for
- Pipelining of the digits can now take place
- compute aibj1 on the cycle after aibj
- use a non-redundant representation
- no broadcasting of qi
22The Post-Condition
- m0?1 exists
- qi chosen so division by r is exact
- Define Ai ? j0 ajrj and Qi analogously
- Then Ai Ai?1riai and An A
- So ri1P AiB QiM at end of ith
iteration - Hence rnP AB QM at end.
i
23The Bounds
- A converted on-line to non-redundant form
- Can assume ai ? r?1
- So loop invariant P lt MB
24- If critical path length is computing q
- Scale M to ensure (?m0?1) mod r 1
- Shift B up to make b0 0
- Result
- qi p0 mod r is simple
- Critical path in repeated cell.
- Cost
- Increase n by 2
25Removing rn
- The Montgomery class of A is
- A ? rnA mod M
- Montgomery modr multn is denoted ? .
- Montgomery product of A and B is
- A ? B ? A B r?n ? ABrn ? AB mod M.
- Applying ? to A instead of ? to A produces
- Ae in an expn algorithm
_
_
_
_
_
_
_
_
___
_
_
_
___
26Encryption Process
_
__
- Process A ? A ? Ae ? Ae
- Precompute R2 rn ? r2n mod M
- Start with A ? R2 ? Arn ? A mod M
- Exponentiate to obtain Ae
- End with Ae ? 1 ? Ae mod M
__
_
_
__
__
_
272M Bound
- Outputs are re-used as inputs.
- So need to bound I/O
- Suppose an?1 0
- Then P lt MB at end of loop n?2
- yields P lt Mr?1B at very end.
- e.g. If B lt 2M then P lt 2M
28- Suppose 2rM lt rn, A lt 2M and R2 lt 2M
- Then A lt 2M, Ae lt 2M and P Ae ? 1 lt 2M
- Final output P satisfies
- Prn Ae QM where Q ? rn?1.
- Here Ae lt 2M yields Prn lt (rn1)M So P ? M
- P M ? Ae ? 0 mod M ? A ? 0 mod M
- A M should never arise A 0 yields P 0.
- So no final modular adjustment is necessary.
_
__
__
_
__
__
__
29Digit-Parallel Implementation
- Classical vs Montgomery
- Similarities
- Broadcasting of qi and ai
- Redundant representations
- Computing qi takes time
- Differences
- Bits to determine qi
30ai, qi
ai, qi
mj bj
mj-1 bj-1
mj1 bj1
pj,c
pj1,s
pj-1,c
pj-1,s
pj,s
cell j
cell j-1
cell j1
pj1,c
pj,c
pj-1,c
pj1,s
pj-1,s
pj,s
P P ai?B (digit-parallel, not modular)
P in Carry-Save form pj pj,s r?pj,c
31ai qi
ai qi
mj1 bj1
mn-1 bn-1
mj-1 bj-1
mj bj
qi
pj,s
pj1,s
pn-2,s
pj-1,s
pj-2,s
j
j1
j-1
n-1
qi1
pj-1,c
pj,c
pn-3,c
pj1,c
pj-3,c
pj,c
pj-2,c
Digit-Parallel P rP ai?B - qi?M
(Classical)
32mi1
mi
bj1
bj
mi-1
m0
b0
bj-1
qi
qi
qi
qi
qi
ai
ai
ai
ai
ai
ai
j
j-1
j1
0
ci,j2
ci,j
ci,1
ci,j1
ci,j-1
(i)
(i)
(i)
(i1)
(i1)
(i1)
pj-1
pj
p0
pj-1
pj
pj-2
(i)
pj1
(n)
(n)
(n)
pj-2
pj-1
pj
Data Flow for P(i1) (P(i) ai?B qi?M)/r
(Montgomery)
33Systolic Array (Montgomery)
- Write ith value of P as P(i) ? j0 p(i?1) r j
- Cells in col j compute p(i)j at time 2ij
- p(i)j rc(i)j ? p(i?1)j1 c(i)j?1 aibj
qimj - Cells in col 0 compute qi at time 2i
- qi ? (p(i?1)1aib0)(?m0?1) mod r
- Any number of rows may be constructed
- Different timing schedules are possible
n?1
34Systolic Array for P (A?B Q?M)r-n
p(i)
p(i)
p(i)
p(i)
j
j-2
mj-1 bj-1
mj bj
mj1 bj1
j-1
j1
ai
ai
ai
ai
cell i,j1
cell i,j
cell i,j-1
qi
qi
qi
qi
carry
carry
carry
carry
p(i1)
p(i1)
p(i1)
p(i1)
j-1
j-2
j1
j
mj bj
mj1 bj1
mj-1 bj-1
ai1
ai1
ai1
ai1
cell i1,j1
cell i1,j
cell i1,j-1
qi1
qi1
qi1
qi1
carry
carry
carry
carry
mj-1 bj-1
mj bj
mj1 bj1
p(i2)
p(i2)
p(i2)
p(i2)
j-1
j
j-2
j1
35mi1
mi
bj1
bj
mi-1
m0
b0
bj-1
qi
qi
qi
qi
qi
ai
ai
ai
ai
ai
ai
j
j-1
j1
0
ci,j2
ci,j
ci,1
ci,j1
ci,j-1
(i)
(i)
(i)
(i1)
(i1)
(i1)
pj-1
pj
p0
pj-1
pj
pj-2
(i)
pj1
(n)
(n)
(n)
pj-2
pj-1
pj
Data Flow for P(i1) (P(i) ai?B qi?M)/r
36Digit-Serial Implementation (Montgomery)
- Advantages
- Local communication
- Shorter critical path
- Critical path easily in repeated cell
- Non-redundant representation
- Digit serial I/O
- Different digits qi and ai re DPA
37Digit-Serial Implementation (Montgomery)
- Disadvantage
- H/W only half used
- Solutions
- Interleave two multiplications
- E.g. configure exponentiation ? 75 use
- Group digits as per Peter Kornerup 94
38- Other cell boundaries/groupings are possible
- Timing front angles in the data dependency graph
can be altered - For current speed of array implementations see
Blum and Paar 99 - Vuillemin et al. 97 constructed an array
- Design is parametrised by k and no. of rows.
39Data Dependency Diagrams
40Data Dependency Diagrams
Parallel Digit Implementation
t 0
t 1
t 2
t 3
41Data Dependency Diagrams
Walter 93
t4
t3
t2
t1
t0
t5
1 tick
t6
t7
2 ticks
...
42Data Dependency Diagrams
Kornerup 94
t0
t1
t2
t3
t4
t5
t6
43Data Integrity
- P AB ? QM or Prn AB QM
- These are easily checked mod m.
- e.g. m a prime just above the maximum cell
output. - Cost one cell in the array i.e. increasing n
by 1. - On error, abort or re-compute by another route
- e.g. M replaced by dM for a digit d prime to r.
44Timing Power Attacks
- Most attacks which succeed on the classical
algorithm have equivalents which will succeed on
corresponding implementation of Montgomerys
algorithm. - With parallel digit processing, the same digits
of A and Q are used in every digit slice in the
same cycle. So DPA might reveal them. - Pipelined version has no equivalent (see data
dependency graph). It uses many different digits
of A and Q in each cycle. DPA is more difficult.
45Conclusions
- For single k-bit multiplier or array of n
parallel cells, classical and Montgomery
algorithms are almost equal. - For pipelined array, Montgomery method has
advantages smaller time area constants, better
I/O, better against DPA - Pipeline is more complex for 100 use, but faster
clock. - Parameters can be chosen for specific purposes.
46