Title: Unified Architectures for Efficient and Compact CryptoProcessing
1Unified Architectures for Efficient and Compact
Crypto-Processing
- Erkay Savas
- Sabanci University
2Outline
- Research Motivation
- Public Key Cryptography
- Unified Arithmetic
- High-Radix Multiplication
- Dual-Radix Multiplication
- Support for GF(3n) Arithmetic
- Implementation Results
- Future Research
3Motivation
- Compatibility
- support for fast arithmetic in different finite
fields and groups - Saving in Area
- Improve time ? area metric
- Algorithm Agility
- NTRU ? ECC
4Public Key Cryptography (PKC)
- Each user has a pair of keys
- Private Key - known only to the owner
- Public Key - known to everyone in the systems
with assurance - Encryption
- Encryption with the Public Key of the receiver
- Decryption
- Only the receiver can decrypt the message by
her/his Private Key
5Public Key Cryptography in Use
- RSA, Rabins scheme
- Integer factorization, Square root of modulo a
composite number - Discrete Logarithm Based Algorithms
- Diffie-Helman Key Exchange, El Gamal
- Elliptic curve DH Key Exchange, ECDSA
- Discrete logarithm over elliptic curves
- IBE
- pairings over elliptic curve points
6RSA
- Most popular PKC
- Invented by Rivest/Shamir/Adleman in 1977 at MIT.
- Its patent expired in 2000.
- Based on Integer Factorization problem
- Each user has public and private key pair.
7RSA Encryption Decryption
- Encryption done by using public key
-
- y ? xe mod n, where x, y lt n
- Decryption done by using private key
-
- x ? yd mod n
8DL Based Cryptosystems
- Fundamental operation
- gx mod p, where x, g lt p and g is primitive
9Elliptic Curve Cryptography 1/2
- Emerging public key cryptography standard for
constrained devices. - 160 bit key length is equivalent in cryptographic
strength to 1024-bit RSA. - 313 bit ECC is equivalent to 4096 bit RSA
- As algebraic/geometric entities have been studied
extensively for the past 150 years. - Rich and deep theory suitable to cryptography
- First proposed for cryptographic usage in 1985
independently by Neal Koblitz and Victor Miller
10Elliptic Curve Cryptography 2/2
- Dominant fundamental operations
- Multiplication in GF(q) where q pk and p is
prime - Alternatives
- GF(p) k 1
- GF(2k) p 2
- GF(pk)
- GF(3k) p 3
11Identity Based Encryption (IBE)
- Public key can be any string
- e-mail address, name, etc.
- No need for certificates
- Anonymity achieved
- users can choose any public key without revealing
their ID - It can easily change it
12IBE Bilinear Mapping
- e(xP, yQ) e(P, Q)xy e(yP, xQ) g
- g is in an (extension of) the underlying field.
- Bilinear mapping over elliptic curves
- Weil pairing
- Tate pairing
- Resource consuming
- Most efficient bilinear mappings
- defined on curves over GF(3k)
13An Introduction to UnifiedArithmetic
- Types of finite fields are heavily used
- Prime fields, GF(p)
- Binary extension fields, GF(2k)
- Ternary extension fields GF(3k) (recently, due to
IBE schemes) - These finite fields feature dissimilar properties
- Different implementations on specialized hardware
14Unified Arithmetic
- Unified hardware design methodology requires
- A single (unified) datapath
- A single (unified) control
- Insignificant overhead in the area
- Insignificant overhead in the time complexity
(e.g. critical path delay) - Good time?area metric
15Unified Arithmetic (GF(p) GF(2k))
- A unified hardware design methodology for both
field is possible since - the elements of either field are represented
using almost the same data structures in digital
systems - the algorithms for basic arithmetic operations in
both fields have structural similarities (i.e.
the steps of the algorithms are almost identical) - Hence, eventually unified arithmetic is possible
16Finite Field Operations in ECC
- Addition in GF(p) and GF(2k)
- Relatively inexpensive in area and time
complexity - Multiplicative inversion in GF(p) and GF(2k)
- Prohibitively expensive in terms of time
- Possible to avoid some of them
- Multiplication in GF(p) and GF(2k)
- Expensive in terms of time and area
- Usually most important operation
- Our focus
17Montgomery Multiplication
- Very efficient way of doing multiplication in
GF(p) and GF(2k) (now also in GF(3k)) - Faster (replaces division by shifts)
- Suitable for unified design
- Suitable for scalable design
- Highly parallel
- Suitable for pipelining
18Montgomery Multiplication
- Definition
- Given a, b ? GF(p), MonMul(a, b) abR-1 mod p,
where R 2k mod p and k ?log2p?. - Algorithm
- c 0
- for i 0 to k-1
- c (c ai b)
- c (c c0 p)/2
- if c gt p then c c-p (final subtraction)
19Algorithm for GF(2k)
- Input a(x), b(x) ? GF(2k), p(x) and k
- Output c(x) a(x)b(x)xk? GF(2k)
- c(x) 0
- for i 0 to k-1
- c(x) (c(x) ? ai b(x))
- c(x) (c(x) ? c0 p(x))/x
- No final subtraction
- Note that
- c/2 and c(x)/x are implemented in an identical
way in SW and HW
20Representation
- Addition
- Atomic operation multiplication is performed as
a repeated addition - Unified addition
- most efficient when carry-save representation is
used for elements of GF(p) - Carry-save representation
- an integer is represented as the sum of two other
integers - x xs xc (sum and carry parts, resp.)
21Scalability
- Original Montgomery multiplication algorithm
performs full-precision integer additions - Not scalable
- Instead,
- long integers are divided into words
- Addition of words are handled separately on word
adders. - Choice of word length depends on the precision,
area and speed requirements
22Word-Based Multiplication
ai
PUi
c(j)0
c(j)w-1
c(j1)0
c(j1)w-1
c(j)1
c(j1)1
c(j)
23Dependency Graph
24Processing Unit (PU) with w2
C1(j)
C0(j)
25Dual-Field Adder (DFA) 1/2
- Almost identical to a full-adder (FA)
- Difference
- it has and additional (control) input (FSEL)
which suppress the carry output of the adder when
it is set to logic-0 - Namely, when FSEL 0 then the adder operates in
GF(2k), otherwise it becomes a regular FA
26DFA 2/2
B
S
A
C
FSEL
Cout
27Pipeline Organization with two PUs
s the number of PUs
28Total Computation Time (in clock cycles)
w word size, k precision, e ?k/w?, s the
number of PUs
29Example Execution Times
- Example k 1024, w 32
- s 17 ? T 2105
- s 15 ? T 2305
- s 10 ? T 3415
- s 1 ? T 33792
- Example k 2048, w 32
- s 33 ? T 4221
- s 30 ? T 4543
- s 10 ? T 13343
- s 1 ? T 133120
30Comparison to the single-field (GF(p)) design
w word size 1.2 ?m CMOS technology
31Design Alternatives
- Higher Radix
- Original design is radix 2
- Namely, multiplier bits are scanned one bit in
each clock cycle - Possible to scan two or more bits of the
multiplier a - Radix-4 two bits
- Radix-8 three bits
- More Complex Design lower clock frequency,
higher area - Less clock cycle count ? Faster execution of
multiplication
32Comparison
- Higher radix vs. single radix
- Metric
- area ? time
- For small total area (i.e. lt10000 equivalent NAND
gates) the performances of radix-2 and radix-8
are comparable - Radix-8 multiplier outperforms radix-2 multiplier
more than 3 times when the total area is around
25000 NAND gates
33Dual-Radix Multiplier
- Radix-2 for GF(p) and radix-4 for GF(2k)
34Dual-Radix Multiplier
- Three multipliers
- A1 GF(p)-only multiplier
- A2 single-radix unified multiplier (with
precomp.) - A3 dual-radix multiplier
- Performance (area ? time)
- A3 performs slightly worse than A1 and A2
(between 7 to 19) in GF(p) mode - A3 outperforms A2 by 38 to 46 in GF(2k)-mode
35Unified Arithmetic?
- Unified multiplier
- carry-save adders used in multiplier
- It is not easy to perform other arithmetic
operations with carry-save representation such as
subtraction and comparison (essential in
inversion)
36New Redundant Representation
- Recall
- Carry-save representation
- X xs xc.
- New redundant representation
- Redundant signed representation (RSD)
- X xp - xn.
- Subtraction is equivalent to the addition
- X-Y (xp - xn) - (yp - yn) (xp - xn) (yn -
yp) - Comparison is relatively easy
37RSD
- All previous multipliers require a reverse
transformation to non-redundant for after each
multiplication - There are thousands multiplication in ECC
- With RSD, all the computation can be done in RSD
form without any reverse transformation - a single transformation is necessary if the
result is needed in non-redundant form.
38Support for GF(3n) Arithmetic
- RSD lends itself to a unified arithmetic
architecture that efficiently supports GF(3n)
arithmetic
39Analysis
- A1 GF(p)-only architecture
- A2 GF(2k)-only architecture
- A3 GF(3n)-only architecture
- A4 Unified architecture (GF(p) GF(2k))
- A5 Unified architecture (GF(p) GF(2k)
GF(3n)) - A1 A2 Hypothetical architecture that has
separate datapath for GF(p) and GF(2k)
40Analysis
- Metric area ? time
- A4 over A1 A2 7.94
- A5 over A1 A2 A3 33.54
- A5 over A4 A3 28.36
41Implementation Results
- 4 PUs ? 11,000, 8 PUs ? 15,000 NAND gates
42Research Directions
- Embed the unified architectures into common
general-purpose processors - Unified inversion using RSD
- Unified architectures for other PKC
43Ending
- Questions
- Contact
- Erkay Savas
- erkays_at_sabanciuniv.edu
- http//people.sabanciuniv.edu/erkays