Title: Cryptographic%20Algorithms%20Implemented%20on%20FPGAs
1Cryptographic Algorithms Implemented on FPGAs
2Why Secure Hardware?
- Embedded systems now common in the industry
- Hardware tokens, smartcards, crypto accelerators,
internet appliances - Detailed analysis reverse engineering
techniques available to all - Increase difficulty of attack
- The means exist
3Attacker resources and methods vary greatly
Resource Teenager Academic Org. Crime Govt
Time Limited Moderate Large Large
Budget () lt1000 10K-100K 100K Unknown
Creativity Varies High Varies Varies
Detectability High High Low Low
Target Challenge Publicity Money Varies
Number Many Moderate Few Unknown
Organized No No Yes Yes
Spread info? Yes Yes Varies No
Source Cryptography Research, Inc. 1999, Crypto
Due Diligence
4Minimal key lengths for symmetric ciphers
Source Blaze/Diffie/Rivest/Schneier/Shimoura/Thom
pson/Wiener www.bsa.org/policy/encryption
Type of attacker
Length needed for protection in late 1995
Budget
Tool
Time and cost per key recovered
40 bits
56 bits
Pedestrian Hacker SmallBusiness CorporateDepar
tment Big Company IntelligenceAgency
scavengedcomputer time FPGA FPGA FPGA ASIC FPGA
ASIC ASIC
infeasible 38 years(5,000)556
days(5,000)19 days(5,000)3 hours(38) 13
hours(5,000)6 min(38)12 sec(38)
tiny 400 10.000 300K 10M 300M
45 5055607075
1 week5 hours(0.08)12 min(0.08)24
sec(0.08)18 sec(0.001) 7 sec(0.08)0.005
sec(0.001)0.0002 sec(0.001)
5Reconfigurable Hardware
- Reconfigurable Hardware (RCHW) means in
commercial applications mostly - Field Programmable Gate Arrays (FPGAs)
- Erasable Programmable Logic Devices (EPLD).
6Field Programmable Gate Arrays
- Can realize a variety of circuits
- can be reprogrammed in-system,
- consist of boolean and storage elements,
- can realize fairly large circuits gt 100 000
gates.
7Reconfigurable Computing - Characteristics
- RC is the middle ground between ASICs and
microprocessors. ASICs are the ultimate in speed
but lack flexibility while processors have the
ultimate in flexibility but lack speed. - Its key feature is the ability to perform
computations in hardware to increase performance,
while retaining much of the flexibility of a
software solution.
8Choosing a Platform
- Choice of implementation is driven by
- Algorithm performance
- Cost Per-unit cost, Development cost
- Power consumption (wireless devices!)
- Flexibility
- Parameter change
- Key agility
- Algorithm agility
- Physical security
9Platform Implementation for Cryptographic
Algorithms
Cryptographic Algorithms
Classic Hardware
Reconfigurable HW
Software
FPGAs
VLSI ASIC chips
General purpose ?Procs, Embedded ?Procs, etc.
10Reconfigurable Computing - defined
ASIC
Processor
Reconfigurable Hardware
Performance
Flexibility
Unit Cost
Development Cost
11Why Crypto-algorithms in Hardware
- Two main reasons
- Software implementations are too slow for some
applications (symmetric alg encryption rates
100 Mbit/sec public-key alg gt 10 msec) - Hardware implementations are intrinsically more
physically secure Key access and algorithm
modication is considerably harder.
12But why reconfigurable hardware?
- Potential advantages of crypto algorithms
implemented on reconfigurable platforms - Algorithm Agility
- Algorithm Upgrade
- Architecture Efficiency
- Resource Efficient
- Algorithm Modification
- (Throughput relative to software)
- (Cost Efficiency relative to ASICs)
13Crypto and FPGAs Algorithm Agility
- Observation Modern security protocols are
defined to - be algorithm independent
- Encryption algorithm is negotiated on a
per-session basis. - Wide variety of ciphers can be required. Ex
IPsec-allowed algorithms DES, 3DES, Blow-Fish,
CAST, IDEA, RC4 and RC6, future extensions! - Same holds for public-key algorithms, e.g.,
Diffie-Hellman and ECDH. - Recall that ASIC solutions can provide
algorithm agility - only at high costs.
14Crypto and FPGAs Algorithm Upgrade
- Applications may need upgrade to a new algorithm
because - Current algorithms was broken (DES)
- Standard expired (again DES)
- New standard was created (AES)
- Algorithm list of algorithm independent protocol
was extended - Upgrade of ASIC-implemented algorithm is
practically - infeasible if many devices are affected or in
applications - such as satellite communications.
15Crypto and FPGAs Architecture Efficiency
- In certain cases a hardware architecture can be
much more efficient if it is designed for a
specific set of parameters. Parameters for
cryptographic algorithms can be for example the
key, the underlying finite field, the coefficient
used (e.g., the specific curve of an ECC system),
and so on. Generally speaking, the more specific
an algorithm is implemented the more efficient it
can become.
16Crypto and FPGAs Resource Efficiency
- Observation The majority of security protocols
uses - private-key as well as public-key algorithms
during one session, but not simultaneous. - Same FPGA device can be used for both through run
- time reconguration.
17Crypto and FPGAs Algorithm Modification
- Some applications require Public algorithms (such
as AES candidates) with proprietary modules,
e.g., proprietary S-boxes or permutations. - Change of modes of operations (feedback modes,
- counter mode, etc.)
- Crypto-analytical implementation, such as
key-search - machines, may use slightly altered version of the
- algorithms.
- With FPGAs, these changes can readily be
implemented.
18Motivation
19- Motivation(1) FPGAs
- potential features
20Configurable Logic Block
4
Combinational Logic
1-bit reg
1-bit reg
4
Combinational Logic
Logic Mode
21- Motivation(1)
- High density built-in modules
Virtex-II Pro
Feature/Product XC2VP2 XC2VP4 XC2VP7 XC2VP20 XC2VP30 XC2VP40 XC2VP50 XC2VP70 XC2VP100 XC2VP125
EasyPath cost reduction - - - - XCE2VP30 XCE2VP40 XCE2VP50 XCE2VP70 XCE2VP100 XCE2VP125
Logic Cells 3,168 6,768 11,088 20,880 30,816 43,632 53,136 74,448 99,216 125,136
Slices 1,408 3,008 4,928 9,280 13,696 19,392 23,616 33,088 44,096 55,616
BRAM (Kbits) 216 504 792 1,584 2,448 3,456 4,176 5,904 7,992 10,008
18x18 Multipliers 12 28 44 88 136 192 232 328 444 556
Digital Clock Management Blocks 4 4 4 8 8 8 8 8 12 12
Config (Mbits) 1.31 3.01 4.49 8.21 11.36 15.56 19.02 25.6 33.65 42.78
PowerPC Processors 0 1 1 2 2 2 2 2 2 4
Max Available Multi-Gigabit Transceivers 4 4 8 8 8 12 16 20 20 24
Max Available User I/O 204 348 396 564 644 804 852 996 1164 1200
1 Logic Cell (1) 4-input LUT (1) FF (1)
Carry Logic 1 CLB (4) Slices
http//www.xilinx.com/products/tables/fpga.htmv2p
22 Motivation(2) Cryptographic algorithms ?
Basic primitives
Survey by Stephen et al, LNCS 1482, Sep. 98
23 Motivation(1 2) Cryptographic algorithms
on FPGAs
- Cryptographic algorithms
- Simple logical operations - at a bit level
- Replicated block
- block length is high
- FPGAs
- FPGAs actually treat bit level operations
- Blocks can be just copied
- Parallelism is possible (high no. of IOs)
- More physical security
- Flexibility
- High density
24- Motivation(3)
- High Performance
25- Motivation(4)
- Smart card applications
26Case of Study Modular Exponentiation
27But why are we interested in modular
exponentiation in the first place?
28RSA cryptosystem by layers
Protocols and Applications SSL, TLS, WTLS, WAP,
etc.
PKCS User FunctionsPKCS1_OAEP_Encrypt,
PKCS1_OAEP_Decrypt, PKCS1_v15_Sign,
PKCS Primitives PKCS1_OAEP_Encode,
PKCS1_OAEP_Decode, etc
RSA primitive Operations Encryption C Me mod
n, Decryption M Cd mod n.
FP finite field operations Addition, Squaring,
multiplication, inversion and exponentiation
29Public-Key Cryptography
30Public-Key Cryptography
31Modern Cryptosystems A Top-Down Model
Applications e-commerce, smart cards, digital
money, secure communications, etc.
Crypto-protocols Diffie-Hellman, authentication
protocols, etc.
Top level Crypto-primitives Key-pair generation,
Signing and Verification
Low-level crypto-primitives addition, doubling,
scalar multiplication
F2m finite field operations Addition, Squaring,
multiplication and inversion
32AES (Rijndael) Algorithm Implementation
33AES Advanced Encryption Standard (Rijndael)
Plain Text
128
AES
Selection of rounds
Key
128
- AES Processes
- Key Scheduling
- Encryption
- Decryption
128
Cipher Text
34AES Advanced Encryption Standard
Input 128 bits 16 bytes
35Key Scheduling
User-key
Generated- keys
.. ..
Round Key 0 Round Key 1 Round Key 3 .. Round Key 10
36AES Encryption Algorithm Flow
USER KEY
SUB KEY
SUB KEY
IN
OUT
ARK
BS
ARK
BS
SR
ARK
(ROUND-1)
SR
MC
BS Byte Substitution SR Shift Rows MC Mix
Column ARK Add Round Key
371. Byte Substitution
SUB KEY
BS
ARK
SR
MC
S-BOX 16x16
a0,0 a0,1 a0,2 a0,3
a1,0 a1,1 a1,2 a1,3
a2,0 a2,1 a2,2 a2,3
a3,0 a3,1 a3,2 a3,3
b0,0 b0,1 b0,2 b0,3
b1,0 b1,1 b1,2 b1,3
b2,0 b2,1 b2,2 b2,3
b3,0 b3,1 b3,2 b3,3
State Matrix
382. ShiftRow(SR)
SUB KEY
BS
ARK
SR
MC
a b c d
f g h e
k l i j
p m n o
a b c d
e f g h
i j k l
m n o p
Offset 0
Offset 1
MC
Offset 2
Offset 3
a b c d
f g h e
k l i j
p m n o
a b c d
e f g h
i j k l
m n o p
Offset 0
Offset 1
IMC
Offset 2
Offset 3
393. MixColumn(MC) Inv MixColumn(IMC)
SUB KEY
BS
ARK
MC
SR
MC
i0,1,2,3
IMC
Every entry is represented in GF(28)
404. AddRoundKey(ARK)
SUB KEY
BS
ARK
SR
MC
key
41AES Implementation Strategies
The commonly used architecures are
Iterative looping
repeated n times
Inner-round pipeling
one round
Loop unrolling
42AES Implementation Strategies
Metrics to measure performance?
1
2
- FPGAs Resources used
- CLB slices
- BRAMs
- etc.
43- Design 1 Encryptor Core
- Sequential vs. Pipelined Architecture
-
44AES Algorithm ImplementationSequential Approach
USER-KEY
ROUND-KEY
ROUND-KEY
CLK
S
PLAIN TEXT
CIPHER TEXT
RND 0
RND 1-9
LATCH
RND 10
RCON
CLK
S
USER KEY
ROUND KEY
KGEN
LATCH
45AES Algorithm ImplementationSequential Approach
Byte Substitution (BS) Look-up table method
B1
B1
S-Box (256 x 8)
B2
B2
S-Box (256 x 8)
4
16x1 RAM
1-bit reg
4
1-bit reg
16x1 RAM
Memory Mode
B15
B15
S-Box (256 x 8)
B16
B16
S-Box (256 x 8)
46AES Algorithm ImplementationSequential Approach
SR
IN4 bytes
OUT4 bytes
b
a
b
c
c
d
d
a
Just change of wires, No space occupied
47AES Algorithm ImplementationSequential Approach
AddRoundKey
Key
Here xtime(v) represents 02v.
48Performance results
Target device VirtexE XCV812 Tools used Xilinx
Foundation Tool F4.1i CLB slices 2744 (22
) BRAMs No used I/Os 385 (95 ) Achieved
Frequency 20.192 MHz Throughput 258.5
Mbits/s Throughput/Area 0.09
49AES Algorithm Implementation Pipelined Approach
IN REG
RND 0
RND 1
RND 2
RND 3
RND 4
RND 5
RND 6
RND 7
RND 8
RND 9
RND 10
OUT
IN
RK 10
RK 0
RK 1
RK 2
RK 3
RK 4
RK 5
RK 6
RK 7
RK 8
RK 9
IN REG
KGEN
KGEN
KGEN
KGEN
KGEN
KGEN
KGEN
KGEN
KGEN
KGEN
KGEN
USER- KEY
50Performance results
Target device VirtexE XCV812 Tools used Xilinx
Foundation Tool F4.1i CLB slices 2136 (18
) BRAMs 100 I/Os 385 (95 ) Achieved
Frequency 22.41 MHz Throughput 2868
Mbits/s Throughput/Area 1.29
51- Design 2 Encryptor/Decryptor Core
- MixColumn Inv. MixColumn Modified
-
52MixColumn(MC) Inv MixColumn(IMC) Revisted
MC
IMC
Every entry is represented in GF(28)
53MixColumn(MC) Inv MixColumn(IMC) Cont
MC
IMC
02(x)
Where
04(x)
02(x)
08(x)
- The co-efficient for IMC have higher hamming
weight ? - It is a costly operation?
54MixColumn(MC) Inv MixColumn(IMC) Cont
We observe that,
(1) (2)
The biggest co-efficient for Eq.2 is, 05
Eq.1, we already have, Eq.2 calculation can be
made before Eq.1
55Data Path for Encryption/Decryption
E/D
AF
MC
E/D
ENC
ARK
SR
OUT
IN
MI
ISR
IMC
DEC
IAF
IARK
E/D
E/D
AF
ENC
SR
MC
OUT
IN
MI
ARK
ISR
ModM
DEC
IAF
Encryption MI AF SR MC ARK Decryption
ISR IAF MI ModM MC ARK
56Performance results
Target device VirtexE XCV2600 Tools used Xilinx
Foundation Tool F4.1i CLB slices 5677 (22.3
) BRAMs 80 (43) I/Os 386 (48 ) Achieved
Frequency 34.2 MHz Throughput 4121
Mbits/s Throughput/Area 0.73
57- Design 3 Encryptor/Decryptor Core
- S-Box Inv. S-Box
-
58Byte Substitution (Revisited)
S-BOX 256 x 8
b0,0 b0,1 b0,2 b0,3
b1,0 b1,1 b1,2 b1,3
b2,0 b2,1 b2,2 b2,3
b3,0 b3,1 b3,2 b3,3
a0,0 a0,1 a0,2 a0,3
a1,0 a1,1 a1,2 a1,3
a2,0 a2,1 a2,2 a2,3
a3,0 a3,1 a3,2 a3,3
State Matrix
59BS and Inverse BS
S-BOX
MI
AF
IN
INV S-BOX
IAF
MI
in GF(28)
E/D
S-BOX
AF
MI
IN
IAF
INV S-BOX
60MI 1st Approach
E/D
E/D
AF
MC
SR
ARK
MI
OUT
IN
ISR
IMC
IAF
IARK
- MI with Look-up Table
- Same S-Box (MI) for encryption/decryption
- Memory requirements become half
- BRAMs are used for storing MI values.
- No initial time to prepare them
61Performance results
Target device VirtexE XCV2600 Tools used Xilinx
Foundation Tool F4.1i CLB slices 6677 (26.3
) BRAMs 80 (43) I/Os 386 (48 ) Achieved
Frequency 30 MHz Throughput 3840
Mbits/s Throughput/Area 0.58
62MI 2nd Approach
Ist Transformation
MI Manipulation
2nd Transformation
M-1
M
FIELD F TO GF(28)
GF(28) TO FIELD F
GF(24)
MI Three-Stage Strategy S. Morioka and A. Satoh,
CHES 2002
- MI with Composite Fields GF(22)2 GF(24)2
- Map the element A ? GF(28) to a composite field F
- Compute the Multiplicative Inverse over the field
F - Map back from field F to GF(28)
63MI Implementation
AH
AH
GF(28) to GF(24)
GF(24) to GF(28)
4
Xl
X2
Mul 4x4
lAH
2
4
8
A17
8
X -1
AL
A
A-1
4
ALA16
4
Mul 4x4
Mul 4x4
AL
A16
Let A?F2 and A AH y AL , then it can be shown
that
(
)
16
A
A
y
A
A
L
H
H
(
)
16
2
16
16
16
17
l
l
A
A
A
A
A
A
A
y
A
A
A
0
L
L
H
L
L
H
H
64Performance results
Target device VirtexE XCV2600 Tools used Xilinx
Foundation Tool F4.1i CLB slices 13416 (52
) BRAMs no I/Os 386 (48 ) Achieved
Frequency 24.5 MHz Throughput 3136
Mbits/s Throughput/Area 0.24
65AES Algorithm Implementations
Results Comparison
66Sequential Vs Pipeline design
Sequential Design
Pipeline Design
67MixColumn vs Inv MixColumn
Device BRAMs CLB(S) Slices Throughput (Mbits/s)(T) T/S
McLoone et al XCV3200E 102 7576 3239 0.43
This design XCV2600E 80 5677 4121 0.73
- Two approach for MC/IMC
- Less BRAMs
- Less Slices
- Higher Throughput reported to-date
68S-Box Vs Inv S-Box
Device BRAMs CLB(S) Slices Throughput (Mbits/s)(T) T/S
McLoone XCV3200E 102 7576 3239 0.43
E/D GF(28) XCV2600E 80 6676 3840 0.58
E/D GF(24) XCV2600E No BRAMs 13416 3136 0.24
- Two approaches for MI
- Key Scheduling included
- No initial delay
- First design uses look-up table for MI,
- Fast but high memory requirements
- Second design use composite field approach
- for MI, Slower with less memory requirements.
- Both are efficient as compared to reported design
69Modular Exponentiation Binary Method Variations
70Side Channel Attacks
Algorithm Binary exponentiation Input a in G,
exponent d (dk,dk-1,,d0) (dk is the most
significant bit) Output c ad in G 1. c
a 2. For i k-1 down to 0 3.
c c2 4. If di 1 then c ca
5. Return c
The time or the power to execute c2 and ca are
different (side channel information).
Algorithm Corons exponentiation Input a in G,
exponent d (dk,dk-1,,dl0) Output c ad in
G 1. c0 1 2. For i k-1 down to
0 3. c0 c02 4. c1
c0a 5. c0 cdi 6. Return
c0
71Mod. Exponentiation LSB-First Binary
- Let k be the number of bits of e, i.e.,
- Input M, e, n.
- Output C Me mod n
- R 1 C M
- For i 0 to n-1
- If ei 1 then R R?C mod n
- C C2 mod n
- Return C
72Modular Exponentiation LSB First Binary
- Example e 250 (11111010), thus k 8
i ei Step 3 (R) Step 4 (C)
7 0 1 M2
6 1 1(M)2 M2 (M2)2 M4
5 0 M2 (M4)2 M8
4 1 M2 M8 M10 (M8)2 M16
3 1 M10 M16 M26 (M16)2 M32
2 1 M26 M32 M58 (M32)2 M64
1 1 M58 M64 M122 (M64)2 M128
0 1 M122 M128 M250 (M128)2 M256
73Modular Exponentiation LSB First Binary
- The LSB-First binary method requires
- Squarings k-1
- Multiplications The number of 1s in the binary
expansion of e, excluding the MSB. - The total number of multiplications
- Maximum (k-1) (k-1) 2(k-1)
- Minimum (k-1) 0 k-1
- Average (k-1) 1/2 (k-1) 1.5(k-1)
- Same as before, but here we can compute the
Multiplication operation in parallel with the
squarings!!
74Arquitectura del MultiplicadorMario GarcÃa et
al ENC03
75Desarrollo (Método q-ario)
76Desarrollo (Método q-ario)
- Precálculo de W.
- Tamaño de q.
- Cálculo de d 2p q
77Desarrollo (Análisis)
- Tamaño de memoria y tiempo de ejecución del
precómputo W. - Número de multiplicaciones y elevaciones al
cuadrado para método q-ario.
78Tiempo de Ejecución Vs. Número de Procs.
79Tamaño de Memoria
80First Layer Field Multiplication
- Preliminary results yield a time delay of 50-70
?Sec and ?9K Slices of hardware resources
utilization.