Cryptographic%20Algorithms%20Implemented%20on%20FPGAs

About This Presentation

Title:

Cryptographic%20Algorithms%20Implemented%20on%20FPGAs

Description:

Cryptographic Algorithms Implemented on FPGAs Why Secure Hardware? Embedded systems now common in the industry Hardware tokens, smartcards, crypto accelerators ... – PowerPoint PPT presentation

Number of Views:724

Avg rating:3.0/5.0

Slides: 81

Provided by: deltaCsCi

Category:

more less

Transcript and Presenter's Notes

Title: Cryptographic%20Algorithms%20Implemented%20on%20FPGAs

1
Cryptographic Algorithms Implemented on FPGAs
2
Why Secure Hardware?

Embedded systems now common in the industry
Hardware tokens, smartcards, crypto accelerators,
internet appliances
Detailed analysis reverse engineering
techniques available to all
Increase difficulty of attack
The means exist

3
Attacker resources and methods vary greatly
Resource Teenager Academic Org. Crime Govt
Time Limited Moderate Large Large
Budget () lt1000 10K-100K 100K Unknown
Creativity Varies High Varies Varies
Detectability High High Low Low
Target Challenge Publicity Money Varies
Number Many Moderate Few Unknown
Organized No No Yes Yes
Spread info? Yes Yes Varies No
Source Cryptography Research, Inc. 1999, Crypto
Due Diligence
4
Minimal key lengths for symmetric ciphers
Source Blaze/Diffie/Rivest/Schneier/Shimoura/Thom
pson/Wiener www.bsa.org/policy/encryption
Type of attacker
Length needed for protection in late 1995
Budget
Tool
Time and cost per key recovered
40 bits
56 bits
Pedestrian Hacker SmallBusiness CorporateDepar
tment Big Company IntelligenceAgency
scavengedcomputer time FPGA FPGA FPGA ASIC FPGA
ASIC ASIC
infeasible 38 years(5,000)556
days(5,000)19 days(5,000)3 hours(38) 13
hours(5,000)6 min(38)12 sec(38)
tiny 400 10.000 300K 10M 300M
45 5055607075
1 week5 hours(0.08)12 min(0.08)24
sec(0.08)18 sec(0.001) 7 sec(0.08)0.005
sec(0.001)0.0002 sec(0.001)
5
Reconfigurable Hardware

Reconfigurable Hardware (RCHW) means in
commercial applications mostly
Field Programmable Gate Arrays (FPGAs)
Erasable Programmable Logic Devices (EPLD).

6
Field Programmable Gate Arrays

Can realize a variety of circuits
can be reprogrammed in-system,
consist of boolean and storage elements,
can realize fairly large circuits gt 100 000
gates.

7
Reconfigurable Computing - Characteristics

RC is the middle ground between ASICs and
microprocessors. ASICs are the ultimate in speed
but lack flexibility while processors have the
ultimate in flexibility but lack speed.
Its key feature is the ability to perform
computations in hardware to increase performance,
while retaining much of the flexibility of a
software solution.

8
Choosing a Platform

Choice of implementation is driven by
Algorithm performance
Cost Per-unit cost, Development cost
Power consumption (wireless devices!)
Flexibility
Parameter change
Key agility
Algorithm agility
Physical security

9
Platform Implementation for Cryptographic
Algorithms
Cryptographic Algorithms

Classic Hardware
Reconfigurable HW
Software
FPGAs
VLSI ASIC chips
General purpose ?Procs, Embedded ?Procs, etc.
10
Reconfigurable Computing - defined
ASIC
Processor
Reconfigurable Hardware
Performance
Flexibility
Unit Cost
Development Cost
11
Why Crypto-algorithms in Hardware

Two main reasons
Software implementations are too slow for some
applications (symmetric alg encryption rates
100 Mbit/sec public-key alg gt 10 msec)
Hardware implementations are intrinsically more
physically secure Key access and algorithm
modication is considerably harder.

12
But why reconfigurable hardware?

Potential advantages of crypto algorithms
implemented on reconfigurable platforms
Algorithm Agility
Algorithm Upgrade
Architecture Efficiency
Resource Efficient
Algorithm Modification
(Throughput relative to software)
(Cost Efficiency relative to ASICs)

13
Crypto and FPGAs Algorithm Agility

Observation Modern security protocols are
defined to
be algorithm independent
Encryption algorithm is negotiated on a
per-session basis.
Wide variety of ciphers can be required. Ex
IPsec-allowed algorithms DES, 3DES, Blow-Fish,
CAST, IDEA, RC4 and RC6, future extensions!
Same holds for public-key algorithms, e.g.,
Diffie-Hellman and ECDH.
Recall that ASIC solutions can provide
algorithm agility
only at high costs.

14
Crypto and FPGAs Algorithm Upgrade

Applications may need upgrade to a new algorithm
because
Current algorithms was broken (DES)
Standard expired (again DES)
New standard was created (AES)
Algorithm list of algorithm independent protocol
was extended
Upgrade of ASIC-implemented algorithm is
practically
infeasible if many devices are affected or in
applications
such as satellite communications.

15
Crypto and FPGAs Architecture Efficiency

In certain cases a hardware architecture can be
much more efficient if it is designed for a
specific set of parameters. Parameters for
cryptographic algorithms can be for example the
key, the underlying finite field, the coefficient
used (e.g., the specific curve of an ECC system),
and so on. Generally speaking, the more specific
an algorithm is implemented the more efficient it
can become.

16
Crypto and FPGAs Resource Efficiency

Observation The majority of security protocols
uses
private-key as well as public-key algorithms
during one session, but not simultaneous.
Same FPGA device can be used for both through run
time reconguration.

17
Crypto and FPGAs Algorithm Modification

Some applications require Public algorithms (such
as AES candidates) with proprietary modules,
e.g., proprietary S-boxes or permutations.
Change of modes of operations (feedback modes,
counter mode, etc.)
Crypto-analytical implementation, such as
key-search
machines, may use slightly altered version of the
algorithms.
With FPGAs, these changes can readily be
implemented.

18
Motivation
19

Motivation(1) FPGAs
potential features

Motivation(1) FPGAs
CLB

Configurable Logic Block
4
Combinational Logic
1-bit reg
1-bit reg
4
Combinational Logic
Logic Mode
21

Motivation(1)
High density built-in modules

Virtex-II Pro
Feature/Product XC2VP2 XC2VP4 XC2VP7 XC2VP20 XC2VP30 XC2VP40 XC2VP50 XC2VP70 XC2VP100 XC2VP125
EasyPath cost reduction - - - - XCE2VP30 XCE2VP40 XCE2VP50 XCE2VP70 XCE2VP100 XCE2VP125
Logic Cells 3,168 6,768 11,088 20,880 30,816 43,632 53,136 74,448 99,216 125,136
Slices 1,408 3,008 4,928 9,280 13,696 19,392 23,616 33,088 44,096 55,616
BRAM (Kbits) 216 504 792 1,584 2,448 3,456 4,176 5,904 7,992 10,008
18x18 Multipliers 12 28 44 88 136 192 232 328 444 556
Digital Clock Management Blocks 4 4 4 8 8 8 8 8 12 12
Config (Mbits) 1.31 3.01 4.49 8.21 11.36 15.56 19.02 25.6 33.65 42.78
PowerPC Processors 0 1 1 2 2 2 2 2 2 4
Max Available Multi-Gigabit Transceivers 4 4 8 8 8 12 16 20 20 24
Max Available User I/O 204 348 396 564 644 804 852 996 1164 1200
1 Logic Cell (1) 4-input LUT (1) FF (1)
Carry Logic 1 CLB (4) Slices
http//www.xilinx.com/products/tables/fpga.htmv2p
22
Motivation(2) Cryptographic algorithms ?
Basic primitives
Survey by Stephen et al, LNCS 1482, Sep. 98
23
Motivation(1 2) Cryptographic algorithms
on FPGAs

Cryptographic algorithms
Simple logical operations - at a bit level
Replicated block
block length is high
FPGAs
FPGAs actually treat bit level operations
Blocks can be just copied
Parallelism is possible (high no. of IOs)
More physical security
Flexibility
High density

Motivation(3)
High Performance

Motivation(4)
Smart card applications

26
Case of Study Modular Exponentiation
27
But why are we interested in modular
exponentiation in the first place?
28
RSA cryptosystem by layers
Protocols and Applications SSL, TLS, WTLS, WAP,
etc.
PKCS User FunctionsPKCS1_OAEP_Encrypt,
PKCS1_OAEP_Decrypt, PKCS1_v15_Sign,
PKCS Primitives PKCS1_OAEP_Encode,
PKCS1_OAEP_Decode, etc
RSA primitive Operations Encryption C Me mod
n, Decryption M Cd mod n.
FP finite field operations Addition, Squaring,
multiplication, inversion and exponentiation
29
Public-Key Cryptography
30
Public-Key Cryptography
31
Modern Cryptosystems A Top-Down Model
Applications e-commerce, smart cards, digital
money, secure communications, etc.
Crypto-protocols Diffie-Hellman, authentication
protocols, etc.
Top level Crypto-primitives Key-pair generation,
Signing and Verification
Low-level crypto-primitives addition, doubling,
scalar multiplication
F2m finite field operations Addition, Squaring,
multiplication and inversion
32
AES (Rijndael) Algorithm Implementation
33
AES Advanced Encryption Standard (Rijndael)
Plain Text
128
AES
Selection of rounds
Key
128

AES Processes
Key Scheduling
Encryption
Decryption

128
Cipher Text
34
AES Advanced Encryption Standard
Input 128 bits 16 bytes
35
Key Scheduling
User-key
Generated- keys
.. ..
Round Key 0 Round Key 1 Round Key 3 .. Round Key 10
36
AES Encryption Algorithm Flow
USER KEY
SUB KEY
SUB KEY
IN
OUT
ARK
BS
ARK
BS
SR
ARK
(ROUND-1)
SR
MC
BS Byte Substitution SR Shift Rows MC Mix
Column ARK Add Round Key
37
1. Byte Substitution
SUB KEY
BS
ARK
SR
MC
S-BOX 16x16
a0,0 a0,1 a0,2 a0,3
a1,0 a1,1 a1,2 a1,3
a2,0 a2,1 a2,2 a2,3
a3,0 a3,1 a3,2 a3,3
b0,0 b0,1 b0,2 b0,3
b1,0 b1,1 b1,2 b1,3
b2,0 b2,1 b2,2 b2,3
b3,0 b3,1 b3,2 b3,3
State Matrix
38
2. ShiftRow(SR)
SUB KEY
BS
ARK
SR
MC
a b c d
f g h e
k l i j
p m n o
a b c d
e f g h
i j k l
m n o p
Offset 0
Offset 1
MC
Offset 2
Offset 3
a b c d
f g h e
k l i j
p m n o
a b c d
e f g h
i j k l
m n o p
Offset 0
Offset 1
IMC
Offset 2
Offset 3
39
3. MixColumn(MC) Inv MixColumn(IMC)
SUB KEY
BS
ARK
MC
SR
MC
i0,1,2,3
IMC
Every entry is represented in GF(28)
40
4. AddRoundKey(ARK)
SUB KEY
BS
ARK
SR
MC
key
41
AES Implementation Strategies
The commonly used architecures are
Iterative looping
repeated n times
Inner-round pipeling
one round
Loop unrolling
42
AES Implementation Strategies
Metrics to measure performance?
1
2

FPGAs Resources used
CLB slices
BRAMs
etc.

Design 1 Encryptor Core
Sequential vs. Pipelined Architecture

44
AES Algorithm ImplementationSequential Approach
USER-KEY
ROUND-KEY
ROUND-KEY
CLK
S
PLAIN TEXT
CIPHER TEXT
RND 0
RND 1-9
LATCH
RND 10
RCON
CLK
S
USER KEY
ROUND KEY
KGEN
LATCH
45
AES Algorithm ImplementationSequential Approach
Byte Substitution (BS) Look-up table method
B1
B1
S-Box (256 x 8)
B2
B2
S-Box (256 x 8)
4
16x1 RAM
1-bit reg
4
1-bit reg
16x1 RAM
Memory Mode
B15
B15
S-Box (256 x 8)
B16
B16
S-Box (256 x 8)
46
AES Algorithm ImplementationSequential Approach
SR
IN4 bytes
OUT4 bytes
b
a
b
c
c
d
d
a
Just change of wires, No space occupied
47
AES Algorithm ImplementationSequential Approach
AddRoundKey
Key
Here xtime(v) represents 02v.
48
Performance results
Target device VirtexE XCV812 Tools used Xilinx
Foundation Tool F4.1i CLB slices 2744 (22
) BRAMs No used I/Os 385 (95 ) Achieved
Frequency 20.192 MHz Throughput 258.5
Mbits/s Throughput/Area 0.09
49
AES Algorithm Implementation Pipelined Approach
IN REG
RND 0
RND 1
RND 2
RND 3
RND 4
RND 5
RND 6
RND 7
RND 8
RND 9
RND 10
OUT
IN
RK 10
RK 0
RK 1
RK 2
RK 3
RK 4
RK 5
RK 6
RK 7
RK 8
RK 9
IN REG
KGEN
KGEN
KGEN
KGEN
KGEN
KGEN
KGEN
KGEN
KGEN
KGEN
KGEN
USER- KEY
50
Performance results
Target device VirtexE XCV812 Tools used Xilinx
Foundation Tool F4.1i CLB slices 2136 (18
) BRAMs 100 I/Os 385 (95 ) Achieved
Frequency 22.41 MHz Throughput 2868
Mbits/s Throughput/Area 1.29
51

Design 2 Encryptor/Decryptor Core
MixColumn Inv. MixColumn Modified

52
MixColumn(MC) Inv MixColumn(IMC) Revisted
MC
IMC
Every entry is represented in GF(28)
53
MixColumn(MC) Inv MixColumn(IMC) Cont
MC
IMC
02(x)
Where
04(x)
02(x)
08(x)

The co-efficient for IMC have higher hamming
weight ?
It is a costly operation?

54
MixColumn(MC) Inv MixColumn(IMC) Cont
We observe that,
(1) (2)
The biggest co-efficient for Eq.2 is, 05
Eq.1, we already have, Eq.2 calculation can be
made before Eq.1
55
Data Path for Encryption/Decryption
E/D
AF
MC
E/D
ENC
ARK
SR
OUT
IN
MI
ISR
IMC
DEC
IAF
IARK
E/D
E/D
AF
ENC
SR
MC
OUT
IN
MI
ARK
ISR
ModM
DEC
IAF
Encryption MI AF SR MC ARK Decryption
ISR IAF MI ModM MC ARK
56
Performance results
Target device VirtexE XCV2600 Tools used Xilinx
Foundation Tool F4.1i CLB slices 5677 (22.3
) BRAMs 80 (43) I/Os 386 (48 ) Achieved
Frequency 34.2 MHz Throughput 4121
Mbits/s Throughput/Area 0.73
57

Design 3 Encryptor/Decryptor Core
S-Box Inv. S-Box

58
Byte Substitution (Revisited)
S-BOX 256 x 8
b0,0 b0,1 b0,2 b0,3
b1,0 b1,1 b1,2 b1,3
b2,0 b2,1 b2,2 b2,3
b3,0 b3,1 b3,2 b3,3
a0,0 a0,1 a0,2 a0,3
a1,0 a1,1 a1,2 a1,3
a2,0 a2,1 a2,2 a2,3
a3,0 a3,1 a3,2 a3,3
State Matrix
59
BS and Inverse BS
S-BOX
MI
AF
IN
INV S-BOX
IAF
MI
in GF(28)
E/D
S-BOX
AF
MI
IN
IAF
INV S-BOX
60
MI 1st Approach
E/D
E/D
AF
MC
SR
ARK
MI
OUT
IN
ISR
IMC
IAF
IARK

MI with Look-up Table
Same S-Box (MI) for encryption/decryption
Memory requirements become half
BRAMs are used for storing MI values.
No initial time to prepare them

61
Performance results
Target device VirtexE XCV2600 Tools used Xilinx
Foundation Tool F4.1i CLB slices 6677 (26.3
) BRAMs 80 (43) I/Os 386 (48 ) Achieved
Frequency 30 MHz Throughput 3840
Mbits/s Throughput/Area 0.58
62
MI 2nd Approach
Ist Transformation
MI Manipulation
2nd Transformation
M-1
M
FIELD F TO GF(28)
GF(28) TO FIELD F
GF(24)
MI Three-Stage Strategy S. Morioka and A. Satoh,
CHES 2002

MI with Composite Fields GF(22)2 GF(24)2
Map the element A ? GF(28) to a composite field F
Compute the Multiplicative Inverse over the field
F
Map back from field F to GF(28)

63
MI Implementation
AH
AH
GF(28) to GF(24)
GF(24) to GF(28)
4
Xl
X2
Mul 4x4
lAH
2
4
8
A17
8
X -1
AL
A
A-1
4
ALA16
4
Mul 4x4
Mul 4x4
AL
A16
Let A?F2 and A AH y AL , then it can be shown
that
(
)
16

A
A
y
A
A

L
H
H
(
)
16
2
16
16
16
17

l
l
A
A
A
A
A
A
A
y
A
A
A
0
L
L
H
L
L
H
H
64
Performance results
Target device VirtexE XCV2600 Tools used Xilinx
Foundation Tool F4.1i CLB slices 13416 (52
) BRAMs no I/Os 386 (48 ) Achieved
Frequency 24.5 MHz Throughput 3136
Mbits/s Throughput/Area 0.24
65
AES Algorithm Implementations
Results Comparison
66
Sequential Vs Pipeline design
Sequential Design
Pipeline Design
67
MixColumn vs Inv MixColumn
Device BRAMs CLB(S) Slices Throughput (Mbits/s)(T) T/S
McLoone et al XCV3200E 102 7576 3239 0.43
This design XCV2600E 80 5677 4121 0.73

Two approach for MC/IMC
Less BRAMs
Less Slices
Higher Throughput reported to-date

68
S-Box Vs Inv S-Box
Device BRAMs CLB(S) Slices Throughput (Mbits/s)(T) T/S
McLoone XCV3200E 102 7576 3239 0.43
E/D GF(28) XCV2600E 80 6676 3840 0.58
E/D GF(24) XCV2600E No BRAMs 13416 3136 0.24

Two approaches for MI
Key Scheduling included
No initial delay

First design uses look-up table for MI,
Fast but high memory requirements
Second design use composite field approach
for MI, Slower with less memory requirements.

Both are efficient as compared to reported design

69
Modular Exponentiation Binary Method Variations
70
Side Channel Attacks
Algorithm Binary exponentiation Input a in G,
exponent d (dk,dk-1,,d0) (dk is the most
significant bit) Output c ad in G 1. c
a 2. For i k-1 down to 0 3.
c c2 4. If di 1 then c ca
5. Return c
The time or the power to execute c2 and ca are
different (side channel information).
Algorithm Corons exponentiation Input a in G,
exponent d (dk,dk-1,,dl0) Output c ad in
G 1. c0 1 2. For i k-1 down to
0 3. c0 c02 4. c1
c0a 5. c0 cdi 6. Return
c0
71
Mod. Exponentiation LSB-First Binary

Let k be the number of bits of e, i.e.,
Input M, e, n.
Output C Me mod n
R 1 C M
For i 0 to n-1
If ei 1 then R R?C mod n
C C2 mod n
Return C

72
Modular Exponentiation LSB First Binary

Example e 250 (11111010), thus k 8

i ei Step 3 (R) Step 4 (C)
7 0 1 M2
6 1 1(M)2 M2 (M2)2 M4
5 0 M2 (M4)2 M8
4 1 M2 M8 M10 (M8)2 M16
3 1 M10 M16 M26 (M16)2 M32
2 1 M26 M32 M58 (M32)2 M64
1 1 M58 M64 M122 (M64)2 M128
0 1 M122 M128 M250 (M128)2 M256
73
Modular Exponentiation LSB First Binary

The LSB-First binary method requires
Squarings k-1
Multiplications The number of 1s in the binary
expansion of e, excluding the MSB.
The total number of multiplications
Maximum (k-1) (k-1) 2(k-1)
Minimum (k-1) 0 k-1
Average (k-1) 1/2 (k-1) 1.5(k-1)
Same as before, but here we can compute the
Multiplication operation in parallel with the
squarings!!