TIE Extensions for Cryptographic Acceleration - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

TIE Extensions for Cryptographic Acceleration

Description:

Multi-session applications present significant load on embedded processors ... Hacked TIE Compiler generated Verilog Code to instantiate multiple RAM models ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 19

Provided by: whoc

Learn more at: https://suif.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: TIE Extensions for Cryptographic Acceleration

1
TIE Extensions for Cryptographic Acceleration

Charles-Henri Gros
Alan Keefer
Ankur Singla

2
Agenda

Introduction
Survey of Existing Architectures
Xtensa Crypto Processor
Rijndael Algorithm (AES final selection)
RC6, IDEA, and DES
Performance
Trade-off Analysis
Conclusion

3
Introduction

Commercial Networking Applications require
flexible high throughput secure connectivity
Encryption/Decryption algorithm computation
intensive
Multi-session applications present significant
load on embedded processors
Embedded systems need performance while
optimizing power and area
Our study existing architectures, analysis of
Xtensa as an alternative, performance analysis
and trade-offs for embedded

4
Survey of Existing Architectures

Three categories
Specialized Crypto Processors
Reconfigurable Architectures
Full Hardware Implementation (ASICs/FPGAs)
High Variation in architecture complexity
Performance vs Area tradeoff
Suitability for Embedded Applications

5
Specialized Crypto Processors

Few VLIW architectures - CryptoManiac
Instruction Combining Instruction Word
combining to exploit ILP
Crypto Arithmetic Unit(s) multiple XORs, GF
multiplication/addition, lookup table
substitution, and permutation
Coarse configurability of datapath
Mostly lacking SIMD support
Performance is typically 2x to 6x that of general
processors

6
Reconfigurable Architectures

Numerous reconfigurable processor architectures
PipeRench, MorphoSys, COBRA, and GARP
Functional Units that provide all crypto
arithmetic - multiple XORs, GF multiplication/addi
tion, modulo multiplication
Reconfigurable Interconnection Network to provide
dynamic change to functional unit connectivity
VLIW Instructions
Reconfiguration Registers
Suitable for Block Ciphers
High Variability in Performance increase w.r.t
Processors

7
Full Hardware Implementation

High performance implementations targeted to
ASICs/FPGAs
DES 12 Gbps on Virtex-E XCV300E
AES 18 Gbps on ASIC using TSMC 0.18?m process
Lacking flexibility and crypto-modes
Memory and Area efficient
Typical latency only in DMA of data to Hardware
unit
Need additional processor for control path

8
Xtensa Crypto Architecture

Custom Extensions to Xtensa Processor using the
TIE framework
Addition of Generic Key Schedule Register File
and Instructions to support all Crypto Algorithms
studied
Addition of multiple on-chip SRAMs (in addition
to 4 Data-RAMs) to the Xtensa processor
Currently Implemented using Table construct in
TIE
Hacked TIE Compiler generated Verilog Code to
instantiate multiple RAM models (implemented
using multi-dimensional array) for viability
analysis
Addition of 4 State Registers and 4 Next State
Registers generic to all algorithms studied
Possible future extensions to include
multi-session key storage and fast retrieval
support

9
AES Overview

AES (Advanced Encryption Standard) is the
standard set to replace DES for both government
and private-sector encryption
Uses a fixed block size of 128-bits, with key
sizes of 128-, 196-, or 256-bits
Designed to be efficient in both hardware and
software across a variety of platforms
10, 12, or 14 rounds depending on key size
128-bit round key used for each round
Can be pre-computed and cached for future
encryptions

10
AES Implementation Abstraction

Each round consists of a lookup, byte-level
permutation, finite field multiplication, and key
XOR
Lookup and multiplication can be combined into
four separate 8x32 lookup tables, so each round
is 16 lookups and 16 XORs
Decryption is essentially the same, but with
different tables and a different key schedule

11
TIE Implementation

Our implementation does all 16 lookups in
parallel, requiring 16 SRAMs
x0, x1, x2, x3, represents the round state (each
32 bits), k0, k1, k2, k3 are the current round
key, and Tij are the T-boxes, where i is a
duplication index and j is the T-box index
Each round is then
x0 T00x0T01x1gtgt8T02x2gtgt16T03x3gtgt24
k0
x1 T10x1T11x2gtgt8T12x3gtgt16T13x0gtgt24
k0
x2 T20x2T21x3gtgt8T22x0gtgt16T23x1gtgt24
k0
x3 T30x3T31x0gtgt8T32x1gtgt16T33x2gtgt24
k0

12
Other Ciphers Implemented

DES (Data Encryption Standard)
64-bit block, 56-bit key, 16 rounds, Feistel
network
8 6x4 S-Boxes, XORs, and bit-level permutations
Cant really be done efficiently in software
TIE Implementation required 1 Instruction per
round
IDEA (International Data Encryption Algorithm)
64-bit block, 128-bit key, 8 rounds, iterated,
operates on 16-bit numbers
4 Multiplications mod 216 1, 4 adds mod 216, 6
XORS
Each round is highly sequential, so difficult to
parallelize
TIE Implementation required 7 Instructions per
round
RC6
Same block and key modes as AES, 20 rounds,
iterated
Multiplication mod 232, XORs, rotations, addition
mod 232
TIE Implementation required 2 Instructions per
round

13
AES Performance in Xtensa

Performance of TIE extensions approaches
performance of non-pipelined ASICs
Total of 31 run-time instructions per data-block
Initial EXOR Instruction
1 Instruction per round computation (10 total)
20 Cycles for Load and Store of 128-bit Data
Blocks
Generally an order of magnitude better than pure
software
Also faster than reconfigurable hardware or a
specialized VLIW processor

14
Mbps of Throughput
Base VLIW TIE ASIC Reconfig.
AES 43.7 512 984 18000 594
DES 26.5 240 586 15000 53.3
IDEA 28 200 231 2034 1013
RC6 61 368 508 15200 470
15
Cycles Per Block
Base VLIW TIE ASIC
AES 838 90 31 10
DES 690 112 26 16
IDEA 653 112 66 9
RC6 600 140 60 9
16
Design Tradeoffs

Flexibility
Algorithm changes
New algorithms
New encryption modes
Implementation bugs
Time to Market
Closer to software development time
Can choose which parts to accelerate

17
Power vs. Performance Mbps/mW
Base VLIW TIE ASIC Rec.
AES 0.36 1.15 5.63 30 0.66
DES 0.22 0.54 4.19 59.13 0.08
IDEA 0.23 0.62 2.13 15.8 2.89
RC6 0.51 1.37 4.69 14.12 1.35
18
Conclusion

Xtensa instructions provide flexibility,
performance, and Mbps/mW all somewhere between an
ASIC and a VLIW or Software-based solution
Suitable for most Embedded Applications like
802.11i, etc.
Using Xtensa for cryptography is a good choice
if
You dont need absolute throughput
You dont need absolute flexibility
You need a control processor anyway
The algorithms needed are known ahead of time