Title: TIE Extensions for Cryptographic Acceleration
1TIE Extensions for Cryptographic Acceleration
- Charles-Henri Gros
- Alan Keefer
- Ankur Singla
2Agenda
- Introduction
- Survey of Existing Architectures
- Xtensa Crypto Processor
- Rijndael Algorithm (AES final selection)
- RC6, IDEA, and DES
- Performance
- Trade-off Analysis
- Conclusion
3Introduction
- Commercial Networking Applications require
flexible high throughput secure connectivity - Encryption/Decryption algorithm computation
intensive - Multi-session applications present significant
load on embedded processors - Embedded systems need performance while
optimizing power and area - Our study existing architectures, analysis of
Xtensa as an alternative, performance analysis
and trade-offs for embedded
4Survey of Existing Architectures
- Three categories
- Specialized Crypto Processors
- Reconfigurable Architectures
- Full Hardware Implementation (ASICs/FPGAs)
- High Variation in architecture complexity
- Performance vs Area tradeoff
- Suitability for Embedded Applications
5Specialized Crypto Processors
- Few VLIW architectures - CryptoManiac
- Instruction Combining Instruction Word
combining to exploit ILP - Crypto Arithmetic Unit(s) multiple XORs, GF
multiplication/addition, lookup table
substitution, and permutation - Coarse configurability of datapath
- Mostly lacking SIMD support
- Performance is typically 2x to 6x that of general
processors
6Reconfigurable Architectures
- Numerous reconfigurable processor architectures
PipeRench, MorphoSys, COBRA, and GARP - Functional Units that provide all crypto
arithmetic - multiple XORs, GF multiplication/addi
tion, modulo multiplication - Reconfigurable Interconnection Network to provide
dynamic change to functional unit connectivity - VLIW Instructions
- Reconfiguration Registers
- Suitable for Block Ciphers
- High Variability in Performance increase w.r.t
Processors
7Full Hardware Implementation
- High performance implementations targeted to
ASICs/FPGAs - DES 12 Gbps on Virtex-E XCV300E
- AES 18 Gbps on ASIC using TSMC 0.18?m process
- Lacking flexibility and crypto-modes
- Memory and Area efficient
- Typical latency only in DMA of data to Hardware
unit - Need additional processor for control path
8Xtensa Crypto Architecture
- Custom Extensions to Xtensa Processor using the
TIE framework - Addition of Generic Key Schedule Register File
and Instructions to support all Crypto Algorithms
studied - Addition of multiple on-chip SRAMs (in addition
to 4 Data-RAMs) to the Xtensa processor - Currently Implemented using Table construct in
TIE - Hacked TIE Compiler generated Verilog Code to
instantiate multiple RAM models (implemented
using multi-dimensional array) for viability
analysis - Addition of 4 State Registers and 4 Next State
Registers generic to all algorithms studied - Possible future extensions to include
multi-session key storage and fast retrieval
support
9AES Overview
- AES (Advanced Encryption Standard) is the
standard set to replace DES for both government
and private-sector encryption - Uses a fixed block size of 128-bits, with key
sizes of 128-, 196-, or 256-bits - Designed to be efficient in both hardware and
software across a variety of platforms - 10, 12, or 14 rounds depending on key size
- 128-bit round key used for each round
- Can be pre-computed and cached for future
encryptions
10AES Implementation Abstraction
- Each round consists of a lookup, byte-level
permutation, finite field multiplication, and key
XOR - Lookup and multiplication can be combined into
four separate 8x32 lookup tables, so each round
is 16 lookups and 16 XORs - Decryption is essentially the same, but with
different tables and a different key schedule
11TIE Implementation
- Our implementation does all 16 lookups in
parallel, requiring 16 SRAMs - x0, x1, x2, x3, represents the round state (each
32 bits), k0, k1, k2, k3 are the current round
key, and Tij are the T-boxes, where i is a
duplication index and j is the T-box index - Each round is then
- x0 T00x0T01x1gtgt8T02x2gtgt16T03x3gtgt24
k0 - x1 T10x1T11x2gtgt8T12x3gtgt16T13x0gtgt24
k0 - x2 T20x2T21x3gtgt8T22x0gtgt16T23x1gtgt24
k0 - x3 T30x3T31x0gtgt8T32x1gtgt16T33x2gtgt24
k0
12Other Ciphers Implemented
- DES (Data Encryption Standard)
- 64-bit block, 56-bit key, 16 rounds, Feistel
network - 8 6x4 S-Boxes, XORs, and bit-level permutations
- Cant really be done efficiently in software
- TIE Implementation required 1 Instruction per
round - IDEA (International Data Encryption Algorithm)
- 64-bit block, 128-bit key, 8 rounds, iterated,
operates on 16-bit numbers - 4 Multiplications mod 216 1, 4 adds mod 216, 6
XORS - Each round is highly sequential, so difficult to
parallelize - TIE Implementation required 7 Instructions per
round - RC6
- Same block and key modes as AES, 20 rounds,
iterated - Multiplication mod 232, XORs, rotations, addition
mod 232 - TIE Implementation required 2 Instructions per
round
13AES Performance in Xtensa
- Performance of TIE extensions approaches
performance of non-pipelined ASICs - Total of 31 run-time instructions per data-block
- Initial EXOR Instruction
- 1 Instruction per round computation (10 total)
- 20 Cycles for Load and Store of 128-bit Data
Blocks - Generally an order of magnitude better than pure
software - Also faster than reconfigurable hardware or a
specialized VLIW processor
14Mbps of Throughput
Base VLIW TIE ASIC Reconfig.
AES 43.7 512 984 18000 594
DES 26.5 240 586 15000 53.3
IDEA 28 200 231 2034 1013
RC6 61 368 508 15200 470
15Cycles Per Block
Base VLIW TIE ASIC
AES 838 90 31 10
DES 690 112 26 16
IDEA 653 112 66 9
RC6 600 140 60 9
16Design Tradeoffs
- Flexibility
- Algorithm changes
- New algorithms
- New encryption modes
- Implementation bugs
- Time to Market
- Closer to software development time
- Can choose which parts to accelerate
17Power vs. Performance Mbps/mW
Base VLIW TIE ASIC Rec.
AES 0.36 1.15 5.63 30 0.66
DES 0.22 0.54 4.19 59.13 0.08
IDEA 0.23 0.62 2.13 15.8 2.89
RC6 0.51 1.37 4.69 14.12 1.35
18Conclusion
- Xtensa instructions provide flexibility,
performance, and Mbps/mW all somewhere between an
ASIC and a VLIW or Software-based solution - Suitable for most Embedded Applications like
802.11i, etc. - Using Xtensa for cryptography is a good choice
if - You dont need absolute throughput
- You dont need absolute flexibility
- You need a control processor anyway
- The algorithms needed are known ahead of time