TIE Extensions for Cryptographic Acceleration - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

TIE Extensions for Cryptographic Acceleration

Description:

Multi-session applications present significant load on embedded processors ... Hacked TIE Compiler generated Verilog Code to instantiate multiple RAM models ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 19
Provided by: whoc
Category:

less

Transcript and Presenter's Notes

Title: TIE Extensions for Cryptographic Acceleration


1
TIE Extensions for Cryptographic Acceleration
  • Charles-Henri Gros
  • Alan Keefer
  • Ankur Singla

2
Agenda
  • Introduction
  • Survey of Existing Architectures
  • Xtensa Crypto Processor
  • Rijndael Algorithm (AES final selection)
  • RC6, IDEA, and DES
  • Performance
  • Trade-off Analysis
  • Conclusion

3
Introduction
  • Commercial Networking Applications require
    flexible high throughput secure connectivity
  • Encryption/Decryption algorithm computation
    intensive
  • Multi-session applications present significant
    load on embedded processors
  • Embedded systems need performance while
    optimizing power and area
  • Our study existing architectures, analysis of
    Xtensa as an alternative, performance analysis
    and trade-offs for embedded

4
Survey of Existing Architectures
  • Three categories
  • Specialized Crypto Processors
  • Reconfigurable Architectures
  • Full Hardware Implementation (ASICs/FPGAs)
  • High Variation in architecture complexity
  • Performance vs Area tradeoff
  • Suitability for Embedded Applications

5
Specialized Crypto Processors
  • Few VLIW architectures - CryptoManiac
  • Instruction Combining Instruction Word
    combining to exploit ILP
  • Crypto Arithmetic Unit(s) multiple XORs, GF
    multiplication/addition, lookup table
    substitution, and permutation
  • Coarse configurability of datapath
  • Mostly lacking SIMD support
  • Performance is typically 2x to 6x that of general
    processors

6
Reconfigurable Architectures
  • Numerous reconfigurable processor architectures
    PipeRench, MorphoSys, COBRA, and GARP
  • Functional Units that provide all crypto
    arithmetic - multiple XORs, GF multiplication/addi
    tion, modulo multiplication
  • Reconfigurable Interconnection Network to provide
    dynamic change to functional unit connectivity
  • VLIW Instructions
  • Reconfiguration Registers
  • Suitable for Block Ciphers
  • High Variability in Performance increase w.r.t
    Processors

7
Full Hardware Implementation
  • High performance implementations targeted to
    ASICs/FPGAs
  • DES 12 Gbps on Virtex-E XCV300E
  • AES 18 Gbps on ASIC using TSMC 0.18?m process
  • Lacking flexibility and crypto-modes
  • Memory and Area efficient
  • Typical latency only in DMA of data to Hardware
    unit
  • Need additional processor for control path

8
Xtensa Crypto Architecture
  • Custom Extensions to Xtensa Processor using the
    TIE framework
  • Addition of Generic Key Schedule Register File
    and Instructions to support all Crypto Algorithms
    studied
  • Addition of multiple on-chip SRAMs (in addition
    to 4 Data-RAMs) to the Xtensa processor
  • Currently Implemented using Table construct in
    TIE
  • Hacked TIE Compiler generated Verilog Code to
    instantiate multiple RAM models (implemented
    using multi-dimensional array) for viability
    analysis
  • Addition of 4 State Registers and 4 Next State
    Registers generic to all algorithms studied
  • Possible future extensions to include
    multi-session key storage and fast retrieval
    support

9
AES Overview
  • AES (Advanced Encryption Standard) is the
    standard set to replace DES for both government
    and private-sector encryption
  • Uses a fixed block size of 128-bits, with key
    sizes of 128-, 196-, or 256-bits
  • Designed to be efficient in both hardware and
    software across a variety of platforms
  • 10, 12, or 14 rounds depending on key size
  • 128-bit round key used for each round
  • Can be pre-computed and cached for future
    encryptions

10
AES Implementation Abstraction
  • Each round consists of a lookup, byte-level
    permutation, finite field multiplication, and key
    XOR
  • Lookup and multiplication can be combined into
    four separate 8x32 lookup tables, so each round
    is 16 lookups and 16 XORs
  • Decryption is essentially the same, but with
    different tables and a different key schedule

11
TIE Implementation
  • Our implementation does all 16 lookups in
    parallel, requiring 16 SRAMs
  • x0, x1, x2, x3, represents the round state (each
    32 bits), k0, k1, k2, k3 are the current round
    key, and Tij are the T-boxes, where i is a
    duplication index and j is the T-box index
  • Each round is then
  • x0 T00x0T01x1gtgt8T02x2gtgt16T03x3gtgt24
    k0
  • x1 T10x1T11x2gtgt8T12x3gtgt16T13x0gtgt24
    k0
  • x2 T20x2T21x3gtgt8T22x0gtgt16T23x1gtgt24
    k0
  • x3 T30x3T31x0gtgt8T32x1gtgt16T33x2gtgt24
    k0

12
Other Ciphers Implemented
  • DES (Data Encryption Standard)
  • 64-bit block, 56-bit key, 16 rounds, Feistel
    network
  • 8 6x4 S-Boxes, XORs, and bit-level permutations
  • Cant really be done efficiently in software
  • TIE Implementation required 1 Instruction per
    round
  • IDEA (International Data Encryption Algorithm)
  • 64-bit block, 128-bit key, 8 rounds, iterated,
    operates on 16-bit numbers
  • 4 Multiplications mod 216 1, 4 adds mod 216, 6
    XORS
  • Each round is highly sequential, so difficult to
    parallelize
  • TIE Implementation required 7 Instructions per
    round
  • RC6
  • Same block and key modes as AES, 20 rounds,
    iterated
  • Multiplication mod 232, XORs, rotations, addition
    mod 232
  • TIE Implementation required 2 Instructions per
    round

13
AES Performance in Xtensa
  • Performance of TIE extensions approaches
    performance of non-pipelined ASICs
  • Total of 31 run-time instructions per data-block
  • Initial EXOR Instruction
  • 1 Instruction per round computation (10 total)
  • 20 Cycles for Load and Store of 128-bit Data
    Blocks
  • Generally an order of magnitude better than pure
    software
  • Also faster than reconfigurable hardware or a
    specialized VLIW processor

14
Mbps of Throughput
Base VLIW TIE ASIC Reconfig.
AES 43.7 512 984 18000 594
DES 26.5 240 586 15000 53.3
IDEA 28 200 231 2034 1013
RC6 61 368 508 15200 470
15
Cycles Per Block
Base VLIW TIE ASIC
AES 838 90 31 10
DES 690 112 26 16
IDEA 653 112 66 9
RC6 600 140 60 9
16
Design Tradeoffs
  • Flexibility
  • Algorithm changes
  • New algorithms
  • New encryption modes
  • Implementation bugs
  • Time to Market
  • Closer to software development time
  • Can choose which parts to accelerate

17
Power vs. Performance Mbps/mW
Base VLIW TIE ASIC Rec.
AES 0.36 1.15 5.63 30 0.66
DES 0.22 0.54 4.19 59.13 0.08
IDEA 0.23 0.62 2.13 15.8 2.89
RC6 0.51 1.37 4.69 14.12 1.35
18
Conclusion
  • Xtensa instructions provide flexibility,
    performance, and Mbps/mW all somewhere between an
    ASIC and a VLIW or Software-based solution
  • Suitable for most Embedded Applications like
    802.11i, etc.
  • Using Xtensa for cryptography is a good choice
    if
  • You dont need absolute throughput
  • You dont need absolute flexibility
  • You need a control processor anyway
  • The algorithms needed are known ahead of time
Write a Comment
User Comments (0)
About PowerShow.com