Title: Performance and Overhead in a Hybrid Reconfigurable Computer
1Performance and Overhead in a Hybrid
Reconfigurable Computer
- O. D. Fidanci1, D. Poznanovic2, K. Gaj3, T.
El-Ghazawi1, N. Alexandridis1 -
- 1George Washington University,
- 2SRC Computers Inc.,
- 3George Mason University
http//cpe02.gmu.edu/rcm/
2Features of General-Purpose Reconfigurable
Computers
- composed of traditional microprocessors and
- Field Programmable Gate Arrays (FPGAs)
- closely integrated with each other
- programming does not require knowledge of
- hardware design
- permit run-time reconfiguration of FPGAs
3Hardware Architecture and Programming Model of
SRC-6E
4SRC Hardware Architecture
5SRC Hardware Architecture cont.
6SRC Programming Model
7Compilation Process of SRC-6E
Macro sources
Application sources
.vhd or .v files
.c or .f files
HDL sources
Synplicity
Intel
Logic synthesis
.v files
MAP Compiler
?P Compiler
.ngo files
Netlists
Xilinx
Object files
.o files
.o files
Place Route
Linker
.bin files
Configuration bitstreams
Application executable
8Application Case Study 1
High-throughput Triple DES encryption
9High-throughput encryption
. . . .
Mi2
Mi1
Mi
K0
3 DES
Ci2
Ci1
Ci
10Fully pipelined architecture of Triple DES
1
2
. . . .
DES macro
17
18
DES macro
19
. . . .
34
35
36
DES macro
. . . .
51
- 51 pipeline stages
- New input new output every clock cycle
11Overhead of the data transfer
mP Board
mP Board
Xeon mP
Xeon mP
MAP Board
(6x)
(6x)
Private Memory
Private Memory
(6x)
(6x)
12Timing Measurements
Three-level timing measurement scheme has been
employed
- end-to-end execution time (wall clock time - HLL
Level) includes the configuration, data transfer
and data processing times - w/o configuration time (wall clock time - HLL
Level) excludes the configuration time but
includes data transfer and data processing times - MAP Time (clock counter - Hardware Level)
only includes data processing time
13Triple DES Encryption
Execution time ms
160
configuration
data transfer
140
computation
120
100
80
60
40
20
0
1024 10,000 25,000 50,000 100,000
250,000 500,000
Number of encrypted blocks
14Problems
- execution time dominated by
- - configuration of the MAP FPGA and
- - data transfer between the
- System Common Memory and
- On-Board-Memory
- configuration time hiding techniques
- preloading the configuration before execution
- flip-flopping FPGAs during reconfiguration
15Data transfer hiding techniques
- Data transfer can be hidden by overlapping DMA
time with the data processing time
Encryption
Output DMA
Input DMA
Input DMA
Input DMA
Possible speed-up up to 33
Encry- ption
Encry- ption
Output DMA
Output DMA
16Reference software implementations
Platform
Pentium 4, 1.8 GHz, 512 kB cache, 1 GB RAM
Software
Optimized for encryption (but not for cipher
breaking)
Non-optimized
Public domain code C only Intel C -O3
optimization
Phil Karns DES code C and assembly language
with look-up table precomputations GNU gcc v.
2.96 -O4 optimization
17Total execution time of Triple DES for Pentium
4 using optimized and non-optimized code
Optimized P4 code Non-optimized P4 code
?
4
18Throughput results for SRC-6E and Pentium 4
19SRC-6E vs. Pentium 4speed-up
20Application Case Study 2
DES cipher breaking
21Secret-key breaking
C0
M0
K1
K2
K3
KN
DES
Generated by the DES breaker
22Keys generated in the User FPGA
mP Board
mP Board
Xeon mP
Xeon mP
MAP Board
(6x)
(6x)
Private Memory
Private Memory
(6x)
(6x)
23DES breaking machine
Execution time ms
1,200
configuration
data transfer
1,000
computation
800
600
400
200
0
128,000 1,000,000
100,000,000
Number of tested keys
24SRC-6e vs. Pentium 4 Speed-up
25Conclusions
- Two different classes of applications developed
- and tested for SRC-6E and Pentium 4 PC
- - Triple DES encryption real-time data
streaming - - DES breaking minimal
input/output
26Conclusions cont.
Wall-clock speed-ups
3 DES Encryption
DES Breaking
- vs. P4 C code
- (larger for real-time input sizes)
3.4 vs. P4 C code 12.5 vs. P4 assembly code
Speed-ups without reconfiguration
3 DES Encryption
DES Breaking
11 vs. P4 C code 41 vs. P4 assembly code
1583 vs. P4 C code
27Informal speed/cost comparison
Cost of the SRC machine Cost of PC
?
100
Speed of the SRC machine Speed of PC
?
1600
with only one out of four FPGAs used in
computations
16 x improved speed/cost ratio
28Conclusions Overheads
Reconfiguration time
Most affected applications
short execution time, large resource
requirements, frequent reconfiguration
Minimization techniques
- preloading configuration
- flip-flopping among multiple FPGAs
Data transfer time
Most affected applications
high speed real-time input/output
Minimization techniques
- overlapping data transfer with computations