Title: Experience with Early HighPerformance Reconfigurable Computing Systems
1Experience with Early High-Performance
Reconfigurable Computing Systems
- Tarek El-Ghazawi
- The George Washington University
- tarek_at_gwu.edu
- http//www.seas.gwu.edu/tarek
2Outline
- Acknowledgements
- Definition
- What Have we Been Doing?
- Our Systems (big toys!)
- Our applications
- A Classification for HPRC Architectures
- Single Node Experience
- Parallel Applications and System Level Experience
- Conclusions
3Acknowledgements
This Work is supported in part by The Arctic
Region Supercomputing Center and NASA GSFC
Students
M. Taher
M. Aboallil
E. El-Araby
Collaborators Kris Gaj, Jacqueline Le Moigne,
Duncan Buell, SRC, SGI, Cray
4High-Performance Reconfigurable Computing (HPRC)
Defined
- Efficient general-purpose computing for the
masses using parallel (and distributed) systems
of both reconfigurable hardware resources and
conventional microprocessors
5Why HPRC?Ans. SYNERGISM between mPs and RPs
6Our HPRC Toys
7Applications
- Remote Sensing/Imaging Applications
- Hyperspectral Dimension Reduction
- Wavelet decomposition
- Image Registration
- Cloud Detection
-
- Bioinformatics
- Smith-Waterman
-
- Cryptography
- DES, 3DES
- IDEA
- RC5
-
8An Architectural Classification for
High-Performance Reconfigurable Computers
91. Homogeneous-Nodes Heterogeneous System
µP 1
µP N
RP 1
RP N
µP Subsystem
RP Subsystem
a. Non-Scalable (or Attached Processor)
Architecture
Examples SRC 6E and SBS HC
10Homogeneous-Nodes- Heterogeneous-System
b. Scalable System
IN and/or GSM
Examples SRC 6, SGI RASC
11Heterogeneous-NodesHomogeneous-System
RP N
RP N
µP N
µP N
Example Cray XD1
IN and/or GSM
µP N
RP N
RP N
µP N
RP N
µP N
3 Node Architecture Options
12A Classification for High-Performance
Reconfigurable Computers (HPRCs)
HPRCs
Heterogeneous Nodes Homogeneous System
Homogeneous Nodes Heterogeneous System
Scalable Systems
Attached Processors
Separate Processors
E.g. SRC 6 and SGI RASC
E.g. SRC 6E and SBS HCs
E.g. Cray XD1
Integrated Processors
mP inside mP
mP inside RP
13Single Node Experience
14Hyperspectral Dimension Reduction
S. Kaewpijit, J. Le Moigne, T. El-Ghazawi,
Automatic Reduction of Hyperspectral Imagery
Using Wavelet Spectral Analysis, IEEE
Transactions on Geoscience and Remote Sensing,
Vol. 41, No. 4, April, 2003, pp. 863-871.
15Wavelet-Based Dimension Reduction(Execution
Profiles on SRC)
Total Execution Time 1.67 sec Speedup 12.08 x
(without-streaming) Speedup 13.21 x
(with-streaming)
Total Execution Time 20.21 sec Pentium4, 1.8GHz
SRC-6E, P3
SRC 6
Total Execution Time 0.84 sec (SRC-6) Speedup
24.06 x (without-streaming) Speedup 32.04 x
(with-streaming)
16Multi-Resolution DWT Decomposition (Mallat
Algorithm)
17DWT Decomposition(One Engine ? One FPGA)
18System Level Experience
19DWT on XD1(MPI Overhead and Computation Speedup)
20Extrapolated Performance on XD1 Using MPI
21DNA Sequencing with Smith-Waterman
- Amino acids
- The building blocks (monomers) of proteins. 20
different amino acids are used to synthesize
proteins. The shape and other properties of each
protein is dictated by the precise sequence of
amino acids in it. - Deoxyribonucleic acid (DNA) is written using a
code of only 4 letters (bases) - Two purines, called adenine (A) and guanine (G)
- Two pyrimidines, called thymine (T) and cytosine
(C) - DNA sequencing
- The determination of the precise sequence of
nucleotides in a sample of DNA - Why determine origin,
22DNA Matching Basics
- Example
- Find the best pairwise alignment of GAATC and
CATAC
- We need a way to measure the quality of a
candidate alignment - Alignment scores are driven from
- substitution matrix
- gap penalty
GAAT-C CA-TAC
-5 10 ? 10 ? 10 ?
23Computing the Scoring Matrix
G
A
T
A
C
T
T
T
-
-
0
G
G
T
T
A
T
G
C
24Hardware Implementation (32x1 Sliding Window)
Multiple Databases and Multiple Queries
32 Residue Window Size (Node 1)
Unlimited Database Size
32 Residue Window Size (Node n)
Unlimited Database Size
25Implementation for Hardware (cntd) (MPI
Implementation)
Done
Scatter Queries
Node 0
Query Sequences
BroadCast DBs
Node 1
Processing
Score Array
Gather Scores
Database Sequences
Node N-1
26MPI and SPMD on the Cray XD1
- 6 Nodes each with 2 ?P and 1 FPGA
- Only 1 ?P can communicate with the FPGA in each
node at a time keeping the other ?P under-utilized
MPI Running
27MPI and SPMD Parallel Programming on the SRC
- 2 ?P on each board
- 1 SNAP per ?P board
- Only 1 ?P can access the SNAP at a time
- One processor per processor board can be used
- Two MAPs can be used out of 4
- Code modified to use two FPGAs
SRC Hi-Bar Switch
Memory
Common Memory
SNAP
Memory
SNAP
MAP
MAP
FPGA
FPGA
FPGA
FPGA
?P
?P
?P
?P
PCI-X
PCI-X
MPI Running
SRC-6
MPI
28Performance Results
29Smith-Waterman Scalability on SRC-6(window of
32x1 residues)
30Time Distribution of Smith-Waterman on
SRC-6(window of 32x1 residues)
31Smith-Waterman Scalability on XD1 (window of
32x1 residues)
32Smith-Waterman Scalability on XD1(window of 32x1
residues)
33Time Distribution of Smith-Waterman on
XD1(window of 32x1 residues)
34Additional Experience Computationally
Intensive Applications
P3 version of SRC-6E
35Conclusions
- Like any real-life story, there are ups and
downs, but the outlook is bright if we do the
right things - Tremendous potential in cost, power, and size as
seen by compute intensive integer applications - Rest of applications coming up
- Early systems suffered from low processor to FPGA
transfer bandwidth, some improvements seen and
more improvements are still needed - Fixed clock restrictions may simplify things, but
may result in great disadvantages
36Conclusions
- More work is needed on how to manage
heterogeneity in HPRCs such that simple SPMD
programming models are enabled while hardware is
well utilized - Simple 11 heterogeneous-node homogeneous
architecture may by the answer for general
purpose HPRCs, other architectures may suite
special-purpose applications - Creative interconnection networks may help here
- Attention to latency and bandwidth between FPGA
and microprocessor in relationship to the
performance of the whole interconnection fabric