Experience with Early HighPerformance Reconfigurable Computing Systems - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Experience with Early HighPerformance Reconfigurable Computing Systems

Description:

2. Experience with Early High-Performance Reconfigurable Computing ... SYNERGISM between mPs and RPs. Harder. Relatively Easy (S.W./Parallel Programming) ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 37

Provided by: hpclSe

Category:

more less

Transcript and Presenter's Notes

Title: Experience with Early HighPerformance Reconfigurable Computing Systems

1
Experience with Early High-Performance
Reconfigurable Computing Systems

Tarek El-Ghazawi
The George Washington University
tarek_at_gwu.edu
http//www.seas.gwu.edu/tarek

2
Outline

Acknowledgements
Definition
What Have we Been Doing?
Our Systems (big toys!)
Our applications
A Classification for HPRC Architectures
Single Node Experience
Parallel Applications and System Level Experience
Conclusions

3
Acknowledgements
This Work is supported in part by The Arctic
Region Supercomputing Center and NASA GSFC
Students
M. Taher
M. Aboallil
E. El-Araby
Collaborators Kris Gaj, Jacqueline Le Moigne,
Duncan Buell, SRC, SGI, Cray
4
High-Performance Reconfigurable Computing (HPRC)
Defined

Efficient general-purpose computing for the
masses using parallel (and distributed) systems
of both reconfigurable hardware resources and
conventional microprocessors

5
Why HPRC?Ans. SYNERGISM between mPs and RPs
6
Our HPRC Toys
7
Applications

Remote Sensing/Imaging Applications
Hyperspectral Dimension Reduction
Wavelet decomposition
Image Registration
Cloud Detection
Bioinformatics
Smith-Waterman

Cryptography
DES, 3DES
IDEA
RC5

8
An Architectural Classification for
High-Performance Reconfigurable Computers
9
1. Homogeneous-Nodes Heterogeneous System

µP 1
µP N

RP 1
RP N
µP Subsystem
RP Subsystem
a. Non-Scalable (or Attached Processor)
Architecture
Examples SRC 6E and SBS HC
10
Homogeneous-Nodes- Heterogeneous-System
b. Scalable System
IN and/or GSM
Examples SRC 6, SGI RASC
11
Heterogeneous-NodesHomogeneous-System
RP N
RP N
µP N
µP N
Example Cray XD1
IN and/or GSM
µP N
RP N
RP N
µP N
RP N
µP N
3 Node Architecture Options
12
A Classification for High-Performance
Reconfigurable Computers (HPRCs)
HPRCs
Heterogeneous Nodes Homogeneous System
Homogeneous Nodes Heterogeneous System
Scalable Systems
Attached Processors
Separate Processors
E.g. SRC 6 and SGI RASC
E.g. SRC 6E and SBS HCs
E.g. Cray XD1
Integrated Processors
mP inside mP
mP inside RP
13
Single Node Experience
14
Hyperspectral Dimension Reduction
S. Kaewpijit, J. Le Moigne, T. El-Ghazawi,
Automatic Reduction of Hyperspectral Imagery
Using Wavelet Spectral Analysis, IEEE
Transactions on Geoscience and Remote Sensing,
Vol. 41, No. 4, April, 2003, pp. 863-871.
15
Wavelet-Based Dimension Reduction(Execution
Profiles on SRC)
Total Execution Time 1.67 sec Speedup 12.08 x
(without-streaming) Speedup 13.21 x
(with-streaming)
Total Execution Time 20.21 sec Pentium4, 1.8GHz
SRC-6E, P3
SRC 6
Total Execution Time 0.84 sec (SRC-6) Speedup
24.06 x (without-streaming) Speedup 32.04 x
(with-streaming)
16
Multi-Resolution DWT Decomposition (Mallat
Algorithm)
17
DWT Decomposition(One Engine ? One FPGA)
18
System Level Experience
19
DWT on XD1(MPI Overhead and Computation Speedup)
20
Extrapolated Performance on XD1 Using MPI
21
DNA Sequencing with Smith-Waterman

Amino acids
The building blocks (monomers) of proteins. 20
different amino acids are used to synthesize
proteins. The shape and other properties of each
protein is dictated by the precise sequence of
amino acids in it.
Deoxyribonucleic acid (DNA) is written using a
code of only 4 letters (bases)
Two purines, called adenine (A) and guanine (G)
Two pyrimidines, called thymine (T) and cytosine
(C)
DNA sequencing
The determination of the precise sequence of
nucleotides in a sample of DNA
Why determine origin,

22
DNA Matching Basics

Example
Find the best pairwise alignment of GAATC and
CATAC

We need a way to measure the quality of a
candidate alignment
Alignment scores are driven from
substitution matrix
gap penalty

GAAT-C CA-TAC
-5 10 ? 10 ? 10 ?
23
Computing the Scoring Matrix
G
A
T
A
C
T
T
T
-
-
0
G
G
T
T
A
T
G
C
24
Hardware Implementation (32x1 Sliding Window)
Multiple Databases and Multiple Queries
32 Residue Window Size (Node 1)
Unlimited Database Size

32 Residue Window Size (Node n)
Unlimited Database Size
25
Implementation for Hardware (cntd) (MPI
Implementation)
Done
Scatter Queries
Node 0

Query Sequences

BroadCast DBs
Node 1

Processing
Score Array
Gather Scores
Database Sequences
Node N-1
26
MPI and SPMD on the Cray XD1

6 Nodes each with 2 ?P and 1 FPGA
Only 1 ?P can communicate with the FPGA in each
node at a time keeping the other ?P under-utilized

MPI Running
27
MPI and SPMD Parallel Programming on the SRC

2 ?P on each board
1 SNAP per ?P board
Only 1 ?P can access the SNAP at a time
One processor per processor board can be used
Two MAPs can be used out of 4
Code modified to use two FPGAs

SRC Hi-Bar Switch
Memory
Common Memory
SNAP
Memory
SNAP
MAP
MAP
FPGA
FPGA
FPGA
FPGA
?P
?P
?P
?P
PCI-X
PCI-X
MPI Running
SRC-6
MPI
28
Performance Results
29
Smith-Waterman Scalability on SRC-6(window of
32x1 residues)
30
Time Distribution of Smith-Waterman on
SRC-6(window of 32x1 residues)
31
Smith-Waterman Scalability on XD1 (window of
32x1 residues)
32
Smith-Waterman Scalability on XD1(window of 32x1
residues)
33
Time Distribution of Smith-Waterman on
XD1(window of 32x1 residues)
34
Additional Experience Computationally
Intensive Applications
P3 version of SRC-6E
35
Conclusions

Like any real-life story, there are ups and
downs, but the outlook is bright if we do the
right things
Tremendous potential in cost, power, and size as
seen by compute intensive integer applications
Rest of applications coming up
Early systems suffered from low processor to FPGA
transfer bandwidth, some improvements seen and
more improvements are still needed
Fixed clock restrictions may simplify things, but
may result in great disadvantages

36
Conclusions

More work is needed on how to manage
heterogeneity in HPRCs such that simple SPMD
programming models are enabled while hardware is
well utilized
Simple 11 heterogeneous-node homogeneous
architecture may by the answer for general
purpose HPRCs, other architectures may suite
special-purpose applications
Creative interconnection networks may help here
Attention to latency and bandwidth between FPGA
and microprocessor in relationship to the
performance of the whole interconnection fabric