Experience with Early HighPerformance Reconfigurable Computing Systems - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Experience with Early HighPerformance Reconfigurable Computing Systems

Description:

2. Experience with Early High-Performance Reconfigurable Computing ... SYNERGISM between mPs and RPs. Harder. Relatively Easy (S.W./Parallel Programming) ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 37
Provided by: hpclSe
Category:

less

Transcript and Presenter's Notes

Title: Experience with Early HighPerformance Reconfigurable Computing Systems


1
Experience with Early High-Performance
Reconfigurable Computing Systems
  • Tarek El-Ghazawi
  • The George Washington University
  • tarek_at_gwu.edu
  • http//www.seas.gwu.edu/tarek

2
Outline
  • Acknowledgements
  • Definition
  • What Have we Been Doing?
  • Our Systems (big toys!)
  • Our applications
  • A Classification for HPRC Architectures
  • Single Node Experience
  • Parallel Applications and System Level Experience
  • Conclusions

3
Acknowledgements
This Work is supported in part by The Arctic
Region Supercomputing Center and NASA GSFC
Students
M. Taher
M. Aboallil
E. El-Araby
Collaborators Kris Gaj, Jacqueline Le Moigne,
Duncan Buell, SRC, SGI, Cray
4
High-Performance Reconfigurable Computing (HPRC)
Defined
  • Efficient general-purpose computing for the
    masses using parallel (and distributed) systems
    of both reconfigurable hardware resources and
    conventional microprocessors

5
Why HPRC?Ans. SYNERGISM between mPs and RPs
6
Our HPRC Toys
7
Applications
  • Remote Sensing/Imaging Applications
  • Hyperspectral Dimension Reduction
  • Wavelet decomposition
  • Image Registration
  • Cloud Detection
  • Bioinformatics
  • Smith-Waterman
  • Cryptography
  • DES, 3DES
  • IDEA
  • RC5

8
An Architectural Classification for
High-Performance Reconfigurable Computers
9
1. Homogeneous-Nodes Heterogeneous System

µP 1
µP N

RP 1
RP N
µP Subsystem
RP Subsystem
a. Non-Scalable (or Attached Processor)
Architecture
Examples SRC 6E and SBS HC
10
Homogeneous-Nodes- Heterogeneous-System
b. Scalable System
IN and/or GSM
Examples SRC 6, SGI RASC
11
Heterogeneous-NodesHomogeneous-System
RP N
RP N
µP N
µP N
Example Cray XD1
IN and/or GSM
µP N
RP N
RP N
µP N
RP N
µP N
3 Node Architecture Options
12
A Classification for High-Performance
Reconfigurable Computers (HPRCs)
HPRCs
Heterogeneous Nodes Homogeneous System
Homogeneous Nodes Heterogeneous System
Scalable Systems
Attached Processors
Separate Processors
E.g. SRC 6 and SGI RASC
E.g. SRC 6E and SBS HCs
E.g. Cray XD1
Integrated Processors
mP inside mP
mP inside RP
13
Single Node Experience
14
Hyperspectral Dimension Reduction
S. Kaewpijit, J. Le Moigne, T. El-Ghazawi,
Automatic Reduction of Hyperspectral Imagery
Using Wavelet Spectral Analysis, IEEE
Transactions on Geoscience and Remote Sensing,
Vol. 41, No. 4, April, 2003, pp. 863-871.
15
Wavelet-Based Dimension Reduction(Execution
Profiles on SRC)
Total Execution Time 1.67 sec Speedup 12.08 x
(without-streaming) Speedup 13.21 x
(with-streaming)
Total Execution Time 20.21 sec Pentium4, 1.8GHz
SRC-6E, P3
SRC 6
Total Execution Time 0.84 sec (SRC-6) Speedup
24.06 x (without-streaming) Speedup 32.04 x
(with-streaming)
16
Multi-Resolution DWT Decomposition (Mallat
Algorithm)
17
DWT Decomposition(One Engine ? One FPGA)
18
System Level Experience
19
DWT on XD1(MPI Overhead and Computation Speedup)
20
Extrapolated Performance on XD1 Using MPI
21
DNA Sequencing with Smith-Waterman
  • Amino acids
  • The building blocks (monomers) of proteins. 20
    different amino acids are used to synthesize
    proteins. The shape and other properties of each
    protein is dictated by the precise sequence of
    amino acids in it.
  • Deoxyribonucleic acid (DNA) is written using a
    code of only 4 letters (bases)
  • Two purines, called adenine (A) and guanine (G)
  • Two pyrimidines, called thymine (T) and cytosine
    (C)
  • DNA sequencing
  • The determination of the precise sequence of
    nucleotides in a sample of DNA
  • Why determine origin,

22
DNA Matching Basics
  • Example
  • Find the best pairwise alignment of GAATC and
    CATAC
  • We need a way to measure the quality of a
    candidate alignment
  • Alignment scores are driven from
  • substitution matrix
  • gap penalty

GAAT-C CA-TAC
-5 10 ? 10 ? 10 ?
23
Computing the Scoring Matrix
G
A
T
A
C
T
T
T
-
-
0
G
G
T
T
A
T
G
C
24
Hardware Implementation (32x1 Sliding Window)
Multiple Databases and Multiple Queries
32 Residue Window Size (Node 1)
Unlimited Database Size


32 Residue Window Size (Node n)
Unlimited Database Size
25
Implementation for Hardware (cntd) (MPI
Implementation)
Done
Scatter Queries
Node 0

Query Sequences

BroadCast DBs
Node 1

Processing
Score Array
Gather Scores
Database Sequences
Node N-1
26
MPI and SPMD on the Cray XD1
  • 6 Nodes each with 2 ?P and 1 FPGA
  • Only 1 ?P can communicate with the FPGA in each
    node at a time keeping the other ?P under-utilized

MPI Running
27
MPI and SPMD Parallel Programming on the SRC
  • 2 ?P on each board
  • 1 SNAP per ?P board
  • Only 1 ?P can access the SNAP at a time
  • One processor per processor board can be used
  • Two MAPs can be used out of 4
  • Code modified to use two FPGAs

SRC Hi-Bar Switch
Memory
Common Memory
SNAP
Memory
SNAP
MAP
MAP
FPGA
FPGA
FPGA
FPGA
?P
?P
?P
?P
PCI-X
PCI-X
MPI Running
SRC-6
MPI
28
Performance Results
29
Smith-Waterman Scalability on SRC-6(window of
32x1 residues)
30
Time Distribution of Smith-Waterman on
SRC-6(window of 32x1 residues)
31
Smith-Waterman Scalability on XD1 (window of
32x1 residues)
32
Smith-Waterman Scalability on XD1(window of 32x1
residues)
33
Time Distribution of Smith-Waterman on
XD1(window of 32x1 residues)
34
Additional Experience Computationally
Intensive Applications
P3 version of SRC-6E
35
Conclusions
  • Like any real-life story, there are ups and
    downs, but the outlook is bright if we do the
    right things
  • Tremendous potential in cost, power, and size as
    seen by compute intensive integer applications
  • Rest of applications coming up
  • Early systems suffered from low processor to FPGA
    transfer bandwidth, some improvements seen and
    more improvements are still needed
  • Fixed clock restrictions may simplify things, but
    may result in great disadvantages

36
Conclusions
  • More work is needed on how to manage
    heterogeneity in HPRCs such that simple SPMD
    programming models are enabled while hardware is
    well utilized
  • Simple 11 heterogeneous-node homogeneous
    architecture may by the answer for general
    purpose HPRCs, other architectures may suite
    special-purpose applications
  • Creative interconnection networks may help here
  • Attention to latency and bandwidth between FPGA
    and microprocessor in relationship to the
    performance of the whole interconnection fabric
Write a Comment
User Comments (0)
About PowerShow.com