Title: High Performance Embedded Computing Software Initiative HPECSI
1High Performance Embedded ComputingSoftware
Initiative (HPEC-SI)
Dr. Jeremy Kepner MIT Lincoln Laboratory
- This work is sponsored by the High Performance
Computing Modernization Office under Air Force
Contract F19628-00-C-0002. Opinions,
interpretations, conclusions, and recommendations
are those of the author and are not necessarily
endorsed by the United States Government.
2Outline
- Introduction
- Software Standards
- Parallel VSIPL
- Future Challenges
- Summary
3Overview - High Performance Embedded Computing
(HPEC) Initiative
Common Imagery Processor (CIP)
Embedded multi-processor
Shared memory server
ASARS-2
Challenge Transition advanced software
technology and practices into major defense
acquisition programs
4Why Is DoD Concerned with Embedded Software?
Source HPEC Market Study March 2001
Estimated DoD expenditures for embedded signal
and image processing hardware and software (B)
- COTS acquisition practices have shifted the
burden from point design hardware to point
design software - Software costs for embedded systems could be
reduced by one-third with improved programming
models, methodologies, and standards
5Issues with Current HPEC Development Inadequacy
of Software Practices Standards
- High Performance Embedded Computing pervasive
through DoD applications - Airborne Radar Insertion program
- 85 software rewrite for each hardware platform
- Missile common processor
- Processor board costs lt 100k
- Software development costs gt 100M
- Torpedo upgrade
- Two software re-writes required after changes in
hardware design
- Today Embedded Software Is
- Not portable
- Not scalable
- Difficult to develop
- Expensive to maintain
6Evolution of Software Support TowardsWrite
Once, Run Anywhere/Anysize
DoD software development
COTS development
Application
Application
- Application software has traditionally been tied
to the hardware
Vendor Software
Vendor SW
1990
7Overall Initiative Goals Impact
- Program Goals
- Develop and integrate software technologies for
embedded parallel systems to address portability,
productivity, and performance - Engage acquisition community to promote
technology insertion - Deliver quantifiable benefits
Portability reduction in lines-of-code to
change port/scale to new system Productivity
reduction in overall lines-of-code Performance c
omputation and communication benchmarks
8HPEC-SI Path to Success
- HPEC Software Initiative builds on
- Proven technology
- Business models
- Better software practices
9Organization
- Partnership with ODUSD(ST), Government Labs,
FFRDCs, Universities, Contractors, Vendors and
DoD programs - Over 100 participants from over 20 organizations
10Outline
- Introduction
- Software Standards
- Parallel VSIPL
- Future Challenges
- Summary
11Emergence of Component Standards
Parallel Embedded Processor
Data CommunicationMPI, MPI/RT, DRI
Control CommunicationCORBA, HP-CORBA
P0
P1
P2
P3
Consoles
Other Computers
Computation VSIPL VSIPL, VSIPL
Definitions VSIPL Vector, Signal, and Image
Processing Library VSIPL Parallel Object
Oriented VSIPL MPI Message-passing
interface MPI/RT MPI real-time DRI Data
Re-org Interface CORBA Common Object Request
Broker Architecture HP-CORBA High Performance
CORBA
- HPEC Initiative - Builds on completed research
and existing standards and libraries
12The Path to Parallel VSIPL
(worlds first parallel object oriented standard)
- First demo successfully completed
- VSIPL v0.5 spec completed
- VSIPL v0.1 code available
- Parallel VSIPL spec in progress
- High performance C demonstrated
Time
Phase 3
Applied Research Self-optimization
Phase 2
Development Fault tolerance
Applied Research Fault tolerance
prototype
Phase 1
Applied Research Unified Comp/Comm Lib
Demonstration Unified Comp/Comm Lib
Development Unified Comp/Comm Lib
Parallel VSIPL
prototype
Development Object-Oriented Standards
Demonstration Object-Oriented Standards
VSIPL
Functionality
Parallel VSIPL
Demonstration Existing Standards
- Unified embedded computation/ communication
standard - Demonstrate scalability
VSIPL
VSIPL MPI
- High-level code abstraction
- Reduce code size 3x
- Demonstrate insertions into fielded systems
(e.g., CIP) - Demonstrate 3x portability
13Working Group Technical Scope
Development VSIPL
Applied Research Parallel VSIPL
-MAPPING (task/pipeline parallel) -Reconfiguration
(for fault tolerance) -Threads -Reliability/Avail
ability -Data Permutation (DRI functionality) -Too
ls (profiles, timers, ...) -Quality of Service
-MAPPING (data parallelism) -Early binding
(computations) -Compatibility (backward/forward) -
Local Knowledge (accessing local
data) -Extensibility (adding new
functions) -Remote Procedure Calls (CORBA) -C
Compiler Support -Test Suite -Adoption Incentives
(vendor, integrator)
14Overall Technical Tasks and Schedule
Near
Mid
Long
Task Name
FY01
FY02
FY03
FY04
FY05
FY06
FY07
FY08
VSIPL (Vector, Signal, and Image Processing
Library) MPI (Message Passing Interface) VSIPL
(Object Oriented) v0.1 Spec v0.1 Code v0.5
Spec Code v1.0 Spec Code Parallel
VSIPL v0.1 Spec v0.1 Code v0.5 Spec
Code v1.0 Spec Code Fault Tolerance/ Self
Optimizing Software
CIP
Demo 2
Applied Research
Development
CIP
Demo 2
Demonstrate
Demo 3
Demo 4
Demo 5
Demo 6
15HPEC-SI Goals1st Demo Achievements
Portability zero code changes required Productivi
ty DRI code 6x smaller vs MPI (est) Performance
3x reduced cost or form factor
Demonstrate
Achieved 10x Goal 3x Portability
Achieved 6x Goal 3x Productivity
Object Oriented
Open Standards
HPEC Software Initiative
Interoperable Scalable
Prototype
Develop
Performance Goal 1.5x Achieved 2x
16Outline
- Introduction
- Software Standards
- Parallel VSIPL
- Future Challenges
- Summary
17Parallel Pipeline
Signal Processing Algorithm
Filter XOUT FIR(XIN )
Beamform XOUT w XIN
Detect XOUT XINgtc
- Data Parallel within stages
- Task/Pipeline Parallel across stages
18Types of Parallelism
Data Parallel
19Current Approach to Parallel Code
Code
Algorithm Mapping
while(!done) if ( rank()1 rank()2
) stage1 () else if ( rank()3 rank()4
) stage2()
Stage 1
Stage 2
Proc1
Proc3
Proc 4
Proc 2
while(!done) if ( rank()1 rank()2
) stage1() else if ( rank()3
rank()4) rank()5 rank6 )
stage2()
- Algorithm and hardware mapping are linked
- Resulting code is non-scalable and non-portable
20Scalable Approach
Single Processor Mapping
include ltVector.hgt include ltAddPvl.hgt void
addVectors(aMap, bMap, cMap) Vectorlt
ComplexltFloatgt gt a(a, aMap, LENGTH) Vectorlt
ComplexltFloatgt gt b(b, bMap, LENGTH) Vectorlt
ComplexltFloatgt gt c(c, cMap, LENGTH) b
1 c 2 abc
Multi Processor Mapping
- Single processor and multi-processor code are the
same - Maps can be changed without changing software
- High level code is compact
Lincoln Parallel Vector Library (PVL)
21C Expression Templates and PETE
Expression
ABCD
ExpressionTemplates
Main
Expression Type
Parse Tree
1. Pass B and Creferences to operator
BinaryNodeltOpAssign, Vector, BinaryNodeltOpAdd,
Vector BinaryNodeltOpMultiply, Vector, Vector gtgtgt
B, C
Operator
2. Create expressionparse tree
B
C
3. Return expressionparse tree
Parse trees, not vectors, created
copy
4. Pass expression treereference to operator
copy
Operator
5. Calculate result andperform assignment
BC
A
- Expression Templates enhance performance by
allowing temporary variables to be avoided
22PETE Linux Cluster Experiments
ABC
ABCD
ABCD/Efft(F)
1.2
1.3
1.2
1.1
1.1
1.2
1
Relative Execution Time
Relative Execution Time
Relative Execution Time
1
1.1
0.9
0.8
0.9
1
0.7
0.8
0.9
0.6
8
8
8
32
128
512
2048
8192
32768
131072
32
128
512
2048
8192
32768
131072
32
128
512
2048
8192
32768
131072
Vector Length
Vector Length
Vector Length
- PVL with VSIPL has a small overhead
- PVL with PETE can surpass VSIPL
23PowerPC AltiVec Experiments
- Results
- Hand coded loop achieves good performance, but is
problem specific and low level - Optimized VSIPL performs well for simple
expressions, worse for more complex expressions - PETE style array operators perform almost as well
as the hand-coded loop and are general, can be
composed, and are high-level
ABC
ABCDEF
ABCD
ABCDE/F
Software Technology
AltiVec loop
VSIPL (vendor optimized)
PETE with AltiVec
- C
- AltiVec aware VSIPro Core Lite
- (www.mpi-softtech.com)
- No multiply-add
- Cannot assume unit stride
- Cannot assume vector alignment
- C
- PETE operators
- Indirect use of AltiVec extensions
- Assumes unit stride
- Assumes vector alignment
- C
- For loop
- Direct use of AltiVec extensions
- Assumes unit stride
- Assumes vector alignment
24Outline
- Introduction
- Software Standards
- Parallel VSIPL
- Future Challenges
- Summary
25A sin(A) 2 B
- Generated code (no temporaries)
- for (index i 0 i lt A.size() i)
- A.put(i, sin(A.get(i)) 2 B.get(i))
- Apply inlining to transform to
- for (index i 0 i lt A.size() i)
- Ablocki sin(Ablocki) 2 Bblocki
- Apply more inlining to transform to
- T Bp (Bblock0) T Aend
(AblockA.size()) - for (T Ap (Ablock0) Ap lt pend Ap, Bp)
- Ap fmadd (2, Bp, sin(Ap))
- Or apply PowerPC AltiVec extensions
- Each step can be automatically generated
- Optimization level whatever vendor desires
26BLAS zherk Routine
- BLAS Basic Linear Algebra Subprograms
- Hermitian matrix M conjug(M) Mt
- zherk performs a rank-k update of Hermitian
matrix C - C ? a ? A ? conjug(A)t b ? C
- VSIPL code
- A vsip_cmcreate_d(10,15,VSIP_ROW,MEM_NONE)
- C vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE)
- tmp vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE)
- vsip_cmprodh_d(A,A,tmp) / Aconjug(A)t /
- vsip_rscmmul_d(alpha,tmp,tmp)/ aAconjug(A)t
/ - vsip_rscmmul_d(beta,C,C) / bC /
- vsip_cmadd_d(tmp,C,C) / aAconjug(A)t bC
/ - vsip_cblockdestroy(vsip_cmdestroy_d(tmp))
- vsip_cblockdestroy(vsip_cmdestroy_d(C))
- vsip_cblockdestroy(vsip_cmdestroy_d(A))
- VSIPL code (also parallel)
- Matrixltcomplexltdoublegt gt A(10,15)
27Simple Filtering Application
- int main ()
- using namespace vsip
- const length ROWS 64
- const length COLS 4096
- vsipl v
- FFTltMatrix, complexltdoublegt, complexltdoublegt,
FORWARD, 0, MULTIPLE, alg_hint ()gt - forward_fft (Domainlt2gt(ROWS,COLS), 1.0)
- FFTltMatrix, complexltdoublegt, complexltdoublegt,
INVERSE, 0, MULTIPLE, alg_hint ()gt inverse_fft
(Domainlt2gt(ROWS,COLS), 1.0) - const Matrixltcomplexltdoublegt gt weights
(load_weights (ROWS, COLS)) - try
- while (1) output (inverse_fft (forward_fft
(input ()) weights)) -
- catch (stdruntime_error)
- // Successfully caught access outside domain.
28Explicit Parallel Filter
- include ltvsiplpp.hgt
- using namespace VSIPL
- const int ROWS 64
- const int COLS 4096
- int main (int argc, char argv)
-
- MatrixltComplexltFloatgtgt W (ROWS, COLS,
"WMap") // weights matrix - MatrixltComplexltFloatgtgt X (ROWS, COLS,
"WMap") // input matrix - load_weights (W)
- try
-
- while (1)
-
- input (X) //
some input function - Y IFFT ( mul (FFT(X), W))
- output (Y) //
some output function -
-
- catch (Exception e) cerr ltlt e ltlt endl
29Multi-Stage Filter (main)
- using namespace vsip
- const length ROWS 64
- const length COLS 4096
- int main (int argc, char argv)
- sample_low_pass_filterltcomplexltfloatgt gt LPF()
- sample_beamformltcomplexltfloatgt gt BF()
- sample_matched_filterltcomplexltfloatgt gt MF()
- try
-
- while (1) output (MF(BF(LPF(input ()))))
-
- catch (stdruntime_error)
- // Successfully caught access outside domain.
-
-
30Multi-Stage Filter (low pass filter)
- templatelttypename Tgt
- class sample_low_pass_filterltTgt
-
- public
- sample_low_pass_filter()
- FIR1_(load_w1 (W1_LENGTH), FIR1_LENGTH),
- FIR2_(load_w2 (W2_LENGTH), FIR2_LENGTH)
-
- MatrixltTgt operator () (const MatrixltTgt Input)
- MatrixltTgt output(ROWS, COLS)
- for (index row0 rowltROWS row)
- output.row(row) FIR2_(FIR1_(Input.row(row)
).second).second - return output
-
- private
- FIRltT, SYMMETRIC_ODD, FIR1_DECIMATION,
CONTINUOUS, alg_hint()gt FIR1_ - FIRltT, SYMMETRIC_ODD, FIR2_DECIMATION,
CONTINUOUS, alg_hint()gt FIR2_
31Multi-Stage Filter (beam former)
- templatelttypename Tgt
- class sample_beamformltTgt
-
- public
- sample_beamform() W3_(load_w3 (ROWS,COLS))
- MatrixltTgt operator () (const MatrixltTgt Input)
const - return W3_ Input
- private
- const MatrixltTgt W3_
-
32Multi-Stage Filter (matched filter)
- templatelttypename Tgt
- class sample_matched_filterltTgt
-
- public
- matched_filter()
- W4_(load_w4 (ROWS,COLS)),
- forward_fft_ (Domainlt2gt(ROWS,COLS), 1.0),
- inverse_fft_ (Domainlt2gt(ROWS,COLS), 1.0)
-
- MatrixltTgt operator () (const MatrixltTgt Input)
const - return inverse_fft_ (forward_fft_ (Input)
W4_) - private
- const MatrixltTgt W4_
- FFTltMatrixltTgt, complexltdoublegt,
complexltdoublegt, - FORWARD, 0, MULTIPLE, alg_hint()gt
forward_fft_ -
- FFTltMatrixltTgt, complexltdoublegt,
complexltdoublegt,
33Outline
- Introduction
- Software Standards
- Parallel VSIPL
- Future Challenges
- Summary
34Dynamic Mapping for Fault Tolerance
Map1
Nodes 0,2
XOUT
XIN
Map0
Nodes 0,1
Output Task
Input Task
Parallel Processor
- Switching processors is accomplished by switching
maps - No change to algorithm required
35Dynamic Mapping Performance Results
Relative Time
Data Size
- Good dynamic mapping performance is possible
36Optimal Mapping of Complex Algorithms
Application
Different Optimal Maps
Intel Cluster
Workstation
Embedded Multi-computer
Embedded Board
PowerPC Cluster
Hardware
- Need to automate process of mapping algorithm to
hardware
37Self-optimizing Softwarefor Signal Processing
Problem Size
Large (48x128K)
Small (48x4K)
25 20 15 10
1.5 1.0 0.5
1-1-1-1
1-1-1-1
- Find
- Min(latency CPU)
- Max(throughput CPU)
- S3P selects correct optimal mapping
- Excellent agreement between S3P predicted and
achieved latencies and throughputs
1-1-1-2
Latency (seconds)
1-1-2-1
1-1-2-2
1-2-2-1
1-2-2-2
1-2-2-2
2-2-2-2
1-2-2-3
0.25 0.20 0.15 0.10
5.0 4.0 3.0 2.0
1-2-2-2
1-3-2-2
1-3-2-2
1-2-2-2
1-2-2-1
Throughput (frames/sec)
1-1-2-2
1-1-2-1
1-1-2-1
1-1-1-1
1-1-1-1
4 5 6 7 8
4 5 6 7 8
CPU
CPU
38High Level Languages
High Performance Matlab Applications
- Parallel Matlab need has been identified
- HPCMO (OSU)
- Required user interface has been demonstrated
- MatlabP (MIT/LCS)
- PVL (MIT/LL)
- Required hardware interface has been demonstrated
- MatlabMPI (MIT/LL)
DoD Sensor Processing
DoD Mission Planning
Scientific Simulation
Commercial Applications
User Interface
Parallel Matlab Toolbox
Hardware Interface
Parallel Computing Hardware
- Parallel Matlab Toolbox can now be realized
39MatlabMPI deployment (speedup)
- Maui
- Image filtering benchmark (300x on 304 cpus)
- Lincoln
- Signal Processing (7.8x on 8 cpus)
- Radar simulations (7.5x on 8 cpus)
- Hyperspectral (2.9x on 3 cpus)
- MIT
- LCS Beowulf (11x Gflops on 9 duals)
- AI Lab face recognition (10x on 8 duals)
- Other
- Ohio St. EM Simulations
- ARL SAR Image Enhancement
- Wash U Hearing Aid Simulations
- So. Ill. Benchmarking
- JHU Digital Beamforming
- ISL Radar simulation
- URI Heart modeling
Image Filtering on IBM SP at Maui Computing Center
Performance (Gigaflops)
Number of Processors
- Rapidly growing MatlabMPI user base demonstrates
need for parallel matlab - Demonstrated scaling to 300 processors
40Summary
- HPEC-SI Expected benefit
- Open software libraries, programming models, and
standards that provide portability (3x),
productivity (3x), and performance (1.5x)
benefits to multiple DoD programs - Invitation to Participate
- DoD Program offices with Signal/Image Processing
needs - Academic and Government Researchers interested in
high performance embedded computing - Contact KEPNER_at_LL.MIT.EDU
41The LinksHigh Performance Embedded Computing
Workshophttp//www.ll.mit.edu/HPECHigh
Performance Embedded Computing Software
Initiativehttp//www.hpec-si.org/Vector,
Signal, and Image Processing Libraryhttp//www.vs
ipl.org/MPI Software Technologies,
Inc.http//www.mpi-softtech.com/Data
Reorganization Initiativehttp//www.data-re.org/
CodeSourcery, LLChttp//www.codesourcery.com/Mat
labMPIhttp//www.ll.mit.edu/MatlabMPI