High Performance Embedded Computing Software Initiative HPECSI - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

High Performance Embedded Computing Software Initiative HPECSI

Description:

MIT Lincoln Laboratory. www.hpec-si.org. High Performance Embedded Computing ... Lincoln. Algorithm and hardware mapping are linked ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 42

Provided by: jeremy78

Learn more at: http://www.hpec-si.org

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Embedded Computing Software Initiative HPECSI

1
High Performance Embedded ComputingSoftware
Initiative (HPEC-SI)

Dr. Jeremy Kepner MIT Lincoln Laboratory

This work is sponsored by the High Performance
Computing Modernization Office under Air Force
Contract F19628-00-C-0002. Opinions,
interpretations, conclusions, and recommendations
are those of the author and are not necessarily
endorsed by the United States Government.

2
Outline

Introduction
Software Standards
Parallel VSIPL
Future Challenges
Summary

3
Overview - High Performance Embedded Computing
(HPEC) Initiative
Common Imagery Processor (CIP)
Embedded multi-processor
Shared memory server
ASARS-2
Challenge Transition advanced software
technology and practices into major defense
acquisition programs
4
Why Is DoD Concerned with Embedded Software?
Source HPEC Market Study March 2001
Estimated DoD expenditures for embedded signal
and image processing hardware and software (B)

COTS acquisition practices have shifted the
burden from point design hardware to point
design software
Software costs for embedded systems could be
reduced by one-third with improved programming
models, methodologies, and standards

5
Issues with Current HPEC Development Inadequacy
of Software Practices Standards

High Performance Embedded Computing pervasive
through DoD applications
Airborne Radar Insertion program
85 software rewrite for each hardware platform
Missile common processor
Processor board costs lt 100k
Software development costs gt 100M
Torpedo upgrade
Two software re-writes required after changes in
hardware design

Today Embedded Software Is
Not portable
Not scalable
Difficult to develop
Expensive to maintain

6
Evolution of Software Support TowardsWrite
Once, Run Anywhere/Anysize
DoD software development
COTS development
Application
Application

Application software has traditionally been tied
to the hardware

Vendor Software
Vendor SW
1990
7
Overall Initiative Goals Impact

Program Goals
Develop and integrate software technologies for
embedded parallel systems to address portability,
productivity, and performance
Engage acquisition community to promote
technology insertion
Deliver quantifiable benefits

Portability reduction in lines-of-code to
change port/scale to new system Productivity
reduction in overall lines-of-code Performance c
omputation and communication benchmarks
8
HPEC-SI Path to Success

HPEC Software Initiative builds on
Proven technology
Business models
Better software practices

9
Organization

Partnership with ODUSD(ST), Government Labs,
FFRDCs, Universities, Contractors, Vendors and
DoD programs
Over 100 participants from over 20 organizations

10
Outline

Introduction
Software Standards
Parallel VSIPL
Future Challenges
Summary

11
Emergence of Component Standards
Parallel Embedded Processor
Data CommunicationMPI, MPI/RT, DRI
Control CommunicationCORBA, HP-CORBA
P0
P1
P2
P3
Consoles
Other Computers
Computation VSIPL VSIPL, VSIPL
Definitions VSIPL Vector, Signal, and Image
Processing Library VSIPL Parallel Object
Oriented VSIPL MPI Message-passing
interface MPI/RT MPI real-time DRI Data
Re-org Interface CORBA Common Object Request
Broker Architecture HP-CORBA High Performance
CORBA

HPEC Initiative - Builds on completed research
and existing standards and libraries

12
The Path to Parallel VSIPL
(worlds first parallel object oriented standard)

First demo successfully completed
VSIPL v0.5 spec completed
VSIPL v0.1 code available
Parallel VSIPL spec in progress
High performance C demonstrated

Time
Phase 3
Applied Research Self-optimization
Phase 2
Development Fault tolerance
Applied Research Fault tolerance
prototype
Phase 1
Applied Research Unified Comp/Comm Lib
Demonstration Unified Comp/Comm Lib
Development Unified Comp/Comm Lib
Parallel VSIPL
prototype
Development Object-Oriented Standards
Demonstration Object-Oriented Standards
VSIPL
Functionality
Parallel VSIPL
Demonstration Existing Standards

Unified embedded computation/ communication
standard
Demonstrate scalability

VSIPL
VSIPL MPI

High-level code abstraction
Reduce code size 3x

Demonstrate insertions into fielded systems
(e.g., CIP)
Demonstrate 3x portability

13
Working Group Technical Scope
Development VSIPL
Applied Research Parallel VSIPL
-MAPPING (task/pipeline parallel) -Reconfiguration
(for fault tolerance) -Threads -Reliability/Avail
ability -Data Permutation (DRI functionality) -Too
ls (profiles, timers, ...) -Quality of Service
-MAPPING (data parallelism) -Early binding
(computations) -Compatibility (backward/forward) -
Local Knowledge (accessing local
data) -Extensibility (adding new
functions) -Remote Procedure Calls (CORBA) -C
Compiler Support -Test Suite -Adoption Incentives
(vendor, integrator)
14
Overall Technical Tasks and Schedule
Near
Mid
Long

Task Name
FY01
FY02
FY03
FY04
FY05
FY06
FY07
FY08
VSIPL (Vector, Signal, and Image Processing
Library) MPI (Message Passing Interface) VSIPL
(Object Oriented) v0.1 Spec v0.1 Code v0.5
Spec Code v1.0 Spec Code Parallel
VSIPL v0.1 Spec v0.1 Code v0.5 Spec
Code v1.0 Spec Code Fault Tolerance/ Self
Optimizing Software
CIP
Demo 2
Applied Research
Development
CIP
Demo 2
Demonstrate
Demo 3
Demo 4
Demo 5
Demo 6
15
HPEC-SI Goals1st Demo Achievements
Portability zero code changes required Productivi
ty DRI code 6x smaller vs MPI (est) Performance
3x reduced cost or form factor
Demonstrate
Achieved 10x Goal 3x Portability
Achieved 6x Goal 3x Productivity
Object Oriented
Open Standards
HPEC Software Initiative
Interoperable Scalable
Prototype
Develop
Performance Goal 1.5x Achieved 2x
16
Outline

Introduction
Software Standards
Parallel VSIPL
Future Challenges
Summary

17
Parallel Pipeline
Signal Processing Algorithm
Filter XOUT FIR(XIN )
Beamform XOUT w XIN
Detect XOUT XINgtc

Data Parallel within stages
Task/Pipeline Parallel across stages

18
Types of Parallelism
Data Parallel
19
Current Approach to Parallel Code
Code
Algorithm Mapping
while(!done) if ( rank()1 rank()2
) stage1 () else if ( rank()3 rank()4
) stage2()
Stage 1
Stage 2
Proc1
Proc3
Proc 4
Proc 2
while(!done) if ( rank()1 rank()2
) stage1() else if ( rank()3
rank()4) rank()5 rank6 )
stage2()

Algorithm and hardware mapping are linked
Resulting code is non-scalable and non-portable

20
Scalable Approach
Single Processor Mapping
include ltVector.hgt include ltAddPvl.hgt void
addVectors(aMap, bMap, cMap) Vectorlt
ComplexltFloatgt gt a(a, aMap, LENGTH) Vectorlt
ComplexltFloatgt gt b(b, bMap, LENGTH) Vectorlt
ComplexltFloatgt gt c(c, cMap, LENGTH) b
1 c 2 abc
Multi Processor Mapping

Single processor and multi-processor code are the
same
Maps can be changed without changing software
High level code is compact

Lincoln Parallel Vector Library (PVL)
21
C Expression Templates and PETE
Expression
ABCD
ExpressionTemplates
Main
Expression Type
Parse Tree
1. Pass B and Creferences to operator
BinaryNodeltOpAssign, Vector, BinaryNodeltOpAdd,
Vector BinaryNodeltOpMultiply, Vector, Vector gtgtgt
B, C
Operator

2. Create expressionparse tree
B
C
3. Return expressionparse tree
Parse trees, not vectors, created
copy
4. Pass expression treereference to operator
copy
Operator
5. Calculate result andperform assignment
BC
A

Expression Templates enhance performance by
allowing temporary variables to be avoided

22
PETE Linux Cluster Experiments
ABC
ABCD
ABCD/Efft(F)
1.2
1.3
1.2
1.1
1.1
1.2
1
Relative Execution Time
Relative Execution Time
Relative Execution Time
1
1.1
0.9
0.8
0.9
1
0.7
0.8
0.9
0.6
8
8
8
32
128
512
2048
8192
32768
131072
32
128
512
2048
8192
32768
131072
32
128
512
2048
8192
32768
131072
Vector Length
Vector Length
Vector Length

PVL with VSIPL has a small overhead
PVL with PETE can surpass VSIPL

23
PowerPC AltiVec Experiments

Results
Hand coded loop achieves good performance, but is
problem specific and low level
Optimized VSIPL performs well for simple
expressions, worse for more complex expressions
PETE style array operators perform almost as well
as the hand-coded loop and are general, can be
composed, and are high-level

ABC
ABCDEF
ABCD
ABCDE/F
Software Technology
AltiVec loop
VSIPL (vendor optimized)
PETE with AltiVec

C
AltiVec aware VSIPro Core Lite
(www.mpi-softtech.com)
No multiply-add
Cannot assume unit stride
Cannot assume vector alignment

C
PETE operators
Indirect use of AltiVec extensions
Assumes unit stride
Assumes vector alignment

C
For loop
Direct use of AltiVec extensions
Assumes unit stride
Assumes vector alignment

24
Outline

Introduction
Software Standards
Parallel VSIPL
Future Challenges
Summary

25
A sin(A) 2 B

Generated code (no temporaries)
for (index i 0 i lt A.size() i)
A.put(i, sin(A.get(i)) 2 B.get(i))
Apply inlining to transform to
for (index i 0 i lt A.size() i)
Ablocki sin(Ablocki) 2 Bblocki
Apply more inlining to transform to
T Bp (Bblock0) T Aend
(AblockA.size())
for (T Ap (Ablock0) Ap lt pend Ap, Bp)
Ap fmadd (2, Bp, sin(Ap))
Or apply PowerPC AltiVec extensions

Each step can be automatically generated
Optimization level whatever vendor desires

26
BLAS zherk Routine

BLAS Basic Linear Algebra Subprograms
Hermitian matrix M conjug(M) Mt
zherk performs a rank-k update of Hermitian
matrix C
C ? a ? A ? conjug(A)t b ? C
VSIPL code
A vsip_cmcreate_d(10,15,VSIP_ROW,MEM_NONE)
C vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE)
tmp vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE)
vsip_cmprodh_d(A,A,tmp) / Aconjug(A)t /
vsip_rscmmul_d(alpha,tmp,tmp)/ aAconjug(A)t
/
vsip_rscmmul_d(beta,C,C) / bC /
vsip_cmadd_d(tmp,C,C) / aAconjug(A)t bC
/
vsip_cblockdestroy(vsip_cmdestroy_d(tmp))
vsip_cblockdestroy(vsip_cmdestroy_d(C))
vsip_cblockdestroy(vsip_cmdestroy_d(A))
VSIPL code (also parallel)
Matrixltcomplexltdoublegt gt A(10,15)

27
Simple Filtering Application

int main ()
using namespace vsip
const length ROWS 64
const length COLS 4096
vsipl v
FFTltMatrix, complexltdoublegt, complexltdoublegt,
FORWARD, 0, MULTIPLE, alg_hint ()gt
forward_fft (Domainlt2gt(ROWS,COLS), 1.0)
FFTltMatrix, complexltdoublegt, complexltdoublegt,
INVERSE, 0, MULTIPLE, alg_hint ()gt inverse_fft
(Domainlt2gt(ROWS,COLS), 1.0)
const Matrixltcomplexltdoublegt gt weights
(load_weights (ROWS, COLS))
try
while (1) output (inverse_fft (forward_fft
(input ()) weights))
catch (stdruntime_error)
// Successfully caught access outside domain.

28
Explicit Parallel Filter

include ltvsiplpp.hgt
using namespace VSIPL
const int ROWS 64
const int COLS 4096
int main (int argc, char argv)
MatrixltComplexltFloatgtgt W (ROWS, COLS,
"WMap") // weights matrix
MatrixltComplexltFloatgtgt X (ROWS, COLS,
"WMap") // input matrix
load_weights (W)
try
while (1)
input (X) //
some input function
Y IFFT ( mul (FFT(X), W))
output (Y) //
some output function
catch (Exception e) cerr ltlt e ltlt endl

29
Multi-Stage Filter (main)

using namespace vsip
const length ROWS 64
const length COLS 4096
int main (int argc, char argv)
sample_low_pass_filterltcomplexltfloatgt gt LPF()
sample_beamformltcomplexltfloatgt gt BF()
sample_matched_filterltcomplexltfloatgt gt MF()
try
while (1) output (MF(BF(LPF(input ()))))
catch (stdruntime_error)
// Successfully caught access outside domain.

30
Multi-Stage Filter (low pass filter)

templatelttypename Tgt
class sample_low_pass_filterltTgt
public
sample_low_pass_filter()
FIR1_(load_w1 (W1_LENGTH), FIR1_LENGTH),
FIR2_(load_w2 (W2_LENGTH), FIR2_LENGTH)
MatrixltTgt operator () (const MatrixltTgt Input)
MatrixltTgt output(ROWS, COLS)
for (index row0 rowltROWS row)
output.row(row) FIR2_(FIR1_(Input.row(row)
).second).second
return output
private
FIRltT, SYMMETRIC_ODD, FIR1_DECIMATION,
CONTINUOUS, alg_hint()gt FIR1_
FIRltT, SYMMETRIC_ODD, FIR2_DECIMATION,
CONTINUOUS, alg_hint()gt FIR2_

31
Multi-Stage Filter (beam former)

templatelttypename Tgt
class sample_beamformltTgt
public
sample_beamform() W3_(load_w3 (ROWS,COLS))
MatrixltTgt operator () (const MatrixltTgt Input)
const
return W3_ Input
private
const MatrixltTgt W3_

32
Multi-Stage Filter (matched filter)

templatelttypename Tgt
class sample_matched_filterltTgt
public
matched_filter()
W4_(load_w4 (ROWS,COLS)),
forward_fft_ (Domainlt2gt(ROWS,COLS), 1.0),
inverse_fft_ (Domainlt2gt(ROWS,COLS), 1.0)
MatrixltTgt operator () (const MatrixltTgt Input)
const
return inverse_fft_ (forward_fft_ (Input)
W4_)
private
const MatrixltTgt W4_
FFTltMatrixltTgt, complexltdoublegt,
complexltdoublegt,
FORWARD, 0, MULTIPLE, alg_hint()gt
forward_fft_
FFTltMatrixltTgt, complexltdoublegt,
complexltdoublegt,

33
Outline

Introduction
Software Standards
Parallel VSIPL
Future Challenges
Summary

34
Dynamic Mapping for Fault Tolerance
Map1
Nodes 0,2
XOUT
XIN
Map0
Nodes 0,1
Output Task
Input Task
Parallel Processor

Switching processors is accomplished by switching
maps
No change to algorithm required

35
Dynamic Mapping Performance Results
Relative Time
Data Size

Good dynamic mapping performance is possible

36
Optimal Mapping of Complex Algorithms
Application
Different Optimal Maps
Intel Cluster
Workstation
Embedded Multi-computer
Embedded Board
PowerPC Cluster
Hardware

Need to automate process of mapping algorithm to
hardware

37
Self-optimizing Softwarefor Signal Processing
Problem Size
Large (48x128K)
Small (48x4K)
25 20 15 10
1.5 1.0 0.5
1-1-1-1
1-1-1-1

Find
Min(latency CPU)
Max(throughput CPU)
S3P selects correct optimal mapping
Excellent agreement between S3P predicted and
achieved latencies and throughputs

1-1-1-2
Latency (seconds)
1-1-2-1
1-1-2-2
1-2-2-1
1-2-2-2
1-2-2-2
2-2-2-2
1-2-2-3
0.25 0.20 0.15 0.10
5.0 4.0 3.0 2.0
1-2-2-2
1-3-2-2
1-3-2-2
1-2-2-2
1-2-2-1
Throughput (frames/sec)
1-1-2-2
1-1-2-1
1-1-2-1
1-1-1-1
1-1-1-1
4 5 6 7 8
4 5 6 7 8
CPU
CPU
38
High Level Languages
High Performance Matlab Applications

Parallel Matlab need has been identified
HPCMO (OSU)
Required user interface has been demonstrated
MatlabP (MIT/LCS)
PVL (MIT/LL)
Required hardware interface has been demonstrated
MatlabMPI (MIT/LL)

DoD Sensor Processing
DoD Mission Planning
Scientific Simulation
Commercial Applications
User Interface
Parallel Matlab Toolbox
Hardware Interface
Parallel Computing Hardware

Parallel Matlab Toolbox can now be realized

39
MatlabMPI deployment (speedup)

Maui
Image filtering benchmark (300x on 304 cpus)
Lincoln
Signal Processing (7.8x on 8 cpus)
Radar simulations (7.5x on 8 cpus)
Hyperspectral (2.9x on 3 cpus)
MIT
LCS Beowulf (11x Gflops on 9 duals)
AI Lab face recognition (10x on 8 duals)
Other
Ohio St. EM Simulations
ARL SAR Image Enhancement
Wash U Hearing Aid Simulations
So. Ill. Benchmarking
JHU Digital Beamforming
ISL Radar simulation
URI Heart modeling

Image Filtering on IBM SP at Maui Computing Center
Performance (Gigaflops)
Number of Processors

Rapidly growing MatlabMPI user base demonstrates
need for parallel matlab
Demonstrated scaling to 300 processors

40
Summary

HPEC-SI Expected benefit
Open software libraries, programming models, and
standards that provide portability (3x),
productivity (3x), and performance (1.5x)
benefits to multiple DoD programs
Invitation to Participate
DoD Program offices with Signal/Image Processing
needs
Academic and Government Researchers interested in
high performance embedded computing
Contact KEPNER_at_LL.MIT.EDU

41
The LinksHigh Performance Embedded Computing
Workshophttp//www.ll.mit.edu/HPECHigh
Performance Embedded Computing Software
Initiativehttp//www.hpec-si.org/Vector,
Signal, and Image Processing Libraryhttp//www.vs
ipl.org/MPI Software Technologies,
Inc.http//www.mpi-softtech.com/Data
Reorganization Initiativehttp//www.data-re.org/
CodeSourcery, LLChttp//www.codesourcery.com/Mat
labMPIhttp//www.ll.mit.edu/MatlabMPI

Write a Comment

User Comments (0)