Title: Languages and Compilers for High Performance Computing
1Languages and Compilers for High Performance
Computing
- Kathy Yelick
- EECS Department
- U.C. Berkeley
2Research Problems in High Performance
Organ Simulation Domain-Specific Tools
Titanium Language for Parallel Scientific
Computing (with Graham, Hilfinger)
BeBOP Architecture- Specific Optimization (with
Demmel)
3Titanium
- Language for grid-based scientific computing
- Based on Java (but compiled)
- Extensions
- Multidimensional arrays with iterators
- Immutable (value) classes
- Templates
- Operator overloading
- Checked Synchronization
- Zone-based memory management
4Is High Performance Java an Oxymoron?
5Parallel Dependence Analysis Cycle Detection
- First, find potential race conditions
- If none, then use traditional sequential analysis
- Analysis of shared/private data can help
- Code defines a program order on accesses
- P is the union of these across
processors - Memory system defines an access order
- A is access order (read/write
write/write pairs) - Avoid reordering along edges of a cycle
- Intuition time cannot flow backwards.
6Parallel Control Analysis Synchronization
- Given a program P, determine which segments of P
could run in parallel. - Match barriers (single analysis in Titanium)
- Match synchronized regions
- Both analyses can be used to
- Detect bugs (race conditions)
- For optimizations
- Prefetching, split-phase memory, loop
transformations, scheduling,
7Titanium Research Problems
- Designed for block-structured grids add support
for unstructured. - Optimizations for local memory hierarchies (more
on this later) - Design of low-cost communication layers for
read/write - Add communication optimizations
- See the projects we page http//titanium.cs.berke
ley.edu/tasks.html
8 Performance Tuning
- Motivation performance of many applications
dominated by a few kernels - Heart simulation ? Navier-Stokes
- Sparse matrix-vector multiply (Multigrid)
- Fast Fourier Transforms
- Information retrieval ? LSI, LDA
- Sparse matrix-vector multiply
- Image processing ? filtering, segmentation
- Sorting/Histograms, Cosine transform, Sparse
matrix-vector multiply - Many other examples
9Architectural Trends
- A cache miss is O(100) cycles
- Getting worse every year
µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Year
10Conventional Performance Tuning
- Vendor or user hand tunes kernels
- Drawbacks
- Very time consuming and difficult work
- Even with intimate knowledge of architecture and
compiler, performance hard to predict - Must be redone for every architecture, compiler
- Not just a compiler problem
- Best algorithm may depend on input, so some
tuning must occur at run-time. - Multiple algorithms for the same problem may not
be provably equivalent by program analysis
11Automatic Performance Tuning
- Approach for each kernel
- Identify and generate a space of algorithms
- Search for the fastest one, by running them
- Constrain search space using performance models
- What is a space of algorithms?
- Depends on kernel and input
- May vary
- instruction mix and order
- memory access patterns
- data structures
- mathematical formulation
- Search both off-line and on-the-fly
12How Much Does Tuning Help?
- Experience from PHiPAC 10x on matmul
13Sparse Matrices as Graphs
- Sparse matrix is adjacency matrix for a graph
- Matrix vector multiplication is nearest neighbor
computation - Optimizations
- Register blocking look for fixed size cliques
- Unroll loops and optimize dense kernels
- Cache blocking partition graph and layout in
memory by partitions - Multiple vectors Assume each node holds a
vector, update them all simultaneously - Common in some types of solvers
- Exploit symmetry (undirected graph)
- Exploit bounded degree or other special structures
14Speedups from Sparsity with 1 Vector
15Speedups from Sparsity with 9 Vectors
16BeBop Research
- BeBop Berkeley Benchmarking and optimization
group - Hand optimizations
- Understood for some problems
- How to build tools
- Work across machines (self-tuning)
- Work on multiple problems (code generation)
17Application-Specific Tools
- Simulation of the human body
- Imagine a digital body double
- 3D image-based medical record
- Includes diagnostic, pathologic, and other
information - Used for
- Diagnosis
- Less invasive surgery-by-robot
- Experimental treatments
- Where are we today?
18From Visible Human to Digital Human
Building 3D Models from images
Source John Sullivan et al, WPI
Source www.madsci.org
19Heart Simulation Calculation
- Developed by Peskin and McQueen at NYU
- Done on a Cray C90 1 heart-beat in 100 hours
- Used for evaluating artificial heart valves
- Scalable parallel version done here
- Implemented in Titanium
- Model also used for
- Inner ear
- Blood clotting
- Embryo growth
- Insect flight
- Paper making
20Digital Human Roadmap
1 organ 1 model
1 organ multiple models
multiple organs
organ system
3D model construction
new algorithms
scalable implementations
coupled models
100x performance
1995
2000
2005
2010
21Summary
- Three related projects
- Titanium
- http//titanium.cs.berkeley.edu
- BeBop
- http//www.cs.berkeley.edu/richie/bebop
- Organ Simulation
- Research issues
- How to make high performance easy
- Increasing complex applications
- Increasing complex machines
22Simulation of a Heart