Languages and Compilers for High Performance Computing - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Languages and Compilers for High Performance Computing

Description:

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley – PowerPoint PPT presentation

Number of Views:141

Avg rating:3.0/5.0

Slides: 23

Provided by: Compute64

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Languages and Compilers for High Performance Computing

1
Languages and Compilers for High Performance
Computing

Kathy Yelick
EECS Department
U.C. Berkeley

2
Research Problems in High Performance
Organ Simulation Domain-Specific Tools
Titanium Language for Parallel Scientific
Computing (with Graham, Hilfinger)
BeBOP Architecture- Specific Optimization (with
Demmel)
3
Titanium

Language for grid-based scientific computing
Based on Java (but compiled)
Extensions
Multidimensional arrays with iterators
Immutable (value) classes
Templates
Operator overloading
Checked Synchronization
Zone-based memory management

4
Is High Performance Java an Oxymoron?
5
Parallel Dependence Analysis Cycle Detection

First, find potential race conditions
If none, then use traditional sequential analysis
Analysis of shared/private data can help
Code defines a program order on accesses
P is the union of these across
processors
Memory system defines an access order
A is access order (read/write
write/write pairs)
Avoid reordering along edges of a cycle
Intuition time cannot flow backwards.

6
Parallel Control Analysis Synchronization

Given a program P, determine which segments of P
could run in parallel.
Match barriers (single analysis in Titanium)
Match synchronized regions
Both analyses can be used to
Detect bugs (race conditions)
For optimizations
Prefetching, split-phase memory, loop
transformations, scheduling,

7
Titanium Research Problems

Designed for block-structured grids add support
for unstructured.
Optimizations for local memory hierarchies (more
on this later)
Design of low-cost communication layers for
read/write
Add communication optimizations
See the projects we page http//titanium.cs.berke
ley.edu/tasks.html

8
Performance Tuning

Motivation performance of many applications
dominated by a few kernels
Heart simulation ? Navier-Stokes
Sparse matrix-vector multiply (Multigrid)
Fast Fourier Transforms
Information retrieval ? LSI, LDA
Sparse matrix-vector multiply
Image processing ? filtering, segmentation
Sorting/Histograms, Cosine transform, Sparse
matrix-vector multiply
Many other examples

9
Architectural Trends

A cache miss is O(100) cycles
Getting worse every year

µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Year
10
Conventional Performance Tuning

Vendor or user hand tunes kernels
Drawbacks
Very time consuming and difficult work
Even with intimate knowledge of architecture and
compiler, performance hard to predict
Must be redone for every architecture, compiler
Not just a compiler problem
Best algorithm may depend on input, so some
tuning must occur at run-time.
Multiple algorithms for the same problem may not
be provably equivalent by program analysis

11
Automatic Performance Tuning

Approach for each kernel
Identify and generate a space of algorithms
Search for the fastest one, by running them
Constrain search space using performance models
What is a space of algorithms?
Depends on kernel and input
May vary
instruction mix and order
memory access patterns
data structures
mathematical formulation
Search both off-line and on-the-fly

12
How Much Does Tuning Help?

Experience from PHiPAC 10x on matmul

13
Sparse Matrices as Graphs

Sparse matrix is adjacency matrix for a graph
Matrix vector multiplication is nearest neighbor
computation
Optimizations
Register blocking look for fixed size cliques
Unroll loops and optimize dense kernels
Cache blocking partition graph and layout in
memory by partitions
Multiple vectors Assume each node holds a
vector, update them all simultaneously
Common in some types of solvers
Exploit symmetry (undirected graph)
Exploit bounded degree or other special structures

14
Speedups from Sparsity with 1 Vector
15
Speedups from Sparsity with 9 Vectors
16
BeBop Research

BeBop Berkeley Benchmarking and optimization
group
Hand optimizations
Understood for some problems
How to build tools
Work across machines (self-tuning)
Work on multiple problems (code generation)

17
Application-Specific Tools

Simulation of the human body
Imagine a digital body double
3D image-based medical record
Includes diagnostic, pathologic, and other
information
Used for
Diagnosis
Less invasive surgery-by-robot
Experimental treatments
Where are we today?

18
From Visible Human to Digital Human
Building 3D Models from images
Source John Sullivan et al, WPI
Source www.madsci.org
19
Heart Simulation Calculation

Developed by Peskin and McQueen at NYU
Done on a Cray C90 1 heart-beat in 100 hours
Used for evaluating artificial heart valves
Scalable parallel version done here
Implemented in Titanium

Model also used for
Inner ear
Blood clotting
Embryo growth
Insect flight
Paper making

20
Digital Human Roadmap
1 organ 1 model
1 organ multiple models
multiple organs
organ system
3D model construction
new algorithms
scalable implementations
coupled models
100x performance
1995
2000
2005
2010
21
Summary