Languages and Compilers for High Performance Computing - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Languages and Compilers for High Performance Computing

Description:

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 23
Provided by: Compute64
Category:

less

Transcript and Presenter's Notes

Title: Languages and Compilers for High Performance Computing


1
Languages and Compilers for High Performance
Computing
  • Kathy Yelick
  • EECS Department
  • U.C. Berkeley

2
Research Problems in High Performance
Organ Simulation Domain-Specific Tools
Titanium Language for Parallel Scientific
Computing (with Graham, Hilfinger)
BeBOP Architecture- Specific Optimization (with
Demmel)
3
Titanium
  • Language for grid-based scientific computing
  • Based on Java (but compiled)
  • Extensions
  • Multidimensional arrays with iterators
  • Immutable (value) classes
  • Templates
  • Operator overloading
  • Checked Synchronization
  • Zone-based memory management

4
Is High Performance Java an Oxymoron?
5
Parallel Dependence Analysis Cycle Detection
  • First, find potential race conditions
  • If none, then use traditional sequential analysis
  • Analysis of shared/private data can help
  • Code defines a program order on accesses
  • P is the union of these across
    processors
  • Memory system defines an access order
  • A is access order (read/write
    write/write pairs)
  • Avoid reordering along edges of a cycle
  • Intuition time cannot flow backwards.

6
Parallel Control Analysis Synchronization
  • Given a program P, determine which segments of P
    could run in parallel.
  • Match barriers (single analysis in Titanium)
  • Match synchronized regions
  • Both analyses can be used to
  • Detect bugs (race conditions)
  • For optimizations
  • Prefetching, split-phase memory, loop
    transformations, scheduling,

7
Titanium Research Problems
  • Designed for block-structured grids add support
    for unstructured.
  • Optimizations for local memory hierarchies (more
    on this later)
  • Design of low-cost communication layers for
    read/write
  • Add communication optimizations
  • See the projects we page http//titanium.cs.berke
    ley.edu/tasks.html

8
Performance Tuning
  • Motivation performance of many applications
    dominated by a few kernels
  • Heart simulation ? Navier-Stokes
  • Sparse matrix-vector multiply (Multigrid)
  • Fast Fourier Transforms
  • Information retrieval ? LSI, LDA
  • Sparse matrix-vector multiply
  • Image processing ? filtering, segmentation
  • Sorting/Histograms, Cosine transform, Sparse
    matrix-vector multiply
  • Many other examples

9
Architectural Trends
  • A cache miss is O(100) cycles
  • Getting worse every year

µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Year
10
Conventional Performance Tuning
  • Vendor or user hand tunes kernels
  • Drawbacks
  • Very time consuming and difficult work
  • Even with intimate knowledge of architecture and
    compiler, performance hard to predict
  • Must be redone for every architecture, compiler
  • Not just a compiler problem
  • Best algorithm may depend on input, so some
    tuning must occur at run-time.
  • Multiple algorithms for the same problem may not
    be provably equivalent by program analysis

11
Automatic Performance Tuning
  • Approach for each kernel
  • Identify and generate a space of algorithms
  • Search for the fastest one, by running them
  • Constrain search space using performance models
  • What is a space of algorithms?
  • Depends on kernel and input
  • May vary
  • instruction mix and order
  • memory access patterns
  • data structures
  • mathematical formulation
  • Search both off-line and on-the-fly

12
How Much Does Tuning Help?
  • Experience from PHiPAC 10x on matmul

13
Sparse Matrices as Graphs
  • Sparse matrix is adjacency matrix for a graph
  • Matrix vector multiplication is nearest neighbor
    computation
  • Optimizations
  • Register blocking look for fixed size cliques
  • Unroll loops and optimize dense kernels
  • Cache blocking partition graph and layout in
    memory by partitions
  • Multiple vectors Assume each node holds a
    vector, update them all simultaneously
  • Common in some types of solvers
  • Exploit symmetry (undirected graph)
  • Exploit bounded degree or other special structures

14
Speedups from Sparsity with 1 Vector
15
Speedups from Sparsity with 9 Vectors
16
BeBop Research
  • BeBop Berkeley Benchmarking and optimization
    group
  • Hand optimizations
  • Understood for some problems
  • How to build tools
  • Work across machines (self-tuning)
  • Work on multiple problems (code generation)

17
Application-Specific Tools
  • Simulation of the human body
  • Imagine a digital body double
  • 3D image-based medical record
  • Includes diagnostic, pathologic, and other
    information
  • Used for
  • Diagnosis
  • Less invasive surgery-by-robot
  • Experimental treatments
  • Where are we today?

18
From Visible Human to Digital Human
Building 3D Models from images
Source John Sullivan et al, WPI
Source www.madsci.org
19
Heart Simulation Calculation
  • Developed by Peskin and McQueen at NYU
  • Done on a Cray C90 1 heart-beat in 100 hours
  • Used for evaluating artificial heart valves
  • Scalable parallel version done here
  • Implemented in Titanium
  • Model also used for
  • Inner ear
  • Blood clotting
  • Embryo growth
  • Insect flight
  • Paper making

20
Digital Human Roadmap
1 organ 1 model
1 organ multiple models
multiple organs
organ system
3D model construction
new algorithms
scalable implementations
coupled models
100x performance
1995
2000
2005
2010
21
Summary
  • Three related projects
  • Titanium
  • http//titanium.cs.berkeley.edu
  • BeBop
  • http//www.cs.berkeley.edu/richie/bebop
  • Organ Simulation
  • Research issues
  • How to make high performance easy
  • Increasing complex applications
  • Increasing complex machines

22
Simulation of a Heart
Write a Comment
User Comments (0)
About PowerShow.com