Title: Titanium: A HighLevel Parallel Language
1Titanium A High-Level Parallel Language
- Kathy Yelick
- University of California, Berkeley and
- Lawrence Berkeley National Laboratory
2Global Address Space Languages
- Explicitly parallel model with SPMD parallelism
- Fixed at program start-up, typically 1 thread per
processor - Global address space model of memory
- Allows programmer to directly represent
distributed data structures - Address space is logically partitioned
- Local vs. remote memory (two-level hierarchy)
- Programmer control over performance critical
decisions - Data layout and communication
- Performance transparency and tunability are goals
- Initial implementation can use fine-grained
shared memory
3Global Address Space
X0
X1
XP
Shared
Global address space
ptr
ptr
ptr
Private
- Global address space abstraction
- Shared memory is partitioned by processors
- Remote memory may stay remote no automatic
caching implied - One-sided communication through reads/writes of
shared variables - Less restricted than MPI-2 one-sided model
- Both individual and bulk memory copies
4Three Related Languages
- Unified Parallel C (UPC)
- Follow-on to Split-C, AC, and PCP research
languages - UPC supported on Cray and HP machines
- Open source compilers from Intrepid and LBNL
- Co-Array Fortran (CAF)
- Based on Fortran 90
- Supported on Cray machines
- Open source compiler under development at Rice
- Titanium
- Based on Java
- Open source compiler from U.C. Berkeley
5Titanium
- Based on Java, a cleaner C
- classes, automatic memory management, etc.
- compiled to C and then native binary (no JVM)
- Same parallelism model as UPC and CAF
- SPMD with a global address space
- Dynamic Java threads are not supported
- Optimizing compiler
- static (compile-time) optimizer, not a JIT
- communication and memory optimizations
- synchronization analysis (e.g. static barrier
analysis) - cache and other uniprocessor optimizations
6Summary of Features Added to Java
- Scalable parallelism (Java threads replaced)
- Immutable (value) classes
- Multidimensional arrays with unordered iteration
- Checked Synchronization
- Operator overloading
- Templates
- Zone-based memory management (regions)
- Libraries for collective communication,
distributed arrays, bulk I/O
7Immutable Classes in Titanium
- For small objects, would sometimes prefer
- to avoid level of indirection
- pass by value (copy entire object)
- especially when immutable -- fields never
modified - Example
- immutable class Complex
- Complex () real0 imag0
- Complex operator (Complex c) ...
-
- Complex c1 new Complex(7.1, 4.3)
- c1 c1 c1
- Addresses performance and programmability
- Similar to structs in C (not C classes) in
terms of performance - Adds support for complex types
8Multidimensional Arrays in Titanium
- Index set is a a domain with rich set of
operators - Unordered iteration over domains helps
optimization
aInterior a.restrict(2)
foreach (p in aInterior.domain())
aInteriorp
a
n1,n1 2n,2n
0,0n,n
b
b.copy(aInterior)
9Titanium Compiler Status
- Titanium compiler runs on almost any machine
- Requires a C compiler (and decent C to compile
translator) - Pthreads for shared memory
- Communication layer for distributed memory (or
hybrid) - Recently moved to live on GASNet shared with UPC
- Obtained GM, Elan, and improved LAPI
implementation - Recent language extensions
- Indexed array copy (scatter/gather style)
- Non-blocking array copy under development
- Compiler optimizations
- Cache optimizations, for loop optimizations
- Communication optimizations for overlap,
pipelining, and scatter/gather under development
10Serial Performance (Pure Java)
- Several optimizations in Titanium compiler (tc)
over the past year - These codes are all written in pure Java without
performance extensions
11Communication Optimizations
- Possible communication optimizations
- Communication overlap, aggregation, caching
- Effectiveness varies by machine
12Parallel Performance and Scalability
- Poisson solver using Method of Local
Corrections Balls, Colella - Communication (flat)
- IBM SP
Cray T3E
13NAS MG in Titanium
- Preliminary Performance for MG code on IBM SP
- Speedups are nearly identical
- About 25 serial performance difference
14Applications in Titanium
- Several benchmarks
- Fluid solvers with Adaptive Mesh Refinement (AMR)
- Conjugate Gradient
- 3D Multigrid
- Unstructured mesh kernel EM3D
- Dense linear algebra LU, MatMul
- Tree-structured n-body code
- Finite element benchmark
- Genetics micro-array selection
- SciMark serial benchmarks
- Larger applications
- Heart simulation
- Ocean modeling with AMR (in progress)
15AMR for Ocean Modeling
- Ocean Modeling Wen, Colella
- Require embedded boundaries to model
floor/coastline - Line vs. point relaxation for aspect ratio
1000km x 10km - Result in irregular data structures and array
accesses - Currently developing
- Basin scale AMR circulation model
- Initially non-adaptive
- Compiler and language support design
Graphics from Titanium AMR Gas Dynamics
McCorquodale,Colella
16Simulating Fluid Flow in Biological Systems
- Immersed Boundary Method Peskin/MacQueen
- Material (e.g., heart muscles, cochlea structure)
modeled by grid of material points - Fluid space modeled by a regular lattice
- Irregular material points need to interact with
regular fluid lattice - Trade-off between load balancing of fibers and
minimizing communication - Memory and communication intensive
- Random array access is key problem in the
performance - Developed compiler optimizations to improve their
performance
17Titanium Group
- Susan Graham
- Katherine Yelick
- Paul Hilfinger
- Phillip Colella (LBNL)
- Alex Aiken
- Dan Bonachea
- Kaushik Datta
- Ed Givelberg
- Sabrina Merchant
- Szu-Huey Chuang
- Carol Ho
- Jimmy Su
- Greg Balls (SDSC)
- Peter McQuorquodale (LBNL)
- Andrew Begel
- Tyson Condie
- Carrie Fei
- David Gay
- Ben Liblit
- Chang Sun Lin
- Geoff Pike
- Ellen Tsai
- Mike Welcome (LBNL)
- Siu Man Yau
18- The End
- http//upc.nersc.gov
- http//titanium.cs.berkeley.edu/
19Target Problems
- Many modeling problems in astrophysics, biology,
material science, and other areas require - Enormous range of spatial and temporal scales
- Requires
- Adaptive methods
- Large scale parallel machines
- Titanium supports
- Stuctured grids
- Locally-structured grids (AMR)
- Unstructured grids (in progress)
20Java Compiled by Titanium Compiler
21Java Compiled by Titanium Compiler
22Parallel Applications
- Genome Application
- Heart simulation
- AMR elliptic and hyperbolic solvers
- Scalable Poisson for infinite domains
- Genome application
- Several smaller benchmarks EM3D, MatMul, LU,
FFT, Join
23MOOSE Application
- Problem Microarray construction
- Used for genome experiments
- Possible medical applications long-term
- Microarray Optimal Oligo Selection Engine (MOOSE)
- A parallel engine for selecting the best
oligonucleotide sequences for genetic microarray
testing - Uses dynamic load balancing within Titanium
24Heart Simulation
- Problem compute blood flow in the heart
- Modeled as an elastic structure in an
incompressible fluid. - The immersed boundary method Peskin and
McQueen. - 20 years of development in model
- Many other applications blood clotting, inner
ear, paper making, embryo growth, and more - Can be used for design
of prosthetics - Artificial heart valves
- Cochlear implants
25Scalable Poisson Solver
- MLC for Finite-Differences by Balls and Colella
- Poisson equation with infinite boundaries
- arise in astrophysics, some biological systems,
etc. - Method is scalable
- Low communication
- Performance on
- SP2 (shown) and t3e
- scaled speedups
- nearly ideal (flat)
- Currently 2D and non-adaptive
26Error on High-Wavenumber Problem
- Charge is
- 1 charge of concentric waves
- 2 star-shaped charges.
- Largest error is where the charge is changing
rapidly. Note - discretization error
- faint decomposition error
- Run on 16 procs
27AMR Gas Dynamics
- Developed by McCorquodale and Colella
- 2D Example (3D supported)
- Mach-10 shock on solid surface
at
oblique angle - Future Self-gravitating gas dynamics package
28Unstructured Mesh Kernel
- EM3D Relaxation on a 3D unstructured mesh
- Speedup on Ultrasparc SMP
- Simple kernel mesh not partitioned.
29Recent Developments
- Interfaces to libraries
- KeLP and (older) PETSc and Metis
- New IBM SP implementation
- Uses LAPI rather than MPI, about 2x performance
gain - New release IBM, SGI, Cray, Linux cluster,
Threads, - Uniprocessor optimizations
- Method inlining, both automated and manual
- Cache optimizations
- Shared pointer analysis
- Support for unstructured computation
- General sub-array copy now with arbitrary points
30Future Plans
- Merge communication layer with UPC
- Unified Parallel C has broad vendor support.
- Uses some execution model as Titanium
- Automated communication overlap
- Analysis and refinement of cache optimizations
- Additional support for unstructured grids
- Conjugate gradient and particle methods are
motivations - Better uniprocessor optimizations, possibly new
arrays