Title: Computer Science Overview
1Computer Science Overview
2Computer Science Projects Posters
- Rocketeer
- Home-grown visualizer
- John, Fiedler
- Rocpanda
- Parallel I/O
- Winslett et al
- Novel Linear System Solvers
- de Sturler, Heath, Saylor
- Performance monitoring
- Campbell, Zheng, Lee
- Parallel Mesh support
- FEM Framework
- Parallel remeshing
- Parallel Solution transfer
- Adaptive mesh refinement
3Computer Science Projects Talks
- Kale
- Processor virtualization via migratable objects
- Jiao
- Integration Framework
- Surface propagation
- Mesh adaptation
4Migratable Objects and Charm
- Charm
- Parallel C
- Arrays of objects
- Automatic load balancing
- Prioritization
- Mature system
- Available on all parallel machines we know
- Rocket Center Collaborations
- It was clear that Charm wont be adopted by the
whole application community - It was equally clear to us that it was a unique
technology that will improve programmer
productivity substantially - Led to the development of AMPI
- Adaptive MPI
5Processor Virtualization
Benefits
Programmer Over decomposition into virtual
processors Runtime Assigns VPs to
processors Enables adaptive runtime
strategies Implementations Charm, AMPI
- Software engineering
- Number of virtual processors can be independently
controlled - Separate VPs for different modules
- Message driven execution
- Adaptive overlap of communication
- Predictability
- Automatic out-of-core
- Asynchronous reductions
- Dynamic mapping
- Heterogeneous clusters
- Vacate, adjust to speed, share
- Automatic checkpointing
- Change set of processors used
- Automatic dynamic load balancing
- Communication optimization
- Collectives
MPI processes
Virtual Processors (user-level migratable threads)
6Highly Agile Dynamic load balancing
- Needed, for example, for handling Advent of
plasticity around a crack - Here a simple example
- Plasticity in a bar
7Optimizing all-to-all via Mesh
Organize processors in a 2D (virtual) grid
Phase 1 Each processor sends messages
within its row
Phase 2 Each processor sends messages
within its column
Message from (x1,y1) to (x2,y2) goes via (x1,y2)
- messages instead of P-1
8Optimized All-to-all Surprise
Completion time vs. computation overhead
76 bytes all-to-all on Lemieux
CPU is free during most of the time taken by a
collective operation
Led to the development of Asynchronous
Collectives now supported in AMPI
9Latency Tolerance Multi-Cluster Jobs
- Job co-scheduled to run across two clusters to
provide access to large numbers of processors - But cross cluster latencies are large!
- Virtualization within Charm masks high
inter-cluster latency by allowing overlap of
communication with computation
Cluster A
Cluster B
Intra-cluster latency (microseconds)
Inter-cluster latency (milliseconds)
10Hypothetical Timeline of a Multi-Cluster
Computation
A
B
cross-cluster boundary
C
- Processors A and B are on one cluster, Processor
C on a second cluster - Communication between clusters via high-latency
WAN - Processor Virtualization allows latency to be
masked
11Multi-cluster Experiments
- Experimental environment
- Artificial latency environment VMI delay
device adds a pre-defined latency between
arbitrary pairs of nodes - TeraGrid environment Experiments run between
NCSA and ANL machines (1.725 ms one-way latency) - Experiments
- Five-point stencil (2D Jacobi) for matrix sizes
2048x2048 and 8192x8192 - LeanMD molecular dynamics code running a 30,652
atom system
12Five-Point Stencil Results (P64)
13Fault Tolerance
- Automatic Checkpointing for AMPI and Charm
- Migrate objects to disk!
- Automatic fault detection and restart
- Now available in distribution version of AMPI and
Charm - New work
- In-memory checkpointing
- Scalable fault tolerance
- Impending Fault Response
- Migrate objects to other processors
- Adjust processor-level parallel data structures
14Scalable Fault Tolerance
- Motivation
- When a processor out of 100,000 fails, all
99,999 shouldnt have to run back to their
checkpoints! - How?
- Sender-side message logging
- Latency tolerance mitigates costs
- Restart can be speeded up by spreading out
objects from failed processor - Long term project
- Current progress
- Basic scheme idea implemented and tested in
simple programs - General purpose implementation in progress
Only failed processors objects recover from
checkpoints, while others continue
15Develop abstractions in context of full-scale
applications
Protein Folding
Quantum Chemistry (QM/MM)
Molecular Dynamics
Computational Cosmology
Parallel Objects, Adaptive Runtime System
Libraries and Tools
Crack Propagation
Space-time meshes
Dendritic Growth
Rocket Simulation
The enabling CS technology of parallel objects
and intelligent Runtime systems has led to
several collaborative applications in CSE
16Next
- Jim Jiao
- Integration Framework
- Surface propagation
- Mesh adaptation