Computer Science Overview

About This Presentation

Title:

Computer Science Overview

Description:

Vacate, adjust to speed, share. Automatic checkpointing. Change ... Restart can be speeded up by spreading out objects from failed processor. Long term project ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 17

Provided by: willia259

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Computer Science Overview

1
Computer Science Overview

Laxmikant (Sanjay) Kale

2
Computer Science Projects Posters

Rocketeer
Home-grown visualizer
John, Fiedler
Rocpanda
Parallel I/O
Winslett et al
Novel Linear System Solvers
de Sturler, Heath, Saylor
Performance monitoring
Campbell, Zheng, Lee
Parallel Mesh support
FEM Framework
Parallel remeshing
Parallel Solution transfer
Adaptive mesh refinement

3
Computer Science Projects Talks

Kale
Processor virtualization via migratable objects
Jiao
Integration Framework
Surface propagation
Mesh adaptation

4
Migratable Objects and Charm

Charm
Parallel C
Arrays of objects
Automatic load balancing
Prioritization
Mature system
Available on all parallel machines we know

Rocket Center Collaborations
It was clear that Charm wont be adopted by the
whole application community
It was equally clear to us that it was a unique
technology that will improve programmer
productivity substantially
Led to the development of AMPI
Adaptive MPI

5
Processor Virtualization
Benefits
Programmer Over decomposition into virtual
processors Runtime Assigns VPs to
processors Enables adaptive runtime
strategies Implementations Charm, AMPI

Software engineering
Number of virtual processors can be independently
controlled
Separate VPs for different modules
Message driven execution
Adaptive overlap of communication
Predictability
Automatic out-of-core
Asynchronous reductions
Dynamic mapping
Heterogeneous clusters
Vacate, adjust to speed, share
Automatic checkpointing
Change set of processors used
Automatic dynamic load balancing
Communication optimization
Collectives

MPI processes
Virtual Processors (user-level migratable threads)
6
Highly Agile Dynamic load balancing

Needed, for example, for handling Advent of
plasticity around a crack
Here a simple example
Plasticity in a bar

7
Optimizing all-to-all via Mesh
Organize processors in a 2D (virtual) grid
Phase 1 Each processor sends messages
within its row
Phase 2 Each processor sends messages
within its column
Message from (x1,y1) to (x2,y2) goes via (x1,y2)

messages instead of P-1

8
Optimized All-to-all Surprise
Completion time vs. computation overhead
76 bytes all-to-all on Lemieux
CPU is free during most of the time taken by a
collective operation
Led to the development of Asynchronous
Collectives now supported in AMPI
9
Latency Tolerance Multi-Cluster Jobs

Job co-scheduled to run across two clusters to
provide access to large numbers of processors
But cross cluster latencies are large!
Virtualization within Charm masks high
inter-cluster latency by allowing overlap of
communication with computation

Cluster A
Cluster B
Intra-cluster latency (microseconds)
Inter-cluster latency (milliseconds)
10
Hypothetical Timeline of a Multi-Cluster
Computation
A
B
cross-cluster boundary
C

Processors A and B are on one cluster, Processor
C on a second cluster
Communication between clusters via high-latency
WAN
Processor Virtualization allows latency to be
masked

11
Multi-cluster Experiments

Experimental environment
Artificial latency environment VMI delay
device adds a pre-defined latency between
arbitrary pairs of nodes
TeraGrid environment Experiments run between
NCSA and ANL machines (1.725 ms one-way latency)
Experiments
Five-point stencil (2D Jacobi) for matrix sizes
2048x2048 and 8192x8192
LeanMD molecular dynamics code running a 30,652
atom system

12
Five-Point Stencil Results (P64)
13
Fault Tolerance

Automatic Checkpointing for AMPI and Charm
Migrate objects to disk!
Automatic fault detection and restart
Now available in distribution version of AMPI and
Charm
New work
In-memory checkpointing
Scalable fault tolerance
Impending Fault Response
Migrate objects to other processors
Adjust processor-level parallel data structures

14
Scalable Fault Tolerance

Motivation
When a processor out of 100,000 fails, all
99,999 shouldnt have to run back to their
checkpoints!
How?
Sender-side message logging
Latency tolerance mitigates costs
Restart can be speeded up by spreading out
objects from failed processor
Long term project
Current progress
Basic scheme idea implemented and tested in
simple programs
General purpose implementation in progress

Only failed processors objects recover from
checkpoints, while others continue
15
Develop abstractions in context of full-scale
applications
Protein Folding
Quantum Chemistry (QM/MM)
Molecular Dynamics
Computational Cosmology
Parallel Objects, Adaptive Runtime System
Libraries and Tools
Crack Propagation
Space-time meshes
Dendritic Growth
Rocket Simulation
The enabling CS technology of parallel objects
and intelligent Runtime systems has led to
several collaborative applications in CSE
16
Next