Applications, scalability, and technological change - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Applications, scalability, and technological change

Description:

New Methods for Developing Peta-scalable Codes. 6 ... New Methods for Developing Peta-scalable Codes. 7. Infinite domain BCs in practice ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 50
Provided by: PSC47
Category:

less

Transcript and Presenter's Notes

Title: Applications, scalability, and technological change


1
Applications, scalability, and technological
change
  • Scott B. Baden, Gregory T. Balls
  • Dept. of Computer Science and Engineering - UCSD
  • Phillip Colella
  • Advanced Numerical Algorithms Group - LBNL

2
Asynchronous computation and a data-driven model
of execution
  • Scott B. Baden
  • Dept. of Computer Science and Engineering
    University of California, San Diego

3
Motivation
  • Petascale architectures motivate the design of
    new algorithms and programming models to cope
    with technological evolution
  • The growing processor memory gap, which
    continues to raise the cost of communication
  • Amdahls law amplifies the cost of resource
    contention
  • Reformulate the algorithm to
  • Reduce the amount of communication
  • Reduce the cost

4
Motivating applications
  • Elliptic solvers
  • High communication overheads due to global
    coupling
  • Low ratio of flops-to-mems
  • Asynchronous algorithms
  • Brownian dynamics for cell microphysiology
  • Dynamic data assimilation

5
Roadmap
  • SCALLOP
  • A highly scalable, infinite domain Poisson solver
  • Written in KeLP
  • Asynchronous algorithms with Tarragon
  • Non BSP programming model
  • Communication overlap

6
Infinite Domain Poisson Equation
  • SCALLOP is an elliptic solver for constant
    coefficient problems in 3D
  • Free space boundary conditions
  • We consider the Poisson equation ??? ?(x,y,z)
  • with infinite domain boundary conditions
  • R is the total charge

7
Infinite domain BCs in practice
  • Infinite domain BCs arise in various applications
  • Modeling the human heart Yelick, Peskin, and
    McQueen
  • Astrophysics Colella et al.
  • Computing infinite domain boundary conditions is
    expensive, especially on a parallel computer
  • Alternatives
  • Extending the domain
  • Periodic boundary conditions

8
Elliptic regularity
  • The Poisson equation ??? ?(x,y,z)
  • Lets assume that? f(x,y,z) for (x,y,z) ? ?
  • ? is the set of points where ? ?0 supp(?)
  • The solution ? ? C? outside of ?
  • We can represent ? at a lower numerical
    resolution outside ? than we can inside ?

D
?
9
Elliptic regularity
  • Superposition and linearity
  • We can divide D into D 1 ? D 2 ? ? D n
    D
  • To get the solution over D, we sum the solutions
    over the D i due to the charges ?i in each D i
  • The solution ?i ? C? outside ?i
  • We can represent ? at a lower numerical
    resolution outside ? than we can inside ?

D
?
D i
10
SCALLOP
  • Exploits elliptic regularity to reduce
    communication costs significantly
  • Barnes-Hut (1986), Andersons MLC (1986), FMM
    (1987), Bank-Holst (2000), Balls and Colella
    (2002)
  • Our contribution extension of these ideas to
    finite difference problems in three dimensions

11
Domain Decomposition Strategy
  • Divide problem into subdomains
  • Use a reduced description of far-field effects
  • Stitch solutions together

12
Comparison with TraditionalDomain Decomposition
Methods
  • E.g. Smith and Widlund
  • Multiple iterations between local and nonlocal
    domains
  • Multiple communication steps
  • SCALLOP employs a fixed number (2) of
    communication steps

13
Comparison with TraditionalDomain Decomposition
Methods
  • Construct a dense linear system for degrees of
    freedom on the boundaries between subdomains
    using a Schur complement(Smith and Widlund)
  • Multiple iterations between local and nonlocal
    domains
  • Multiple communication steps
  • SCALLOP employs a fixed number of communication
    steps

14
SCALLOP in Context
  • Finite element methods
  • Bank-Holst (2000)
  • Particle Methods
  • Fast Multipole Method Greengard and Rokhlin,
    1987
  • Users pay a computational premium in exchange for
    parallelism
  • Method of Local Corrections Anderson, 1986
  • Not well-suited to finite-difference
    calculations difficult to generate suitable
    derivatives

15
Domain Decomposition Definitions
  • N3 is the global problem size
  • Divided into q3 subdomains
  • proc must divide q3 evenly
  • (N/q)3 local mesh of size
  • (N/C)3 Coarse mesh of size
  • C is the coarsening factor
  • In this 2-D slice N16, q2, and C4
  • 163 mesh split over 8 proc
  • Local mesh 83
  • Coarse mesh 43

16
The Scallop Domain Decomposition Algorithm
  • Five step algorithm, 2 communication steps
  • Serial building blocks -
  • Dirichlet solver (FFTW)
  • Infinite domain solver (built on the Dirichlet
    solver)
  • Two complete Dirichlet solutions on slightly
    enlarged domains
  • Infinite domain boundary calculation consumes
    most of the running time

17
Domain Decomposition Algorithm
  • 1. On each subdomain, solve an infinite domain
    problem, ignoring all other subdomains, and
    create a coarse representation of the charge.
  • O((N/q)3) parallel running time, no
    communication

18
Domain Decomposition Algorithm
  • 2. Aggregate all the coarse charge fields into
    one global charge.
  • O((Nq/C)3), all-to-all communication

19
Domain Decomposition Algorithm
  • 3. Calculate the global infinite domain
    solution.(Duplicate solves on all processors)
  • O((N/C)3) running time, no communication

20
Domain Decomposition Algorithm
  • 4. Compute boundary conditions for final local
    solveNeighbors exchange boundary data of local
    solutions and combine local fine grids with
    global coarse grid
  • O(1) running time, nearest-neighbor
    communication

21
Domain Decomposition Algorithm
  • 5. Solve a Dirichlet problem on each subdomain to
    obtain the local portion of the Infinite Domain
    solution O((N/q)3) running time, no
    communication

22
Domain Decomposition Algorithm
  • 1 Initial solution O((N/q)3)
  • 2 Aggregation O((Nq/C)3)
  • 3 Global coarse solution
    O((N/C)3)
  • 4 Local correction O((N/q)3) (less than ID
    solution)
  • 5 Final calculation O(1)

Overall O((N/q)3 (N/C)3) work, two
communication steps
23
Computational Tradeoffs
  • Accuracy only weakly dependent on C
  • Goal minimize cost of global coarse-grid solve
  • C 2q
  • Global coarse work less than 1/8 local fine work
  • O( (N/q)3 ((N/C)3) ) ? O( (N/q)3 )
  • For current implementation, large C leads to
    extra local fine grid work

24
Subdomain Overlap
  • In order to ensure smooth solutions and accurate
    interpolation, the local (fine) domains need to
    overlap
  • The overlap is measured in coarse grid spacing
  • For large refinement ratios, the overlap (in
    terms of fine grid points) gets very large
  • Here we see the domain of influence of a fine
    mesh cell

25
Overheads
  • SCALLOP performs 3 solves on slightly enlarged
    local domains
  • Communication in a fixed number of steps
  • Infinite domain BC computation performs global
    communication on a reduced description of the data

26
Analytic performance model for computational
overheads
  • Let TSID(N) serial infinite domain solveon an
    N3 mesh (step 1) ID BCs account for 92 of the
    time
  • Global coarse-grid solve TSID(N/C)3, C2
    (1/8)TSID (N)
  • Final solve 0.08 TSID We dont need to
    compute ID BCs!
  • Total cost 1.2TSID
  • If we could reduce the cost of the ID BCs
    computation to zero, the total is at worst 2.0TSID

27
Computational Overhead
  • Global coarse-grid solve on a grid of size
    (N/C)3 small if N/C
  • Extra computation due to overlap of the fine-grid
    domains small if C is reasonably small
  • Two fine-grid calculations (complete solutions,
    not just smoothing steps or V-cycles)
    Unavoidable, but the final Dirichlet solution is
    less costly than a full infinite domain solution

28
Limitations of Current Implementation
  • Earlier we mentioned that we require C 2q
  • For interpolation in step 4, we require a border
    of 2 coarse grid cells around each subdomain
  • To obtain those coarse values, we currently use
    the local fine grid ID solution from step 1
  • We thus require a local mesh of size Nf,G Nf
    4C
  • Our analytic performance model assumes that the
    local extended mesh size is Nf,G 1.2 Nf
  • But as q grows, Nf 4C 1.2 Nf
  • Computational work is not strictly N3

29
Limitations of Current Implementation
  • How does 1.2Nf Nf 4C constrain us?
  • Take, as before
  • C 2q
  • 1.2Nf Nf 8q
  • 1.2 1 8q/Nf
  • q Nf/40
  • For Nf 160, q 4
  • Without some tradeoffs, were limited to q3 64
    procs

30
Alternate Implementation
  • Necessary coarse grid values can be computed
    during the infinite domain boundary calculation,
    without calculating corresponding fine grid
    values
  • Local domain sizes kept reasonably small
  • No longer Nf,G max(1.2Nf, Nf 4C),
  • Just Nf,G 1.2Nf
  • All computational costs strictly O(N3)
  • New serial ID solver tested, parallel
    implementation underway

31
Limit of Parallelism
  • We are limited only by maximum coarsening factor
    1 coarse cell per local domain.
  • Nc q or N/C q
  • If we take C 2q and Nf N/q, as before,
  • Nf 2q
  • Total problem size and parallelism are now a
    function of local memory available
  • For Nf 128
  • q 64, q3 262,144 processors

32
Experiments
  • Ran on two SP systems with Power 3 CPUs
  • NPACIs Blue Horizon
  • NERSCs Seaborg
  • Used a serial FFT solver implemented with FFTW
  • Compiled with -O2, standard environment

33
Scaled Speed-up
  • Try to maintain constant work per processor
  • Number of processors, P, proportional to
  • N3, q3, C3
  • We report performance in terms of grind time
  • Ideally should be a constant

Tgrind T / N3
34
Results - Seaborg
Communication percent
Grind times
  • Grind time increases by a factor of 2.4 over a
    range of 16 - 1024 processors on Seaborg.
  • Communication takes less than 12 of the running
    time.

35
Implementation
  • SCALLOP was implemented with KeLP
  • A rapid development infrastructure for
    distributed memory machines
  • KeLP simplifies the expression of coarse to fine
    grid communication
  • Bookkeeping
  • Domains of dependence
  • KeLP provides useful abstractions
  • Set operations on geometric domains (FIDIL,
    BoxLib, Titanium)
  • Express communication in geometric terms
  • Separation of concerns
  • KeLP is unaware of the representation of user
    data structures
  • User is unaware of the low level details involved
    in moving data

36
The KeLP Data Motion Model
  • User defines persistent communication objects
    customized for regular section communication
  • Replace low level point-to-point messages with
    high level geometric descriptions of data
    dependences
  • Optimizations
  • Execute asynchronously to overlap with
    computation
  • Modify the dependencies with meaning-preserving
    transformations that improve performance

37
KeLPs view of communication
  • Communication exhibits collective behavior, even
    if all pairs of processors arent communicating
  • The data dependencies have an intuitive geometric
    structure involving regular section data motion
    within a global coordinate system

38
KeLPs Structural Abstractions
  • Distributed patches of storage living in a global
    coordinate system, each with their own origin
  • Geometric meta-data describing the structure of
    blocked patches of data and of data dependences
  • A geometric calculus for manipulating the
    meta-data
  • Unit of dependence is a regular section

39
Abstract representation
  • The dependence structure and the data are
    abstract
  • KeLP doesnt say how the data are represented nor
    how the data will be moved
  • The user provides rules to instantiate and
    flatten a subspace of Zn

40
Examples
  • Define a grid over an irregular subset of a
    bounding rectangle (Colella and van Straalen,
    LBNL)
  • Particles
  • We might represent these internally with trees,
    hash tables, etc.
  • KeLP enforces the model that we move data laying
    within rectangular subspaces

41
Summing up Scallop
  • A philosophy for designing algorithms that
    embraces technological change
  • Sophisticated algorithms thatreplace (expensive)
    communication with (cheaper)
    computation
  • To develop these algorithms, we need appropriate
    infrastructure (KeLP is another talk)
  • Scaling to larger problems is underway
  • Reducing the effective cost of domain overlap
  • Reducing the cost of the infinite domain boundary
    calculation
  • Extension to adaptive mesh refinement algorithm

42
Roadmap
  • SCALLOP
  • A highly scalable, infinite domain Poisson solver
  • Written in KeLP
  • Asynchronous algorithms with Tarragon
  • Non BSP programming model
  • Communication overlap

43
Roadmap
  • SCALLOP A highly scalable, infinite domain
    Poisson solver written in KeLP
  • Asynchronous algorithms with Tarragon
  • Communication overlap
  • Monte Carlo simulation with cell microphysiology

44
Performance Robustness in the presence of
technological change
  • The recipe for writing high quality application
    software changes over time
  • Either the application must be capable of
    responding to change
  • Or it will have to be reformulated
  • Weve just looked a numerical technique for
    dealing with approach 2
  • Now lets consider a non-numerical approach
  • Application overlapping communication with
    computation

45
Canonical variants
  • Many techniques are aimed at enhancing memory
    locality within a single address space
  • ATLAS Dongarra et al. 98 , PhiPack Demmel et
    al. 96, Sparsity Demmel Yelick 99, FFTW
    Frigo Johnson 98
  • Architectural Cognizance Gatlin Carter 99
  • DESOBLAS Beckmann and Kelly, LCPC 99 delayed
    evaluation of task graphs
  • But the rising cost of data transfer is also a
    concern
  • Well explore a canonical variant for overlapping
    computation with interprocessor communication in
    MIMD architectures

46
Whats difficult about hiding communication?
  • The programmer must hard code the overlap
    technique into the application software
  • The required knowledge is beyond the experience
    of many application programmers
  • The specific technique is sensitive to the
    technology and the application, hence the code is
    not robust

47
Motivating application
  • Iterative solver for Poissons equation in 3
    dimensions
  • Jacobis method, 7-pt stencil
  • for (i,j,k) in 1N x 1N x 1N
  • uijk (ui-1jk
    ui1jk
  • uij-1k
    uija1k
  • uijk1 uijk-1)/6

48
Traditional SPMD implementation
  • Decompose the domain into subregions, one per
    process
  • Transmit halo regions between processes
  • Compute inner region after communication completes

49
Multi-tier Computers
  • High opportunity cost of communication
  • Hierarchical organization amplifies node
    performance relative to the interconnect
  • Trends more processors per node, faster
    processors
  • r? DGEMM floating point rate per node, MFLOP/s
  • ß? peak pt - pt MPI message BW, MBYTE/s
  • IBM SP2/Power2SC r? 640 ß?
    100
  • NPACI Blue Horizon r? 14,000 ß? 400
  • NPACI Data Star r? 48,000 ß? ?
    800

50
Overlapped variant
  • Reformulate the algorithm
  • Isolate the inner region from the halo
  • Execute communication concurrently with
    computation on the inner region
  • Compute on the annulus when the halo finishes

51
Overlapped code (KeLP2)
  • Relax(Distributed_Data X, Mover Communication)
  • Communication.start()
  • for each subdomain x in X
  • Update x
  • Communication.wait()
  • // Repeat over the annulus
  • Implemented with KeLP2 Fink 98, SC99
  • KeLP2 implements a message proxy to realize
    overlap
  • It also provide hierarchical control flow

52
Performance on a 8 nodes of Blue Horizon
With KeLP2 Fink 98, SC99
732

713
655
626 (14)
HAND ST MT(8) MTV(7) MTV(7)
OPT
53
Observations
  • We had to hard code the overlap strategy as well
    as the parallel control flow into the application
  • Split-phase communication, scheduling,
    complicated partitioning
  • Optimal ordering of communication and computation
    varies across generational changes in technology
  • The characteristic communication delays increase
    relative to that of computation
  • The costs may be irregular
  • The hard coded strategy imposes unnecessary
    constraints
  • Computation and communication are partially
    ordered if you decrease the granularity of the
    computation
  • Applications are rich in potential parallelism

54
Tarragon an alternative approach
  • Testbed for exploring communication tolerant
    algorithms
  • Asynchronous task graph model of execution
  • Data driven departs from the traditional bulk
    synchronous model
  • Communication and computation do not execute as
    distinct phases but are coupled activities
  • Tolerate unpredictable or irregular task and
    communication latencies

55
Data driven execution
  • Overdecompose the problem so that each process
    owns several tasks
  • Construct a task graph indicating the data
    dependences
  • A tasks suspends until the required communication
    completes at which point the task is runnable
  • Tarragon run time system schedules runnable tasks
    according to the flow of data in the task graph

56
Tarragon in Context
  • Data driven techniques used in DataFlow,
    databases and data intensive applications (Data
    Cutter, ADR)
  • Charm Kale 93
  • Parallelism expressed across object collections
    by making remote method invocations (message
    passing)
  • Global name space
  • Tarragon
  • Functions operate on local data only data motion
    is explicit
  • Tune performance by adjusting task granularity
    and by decorating the graph with performance
    metadata

57
Tarragon API
  • We express parallelism in an abstract form
  • A task graph describes the partial ordering of
    tasks
  • Vertices? computation
  • Edges ? dependences
  • A background thread called the mover-dispatcher
    provides available tasks, processes completions

58
The Mover-Dispatcher
  • Processes incoming and outgoing communication
  • Determines when tasks are ready
  • Calls a scheduler to determine the order of ready
    task execution
  • Labels on the taskGraph guide the scheduling
    process
  • Processes completions
  • Completion handler is a user defined callback
    that invokes single sided communication

59
A look inside the Run Time System
  • E Execution engines
  • M Mover/Dispatcher

NA
NB
Done
Run
3
2
1
6
Rdy
M
E
E
E
E
E
M
E
M
M
E
E
M
E
E
M
60
Benefits
  • Tolerate unpredictable or irregular latencies at
    different scales
  • Communication and computation are coupled
    activities rather than distinct phases
  • Tune slackness to improve communication
    pipelining
  • Flexible Scheduling
  • Schedulers may be freely substituted, and may be
    application specific Apples, Berman
  • Performance meta data enable us to alter the
    execution order without having to change the
    scheduler
  • Run time system optimizes execution ordering
    without entailing heroic reprogramming

61
Slackness
  • Multiple tasks per processing module
  • Improve communication pipelining communication
    occurs incrementally and in parallel with
    computation
  • Tolerate irregular communication delays
  • Treat load balancing as a scheduling activity
    (migration)

62
First steps
  • KeLP2 4 applications formulated for overlap
    Fink and Baden 1997, Baden and Fink 1998
  • Quantum KeLP F. David Sacerdoti (MS 02,
    SIAM03)
  • Overdecomposed workload
  • Load balancer migrates work grains between
    processors

63
Summary
  • Asynchronous task graph execution model
  • Non bulk synchronous execution model
  • communication and computation are coupled
    activities rather than distinct phases
  • Tolerate unpredictable or irregular task and
    communication latencies
  • Performance meta data decorate the graph to
    provide scheduling hints
  • Generalizations
  • Very long latencies (on the order of
    milliseconds)
  • Application coupling
  • Incorporate dynamic data sources into ongoing
    computations

64
Roadmap
  • SCALLOP A highly scalable, infinite domain
    Poisson solver written in KeLP
  • Asynchronous algorithms with Tarragon
  • Communication overlap
  • Monte Carlo simulation with cell microphysiology

65
MCELL
  • Monte Carlo simulator of cellular microphysiology
  • Biochemical reaction dynamics in realistic 3D
    microenvironments
  • Brownian dynamics of individual molecules and
    their chemical interactions
  • Developed at the Salk Institute and Pittsburgh
    Supercomputing Center by Tom Bartol and Joel
    Stiles
  • 100 users (First released in 1997)
  • Mcell-K parallel variant implemented with KeLP
    (with Tom Bartol and Terrence Sejnowski, Salk)

66
Cell microphysiology simulation
  • Collaboration with Tom Bartol, Tilman Kispersky,
    Terrence Sejnowski (Salk Institute), Joel Sitles
    (PSC)
  • MCell a general Monte Carlo simulator of
    cellular microphysiology
  • Brownian dynamics random walk of individual
    molecules and chemical interactions
  • 100 users (First released in 1997)
  • Mcell-K parallel variant implemented with KeLP

67
Motivating application
  • Cerebellar Glomerulus
  • 2 CPU-months on a single processor
  • 24 GB of RAM
  • 20 million Ca2 ions, 10 million polygons
  • With serial MCell
  • Run 1/8 of the domain of the problem on a single
    processor
  • Reduced resolution
  • Scalable KeLP version MCell-K
  • Running on up to 128 processors on Blue Horizon
  • Collaboration involving Greg Balls (UCSD),
    Srinivas Turaga,Tilman Kispersky (UCSD/Salk),
    Tom Bartol (Salk),Terry Sejnowski (Salk)

68
Animation
  • Simulation of a chick ciliary ganglion synapse
  • A real-world problem
  • 400,000 polygons in the surface
  • Approximately 40,000 molecules diffusing
  • Approximately 500,000 surface receptors

69
Chick ciliary ganglion synapse
Receptors
Chick ciliary ganglion synapse courtesyDarwin
Berg, Jay Coggan, Mark Ellisman,Eduardo
Esquenazi, Terry Sejnowski, Tom Bartol
70
Chick ciliary ganglion synapse
Ligands
71
Diffusion and Interactions
  • Ligands neurotransmitter molecules
  • Bind to sites under constraints
  • Bounce off of surfaces
  • Uneven distributions in space and time

release sites
72
On a parallel computer
  • Partition boundary splits up the problem over
    multiple processors
  • As ligands cross a processor boundary, we color
    them yellow

Bound ligands
73
Animation
74
Movie
75
Issues in Parallelization
  • Particles move over a sequence of timesteps
  • React with embedded 2D surfaces - cell membranes
  • Processor boundaries introduce uncertainties in
    handling communication and the need to detect
    termination
  • A particle may bounce among processors owning
    nearby regions of space

76
Two questions
  • How do we know when the current timestep has
    completed?
  • How and when do we transmit particles among
    processors?

77
Parallelization Strategy
  • To detect termination we divide each timestep
    into sub-timesteps
  • We continue to the next time step only when there
    are no more ligands to udpate or communicate
  • Currently implemented with a barrier
  • Aggregate communication of ligands to amortize
    message starts
  • Buffers and message lengths scaled automatically
  • Uniform static decomposition
  • Work on dynamic load balancing is underway

78
Software infrastructure Abstract KeLP
  • A rapid development infrastructure for
    distributed memory machines
  • Implemented as a C class library layered on MPI
  • Communication orchestration
  • Manage communication in terms of geometric set
    operations
  • KeLP doesnt need to know how the user
    represented application data structures, user
    doesnt need to know about low level details of
    moving data
  • User-defined container classes
  • wrote a special purpose molecule class
  • callbacks to handle data packing and unpacking
  • Simple interface
  • clean separation of parallelism from other code
  • small change from original serial code

79
Computational results
  • Chick ciliary ganglion
  • 400k surface triangles
  • 192 release sites (max of 550)
  • Each site releases 5000 ligands at t0 (960k
    total)
  • 2500 time steps
  • Persistent ligand case
  • Enzymes that destroy ligands are made less
    effective
  • Most ligands are present at the end of the
    simulation
  • Report summary statistics in epochs of 100 time
    steps
  • Ran on NPACI Blue horizon 16, 32, 64 processors

80
Performance on NPACI Blue Horizon
  • Running times scale well

81
Parallel Efficiency
  • Communication costs for this algorithm are low on
    Blue Horizon
  • Communicating a few thousand molecules
  • All-reduce a few hundred µs for 64 procs
  • Each time step requires 1 s of computation

82
Performance Prediction
  • Running times are predicted well by maximum
    ligands per processor

83
Load imbalance
  • Maximum load close to 2x average load.

84
Uneven workload distributions
  • Loads vary significantly, dynamically

85
Load Balancing
  • 1 release site, 10,000 molecules
  • 8 simulated processors

86
Load Balancing - Ganglion
  • 18 release sites, 1000 molecules each
  • 8 simulated processors

87
The Future
  • Ligands may bounce across processor boundaries
  • Detecting termination is expensive
  • Lost opportunities due to load imbalance
  • Motivates asynchronous, execution, novel
    scheduling
  • New project Tarragon, NSF ITR

Wire frame view of rat diaphragm synapse courtesy
Tom Bartol and Joel Stiles
88
Non-BSP programming with Tarragon
  • Tarragon employs a task graph model of execution
  • Couples task completion with communication
  • Tolerates unpredictable or irregular task and
    communication latencies
  • Different from traditional BSP programming
  • Arriving data triggers communication
  • Task completion triggers computation
  • Testbed for exploring communication tolerant
    algorithms (linear algebra, data assimilation)

89
Asynchronous computation with Tarragon
  • ITR Asynchronous execution for scalable
    simulation of cell physiology
  • Cleaner treatment of migrating particles
  • change owners dynamically
  • avoid subtimestepping which exacerbates load
    imbalancing
  • Many-to-one task assignments
  • Automated load balancing via workload migration
  • Finer grained intermittent communication

90
Current and Future Work
  • Asynchronous computation
  • Large scale simulations
  • Load balancing
  • Predictive modeling (U. Rao Venkata)
  • Parameter sweep

91
Dynamic data driven applications
  • Using Tarragons data driven programming model,
    we can couple external data sources into ongoing
    computation
  • Work in progress, 2 applications
  • Dynamic clamping of neurons (Bartol Sejnowksi,
    Salk)
  • Feedback MCell simulations of neural
    microphysiology into living neurons in vitro via
    patch clamping
  • Living and simulated neurons are (virtually) part
    of the same circuit
  • Interactive ray tracing of dynamic scenes (H.
    Jensen UCSD)
  • change the lighting, scene, camera angle..
  • Need interactive feel

92
Dynamic scenes
  • Original scene (courtesy Henrik Jensen)

93
Dynamic scenes
  • Adding an object to the scene

94
Dynamic scenes
  • Changing the camera angle

95
Conclusions and the Future
  • Weve seen two techniques for coping with
    technological change
  • SCALLOP A new numerical algorithm
  • Tarragon A new execution model (partitioning and
    scheduling)
  • Each technique requires appropriate
    infrastructure to contend with the rising costs
    of data motion
  • Generalizations
  • Very long latencies (on the order of
    milliseconds)
  • Application coupling
  • Incorporate dynamic data sources into ongoing
    computation
  • What are the roles of libraries and programming
    languages?

96
Conclusions
  • Weve looked at an asynchronous data driven
    programming model with motivating applications
  • Communication tolerance
  • Dynamic data driven applications that couple
    simulations with the real world
  • Appropriate programming model simplifies the
    design
  • Scheduling important, too.
  • What are the role of libraries and programming
    languages?

97
Conclusions
  • Weve looked at an asynchronous data driven
    programming model with motivating applications
  • Communication tolerance
  • Dynamic data driven applications that couple
    simulations with the real world
  • Appropriate programming model simplifies the
    design
  • Scheduling important, too.
  • What are the role of libraries and programming
    languages?

98
Acknowledgements and support
  • Support
  • NSF ACI0326013, ACI9619020, IBN9985964
  • Howard Hughes Medical Institute
  • University of California, San Diego
  • San Diego Supercomputer Center (SDSC)
  • Cal-(IT)2
  • DoE (ISCR, CASC)
  • ESPRC (visits to Imperial College UK)
  • Papers and software http//www-cse.ucsd.edu/grou
    ps/hpcl/scg/

99
Technology transitions
  • The KeLP technology is employed in the CHOMBO
    structured adaptive mesh refinement (SAMR)
    infrastructure(P. Colella, LBNL)
  • The technology also employed in the SAMRAI
    infrastructure for SAMR(S. Kohn, R. Horning,
    LLNL)

100
Applications
  • Mcell Cell microphysiology T. Sejnowski, T.
    Bartol (SALK), J. R. Stiles (PSC)
  • First principle simulations of real materials
    using structured adaptive mesh refinement J.
    Weare et al.
  • Mortar space method for subsurface modeling M.
    F. Wheeler, TICAM production code called
    UTPROJ3D
  • Data management
  • Compression in Direct Numerical Simulation for
    turbulenceK. K. Nomura, P. Diamessis, W.
    Kerney
  • Querying structured adaptive mesh refinement
    datasetsJ. Saltz, T. Kurc, OSU, P. Colella,
    LBNL
  • KeLP I/O Target for telescoping Compiler B.
    Broom, R. Fowler, K. Kennedy, Rice

101
A cast of many
  • Scott Kohn
  • Stephen J. Fink
  • Frederico Sacerdoti
  • Daniel Shalit
  • Urvashi Rao Venkata
  • Jake Sorensen
  • Pietro Cicotti
  • Paul Kelly (Imperial College)
Write a Comment
User Comments (0)
About PowerShow.com