Title: Applications, scalability, and technological change
1Applications, scalability, and technological
change
- Scott B. Baden, Gregory T. Balls
- Dept. of Computer Science and Engineering - UCSD
- Phillip Colella
- Advanced Numerical Algorithms Group - LBNL
2Asynchronous computation and a data-driven model
of execution
- Scott B. Baden
- Dept. of Computer Science and Engineering
University of California, San Diego
3Motivation
- Petascale architectures motivate the design of
new algorithms and programming models to cope
with technological evolution - The growing processor memory gap, which
continues to raise the cost of communication - Amdahls law amplifies the cost of resource
contention - Reformulate the algorithm to
- Reduce the amount of communication
- Reduce the cost
4Motivating applications
- Elliptic solvers
- High communication overheads due to global
coupling - Low ratio of flops-to-mems
- Asynchronous algorithms
- Brownian dynamics for cell microphysiology
- Dynamic data assimilation
5Roadmap
- SCALLOP
- A highly scalable, infinite domain Poisson solver
- Written in KeLP
- Asynchronous algorithms with Tarragon
- Non BSP programming model
- Communication overlap
6Infinite Domain Poisson Equation
- SCALLOP is an elliptic solver for constant
coefficient problems in 3D - Free space boundary conditions
- We consider the Poisson equation ??? ?(x,y,z)
- with infinite domain boundary conditions
- R is the total charge
7Infinite domain BCs in practice
- Infinite domain BCs arise in various applications
- Modeling the human heart Yelick, Peskin, and
McQueen - Astrophysics Colella et al.
- Computing infinite domain boundary conditions is
expensive, especially on a parallel computer - Alternatives
- Extending the domain
- Periodic boundary conditions
8Elliptic regularity
- The Poisson equation ??? ?(x,y,z)
- Lets assume that? f(x,y,z) for (x,y,z) ? ?
- ? is the set of points where ? ?0 supp(?)
- The solution ? ? C? outside of ?
- We can represent ? at a lower numerical
resolution outside ? than we can inside ?
D
?
9Elliptic regularity
- Superposition and linearity
- We can divide D into D 1 ? D 2 ? ? D n
D - To get the solution over D, we sum the solutions
over the D i due to the charges ?i in each D i - The solution ?i ? C? outside ?i
- We can represent ? at a lower numerical
resolution outside ? than we can inside ?
D
?
D i
10SCALLOP
- Exploits elliptic regularity to reduce
communication costs significantly - Barnes-Hut (1986), Andersons MLC (1986), FMM
(1987), Bank-Holst (2000), Balls and Colella
(2002) - Our contribution extension of these ideas to
finite difference problems in three dimensions
11Domain Decomposition Strategy
- Divide problem into subdomains
- Use a reduced description of far-field effects
- Stitch solutions together
12Comparison with TraditionalDomain Decomposition
Methods
- E.g. Smith and Widlund
- Multiple iterations between local and nonlocal
domains - Multiple communication steps
- SCALLOP employs a fixed number (2) of
communication steps
13Comparison with TraditionalDomain Decomposition
Methods
- Construct a dense linear system for degrees of
freedom on the boundaries between subdomains
using a Schur complement(Smith and Widlund) - Multiple iterations between local and nonlocal
domains - Multiple communication steps
- SCALLOP employs a fixed number of communication
steps
14SCALLOP in Context
- Finite element methods
- Bank-Holst (2000)
- Particle Methods
- Fast Multipole Method Greengard and Rokhlin,
1987 - Users pay a computational premium in exchange for
parallelism - Method of Local Corrections Anderson, 1986
- Not well-suited to finite-difference
calculations difficult to generate suitable
derivatives
15Domain Decomposition Definitions
- N3 is the global problem size
- Divided into q3 subdomains
- proc must divide q3 evenly
- (N/q)3 local mesh of size
- (N/C)3 Coarse mesh of size
- C is the coarsening factor
- In this 2-D slice N16, q2, and C4
- 163 mesh split over 8 proc
- Local mesh 83
- Coarse mesh 43
16The Scallop Domain Decomposition Algorithm
- Five step algorithm, 2 communication steps
- Serial building blocks -
- Dirichlet solver (FFTW)
- Infinite domain solver (built on the Dirichlet
solver) - Two complete Dirichlet solutions on slightly
enlarged domains - Infinite domain boundary calculation consumes
most of the running time
17Domain Decomposition Algorithm
- 1. On each subdomain, solve an infinite domain
problem, ignoring all other subdomains, and
create a coarse representation of the charge. - O((N/q)3) parallel running time, no
communication
18Domain Decomposition Algorithm
- 2. Aggregate all the coarse charge fields into
one global charge. - O((Nq/C)3), all-to-all communication
19Domain Decomposition Algorithm
- 3. Calculate the global infinite domain
solution.(Duplicate solves on all processors) - O((N/C)3) running time, no communication
20Domain Decomposition Algorithm
- 4. Compute boundary conditions for final local
solveNeighbors exchange boundary data of local
solutions and combine local fine grids with
global coarse grid - O(1) running time, nearest-neighbor
communication
21Domain Decomposition Algorithm
- 5. Solve a Dirichlet problem on each subdomain to
obtain the local portion of the Infinite Domain
solution O((N/q)3) running time, no
communication
22Domain Decomposition Algorithm
- 1 Initial solution O((N/q)3)
- 2 Aggregation O((Nq/C)3)
- 3 Global coarse solution
O((N/C)3) - 4 Local correction O((N/q)3) (less than ID
solution) - 5 Final calculation O(1)
Overall O((N/q)3 (N/C)3) work, two
communication steps
23Computational Tradeoffs
- Accuracy only weakly dependent on C
- Goal minimize cost of global coarse-grid solve
- C 2q
- Global coarse work less than 1/8 local fine work
- O( (N/q)3 ((N/C)3) ) ? O( (N/q)3 )
- For current implementation, large C leads to
extra local fine grid work
24Subdomain Overlap
- In order to ensure smooth solutions and accurate
interpolation, the local (fine) domains need to
overlap - The overlap is measured in coarse grid spacing
- For large refinement ratios, the overlap (in
terms of fine grid points) gets very large - Here we see the domain of influence of a fine
mesh cell
25Overheads
- SCALLOP performs 3 solves on slightly enlarged
local domains - Communication in a fixed number of steps
- Infinite domain BC computation performs global
communication on a reduced description of the data
26Analytic performance model for computational
overheads
- Let TSID(N) serial infinite domain solveon an
N3 mesh (step 1) ID BCs account for 92 of the
time - Global coarse-grid solve TSID(N/C)3, C2
(1/8)TSID (N) - Final solve 0.08 TSID We dont need to
compute ID BCs! - Total cost 1.2TSID
- If we could reduce the cost of the ID BCs
computation to zero, the total is at worst 2.0TSID
27Computational Overhead
- Global coarse-grid solve on a grid of size
(N/C)3 small if N/C - Extra computation due to overlap of the fine-grid
domains small if C is reasonably small - Two fine-grid calculations (complete solutions,
not just smoothing steps or V-cycles)
Unavoidable, but the final Dirichlet solution is
less costly than a full infinite domain solution
28Limitations of Current Implementation
- Earlier we mentioned that we require C 2q
- For interpolation in step 4, we require a border
of 2 coarse grid cells around each subdomain - To obtain those coarse values, we currently use
the local fine grid ID solution from step 1 - We thus require a local mesh of size Nf,G Nf
4C - Our analytic performance model assumes that the
local extended mesh size is Nf,G 1.2 Nf - But as q grows, Nf 4C 1.2 Nf
- Computational work is not strictly N3
29Limitations of Current Implementation
- How does 1.2Nf Nf 4C constrain us?
- Take, as before
- C 2q
- 1.2Nf Nf 8q
- 1.2 1 8q/Nf
- q Nf/40
- For Nf 160, q 4
- Without some tradeoffs, were limited to q3 64
procs
30Alternate Implementation
- Necessary coarse grid values can be computed
during the infinite domain boundary calculation,
without calculating corresponding fine grid
values - Local domain sizes kept reasonably small
- No longer Nf,G max(1.2Nf, Nf 4C),
- Just Nf,G 1.2Nf
- All computational costs strictly O(N3)
- New serial ID solver tested, parallel
implementation underway
31Limit of Parallelism
- We are limited only by maximum coarsening factor
1 coarse cell per local domain. - Nc q or N/C q
- If we take C 2q and Nf N/q, as before,
- Nf 2q
- Total problem size and parallelism are now a
function of local memory available - For Nf 128
- q 64, q3 262,144 processors
32Experiments
- Ran on two SP systems with Power 3 CPUs
- NPACIs Blue Horizon
- NERSCs Seaborg
- Used a serial FFT solver implemented with FFTW
- Compiled with -O2, standard environment
33Scaled Speed-up
- Try to maintain constant work per processor
- Number of processors, P, proportional to
- N3, q3, C3
- We report performance in terms of grind time
- Ideally should be a constant
Tgrind T / N3
34Results - Seaborg
Communication percent
Grind times
- Grind time increases by a factor of 2.4 over a
range of 16 - 1024 processors on Seaborg. - Communication takes less than 12 of the running
time.
35Implementation
- SCALLOP was implemented with KeLP
- A rapid development infrastructure for
distributed memory machines - KeLP simplifies the expression of coarse to fine
grid communication - Bookkeeping
- Domains of dependence
- KeLP provides useful abstractions
- Set operations on geometric domains (FIDIL,
BoxLib, Titanium) - Express communication in geometric terms
- Separation of concerns
- KeLP is unaware of the representation of user
data structures - User is unaware of the low level details involved
in moving data
36The KeLP Data Motion Model
- User defines persistent communication objects
customized for regular section communication - Replace low level point-to-point messages with
high level geometric descriptions of data
dependences - Optimizations
- Execute asynchronously to overlap with
computation - Modify the dependencies with meaning-preserving
transformations that improve performance
37KeLPs view of communication
- Communication exhibits collective behavior, even
if all pairs of processors arent communicating - The data dependencies have an intuitive geometric
structure involving regular section data motion
within a global coordinate system
38KeLPs Structural Abstractions
- Distributed patches of storage living in a global
coordinate system, each with their own origin - Geometric meta-data describing the structure of
blocked patches of data and of data dependences - A geometric calculus for manipulating the
meta-data - Unit of dependence is a regular section
39Abstract representation
- The dependence structure and the data are
abstract - KeLP doesnt say how the data are represented nor
how the data will be moved - The user provides rules to instantiate and
flatten a subspace of Zn
40Examples
- Define a grid over an irregular subset of a
bounding rectangle (Colella and van Straalen,
LBNL) - Particles
- We might represent these internally with trees,
hash tables, etc. - KeLP enforces the model that we move data laying
within rectangular subspaces
41Summing up Scallop
- A philosophy for designing algorithms that
embraces technological change - Sophisticated algorithms thatreplace (expensive)
communication with (cheaper)
computation - To develop these algorithms, we need appropriate
infrastructure (KeLP is another talk) - Scaling to larger problems is underway
- Reducing the effective cost of domain overlap
- Reducing the cost of the infinite domain boundary
calculation - Extension to adaptive mesh refinement algorithm
42Roadmap
- SCALLOP
- A highly scalable, infinite domain Poisson solver
- Written in KeLP
- Asynchronous algorithms with Tarragon
- Non BSP programming model
- Communication overlap
43Roadmap
- SCALLOP A highly scalable, infinite domain
Poisson solver written in KeLP - Asynchronous algorithms with Tarragon
- Communication overlap
- Monte Carlo simulation with cell microphysiology
44Performance Robustness in the presence of
technological change
- The recipe for writing high quality application
software changes over time - Either the application must be capable of
responding to change - Or it will have to be reformulated
- Weve just looked a numerical technique for
dealing with approach 2 - Now lets consider a non-numerical approach
- Application overlapping communication with
computation
45Canonical variants
- Many techniques are aimed at enhancing memory
locality within a single address space - ATLAS Dongarra et al. 98 , PhiPack Demmel et
al. 96, Sparsity Demmel Yelick 99, FFTW
Frigo Johnson 98 - Architectural Cognizance Gatlin Carter 99
- DESOBLAS Beckmann and Kelly, LCPC 99 delayed
evaluation of task graphs - But the rising cost of data transfer is also a
concern - Well explore a canonical variant for overlapping
computation with interprocessor communication in
MIMD architectures
46Whats difficult about hiding communication?
- The programmer must hard code the overlap
technique into the application software - The required knowledge is beyond the experience
of many application programmers - The specific technique is sensitive to the
technology and the application, hence the code is
not robust
47Motivating application
- Iterative solver for Poissons equation in 3
dimensions - Jacobis method, 7-pt stencil
- for (i,j,k) in 1N x 1N x 1N
- uijk (ui-1jk
ui1jk - uij-1k
uija1k - uijk1 uijk-1)/6
48Traditional SPMD implementation
- Decompose the domain into subregions, one per
process - Transmit halo regions between processes
- Compute inner region after communication completes
49Multi-tier Computers
- High opportunity cost of communication
- Hierarchical organization amplifies node
performance relative to the interconnect - Trends more processors per node, faster
processors - r? DGEMM floating point rate per node, MFLOP/s
- ß? peak pt - pt MPI message BW, MBYTE/s
- IBM SP2/Power2SC r? 640 ß?
100 - NPACI Blue Horizon r? 14,000 ß? 400
- NPACI Data Star r? 48,000 ß? ?
800
50Overlapped variant
- Reformulate the algorithm
- Isolate the inner region from the halo
- Execute communication concurrently with
computation on the inner region - Compute on the annulus when the halo finishes
51Overlapped code (KeLP2)
- Relax(Distributed_Data X, Mover Communication)
- Communication.start()
- for each subdomain x in X
- Update x
- Communication.wait()
- // Repeat over the annulus
-
- Implemented with KeLP2 Fink 98, SC99
- KeLP2 implements a message proxy to realize
overlap - It also provide hierarchical control flow
52Performance on a 8 nodes of Blue Horizon
With KeLP2 Fink 98, SC99
732
713
655
626 (14)
HAND ST MT(8) MTV(7) MTV(7)
OPT
53 Observations
- We had to hard code the overlap strategy as well
as the parallel control flow into the application - Split-phase communication, scheduling,
complicated partitioning - Optimal ordering of communication and computation
varies across generational changes in technology - The characteristic communication delays increase
relative to that of computation - The costs may be irregular
- The hard coded strategy imposes unnecessary
constraints - Computation and communication are partially
ordered if you decrease the granularity of the
computation - Applications are rich in potential parallelism
54Tarragon an alternative approach
- Testbed for exploring communication tolerant
algorithms - Asynchronous task graph model of execution
- Data driven departs from the traditional bulk
synchronous model - Communication and computation do not execute as
distinct phases but are coupled activities - Tolerate unpredictable or irregular task and
communication latencies
55 Data driven execution
- Overdecompose the problem so that each process
owns several tasks - Construct a task graph indicating the data
dependences - A tasks suspends until the required communication
completes at which point the task is runnable - Tarragon run time system schedules runnable tasks
according to the flow of data in the task graph
56Tarragon in Context
- Data driven techniques used in DataFlow,
databases and data intensive applications (Data
Cutter, ADR) - Charm Kale 93
- Parallelism expressed across object collections
by making remote method invocations (message
passing) - Global name space
- Tarragon
- Functions operate on local data only data motion
is explicit - Tune performance by adjusting task granularity
and by decorating the graph with performance
metadata
57Tarragon API
- We express parallelism in an abstract form
- A task graph describes the partial ordering of
tasks - Vertices? computation
- Edges ? dependences
- A background thread called the mover-dispatcher
provides available tasks, processes completions
58The Mover-Dispatcher
- Processes incoming and outgoing communication
- Determines when tasks are ready
- Calls a scheduler to determine the order of ready
task execution - Labels on the taskGraph guide the scheduling
process - Processes completions
- Completion handler is a user defined callback
that invokes single sided communication
59A look inside the Run Time System
- E Execution engines
- M Mover/Dispatcher
NA
NB
Done
Run
3
2
1
6
Rdy
M
E
E
E
E
E
M
E
M
M
E
E
M
E
E
M
60 Benefits
- Tolerate unpredictable or irregular latencies at
different scales - Communication and computation are coupled
activities rather than distinct phases - Tune slackness to improve communication
pipelining - Flexible Scheduling
- Schedulers may be freely substituted, and may be
application specific Apples, Berman - Performance meta data enable us to alter the
execution order without having to change the
scheduler - Run time system optimizes execution ordering
without entailing heroic reprogramming
61 Slackness
- Multiple tasks per processing module
- Improve communication pipelining communication
occurs incrementally and in parallel with
computation - Tolerate irregular communication delays
- Treat load balancing as a scheduling activity
(migration)
62 First steps
- KeLP2 4 applications formulated for overlap
Fink and Baden 1997, Baden and Fink 1998 - Quantum KeLP F. David Sacerdoti (MS 02,
SIAM03) - Overdecomposed workload
- Load balancer migrates work grains between
processors
63 Summary
- Asynchronous task graph execution model
- Non bulk synchronous execution model
- communication and computation are coupled
activities rather than distinct phases - Tolerate unpredictable or irregular task and
communication latencies - Performance meta data decorate the graph to
provide scheduling hints - Generalizations
- Very long latencies (on the order of
milliseconds) - Application coupling
- Incorporate dynamic data sources into ongoing
computations
64Roadmap
- SCALLOP A highly scalable, infinite domain
Poisson solver written in KeLP - Asynchronous algorithms with Tarragon
- Communication overlap
- Monte Carlo simulation with cell microphysiology
65MCELL
- Monte Carlo simulator of cellular microphysiology
- Biochemical reaction dynamics in realistic 3D
microenvironments - Brownian dynamics of individual molecules and
their chemical interactions - Developed at the Salk Institute and Pittsburgh
Supercomputing Center by Tom Bartol and Joel
Stiles - 100 users (First released in 1997)
- Mcell-K parallel variant implemented with KeLP
(with Tom Bartol and Terrence Sejnowski, Salk)
66Cell microphysiology simulation
- Collaboration with Tom Bartol, Tilman Kispersky,
Terrence Sejnowski (Salk Institute), Joel Sitles
(PSC) - MCell a general Monte Carlo simulator of
cellular microphysiology - Brownian dynamics random walk of individual
molecules and chemical interactions - 100 users (First released in 1997)
- Mcell-K parallel variant implemented with KeLP
67Motivating application
- Cerebellar Glomerulus
- 2 CPU-months on a single processor
- 24 GB of RAM
- 20 million Ca2 ions, 10 million polygons
- With serial MCell
- Run 1/8 of the domain of the problem on a single
processor - Reduced resolution
- Scalable KeLP version MCell-K
- Running on up to 128 processors on Blue Horizon
- Collaboration involving Greg Balls (UCSD),
Srinivas Turaga,Tilman Kispersky (UCSD/Salk),
Tom Bartol (Salk),Terry Sejnowski (Salk)
68Animation
- Simulation of a chick ciliary ganglion synapse
- A real-world problem
- 400,000 polygons in the surface
- Approximately 40,000 molecules diffusing
- Approximately 500,000 surface receptors
69Chick ciliary ganglion synapse
Receptors
Chick ciliary ganglion synapse courtesyDarwin
Berg, Jay Coggan, Mark Ellisman,Eduardo
Esquenazi, Terry Sejnowski, Tom Bartol
70Chick ciliary ganglion synapse
Ligands
71Diffusion and Interactions
- Ligands neurotransmitter molecules
- Bind to sites under constraints
- Bounce off of surfaces
- Uneven distributions in space and time
release sites
72On a parallel computer
- Partition boundary splits up the problem over
multiple processors - As ligands cross a processor boundary, we color
them yellow
Bound ligands
73Animation
74Movie
75Issues in Parallelization
- Particles move over a sequence of timesteps
- React with embedded 2D surfaces - cell membranes
- Processor boundaries introduce uncertainties in
handling communication and the need to detect
termination - A particle may bounce among processors owning
nearby regions of space
76Two questions
- How do we know when the current timestep has
completed? - How and when do we transmit particles among
processors?
77Parallelization Strategy
- To detect termination we divide each timestep
into sub-timesteps - We continue to the next time step only when there
are no more ligands to udpate or communicate - Currently implemented with a barrier
- Aggregate communication of ligands to amortize
message starts - Buffers and message lengths scaled automatically
- Uniform static decomposition
- Work on dynamic load balancing is underway
78Software infrastructure Abstract KeLP
- A rapid development infrastructure for
distributed memory machines - Implemented as a C class library layered on MPI
- Communication orchestration
- Manage communication in terms of geometric set
operations - KeLP doesnt need to know how the user
represented application data structures, user
doesnt need to know about low level details of
moving data - User-defined container classes
- wrote a special purpose molecule class
- callbacks to handle data packing and unpacking
- Simple interface
- clean separation of parallelism from other code
- small change from original serial code
79Computational results
- Chick ciliary ganglion
- 400k surface triangles
- 192 release sites (max of 550)
- Each site releases 5000 ligands at t0 (960k
total) - 2500 time steps
- Persistent ligand case
- Enzymes that destroy ligands are made less
effective - Most ligands are present at the end of the
simulation - Report summary statistics in epochs of 100 time
steps - Ran on NPACI Blue horizon 16, 32, 64 processors
80Performance on NPACI Blue Horizon
81Parallel Efficiency
- Communication costs for this algorithm are low on
Blue Horizon - Communicating a few thousand molecules
- All-reduce a few hundred µs for 64 procs
- Each time step requires 1 s of computation
82Performance Prediction
- Running times are predicted well by maximum
ligands per processor
83Load imbalance
- Maximum load close to 2x average load.
84Uneven workload distributions
- Loads vary significantly, dynamically
85Load Balancing
- 1 release site, 10,000 molecules
- 8 simulated processors
86Load Balancing - Ganglion
- 18 release sites, 1000 molecules each
- 8 simulated processors
87The Future
- Ligands may bounce across processor boundaries
- Detecting termination is expensive
- Lost opportunities due to load imbalance
- Motivates asynchronous, execution, novel
scheduling - New project Tarragon, NSF ITR
Wire frame view of rat diaphragm synapse courtesy
Tom Bartol and Joel Stiles
88 Non-BSP programming with Tarragon
- Tarragon employs a task graph model of execution
- Couples task completion with communication
- Tolerates unpredictable or irregular task and
communication latencies - Different from traditional BSP programming
- Arriving data triggers communication
- Task completion triggers computation
- Testbed for exploring communication tolerant
algorithms (linear algebra, data assimilation)
89Asynchronous computation with Tarragon
- ITR Asynchronous execution for scalable
simulation of cell physiology - Cleaner treatment of migrating particles
- change owners dynamically
- avoid subtimestepping which exacerbates load
imbalancing - Many-to-one task assignments
- Automated load balancing via workload migration
- Finer grained intermittent communication
90Current and Future Work
- Asynchronous computation
- Large scale simulations
- Load balancing
- Predictive modeling (U. Rao Venkata)
- Parameter sweep
91Dynamic data driven applications
- Using Tarragons data driven programming model,
we can couple external data sources into ongoing
computation - Work in progress, 2 applications
- Dynamic clamping of neurons (Bartol Sejnowksi,
Salk) - Feedback MCell simulations of neural
microphysiology into living neurons in vitro via
patch clamping - Living and simulated neurons are (virtually) part
of the same circuit - Interactive ray tracing of dynamic scenes (H.
Jensen UCSD) - change the lighting, scene, camera angle..
- Need interactive feel
92Dynamic scenes
- Original scene (courtesy Henrik Jensen)
93Dynamic scenes
- Adding an object to the scene
94Dynamic scenes
- Changing the camera angle
95Conclusions and the Future
- Weve seen two techniques for coping with
technological change - SCALLOP A new numerical algorithm
- Tarragon A new execution model (partitioning and
scheduling) - Each technique requires appropriate
infrastructure to contend with the rising costs
of data motion - Generalizations
- Very long latencies (on the order of
milliseconds) - Application coupling
- Incorporate dynamic data sources into ongoing
computation - What are the roles of libraries and programming
languages?
96Conclusions
- Weve looked at an asynchronous data driven
programming model with motivating applications - Communication tolerance
- Dynamic data driven applications that couple
simulations with the real world - Appropriate programming model simplifies the
design - Scheduling important, too.
- What are the role of libraries and programming
languages?
97Conclusions
- Weve looked at an asynchronous data driven
programming model with motivating applications - Communication tolerance
- Dynamic data driven applications that couple
simulations with the real world - Appropriate programming model simplifies the
design - Scheduling important, too.
- What are the role of libraries and programming
languages?
98Acknowledgements and support
- Support
- NSF ACI0326013, ACI9619020, IBN9985964
- Howard Hughes Medical Institute
- University of California, San Diego
- San Diego Supercomputer Center (SDSC)
- Cal-(IT)2
- DoE (ISCR, CASC)
- ESPRC (visits to Imperial College UK)
- Papers and software http//www-cse.ucsd.edu/grou
ps/hpcl/scg/
99Technology transitions
- The KeLP technology is employed in the CHOMBO
structured adaptive mesh refinement (SAMR)
infrastructure(P. Colella, LBNL) - The technology also employed in the SAMRAI
infrastructure for SAMR(S. Kohn, R. Horning,
LLNL)
100Applications
- Mcell Cell microphysiology T. Sejnowski, T.
Bartol (SALK), J. R. Stiles (PSC) - First principle simulations of real materials
using structured adaptive mesh refinement J.
Weare et al. - Mortar space method for subsurface modeling M.
F. Wheeler, TICAM production code called
UTPROJ3D - Data management
- Compression in Direct Numerical Simulation for
turbulenceK. K. Nomura, P. Diamessis, W.
Kerney - Querying structured adaptive mesh refinement
datasetsJ. Saltz, T. Kurc, OSU, P. Colella,
LBNL - KeLP I/O Target for telescoping Compiler B.
Broom, R. Fowler, K. Kennedy, Rice
101A cast of many
- Scott Kohn
- Stephen J. Fink
- Frederico Sacerdoti
- Daniel Shalit
- Urvashi Rao Venkata
- Jake Sorensen
- Pietro Cicotti
- Paul Kelly (Imperial College)