Title: Parallel Computing, MPI and FLASH March 23, 2005
1Parallel Computing, MPI and FLASHMarch 23, 2005
2What is Parallel Computing ?And why is it useful
- Parallel Computing is more than one cpu working
together on one problem - It is useful when
- Large problem, could take very long
- Data size too big to fit in the memory of one
processor - When to parallelize
- Problem could be subdivided into relatively
independent tasks - How much to parallelize
- While the speedup in computation relative to
single processor is of the order of number of
processors
3Parallel paradigms
- SIMD Single instruction multiple data
- Processors work in lock-step
- MIMD Multiple instruction multiple data
- Processors do their own thing with occasional
synchronization - Shared Memory
- One way communications
- Distributed Memory
- Message passing
- Loosely Coupled
- When the process on each cpu is fairly self
contained and relatively independent of processes
on other cpus - Tightly Coupled
- When cpus need to communicate with each other
frequently
4How to Parallelize
- Divide a problem into a set of mostly independent
tasks - Partitioning a problem
- Tasks get their own data
- Localize a task
- They operate on their own data for the most part
- Try to make it self contained
- Occasionally
- Data may be needed from other tasks
- Inter-process communication
- Synchronization may be required between tasks
- Global operation
- Map tasks to different processors
- One processor may get more than one task
- Task distribution should be well balanced
5New Code Components
- Initialization
- Query parallel state
- Identify process
- Identify number of processes
- Exchange data between processes
- Local, Global
- Synchronization
- Barriers, Blocking Communication, Locks
- Finalization
6MPI
- Message Passing Interface, standard for
distributed memory model of parallelism - MPI-2 supports one-way communication, commonly
associated with shared memory operations - Works with communicators a collection of
processors - MPI_COMM_WORLD default
- Has support for lowest level communication
operations and composite operations - Has blocking and non-blocking operations
7Low level Operations in MPI
- MPI_Init
- MPI_Comm_size
- Find number of processors
- MPI_Comm_rank
- Find my processor number
- MPI_Send/Recv
- Communicate with other processors one at a time
- MPI_Bcast
- Global data transmission
- MPI_Barrier
- Synchronization
- MPI_Finalize
8Advanced Constructs in MPI
- Composite Operations
- Gather/Scatter
- Allreduce
- Alltoall
- Cartesian grid operations
- Shift
- Communicators
- Creating subgroups of processors to operate on
- User-defined Datatypes
- I/O
- Parallel file operations
9Communicators
COMM1
COMM2
10Communication Patterns
11Communication Overheads
- Latency vs. Bandwidth
- Blocking vs. Non-Blocking
- Overlap
- Buffering and copy
- Scale of communication
- Nearest neighbor
- Short range
- Long range
- Volume of data
- Resource contention for links
- Efficiency
- Hardware, software, communication method
12Parallelism in FLASH
- Short range communications
- Nearest neighbor
- Long range communications
- Regridding
- Other global operations
- All-reduce operations on physical quantities
- Specific to solvers
- multi-pole method
- FFT based solvers
13Domain Decomposition
P1
P0
P2
P3
14Border Cells / Ghost Points
- When splitting up solnData, need data from other
processors. - Need a layer of cells from each processor
- Need to update each time step
15Border/Ghost Cells
Short Range communication
16Two MPI Methods for doing it
- MPI_Cart_create
- Create topology
- MPE_Decomp1d
- Domain decomp on topology
- MPI_Cart_shift
- Whos on the left/right?
- MPI_SendRecv
- Ghost cells left
- MPI_SendRecv
- Ghost cells right
- MPI_Comm_rank
- MPI_Comm_size
- Manually decompose grid over processors
- Calculate left/right
- MPI_Send/MPI_Recv
- Carefully to avoid deadlocks
17Adaptive Grid Issues
- Discretization not uniform
- Simple left-right guard cell fills inadequate
- Adjacent grid points may not be mapped to the
nearest neighbors in processors topology - Redistribution of work necessary
18Regridding
- Change in number of cells/blocks
- Some processors get more work than others
- Load imbalance
- Redistribute data to even out work on all
processors - Long range communications
- Large quantities of data moved
19Regridding
20Other parallel operations in FLASH
- Global max/sum etc (Allreduce)
- Physical quantities
- In solvers
- Performance monitoring
- Alltoall
- FFT based solver on UG
- User defined datatypes and file operations
- Parallel I/O
21A Little FLASH History
BAM
- FLASH0
- Paramesh2, Prometheus and EOS/Burn
- FLASH1
- Smoothing out the smash
- First form of module architecture inheritance
- FLASH2
- Untangle modules from each other (Grid)
- dBase
- Concept of levels of users
- FLASH3
- Stricter interface control module architecture
- Taming the database
22What FLASH Provides
- Physics
- Hydrodynamics
- PPM
- MHD
- Relativistic PPM
- Nuclear Physics
- Gravity
- Cosmology
- Particles
- Infrastructure
- Setup
- AMR Paramesh
- Regular testing
- Parallel I/O
- hdf5, pnetcdf,
- Profiling
- Runtime and post-processing visualization
23FLASH Code Basics
- An application code, composed of units/modules.
Particular modules are set up together to run
different physics problems. - Performance, Testing, Usability, Portability
- Fortran, C, Python,
- More than 560,000 lines of code
- 75 code, 25 comment
- Very portable
- Scaling to 1000s of procs
Internal Release
24Basic Computational Unit Block
- The adaptive grid is composed of blocks
- All blocks same dimensions
- Cover different fraction of the physical domain.
- Blocks at different levels of refinement have
different grid spacing.
25Structure of FLASH Modules (not exact!)
Materials
Hydro
Source_terms
Gravity
init() tstep() hydro3d()
init() tstep() grav3d()
init() tstep() src_terms()
eos3d() eos1d() eos()
26Whats a FLASH Module?
- FLASH basic architecture unit Modules
- Component of the FLASH code providing a
particular functionality - Different combinations of modules are used for
particular problem setups - Ex driver, hydro, mesh, dBase, I/O
- Fake inheritance by use of directory structure
- Modules communicate
- Driver
- Variable Database
27Abstract FLASH2 Module
- 1. Meta-data (Configuration Info)
- Interface with driver and setup
- Variable/parameter registration
- Variable attributes
- Module Requirements
FLASH Component
- 2. Interface Wrapper
- Exchange with variable database
- Prep data for kernels
- 3. Physics Kernel(s)
- Single patch, single proc functions
- written in any language
- Can be sub-classed
FLASH Application
driver
Collection of Flash2 Modules
Database
28Module Implementations
- FLASH2 Modules are directory trees
- source/hydro/explicit/split/ppm
- Each level might have source
- Source relevant for all directories/implementation
s below - Preserves interfaces
- Allows flexible implementations
29Inheritance Through Directories Hydro
An empty hydro init, hydro, tstep are
defaults on top of the directory tree.
init
hydro
tstep
Explicit
Hydro/Explicit Replaces tstep Introduces
shock No hydro Implemented yet!
tstep
tstep
shock
Split
Hydro/Explicit/Split hydro implemented Uses
general explicit tstep Uses general shock
Replaces init
hydro
hydro
implemtation
init
implemtation
DeltaForm
30The Module Config File
- Declare solution variables, fluxes
- Declare runtime parameters
- Sets defaults
- Lists required, exclusive modules
- Config files are additive down the directory tree
- no replacements
31Setup Building an Application
Configuration Tool (Setup)
Mesh
Database
32FLASH Setup Implements Architecture
- Python code links together needed physics and
tools for a problem - object
- Traverses modules to get implementations
- Determines solution data storage list
- Creates list of parameters from modules
- Configures Makefiles properly
33Pulling it All Together
- Choose a problem setup
- Run setup to configure that problem
- Everything is in a new top-level directory
- object
- Make
- Run
- Flash.par for runtime parameters
- Defaults already set from particular modules
34Setups
- A basic problem setup
- Config file
- Required physics modules
- Flash.par
- Default list runtime parameter configuration
- Init_block
- Initial conditions for the problem set block by
block - Many other possible files
- Driver, Refinement algorithms, User defined
boundary conditions - Any files in setup take precedence
35Provided Driver
- Provided
- Second order, state form, strang split
- New drivers
- Put in setups
- Welcome contributions
set time step hydro sourceTerms cosmology radiatio
n particles gravity set time step (repeat
physics) Mesh_updateGrid
evolve.F90
flash.F90
Initialize() Loop over timesteps
evolvePhysics() timestep() output()
visualize() End loop Finalize()
36FLASH Applications
- Compressible reactive flow
- Wide range of length of time scales
- Many interacting physical processes
- Only indirect validation possible for the
astrophysics - Many people in collaboration
Flame-vortex interactions
Compressible turbulence
Shocked cylinder
Nova outbursts on white dwarfs
Intracluster interactions
Cellular detonations
White Dwarf deflagration
Helium burning on neutron stars
Rayleigh-Taylor instability
37- And that brings us to
- questions and discussion.