Title: Introduction to Parallel MM5
1 Introduction to Parallel MM5
- Yupeng Li
- Air Quality Modeling,
- Dept. of Computer Science,
- UH
2Credits
- Parallelized by John Michalakes, Software
Engineer, MCS Division, Argonne National
Laboratory
3Parallelism in MM5
- What is meant by parallel?
- Increase computational and memory resources
available for larger, faster runs by having more
than one computer work on the problem - Isnt MM5 already parallel?
- Yes, the model has been able to run shared-memory
parallel since MM4 using Cray Microtasking
directives - More recently, standardized OpenMP directives
have been incorporated - What is DM-parallelism? Why?
- Processors store part of model domain in local
memory, not shared with other processors, and
work together on a problem by exchanging messages
over a network - Scalable because it eliminates bottlenecks on
shared resources such as bus or memory - Possibly also more cost effective since systems
can be commodity - You already have the DM-parallel version of MM5
4A Little on Parallel Computing
- There are two ways to use more than processor
- (SMP) Shared Memory
- (DMP) Distributed Memory
- There is newer one called Shared Distributed
Memory (SDMP)
5Memory Hierarchy
CPU or processor
Register (1)
Cache (25) (May have several levels)
Main memory (30)
Disk (106)
Access speed in machine cycle.
6Shared Memory Architecture
For small numbers of processors
Shared memory
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
7Distributed Memory
- The alternative model to shared memory
mem1
mem2
mem3
memN
proc1
proc2
proc3
procN
network
8The revolution of communication networks From
LAN to SAN (System Area Network)
- Myrinet scalable interconnection networks
- high bandwidth
- low latency
- reliable
- Success Story Berkeley NOW
Single chip building block of Myrinet (1995)
- Synfinity Component for very large highly
reliable interconnection networks configuration
(HAL/Fujitsu 1998) - high bandwidth (1.6GBytes/Sec/Port)
- low latency (42ns)
- highly reliable
9Machines
10Programming Technology for Parallel Architecture
- (SMP) Shared Memory
- PThread
- Directives OpenMP (OMP)
- (DMP) Distributed Memory
- API Message Passing Interface (MPI)
- Language High Performance Fortran (HPF)
- OMP and MPI are most widely used.
11OpenMP
- Mainly directives, plus a small set of
functionPROGRAM TEST ... !OMP PARALLEL
...!OMP DO DO I... ... CALL SUB1 ...
ENDDO ... CALL SUB2 ... !OMP END PARALLEL
12OpenMP
- And!OMP PARALLEL PRIVATE(NTHREADS, TID) C
Obtain and print thread id TID
OMP_GET_THREAD_NUM() PRINT , 'Hello World from
thread ', TID C Only master thread does this
IF (TID .EQ. 0) THEN NTHREADS
OMP_GET_NUM_THREADS() PRINT , 'Number of
threads ', NTHREADS END IF C All threads
join master thread and disband !OMP END
PARALLEL
13OpenMP
- If compiler can not recognize OMP directives,
these directives will just be ignored. - If there is not enough number processors (for the
threads you ask for), some threads will have to
share processor.
14MPI
- A bunch of functions calls. include 'mpif.h'
integer myrank, numprocs, ierr integer
status(MPI_STATUS_SIZE) real side, square
side 7.0 square0.0 C call MPI_INIT(
ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD,
myrank, ierr ) call MPI_COMM_SIZE(
MPI_COMM_WORLD, numprocs, ierr ) print ,'I am
node ',myrank,' out of ',numprocs,' nodes.'
if(myrank.eq.1) then square side side
call MPI_SEND(square,1,MPI_REAL,0,99,MPI_COMM_
WORLD,ierr) endif if(myrank.eq.0) then
print ,'Before receive square is ',square call
MPI_RECV(square,1,MPI_REAL,1,99,MPI_COMM
_WORLD,status,ierr) print ,'After receive
square is ',square endif call
MPI_FINALIZE(ierr) end
15MPI
- Usually involving one master (rank 0)node and
several slave node - Master distribute job load to the set of
computing nodes (MPI_SEND), and collect results
(MPI_RECV) from them as well. - Initialization and Finalization is must.
16Before we have MPI in MM5
- There are already several versions of MPI
implementation of MM5 out there. But none of them
is officially adopted, because it is too
expensive to maintain two set of code. - Until Runtime System Library (RSL) and Fortran
Loop and Index Converter (FLIC) are used in
parallelizing MM5 for DMP. This is called Same
source approach. - Single-source implementation of parallelism has
obvious benefits for maintainability, avoiding
the effort needed to keep multiple,
architecture-specific versions up to date with
respect to each other.
17Same source concept
- Ideal Source code for the DM-parallel and
non-DM parallel model are identical at the
science level - Hide parallel details under the hood- automate
and encapsulate. - Parallel toolbox
- FLIC - automatic generation of I and J loop
indexes - RSL routines for domain decomposition and
message passing
18Distributed Memory Parallel Performance
19DM-parallel Features
- All MM5 options supported except
- Moving nested grids
- Arakawa-Schubert Cumulus
- Pleim-Xu PBL
20Building the DM-parallel MM5
- Downloadftp//ftp.ucar.edu/mesouser/MM5V3/MM5.TA
R.gzftp//ftp.ucar.edu/mesouser/MM5V3/MPP.TAR.gz
- Unzip and untargzip -d -c MM5.TAR.gz tar xf
cd MM5gzip -d -c MPP.TAR.gz tar xf - Edit configure.user file for computer and
configuration
21Editing configure.user
- Find the MPP subsection in Section 7 of
configure.user pertaining to your computer and
uncomment those rules - Adjust PROCMIN_NS and PROCMIN_EW settings at top
of Section 7 for memory scaling
22PROCMIN variables
- Determine horizontal dimensions of MM5 arrays for
each processor at compile time - PROCMIN_NS divides MIX (north-south
decomposition)PROCMIN_EW divides MJX (east-west
decomposition) - Can reduce per processor size of MM5 arrays to
exploit the aggregate memory size of the parallel
machine
23PROCMIN variables
- An executable compiled with PROCMIN_NS1 and
PROCMIN_EW1 uses maximum per processor memory
but is valid for any number of processors. - Warning! An executable compiled with PROCMIN_NS2
and PROCMIN_EW2 can be run on no fewer than 4
processors, but for example it can NOT be run on
5 processors (MIX/2 dimension is too small for
1x5 decomposition)
24PROCMIN variables
- Experiment with different decompositions e.g.,
runtimes for 16 processor jobs compiled as 2x8,
4x4, and 8x2 might vary significantly.
25Building the code
- Build the model make mpp
- Resulting executable Run/mm5.mpp
- To remake the code in different
configurationmake mpclean - To reinstall the code in different locationmake
uninstall
26Running the model
- Generate the mmlif (namelist) file
- make mm5.deck
- Edit mm5.deck
- ./mm5.deck (creates namelist file in Run/mmlif
does not run code.) - Also run it after you change configure.user
- Run the model
- cd Run
- mpirun -np 4 ./mm5.mpp (standard, MPICH)
- dmpirun (DEC MPI)
- mprun (Sun MPI)
- mpimon (Linux/ScaMPI)
- POE (IBM)
27Running the model
- Model generates normal MMOUT_DOMAIN output files
and 3 text files per processor - rsl.out.0000 (contains standard output)
- rsl.error.0000 (contains standard error)
- show_domain_0000 (shows the domain decomposition)
28Test datasets
- Storm of the Centuryftp//ftp.ucar.edu/mesouser/M
M5V3/TESTDATA/input2mm5.tar.gzftp//ftp.ucar.edu/
mesouser/MM5V3/TESTDATA/soc_benchmark_config.tar.g
z - Good small case for initial testing
- Includes a nest
- Large domain (World Series Rain-out)ftp//ftp.uca
r.edu/mesouser/MM5V3/TESTDATA/largedomainrun.tar.g
z - Representative problem sizes for distributed
memory
29Important Information Sources
- http//www.mmm.ucar.edu/mm5/
- When you really can not find answer in
documentations, send email to mesouser_at_ncar.ucar.
edu - Subscribe to MM5 user Discussions mailing list
through the address belowhttp//www.mmm.ucar.edu
/mm5/support/news.htmlto see if others have the
same experience